Statistics in Economics and Management

Emina RESI] . Adela DELALI] . Merima BALAVAC .
Ademir ABDI]
STATISTICS
IN ECONOMICS
AND MANAGEMENT
Sarajevo, 2010.
Naziv djela STATISTICS IN ECONOMICS AND MANAGEMENT
Autori Doc.dr Emina Resić

Mr Adela Delalić
Mr Merima Balavac
Ademir Abdić
Izdavač Ekonomski fakultet u Sarajevu
Glavni urednik Dekan, prof. dr Veljko Trivun
Recenzenti Prof. dr Rabija Somun – Kapetanović,

redovni profesor Ekonomskog fakuteta u Sarajevu
Prof. dr Ksenija Dumičić,
redovni profesor Ekonomskog fakulteta u Zagrebu
Design&DTP Adis Duhović
Lektor Mr Milica Babić
[tampa Premier Febeco d.o.o. Mostar
Tiraž 300
Godina izdanja 2010.
CIP - Katalogizacija u publikaciji

Nacionalna i univerzitetska biblioteka
Bosne i Hercegovine, Sarajevo
330.45:519.2]:005(075.8)
STATISTICS in economics and management / Emina

Resić ... [et. al.]. - Sarajevo : Ekonomski
fakultet, 2010. - 589 str. : ilustr. ; 24 cm
Bibliografija: str. 561-564 ; bibliografske i

druge bilješke uz tekst.
ISBN 978-9958-25-056-9
1. Resić, Emina
COBISS.BH-ID 18502150
STATISTICS IN ECONOMICS AND MANAGEMENT
PREFACE
All kind of activities require the use of numbers. Students are expected
to work with confusing sets of data and statistics help them to make
sense of it. By using statistical tools, we aim to simplify complex
problems and present to the others in comprehensive form.
We want our students to be effective when facing and working with

numbers. Better understanding of quantitative approaches should ease
problem solving and make us more confident in research we undertake.
We stress the relevance and the importance of an effective approach
to problem solving and the importance of selection of the right
methodology.
This book is intended for the students of Economics and the closely
related Accountancy and Business disciplines. It provides examples
and problems relevant to those subjects, using real data where possible.
This is book for an elementary level and requires no prior knowledge
of statistics, nor advanced mathematics. Book covers all the relevant
concepts so that an understanding of why a particular statistical test
should be used is gained. These concepts are introduced naturally in the
course of the text, rather than having sections to themselves. The book
can form the basis of a one- or two-term course, depending upon the
intensity of the teaching.
Some tasks were done using Excel, in order to show the benefit that
Excel and other computer program have for solving of statistical
problem. We have included Excel output in the form of screenshots so
that reader become familiar with the program and be equipped to use it
on its own. There is possibility of numerical differences in results as
a consequence of differences in the precision of computing resources
and rounding.
This book is the result of our long-standing work in subjects: Statistics,

Statistics in Economics and Management, Business Statistics and
Business Mathematics and Statistics at the School of Economics and
Business in Sarajevo and is intended primarily for the students of School
of Economics and Business. It follows the curriculum of the course
3
PREFACE
Statistics in Economics and Management at School of Economics and

Business in Sarajevo, which is the subject for of the first year of study.
The content and the scope of the material set are aimed to facilitate
students’ preparation for an exam. It intends to develop their analytical
skills and equips them with knowledge to undertake basic statistical
analysis of problems at hand.
We use this opportunity to thank reviewers and everyone who contributed

to this book. We owe the most sincere thanks to PhD. Rabija Somun-
Kapetanović, who was generously shared her experience with us.
Since this is the first edition of this book, the authors will be grateful for
any suggestions that would improve the quality of this book.
4
CONTENT
1. DATA COLLECTION AND PRESENTATION ...................... 11
1.1. WHAT IS STATISTICS ........................................................ 13
1.2. DATA/INFORMATION/STATISTICS ................................ 15
1.3. SCALES OF MEASUREMENT .......................................... 18
1.4. DISCRETE AND CONTINUOUS VARIABLES ................ 20
1.5. DATA COLLECTION .......................................................... 21
1.5.1. Population and sample ............................................... 21
1.5.2. Census ......................................................................... 22
1.5.3. Sampling ..................................................................... 23
1.6. TYPES OF SAMPLE ........................................................... 25
1.6.1. Simple random sample ................................................ 25
1.6.2. Stratified sample ......................................................... 28
1.6.3. Cluster sampling ......................................................... 29
1.6.4. Quota sampling ........................................................... 31
1.6.5. Systematic sampling ................................................... 33
1.6.6. Calculating a Sample Size ........................................... 34
1.7. FREQUENCY DISTRIBUTION ......................................... 35
1.7.1. Constructing frequency distribution table ................... 41
1.7.2. Constructing cumulative frequency
distribution tables ........................................................ 45
1.7.3. Class intervals ............................................................. 48
1.7.4. Outliers ....................................................................... 57
1.8. DATA PRESENTATION: TABLES,
DIAGRAMS AND GRAPHS ............................................... 58
2. DESCRIPTIVE STATISTICS ................................................... 69

2.1. INTRODUCTION ................................................................ 71
2.2. MEASURES OF CENTRAL TENDENCY .......................... 72
2.2.1. Arithmetic mean .......................................................... 73
5
CONTENT
2.2.2. Harmonic mean ........................................................... 77

2.2.3. Geometric mean .......................................................... 79
2.2.4. Median ....................................................................... 81
2.2.5. Mode .......................................................................... 83
2.2.6. Quartiles ..................................................................... 84
2.3. EXAMPLES FOR MEASURES
OF CENTRAL TENDENCY ............................................... 86
2.4. MEASURES OF DISPERSION .......................................... 91
2.4.1. The middle absolute distance ..................................... 92
2.4.2. The variance and the standard deviation ..................... 95
2.4.3. Coefficient of variation ............................................. 101
2.4.4. Z value ...................................................................... 101
2.4.5. The quartile range, the quartile deviation and
the coefficient of quartile deviation .......................... 102
2.5. EXAMPLES FOR MEASURES OF DISPERSION .......... 103
2.6. SHAPE OF DISTRIBUTION ............................................ 110
2.6.1. Symmetry or skewness ............................................. 110
2.6.2. Kurtosis .................................................................... 113
2.7. MEASURE OF CONCENTRATION ................................ 119
2.8. USING EXCEL TO OBTAIN
DESCRIPTIVE STATISTICS ........................................... 123
2.9. SOLVED EXAMPLES ...................................................... 143
2.10. SELF STUDY EXAMPLES .............................................. 203
3. REGRESSION AND CORRELATION .................................. 225

3.1. INTRODUCTION .............................................................. 227
3.2. BASIC ASPECTS ............................................................... 228
3.3. SCATTER PLOT ................................................................ 229
3.4. LINE OF BEST FIT (REGRESSION LINE) ...................... 234
3.5. THE STANDARD ERROR OF ESTIMATE
AND THE COEFFICIENT OF DETERMINATION ......... 235
3.6. THE CORRELATION COEFFICIENT ............................. 237
3.7. INTERPRETATION OF THE SIZE
OF A CORRELATION ....................................................... 238
3.8. CALCULATING THE EQUATION
OF THE LINEAR REGRESSION MODEL ..................... 239
3.9. THE CORRELATION COEFFICIENT
FOR LINEAR RELATIONSHIP ........................................ 244
6
3.10. PREDICTION OR FORECASTING .................................. 247

3.11. SPEARMAN’S RANK CORRELATION
COEFFICIENT ................................................................... 248
3.12. STATISTICAL TESTING FOR SIMPLE
LINEAR REGRESSION MODEL (t TEST) ...................... 249
3.13. OVERVIEW EXAMPLE FOR SIMPLE
LINEAR REGRESSION .................................................... 251
OF THE EXPONENTIAL REGRESSION MODEL .......... 254
OF THE PARABOLICAL REGRESSION MODEL .......... 255
OF THE POWER REGRESSION MODEL ....................... 255
3.17. MULTIPLE REGRESSION MODEL ................................ 256
3.17.1. Measures for quality of multiple regression model .. 257
3.17.2. Statistical test for multiple regression
model (t test, ANOVA) ............................................ 259
3.18. INDICATOR – DUMMY VARIABLES ............................. 264
3.18.1. Simple model with dummy variable ........................ 265
3.18.2. Example of regression indicator variables
in the simple model with a "dummy" variable ........ 266
3.18.3. Example of multiple regression models with
indicator and continuous variables
as explanatory variables in the model ..................... 266
3.19. CONDITIONS FOR ECONOMETRIC MODELS ............ 272
3.19.1. Assumptions of the regression models .................... 273
3.20. SOLVED EXAMPLES ....................................................... 277
3.21. SELF STUDY EXAMPLES ............................................... 302
4. TIME SERIES ANALYSIS ...................................................... 309

4.1. INTRODUCTION .............................................................. 311
4.2. COMPONENTS (ELEMENTS) OF TIME SERIES .......... 312
4.2.1. Trend or long-term component ................................. 313
4.2.2. Seasonal component (seasonal variations) ............... 314
4.2.3. Cyclical component ................................................. 314
4.2.4. Irregular or random component ................................ 315
4.2.5. Systematic versus nonsystematic component
in time series ............................................................ 315
7
CONTENT
4.2.6. Additive versus multiplicative model ........................ 316

4.3. GRAPHICAL METHOD FOR EVALUATION
ANALYSIS OF SOME PHENOMENA ............................. 317
4.4. ABSOLUTE AND RELATIVE CHANGES ...................... 322
4.4.1. Absolute change ........................................................ 322
4.4.2. Relative change ......................................................... 322
4.5. THE INDEX METHOD ..................................................... 325
4.5.1. The average annual rate of change ............................. 329
4.5.2. Aggregate index numbers .......................................... 331
4.5.3. Index of values .......................................................... 333
4.5.4. Aggregate price index ................................................ 333
4.5.5. Aggregate volume (quantity) index ........................... 335
4.5.6. Decomposition of aggregate index ............................ 338
4.6. DETERMINATION OF THE TREND ............................... 340
4.6.1. Determination of trend by „eye“ ................................ 340
4.6.2. The method of moving averages ................................ 341
4.7. MATHEMATICAL MODELS FOR DETERMINATION
OF LONG-TERM TREND ................................................. 349
4.7.1. Least squares method for determination
of the trend ................................................................ 350
4.7.2. Trend isolation ........................................................... 355
4.8. SOLVED EXAMPLES ....................................................... 374
5. PROBABILITY AND THEORETICAL DISTRIBUTIONS . 389

5.1. INTRODUCTION .............................................................. 391
5.2. RANDOM VARIABLES AND PROBABILITY
DEFINITIONS ................................................................... 391
5.3. BASIC DEFINITIONS IN PROBABILITY
AND NOTATION ............................................................... 393
5.4. BASIC RELATIONSHIPS IN PROBABILITY ................. 395
5.5. BASIC RELATIONSHIPS IN PROBABILITY
EXAMPLES ....................................................................... 389
5.6. BAYES THEOREM ........................................................... 403
5.7. PROBABILITY DISTRIBUTIONS ................................... 405
5.8. BINOMIAL DISTRIBUTION ........................................... 407
5.8.1. Probability distribution of a binomial
random variable ........................................................ 408
8
5.8.2. Characteristics of the Binomial distribution .............. 410

5.9. POISSON DISTRIBUTION ............................................... 413
5.9.1. Probability distribution of Poisson random variable .. 414
5.9.2. Characteristics of the Poisson distribution ................. 416
5.10. HYPERGEOMETRIC DISTRIBUTION ........................... 419
5.11. NORMAL DISTRIBUTION .............................................. 422
5.11.1. Rules for standardized normal distribution .............. 427
5.11.2. Characteristic intervals for normal distribution ....... 428
5.12. STUDENT t-DISTRIBUTION ........................................... 440
5.13. CHI-SQUARE DISTRIBUTION ....................................... 443
5.14. F DISTRIBUTION ............................................................. 445
5.15. APPROXIMATIONS OF BINOMIAL, POISSON
AND HYPERGEOMETRIC DISTRIBUTION
WITH NORMAL DISTRIBUTION ................................... 447
5.16. SOLVED EXAMPLES ....................................................... 448
6. INFERENTIAL STATISTICS ................................................. 471

6.1. INTRODUCTION .............................................................. 473
6.2. THE POINT ESTIMATOR ................................................. 475
6.3. THE DISTRIBUTION OF THE SAMPLE MEANS .......... 478
6.4. CONFIDENCE INTERVAL FOR
THE POPULATION MEAN ............................................. 478
6.4.1. Standard deviation of population is known ................ 478
6.4.2. Standard deviation of population isn’t known ......... 478
6.5. CONFIDENCE INTERVAL OF THE POPULATION
PROPORTIONS ................................................................. 484
6.6. CONFIDENCE INTERVAL FOR VARIANCE
IN POPULATION .............................................................. 486
6.7. HOW TO DETERMINE SAMPLE SIZE
ACCORDING TO SAMPLE ERROR? .............................. 488
6.7.1. Determining sample size for estimating
population mean ....................................................... 488
population proportion ............................................... 490
6.8. HYPOTHESIS TESTING .................................................. 492
6.8.1. Regions of rejection and non-rejection ...................... 495
6.8.2. Risks in decision making process .............................. 496
9
CONTENT
6.8.3. Procedure for hypothesis testing ................................ 497

6.8.4. Hypothesis for the mean ............................................ 497
6.8.5. A two sample test for means ...................................... 504
6.8.6. Testing differences between arithmetic
means of more than two populations on the basis
of their samples - analysis of variance ANOVA ........ 511
6.8.7. Statistical tests for the proportion .............................. 518
6.8.8. Statistical tests for the variance ................................. 522
6.8.9. Chi-square ( ) test of independence ............................ 525
6.8.10. Test for differences among proportion
for populations ........................................................ 528
6.8.11. Test of adequacy to approximations
(goodness of fit) ...................................................... 531
6.9. SOLVED EXAMPLES ....................................................... 534
REFERENCES ............................................................................... 559
STATISTICAL TABLES ............................................................... 565
INDEX ............................................................................................. 583
10
1
DATA
COLLECTION
AND
PRESENTATION
CHAPTER
1
1.1. WHAT IS STATISTICS?
“The best thing about being a statistician is that you get to play in
everyone else’s backyard.” John Tukey, Princeton University1
Any manager operating in the business environment requires as much

information as possible about the different characteristics of that
environment. Nowadays, the most of available information is quantitative
(for example, interest rates, market prices, unemployment…), partially
thanks to the massive information storage capacities of computer
systems. Market research surveys are carried out to determine the
strength of demand. An auditor is concerned with the number and size
of errors found in accounts receivable. A personal manager may be able
to use attitude test scores, in order to complement subjective judgment
of candidates for job.
Data used in these examples are numerical. Human brain has limited
capacity to deal with ample incoming information and when faced with
large groups of numbers, most people cannot normally hold them all in
mind at once. It is difficult to make any conclusions by simply looking
at the raw data; therefore, it is useful to create some kind of overall
picture or summary of what is going on. The main purpose of statistics
is to accurately summaries the data into easily interpretable fewer
numbers.2 The statistician’s role involves the extraction and synthesis of
important features of a large body of numerical data. They try to make
sense out of numerical data by data summary, which helps to get an
easily understandable picture, while little of importance is lost.
Statistics could also be defined as the science of uncertainty. Statistics

does not deal with a question such as: What will be, but rather it
deals with questions such as What could be, What might be or What
probably is.
1
http://math.hunter.cuny.edu/, access 25. 04. 2010.
2
http://www.marketresearchworld.net/index.php?option=com_content&task=view&id=21&I
temid=41, access 27. 01. 2010.
13
1 DATA COLLECTION AND PRESENTATION
Here are some of the many real-world examples that require the use of
statistics:
Your company has created a new drug that may cure some disease.
How would you conduct a test to confirm the drug's effectiveness?
The latest sales data have just come in, and your manager wants you
to prepare a report for management about areas where the company
could improve its business. What should you look for? What should
you not look for?
A widget maker in your factory that normally breaks 4 widgets for
every 100 it produces has recently started breaking 5 widgets for
every 100. When is it time to buy a new widget maker? (And just
what is a widget, anyway?)
Statistics, in short, is the study of data. It involves collecting, classifying,

summarizing, organizing, analyzing, and interpreting numerical
information. Statistics includes two important parts:
Descriptive statistics, which involves the studies of methods

and tools for collecting data, and mathematical models to
describe and interpret data.
It utilizes numerical and graphical methods to look for patterns in a data

set, to summarize the information revealed in a data set and to present
the information in a convenient form.
Inferential statistics, which involves the systems and techniques

for making probability-based decisions and accurate
predictions based on incomplete (or sample) data and utilizes
sample data to make estimates, decisions, predictions, or
other generalizations about a larger set of data (or about
population).
14
Statistical dealing with data has three main aspects:
1. The collection of qualitative or numerical data,

2. The different ways for presentation of qualitative or numerical data and
3. The different ways for presentation and appropriate analysis of
numerical data with appropriate statistical methods and models.
With the appropriate tools and solid grounding in statistics, one can use
a limited sample to make intelligent and accurate statements about the
population. In today's information-overloaded age, statistics is one of
the most useful subjects anyone can learn. Newspapers are filled with
statistical data, and anyone who is ignorant of statistics is at risk of being
seriously misled about important real-life decisions such as what to eat,
who is leading the polls, how dangerous smoking is, etc. Knowing a
little about statistics will help one to make more accurate decisions
about these and other important questions. Furthermore, statistics are
often used by politicians, advertisers and others who use statistics to
twist the truth for their own gain. For example, a company selling the
cat food brand “Cat-sweet” may claim in their advertisements that eight
out of ten cat owners said that their cats really preferred brand “Cat-
sweet” food to "the other leading brand" of cat food. What they may
not mention is that the cat owners questioned were those they found in
a supermarket buying “Cat-sweet”.
Statistics is the most powerful tool available for assessing the significance
of experimental data and for drawing the right conclusions from the
vast amount of data faced by engineers, scientists, sociologists and the
other professionals. There is no social, health-care, environmental or
political study that does not rely on statistical methodologies. Since the
nature of variation is ubiquitous, probability and statistics, fields that
allow us to study, understand, model, embrace and interpret variation,
are ubiquitous as well.
1.2. DATA/INFORMATION/STATISTICS
Before one can present and interpret information, there has to be

a process of gathering and sorting of data. Just as trees are the raw
15
1 DESCRIPTIVE STATISTICS
material from which paper is produced, so too, can data be viewed as

the raw material from which information is obtained.3
Data are defined as “facts or figures from which conclusions

can be drawn”.3
Data, information and statistics are often misunderstood. They are

different categories, as the next table shows.
Table 1.1. Data collected on the weight of 20 individuals in classroom
Data collected on the weight of 20 individuals in classroom

Data Information Statistics
5 individuals in the
20 kg, 24 kg, etc. Mean weight = 22.5 kg
20-to-24-kg range
15 individuals in the
28 kg, 30 kg, etc. Median weight = 28 kg
25-to-30-kg range
Data can take various forms, but are often numerical. As such, data can
relate to an enormous variety of aspects. Some examples are:
the daily weight measurements of each individual in your classroom;

the number of movie rentals per month for each household in your
neighborhood;
the city's temperature (measured every hour) for a one-week period
etc.
Other forms of data exist, such as radio signals, digitized images and
laser patterns on compact discs.
3
http://www.statcan.gc.ca/edu/power-pouvoir/ch1/definitions/5214853-eng.htm, access 25.
05. 2010.
16
Statistics offices collect data every day to provide information. Once

data have been collected and processed, they are ready to be organized
into information. Indeed, it is hard to imagine reasons for collecting
data other than to provide information. This information is sources of
knowledge about the issues and helps individuals and groups to make
informed decisions. Information is defined as "the data that have
been recorded, classified, organized, related or interpreted within a
framework so that meaning emarges".
Information, like data, can take various forms. Some examples of the
different types of information that can be derived from data include:
the number of persons in a group in each weight category (20 to

24 kg, 25 to 30 kg, etc.);
the total number of households that did not rent a movie during the
last month;
the number of days during the week where the temperature went
above 20°C, etc.
Statistics represent a common method of presenting information. In

general, statistics relate to numerical data, and can refer to the science
of dealing with the numerical data itself. Above all, statistics aim to
provide useful information by means of numbers. Therefore, statistics
is defined "a type of information obtained through mathematical
operations on numerical data".
Using the previous examples, some of the statistics that can be obtained
include:
the average weight of people in your office;

the minimum number of rentals your household had to make to be in
the top 10% of renters for the last month;
the minimum and maximum temperature observed each day of the
week, etc.
17
1.3. SCALES OF MEASUREMENT
Different scales of measurement have correspondence with appropriate

data type.
Nominal scale
Nominal scale classifies data into various distinct categories

in which no ordering is implied. Nominal variables might be
used to identify different attributes.
For example nominal scale is appropriate for:

Hair or eyes color
Gender
Personal computer ownership
Internet provider that you prefer
The international telephone code for a country
The numbers on the shirts of players in a sports team
The license plate number of a car
We can only compare whether variables are equal or unequal. There are
no "less than" or "greater than" relations among them, nor operations
such as addition or subtraction.
Ordinal scale
Ordinal scale classifies data into various distinct categories

in which no ordering is implied. Ordinal scale is directly
connected with ranking.
For example there is “product satisfaction”, because you can be: very
satisfied, satisfied, neutral, unsatisfied or very unsatisfied. A physical
example is the Mohs scale of mineral hardness. Another example is the
results of a horse race; which horses arrived first, second, third, etc. are
18
reported, but the time intervals between the horses are not reported.
The most measurement in psychology and other social sciences is at
the ordinal level; for example attitudes and IQ are only measured at
the ordinal level. If customers surveyed report preferring chocolate to
vanilla-flavored ice cream, the data are of this kind.
Comparisons of greater and less can be made, in addition to equality

and inequality. However, operations such as conventional addition and
subtraction are still without meaning. While the scale can be ranked
from high to low, the difference between points cannot be quantified.
We cannot say that the person who thinks facilities are good regards
the facilities as twice as good as the person who thinks they are below
average.
Ratio scale
Ratio scale is an ordered scale which involves a true zero

point. A certain distance along the scale means the same thing
everywhere on the scale (height, age, profit, etc.).
All mathematical operations are possible with this type of data and lead
to meaningful results. There are numerous methods for analyzing this
type of data.
Interval scale
The most important characteristic of interval scale is that the

measurement does not involve a true zero point. The numbers
have all the features of ordinal measurement and also are
separated by the same interval. “Zero” value is arbitrary, not
real (temperature, etc.)
In this case, differences between arbitrary pairs of numbers can be

meaningfully compared. Operations such as addition and subtraction
19
are therefore meaningful. However, the zero point on the scale is

arbitrary, and ratios between numbers on the scale are not meaningful,
so operations such as multiplication and division cannot be carried out.
On the other hand, negative values on the scale can be used.
Categorical variables (attributes) are connected with nominal

or ordinal scale, but numerical variables are connected with
ratio or interval scale.
1.4. DISCRETE AND CONTINUOUS VARIABLES
Numerical variable has numerical form. It can be either discrete or

continuous.
Discrete variables produce numerical responses that arise

from a counting process.
An example of a discrete numerical variable is “the number of subscribed

magazines”. The response is one of a finite number of integers, so a
discrete variable can only take a finite number of real values. Another
example of a discrete variable would be the score given by a judge to
a gymnast in competition: the range is 0 to 10 and the score is always
given to one decimal (e.g., a score of 8.5).
Continuous variables produce numerical responses that arise

from a measuring process.
The response takes on any value within a continuum or interval,

depending on the precision of the measuring instrument. Examples of
a continuous variable are distance, age, weight and height. For example
your weight may be 57 kg, 57.5 kg, 57.58 kg, depending on the measure
units and on the precision of the available measuring instrument.
20
1.5. DATA COLLECTION
Depending on the scope of research, data about statistical units can

be collected from a whole population or from a part of population
(a sample).
1.5.1. Population and sample
Statistical unit is an element that possesses characteristics on

the basis of which mass phenomenon variation is investigated.
Population is a set of statistical units (people, objects,

transactions, events or organizations of interest) that we want
to analyse. Population size is the number of statistical units
comprising population (N).
Definition of population has tree aspects:
Notion-based definition of the population establishes sets based on

the notion of the statistical unit
The space-based definition of the population is determined by the
space to which statistical units of the set belong
The time-based definition of the population is determined by the
time in which statistical units are observed.
The time in question may be:
A moment in time (e.g. current number of population, current number

of employed, etc.) or
A time interval (annual business result, monthly production, etc.)
Sample is only a part of the population that is included in

research.
21
A population usually contains too many objects or individuals to study

conveniently, so an investigation is very often restricted to one or more
samples drawn from population. A well chosen sample will contain
most of the information about a particular population parameter but the
relation between the sample and the population must be such as to allow
true inferences to be made about a population from that sample.
1.5.2. Census
A survey of a whole population is called a census.
A census refers to data collection about every unit in a group or

population. If you collected data about the height of everyone in your
class, that would be regarded as a class census.
A characteristic of a population (such as the population mean)

is referred to as a parameter.
There are various reasons why a census may or may not be chosen as
the method of data collection:
Table 1.2. Census data (advantages and disadvantages)4
Census data
Advantages (+)
Sampling variance is zero: There is no sampling variability attributed to the
statistic because it is calculated using data from the entire population.
Detail: Detailed information about small sub-groups of the population can be

available.
4
http://www.statcan.gc.ca/edu/power-pouvoir/ch2/types/5214777-eng.htm, access 20.05.2010.
22
Disadvantages (–)
Cost: In terms of money, conducting a census for a large population can be very
expensive.
Time: A census generally takes longer to conduct than a sample survey.
Response burden: Information needs to be received from every member of the
target population.
Control: A census of a large population is such a huge undertaking that makes it

difficult to keep every single operation under the same level of scrutiny and control.
1.5.3. Sampling
Sampling frame is a complete or partial listing of items

comprising the population.
The frame can be data sources as population lists, directories or maps.

Samples are drawn from this frame. From sampling frame we can
identify every single element and include any in our sample. If the
frame is inadequate because certain groups of individuals or items in
the population were not properly included, then the samples will be
inaccurate and biased. The first important attribute of a sample is that
every object or individual in the population from which it is drawn must
have a known non-zero chance of being included in it5. The sampling
process comprises of several stages:
Defining the population of concern,

Specifying a sampling frame, a set of items or events possible to
measure,
Specifying a sampling method for selected items or events from the
frame,
Determining the sample size,
Implementing the sampling plan,
5
a natural suggestion is that these chances should be equal
23
Sampling and data collection,

Reviewing the sampling process.
Examples of sample surveys:
Phoning the fifth person on every page of the local phonebook and
asking them how long they have lived in the area.
Selecting sub-populations in proportion to their incidence in the
overall population. For instance, a researcher may have reason to
select a sample consisting of 30% females and 70% males from a
population that has same gender structure.
Selecting several cities in a country, several neighborhoods in
those cities and several streets in those neighborhoods to recruit
participants for a survey.
A characteristic of a sample (such as the sample standard

deviation) is referred to as a statistic.
In a sample survey, data are gathered for only part of the total population.
If you collected data about the height of 10 students in a class of 30, that
would be a sample survey of the class rather than a census. Reasons one
may or may not choose to use a sample survey include:
Table 1.3. Sample survey (advantages and disadvantages)6
Sample survey
Advantages (+)
Cost: A sample survey costs less than a census because data are collected from
only part of a group.
Time: Results are obtained far more quickly for a sample survey, than for a census.
Fewer units are contacted and less data needs to be processed.
Response burden: Fewer people have to respond in the sample.
6
http://www.statcan.gc.ca/edu/power-pouvoir/ch2/types/5214777-eng.htm, access 20.05.2010.
24
Control: The smaller scale of this operation allows for better monitoring and
quality control.
Disadvantages (–)
Sampling variance is non-zero: The data may not be as precise because the data
came from a sample of a population, instead of the total population.
Detail: The sample may not be large enough to produce information about small
population sub-groups or small geographical areas.
An estimate of a parameter taken from a random sample is known to be

unbiased7. As the sample size increases, it gets more precise.
1.6. TYPES OF SAMPLE
1.6.1. Simple random sample
A simple random sample is selected so that every possible

sample has an equal chance of being selected from the
population.
Each individual is chosen randomly and entirely by chance, so that

each individual has the same probability of being chosen at any stage
during the sampling process, and each subset of n individuals has the
same probability of being chosen for the sample as any other subset of n
individuals.
In small populations such sampling is typically done without

replacement. This means that person or item once selected is
not returned to the frame and therefore cannot be selected again.
7
A measurement will be unbiased when the average of a large set of unbiased measurements
is close to the true value of parameter for population.
25
The chance that any particular member of the frame is selected on the
1
first draw is . Then the chance that any particular member of the frame
N 1
not previously selected will be selected on the second draw is ,
N −1
etc. This process continues until desired sample of size n is obtained.
Sampling without replacement deliberately avoids choosing any member
of the population more than once. An unbiased random selection of
individuals is important so that in the long run, the sample represents the
population. However, this does not guarantee that a particular sample
is a perfect representation of the population. Simple random sampling
merely allows one to draw externally valid conclusions about the entire
population based on the sample.
Although simple random sampling can be conducted with

replacement instead, this is less common and would normally
be described more fully as simple random sampling with
replacement. This means that person or item once selected
is returned to the frame and therefore can be selected again
with the same probability .
Advantages are that a random sample is free of classification error and

it requires minimum advance knowledge of the population. Random
sampling best suits situations where not much information is available
about the population and data collection can be efficiently conducted on
randomly distributed items. If these conditions are not true, stratified
sampling or cluster sampling may be a better choice.
How do we select a simple random sample? Let's assume that we are

doing some research with a small service agency that wishes to assess
clients' views of quality of service over the past year. First, we have
to get the sampling frame organized. To accomplish this, we will go
through agency records to identify every client over the past 12 months.
If we're lucky, the agency has good accurate computerized records and
can quickly produce such a list. Then, we have to actually draw the
sample. Decide on the number of clients you would like to have in the
26
final sample. For the sake of the example, let's say you want to select
100 clients to survey and that there were 1000 clients over the past 12
months. Then, the sampling fraction is f = n/N = 100/1000 = 0.10 or
10%. Now, to actually draw the sample, you have several options. You
could print off the list of 1000 clients, tear them into separate strips, put
the strips in a hat, mix them up real good, close your eyes and pull out
the first 100. But this mechanical procedure would be tedious and the
quality of the sample would depend on how thoroughly you mixed them
up and how randomly you reached in. Perhaps a better procedure would
be to use the kind of ball machine that is popular with many of the state
lotteries. You would need three sets of balls numbered 0 to 9, one set
for each of the digits from 000 to 999 (if we select 000 we'll call that
1000). Number the list of names from 1 to 1000 and then use the ball
machine to select the three digits that select each person. The obvious
disadvantage here is that you need to get the ball machines.8
Neither of these mechanical procedures is very feasible and, with the

development of inexpensive computers there is a much easier way.
Here's a simple procedure that's especially useful if you have the names
of the clients already in the computer. Many computer programs can
generate a series of random numbers. Let's assume you can copy and
paste the list of client names into a column in an EXCEL spreadsheet.
Then, in the column right next to it paste the function =RAND() which is
EXCEL's way of putting a random number between 0 and 1 in the cells.
Then, sort both columns -- the list of names and the random number
-- by the random numbers. This rearranges the list in random order
from the lowest to the highest random number. Then, all you have to do
is take the first hundred names in this sorted list. You could probably
accomplish the whole thing in under a minute.
Simple random sampling is easy to accomplish and to explain to others.

Because simple random sampling is a fair way to select a sample, it
is reasonable to generalize the results from the sample back to the
population. Simple random sampling is not the most statistically
efficient method of sampling and you may, just because of the luck of
8
http://www.socialresearchmethods.net/kb/sampprob.php, access 26. 05. 2010.
27
the draw, not get good representation of subgroups in a population. To

deal with these issues, we have to turn to other sampling methods.
1.6.2. Stratified sample
When sub-populations vary considerably, it is advantageous to sample

each subpopulation (stratum) independently.
Stratification is the process of grouping members of the

population into relatively homogeneous subgroups before
sampling.
The strata should be mutually exclusive: every element in the

population must be assigned to only one stratum. The strata should
also be collectively exhaustive: no population element can be excluded.
Then random or systematic sampling is applied within each stratum.
This often improves the representativeness of the sample by reducing
sampling error.
In general, the size of the sample in each stratum is taken in proportion

to the size of the stratum. This is called proportionate allocation.
Proportionate allocation uses a sampling fraction in each of the strata
that is proportional to that of the total population. If the population
consists of 60% in the male stratum and 40% in the female stratum,
then the relative size of the two samples (three males, two females)
should reflect this proportion.
Example 1.1.
Suppose that in a company there is the following staff:

Determination
of stratified
sample male, full time: 90
male, part time: 18
female, full time: 9
28
female, part time: 63

total: 180
We are asked to take a sample of 40 staff, stratified according to the

above categories.
The first step is to find the total number of staff (180) and calculate the
percentage in each group:
% male, full time = (90/180) x 100 = 50%

% male, part time = (18/180) x100 = 10%
% female, full time = (9/180) x 100 = 5%
% female, part time = (63/180) x 100 = 35%.
This tells us that our sample of 40 should have:
50% should be male, full time (50% of 40 is 20).

10% should be male, part time (10% of 40 is 4).
5% should be female, full time (5% of 40 is 2).
35% should be female, part time (35% of 40 is 14).
1.6.3. Cluster sampling
The problem with random sampling methods when we have to sample a

population that is dispersed across a wide geographic region is that you
will have to cover a lot of ground geographically in order to get to each
of the units you sampled.9 Imagine taking a simple random sample of all
the residents of New York State in order to conduct personal interviews.
By the luck of the draw you will wind up with respondents who come
from all over the state. Your interviewers are going to have a lot of
travelling to do. It is for precisely this problem that cluster or area
random sampling was invented.
9
http://www.socialresearchmethods.net/kb/sampprob.php, access 25. 01. 2010.
29
In cluster sampling, we have to follow some steps:

divide population into clusters (usually along geographic
boundaries)
randomly sample clusters
measure all units within sampled clusters.
Clusters can be naturally occurring destinations (countries, districts,

municipalities, city blocs, apartments, households etc). For instance, in
the figure we see a map of the counties in New York State.10
Figure 1.1. A map of the counties in New York State
Suppose that we have to do a survey of city governments that will require

us going to the towns personally. If we do a simple random sample
state-wide we will have to cover the entire state geographically. Instead,
we decide to do a cluster sampling of five counties (marked in red in
10
http://www.angelfire.com/empire/richardt/, access 26. 01. 2010.
30
the figure). Once these are selected, we go to every city government in

the five areas. Clearly this strategy will help us to economize on our
mileage. Cluster or area sampling, then, is useful in situations like this,
and is done primarily for efficiency of administration. Note also, that
we probably don't have to worry about using this approach if we are
conducting a mail or telephone survey because it doesn't matter as much
(or cost more or raise inefficiency) where we call or send letters to.
Cluster samples are generally used if:

No list of the population exists.
Well-defined clusters, which will often be geographic
areas, exist. A reasonable estimate of the number of
elements in each level of clustering can be made.
Often the total sample size must be fairly large to enable cluster
sampling to be used effectively. Cluster sample is mostly more effective
than simple random sample, particularly if the population is spread over
a wide territory.
1.6.4. Quota sampling
Quota sampling is the nonprobabilistic equivalent of stratified

sampling.
Like with stratified sampling, the researcher first identifies the stratums
and their proportions as they are represented in the population. Then
convenience or judgment sampling is used to select the required number
of subjects from each stratum. This differs from stratified sampling,
where the stratums are filled by random sampling.
There are two types of quota sampling: proportional and non

proportional.
31
In proportional quota sampling you want to represent

the major characteristics of the population by sampling a
proportional amount of each.
For instance, if you know the population has 40% women and 60% men,
and that you want a total sample size of 100, you will continue sampling
until you get those percentages and then you will stop. So, if you've
already got the 40 women for your sample, but not the sixty men, you
will continue to sample men but even if legitimate women respondents
come along, you will not sample them because you have already "met
your quota." The problem here (as in much purposive sampling) is that
you have to decide the specific characteristics on which you will base
the quota. Will it be by gender, age, education race, religion, etc.?
Nonproportional quota sampling is a bit less restrictive. In

this method, you specify the minimum number of sampled
units you want in each category.
Here, you're not concerned with having numbers that match the
proportions in the population. Instead, you simply want to have enough
to assure that you even will be able to talk about small groups in the
population. This method is the nonprobabilistic analogue of stratified
random sampling in that it is typically used to assure that smaller groups
are adequately represented in your sample.
32
1.6.5. Systematic sampling
Systematic sampling is a statistical method involving the

selection of every kth element from a sampling frame, where k,
the sampling interval, is calculated as:
k = population size (N) / sample size (n)
Using this procedure each element in the population has a known

and equal probability of selection. This makes systematic sampling
functionally similar to simple random sampling. It is, however, much
more efficient (if variance within systematic sample is more than
variance of population) and much less expensive to carry out. The
researcher must ensure that the chosen sampling interval does not hide
a pattern. Any pattern would threaten randomness. A random starting
point must also be selected.
Systematic sampling is to be applied only if the given population is

logically homogeneous, because systematic sample units are uniformly
distributed over the population.
Example 1.2.
a) Suppose a supermarket wants to study buying habits of their

customers. By using systematic sampling they can choose every 10th Determination
or 15th customer entering the supermarket and conduct the study on of systematic
this sample. This is random sampling with a system. sample
From the sampling frame, a starting point is chosen at random, and

choices thereafter are at regular intervals.
For example, suppose you want to sample 8 houses from a street of 120
houses. 120/8=15, so every 15th house is chosen after a random starting
point between 1 and 15. If the random starting point is 11, then the houses
selected are 11, 26, 41, 56, 71, 86, 101, and 116.
33
If, as more frequently, the population is not evenly divisible (suppose

you want to sample 8 houses out of 125, where 125/8=15.625), should
you take every 15th house or every 16th house? If you take every 16th
house, 8*16=128, so there is a risk that the last house chosen does not
exist. On the other hand, if you take every 15th house, 8*15=120, so
the last five houses will never be selected. The random starting point
should instead be selected as noninteger between 0 and 15.625 (inclusive
on one endpoint only) to ensure that every house has equal chance of
being selected; the interval should now be noninteger (15.625); and
each noninteger selected should be rounded up to the next integer. If
the random starting point is 3.3, then the houses selected are 4, 19, 35,
51, 66, 82, 98, and 113, where there are 3 cyclic intervals of 15 and 5
intervals of 16.
1.6.6. Calculating a Sample Size
A frequently asked question is “How many people should I sample?” It

is an extremely good question, although unfortunately there is no single
answer! In general, the larger the sample size, the more closely your
sample data will match that from the population. However, in practice,
you need to work out how many responses will give you sufficient
precision at an affordable cost. Calculation of an appropriate sample
size depends upon a number of factors unique to each survey and it is
down to you to make the decision regarding these factors.
The three most important factors that determine sample size are:
How accurate you wish to be?
How confident you are in the results?
What budget you have available?
The temptation is to say all should be as high as possible. The problem

is that an increase in either accuracy or confidence (or both) will always
require a larger sample and higher budget. Therefore, a compromise
must be reached and you must work out the degree of inaccuracy and
confidence you are prepared to accept.
34
1.7. FREQUENCY DISTRIBUTION
First result that we get after research is series with gross data.
It is a database in which we entered data for each item or object without

any order (“piled data”). In order to get an arranged statistical series
(ordered array), we need to sort data by order of magnitude (from
smallest observation to the largest observation).
The easiest method of organizing data is a frequency

distribution, which converts raw data into a meaningful
pattern for statistical analysis.
Well, the final form of data grouping is the statistical distribution of

frequencies, in which each variable modality or interval (there is n of
modalities or intervals) associates a corresponding absolute frequency
f i (number of times each value, modality or class, appears or number
of occurrences of a modality or class) ⇒ (xi , f i ) or .
Frequency distribution is a summary table in which the data

are arranged into numerically ordered class groupings or
categories.
The number of class groupings used depends on the number of data

observations (N). In general, the frequency distribution should have at
least 5 class groupings but no more than 15.
Frequency distribution is usually a list, ordered by quantity, showing

the number of times each modality appears (xi , f i ).
35
Example 1.3.
If 100 people rate a five-point Likert scale assessing their agreement

with the same important statement on a scale on which 1 denotes strong
agreement and 5 strong disagreement, the frequency distribution of
their responses might look like:
Degree of agreement Number of interviewed – absolute frequency

Strongly agree 20
The frequency Agree somewhat 30

distribution. Not sure 20
Disagree somewhat 15
Strongly disagree 15
Total 100
From table we can conclude that 30 people “agree somewhat” with this
statement, etc.
This simple tabulation has two drawbacks. When a variable can take
continuous values instead of discrete values or when the number of
possible values is too large, the table construction is cumbersome, if not
impossible. A slightly different tabulation scheme based on the range
of values (classes or intervals) is used in such cases .
Example 1.4.
There is one example for using Excel procedure for creating frequency
distribution: According to data base for HBS 200411 we have information
Constructing about several variables for 7,413 households:
of frequency Entity
distribution
Canton
using Excel.
Gender
Marital status
Education level
Employment status
11
Database Household Budget Survey 2004, B&H Agency for Statistics
36
We have qualitative variables with small number of modalities, so we

will use non-interval grouping, or we will find absolute frequency for
each modality.
First, we will enter in empty column of Excel sheet type modalities for
given variable. We will take variable “marital status” with modalities:
unmarried, married, informal marriage, divorced and widower/
widow:
For construction of frequency distribution we will use Excel function:

COUNTIF
37
Now we will give elements to the chosen CONTIF function:

Range will be row or column with original data (we will fix that data
range with $: $D$2:$D$7414)
Criteria is cell with given modality (H10)
38
We will get absolute frequency for modality “unmarried”.
With Copy-Paste option, we will complete other cells for absolute

frequency:
Result is frequency distribution with absolute frequencies for all

modalities.
39
Example 1.5.
If we consider the heights of the students in a class, the frequency table

might look as follows:
Height range Number of students – absolute frequency

4.5 – 5.0 feet 25
5.0 – 5.5 feet 35
5.5 – 6.0 feet 20
6.0 – 6.5 feet 10
Total 90
From that table we can see that 25 students have height in range 4.5-5.0,
etc.
Frequency distribution tables can be used for both categorical and

numeric variables. Continuous variables should only be used with class
intervals, which will be explained shortly.
The relative frequency is proportion of units of a statistical

set with the same modality or interval.
This relative frequency of a particular modality or class interval is found

by dividing the absolute frequency by the number of observations:
The percentage frequency is found by multiplying each rela-

tive frequency value by 100.
40
The percentage frequency is shown in percentages, and it has the same

meaning as the relative frequency:
1.7.1. Constructing frequency distribution table
Example 1.6.
Volontars take a survey at the Sarajevo suburb. In each of 20 homes,

people were asked about the number of cars registered to their
households. The following results were recorded:
1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
Follow those steps to present this data in a frequency distribution table.

Divide the results (x) into modalities, and then count the number of Constructing
results for each modality. In this case, the modalities are the number frequency
of households with no car (0), one car (1), two cars (2) and so forth. distribution table.
Make a table with separate columns for the modality numbers (the
number of cars per household), the tallied results and the frequency
of results in each interval. Label these columns Number of cars,
Tally and Frequency.
Read the list of data from left to right and place a tally mark in
the appropriate row. For example, the first result is a 1, so place a
tally mark in the row beside where 1 appears in the interval column
(Number of cars). The next result is a 2, so place a tally mark in the
row beside the 2, and so on. When you reach your fifth tally mark,
draw a tally line through the preceding four marks to make your
final frequency calculations easier to read.
Add up the number of tally marks in each row and record them in the
final column entitled Frequency.
41
Your frequency distribution table for this exercise should look like this:
Number of cars ( xi ) Tally Frequency ( fi )

0 4
1 6
2 5
3 3
4 2
Total 20
By looking at this frequency distribution table quickly, we can see that

out of 20 households surveyed, 4 households had no cars, 6 households
had 1 car, etc.
In this case, we can apply Excel procedure to get frequency distribution.

If we have column with original data: A2:A21:
Constructing
frequency
distribution table
using Excel.
42
we can use Excel function Frequency. First we have to fix all modalities
(0, 1, 2, 3, and 4) in a new column:
Then we have to select free cells beside that column and choose Excel
function – Statistical – Frequency:
Data array – row or column or array with original data,
Bins array – new column with modalities,
43
In the end we will press in the same time CTRL+SHIFT+ENTER. That

will produce absolute frequencies for all modalities:
44
1.7.2. Constructing cumulative frequency

distribution tables
Increasing absolute cumulative frequency (CAF ) is used to

determine how many data have the value that is equal to or
lower than the value of present modality.
The cumulative frequency is calculated using a frequency distribution

table. The cumulative frequency is calculated by adding each frequency
from a frequency distribution table to the sum of its predecessors.
The last value will always be equal to the total for all
observations (N), since all frequencies will already have been
added to the previous total.
When we need to use increasing relative cumulative frequency

(CRF ) which is used to determine which part of data have the value
that is equal to or lower than the value of present modality, then formula
will be:
Cumulative percentage or increasing percentage cumulative

frequency (CRF% ) is used to determine which percent of
data have the value that is equal to or lower than the value of
present modality.
45
It is calculated by:
1. dividing the cumulative absolute frequency by the total number of
observations, then multiplying it by 100, or by
2. adding each percentage frequency from a frequency distribution
table to the sum of its predecessors:
3. adding each relative frequency from a frequency distribution table to

the sum of its predecessors, then multiplying it by 100:
The last value for increasing relative cumulative frequency will always
be equal to 100%.
Example 1.7.
Participant (10 in total) of the summer fair had to fill out a form with
personal information (sex, ages, occupation,…). We were interested in
age structure and hence sort out ages of participants:
36, 48, 54, 92, 57, 63, 66, 76, 66, 80
Use the following steps to present these data and create cumulative
frequency distribution table:
Divide the results into intervals, and then count the number of results
Constructing
cumulative
in each interval. In this case, intervals of 10 are appropriate. Since 36
frequency is the lowest age and 92 is the highest age, start the intervals at 35 to
distribution 44 and end the intervals with 85 to 94.
table.
Create a table similar to the frequency distribution table but with
three new columns for cumulative frequencies.
In the first column or the Lower value column, put the lower value
of the result intervals. For example, in the first row, we would put the
number 35.
46
The next column is the Upper value column. Place the upper value
of the result intervals. For example, we would put the number 44 in
the first row.
The third column is the Frequency column. Record the number of
times a result appears between the lower and upper values for given
interval. In the first row, we would place the number 1.
The fourth column is Interval or class column. For the first interval,
upper- bounded limits would be 35 – 45.
The fifth column is the Cumulative frequency column. Here we add
the cumulative frequency of the previous row to the frequency of the
current row. Since this is the first row, the cumulative frequency is
the same as the absolute frequency. However, in the second row, the
frequency for the 35–45 interval (i.e., 1) is added to the frequency
for the 45–55 interval (i.e., 2). Thus, the cumulative frequency is 3
(1+2=3), meaning we have 3 participants in the 34 to 54 age group.
The next column is the Percentage column. In this column, a list
of the percentage of the frequency is given. To do this, divide the
frequency by the total number of data and multiply by 100. In this
case, the frequency of the first row is 1 and the total number of data
is 10. The percentage would then be 10.0. ((1/10)*100 =10.0).
The final column is Cumulative percentage. In this column, divide
the cumulative frequency by the total number of results and then to
make a percentage, multiply by 100. Note that the last number in this
column should always equal 100.0. In this example, the cumulative
frequency is 1 and the total number of data is 10, therefore the
cumulative percentage of the first row is 10.0. ((1/10)*100=10.0).
However, in the second row, the frequency for the 35–45 interval
(i.e., 10) is added to the frequency for the 45–55 interval (i.e.,
(2/10)*100=20). Thus, the cumulative frequency is 30 (10+20=30),
meaning we have 30% of participants in the 34 to 54 age group.
47
The cumulative frequency distribution table should be:
Cumulative
Lower Upper Frequency Cumulative
Class absolute Percentage
Value Value ( fi ) percentage
Cumulative frequency
frequency
35 44 35 - 45 1 1 10.0 10.0
distribution
table. 45 54 45 - 55 2 3 20.0 30.0
55 64 55 - 65 2 5 20.0 50.0
65 74 65 - 75 2 7 20.0 70.0
75 84 75 - 85 2 9 20.0 90.0
85 94 85 - 95 1 10 10.0 100.0
For example, we can conclude that 50% of the participants have less
than 65 years of age (or 64 years of age or less), etc.
1.7.3. Class intervals
If a variable takes a large number of values, then it is easier to present and

handle the data by grouping the values into class intervals. Continuous
variables had to be presented in class intervals, while discrete variables
can be grouped into class intervals or not.
To illustrate, suppose we set out age ranges for a study of young people,
while allowing for the possibility that some older people may also fall
into the scope of our study. The absolute frequency of a class interval
is the number of observations that occur in a particular predefined
interval. So, for example, if 20 people aged 5 to 9 (9 is included) appear
in our study's data, the frequency for the [5–9] or [5–10[ interval is 20.
The endpoints of a class interval are the lowest and highest values that
a variable can take (L1i and L2i). So, the closed intervals in our study are
0 to 4 years, 5 to 9 years, 10 to 14 years, 15 to 19 years, 20 to 24 years,
and 25 years and over. The endpoints of the first interval are 0 and 4
if the variable is discrete, and 0 and 4.999 if the variable is continuous.
The endpoints of the other class intervals would be determined in the
same way.
48
There are some approximate formulas for the number of intervals:
where N is the size of data set, but it is frequently used to determine

number of intervals according to previous practice. Then we can find
appropriate width for class interval like:
Class interval width is the difference between the lower

endpoint of an interval and the lower endpoint of the next
interval (li = L1, i+1 - L1, i ).
In our study continuous closed intervals are 0 to 4, 5 to 9, etc. The width

of the first five intervals is 5, and the last interval is open, since no
higher endpoint is assigned to it. The intervals could also be written as
0 to less than 5, 5 to less than 10, 10 to less than 15, 15 to less than 20,
20 to less than 25, and 25 and over.
In summary, follow these basic rules when constructing a frequency

distribution table for a data set that contains a large number of
observations:
find the lowest and highest values of the variables,
decide on the width of the class intervals,
include all possible values of the variable.
In deciding on the width of the class intervals, you will have to find a
compromise between having intervals short enough so that not all of
the observations fall in the same interval, but long enough so that you
do not end up with only one observation per interval. It is also very
important to make sure that the class intervals are mutually exclusive.
49
Example 1.8.
Thirty AA batteries were tested to determine how long they would last.
The results, to the nearest minute, were recorded as follows:12
423, 369, 387, 411, 393, 394, 371, 377, 389, 409, 392, 408, 431, 401, 363,
391, 405, 382, 400, 381, 399, 415, 428, 422, 396, 372, 410, 419, 386, 390
Construct a frequency distribution table. Use those data to make a table

giving the relative frequency and percentage frequency of each interval
of battery life. Calculate and interpret cumulative frequencies.
The lowest value is 363 and the highest is 431. Using the given data
and a width of class interval of 10, the interval for the first class is
[360 to 370[, where 363 (the lowest value) is included. Remember, there
should always be enough class intervals so that the highest value has
been included.
Battery life, minutes ( xi ) Tally Frequency ( fi )

[360 – 370[ 2
[370 – 380[ 3
[380 – 390[ 5
[390 – 400[ 7
[400 – 410[ 5
[410 – 420[ 4
[420 – 430[ 3
[430 – 440] 1
Total 30
In this case, we can also apply Excel procedure to get frequency

Constructing distribution. If we have column with original data: A2:A31:
a frequency
distribution table
and calculating
cumulative
frequencies using
Excel.
12
http://www.statcan.gc.ca/edu/power-pouvoir/ch8/5214814-eng.htm#Top, access: 26.01.2010.
50
First we have to find minimal and maximal value in data set for our
decision about number and width of classes. We will use Excel functions
MIN and MAX:
51
52
Lowest value in data set is 363 and highest value is 431. We will take
classes with 10, and according to that we will create 8 classes. Then, we
will create new columns, one with lower and one with upper endpoints
of closed classes:
53
Then we have to select free cells beside that column and choose Excel
function – Statistical – Frequency:
Data array – row or column or array with original data,
Bins array – new column with upper endpoint of classes,
54
In the end we will press in the same time CTRL+SHIFT+ENTER. That

will produce absolute frequencies for all classes:
55
Relative frequency and percentage frequency of each interval of battery

life are:
xi fi pi Pi
360 – 370 2 0.07 7
Calculating relative
frequency and 370 – 380 3 0.10 10
percentage 380 – 390 5 0.17 17
frequency.
390 – 400 7 0.23 23
400 – 410 5 0.17 17
410 – 420 4 0.13 13
420 – 430 3 0.10 10
430 – 440 1 0.03 3
Total 30 1.00 100
An analyst of these data could now say that:

7% of AA batteries last from 360 to 370 minutes,
Interpreting relative
frequency and the probability of any randomly selected AA battery having a life in
percentage the range [23% of AA batteries last from 390 to 400 minutes,
frequency. 3% of AA batteries last from 430 to 440 minutes.
56
In an interval grouped series, in order to provide additional data

calculation, we need to approximate the intervals to corresponding
class middles (class mark, midpoint, center of interval):
xi fi ci CAFi CRF%i
360 – 370 2 365 2 6.67 Calculating
center of interval.
370 – 380 3 375 5 16.67
380 – 390 5 385 10 33.33
390 – 400 7 395 17 56.67
400 – 410 5 405 22 73.33
410 – 420 4 415 26 86.67 Calculating
420 – 430 3 425 29 96.67 Cumulative absolute
frequencies and
430 – 440 1 435 30 100.00 Cumulative relative
Total 30 frequencies.
For example, we can say that: Interpreting

17 out of 30 AA batteries from sample have a life less than 400 Cumulative absolute
minutes, so 13 of 30 AA batteries from sample have a life 400 frequencies and
Cumulative relative
minutes or longer. frequencies.
86.67% of AA batteries have a life less than 420 minutes.
1.7.4. Outliers
An outlier is an extreme value of the data. It is an observation

value that is significantly different from the rest of the data.
There may be more than one outlier in a set of data. Sometimes, outliers
are significant pieces of information and should not be ignored. Other
times, they occur because of an error or misinformation and should be
ignored.
57
Example 1.9.
Weights for 20 products were measured and following results are

recorded:
61.7, 58.4, 59.2, 61.5, 61.4, 59.8, 59.0, 61.1, 61.6, 56.3,
61.9, 65.7, 58.9, 59.0, 61.2, 61.4, 58.4, 60.0, 59.3, 61.9
In this case, the stems will be the whole number values and the leaves
will be the decimal values. The data range from 56.3 to 65.7, so the
stems should start at 56 and finish at 65. The following table is a stem
and leaf plot for lengths of 20 products:
Lengths of 20 products
Stem Leaf
Constructing stem 56 3
and leaf plot. 57
58 449
59 00238
60 0
61 124456799
63
64
65 7
In this case, 56.3 and 65.7 could be considered as outliers, since these
two values are quite different from the other values.
1.8. DATA PRESENTATION: TABLES,

DIAGRAMS AND GRAPHS
Two the most important ways for presenting data are

previously presented tables with frequency distributions and
graphs.
58
Why use graphs when presenting data? Because graphs:

are quick and direct
highlight the most important facts
facilitate understanding of the data
can convince readers
can be easily remembered.
Knowing what type of graph to use with what type of information is

crucial. Depending on the nature of the data and variable type some
graphs might be more appropriate than others. A graph is not always
the most appropriate tool to present information. Sometimes text or a
data table can provide a better explanation to the readers and save you
considerable time and effort. We might want to reconsider the use of a
graph when:
the data are very dispersed
there are too few data (one, two or three data points)
the data are very numerous
the data show little or no variations.
A qualitative variable can be represented using:

simple columns (bar graphs),
a structural column,
a structural circle (pie) or half-circle.
If it is a nominal variable, the order is irrelevant, and if it is an ordinal

variable, the order of columns is relevant and must not be disturbed.
Depending on the type, a quantitative variable may be represented by:
A small number of data, ungrouped series:

Tukey’s tree – leaf diagram (S-L) (Stem and Leaf Plot)
x – axis
59
A grouped series:
Split columns – bar charts (discrete series, no intervals)
Structured column
Structured circle - pie
histogram – adjoining columns (discrete series, intervals)
polygon of absolute frequencies
polygon of cumulative frequencies
line diagram (discrete non-interval grouped series)
In case of intervals with various class widths, we cannot draw a

histogram with absolute frequencies, but with corrected absolute
frequencies which is calculated using following formula:
You too can experiment with different types of graphs and select the
most appropriate. There are several suggestions for appropriate selecti-
on according to effects that you want to get with graphs:
pie chart (description of components)
horizontal bar graph (comparison of items and relationships, time
series)
vertical bar graph (comparison of items and relationships, time series,
frequency distribution)
line graph (time series and frequency distribution)
scatter plot (analysis of relationships)
histogram (continuous variable).
If you decide that a graph is the best way to present your information,
then no matter what type of graph you use, you need to keep in mind
the following rules:
convey an important message
decide on a clear purpose
draw attention to the message, not the source
experiment with various options and graph styles
60
use simple design for complex data

make the data 'speak'
adapt graph presentation to suit the target audience
ensure that the visual perception process is easy and accurate
avoid distortion and ambiguity
optimize design and integrate style with text and tables.
The next table decribes different types of graphical presentation of data:
Table 1.4. Graph type
Graph type Description

Age pyramid Represents age structure of a population.
Compares important data values. Displays data better than

Vertical bar graph
horizontal bar graphs, and is preferred to use when possible.
Displays a comparatively large number of categories when

Dot graph category order is unimportant. Best used when portraying cat-
egory values in descending order.
Shows continuous variable data in a similar way to column

Histogram
graphs, but without the gap between the columns.
Histograph (fre- Depicts continuous variable data. Smoothes abrupt changes

quency polygon) which may appear in a histogram
Horizontal bar Compares important data. Useful when category names are
graph too long to fit at the foot of a column.
Line graph Depicts data over time.
Favored by professional graphic artists, although students can

Pictograph create simple pictorial presentations as well. Comparisons
must be accurately depicted and respect the scale.
Compares a small number of categories. Values should be

markedly different, or differences may not be easy to decipher.
Pie chart Labeling pie segments with their actual values overcomes this
problem. When data points are similar, the pie chart's message
may be misunderstood. A bar graph may be better in this case.
Scatter plot Measures two or more variables thought to be related.
61
If we use Excel, we can apply different types of graphs:
Now, we will give examples for different graph types.
62
Example 1.3. (cont.) Graphics
Number of interviewed – absolute

Degree of agreement Graphically
frequency
presentation of
Strongly agree 20 data using Excel.
Agree somewhat 30
Not sure 20
Disagree somewhat 15
Strongly disagree 15
Total 100
63
Example 1.6. (cont.) Graphics
Number of cars ( xi ) Frequency ( fi ) CAF CRF (%)

0 4 4 20
Graphically 1 6 10 50
presentation of
data using Excel. 2 5 15 75
3 3 18 90
4 2 20 100
Total 20
64
65
Example 1.8. (cont.) - Graphics
xi fi ci CAFi CRF (%)i

360 – 370 2 365 2 6.67
Graphically
370 – 380 3 375 5 16.67
presentation
of data using 380 – 390 5 385 10 33.33
Excel. 390 – 400 7 395 17 56.67
400 – 410 5 405 22 73.33
410 – 420 4 415 26 86.67
420 – 430 3 425 29 96.67
430 – 440 1 435 30 100.00
Total 30
66
67
2
DESCRIPTIVE
STATISTICS
CHAPTER
2
2.1. INTRODUCTION
Descriptive statistics is used to describe the basic features of the data

in a study. It provides simple summaries about the sample and the
measures. Together with simple graphics analysis, it forms the basis of
virtually every quantitative analysis of data.
Descriptive statistics is typically distinguished from inferential statistics.

With descriptive statistics you are simply describing what is or what
the data shows. With inferential statistics, you are trying to reach
conclusions that extend beyond the immediate data alone. For instance,
we use inferential statistics to try to infer conclusion about entire
population from the sample data. Or, we use inferential statistics to
make judgments of the probability that an observed difference between
groups is a dependable one or one that might have happened by chance
in this study. Thus, we use inferential statistics to make inferences from
our data to more general conditions; we use descriptive statistics simply
to describe what’s going on in our data.
Descriptive statistics is used to present quantitative descriptions in a

manageable form. In a research study we may apply lots of measures. Or
we may measure a large number of people on any measure. Descriptive
statistics help us to simplify large amounts of data in a sensible way.
Each descriptive statistic reduces lots of data into a simpler summary.
For instance, consider a simple number used to summarize how well a
batter is performing in baseball, the batting average. This single number
is simply the number of hits divided by the number of times at bat
(reported to three significant digits). A batter who is hitting 0.333 is
getting a hit one time in every three at bats. A batter who is hitting
0.250 is hitting one time in four. The single number describes a large
number of discrete events. Or, consider the scourge of many students,
the Grade Point Average (GPA). This single number describes the
general performance of a student across a potentially wide range of
course experiences.
Unvaried analysis involves the examination across cases of one variable

at a time.
71
There are five major characteristics of a single variable that we

tend to look at:
• the frequency distribution
• the central tendency or location
• the dispersion
• the shape ((a)symmetry and kurtosis)
• the concentration.
In most situations, we would describe all of these characteristics for

each of the variables in our study.
There may be two objectives for formulating a summary statistic:

To choose a statistic that shows how different units seem similar.
Statistical textbooks name one solution of this objective, a measure
of central tendency.
To choose another statistic that shows how they differ. This kind of
statistic is often called a measure of statistical variability.
To compare statistics for real variables with common statistics for
some theoretical distributions like normal or binomial distribution.
In this case we use measures that show shape of real distribution.
To measure the level of concentration for a given real variable.
2.2. MEASURES OF CENTRAL TENDENCY
Measures of location or measures of central tendency summarize a list

of numbers by a “typical” value. Measure of central tendency can be:
• calculational or complete – measure that works with all

data (arithmetic mean, harmonic mean and geometric
mean).
• positional or incomplete – measure that doesn’t work with
all data (mode, median, quartiles etc).
72
The three most common measures of location are the mean,

the median and the mode:
• The arithmetic mean is the sum of the values, divided by

the number of values. It has the smallest possible sum of
squared differences from members of the list.
• The median is the middle value in the sorted list. It is the

smallest number which is at least as big as at least half
the values in the list. It has the smallest possible sum of
absolute differences from members of the list.
• The mode is the most frequent value in the list (or one
of the most frequent values, if there is more than one). It
differs from the fewest possible members of the list.
The central tendency of a distribution is an estimate of the “centre” of a

distribution of values. When summarizing a quantity like length or weight
or age, it is common to answer the first question with the arithmetic
mean, the median, or, in case of a unimodal distribution, the mode.
Sometimes, it can be useful to calculate specific measures from the
cumulative distribution function such as quartiles, quintiles or percentiles.
2.2.1. Arithmetic mean
The arithmetic mean or average or just mean is probably the most

commonly used method of describing central tendency.
To compute the mean all you have to do is to add up all the

values and divide it by the number of values.
73
For example, the mean or average quiz score is determined by summing

all the scores and dividing by the number of all students taking the
exam. For example, consider the test score values:
15, 20, 21, 20, 36, 15, 25, 15
The sum of these 8 values is 167, so the mean value is:
If we work with sample size n, we will take n instead of N for population

in the formula for the mean.
The mean of a frequency distribution can be calculating according to

the relation:
but if there is distribution with classes we had to change original

modalities with class marks:
This rule about changing xi into ci if there is distribution with classes

will be applied for all parameters of descriptive statistics.
74
While probably not intuitively obvious, the mean has a very desirable
property: it is the “best guess” for a score in the distribution, when
we measure “best” as least in error13. This might seem especially odd
because, in some case, no one would report 5.4 best friends, so if you
guessed 5.4 for someone, you are always wrong! But if you measure
how far off your guess would tend to be from the actual score that you
are trying to guess, 5.4 would produce the smallest error in your guess.
It is worth elaborating on this point because it is important. Suppose I
put the data into a hat, and pulled the scores out of the hat one by one
and each time I ask you to guess the score I pulled out of the hat. After
each guess, I record how far off your guess was, using the formula:
error = actual score - guess. Repeating this procedure for all 5 scores,
we can compute your mean error. Now, if you always guessed 5.4, your
mean error would be, guess what? Mean error would be 0! Any other
guessing strategy you used would produce a mean error different from
zero. Because of this, the mean is often used to characterize the “typical”
value in a distribution. No other single number we could report would
more accurately describe every data point in the distribution.
Main characteristics of the mean are:

If we have a series of data that are all equal to a constant c, then the The main
arithmetic mean of the series is equal to the constant. characteristics
of the mean.
Proof:
The arithmetic mean is placed between the lowest and highest value
of the series.
13
http://www.une.edu.au/WebStat/unit_materials/c4_descriptive_statistics/central_tendency_
measure.html, access: 14. 11. 2009.
75
Proof:
The sum of deviations of observations from the arithmetic mean is 0.
Proof:
The characteristic of aggregating the arithmetic mean.
If we multiply each observation by the same constant, the arithmetic

mean of the new variable is equal to the product of the constant and
the arithmetic mean of the original variable.
76
Proof:

If we add the same constant to each observation, the arithmetic
mean of the new variable is equal to the sum of the constant and the
arithmetic mean of the original variable.
Proof:
According to formula and main characteristics we can also conclude

that mean is “sensitive” to changes to any data from a series.
Because its computation is based on every observation, the mean is

greatly affected by any extreme value or values. Well, use of mean is not
recommended if the series contains data that “spoil” it or have outliers.
2.2.2. Harmonic mean
Harmonic mean is the reciprocal value of the arithmetic mean

of the reciprocal value of the observation.
77
For non-grouped data harmonic mean is equal to:

in the formula for harmonic mean.
The harmonic mean for a frequency distribution can be calculated using

following relation:

It is calculated when the original data are expressed as reciprocal

values. Reciprocal values change opposite from the direction of changes
of original values. Harmonic mean is used to express the indirect
relation (productivity in the form of time required to produce a unit
of product, capital turnover time, speed and distance covered, population
density...).
The harmonic mean is sensitive to a single small value. The harmonic

mean tends to be small if at least one of the values of analyzed variable
is abnormally small. For this reason, the harmonic mean is often used,
for example, to aggregate scores in different types of activity to a single
final score, e.g. to estimate students’ performance. This ensures that no
partial scores are radically lower than the final score.14
14
http://www.statistics.com/resources/glossary/h/harmmean.php, access: 25. 01. 2010.
78
For example, we want to know what is the average time of turnover

of production means in a company, if it is known that 30,000$ were
invested in means with 15 years of useful life, 14,000$ were invested in
means with 7 years of useful life, and 40,000$ were invested in means
with 3 months of useful life.
Means turnover and invested funds have indirect relation, and we will
calculate harmonic mean.
Average time of turnover of production means in a company is

approximately 6 months.
2.2.3. Geometric mean
Geometric mean is equal to the Nth root of the product of all

observations.
Instead of adding the set of numbers and then dividing the sum by the
count of numbers in the set, for the geometric mean the numbers are
multiplied and then the Nth root of the resulting product is taken. For
non-grouped data geometric mean is equal to:

in the formula for harmonic mean.
The geometric mean for a frequency distribution can be calculated

using formula:
79

We can use geometric mean only for data set where . It is used
when phenomena act (behave) according to the geometric progression.
It is important in the analysis of temporal series, for calculating the
average growth rate.
For example, we monitored the changes of gross investments for 9

years using appropriate chain indices and we want to know the average
chain indices:
t I II III IV V VI VII VIII IX

It/t-1 (%) 122 124 125 121 142 179 193 196 274
We use the geometric means, as is usual in the economic analysis of

temporal series:
The average chain index in the period was 154.69%.
80
2.2.4. Median
Most important positional measure of central tendency is the

median. The median is the score found at the exact middle of
the set of values. The median has value that is smaller than
or equal to 50% of the observations and larger than the other
50%.
One way to compute the median is to list all scores in numerical order,
and then locate the score in the center of the sample. Well, theoretical
position for median according to absolute frequencies is , or according
to relative frequencies 0.5. For example, if there are 500 scores in the
list, score on position 250th would be the median.
There is the rule: If we have N like odd number then for ordered set of
data median will be equal to the data on position: . But if we have
N like even number then for ordered set of data median will be equal to
15
the arithmetic mean of data on positions: and . Or by formula,
for ordered set of data median will be equal to:
15
If the two middle scores had different values, we would have to interpolate to determine the
median.
81
For example, quiz score for 8 students taking the exam are given:
15, 20, 21, 20, 36, 15, 25, 15
If we order the 8 scores, we would get:
x1 = 15, x2 = 15, x3 = 15, x4 = 20, x5 = 20, x6 = 21, x7 = 25, x8 = 36
There are 8 scores and scores x4 and x5 represent the halfway point.
Since both of these scores are 20, the median is 20.
Determination of the median for a frequency distribution is based on the

increasing cumulative frequencies. If we work with absolute cumulant,
the first modality or interval where condition is fulfilled
is called the median or interval which contains the median. If it is an
interval, then the median is determined using the following formula:
Another way to calculate the median for a frequency distribution is

based on the increasing cumulative relative frequencies. The first
modality or interval where condition is fulfilled is called
the median or interval which contains the median. If it is an interval,
then the median is determined using the following formula:
Graphically we can determine the median on polygon of cumulative

frequencies (absolute or relative).
When we have series with expressed heterogeneity or with outliers,

then we should use median rather than mean for measure of central
tendency.
82
2.2.5. Mode
The mode is positional measure of central tendency that

represents the most frequently occurring value in the set of
scores. To determine the mode, you might again order the
scores as shown above, and then count each of them. The most
frequently occurring value is the mode.
In our example (quiz score for 8 students taking the exam where
following scores are obtained: 15, 20, 21, 20, 36, 15, 25, 15), mode
value is the value 15, which occurs the most frequently in the series
(three times). In some distributions there is more than one modal value.
For instance, in a bimodal distribution there are two values that occur
most frequently.
Mode is only calculated for the statistical distribution (grouped series).

It is graphically represented via histogram. In a non-interval grouped
distribution, determination of the mode value is based on the highest
frequency . For an interval grouped distribution, the
frequency of the interval of interest opposed to the highest frequency is
determined on the basis of the following formula:
Notice that for the same set of 8 scores we got three different values
(20.875, 20, and 15) for the mean, median and mode respectively. If the
distribution is truly normal (i.e., bell-shaped), the mean, median and
mode are all equal to each other.
The mode is used only for descriptive purposes because the mode is
more variable from sample to sample than other measures of central
tendency.
Well, if we want to know what the most common modality is, we will
use mode as measure of central tendency. The mode is used less than
83
either the mean or the median in business applications. Perhaps its most
obvious use is by manufacturers who produce goods, such as clothing,
in various sizes. The modal size of items sold is then the one in heaviest
demand.
Graphically we can determine the mode on histogram.
2.2.6. Quartiles
Quartiles are positional measures of central tendency which

divide the statistical series (like series with ordered data) in
four equal parts or four quarters.
In each of the parts, there is 25% data from the series. There are three
quartiles: Q1, Q2 = Me and Q3. The first quartile is a value for which
25% of the observations are smaller or equal to while other 75% are
larger. The third quartile is a value for which 75% of the observations
are smaller or equal and 25% are larger.
Theoretical positions of quartiles within series of data (represented by

the absolute frequencies) are:
For Q1 ⇒ N/4
For Me ⇒ N/2
For Q3 ⇒ 3.(N/4)
Theoretical positions of quartiles within series of data (represented by

the relative frequencies) are:
For Q1 ⇒ 0.25
For Me ⇒ 0.50
For Q3 ⇒ 0.75
Determination of the quartiles for a frequency distribution is based

on the increasing cumulative frequencies. If we work with absolute
cumulant, then:
84
The first modality or interval where condition is
fulfilled is called the first quartile or interval which contains the

first quartile. If it is an interval, then the first quartile is determined
by using the following formula:

fulfilled is called the third quartile or interval which contains the third
quartile. If it is an interval, then the third quartile is determined by
using the following formula:
Another way to calculate the median for a frequency distribution is

based on the increasing cumulative relative frequencies:

fulfilled is called the first quartile or interval which contains the
first quartile. If it is an interval, then the first quartile is determined

fulfilled is called the third quartile or interval which contains the
third quartile. If it is an interval, then the third quartile is determined
85
Graphically we can determine quartiles on polygon of cumulative

frequencies (absolute or relative).
2.3. EXAMPLES FOR MEASURES

OF CENTRAL TENDENCY
Example 2.1.
The following data represent the total daily number of produced burgers
(’000) from a selected 20 fast-food chains in one town:
34, 15, 9, 19, 31, 34, 35, 39, 19, 34, 43, 7, 9, 15, 19, 35, 15, 19, 9, 31.
a) Create frequency distribution.

b) Calculate mean, median, quartiles and mode. Explain.
Solution:
Firstly, we will make a series arranged in order from the smallest to the
largest in size:
7, 9, 9, 9, 15, 15, 15, 19, 19, 19, 19, 31, 31, 34, 34, 34, 35, 35, 39, 43.
a) We have discrete variable with small number of modalities ⇒ ( xi , fi )

form of frequency distribution.
xi fi
7 1
Constucting 9 3
frequency 15 3
distribution.
19 4
31 2
34 3
35 2
39 1
43 1
n 20
86
b)
xi fi xi . fi CAF
7 1 7 1
9 3 27 4
15 3 45 7
19 4 76 11
31 2 62 13
34 3 102 16
35 2 70 18
39 1 39 19
43 1 43 20
n 20 471
Mean:
Calculating and
interpreting
Average daily number of produced burgers in analyzed sample was aritmetic mean.
23,550 burgers.
Median: f
Calculating and
50% of analyzed fast-food chains have daily production of burger equal interpreting median.
to or less than 19,000, while 50% of analyzed fast-food chains produce

more than 19,000 burgers daily.
First quartile:
Calculating and
interpreting first
25% of analyzed fast-food chains have daily production of burger equal quartile.
Third quartile:
Calculating and
interpreting third
75% of analyzed fast-food chains have daily production of burger equal quartile.
87
Mode:
Calculating and
interpreting mode. In this sample, fast-food occurs the most frequently chain with
production of 19 000 burgers per day.
Example 2.2.
If we consider the heights of the students in a class, the frequency table

is given below:
Height range (feet) Number of students

4.5 – 5.0 25
5.0 – 5.5 35
5.5 – 6.0 20
6.0 – 6.5 10
Total 90
Calculate mean, median, quartiles and mode. Explain.
Solution:
xi fi pi ci ci . fi CAF CRF
4.5 – 5.0 25 0.28 4.75 118.75 25 0.28
5.0 – 5.5 35 0.39 5.25 183.75 60 0.67
5.5 – 6.0 20 0.22 5.75 115.00 80 0.89
6.0 – 6.5 10 0.11 6.25 62.50 90 1.00
Total 90 1.00 480
Mean:
Calculating and
interpreting Average height of students in class is 5.33 feet.
aritmetic mean.
88
Median:
Calculating and
interpreting median.
Or with cumulative relative frequencies:
50% of analyzed students have height equal to or less than 5.286 feet,
while 50% of analyzed students are taller than 5.286 feet.
First quartile:
Calculating and
interpreting first
quartile.
Third quartile:
Calculating and
interpreting third
quartile.
89
Graphicaly
presentation
of median.
Mode:
Calculating
and interpreting
mode.
In this sample, 5.2 feet students occur the most frequently.
90
Graphicaly
presentation
of mode.
2.4. MEASURES OF DISPERSION
Dispersion refers to the spread of the values around the central

tendency.
Here is one example of the importance of variability. The average

number of children under 18 per family in the US was 0.89 according to
the 1990 census, so the average family size is about 2.9 people (does it
make sense? what is a family?). If you were in the construction business
that might suggest to you that a two-bedroom home is the right size to
build for the average American family (two parents sharing a room,
and another room for the 0.89 children). However, family sizes vary
over quite a large range; indeed, the same report shows that the average
number of children for families that have children is 1.86, so families
that have children would tend to need a three bedroom home, rather
than a two bedroom home, if the children are to have their own rooms.
91
There are four common absolute measures of dispersion:

• the range,
• the quartile range,
• the middle absolute distance (MAD) and
• the variance and the standard deviation.
The range is simply the highest value minus the lowest value:
RV = xmax - xmin.
In our example distribution with quiz score for students that take exam,
the highest value is 36 and the lowest is 15, so the range is 36 - 15 = 21.
Relying on the previous measures, we define relative measures of

dispersion such as:
coefficient of variation
z value
coefficient of quartile deviation
2.4.1. The middle absolute distance
Middle absolute distance is the absolute measure of dispersion,

which is constructed as the deviation of analyzed variable
data from a representative parameter.
In order to construct middle absolute deviation we will analyze

following five stages:
Phase 1: Choose a representative indicator
For representative indicator we choose the arithmetic mean because

of its features and advantages compared to other parameters of central
92
tendency and the characteristic that we can check using the theorem
König-Huygens.
We select any parameter of the central tendency and mark it with w.

Arithmetic mean observation deviation from the parameter is equal to:
Since the second member of the upper expression is equal to zero16
, the former expression is:
The first member on the right side of expression does not depend on w
and it is a variance of the variable X. Hence, previous expression has a
minimum value when:
or when
We find that the arithmetic mean is the best representative indicator

of central tendency because the value of observation deviations from
arithmetic mean is less than the value of observation deviations from
any other parameter of central tendency.
16
According to characteristics of arithmetic mean
93
Phase 2: We will measure deviations between each observation and the

arithmetic mean.
For each observation we count mentioned deviation:
Phase 3: We will calculate one number to represent all deviations from

previous phase.
To synthesize positions of all the calculated deviations on one number,

we will calculate their arithmetic mean. Thus calculated arithmetic
mean of deviation of data from their arithmetic mean is equal to zero,
because there is cancelation between positive and negative values of
deviations from the arithmetic mean.
Phase 4: Resolving problem of cancelling between positive and negative

values of deviations from the arithmetic mean.
To avoid cancelling of positive deviations from the arithmetic mean

with negative deviations from the arithmetic mean, we count for each
observation absolute value of deviations from the arithmetic mean (zj):
Phase 5: MAD calculation
The final phase is to determine the average absolute deviation that is

equal to arithmetic mean of deviations between observations and their
arithmetic mean
Formulas for calculating MAD are:
for gross data:
for population:
for sample:
94
for the statistical distribution of absolute frequency
for population:
where
for sample:
for the statistical distribution of relative frequency
for population: where (or 100%)
Depending on the available data we will apply the appropriate formula.

Middle absolute distance is a parameter that is easy to explain. Increasing
the middle absolute distance indicates a greater dispersion of data in
relation to their arithmetic mean. This parameter is rarely used and has
more theoretical than practical value.
2.4.2. The variance and the standard deviation
Variance and standard deviation can be constructed in four phases. The

first three phases are the same as the phases that we have analyzed for
the middle absolute distance (MAD). But, the procedure for solving the
fourth phase is different. To avoid cancelling positive deviations from
the arithmetic mean with negative deviations from the arithmetic mean,
we will count square of distance from the arithmetic mean for
each observation and then calculate their arithmetic mean.
The variance is equal to arithmetic mean of squared deviations

between observations and their arithmetic mean.
The standard deviation is more accurate and estimates dispersion in

more detail than RV because an outlier can greatly exaggerate the range
95
(as was true in the example with exam scores where the single outlier
value of 36 stands apart from the rest of the values).
The standard deviation shows the relation between the set of

scores and the mean of the sample.
Again let’s analyse the set of scores: 15, 20, 21, 20, 36, 15, 25, 15.
To compute the standard deviation, we will first calculate the distance
between each value and the arithmetic mean. We previously calculated
the mean of 20.875. So, the differences from the mean are as follows:
15 - 20.875 = -5.875
20 - 20.875 = -0.875
21 - 20.875 = 0.125
20 - 20.875 = -0.875
36 - 20.875 = 15.125
15 - 20.875 = -5.875
25 - 20.875 = 4.125
15 - 20.875 = -5.875
We should notice that values that are less than the mean have
negative discrepancies and values greater than the mean have positive
discrepancies. For next step, we will square each distance:
(-5.875) . (-5.875) = 34.516

(-0.875) . (-0.875) = 0.766
(0.125) . (0.125) = 0.016
(-0.875) . (-0.875) = 0.766
(15.125) . (15.125) = 228.766
(-5.875) . (-5.875) = 34.516
(4.125) . (4.125) = 17.016
(-5.875) . (-5.875) = 34.516
Now, we will sum these “squares” to get the Sum of Squares (SS) value.
That sum is 350.878. In the next step, we will divide this sum by the
number of scores minus 1 (n-1), because we are working with sample,
96
not with population. Here, the result is 350.878 / 7 = 50.125. This

value 50.125, like average square distance from mean, is known as the
variance. The variance has illogical unit of measure – (unit of measure
for analyzed variable)2.
To get the standard deviation, we will take the square root of the
variance, because we squared the deviations in earlier stage. This would
be square root from (50.125) = 7.0799. The standard deviation has the
same measurement unit as analyzed variable, so we can find logical
interpretation for standard deviation value.
This computation may seem confusing, but it’s actually quite simple.
To prove this, consider the formula for the standard deviation:
for population:
for sample:
In the numerator of the ratio we can see that each score has the mean
subtracted from its value, the difference is squared, and the squares
are summed. In the denumerator, we take the number of scores (or the
number of scores minus 1 for sample). The ratio is the variance and the
square root is the standard deviation.
The standard deviation is as the square root of the sum of the

squared deviations from the mean divided by the number of
scores (or the number of scores minus one, if we work with
sample). Lower value for standard deviation indicates a lower
value of the variable dispersion around arithmetic mean and
more homogeneous series. Standard deviation is expressed in
the same measurement unit as analyzed variable.
97
The variance and standard deviation of a frequency distribution can be

calculated by using formula:
For population:
For sample:
But if there is interval distribution we had to change original modalities

with class marks for interval:
For population:
For sample:
98
Main characteristics of variance computing are:
If we add the same number to each observation, the variance will not
change. Or mathematically:
Proof:
If then:
If we multiply each observation by the same number b, the variance

will be multiplied by b2. Or mathematically:
Proof:
If then:
99
From the two previous properties, we can observe and express the
following proposition:
Feature of variance aggregation:
If we know following data for two statistical series and

then variance of global series can be calculated by the
following relation:
The first member on the right side of the given relation is weighted
arithmetic mean of variances for two series and it is called the variance
in the series. Another member is variance of the arithmetic mean and
it is called a variance between the series. This rule can be generalized
to cases of aggregation. Variance is the dispersion parameter whose
numerical value cannot be correctly explained but which has analyzed
characteristics of computation. Therefore, we define the standard
deviation, whose numerical value can be explained specifically but it
does not have characteristics of computing that we have demonstrated
for the variance.
The standard deviation allows us to reach some conclusions about

specific scores in our distribution. Assuming that the distribution of
scores is normal or bell-shaped (or it is very close to it), the following
conclusions can be reached (rule of six sigma):
approximately 68% of the scores in the sample fall within one
standard deviation of the mean
approximately 95% of the scores in the sample fall within two
standard deviations of the mean
approximately 99% of the scores in the sample fall within three
standard deviations of the mean.
Problem with standard deviation, as absolute measure of dispersion,

is that we cannot use standard deviation for comparison of series with
different unit of measure or with different average.
100
2.4.3. Coefficient of variation
Standard deviation is a measure of variability expressed in the same unit

as the variable X. This is why we cannot use standard deviation for
comparison of variability expressed in a series with different units
of measure. To avoid this defect and be able to compare the different
series, a relative measure of dispersion is designed as ratio of standard
deviation and arithmetic mean.
The coefficient of variation is a relative measure of variability

which can be used to compare series with different units of
measure, because it is an unnamed number.
or for sample
Coefficient of variation is an unnamed number and it is commonly

expressed in percentages. We use it to compare dispersion when the
variables are expressed in different units of measure and when the
arithmetic means of different variables are different.
2.4.4. Z value
Standard deviation is a parameter that describes the dispersion of the

statistical series as a whole. To determine the relative position of the
numeric values of variables in the series we can apply standardized
value. So,
z value determines the relative position of variable modality

in the series:
or for sample
101
They are appropriate for comparing positions of data in different series.

Z values are specific because of fact that we can calculate z value for
each modality, not only for the series of data.
2.4.5. The quartile range, the quartile deviation

and the coefficient of quartile deviation
The quartile range ( IQ = Q3 - Q1 ) is the range from the 25th to

the 75th percentile of a distribution.
It represents the “middle half” of the data and is a marker of variability

or spread that is robust to outliers.
The quartile deviation (semi-interquartile range) is quartile

range divided by 2.
We will calculate quartiles in the same way like median with theoretical
positions 25% and 75%.
The coefficient of quartile deviation is the relative dispersion

indicator:
Higher value of the coefficient of quartile deviation indicates greater

dispersion and vice versa. This is relative indicator of data varying
around the median.
102
2.5. EXAMPLES FOR MEASURES OF DISPERSION
Example 2.3.
We use data from the sample of 7 participants at one seminar who had
to fill out a form that gave their name, address and age. The following
ages of the participants were recorded:
36, 48, 54, 92, 57, 63, 66
Calculate and explain measures of central tendency and measures of

dispersion.
Solution:
There is only 7 data (each different from another), hence we will not
construct frequency distribution. First we will make order series for this
data:
Ordinal numeral
Age - xj
of participant
1. 36 -23.429 548.898 23.429
2. 48 -11.429 130.612 11.429
3. 54 -5.429 29.469 5.429
4. 57 -2.429 5.898 2.429
5. 63 3.571 12.755 3.571
6. 66 6.571 43.184 6.571
7. 92 32.571 1,060.898 32.571
Total 416 0.000 1,831.714 85.429
103
Aritmetic mean - The average age of the participants
in the sample is 59.4286.
In ordered series with even number of observations theoretical
Median
position for median is , so median is Me = 57. 50% of
selected participants are 57 years old or younger, while 50% are more
than 57 years old.
Mode
All data has different value and we did not create frequency
distribution, so we cannot calculate mode.
RV = xmax _ xmin = 92 _ 36 = 56 - Range of variation between youngest

Calculating and and oldest participant is 56 years.
interpreting range
of variation.
17
- Average linear distance
Calculating and
interpreting standard from average age (59.4286) in sample is 17.47 years.
deviation.
- Average absolute distance

Calculating and
interpreting middle from average age (59.4286) in sample is 14.24.
absolute distance.
- Relative dispersion of data

Calculating and
interpreting about the average is 29.4%.
coefficient of
variation.
For xj = 66, a z value is: - Participant
Calculating and with age of 66 is above average for 0.3762 standard deviations.
interpreting z value.
17
Data are given for the sample, so we use formula for standard deviation of sample (with (N-1)).
104
Example 2.4.
A survey was taken on X Avenue. In each of 20 homes from the sample,

people were asked how many cars were registered to their households.
The results were recorded as follows:
Number of cars Number of homes

1 6
2 7
3 4
4 3
Total 20

dispersion.
Solution:
xi fi xi . fi CAF
1 6 6 6 8.64 7.2
2 7 14 13 0.28 1.4
3 4 12 17 2.56 3.2
4 3 12 20 9.72 5.4
n 20 44 21.2 28
Mean:
Aritmetic mean
Average number of cars that were registered for households in analyzed
sample is 2.2 cars.
Median:
Median
50% of analyzed households from sample have 2 registered cars or less.
105
Mode:
Mode
Households with 2 registered cars are the most frequent in this sample.
RV = xmax _ xmin = 4 _ 1 = 3 - Range of variation is 3 cars.

Calculating and
interpreting range
of variation.

18
Calculating and
interpreting standard from average number of registered cars per household in the analyzed
deviation. sample (2.2) is 1.056 cars.
- Average absolute distance from

Calculating and
interpreting middle average number of registered cars per households in the analyzed
absolute distance. sample (2.2) is 1.4 cars.
- Relative dispersion of data

Calculating and
interpreting around the mean is 48.01%.
coefficient of
variation.
For xi = 1, a z value is: - Households
Calculating and with 1 car registered are below average for 1.136 standard deviations.
interpreting z value.
Example 2.5.
Thirty AA batteries from the sample were tested to determine how long
they would last. The results, to the nearest minute, were recorded as
follows:
18
There are 20 homes like sample, so we use formula for standard deviation from sample (with
(N-1)).
106
Battery life, minutes Frequency

360 – 370 2
370 – 380 3
380 – 390 5
390 – 400 7
400 – 410 5
410 – 420 4
420 – 430 3
430 – 440 1
Total 30

dispersion.
Solution:
xi fi CAF ci ci . fi
360 – 370 2 2 365 730 2.178 66

370 – 380 3 5 375 1.125 1.587 69
380 – 390 5 10 385 1.925 845 65
390 – 400 7 17 395 2.765 63 21
400 – 410 5 22 405 2.025 245 35
410 – 420 4 26 415 1.660 1.156 68
420 – 430 3 29 425 1.275 2.187 81
430 – 440 1 30 435 435 1.369 37
Total 30 11.940 9.630 442
Mean:
Aritmetic mean
Average battery life in analyzed sample is 398 min.
107
Median:
Median
50% of analyzed batteries last 397.14 minutes or less, while 50% last
longer.
Mode:
Mode
The battery which lasts 395 minutes is the most frequent in the sample.
Quartile 1:
Quartile 1
25% of analyzed batteries have life of 385 minutes or less, while 75%
last longer than 385 minutes.
Quartile 3:
Quartile 3
75% of analyzed batteries have life of 411.25 minutes or less, while

25% last longer than 411.25 minutes.
- Range of variation is 80 minutes.

Range of variation
108

19
Standard deviation
from average battery life in analyzed sample (398 min.) is 18.22
minutes.
- Average absolute distance

Middle absolute
distance
from average battery life in analyzed sample (398 min.) is 15.24
minutes
- Relative dispersion of data around

Coefficient of
variation
the mean is 4.158%.
For xi = 405, a z value is: - Battery

z value
with life of 405 minutes lasts 0.385 standard deviations above
average.
Calculating and
interpreting
When we remove 25% of the smallest and 25% of the highest data, the quartile range.
new range of variation will be 26.25 minutes.
- Relative dispersion Calculating and

interpreting
of data around the median is 3.29%. coefficient of
quartile deviation.
19
There is thirty AA batteries like sample, so we use formula for standard deviation from
sample (with (N-1)).
109
2.6. SHAPE OF DISTRIBUTION
2.6.1. Symmetry or skewness
A frequency distribution may be symmetrical or asymmetrical.
Imagine constructing a histogram centred on a piece of paper and

folding the paper in half the long way. If the distribution is symmetrical,
the part of the histogram on the left side of the fold would be the mirror
image of the part on the right side of the fold.
If the distribution is asymmetrical, the two sides will not be mirror

images of each other. True symmetric distributions are a property of
the normal distribution, which will be elaborated later. Asymmetric
distributions are more commonly found.
Table 2.1. Measure of skewness (for population and sample)
α3 = 0 ⇒ symmetry
Measure
α3 > 0 ⇒ positively
of skewness
skewed
(for
population) α3 < 0 ⇒ negatively
skewed
110
α3 = 0 ⇒ symmetry
Measure
of α3 > 0 ⇒ positively
skewness skewed
(for
sample) α3 < 0 ⇒ negatively
skewed
When distribution is symmetrical, the arithmetic mean, median and

mode are equal.
Figure 2.1. A symmetrical distribution frequency
Source: Somun-Kapetanovi} R., Statistika u ekonomiji i menadžmentu, Ekonomski fakultet u

Sarajevu, Sarajevo 2008., page 83
If a distribution is asymmetric it is either positively skewed or negatively

skewed.
111
A distribution is said to be positively skewed if the scores tend

to cluster toward the lower end of the scale (that is, the smaller
numbers) with increasingly fewer scores at the upper end of the
scale (that is, the larger numbers).
Figure 2.2. A positively skewed distribution frequency

A negatively skewed distribution is exactly the opposite. With a

negatively skewed distribution, most of the scores tend to occur
toward the upper end of the scale while increasingly fewer
scores occur toward the lower end.
112
Figure 2.3. A negatively skewed distribution frequency

2.6.2. Kurtosis
Another descriptive statistic that can be derived to describe a

distribution is called kurtosis. It refers to the relative concentration of
data in the centre, the upper and lower ends (tails) and the shoulders of
a distribution.
A distribution is platykurtic if it is flatter than the correspon-

ding normal curve and leptokurtic if it is more peaked than
the normal curve.
113
Table 2.2. Measure of kurtosis (for population and sample)
Measure α4 = 3 ⇒ normal
of kurtosis
α4 > 3 ⇒ leptocurtic
(for
population) α4 < 3 ⇒ platykurtic
α4 = 3 ⇒ normal
Measure
of kurtosis α4 > 3 ⇒ leptocurtic
(for sample)
α4 < 3 ⇒ platykurtic
The new graph presents three types of kurtosis for distribution.
114
Figure 2.4. Measure of kurtosis

A distribution is called unimodal if there is only one major “peak” in the

distribution of scores when represented as a histogram. A distribution is
bimodal if there are two major peaks. If there are more than two major
peaks, we call the distribution multimodal.
Example 2.6.
In the last 2 years in the company ICC 50 injuries have happened and
the number of hours lost due to injury was:
Number of hours lost due to injury Number of injury

1 10
2 12
3 14
4 11
5 3
Total 50
115
Calculate and explain:

a) the average number of hours lost due to injury
b) standard deviation
c) measures of skewnes and kurtosis.
Solution:
First we will complete worksheet for analysed population:
xi fi xi . fi
1 10 10 -1.7 28.9 -49.13 83.521

2 12 24 -0.7 5.88 -4.116 2.8812
3 14 42 0.3 1.26 0.378 0.1134
4 11 44 1.3 18.59 24.167 31.4171
5 3 15 2.3 15.87 36.501 83.9523
Σ 50 135 70.5 7.8 201.885
a)
The average number of hours lost due to injury for analysed population
is 2.7.
20
b) - average linear deviation
from average number of hours lost due to injury is 1.187 hours.
20
This is population for two years, so we use formula for standard deviation for population
(with N).
116
c)
Calculating and
interpreting measure
of skewness.
Calculating and
interpreting measure
of kurtosis.
Graphicaly
presentation of
measures of
skewnes and
kurtosis.
Example 2.7.
Determine the arithmetic mean, standard deviation, measures of

skewnes and kurtosis for variable the amount of donations for sample
of 40 donors:
Amount of donations Number of donators

0 - 400 4
400 - 800 8
800 - 1200 14
1200 - 1600 8
1600 - 2000 6
117
Solution:
First we have to complete worksheet for given frequency distribution:
xi fi ci ci . fi
0 - 400 4 200 800 2822400 -2370816000 1991485440000

400 - 800 8 600 4800 1548800 -681472000 299847680000
800 - 1200 14 1000 14000 22400 -896000 35840000
1200 - 1600 8 1400 11200 1036800 373248000 134369280000
1600 - 2000 6 1800 10800 3465600 2633856000 2001730560000
Σ 40 41600 8896000 -46080000 4427468800000
The average amount of donations in a sample is 1040 KM.
- Average linear
deviation from average amount of selected donations is 477.60 KM.
Measure of skewness
Measure of kurtosis
118
2.7. MEASURE OF CONCENTRATION
The Lorenz curve is a graphical representation of the

cumulative distribution function of a probability distribution;
it is a graph showing the proportion of the distribution
assumed by the bottom y% of the values.
It is often used to represent income distribution, where it shows for the

bottom x% of households, what percentage of the total income they
have (y%).
Figure 2.5. The Lorenz curve
Point on the Lorenz curve represents a statement as “the bottom 20% of

all households has 10% of the total income”. A perfectly equal income
distribution would be one in which every person has the same income.
In this case, the bottom N% of society would always have N% of the
income. This can be depicted by the straight line y = x; called the line
of perfect equality.
119
By contrast, a perfectly unequal distribution would be one in which one

person has all the income and everyone else has none. In that case, the
curve would be at y = 0 for all x < 100%, and y = 100% when x = 100%.
This curve is called the line of perfect inequality.
The Gini coefficient is determined by the area between the

line of perfect equality and the observed Lorenz curve (area of
concentration). It is equal to ratio of the area of concentration
and area of triangle between the line of perfect equality and
the line of perfect inequality. This equals two times the area of
concentration.
The higher the Gini coefficient, the more unequal the distribution is.
There are two methods for calculation of Gini coefficient:
120
Example 2.8.
For company X, we look at the following distribution of wages:
Annual wages (in KM) Number of employees

[5,000 - 7,000[ 60
[7,000 - 8,000[ 80
[8,000 - 9,000[ 105
[9,000 - 11,000[ 110
[11,000 - 15,000[ 35
[15,000 - 20,000[ 10
Total 400
Calculate the Gini coefficient. Make conclusion. Construct Lorenz curve.
Solution:
xi fi ci pi CRFi ci . fi Qi
[5,000 - 7,000[ 60 6000 0.150 0.1500 360000 0.100 0.100

[7,000 - 8,000[ 80 7000 0.200 0.3500 600000 0.167 0.268
[8,000 - 9,000[ 105 8500 0.260 0.6100 892500 0.249 0.517
[9,000 - 11,000[ 110 10000 0.275 0.8875 1100000 0.307 0.824
[11,000 - 15,000[ 35 13000 0.875 0.9750 455000 0.127 0.951
[15,000 - 20,000[ 10 17500 0.250 1.0000 175000 0.490 1.000
Total 400 1.000 3582500 1.000
Gini coefficient – trapezoid method:
Calculating and
interpreting Gini
coefficient –
trapezoid method.
121
Gini coefficient – triangle method:
Calculating and
interpreting Gini
coefficient – triangle
method.
Since the Gini coefficient is closer to 0 than 1, we say that it is a

relatively equitable distribution (concentration is low).
Constructed Lorenz curve is presented on the following graph:
Graphicaly
presentation
Lorenz curve.
We could derive the same conclusion as we derive from Gini coefficient

(relatively equitable distribution i.e. low concentration).
122
2.8. USING EXCEL TO OBTAIN

DESCRIPTIVE STATISTICS
Overview example 2.1. Computing

descriptive
We have database with variables that is related to the procedure of statistics using
Excel
paying taxes in 181 countries21.
Data are given in Excel sheet (A1-G363). Variables are:

Payments (number of transactions) (B2-B363)
Time (hours) (C2-C363)
Total tax rate (% profit) (D2-D363).
These are quantitative variables, so we can apply methodology of

descriptive statistics for series of 181 data per each variable to get
several parameters which will describe given series.
The most simple and the fastest way to get several parameters which
will describe given series (xmin, xmax, average, deviation, mod, median,
kurtosis and skewness) is to use Excel function: Tools – Data Analysis.
If that option is not included we have to renew it:
1. Tools – Add-ins:
21
http://www.doingbusiness.org/CustomQuery/, data for 2008 year, access: 15. 04. 2009.
123
2. We have to renew or choose Analysis ToolPak and Analysis ToolPak

– VBA:
3. Click OK and we will get in Tools:
Now we can use Data Analysis option:
124
We will get list with analysis that we can make. Currently we are
interested in option Descriptive statistics, so we choose it and click OK.
In the same time, in Input range we can select all columns with several
variables and group according to the columns ($B$1:$D$182). After
data selection, we include the first cell with variable name and then
choose option Labels in the first row. Then we set up empty cell or new
sheet where we want to save the result of analyses and select which
parameter statistics we want to determine:
Summary statistics - xmin, xmax, average, deviation, mod, median,
kurtosis and skewness, range, count...
Confidence level for mean – This is boundary for confidence interval
for average with given confidence level (for example 95%)
If we want to calculate quintiles we will choose Kth largest i Kth
smallest option. For example, for the first and the 99th percentiles in
both cases we take 1, for the first and the third quartile in both cases
we take 25, for the first and ninth deciles in both cases we take 10…
125
Click OK and result is:
Interpretation of the statistics for the variable in this example - time

Interpretation (hours) - is given as follows:
of descriptive Average is 317.63 hours, in sample of 181 countries (count), so 317.63
statistics hours are needed for paying taxes procedure, on average.
Standard error of average estimation is based on sample size and
calculated sample standard deviation
Median is 256, so in 50% of countries 256 hours or less are needed

for paying taxes procedure, while 50% of countries need more than
256 hours for paying taxes procedure.
Mod is 270, so the most frequent appears to be a country with 270
hours needed to pay taxes procedure.
Standard deviation indicates that average linear deviation time of
needed to pay taxes procedure from average time is 317.66 hours, so
we can calculate coefficient of variation:
Relative variability of data around average is 100%. Only in

comparison with another series this information has sense.
126
Variance defined as average square deviation of data from average is

100,906.1, but we interpret this through standard deviation.
Kurtosis is (19.96+3) = 22.96, which is more than 3 so we can
conclude that this distribution is significantly more peaked than the
normal curve.
Skewness is 3.77, which is more than 0 so we can conclude that this
distribution is significantly right asymmetric in comparison with
the normal curve
Range defined as difference between highest and lowest value is
2,600 h.
Minimal time for paying taxes procedure is 0 h.
Maximal time for paying taxes procedure is 2,600 h.
Sum of data in series is 57.491, but there is no logical interpretation
for this information.
Third quartile is 453, so in 75% of countries 453 hours or less are
needed to pay taxes procedure while in 25% of countries are needed
more than 453 hours to pay taxes procedure.
First quartile is 105, so in 25% of countries 105 hours or less are
needed to pay taxes procedure until in 75% of countries are needed
more than 105 hours to pay taxes procedure.
Boundary for confidence interval for average with given confidence
level 95% is 46.59. Confidence interval for average with 95%
confidence level is [317.63±46.59] = [271.04-364.22]. So with Type I
error of 5% we can conclude that time for paying taxes procedure in
some countries will be within interval [271.04-364.22] hours.
To see these parameters visually we will construct histogram. We have

option in Data analysis:
127
Before we construct histogram we have to define intervals according

to minimal and maximal value and the numbers of interval that we
want to create. Maximal value is 2600 and minimal value is 0, so
we will determine intervals with width of 100: 0-100, 100-200, ...,
400-500, 500-600, ..., 2,500-2,600. Upper limits for that intervals that
are included in intervals are: 99, 199, ..., 499, 599, ..., 2,600. We will type
these limits in one Excel column (I22:I47).
For Input range we will select column with original data (C2:C182) and for
Bin Range we will select cells where we type upper limits for intervals
(I22:I47). We will find place to save result and option Chart output:
128
Graph that we are get is graph with vertical bars, but we will click on
graph and choose Chart options – Options. There we will set up that gap
between bars is equal to 0:
Finally, histogram is:
129
Conclusions about distribution shape drawn from histogram are the same
conclusions that we inferred from previously calculated parameters.
It is very positive (right) asymmetric and peaked distribution. This
distribution is significantly different in comparison with normal curve.
Overview example 2.2.

Computing
descriptive With aim to analyse concentration of consumption based on data base
statistics using
Excel
HBS 2008, we will be using the data about consumption per capita for
23,374 individuals from 7,071 households:
These are original gross data, so we will first construct appropriate

frequency distribution. We need to find minimal and maximal value for
consumption level in our sample:
130
Since we make decision to set up intervals 5,000 wide, the upper limits
included in intervals (bins) are: 4,999.99, 9,999.99, 14,999.99, …, 5,4999,99.
We will type these limits in empty column in sheet where original
data are:
131
We select empty cells in column behind (E6:E16). In function ( fx) we

choose Frequency:
With CTRL+SHIFT+ENTER we will get frequency distribution:
132
Now we can start to construct Lorenz curve and to calculate Gini

coefficient. We need centers of intervals and relative frequencies, but
before that we have to form columns with lower and upper limits of
intervals:
First we will calculate centers of intervals:
133
With Copy-Paste option we will get column with centers of intervals:
Than we will calculate relative frequencies:
134
With Copy-Paste option we will get column with relative frequencies:
Afterwards, we will calculate relative cumulative frequencies. The first

relative cumulative frequency is the same as the first relative frequency
and all the other cumulative frequencies are obtained by adding
each frequency from a frequency distribution table to the sum of its
predecessors:
135
With Copy-Paste option we will get column with relative cumulative

frequencies:
Then we need cumulant for relative aggregate. First we will calculate

aggregate (c.p) as product of centre of interval and absolute frequency
for given interval:
With Copy-Paste option we will get column for aggregate:
We will calculate relative aggregate as:
136
With Copy-Paste option we will get column for relative aggregate:
In the end we will find relative cumulative aggregate (Q):
137
With Copy-Paste option we will get column for cumulant of relative

aggregate:
To graph Lorenz curve for x axis we will take relative cumulative

frequencies and for y axes we will take cumulant of relative aggregate.
Before that we will insert one point with value 0 for both cumulants:
138
Now we can graph Lorenz curve:
For line of perfect equality we will take the same data for relative
cumulative frequencies for both axes.
For Lorenz curve we take:
139
Now with Add we will insert new series for line with perfect equality:
140
We choose Next and then the option to give titles appear:
Finally, the following graph is obtained:
White area is the area of concentration.
We will calculate Gini coefficient, quantification measure of concentration,

by using the following relation:
141
With Copy-Paste option we will complete this column:
When we calculate (1-this sum) we will get Gini coefficient:
142
And finally, the value of Gini coefficient is:
Gini coefficient is 0.3378 so distribution of consumption is not perfectly

equal but the level of concentration is not very high.
2.9. SOLVED EXAMPLES
2.1. Given following data set: 3, 4, 7, 18, 6, 10, 25.

a) Find the mean?
b) Find the median?
Solution:
3, 4, 6, 7, 10, 18, 25
a)
The average value of the 7 observed data is 10.43.
b) Note: Ungrouped data set, N=7 - odd number of data.
143
50% of data have value 7 or less, while 50% of the data have value more
than 7.
2.2. Given the following data set: 2, 3, 7, 4, 3, 2, 8, 3.

a) Find the the mean?
b) Find the the median?
c) Find the the mode?
Solution:
2, 2, 3, 3, 3, 4, 7, 8
a)
The average value of the 8 observed data is 4.
b) Note: Ungrouped data set, N=8 - even number of data.
50% of data have value 3 or less, while 50% of the data have value more
than 3.
c)
The most frequent data in the observed data set is 3.
2.3. We monitored appropriate chain index to observe changes of stock

price in the 8 days period. Following data are recorded:
t I II III IV V VI VII VIII

It/t-1 (%) 105 125 123 127 145 98 178 197
Find the average value of chain index.
144
Solution:
We use the geometric mean, as is usual in the economic analysis of

temporal series:
Geometric mean
The average chain index in the observed period is 133.71%.
2.4. We tested 40 workers from Sam factory to establish the average

time required for the execution of actions in the production process.
10 workers product needed 20 minutes, 17 employees product
needed 25 minutes, 7 employees product needed 30 minutes and
6 employees product needed 35 minutes. Find the average time
required to execute the observed action.
Solution:
Note: Performance and the average time required to execute the observed
action have indirect relation; hence we will calculate harmonic mean.
xi (minutes) fi
20 10
25 17
30 7
35 6
Σ 40
Harmonic mean
The average time required to perform the action is 25.24 minutes.
145
2.5. Daily earnings of 15 employees are (in KM):
80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 100, 100, 100, 110, 110.
a) Present the data with the polygon of cumulative absolute frequencies.

b) What is the average daily earning of employees in the group?
c) Determine mod and median of given frequency distribution and
interpret the results.
d) Determine and explain Q1 and Q3.
Solution:
Since we have a series with few data, a non interval grouped frequency
distribution will be formed:
xi - Daily earnings fi - Absolute frequency xi . fi

80 4 320 4
90 6 540 10
100 3 300 13
110 2 220 15
Σ 15 1380
146
b)
The average daily earning of employees in the group is 92.00 KM.
c)
The most frequent daily earning for the 15 observed employees is 90 KM.
To find the median, we firstly use the formula for the location (position).
The position is . Afterward, we look for the least value
of cumulative absolute frequency that is greater or equal to calculated
position. The corresponding modality represents median:
Due to the large difference between the actual

and theoretical (0.5 or 50%) cumulative frequency, we will be using
actual cumulative frequency in our interpretations. Therefore, 66.67%
of employees have daily earning 90 KM or less, while 33.33% of the
employees have daily earning more than 90 KM.
d)
In this case, there is no great difference between the actual
and theoretical (0.25 or 25%)
cumulative frequency so in our interpretations we use actual theoretical

cumulative frequency. Therefore, 25% of employees have daily earning
80 KM or less, while 75% of the employees have daily earning more
than 80 KM.
147
and theoretical (0.75 or 75%) cumulative frequency, we use actual

cumulative frequency in our interpretations. Therefore, 86.67% of
employees have daily earning 100 KM or less, while 13.33% of the
employees have daily earning more than 100 KM.
2.6. A teacher recorded the following quiz scores (out of possible 5

points) for 25 students:
2 3 4 2 4
3 3 1 3 4
4 5 5 1 2
2 1 4 0 3
3 3 2 2 1
a) Create a non interval grouped frequency distribution.

b) Graphically present the frequency distribution by using polygon of
cumulative absolute frequency.
c) What is the average quiz score for the 25 students?
d) Calculate and explain mode and median.
e) Calculate and explain first and third quartile.
Solution:
a)
Quiz scores Number of students xi . fi
0 1 1 0
1 4 5 4
2 6 11 12
3 7 18 21
4 5 23 20
5 2 25 10
Σ 25 67
148
b)
c)
The average quiz score for the 25 observed students is 2.68 points.
d)
The most frequent quiz scores for the 25 observed students is 3 points.
and theoretical (50%) cumulative frequency in our interpretations we

will be using actual cumulative frequency. Therefore, 72% of students
have quiz scores 3 points or less, while 28% of the students have more
than 3 points.
149
e)
and theoretical (25%) cumulative frequency, we will be using actual

cumulative frequency in our interpretations. Therefore, 44% of students
have quiz scores 2 points or less, while 56% of the students have more
than 2 points.
and theoretical (75%) cumulative frequency, we use actual cumulative

frequency in our interpretations. Therefore, 92% of students have quiz
scores 4 points or less, while 8% of the students have more than 4 points.
2.7. The following values are the number of cars that households of one
rich part of city possess:
Number of cars Number of households

1 3
2 7
3 8
4 5
5 2
a) Graphically present the frequency distribution by using bar chart

(column).
b) Calculate and explain arithmetic mean.
c) Determine and interpret mode and median.
d) Determine and interpret D1 and D9
150
Solution:
Number of cars Number of households xi . fi

1 3 3 3
2 7 14 10
3 8 24 18
4 5 20 23
5 2 10 25
Σ 25 71
b)
The average number of cars for the 25 observed households is 2.84.
c)
The most frequent number of cars for the 25 observed households is 3.
151
72% of households have 3 cars or less, while 28% of the households

have more than 3 cars.
d)
Determining and
interpreting first
decile. In this case, there is no great difference between the actual (12%) and
theoretical (10%) cumulative frequency so in our interpretations we use
actual theoretical cumulative frequency. Therefore, 10% of households
have 1 car or less, while 90% of the households have more than 1 car.
Determining and
interpreting ninth
decile.
In this case, there is no great difference between the actual (92%) and
theoretical (90%) cumulative frequency so in our interpretations we use
actual theoretical cumulative frequency. Therefore, 90% of households
have 4 cars or less, while 10% of the households have more than 4 cars.
2.8. The numbers of new orders received by a company over the past 20
working days were recorded as follows:

b) Graphically present the frequency distribution by using pie chart.
c) Calculate and explain arithmetic mean.
d) Determine and interpret mode.
e) Determine and interpret quartiles.
Solution:
a)
Number of Number of pi xi . fi
new orders working days
0 2 0.1000 36 0 2
1 2 0.1000 36 2 4
2 4 0.2000 72 8 8
3 6 0.3000 108 18 14
4 4 0.2000 72 16 18
5 2 0.1000 36 10 20
Σ 20 1 360 54
152
b)
c)
The average numbers of new orders received by a company for the 20

observed days is 2.7.
d)
The most frequent numbers of new orders received by a company within

the 20 observed days is 3.
e)
40% of time company received 2 new orders or less, while 60% of time
company received more than 2 new orders.
90% of time company received 4 new orders or less, while 10% of times
company received more than 4 new orders.
153
2.9. The speeds (in kph) of 20 cars on a highway were:
130 131 138 120 105

130 133 138 116 125
141 135 125 115 139
148 149 119 127 108
a) Create interval grouped statistical frequency distribution in a way

that the lower (left) boundary of the first interval is 100 and the lengths
(amplitudes) of intervals are 10.
b) Graphically present the frequency distribution by using histogram.
c) Calculate and explain arithmetic mean.
d) Calculate and explain mode.
e) Determine mod graphically.
Solution:
a)
Speeds ci ci . fi
Number of cars
(in kph)
[100 – 110[ 2 105 210
[110 – 120[ 3 115 345
[120 – 130[ 4 125 500
[130 – 140[ 8 135 1080
[140 – 150] 3 145 435
Σ 20 2570
154
b)
c)
The average speed of the 20 observed cars is 128.5 kph.
d)
The most frequent speed of the 20 observed cars is 134.44 kph.
155
e)
2.10. The following frequency distribution shows the distance (in km)
that 50 workers need travel to work:
Distance (in km) Number of workers

[0 – 5[ 7
[5 – 10[ 20
[10 – 15[ 16
[15 – 20] 7
a) Graphically present the frequency distribution by using polygon of

cumulative absolute frequency.
c) Calculate and explain median.
d) Calculate and explain first and third quartile.
e) Determine median and quartiles graphically.
f) Graphically present the box plot.
156
Solution:
Number of ci ci . fi
Distance (in km)
workers
[0 – 5[ 7 2.5 7 17.5
[5 – 10[ 20 7.5 27 150
[10 – 15[ 16 12.5 43 200
[15 – 20] 7 17.5 50 122.5
Σ 50 490
b)
The average distance that 50 observed workers need to travel is 9.8 km.
c) From the interval, the median
is determined using linear interpolation:
157
50% of workers travel 9.5 km or less to get to company, while 50%

workers travel to company longer than 9.5 km.
d) We determine the first
quartile from the interval, using linear interpolation,:
25% of workers travel to company 6.38 km or less, while 75% workers

travel to company longer than 6.38 km.
We determine the third
quartile from the interval, using linear interpolation:
75% of workers travel to company 13.28 km or less, while 25% workers

travel to company longer than 13.28 km.
e)
Graphicaly
presentation
of quartiles.
158
f)
Graphicaly
presentation
of box plot.
2.11. A supervisor of a bank kept records of the time (in minutes) that
employees needed to complete a particular task. The data are given
in the next table:
11 29 16 24 15 23 10 21 18 20
15 22 13 24 16 28 21 14 26 27
25 20 19 23 17 23 18 22 19 29

that the lower boundary of the first interval is 10 and the amplitudes
of intervals are 5.
b) Graphically present the frequency distribution using pie chart.
c) Calculate the average time that employees needed to complete a
particular task.
d) Calculate and explain D1 and D9 .
159
Solution:
a)
Time ci pi ci . fi
Frequency
(in min)
[10 – 15[ 4 12.5 0.1333 48 50 4
[15 – 20[ 9 17.5 0.3000 108 157.5 13
[20 – 25[ 11 22.5 0.3667 132 247.5 24
[25 – 30] 6 27.5 0.2000 72 165 30
Σ 30 1 360 620
b)
c)
The average time employees needed to complete a particular task for

the 30 observed employees is 20.67 minutes.
d)
160
10% of employees need 13.75 minutes or less to complete a particular

task, while 90% of employees need more than 13.75 minutes to complet
a particular task.
90% of employees need 27.50 minutes or less to complete a particular

task, while 10% of employees need more than 27.50 minutes to complete
a particular task.
2.12. The table below shows the distribution of scores on driving test
undertaken by 90 candidates:
Scores Number of candidates

[0 – 20[ 8
[20 – 40[ 16
[40 – 60[ 35
[60 – 80[ 18
[80 – 100] 13
a) Draw a histogram.
b) Calculate the average score on the driving test.
c) Calculate and explain quartile.
d) Calculate and explain C1 and C99 .
161
Solution:
Scores
candidates
[0 – 20[ 8 10 80 8
[20 – 40[ 16 30 480 24
[40 – 60[ 35 50 1750 59
[60 – 80[ 18 70 1260 77
[80 – 100] 13 90 1170 90
Σ 90 4740
a)
b)
The average score on the driving test undertaken by 90 candidates is

52.67 points.
c)
162
25% of candidates on the driving test have 38.13 points or less, while
75% candidates have more than 38.13 points.
d) Calculating
and interpreting
first centile.
1% of candidates on the driving test have 2.25 points or less, while 99%
candidates have more than 2.25 points.
Calculating and
interpreting ninety
ninth centile.
2.13. Compute the arithmetic mean, standard deviation and coefficient

of variation of the following data:
5, 6, 6, 8, 7, 7, 7, 8, 9, 9, 8, 7, 7, 10, 9, 8
163
Solution:
5, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 10
xi fi xi . fi
5 1 5 -2.56 6.55
6 2 12 -1.56 4.87
7 5 35 -0.56 1.57
8 4 32 0.44 0.77
9 3 27 1.44 6.22
10 1 10 2.44 5.95
Σ 16 121 25.93
The average value for the 16 observed data is 7.56.
The average squared deviation of individual data from the arithmetic

mean is 1.62. (Note that the variance is expressed in a squared
measurement unit of observed variable but not interpreted that way).
The average linear deviation of individual data from the arithmetic

mean is 1.27.
Relative variation of data around the arithmetic mean is 16.80%.
164
2.14. A company has produced the following table to describe the monthly
overhead expenses:
The monthly overhead expenses

Number of months
(in 000 KM)
[1 – 3[ 2
[3 – 5[ 3
[5 – 7[ 4
[7 – 9] 3
Determine:
a) Graphically present the frequency distribution by using histogram.
b) Calculate the average monthly overhead expenses.
c) Compute and explain mode and median.
d) Compute and explain middle absolute distance.
Solution:
The monthly Number

overhead expenses of ci ci . fi
(in 000 KM) months
[1 – 3[ 2 2 4 2 -3.33 6.66
[3 – 5[ 3 4 12 5 -1.33 3.99
[5 – 7[ 4 6 24 9 0.67 2.68
[7 – 9] 3 8 24 12 2.67 8.01
Σ 12 64 21.34
165
a)
b)
The average monthly overhead expenses for the 12 observed months are
5330 KM.
c)
The most frequent monthly overhead expenses for the 12 observed

months are 6000 KM.
50% of time company had monthly overhead expenses 5500 KM or

less, while 50% of time company had monthly overhead expenses more
than 5500 KM.
166
d)
The average absolute deviation of the individual data from the average
monthly overhead expenses amounts to 1780 KM.
2.15. Determine arithmetic mean, mode and standard deviation of the

data series given in the following table:
Number of sold cars Number of working days

0 3
1 10
2 8
3 6
4 3
Solution:
Number of Number of xi . fi
sold cars working days
0 3 0 -1.87 10.49
1 10 10 -0.87 7.57
2 8 16 0.13 0.14
3 6 18 1.13 7.66
4 3 12 2.13 13.61
Σ 30 56 39.47
The average number of sold cars for the 30 observed working days is
1.87.
167
The most frequent number of sold cars for the 30 observed working
days is 1.
The average linear deviation of the individual data from the average
number of sold cars amounts to 1.15.
2.16. We recorded the time cleaners needed to finish certain job and for
40 cleaners gained the following data (in minutes):
18 23 18 16 16 23 19 16 20 19
17 17 14 12 14 12 15 13 21 18
22 20 19 17 21 21 23 15 19 16
18 23 18 12 14 12 14 16 20 19

that the lower boundary of the first interval is 12 and the lengths of
intervals are 3.
b) Graphically present the frequency distribution by using histogram
and the polygon of absolute frequencies.
c) Calculate the average time needed to finish job.
d) Calculate and explain median and quartiles.
e) Calculate and explain coefficient of variation and the quartile devia-
tion coefficient. What is better representative of data: median or
mean?
168
Solution:
a)
Ri fi ci ci . fi
[12 – 15[ 9 13.5 121.5 9 1640.25

[15 – 18[ 10 16.5 165 19 2722.5
[18 – 21[ 13 19.5 253.5 32 4943.25
[21 – 24] 8 22.5 180 40 4050
Σ 40 720 13356
b)
169
c)
The average time needed to finish job for the 40 cleaners is 18.00 min.
d)
50% of cleaners finished job in 18.23 min or less, while 50% of cleaners
need more than 18.23 min.
170
e)
Relative variation of data around the mean is 17.50%.
Relative variation of data around the median is 14.62%.
The value of relative indicator of variation which uses median as a series

representative is lower than the value of relative indicator which uses
arithmetic mean as a series representative. Therefore, it is better to use
median than the arithmetic mean as a data representative.
2.17. The number of flats built in certain municipalities was:
43 120 55 470 250 420

80 120 220 230 220 70
103 230 205 320 405 207
305 180 430 208 350 80

that the lower boundary of the first interval is 0 and the amplitudes
of intervals are 100.
b) Find the average number of flats built.
c) Calculate and explain first quartile.
d) What minimum number of flats should municipality build to be
found in upper 25% municipalities by the number of flats built?
e) Calculate and explain the quartile range.
171
Solution:
a)
Ri fi ci ci . fi
[0 – 100[ 5 50 250 5
[100 – 200[ 4 150 600 9
[200 – 300[ 8 250 2000 17
[300 – 400[ 3 350 1050 20
[400 – 500] 4 450 1800 24
Σ 24 5700
b)
The average number of flats built for the 24 municipalities is 237.50.
c)
25% of municipalities have built 125 flats or less, while 75% of

municipalities have built more than 125 flats.
d)
e) The interquartile range of a data set is the difference between the

third quartile and the first quartile. It is the range for the middle 50%
of the data. It overcomes the sensitivity to extreme data values.
172
When we remove 25% of the smallest and 25% of the highest data, the
new range of variation will be 208.33 flats.
2.18. The weekly amount spent on food in households (in KM):
The weekly amount spent on food (in KM) Number of households

[100 – 300[ 13
[300 – 500[ 19
[500 – 700[ 30
[700 – 900[ 50
[900 – 1100] 18

cumulative absolute frequencies.
b) Calculate the average weekly amount spent on food in households.
c) Calculate and explain standard deviation.
d) Calculate and explain Z value for households that weekly spent 759 KM
on food.
e) Calculate and explain coefficient of variation.
Solution:
Amount spent Number of ci ci . fi

(in KM) households
[100 – 300[ 13 200 2600 520000
[300 – 500[ 19 400 7600 3040000
[500 – 700[ 30 600 18000 10800000
[700 – 900[ 50 800 40000 32000000
[900 – 1100] 18 1000 18000 18000000
Σ 130 86200 64360000
173
b)
The average weekly amount spent for the 130 observed households is
663.08 KM.
c)
amount spent amounts to 235.38 KM.
d) For , a z value is:
Households that weekly spent 759 KM have above average spending for
0.41 standard deviations.
e)
174
2.19. The number of working days lost by employees in the last month
is given in the following table:
Number of days Number of employees

0 20
1 38
2 43
3 32
4 20
5 8
a) Graphically present the frequency distribution by using bar chart

(column).
c) Determine and interpret Q1 and Q3 .
d) Compute and explain the quartile deviation coefficient.
Solution:
Number of days Number of employees ci . fi

0 20 0 20
1 38 38 58
2 43 86 101
3 32 96 133
4 20 80 153
5 8 40 161
Σ 161 340
175
a)
b)
The average number of working days lost by employees in the last month
is 2.11.
c)
36.02% of employees had 1 working day lost or less, while 63.98% of

the employees had more than 1 working day lost.
82.61% of employees lose 3 working days or less, while 17.39% of the

employees lose more than 3 working days.
d)
Relative variation of data around the median is 50%.
176
2.20. The number of traffic offences each day on a section of highway

were recorded for 90 days as follows:
Number of traffic offences Number of days

0 10
1 20
2 25
3 20
4 15
Note: Use relative frequencies.

relative frequency.
c) Determine and interpret mode and median.
d) What is better representative of data: median or arithmetic mean?
Solution:
Number of traffic Number of pi xi . pi

offences days
0 10 0.1111 0 0.1111 0
1 20 0.2222 0.22 0.3333 0.22
2 25 0.2778 0.56 0.6111 1.11
3 20 0.2222 0.67 0.8333 2
4 15 0.1667 0.67 1 2.67
Σ 90 1 2.12 6
177
a)
b)
The average number of traffic offences for the 90 observed days is 2.12.
c)
The most common number of traffic offences for the 90 observed days
is equal to 2.
2 traffic offences or less happened 61.11% of time, while more than 2

traffic offences happened 38.89% of time.
d)
178
The value of relative indicator of variation which uses median as a

series representative is lower than the value of relative indicator which
uses arithmetic mean as a series representative. Therefore, it is better to
use median than the arithmetic mean as a data representative.
2.21. The following frequency table summarize the ages of 43 workers

at the travel agency:
Ages of the workers (years) Number of workers

[15 – 21[ 5
[21 – 27[ 18
[27 – 33[ 13
[33 – 39[ 4
[39 – 45[ 2
[45 – 51] 1
Note: Use relative frequencies.

cumulative relative frequency.
c) Calculate and explain mode and median.
d) Calculate and explain coefficient of variation.
e) Calculate and explain the quartile deviation coefficient.
179
Solution:
Ages of the Number of ci pi ci . pi

workers (years) workers
[15 – 21[ 5 18 0.1163 2.09 0.1163 37.68
[21 – 27[ 18 24 0.4186 10.05 0.5349 241.11
[27 – 33[ 13 30 0.3023 9.07 0.8372 272.07
[33 – 39[ 4 36 0.0930 3.35 0.9302 120.53
[39 – 45[ 2 42 0.0465 1.95 0.9767 82.03
[45 – 51] 1 48 0.0233 1.12 1 53.68
Σ 43 1 27.63 807.10
a)
b)
The average age of the workers for the 43 observed workers is 27.63 years.
c)
180
Among the 43 observed workers, the most frequent worker’s age is

25.33 years.
50% of workers are 26.50 years or younger, while the remaining 50% of
workers are older than 26.50 years.
d)
e)
181
2.22. A company collected the ages of its middle managers with the data
shown below (in years):
65 35 46 40 25 28 58 39 41 41
38 53 36 49 43 52 60 54 59 30
a) Create statistical frequency distribution in a way that length of inter-

vals is 10 years (interval grouping).
b) Determine arithmetic mean.
c) Determine range of data.
d) Determine the quartile range.
e) Determine the deciles range.
f) Determine the centiles range.
Solution:
a)
Ages of the middle ci ci . fi

Frequency
managers
[25 – 35[ 4 30 120 4
[35 – 45[ 7 40 280 11
[45 – 55[ 5 50 250 16
[55 – 65] 4 60 240 20
Σ 20 890
b)
The average age of the middle managers for the 20 observed managers
is 44.50 years.
c) The range of a data set is the difference between the largest and the
smallest data values. It is the simplest measure of variability. It is
very sensitive to the smallest and the largest data values.
182
d) The interquartile range of a data set is the difference between the

third quartile and the first quartile. It is the range for the middle 50%
of the data. It overcomes the sensitivity to extreme data values.
e) The interdeciles range of a data set is the difference between the

ninth deciles and the first deciles. It is the range for the middle 80%
of the data.
Calculating
deciles range.
183
f) The intercentiles range of a data set is the difference between the

ninety centiles and the first centiles. It is the range for the middle
98% of the data.
Calculating
centiles range.
2.23. The average weekly percentage returns on common stocks over 52

week period were as follows:
Returns (%) Number of weeks

[−10, −5[ 4
[−5, 0[ 7
[0, 5[ 15
[5, 10[ 18
[10, 15] 8

absolute frequency.
c) Calculate and explain standard deviation.
d) Calculate and explain coefficient of variation.
184
Solution:
Returns (%) Number of weeks ci ci . fi
[−10, −5[ 4 -7.5 -30 225

[−5, 0[ 7 -2.5 -17.5 43.75
[0, 5[ 15 2.5 37.5 93.75
[5, 10[ 18 7.5 135 1012.5
[10, 15] 8 12.5 100 1250
Σ 52 225 2625
a)
b)
The average return for the 52 observed weeks is 4.33%.
c)
185
returns is equal to 5.63 %.
d)
2.24. Calculate and interpret coefficient of skewness (asymmetry) and

coefficient of kurtosis from the data given in the following table:
Number of new orders Number of working days

1 7
2 10
3 13
4 9
5 8
6 3
Solution:
Number Number
of new of working xi . fi
orders days
1 7 7 -2.20 33.88 -74.54 163.98
2 10 20 -1.20 14.40 -17.28 20.74
3 13 39 -0.20 0.52 -0.10 0.02
4 9 36 0.80 5.76 4.61 3.69
5 8 40 1.80 25.92 46.66 83.98
6 3 18 2.80 23.52 65.86 184.40
Σ 50 160 -2.20 104.00 25.21 456.81
186
positive (right) skewed (asymmetric)
distribution
wide (platykurtic, flat) distribution
2.25. Weekly earnings of employees in Star Company are given in the

following table:
Weekly earnings ($) Number of employees

350 10
450 14
550 16
650 24
750 6
850 2
a) Determine the average weekly earnings of employees in Star

Company.
b) Calculate and interpret the variance and standard deviation of
weekly earnings.
c) Calculate and interpret coefficient of asymmetry.
d) Calculate and interpret coefficient of kurtosis.
187
Solution:
Weekly Number
earnings of xi . fi
($) employees
350 10 3500 -211.11 445674.32 -94086305.91 19862560039.88

450 14 6300 -111.11 172836.05 -19203813.45 2133735712.30
550 16 8800 -11.11 1974.91 -21941.29 243767.73
650 24 15600 88.89 189634.37 16856599.18 1498383101.54
750 6 4500 188.89 214076.59 40436927.58 7638131249.87
850 2 1700 288.89 166914.86 48220035.12 13930285945.45
Σ 72 40400 1191111.10 -7798498.77 45063339816.78
a)
The average weekly earning of employees in Star Company is equal to

561.11 $.
b)
The average squared deviation of individual earnings from the average

earning in Star Company is equal to 16543.21.
The average linear deviation of individual earnings from the average

earning in Star Company is equal to 128.62 $.
c)
Slightly negative skewed (left
asymmetric) frequency distribution.
188
d)
Wide (platykurtic, flat)
frequency distribution.
2.26. The table below shows the distribution of the time students spend
on a particular homework assignment (sample of 30 students):
Time (in min) Number of students

[0 − 20[ 3
[20 − 40[ 18
[40 − 60[ 7
[60 − 80] 2
a) Graphically present the frequency distribution by using histogram.

b) Calculate and interpret arithmetic mean.
c) Calculate and interpret standard deviation.
d) Calculate and interpret coefficient of asymmetry.
Solution:
Time (in min)
students
[0 − 20[ 3 10 30 300 -25.33 -48755.86
[20 − 40[ 18 30 540 16200 -5.33 -2725.55
[40 − 60[ 7 50 350 17500 14.67 22099.80
[60 − 80] 2 70 140 9800 34.67 83347.30
Σ 30 1060 43800 -25.33 53965.69
189
a)
b)
The average time students from sample spend on a particular homework

assignment is equal to 35.33 min.
c)
The average linear deviation of individual time students spend on

assignment from the average time students spend on assignment is
equal to 14.55 min.
d)
Positive skewed (right asymmetric)
190
2.27. The following frequency distribution shows the number of hours

spent studying the course material during the week before the
final exam for 123 students:
The number of hours Frequency

[5 − 10[ 13
[10 − 15[ 30
[15 − 20[ 50
[20 − 25[ 20
[25 − 30] 10
a) Draw a polygon of absolute frequency.

b) Calculate and interpret mode.
c) Calculate and interpret standard deviation.
d) Calculate and interpret coefficient of kurtosis.
Solution:
The number ci ci . fi
Frequency
of hours
[5 − 10[ 13 7.5 97.5 -9.35 1136.49 99355.02
[10 − 15[ 30 12.5 375 -4.35 567.68 10741.83
[15 − 20[ 50 17.5 875 0.65 21.12 8.93
[20 − 25[ 20 22.5 450 5.65 638.45 20380.92
[25 − 30] 10 27.5 275 10.65 1134.23 128646.64
Σ 123 2072.5 -9.35 3497.97 259133.33
191
a)
b)
The most frequent hours spent studying the course material for the 123
observed students is 17.00 hours.
c)
value (mean) is equal to 5.33 hours.
192
d)
Slightly wide (platykurtic, flat)
distribution
2.28. A supervisor of a bank kept records of the time (in minutes) that
employees needed to complete a particular task. The data are
given in next table:
11 29 16 24 15 23 10 21 18 20
15 22 13 24 16 28 21 14 26 27
25 20 19 23 17 23 18 22 19 29
a) Create statistical frequency distribution in a way that length of

intervals is 5 minutes (interval grouping).
b) Draw a histogram.
c) Calculate and interpret coefficient of asymmetry and coefficient of
kurtosis.
Solution:
a)
Time ci ci . fi
Frequency
(in min)
[10 − 15[ 4 12.5 50 625.00 -8.17 -2181.35 17821.66
[15 − 20[ 9 17.5 157.5 2756.25 -3.17 -286.70 908.82
[20 − 25[ 11 22.5 247.5 5568.75 1.83 67.41 123.37
[25 − 30] 6 27.5 165 4537.50 6.83 1911.67 13056.72
Σ 30 620 13487.50 -488.96 31910.57
193
b)
c)
Slightly negative skewed (left
asymmetric) frequency distribution.
Wide (platykurtic, flat) frequency
distribution.
194
2.29. Given the following distribution of annual salary of Sam Company

(in 000 KM):
Annual salary
Number of workers
(in 000 KM)
[10 − 15[ 5
[15 − 20[ 15
[20 − 25[ 20
[25 − 30[ 30
[30 − 35] 15
a) Sketch Lorenz’s curve.

b) Calculate and interpret Gini coefficient.
Solution:
Annual Number
salary of ci pi ci . fi
(in 000 KM) workers
[10 − 15[ 5 12.5 0.0588 0.0588 62.5 0.0299 0.0299
[15 − 20[ 15 17.5 0.1765 0.2353 262.5 0.1258 0.1557
[20 − 25[ 20 22.5 0.2353 0.4706 450 0.2156 0.3713
[25 − 30[ 30 27.5 0.3529 0.8235 825 0.3952 0.7665
[30 − 35] 15 32.5 0.1765 1 487.5 0.2335 1
Σ 85 1 2087.5 1
a)
195
b)
Trapezoid method
As the Gini coefficient is closer to 0 than 1 we say that it is a relatively

equitable distribution (concentration is low).
Triangles method
The same comment as previously.
196
2.30. A survey is made on a sample of 20 students attending third year

of Faculty of Economics that passed Econometrics exam. The Calculating
descriptive statistics
grades of students are given in the following table: on basis sample.
Grades Number of students

6 4
7 4
8 7
9 3
10 2
Calculate and explain measures of central tendency, measures of

dispersion and measures of asymmetry and kurtosis.
Solution:
xi fi xi . fi CAF
6 4 24 4 -1.75 12.25 7.00 -21.44 37.52

7 4 28 8 -0.75 2.25 3.00 -1.69 1.27
8 7 56 15 0.25 0.44 1.75 0.11 0.03
9 3 27 18 1.25 4.69 3.75 5.86 7.32
10 2 20 20 2.25 10.13 4.50 22.78 51.26
Σ 20 155 29.76 20.00 5.63 97.39
Measures of central tendency:
Mean:
Average grade of students that passed exam of course Econometrics in

the analyzed sample is 7.75.
Median:
197
50% of students got grade 8 or less, while 50% students got grade higher
than 8.
Mode:
The most frequent grade for the 20 observed students is 8.
Measures of dispersion:
The range data:
Range of variation is 4 grades.
The standard deviation:

22
grade in the analyzed sample is 1.25 grades.
The middle absolute distance:
grade in analyzed sample is 1.05 grades.
Coefficient of variation:
Z value: For xi = 7, a z value is:
Students with grade 7 are below average for 0.6 standard deviations.
22
There are 20 students in the sample, so we use formula for standard deviation from sample
(with (n-1)).
198
Measures of asymmetry and kurtosis:
Slightly positive skewed (right asymmetric)
2.31. A survey on workers’ age is conducted on a sample of 25 workers

of Melly Company. The ages of workers are given in the following Calculating
descriptive
table: statistics on
basis sample.
Ages Number of workers
[15 − 25[ 5
[25 − 35[ 7
[35 − 45[ 8
[45 − 55[ 3
[55 − 65] 2
Calculate and explain measures of central tendency, measures of

dispersion and measures of asymmetry and kurtosis.
199
Solution:
xi fi CAF ci ci . fi
[15 − 25[ 5 5 20 100 -16 1280 80 -20480 327680

[25 − 35[ 7 12 30 210 -6 252 42 -1512 9072
[35 − 45[ 8 20 40 320 4 128 32 512 2048
[45 − 55[ 3 23 50 150 14 588 42 8232 115248
[55 − 65] 2 25 60 120 24 1152 48 27648 663552
Σ 25 900 3400 244 14400 1117600
Measures of central tendency:
Mean:
The average age of the Melly Company’s workers is 36 years.
Median:
50% of workers of the Melly Company are 35.63 years old or younger,
while the remaining 50% of workers are older than 35.63 years.
Mode:
The most frequent age of the workers of the Melly Company is 36.67
years.
200
Quartile 1:
Quartile 3:
Measures of dispersion:
The range data:
Range of variation is 40 years.
The standard deviation: 23
years in analyzed sample is 11.90 years.
The middle absolute distance:
23
There are 25 students in the sample, so we use formula for standard deviation from sample
(with (n-1)).
201
years in analyzed sample is 11.90 years.
Coefficient of variation:
Z value: For xi = 40, a z value is:
40 years old workers are for 0.34 standard deviations above average.
The quartile range:
When we remove 25% of the smallest and 25% of the highest data, the
new range of variation will be 16.65 years.
The quartile deviation coefficient:
Measures of asymmetry and kurtosis:
Slightly positive skewed (right asymmetric)
202
2.10. SELF STUDY EXAMPLES
2.32. A variable that can only take certain values (whole numbers) is
referred to as a:
a) continuous variable.
b) discrete variable.
c) constant.
d) statistical variable.
Answer: b)
2.33. What level of measurement would be involved in recording a

person’s social security number?
a) nominal level
b) ordinal level
c) interval level
d) ratio level
Answer: a)
2.34. You measure the width (in inches) of a number of fabric samples.
This would be an example of measurement at the:
a) nominal level.
b) ordinal level.
c) interval level.
d) ratio level.
Answer: d)
2.35. What is frequency distribution? Create frequency distribution for

the following set of data:
Data Set - High Temperatures for 30 Days

50 45 49 50 43
49 50 49 45 49
47 47 44 51 51
44 47 46 50 44
51 49 43 43 49
45 46 45 51 46
203
Give interpretation for those records.
Answer:
Temperature Frequency
51 4
50 4
49 6
48 0
47 3
46 3
45 4
44 3
43 3
Ν 30
2.36. a) How do you define different types of frequencies?

b) Apply that to the previous example.
Answer:
Frequency Distribution for High Temperatures

Cumulative Cumulative
Temperature Frequency Percentage
Frequency percentage
51 4 4 13.3 13.3
50 4 8 13.3 26.7
49 6 14 20.0 46.7
48 0 14 0.0 46.7
47 3 17 10.0 56.7
46 3 20 10.0 66.7
45 4 24 13.3 80.0
44 3 27 10.0 90.0
43 3 30 10.0 100.0
Total 30 100.0
204
2.37. The weights of 30 students were measured and following data is

recorded:
59.2, 61.5, 62.3, 61.4, 60.9, 59.8, 60.5, 59.0, 61.1, 60.7, 61.6, 56.3, 61.9, 68.7, 60.4,
58.9, 59.0, 61.2, 62.1, 61.4, 58.4, 60.8, 60.2, 62.7, 60.0, 59.3, 61.9, 61.7, 58.4, 62.2
a) Is the variable discrete or continuous? Explain.

b) Are there some outliers?
c) Make frequency distribution and calculate cumulative frequency and
cumulative percentage. Explain.
d) What is the appropriate graph in this case? Create that graph.
Answer:
a) Variable is continuous, since students’ weight can take any value
from certain interval and it is obtained by measurement procedure.
b) There are two outliers: 56.3 and 68.7.
Frequency Distribution for Students' Weights

(After Excluding Outliers)
Cumulative Cumulative
Weight (xi) fi Percentage
Frequency percentage
[58 - 59) 3 3 10.71 10.71
[59 - 60) 5 8 17.86 28.57
[60 - 61) 7 15 25.00 53.57
[61 - 62) 9 24 32.14 85.71
[62 - 63] 4 28 14.29 100.00
Total 28 100.00
d) histogram
2.38. A frequency polygon and histogram would be examples of what

kind of data presentation?
Answer: A frequency polygon and histogram are examples of graphical

representation of data.
205
2.39. Which of the following types of graphs could be is represented by

a “curve”?
a) bar graphs
b) histogram
c) pie chart
d) polygon of frequency.
Answer: d)
2.40. Given the following data set: 13, 15, 12, 13, 9, 13.
a) Find the mean?

b) Find the median?
c) Find the mode?
Answer: a) 12.5 b) 13 c) 13
2.41. Given the following data set: 11, 9, 10, 13, 11, 12, 13, 14, 11, 15, 9.
a) Find is the mean?

b) Find the median?
c) Find the mode?
Answer: a) 10.64 b) 11 c) 11
2.42. The number of monthly traffic offences on a section of highway

was recorded for 12 months:
Number of traffic offences Number of days

10 2
11 3
12 4
13 2
14 1

absolute frequency and polygon of cumulative absolute frequencies.
c) Determine and interpret mode.
206
d) Determine and interpret median and quartiles.

e) Determine and interpret eighth deciles.
Answer: b) 11.75 c) 12 d) 12; 11; 12 e) 13
Compute the arithmetic mean, mode and quartiles of the following data:
8 10 8 7 8 9 7 6 7 8 5 6
9 6 8 6 5 10 8 7 9 5 9 9
7 7 9 8 7 9 8 5 7 7 10 7
Answer: (7.53, 7, 7, 7, 9)
2.44. The following frequency table summarize the ages of 195 visitors
at the local museum:
Ages of the visitors Number of visitors

[1 - 11[ 15
[11 - 21[ 23
[21 - 31[ 30
[31 - 41[ 35
[41 - 51[ 41
[51 - 61[ 38
[61 - 71] 13

absolute frequency.
b) Calculate average ages of the visitors.
c) Calculate and explain mod and median.
d) Calculate and explain quartiles.
Answer: b) 37.79 c) 47.67; 39.43 d) 24.58; 51.59
207
2.45. The data about the number of persons that are temporarily
employed abroad, according to age, are given in the table below:
Age (years) Number of persons

[15 - 25[ 20083
[25 - 35[ 41249
[35 - 45[ 30499
[45 - 55[ 10273
[55 - 65] 2706
a) Determine the average age of persons temporarily employed abroad.

b) Calculate and graphically determine the most common age of person
temporarily employed abroad. Interpret the results.
c) Calculate and graphically determine median and quartiles. Interpret
the results.
Answer: a) 33.73 b) 31.63 c) 32.84; 26.48; 40.66
2.46. Suppose that you want to drive 10 km in your car. You will not
drive with the same speed all the time:
100 km/h for the first 5 km

110 km/h for the second 8 km
90 km/h for the third 10 km
120 km/h for the fourth 4 km.
What is your average speed?
Answer: 101.1
2.47. A teacher recorded the following quiz scores (out of possible 5

points) for 30 students:
2 1 4 4 1 4 1 2 3 4
3 3 1 3 3 2 0 3 3 4
5 5 5 5 4 3 2 0 2 1

b) Graphically present the frequency distribution by using polygon of
absolute frequency.
208
c) What is the average quiz score for the 30 students?

d) Calculate and explain range of data.
e) Calculate and explain coefficient of variation.
f) Calculate and explain the quartile deviation coefficient.
Answer: c) 2.77 d) 5 e) 52.53% f) 33.33%
2.48. Consider the following frequency distribution for 35 companies:
Amount of annual revenue Number of companies

[0 - 3[ 15
[3 - 6[ 9
[6 - 9[ 6
[9 - 12] 5

absolute frequency.
c) Calculate and explain mode.
d) Calculate and explain range of data.
e) Calculate and explain the quartile deviation coefficient.
Answer: b) 4.59 c) 2.14 d) 12 e) 60.56%
2.49. Data about the level of capacity utilization in 23 factories are given
in the table below:
Level of capacity utilization Number of factories

[40 - 50[ 1
[50 - 60[ 3
[60 - 70[ 4
[70 - 80[ 5
[80 - 90[ 7
[90 - 100] 3
Calculate:
a) The average level of capacity utilization.
b) The most common level of capacity utilization.
209
c) The level of capacity utilization that split statistical series in two

parts with the same number of observations.
d) Coefficient of variation.
Answer: a) 75.00 b) 83.33 c) 77.00 d) 18.44%
2.50. Monthly earnings of employees in Melly Company are given in

the following table:
Monthly earnings (KM) Number of employees

1350 9
1450 14
1550 26
1650 22
1750 8
1850 3
a) Determine the average monthly earnings of employees in Melly

Company.
b) Calculate and interpret the variance and standard deviation of
monthly earnings.
c) Compute and explain middle absolute distance.
d) Calculate and explain the quartile range.
Answer: a) 1568.29 b) 15640.99; 125.06 c) 99.91 d) 200
2.51. Data about the age of cell phone users are given in the following
table:
Age of users Number of users

[10 - 20[ 9
[20 - 30[ 35
[30 - 40[ 25
[40 - 50[ 18
[50 - 60[ 10
[60 - 70[ 5
[70 - 80] 3
210
a) Draw histogram of absolute frequencies and polygon of cumulative

absolute frequencies.
b) Calculate the average age of cell phone users.
c) Calculate the upper age boundary for 50% of the youngest users.
d) Calculate the most common age of users in the series of data.
e) Calculate average linear deviation about arithmetic mean.
Answer: b) 36.14 c) 33.40 d) 27.22 e) 14.50
2.52. The following values are the number of cars that households of
one rich part of city posses:
Number of cars Number of households

1 2
2 5
3 7
4 8
5 3
a) Calculate and interpret coefficient of asymmetry.

b) Calculate and interpret coefficient of kurtosis.
Answer: a) -0.232 b) 2.246
2.53. A company has produced the following table to describe the

monthly overhead expenses:
The monthly overhead expenses

Number of months
(in 000 KM)
[1 - 3[ 5
[3 - 5[ 10
[5 - 7[ 8
[7 - 9] 6
a) Calculate and interpret coefficient of asymmetry.

b) Calculate and interpret coefficient of kurtosis.
Answer: a) 0.055 b) 1.921
211
2.54. Given the following distribution of monthly pay of 45 employees

in Melly Company (in 00 KM):
Monthly pay (in 00 KM) Number of employees

[10 - 15[ 7
[15 - 20[ 11
[20 - 25[ 15
[25 - 30[ 9
[30 - 35] 3
a) Sketch Lorenz’s curve.

b) Calculate and interpret Ginny coefficient.
Answer: b) 0.1471
2.55. There are data for Expense ratio in 200 funds.
Ordinal numeral Expense ratio

1 0.77
2 1.77
3 0.67
4 1.00
5 1.00
6 1.00
7 0.93
8 0.85
9 1.00
10 0.87
11 1.03
12 0.75
13 0.98
14 0.89
15 0.93
16 0.71
17 0.96
212
18 1.15
19 0.95
20 1.41
21 0.95
22 1.88
23 0.51
24 1.03
25 1.26
26 1.31
27 1.14
28 0.87
29 0.84
30 0.81
31 0.93
32 0.88
33 0.84
34 0.74
35 0.63
36 0.77
37 1.38
38 1.42
39 0.71
40 1.30
41 0.67
42 0.88
43 0.94
44 1.14
45 1.95
46 0.85
47 1.81
48 2.06
49 1.28
213
50 1.59
51 0.87
52 0.84
53 0.84
54 1.00
55 0.96
56 1.03
57 1.22
58 0.94
59 0.62
60 1.11
61 1.49
62 0.89
63 0.49
64 0.88
65 1.02
66 1.99
67 0.71
68 0.11
69 1.20
70 0.91
71 0.73
72 0.85
73 1.06
74 0.87
75 0.22
76 0.40
77 0.48
78 0.63
79 0.22
80 0.31
81 1.11
214
82 1.36
83 1.04
84 1.13
85 0.72
86 1.03
87 2.11
88 1.96
89 1.97
90 2.13
91 0.99
92 1.00
93 0.95
94 0.99
95 1.36
96 1.13
97 0.65
98 0.99
99 0.77
100 1.19
101 1.34
102 1.25
103 1.06
104 2.06
105 1.20
106 0.85
107 0.85
108 0.89
109 0.94
110 0.52
111 1.04
112 0.88
113 0.90
215
114 0.86
115 1.57
116 0.79
117 0.64
118 1.40
119 1.00
120 1.29
121 0.84
122 0.85
123 1.11
124 1.74
125 0.80
126 1.09
127 1.26
128 1.37
129 0.61
130 0.83
131 0.99
132 1.25
133 1.06
134 1.06
135 1.90
136 1.95
137 0.85
138 0.81
139 0.99
140 0.89
141 0.89
142 0.89
143 0.88
144 1.51
145 1.05
216
146 1.02
147 1.07
148 1.14
149 0.95
150 1.00
151 0.88
152 0.85
153 1.04
154 0.99
155 0.93
156 0.89
157 0.71
158 0.77
159 0.44
160 1.44
161 0.97
162 0.96
163 1.32
164 1.67
165 0.83
166 1.26
167 0.97
168 1.20
169 0.95
170 0.95
171 0.78
172 1.12
173 0.54
174 0.88
175 1.15
176 1.54
177 1.16
217
178 0.94
179 1.18
180 0.84
181 0.94
182 0.67
183 0.63
184 1.06
185 0.91
186 1.36
187 1.22
188 0.80
189 0.96
190 0.56
191 0.93
192 1.08
193 0.83
194 2.07
195 0.93
196 0.98
197 0.79
198 1.35
199 0.78
200 1.10
a) Using 20 random numbers select a simple random sample.
b) For selected sample:

Present the expense ratio data as a frequency distribution of
grouped data.
Create histogram and ogive.
Calculate the average, median, mod and standard deviation from
the sample frequency distribution.
Calculate and explain coefficient of variation.
218
c) For given population:

Present the expense ratio data as a frequency distribution of grouped
data.
Create histogram and ogive.
Calculate the average, median, mod and standard deviation from the
population frequency distribution.
Calculate and explain the quartile range and the quartile deviation
coefficient
For Expanse ratio 0.93 calculate and explain z value.
Answer:
Ordered random sample of 20 expense ratios:
76 0.4 147 1.07

158 0.77 18 1.15
165 0.83 18 1.15
130 0.83 166 1.26
154 0.99 26 1.31
139 0.99 37 1.38
131 0.99 20 1.41
6 1 144 1.51
6 1 136 1.95
65 1.02 66 1.99
Sample
Expense ratio (xi) fi
[0 - 0.5) 1
[0.5 - 1.0) 6
[1.0 - 1.5) 10
[1.5 - 2.0] 3
Total 20
Sample average: 1.13 → Average expense ratio in the sample is 1.13.

Sample median: 1.15 → 50% of analyzed ratios are 1.15 or less.
Sample mode: 1 → The most frequent expense ratio in the sample
is 1.
219
Sample standard deviation: 1.18 → Average linear distance from

average expense ratio is 1.18.
Sample coefficient of variation: 104.8% → Relative variation of data around
sample average is 104.8%.
Population
Expense ratio (xi) fi
[0 - 0.25) 3
[0.25 - 0.50) 5
[0.50 - 0.75) 21
[0.75 - 1.00) 83
[1.00 - 1.25) 46
[1.25 - 1.50) 22
[1.50 - 1.75) 6
[1.75 - 2.00) 9
[2.00 - 2.25] 5
Total 200
Population average: 1.04 → Average expense ratio in the

population is 1.04.
Population median: 0.96 → 50% of analyzed ratios in the
population are 0.96 or less.
Population mode: 0.85 → The most frequent expense
ratio in the population is 1.
Population standard deviation: 1.22 → Average linear distance from
average expense ratio is 1.22.
Q1: 0.81 → 25% of expense ratios are 0.81 or less.

Q3: 1.21 → 75% of expense ratios are 1.21 or less.
Quartile range: 0.40 → When we remove the top 25% and the bottom
25% of data, new range of variation of expense
ratio is 0.40.
Quartile deviation coefficient: 20.0% → Relative variation of data
around median is 20%.
Z - value for expense ratio of 0.93: - 0.0901 → Expense ratio of 0.93 is

below average by 0.0901
standard deviation.
220
2.56. The data below show the number of employees in manufacturing

plants in one region:
Number of employees Number of firms

1 – 10 409
10 – 15 961
20 – 50 1688
50 – 100 1229
100 – 200 804
200 – 500 213
500 – 1000 152
1000 – 1500 89
a) Draw a histogram of the data.

b) Calculate the mean, median and mode of the distribution. Why do
they differ?
c) Calculate the inter-quartile range, variance, standard deviation and
coefficient of variation of the data.
Answer:
Histogram
221
Frequency Distribution plants according to employment
Expense ratio ( Ri ) fi CAFi ci ci . fi
[1 - 10) 409 409 5.5 2,249.5 4,103,575.74

[10 - 15) 961 1,370 12.5 12,012.5 8,341,355.99
[20 - 50) 1,688 3,058 35 59,080 8,429,296.11
[50 - 100) 1,229 4,287 75 92,175 1,155,742.78
[100 - 200) 804 5,091 150 120,600 1,580,277.33
[200 - 500) 213 5,304 350 74,550 12,715,927.27
[500 - 1000) 152 5,456 750 114,000 63,105,312.41
[1000 - 1500] 89 5,545 1,250 111,250 116,545,562.6
Total 5,545 30,520 585,917 215,977,050.3
Mean: 105.67
Median: 44.93
Mode: 12.60
Standard deviation: 197.36
Variance: 38,949.87
Coefficient of variation: 186.77%
Q1: 20.28
Q3: 94.78
Inter-quartile range: 74.50
2.57. Your organization has recently started advertising its services on

the Internet. The marketing manager has indicated that she wants
to know how long it takes Internet users to access your company’s
Internet screen, since there is concern it is taking too long and
deterring interest. You have asked colleagues and friends at a
variety of other organizations to access your own company’s
website and keep a careful record of how long it took them to get
into the company’s home page. The results for 120 attempts are:
222
Access time (seconds) Number of attempts at access

up to 15 seconds 17
15 – 20 24
20 – 25 19
25 – 30 28
30 – 35 19
35 or over 13
a) Construct an ogive for this data and comment on your result.

b) Explain which measure(s) of average and dispersion you would
suggest using for this data and the reasons for your preference.
c) Calculate the measure(s) of average and dispersion.
d) Interpret these, and any other statistics you think might find useful,
in the context of the problem at hand.
Answer:
Ogive (Graph of cumulative percentage frequency)
223
First, we have to make sure that the access time intervals are of the
same length. In other words, our starting table will be following:
Access time (seconds) No. of attempts at access

10 – 15 17
15 – 20 24
20 – 25 19
25 – 30 28
30 – 35 19
35 – 40 13
Sample average: 26.96 → Average access time in the sample is

almost 27 seconds.
Sample median: 25 → 50% of access times are 25 seconds
or less.
Sample standard deviation: 7.87 → Average linear distance from average
access time is 7.87.
Sample coefficient of variation: 29.19%

Relative variation of data around the average access time is 29.19%.
224
3
REGRESSION
AND
CORRELATION
CHAPTER
3
3.1. INTRODUCTION
Correlation and regression analysis has a different purpose than the

previous techniques we have looked at.
The goal of correlation and regression analysis is to determine

and quantify the relationship between two or more than two
variables.
One variable has to have two or more scores coming from the same
object or individual. Over many cases we wish to know whether there is
a relationship between the variables.
Correlation and regression are methods of describing the

nature and degree of relationship between two or more
variables.
Examples of such relations are:

Hours spent studying and grade point average
Family’s income and child’s I.Q.
College G.P.A and adult income
Amount of time watching T.V. and fear of crime, etc.
In each case, for each object or person or case, measurement is made on

the two or more variables and we wish to determine if those variables
are related.
There are three most important concepts in correlation and regression

analysis:
The scatter plot displays the form, direction, and strength

of the relationship between two quantitative variables.
227
3 REGRESSION AND CORRELATION
Straight-line or linear relationships are particularly important because a

straight line is a very simple pattern that is quite common. But when we
work with two or more than two independent variables, concept of the
graphical presentation becomes inapplicable.
The correlation measures the direction and strength of

relationship between two or more variables.
The least-squares regression model or equation is the model
that makes the sum of the squares of the distances, between
original data for dependent variable and predicted or
estimated data for dependent variable, as small as possible.
If we work with one independent variable, then we can present least-

squares regression line graphically as the line that shows the lowest sum
of the squares of the vertical distances of the data points from the line.
3.2. BASIC ASPECTS
In correlation and regression analysis, basic aspects are:
a) The direction of the relationship

Positive → high scores on one variable go with high scores on the
other variable and vice versa.
Negative → high scores on one variable go with low scores on the
other variable and vice versa.
b) The form of the relationship

Linear versus non–linear relationships
c) The degree of the relationship

In a positive relationship are high scores always associated with other
high scores and low scores with other low scores or just sometimes
228
3.3. SCATTER PLOT
A scatter plot is a type of graph using Cartesian coordinates

to display values of two variables from a set of data.
The data is displayed as a collection of points, each having the value

of one variable (independent variable x) determining the position on
the horizontal axis and the value of the other variable determining the
position on the vertical axis (dependent variable y). A scatter plot is also
called a scatter chart, scatter diagram and scatter graph.
Example 3.1.
Here is a table showing the results of two examinations set of 10 students.

They took a Maths and Statistics exams and record the scores that they
get in both exams: 24
John Betty Sarah Peter Fiona Charlie Tim Gerry Martine Rachel
Maths
72 65 80 36 50 21 79 64 44 55
score
Statistics
78 70 81 31 55 29 74 64 47 53
score
We want to create scatter graph.

Creating scatter
Solution: plot using Excel.
We will draw two-dimensional Cartesian coordinate system. The

horizontal axis will represent the score on the Maths exam. The vertical
axis will represent the score on the Statistics exam. For each student,
we then mark a dot at the co-ordinates representing their two scores. In
Excel, among Chart types, we choose scatter:
24
http://richardbowles.tripod.com/maths/correlation/corr.htm, access: 28. 01. 2010.
229
And we will get scatter plot:
230
We can see that the points follow a very strong pattern. Students who
are good at Maths tend to be good at Statistics as well. The marks lie
fairly close to an imaginary straight line that we can draw on the graph.
In the diagram above, we can draw in this straight line: we will make right
click with “mouse” on marks and we will select options as shown below.
231
Than, we will choose Add Trendline option:
And then linear model, which is obvious from graph:
232
The fact that the points lie close to the straight line is called a strong
correlation. The fact that this line is upward sloping - indicating that the
Statistics mark tends to increase as the Maths mark increases - is called
a positive correlation.
On the next graph we can see different forms of scatter plots:
Figure 3.1. Different forms of scatter plots

In cases a) and b) we have linear relationships. In case a) direction of

relationship is positive and direct (high score on one variable goes with
high on the other variable), but in case b) relationship is negative and
indirect (high score on one variable goes with low score on the other).
Under case c), there is no relationship between the variables, a case can
be high on one variable and either high or low on the other. Under cases
d), e) and f) there are non–linear relationships.
233
3.4. LINE OF BEST FIT (REGRESSION LINE)
The straight line that we draw through the points is called

either the line of best fit or the regression line.
It is mathematical representations of the relationship between two

quantitative variables. There is a standard way to draw this line to
ensure that it fits as closely to the data points as possible. Later on, we
will present exact mathematical procedure to obtain a regression line.
For now, we only have to remember one thing:
The regression line goes through the point whose co-ordinates

are the mean values of given variables in regression model.
The arithmetic means are found by adding the relevant scores for
exams, and dividing sum by 10. This is because there are results for ten
students in the table with original data.
We work out:
mean Maths scores =
= (72 + 65 + 80 + 36 + 50 + 21 + 79 + 64 + 44 + 55) / 10 = 56.6
mean Statistics scores =
= (78 + 70 + 81 + 31 + 55 + 29 + 74 + 64 + 47 + 53) / 10 = 58.2
and we can be sure that the line has to go through the point (56.6, 58.2).
We can see on scatter plot from example 1 that there is roughly the same
number of data point lying above this line as there are below it.
We can use the regression line to make predictions. For instance, what
Statistics mark would we expect someone to receive if they received a
Maths mark of 40? If we look at the straight line, we can see that when
the Maths mark is 40, the Statistics mark is approximately 42. Similarly,
we can assume that anyone who got 40 marks on Statistics exam, would
234
also get about 38 marks on Maths exam. However, there are limits on
the predictions that we can make, as we will elaborate later on.
3.5. THE STANDARD ERROR OF ESTIMATE AND

THE COEFFICIENT OF DETERMINATION
There are steps to obtain the standard error of estimate and the coefficient
of determination:
1. Decomposition of an observed score if y is dependent variable:
Figure 3.2. Partitioning of variability
2. Partitioning the variance in scores
a) More useful may be looking at it in terms of variability, breaking the

total variability of the score (its deviation from the mean) into two
portions:
235
- The deviation of the score from the mean.

- The deviation of the predicted score from the mean — this
is the portion of the score that reflects the relationship with the x
variable.
- The deviation of the observed score from the predicted
score. This is error, or the part of the score that is not related to the
x variable.
b) If we square these deviations and sum them we have sums of squares.

These sums of squares are additive:
is the total sum of squares for the dependent variable –

SSy (total variability)
is the sum of squares due to prediction or regression

(SSregression). This is the part of the y variable that the x variable did
predict (explained variability).
is the sum of squares for the residual or the errors of

prediction, the part of SSy that the x variable did not predict
(SSerrors in prediction or residual SSregression or unexplained variability).
3.
is the coefficient of determination which represents the fraction of

the total variation in the y scores that can be predicted from the x
scores.
It is denoted by r2. According to formula, this coefficient must be in the

range 0 to +1.
The coefficient of determination tells us what proportion of the variation

between the data points is explained or accounted for by the line of the
best fit. It indicates how close the points are to the line.
236
4. Then, we can calculate standard error of estimate by using formula:
We can see by looking at the graph whether there is a strong or weak

correlation between two variables, and whether that correlation is
positive or negative. However, there is a mathematical way of working
it out by calculating the correlation coefficient. This is also known as
Pearson’s Correlation Coefficient, represented by the letter r, and it is a
single number which ranges from -1 (perfect strong negative correlation)
to +1 (perfect strong positive correlation).
The correlation coefficient indicates whether there is a

relationship between the two variables, and whether the
relationship is a positive or a negative number.
Mathematically, the correlation coefficient is square root from the

coefficient of determination:
The stronger the correlation the larger explained variability will be:
If r = 0 then
If r = 1 then
237
The stronger the correlation, the smaller unexplained variability will be:
If r = 0 then
If r = 1 then
Correlation coefficients which are close to -1 or +1 indicate a strong

correlation. Values close to 0 indicate a weak correlation, while 0 itself
indicates no correlation at all. The stronger the correlation means the
better the prediction and the smaller the errors of prediction.
3.7. INTERPRETATION OF THE SIZE

OF A CORRELATION
Some authors have offered guidelines for the interpretation of a correlation

coefficient:
Table 3.1. Guidelines for the interpretation of a correlation coefficient
Correlation Negative Positive

Small –0.3 to –0.1 0.1 to 0.3
Medium –0.5 to –0.3 0.3 to 0.5
Large –1.0 to –0.5 0.5 to 1.0
Cohen25 has observed that all such criteria are in some ways arbitrary
and should not be observed too strictly. This is because the interpretation
of a correlation coefficient depends on the context and purposes. A
correlation of 0.9 may be very low if one is verifying a physical law
using high-quality instruments, but may be regarded as very high in
the social sciences where there may be a greater contribution from
complicating, unobserved factors.
Along this vein, it is important to remember that “large” and “small”

should not be taken as synonyms for “good” and “bad” in terms of
25
Cohen, J., Statistical power analysis for the behavioral sciences (2nd ed.), Lawrence
Erlbaum Associates, 1988.
238
determining that a correlation is of a certain size. For example, a

correlation of (1.0) or (–1.0) indicates that the two variables analyzed are
equivalent modulo scaling. Scientifically, this more frequently indicates
a trivial result than a profound one. For example, consider discovering
a correlation of 1.0 between how many feet tall a group of people are
and the number of inches from the bottom of their feet to the top of their
heads could not be considered particularly important.

OF THE LINEAR REGRESSION MODEL
The linear regression model is defined by two numbers - the slope and
the intercept on the vertical axis of the line that best fits those points.
We always refer to the slope of the line as b and the intercept as a, which
gives the equation of the regression line as:
The Least-Squares Method (LSM) determines the values of a and

b that minimizes the sum of squares for the residual or the errors of
prediction:
According to this LSM method, here are formulas for calculation of the
slope and the intercept and general rules for their interpretation:
- indicates the value of y when x is 0.
- indicates how much the y values
change, on average, as x changes for one unit.
Example 3.1. (cont.)
We want to create regression model for relationship between Maths

score and Statistics score, in the sense that Statistics score depends on
Maths score.
239
Solution:
We will use parameters that we calculated before.

Calculating and
interpreting slope
(parameter b). Statistics score will rise by 0.938 on
average if Math score rise by 1.

Calculating and
interpreting intercept Student who have no scores
(parameter a). (0 score) from Math will be expected to have 5.089 score from Statistics.
Calculating the Regression model is:

equation of the linear
regression model.
We can obtain the same results by using Excel options:
Calculating the One way to obtain results by using Excel:

equation of the linear
regression model
using Excel.
In Excel function we will find functions INTERCEPT and SLOPE:
240
241
Regression model:
Interpretation:
Statistics score will rise by 0.938 on average if Math score rises by 1.
Students who have 0 score from Math will have 5.089 score from
Statistics.
Another way of calculation by using Excel solution:
In Excel we will find Tools - Data Analysis – Regression:
242
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.971121335
R Square 0.943076647
Adjusted
0.935961228
R Square
Standard
4.68868839
Error
Observations 10
ANOVA
Significance
Df SS MS F
F
Regression 1 2913.729609 2913.729609 132.5399 2.94E-06
Residual 8 175.8703905 21.98379882
Total 9 3089.6
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 5.083182203 4.846187507 1.048903328 0.324874 -6.09215 16.25851
Math score 0.938459678 0.081515907 11.5125957 2.94E-06 0.750484 1.126436
RESIDUAL OUTPUT
Observation Predicted Y Residuals Standard Residuals
1 72.65227905 5.347720953 1.209744422
2 66.0830613 3.916938701 0.886077412
3 80.15995647 0.840043526 0.190031974
4 38.86773063 -7.867730625 -1.779812993
5 52.00616612 2.993833877 0.677255576
6 24.79083545 4.209164551 0.952183814
7 79.2214968 -5.221496796 -1.181190394
8 65.14460162 -1.14460162 -0.258928137
9 46.37540805 0.624591948 0.141293203
10 56.69846451 -3.698464515 -0.836654877
243

FOR LINEAR RELATIONSHIP
According to the general formula for the correlation coefficient, here is

how we calculate the correlation coefficient when relationship between
variables is linear:
where:
is the covariance between
x (independent variable) and y (dependent variable).
is standard deviation of variable x
is standard deviation of variable y
244
- mean of variable x
- mean of variable y
n is number of objects.
We want to calculate correlation coefficient between Maths score and

Statistics score.
Solution:
First, all sums needed for calculation of correlation coefficient between

Maths score and Statistics score will be calculated in a working table.
Maths Statistics
Object x2 y2 x.y
score - x score - y
John 72 78 5184 6084 5616
Betty 65 70 4225 4900 4550
Sarah 80 81 6400 6561 6480
Peter 36 31 1296 961 1116
Fiona 50 55 2500 3025 2750
Charlie 21 29 441 841 609
Tim 79 74 6241 5476 5846
Gerry 64 64 4096 4096 4096
Martine 44 47 1936 2209 2068
Rachel 55 53 3025 2809 2915
Total 566 582 35344 36962 36046
Then we will apply formulas for covariance and standard deviations:
245
Now we have parameters needed to calculate correlation coefficient:
Calculating and
interpreting
correlation coefficient.
Those parameters could also be calculated by using Excel.
In Excel statistical functions we will choose function CORREL:

Calculating and
interpreting
correlation
coefficient using
Excel statistical
functions.
And then we will select variable data from Excel worksheet.
246
Correlation coefficient is close to 1, which indicates a strong positive

correlation, as we assumed from scatter plot. Hence we can draw the
conclusion that there is strong direct relationship between scores on
Math and Statistics.
Providing you have done the calculations correctly, correlation

coefficient will lie within the range (-1 to 1).
3.10. PREDICTION OR FORECASTING
This model, which is determined by LSM method, is used for forecasting

values of dependent variable y for different given values of independent
variable x. Predictions in regression analysis can be made by:
Interpolation –when values of independent variable x are within

original range from smallest to largest x used in developing the
regression model. This is relatively reliable prediction.
Extrapolation – when values of independent variable x aren’t within
original range from smallest to largest x used in developing the
regression model. This prediction can be subject to unknown effects that
we don’t expect, so in case of extrapolation, reliability is questionable.
247
If students have Math score 75, what is the expected score for Statistics?
Solution:
We will make interpolation:
Forecasting values According to previous regression model, we will expect that students
of dependent
variable y. who have Math score of 75 will get 75.214 score on Statistic.
3.11. SPEARMAN’S RANK CORRELATION

COEFFICIENT
Spearman’s correlation coefficient (ρ) used with ranked data can be

calculated using formula:
where d is difference in ranking for x and y:
The only difference between ρ and the standard r is that the data used
for calculation of ρ are ranks.
Example 3.2.
Two art historians were asked to rank six paintings from 1 (best) to 6
(worst). Their rankings are shown as a table:
Painting Historian 1 Historian 2

A 6 5
B 5 6
C 1 2
D 3 1
E 2 4
248
Calculate Spearman’s rank correlation coefficient. Explain.
Solution:
We have ranks for two variables and we will calculate difference in

ranking for x and y: .
Painting Historian 1 - rx Historian 2 - ry d d2

A 6 5 1 1
B 5 6 -1 1
C 1 2 -1 1
D 3 1 1 1
E 2 4 -2 4
Sum 8
Spearman’s rank correlation coefficient is:

Calculating and
interpreting
Spearman’s
rank correlation
That suggests relatively strong direct relationship (77.14%) between coefficient.
opinions of these two art historians.
3.12. STATISTICAL TESTING FOR SIMPLE LINEAR

REGRESSION MODEL (t TEST)
It is possible to implement test for significance of parameters in the

model of simple linear regression. The testing procedure can be obtained
in four steps:
1.
2. Standard error for parameter b
where:
249
3.
4.
where k=1 number of independent variables in simple regression model
5. , parameter b is not significant, it is the independent

variable that follows the model was not significant.
, parameter b is significant.
Concept of p values, which is simpler, concludes that:

If the p value of coefficient on the observed variable is less than 0.05,
we conclude that the variable is significant. We accept the Ho since
type I error is 5%, indicating probability of 5% to reject the H0 when
it is actually true.
If the p value for the coefficient of observed variable is greater
than 0.05, we conclude that variable is not significant and could be
excluded from the model. Since probability of making type I error is
greater than 5%, H0 is rejected.
We will analyze part of Excel output for regression analysis in

Statistical testing example 1:
for simple linear
regression model
(t test) using Excel. Coefficients Standard Error t Stat P-value
Intercept 5.083182203 4.846187507 1.048903328 0.324874
X Variable 1 0.938459678 0.081515907 11.5125957 2.94E-06
In this case p value for coefficient with independent variable (Math

score) is lower than 0.05, so we can say that variable Math score is
significant independent variable for that regression model.
250
3.13. OVERVIEW EXAMPLE FOR SIMPLE

LINEAR REGRESSION
Example 3.3.
To examine relationship between the store size (i.e. square footage) and
its annual sales, a sample of 14 stores was selected. The results for these
14 stores are summarized in the next table:
Annual sales
Store Square feet (000)
(in millions of $)
1 1.7 3.7
2 1.6 3.9
3 2.8 6.7
4 5.6 9.5
5 1.3 3.4
6 2.2 5.6
7 1.3 3.7
8 1.1 2.7
9 3.2 5.5
10 1.5 2.9
11 5.2 10.7
12 4.6 7.6
13 5.8 11.8
14 3.0 4.1
a) Create scatter plot to examine relationship between the store size and
its annual sales. Comment. Creating the
b) Create regression model and explain parameters. equation of the
linear regression
c) Calculate and explain coefficient of correlation and coefficient of model using Excel.
determination.
d) If store size is 4200 square feet, what level of annual sales for that
store we could expect?
251
Solution:
a) Scatter plot:
1. independent variable is store size,
2. dependent variable is annual sale
Creating scatter plot.
According to this scatter plot, we suppose that there is direct linear

relationship.
b) linear model:
First we need to find sums in the working table:
Square feet (000) Annual sales

Store x2 y2 x.y
-x (in millions of $) - y
1 1.7 3.7 2.89 13.69 6.29
2 1.6 3.9 2.56 15.21 6.24
3 2.8 6.7 7.84 44.89 18.76
4 5.6 9.5 31.36 90.25 53.2
5 1.3 3.4 1.69 11.56 4.42
6 2.2 5.6 4.84 31.36 12.32
252
7 1.3 3.7 1.69 13.69 4.81

8 1.1 2.7 1.21 7.29 2.97
9 3.2 5.5 10.24 30.25 17.6
10 1.5 2.9 2.25 8.41 4.35
11 5.2 10.7 27.04 114.49 55.64
12 4.6 7.6 21.16 57.76 34.96
13 5.8 11.8 33.64 139.24 68.44
14 3 4.1 9 16.81 12.3
Total 40.9 81.8 157.41 594.9 302.3
Slope (parameter b)
- indicates that annual sale increase by 1.67 million dollars on average

as store size increases by 1000 square feet.
- indicates that expected annual Intercept

(parameter a)
sale is 0.964 million dollars when store size is 0 square feet.
Regression model is:

The equation of the
linear regression
c) Correlation coefficient is: model.
Correlation
coefficient
This indicates strong (but not perfect) positive correlation.
Coefficient of determination is r2 = 0.95082 = 0.904 Use of regression

model has explained variability in predicting annual sales by 90.4%. Coefficient of
determination
Only 9.6% of the sample variability in annual sales is due to factors
other than what is accounted for by linear regression model that uses
only square footage.
253
d) xi = 4.2 is within original range from smallest to largest x used in

Forecasting
developing the regression model, so we made interpolation.
The predicted average annual sale of a store with 4,200 square feet is
$7,978,000.
3.14. CALCULATING THE EQUATION OF THE

EXPONENTIAL REGRESSION MODEL
Exponential regression model is given by relation:
The idea is to convert an exponential curve to a linear one, using the

logarithm, as follows:
Replacement:
linear model:
a = antilogarithm A, b = antilogarithm B
= antilogarithm
We will apply this exponential model to a set of data that we suspect

does not change linearly over time.
254

OF THE PARABOLICAL REGRESSION MODEL
If we want to examine for non-linear relationships among variables,

parabolic regression model is given by relation: This
is a regression model in which the regression functions are polynomials.
For calculation of parameters we will apply system of normal equations

(according to LSM):
We will use parabolic regression model if we want to look for a U-shaped

pattern.

OF THE POWER REGRESSION MODEL
Power (log-log) regression model is given by relation:
We will again use the idea to convert a power model to a linear one,
using the logarithm, as follows:
Replacement: linear
model:
255
a = antilogarithm A, = antilogarithm
3.17. MULTIPLE REGRESSION MODEL
The general multiple regression model
The general multiple regression model with K independent variables is:
Dependent variable Y is expressed as a function of K independent

random variables and e. If variables are functional part of the defined
linear model, multiple linear regression will estimate linear equation of
the form:
Coefficients in the regression model have the following meaning:

Parameter a is free, constant member which represents the expected
value of dependent variable Y when the values of K independent va-
riables (X1, X2,...,XK) are equal to zero. The value of this parameter
does not always have logical explanation.
Parameter bi (i=1,2,....,K) or the regression coefficient on the inde-
pendent variable indicates the average change in dependent variable
Y conditional unit increase in independent variables Xi, provided that
the other independent variables remain unchanged. Positive value of
parameter indicates the proportional relationship between variables
Y and Xi. The positive coefficient indicates how much the dependent
variable is expected to increase when that independent variable in-
creases by one, holding all the other independent variables constant.
A negative value means inversely proportional relationship between
256
dependent variable Y and independent variable Xi. In this case the

direction of changes of independent and dependent variables is the
opposite, and increase of independent variable Xi by one tends to
decrease dependent variable Y, holding all the other independent
variables constant.
The values of the multiple regression coefficient is evaluated using the

methods of least squares.
3.17.1. Measures for quality of multiple regression model
There are several parameters for measuring representativeness and

quality of multiple regression models:
Α. The root mean square error (RMSE)
- model error measure of unexplained variability.
B. The coefficient of model variation
C. The coefficient of multiple determination (relationship of explained

and total variability) is defined by the following expression:
Coefficient of multiple determination explains how the changes in

variability of dependent variables are explained by the changes of
variability for K independent variables included in the regression model.
D. Coefficient of multiple linear correlations measures the strength

of relation between dependent variable and all the independent
variables jointly. It is determined as the square root of the coefficient
of multiple determinations:
257
Or by expression:
Coefficient does not indicate direction of the association, because

relations between the dependent and independent variables can be
multidirectional.
E. Partial correlation coefficient shows the strength and direction of the

relation between dependent variable Y and j-independent variables
holding remaining (K-1) variables constant. The value of this
coefficient ranges within limits:
For example, partial correlation coefficients of the first order for K = 2

are defined using a simple coefficient of linear correlation in the
following manner:
Interpretation of partial correlation coefficients: explaining the strength

and direction the independent and dependent variables (their variability),
if you switch off the influence of others (K-1) independent variables.
F. Adjusted coefficient of determination
Adjustment is done with number of predictors and sample size.
258
3.17.2. Statistical test for multiple regression

model (t test, ANOVA)
a) Testing the significance of the multiple regression coefficient

bij.12...m1
If we want to test significance for some independent variable in multiple

regression model we will use t test.
1.
2. Standard error evaluation parameter b is and determined on
the basis of
3.
4.
where k = M-1 - the number of independent variables in multiple

regression model
5. There is no sufficient evidence to reject Ho at α

level of significance. Parameter b is not significant and independent
variable has no contribution to the model
, We reject Ho at α level of significance. Parameter b is
significant and independent variable has contribution to the model.
b. Analysis of variance in the regression model - F test for regression

model
This analysis tested whether there is a significant link between a number

of independent variables included in the model and the dependent
variable.
259
The methodology of conducting F test is as follows:
1. least one parameter
2.
3. For given where k is the number of independent

variables in the regression model
4.
If we accept an alternative hypothesis, it can be considered that at least

one of the independent (explanatory) variables involved in the model
has significant effect on dependent variable.
Example 3.4.
Sample of 34 shops in the chain store was selected for a marketing

test. Dependent variable is the volume of sales, while the independent
variables are price and cost of the promotion:
Sales Price Promotion Sales Price Promotion

(units) (KM) cost (00 KM) (units) (KM) cost (00 KM)
4141 59 200 2730 79 400
3842 59 200 2618 79 400
3056 59 200 4421 79 400
3519 59 200 4113 79 600
4226 59 400 3746 79 600
4630 59 400 3532 79 600
3507 59 400 3825 79 600
3754 59 400 1096 99 200
5000 59 600 761 99 200
5120 59 600 2088 99 200
4011 59 600 820 99 200
260
5015 59 600 2114 99 400

1916 79 200 1882 99 400
675 79 200 2159 99 400
3636 79 200 1602 99 400
3224 79 200 3354 99 600
2295 79 400 2927 99 600
Create an appropriate regression model and analyze the results.

Statistical test for
multiple regression
Solution: model (t test,
ANOVA) using
It is a model of multiple regressions with two independent variables. Excel.
Regression model is obtained by using Excel (Data analysis - Regression)26
and the results are summarized in the following Excel output:
SUMMARY OUTPUT
Multiple R 0.870475
R Square 0.757726
Adjusted R
0.742095
Square
Standard
638.0653
Error
Observations 34
Significance
ANOVA df SS MS F
F
Regression 2 39472731 19736365 48.47713 2.86E-10
Residual 31 12620947 407127.3
Total 33 52093677
Standard Upper
Coefficients t Stat P-value Lower 95%
Error 95%
Intercept 5837.521 628.1502 9.293192 1.79E-10 4556.4 7118.642
Price -53.2173 6.852221 -7.76644 9.2E-09 -67.1925 -39.2421
Promotion
3.613058 0.685222 5.272828 9.82E-06 2.215538 5.010578
cost
26
The database column with the dependent variable must be either the first or last, because
the independent variables must be given as "block" variables.
261
The interpretation of the Excel output obtained is as follows:
Correlation coefficient (multiple R) is 0.87. Correlation coefficient

of 0.87 indicates that there is strong association between dependent
variable and all independent variables jointly.
Determination coefficient (R square) is 0.757, which indicates that
75.7% of the variation in the volume of sales is explained by price
and cost of the promotion.
Adjusted determination coefficient (adjusted R square) is 0.742
Model error (Standard Error) is
The results of ANOVA (analysis variances) from tested model:
– In the first column is the information on the appropriate number
of degrees of freedom:
– In the second column are the results of the sum of the squared
deviation.

– In the third column are the results of the mean square (MS),
calculated as the sum of squared of deviation / number of degrees
of freedom
262
k - number of independent variables in model

n - number of observation (objects)
– In the fourth column is the empirical value of F test and in the

fifth column is the appropriate p-value (F significance).
– Since it is we consider the model
significant (at least one of the independent variables included in

the model significantly influence the dependent variable).
The table also contains information on the model parameters which

form the regression equation:
If the price increases by 1 KM, the volume of sales is reduced

by 53.2173 KM on average, provided that the investment in the
promotion does not change.
If the cost for the promotion increases by 100 KM, volume of
sales increases by 3.6131 on average, provided that the price
does not change.
Further information on the parameters or coefficients regression

model are:
Α. Standard error estimates of these parameters

Β. 27
for testing significance of
each parameter separately. When theoretical values is outside
the interval we reject the null hypothesis and
we consider that both explanatory variables are significant in the

model.
27
If n is less than 30 we use t distribution with (n-k-1) degrees of freedom.
263
C. p-value for testing significance of each coefficient separately.

Since the p values of the variable coefficients in the model are
less than 0.05, we conclude that model explanatory variables are
significant.
D. Lower and upper limits of interval evaluation for each parameter
(theoretical distribution and standard error).
3.18. INDICATOR – DUMMY VARIABLES
Previously, we considered the independent variables in terms of

quantitative variables. “Dummy”, dichotomy, encrypted, or indicator
variable is derived or artificial numerical variable, which is used in
regression analysis to show subsets of the analyzed sample from the
population.
In the simplest case, the indicator variable values are 0 and 1:

0 for elements in the control groups or elements that are not in the
target group (do not have the desired characteristic) and
1 for elements in the experimental group (with a specific treatment) or
for elements that are the target group (with the desired characteristic).
When designing the research, the indicator variable is often used to set
boundaries between different groups. Indicator variable is very useful
because it does not necessarily require construction of separate regression
models for each group or subset and gives the possibility to use a simple
regression equation for the representation of different groups.
Indicator variable is used to include qualitative explanatory (independent)

variables in the regression model. So, another advantage of the indicator
variable is that despite the fact that indicator variable is a nominal scale
variable it could be treated as if it is measured at the interval scale.
For example, if the calculation of average for this variable, the result is
interpreted as the proportion of models in the distribution of 1.
Examples of indicator variables:

indicator variable for gender: 1 if male, 0 if not
indicator for marital status: 1 if married, 0 if not
264
indicator for employment: 1 if employed, 0 if not

indicator for categorization according to the urbanity: 1 if urban,
0 if not
3.18.1. Simple model with dummy variable
Simple regression model with a dummy variable is a model with only

one independent “dummy” variable: , where:
- value of dependent variable (the result of the outcome) of and the

i objects
a –intercept
b – slope coefficient
dichotomy (dummy) variable:
ei- residual (error) of i objects
To illustrate the indicator variable, we will further analyse simple

regression model with a “dummy” variable. The first step is to specify
dummy variable in the regression equation. For the control group di = 0
and for experimental group di = 1. When dummy is introduced in the
regression, assuming that the phrase residuals or errors are on average
equal to 0, the following equation is obtained:
For control group (di = 0):
For experimental group (di = 1):
265
We will calculate difference between the groups. This will be the

difference between regression models for the reference group.
Therefore, the difference between the groups is the coefficient b.
3.18.2. Example of regression indicator variables

in the simple model with a ”dummy” variable
Let us take a concrete example of a simple regression model where the

dependent variable is wage and independent indicator variable is an
indicator for marital status (1 if married, 0 if not).
Example of
regression indicator
variables in the simple What is interpretation of these model parameters?
model with a "dummy"
variable. Parameter a indicates that expected wage of those who are not
married is equal to 798.44 KM.
Parameter b indicates that, on average, the wage of persons who are
married is by 178.61 KM greater than the wage of persons who are
not married.
Summary of parameter a and b indicates that those who are married
have average wage of 977.05 KM.
3.18.3. Example of multiple regression models

with indicator and continuous variables
as explanatory variables in the model
Let us take a concrete example of regression model where the dependent

variable is wage and independent variables are:
indicator variable is an indicator of university degree (1 if finished

university, 0 if not).
266
continuous variable is the length of employment (in months)
Example of multiple
regression models
What is interpretation of these parameters in the model? with indicator
and continuous
variables as an
Parameter a indicates that expected wage of those who have not explanatory
completed university, and whose work experience is equal to 0 (start variables in the
model.
to work) is equal to 275 KM.
Parameter bd indicates that, on average, the wage of the person who
finished university is by 162 KM higher than the wage of those who
has not completed university, holding other things constant
Parameter bx1 indicates that if all other factors in the model remain
unchanged increase of service for 1 month leads to increase of wages
for 6.3 KM, on average.
Note: In the model it is possible to include more continuous and

indicator variables. Interpretations remain the same; we will interpret
parameters obtained for the given variable noting that other factors
remain unchanged (controlled).
Example 3.5.
For the sample of the 15 houses, the following information is known:

the sale value (000 KM), size (00 m2) and possession of fire protection Constructing and
systems: testing the
regression model
with a "dummy"
Sale value Size Possession of fire protection systems variable using
Excel.
84.4 2.00 yes
77.4 1.71 no
75.7 1.45 no
85.9 1.76 yes
79.1 1.93 no
70.4 1.20 yes
75.8 1.55 yes
85.9 1.93 yes
267
78.5 1.59 yes

79.2 1.50 yes
86.7 1.90 yes
79.3 1.39 yes
74.5 1.54 no
83.8 1.89 yes
76.8 1.59 no
Construct the model to predict the sales value of the house depending
on its size and information about the system of fire protection. Interpret
the parameters obtained.
Solution:
Since the variable possession of the fire protection system indicates the
absence/presence of the system, we need to create a dummy variable:
We will use the Excel IF function to create dummy variables:
268
Then we will continue with the Copy-Paste to obtain dummies in every

cell:
Possession of fire
Sale value - y Size - x d
protection systems
84.4 2 yes 1
77.4 1.71 no 0
75.7 1.45 no 0
85.9 1.76 yes 1
79.1 1.93 no 0
70.4 1.2 yes 1
75.8 1.55 yes 1
85.9 1.93 yes 1
78.5 1.59 yes 1
79.2 1.5 yes 1
86.7 1.9 yes 1
79.3 1.39 yes 1
74.5 1.54 no 0
83.8 1.89 yes 1
76.8 1.59 no 0
269
Appropriate regression model is:
Model is evaluated using multiple regression statistics (EXCEL - Data

analysis):
SUMMARY OUTPUT
Multiple R 0.900587
R Square 0.811057
Adjusted R
0.779567
Square
Standard
2.262596
Error
Observations 15
Significance
ANOVA Df SS MS F
F
Regression 2 263.7039 131.852 25.75565 4.55E-05
Residual 12 61.43209 5.11934
Total 14 325.136
Standard Upper
Coefficients t Stat P-value Lower 95%
Error 95%
Intercept 50.09049 4.351658 11.51067 7.68E-08 40.60904 59.57194
Size 16.18583 2.574442 6.287124 4.02E-05 10.57661 21.79506
Possession
of fire
3.852982 1.241223 3.104183 0.009119 1.148591 6.557374
protection
systems
Interpretations are as follows:
Correlation coefficient of 0.9 indicates that there is strong associati-

on between dependent variable and all independent variables jointly.
The determination coefficient indicates that 77.96% of the variation
in the sale value is explained by the house size and possession of the
fire system.
Adjusted determination coefficient is 0.7796
Model error is 2.26
270
– The results of ANOVA (analysis of variances) from tested model:.
Since:
we consider a model significant (at least one of the independent

variables included in the model is significant and has influence
on the dependent variable).
Further in the table are the parameters of the model which form the
regression equation: Interpretation
of the coefficients is:
– For each 100 square meters sale value is higher by 16.186 KM on
average, if other variables stay the same.
– House that possesses fire protection system has, on average,
greater sale value by 3.853 KM than house without fire protection
system, if other variables stay the same.
In addition to the parameters or regression coefficients, Excel output

also contains information on:
– standard error estimates of these parameters
– te for testing parameter significance of each parameter separately.
First we have to find theoretical interval:
271
Since theoretical values (t Stat in table behind parameters) are outside
the theoretical interval we accept an alternative
hypothesis, and we believe that both explanatory variables are significant

in the model.
p-value for testing significance of each parameter separately. Since

all these values are less than a specified level of Type I error of
5%, we reject the null hypothesis and we consider both explanatory
variables to be significant in the model.
lower and upper limits for interval evaluation of each parameter
(theoretical distribution and standard error)
3.19. CONDITIONS FOR ECONOMETRIC MODELS
Regression linear model: has two parts.

The first part of the model (a + b·xi) represents a functional relationship
in which Y is linearly dependent on X, if the other factors are constant.
Second, stochastic part of the model (ei) represents the random variation,
which takes into account the effect of changes in other variables that are
not explicitly included in the model.
272
Provided that the specification matches the model which is in relation

with economic realities and practices, problems of measuring economic
relations is expressed as problems of statistical evaluation of parameters
and probability timetable must meet assumptions about linear regression
model. These assumptions are as follows:
a) E(ei) = 0, (expected value of errors is equal to zero)

b) E(ei2)= σ2, (constant common variance - homoskedasticity)
c) E(ei . ej) = 0, for each i, j; (independency, there is no autocorrelation
with stochastic part)
d) (normality) - this assumption points to the absence of the
extreme data in the sample the outlier values of Xt and Yt, which are
very distant from the values of other variables
e) independency from Xj for each i, j;.
To evaluate the value of parameters of the regression model it is necessary

to choose the estimator (assessor, formula), which will come to their
best estimates. Estimators should have the following characteristics:
1. Impartiality
2. Consistency
3. Efficiency
4. The best linear impartiality.
3.19.1. Assumptions of the regression models
Multicollinearity
For first, we monitor correlation matrix. The rule of thumb says that
if the correlation coefficient between the independent variables is
higher than 0.8 (Gujarati, 2004, p.359), there could be the problem of
multicollinearity.
VIF (Variance Inflation Factor)
where R2 - determination coeficient in multiple regression model
273
If VIF > 10 and Tolerance < 0.1 assumption of noncollinearity is not

met.
Eigen value (the total amount of variance of independent variables

which can be explained). If it is greater than 1, it indicates that
assumption of noncollinearity is not met.
Condition index (CI) – square roots quotient successive Eigen values:
and more than two proportions variances for the

independent variables are greater than 0.5 weak dependence
between the independent variables.
independent variables are greater than 0.5 medium dependence
between the independent variables.
independent variables are greater than 0.5 strong dependence
between the independent variables assumption of noncollinearity
is not met.
How to solve the problem of mullticolinearity?
Combine related independent variables into one (the average z

score of independent variables, factorial analysis ...)
Eliminate some of the independent variables which have the
characteristics of the interdependence.
Collect more data about the analyzed variables in order to reduce
the problem, reestimate model with new data and verify if there is
still the problem of multicollinearity.
Outliers
Outliers exist where standardized residuals have values . There

are several ways to detect outliers through appropriate tests:
Distance - analysis residuals. It is important that no more than 5% of

the standardized residuals have a value of
274
Calculates the Laverage value (as a new variable). The problem of

outlier should review instances where the value is greater than 0.04
Calculates the Cook’s D value (as a new variable). The problem of
outlier should review instances where the value is greater than (4/n).
High Cook ‘Y value indicates the outliers.
Standardized Dfbeta indicates the change of regression coefficients if
outliers are excluded. The problem of outlier should review instances
where the absolute value is greater than (2/ ). High Dfbeta value
indicates the outliers.
Normality
After regression model construction, we can determine a new variable -

residuals. Kolmogorov-Smirnov test checks whether the assumption of
normality is met with residuals distribution. To determine whether the
variable satisfies the assumption of “normality” we use Kolmogorov-
Smirnov test for a sample of 50 observations, and more. Result is z
empirical value. P-value of the corresponding Kolmogorov-Smirnov
test is considered to be statistically significant if it is lower than 0.05,
since the tests are working with a first type error 5%. In this case,
there is no sufficient evidence to reject the null hypothesis that the
distribution of the analyzed variables does not meet the assumption of
normality. Otherwise (if p-value of KS test is higher than 0.05) we reject
the null hypothesis that the distribution of the analyzed variables does
not meet the assumption of normality, i.e. we can come to conclusion
that the distribution of the analyzed variables satisfies the assumption
of normality.
Autocorrelation
Durbin-Watson test indicates autocorrelation. DW value equal to 2

indicates that there is no autocorrelation. As a rule of thumb, if the
Durbin-Watson statistics is statistically significantly smaller than 2,
there is evidence of positive serial correlation. A rough rule indicates
that if the Durbin-Watson statistic is less than 1, it is cause for alarm
because of autocorrelation. If the DW is statistics in the interval 2 - 4, it
indicates no negative serial correlation.
275
According to the position of empirical values of DW in the interval

between 0 and 4, we can conclude the following:
1.
2.
3.
4. or
Heteroskedasticity
Test Goldfeld-Quandt aims to compare the sum of residual squares

deviation after division of the sample into two samples. Heteroskedasitcity
mainly arises in models with cross-section data rather than in models
with time series data due to greater variance over different cross section
unit than the variance between the same units in different point in time.
We will create two regressions for two samples and use the F test to
compare the residual deviations. Hypothesis H0 is accepted if there are
no significant differences between the sum residual squared deviations.
Data needs to be grouped according to given independent variable that

can be a source of heteroskedasticity. Divide a number of observations
in two samples, for both samples run regressions and calculate residuals.
We will test whether the residual variances from different samples are
the same or not, with Leven test (the test of arithmetic means). If residual
variances from different samples are not equal, there is a problem of
heteroskedasticity.
This problem could be solved by the weighted regression with the square
root of the inverse variable that is the source of heteroskedasticity.
276
3.1. There has been huge discussion in the media all over the world
about unproductive public sector labour force in Greece, especially
in the light of the current crisis that Greece is facing with. Foreign
analysist have complaints on the high salary that workers receive
for their poor performance. To see how workers earnings affect
their productivity, we collect data on average earnings and workers
productivity index in five public institutions in Greece. Data are
given in the table:
Average earnings
Institution Workers’ productivity index
(in 00KM)
I 103.3 139
II 103.9 140
III 104 140.5
IV 104.5 141
V 104.8 143
a) Plot a scatter diagram of the data.

b) What regression equation best predicts workers productivity, based
on average earnings of employees?
c) How well regression fits data?
Solution:
a) The following scatter diagram can be drawn:
277
We want to determine how average workers’ earnings affect workers’

productivy. We plot scatter diagram with workers’ productivy index as
the dependent variable (Y) and average earnings as the independent
variable (X). Hence, we put workers’ productivity index on the vertical
axis (the y-axis) and the average earnings on the horizontal axis (the
x-axis). In the scatter plot above, workers’ productivity appears to have
an upward trend, i.e. workers’ productivity increases with increase in
their average earnings.
b) Straight line drawn on a graph can be represented by a linear equation

of the form:
To obtain values of regression coefficients, the Least-Squares Method

is used. According to this method, formulas for calculation of the
coefficients are:
Intercept
Slope
278
All sums needed to calculate formulas (i.e. their parts mean, covariance
and standard deviation coefficients) will be obtained in the following
working table.
x y x.y x2 y2
139 103.3 14358.7 19321 10670.89
140 103.9 14546 19600 10795.21
140.5 104 14612 19740.25 10816
141 104.5 14734.5 19881 10920.25
143 104.8 14986.4 20449 10983.04
Total: 703.5 520.5 73237.6 98991.25 54185.39
Values of the mean, covariance and standard deviations needed to

calculate coefficients are:
Returning calculated parameters in a formula for calculation of

coefficients, we obtain following results:
So the equation of our fitted line is:
279
Interpretation of regression coefficients:
The intercept (a coefficient) tells us that if average yearly earnings are

0, we expect productivity index to be equal to 52.14.
The slope (b coefficient) tells us that if average earnings increases by

100 euros, we expect productivity index to increase by 0.037 percentage
points, on average.
c) One way to assess fit is to check the coefficient of determination,

which can be computed from the following formula:
By using worker’s average earnings as a predictor, we have explained

89.57% of the variance in productivity. This is considered a good fit
to the data, in the sense that it will substantially improve our ability to
predict productivity index of the workers in public sector in Greece by
observing average yearly earnings of workers.
3.2. According to World Health Organization, obesity has reached

epidemic proportions globally. Particularly worrying is childhood
obesity, which is increasing constantly. One of the factors which is
often mention as a cause is a rise of family income, which gives
way to more varied diets with a higher proportion of fats, saturated
fats and sugars. To check for validity of these claims, we undertake
a nutrition study in a large city. A sample of 6 children 7 years old
was weighed and their family incomes estimated. The following
results were recorded:
280
Monthly family income (in KM) Weight (in kg)

1000 23
1150 25.5
1100 25
1300 27
1600 30
1400 28

b) Determine and explain the parameters of the corresponding regression
model.
c) What could be concluded about direction and strength of the linear
association between variables in the model?
d) If the monthly income is 1500 KM, what is the expected weight of a
child?
Solution:
Since we are interested in determining the weight of the child if we

know the family income, then weight is the dependent variable (Y) and
family income is the independent variable (X). Hence, we put weight on
281
the vertical axis (the y-axis) and the income on the horizontal axis (the
x-axis). From the scatter plot above we conclude that weight increase
with increase in mothly income.
b) The regression equation is a linear equation of the form:
According to the Least-Squares Method, formulas for calculation of the

coefficients:
Intercept
Slope
working table.
x y x.y x2 y2
1000 23 23000 1000000 529
1150 25.5 29325 1322500 650.25
1100 25 27500 1210000 625
1300 27 35100 1690000 729
1600 30 48000 2560000 900
1400 28 39200 1960000 784
Total: 7550 158.5 202125 9742500 4217.25
Values of the mean, covariance and standard deviations needed to

calculate coefficients are:
282
Returning calculated parameters in a formula for calculation of

coefficients, we obtain following results:
So the equation of our fitted line is:
Interpretation of regression coefficients:
The intercept shows that if a family has no monthly earnings (earnings

are 0), the expected weight of a child is 12.49 kg.
The slope shows that if average earnings increases by 1 KM, we expect

weight to increase by 0.011 kg (or 11grams), on average.
c) Direction and strength of the linear relationship is assesed by

coefficient of correlation:
Coefficient of correlation is positive and tends to 1, we conclude that

relationship between monthly family income and child’s weight is direct
and strong.
283
d)
If the monthly family income is 1500 KM, the estimated child’s weight
is 28.99 kg.
3.3. The data on monthly loan payment and amount of monthly savings
in 6 households are given in the following table:
Monthly loan Monthly savings

payment (00 KM) (00 KM)
5 1.5
4.8 2
2.5 3
3.8 2.4
4 2.2
1.2 3.8

b) Determine and explain the parameters of the corresponding regression
model.
c) What percentage of the variation in monthly savings is explained by
your model? What could be concluded about direction and strength
of the linear association between variables in the model?
Solution:
284
An independent variable (monthly loan payment) is presented on the

horizontal, x-axis, while dependent variable (monthly savings) is given
on the vertical, y-axis. The appears to be downward trend in monthly
savings and monthly loan payment.
According to the Least-Squares Method, formulas for calculation of

regression coefficients are:
Intercept
Slope
working table.
285
x y x.y x2 y2
5 1.5 7.5 25 2.25
4.8 2 9.6 23.04 4
2.5 3 7.5 6.25 9
3.8 2.4 9.12 14.44 5.76
4 2.2 8.8 16 4.84
1.2 3.8 4.56 1.44 14.44
Total: 21.3 14.9 47.08 86.17 40.29
If monthly rent payment is

equal to 0 KM, it is expected that a household saves 439.7 KM, on a
monthly basis.
If monthly rent payment increases by 1 KM
(100 KM), the amount of monthly savings will, in average, decrease by

0.54 KM (54 KM).
Finally, the regression equation is:
c) In order to determine what percentage of the variation in monthly

savings is explained by our model, coefficient of determination is
used:
286
of the variability in monthly
savings can be explained by the variability in monthly rent payments.

This is considered a good fit to the data.
As it was already noticed, covariance coefficient CXY and parameter b

are both negative. Direction of relationship between variables can be
also examined by observing the sign of CXY. In this example, covariance
coefficient is negative; therefore relationship between variables is
indirect.
Correlation coefficient is also negative:
- indirect and strong relationship
3.4. The table presents the production volume and costs in one
international company that were recorded during 6 year period:
Production volume Production costs

Year
(000 pieces) (000 KM)
1 4 100
2 6 146
3 8 178
4 10 220
5 12 256
6 13 280
a) Draw a scatter plot. Is there a significant linear relationship between

production volume and production cost?
b) Calculate and explain coefficient of correlation and coefficient of
determination.
c) Determine the functional form of the regression and explain
parameters.
d) If the production volume is 15.000 units, what is the expected level
of production costs?
287
Solution:
a) A scatter plot present independent variable (volume of production)

on the horizontal, x-axis, while dependent variable (production cost)
is given on the vertical, y-axis.
Since data points can be approximated with a straight line, we can

conclude there is strong evidence of a linear relationship between
variables. Upward sloping line indicates that relationship is positive
and direct, i.e. increase in volume production will tend to increase
production cost.
b) In order to calculate coefficient of determination, the following

formula is used:
All sums needed for calculation of correlation coefficient will be

obtained in the following working table.
288
x y x.y x2 y2
4 100 400 16 10000
6 146 876 36 21316
8 178 1424 64 31684
10 220 2200 100 48400
12 256 3072 144 65536
13 280 3640 169 78400
Total: 53 1180 11612 529 255336
Further, covariance and standard deviations will be calculated:
Calculated parameters will be applied in a formula for coefficient of

determination:
Coefficient of determination explains that 99.91% of the production

costs variability can be explained by the production volume variability.
This is considered a good fit to the data.
289
Coefficient of correlation is positive and tends to 1, which indicates that

relationship between production volume and production costs is direct
and strong.
c) The regression equation is a linear equation of the form:
Regression coefficients are:
If the production volume is 0 pieces, we expect production costs to be

equal to 24,600 KM (fixed cost).
If the production volume increases by 1 piece, we expect production

costs to increase by 19.49 KM, on average.
Fitted regression equation is:
d)
For production volume of 15,000 pieces, we expect that the costs of

production amount to 316,950 KM.
3.5. In order to determine effect that the costs of advertising (x) have
on sales volume (y), we collected data at 10 different shopping
malls and obtained the following result:
The costs of advertising - x Volume of sales - y

18 55
7 17
14 36
31 85
21 62
5 18
290
11 33
16 41
26 63
29 87
a) Draw a scatter diagram.

b) Determine the functional form, the parameters of the corresponding
regression model and strength of the relationship between the
advertising costs and the volume of sales.
c) For the costs of advertising of 30 $, how much of the sales volume is
expected?
d) Determine the strength of the correlation by using Spearman’s rank
correlation coefficient.
Solution:
a) Scatter diagram is:
We will create a working table with all sums needed for our calculation.
291
The
Volume
costs of
of sales x.y x2 y2 rx ry rx - r y ( rx - ry )2
advertising
-y
–x
18 55 990 324 3025 6 6 0 0
7 17 119 49 289 2 1 1 1
14 36 504 196 1296 4 4 0 0
31 85 2635 961 7225 10 9 1 1
21 62 1302 441 3844 7 7 0 0
5 18 90 25 324 1 2 -1 1
11 33 363 121 1089 3 3 0 0
16 41 656 256 1681 5 5 0 0
26 63 1638 676 3969 8 8 0 0
29 87 2523 841 7569 9 10 -1 1
Total: 178 497 10820 3890 30311 4
To conduct regression, we need to determine regression parameters a

and b:
Firstly, parameter b is determined:
Parameters a have the following value
292
Finally, regression equation which explains relation between volume of

sale and cost of advertising is:
Strength of the relationship between variables is determined by

coefficient of correlation:

the relationship between advertising expenditure and total sale is direct
and strong
c)
If the cost of advertising is 30 $, the estimated sales volume is 82.96 $.
d) - strong and direct relationship.
3.6. The data of qualification rank and working efficiency rank for 6
employees are given in the table below:
Worker B E D A F C
Qualification rank 1 2 3 4 5 6
Efficiency score 25 30 23 21 18 20
On the basis of Spearman’s rank correlation coefficient, assess the

strength and direction of relationship between qualifications and
efficiency of workers.
293
Solution:
Starting from the formula for the Spearman’s rank calculation
a working table is formed.
Worker x rx ry d = rx - ty d2
B 25 1 2 -1 1
E 30 2 1 1 1
D 23 3 3 0 0
A 21 4 4 0 0
F 18 5 6 -1 1
C 20 6 5 1 1
Σ 4
Spearman’s rank correlation coefficient is positive and tends to 1, which

indicates that the relationship between the qualifications of workers and
the efficiency is direct and strong.
3.7. The results of examination of the average monthly sales and

psychophysical ability (obtained by psychophysical performance
test) of sellers are given in the following table:
Monthly sales
Test results
(in 1000 $)
10 55
11 62
29 80
12 62
294
20 70
13 62
24 75
18 80
15 65
a) Determine and explain the strength of correlation between these

phenomena, using correlation coefficient.
b) On the basis of Spearman’s rank correlation coefficient, assess the
strength and direction of relationship between these phenomena.
Solution:
a) Starting with calculation shown in the working table
y x x.y y2 x2 ry rx d = ry - rx d2
10 55 550 100 3025 1 1 0 0
11 62 682 121 3844 2 3 1 1
29 80 2320 841 6400 9 8.5 0.5 0.25
12 62 744 144 3844 3 3 0 0
20 70 1400 400 4900 7 6 -1 1
13 62 826 169 3844 4 3 -1 1
24 75 1800 576 5625 8 7 -1 1
18 80 1440 324 6400 6 8.5 2.5 6.25
15 65 975 225 4225 5 5 0 0
Σ 152 611 10717 2900 42107 10.5
295
we calculate correlation coefficient to asses the strength of correlation

between variables:
The correlation coefficient amounts to 0.87, therefore the relationship

between observed variables is direct and strong. 75.79% of the variability
in average monthly sales can be explained by the variability in the
psychophysical performances of sellers.
Spearman’s rank correlation coefficient is positive and tends to 1, which

indicates that the relationship between the psychophysical performances
and average monthly sales is direct and strong.
3.8. There have been significant changes in the clothing market since
the beginning of the 21st century. Expansion of the discount
fashion sector and increasing number and type of competitors
(supermarket chain becoming more and more important factor at
clothing market) are just a few. In this competitive environment,
decision making is becoming more complex and requires more
information. The marketing manager of a popular clothing brand
would like to determine the effect of advertising expenditure on
the sales of clothes. To test the effectiveness of advertising, a
random sample of 5 markets is selected and following values are
recorded:
Total sale Advertising expenditure

Market
(in 000 KM) (in 00 KM)
I 5 1.6
II 7 2.2
III 4 1.4
IV 6 1.9
V 10 2.4
296

b) Calculate and explain the correlation coefficient.
c) What percentage of the variation in sales is explained by your
model?
Solution:
A scatter plot present independent variable (total sale) on the horizontal,

x-axis, while dependent variable (advertising expenditure) is given on
the vertical, y-axis.
Since straight line is appropriate approximation for data points, we can

conclude there is evidence of linear relationship between variables.
Upward sloping line indicates that relationship is positive and direct,
i.e. increase in advertising expenditure will tend to increase total sale.
b) In order to calculate coefficient of correlation, the following formula

is used:

297
x y x.y x2 y2
1.6 5 8 2.56 25
2.2 7 15.4 4.84 49
1.4 4 5.6 1.96 16
1.9 6 11.4 3.61 36
2.4 10 24 5.76 100
Total: 9.5 32 64.4 18.73 226
Further, covariance and standard deviations will be calculated as:
Calculated parameters will be applied in a formula for calculation of

correlation coefficient:
0.9481

the relationship between advertising expenditure and total sale is direct
and strong.
298
c) We will be using coefficent of determinition (r2) to asses how well

the model fits data, i.e. what percent of variance in dependent
variable (sales) is explained by your model. Its value is previously
obtained (r2 = 0.8999) and explains that 89.9% of the sales variability
can be explained by the advertising expenditure variability. This is
considered a good fit to the data.
3.9. Sport equipment manufacturer wants to launch new advertising

strategy and wants to send a message how physical activity is
important for figure and overall health. To check validity of its
claims, marketing team observed, for six months, the time (in
minutes) a group of females with the same height (165 cm) and
weight (62kg) spend in a gym and record their weight afterwards.
The data are presented in a table:
Time spent in gym Weight

(minutes) (in kg)
30 60
60 59
90 57
120 55
140 54,5
160 53
a) Draw a scatter plot.

b) What percentage of the variation in weight is explained by your
model?
c) Determine the functional form of the regression and explain
parameters.
d) If the time spent in gym is 180 minutes, what is the expected person’s
weight?
Solution:
a) A scatter plot presents independent variable (hours spent in gym) on

the horizontal, x-axis, while dependent variable (weight) is given on
the vertical, y-axis.
299
Since data points can be approximated with a straight line, we can

conclude there is strong evidence of a linear relationship between
variables. Downward sloping line indicates negative and direct
relationship between variables.
b) In order to determine what percentage of the variation in weight is

explained by our model, coefficient of determination is used:

x y x.y x2 y2
30 60 1800 900 3600
60 59 3540 3600 3481
90 57 5130 8100 3249
120 55 6600 14400 3025
140 54.5 7630 19600 2970.25
160 53 8480 25600 2809
Total: 600 338.5 33180 72200 19134.25
300
Further, covariance and standard deviations will be calculated:
Calculated parameters will be applied in a formula for coefficient of

determination:
Coefficient of determination explains that 98.89% of the person’s weight

can be explained by the time spent in gym. This is considered a good
fit to the data.
c) The regression equation is a linear equation of the form:
Regression coefficients are:
If the time spent in gym is 0, we expect weight to be equal to

61.908 KM.
301
If the time spent in gym increased by 1 minute, we expect weight to

decrease by 0.0549 kg, on average.
Fitted regression equation is:
d)
If a person spends 180 minutes in gym, the expected person’s weight is

52.026 kg.
The percentage of rural population and the number of newborns (per

1000 residents) in a period of 6 years were:
Percentage of rural Number of newborns

population (%) (per 1000 residents)
17 4
24 6
26 9
29 11
34 13
42 18
a) Draw the scatter diagram.

b) Determine the functional form of the regression and explain
parameters.
c) Determine and explain coefficients of correlation and determination.
Answer: b)
3.11. The data has been collected to show that tenure affect monthly
worker”s earnings (assuming that other workers characteristics
such as educational level or job responsibilities are the same):
302
Tenure Monthly worker’s earnings

(years) (in KM)
3 1200
5 1280
8 1350
10 1380
14 1400
17 1450
a) Which is dependent and which is independent variable in the model?

b) Plot the data and determine the nature of relationship between
variable.
c) Determine regression equation and explain parameters.
Answer: b) tenure is independent variable and montly earnings is

dependent variable
c)
3.12. Scientists believe that there is association between cigarette smoking

and learning performance. In order to check validity of their
claims, they gather data on daily cigarette consumption (expressed
as the number of cigarette consumed) and students performance
(expressed as average grade) on the sample of 6 students:
Daily cigarette consumption (in ) Average grade

5 8.8
8 8.75
10 8.8
15 8.6
18 8.3
20 8
a) Create regression model for those variables. Explain parameters.

b) Determine and explain the strength of correlation between these
phenomena, using correlation coefficient.
303
c) On the basis of Spearman’s rank correlation coefficient, assess the

strength and direction of relationship between these phenomena.
Answer: b) c)
d)
3.13. The marketing manager of a large supermarket chain would like

to determine the effect of shelf space on the sales of pet food. A
random sample of 12 equal-sized stores is selected, with following
results:
Store Shelf space (feet) Weekly sales (000 of $)

1 5 1.6
2 5 2.2
3 5 1.4
4 10 1.9
5 10 2.4
6 10 2.6
7 15 2.3
8 15 2.7
9 15 2.8
10 20 2.6
11 20 2.9
12 20 3.1
a) Set up a scatter diagram.

b) Create regression model for these variables. Explain parameters.
c) Calculate and explain coefficient of correlation and coefficient of
determination.
d) Predict the average weekly sales of pet food for stores with 8 feet of
shelf space for pet food.
Answer: b) c) r = 0.827, d) 2.042
304
3.14. A large mail-order house believes that there is an association

between the weight of the mail it receives and the number of
orders to be filled. It would like to investigate the relationship in
order to predict the number of orders based on the weight of the
mail. From an operational perspective, knowledge of the number
of orders will help in the planning of the order fulfillment process.
A sample of 15 mail shipments is selected within range of 200-700
pounds. The results are as follows:
Orders
Weight of the mail (pounds)
(in 000)
216 6.1
283 9.1
237 7.2
203 7.5
259 6.9
374 11.5
342 10.3
301 9.5
365 9.2
384 10.6
404 12.5
426 12.9
482 14.6
432 13.6
409 12.8

b) Create regression model for those variables. Explain parameters.
determination.
d) Predict the number of orders when the weight of the mail is 500
pounds.
Answer: b) c) r = 0.957, d) 14.96
305
3.15. The evil Swindler has been collecting data on the effect radiation
exposure has on Captain Amazing’s super powers. Here is the
number of minutes of exposure to radiation, paired with the
number of tons Captain Amazing is able to lift:
Radiation exposure Weight

(minutes) (tons)
3 14
3.5 14
4 12
4.5 10
5 8
5.5 9.5
6 8
6.5 9
7 6
Your job is to use least squares regression to find the line of best fit,
and then find the correlation coefficient to describe the strength of the
relationship between your line and the data. Sketch the scatter diagram
too. If Swindler exposes Captain Amazing to radiation for 5 minutes,
what weight do you expect Captain Amazing to be able to lift?
Answer: r = –0.81, prediction: 9.61
3.16. Sample data showing the predicted hours of sunshine and concert
attendance for different events. We can use this to estimate ticket
sales based on the predicted hours of sunshine for the day.
Sunshine Concert attendance

(hours) (100’s)
1.9 22
2.5 33
3.2 30
3.8 42
306
4.7 38
5.5 49
5.9 42
7.2 55

b) Create regression model for these variables. Explain parameters.
determination.
d) The predicted amount of sunshine on the day of the next concert is 6
hours. What do you expect concert attendance to be?
Answer: b) c) r = 0.91, d) 4.772.
307
4
TIME SERIES
ANALYSIS
CHAPTER
4
4.1. INTRODUCTION
Because economic and business conditions vary over time, managers

have to find ways to keep abreast of the effects that such changes will
have on their organizations. A very useful technique that can help
in planning of future steps is business forecasting from time series
information. Main aim is to create predictions that can be incorporated
into the process of strategic planning. Time-series forecasting methods
involve predictions and projections of future movements based on the
past and current observations for given variables.
Dynamics involves quantitative and qualitative changes

observed in scope and in the structure (quality) of phenomena
or variable within the observed time interval. Analysis of
dynamic observes the phenomenon through its variations in
time.
Changes in one relatively isolated phenomenon at the time are the

result of the influence of many other phenomena. When we establish
a connection between the time as independent variable and phenomena
as dependent variable, all other phenomena that affect this dependent
variable are included in the time variable.
In the time series regression, time is independent variable and the

analyzed phenomenon is dependent variable. It is best to have the time
intervals with the same length. But, what will be the length of time
interval depends on a number of factors:
nature of the observed phenomena (e.g., if there is seasonal influence
then it is best to monitor by months or quarters, because if we take
an annual basis we aren’t able to see the influence of season on the
observed phenomena),
objective research,
available instruments and resources etc.
Some effects are relatively stable, and did not show rapid changes in the
scope and structure, so it is enough to follow a year or even five-year
311
4 TIME SERIES ANALYSIS
data (e.g. social product, capacity, landed estates…). But if we follow

the current economic activities (e.g. production, prices, and transport of
goods) we should use monthly data or data for shorter time intervals.
If we use data on a monthly level, should we take into account the
comparability of data, because we don’t have the same number of days
for each month?
Main aims or tasks for dynamic analysis are:

• Description of development occurrence in time
• Explanation of variations occurrence in time
• Predicting the development of phenomena.
Most frequently used methods of dynamic analysis are:

• The graphic method
• The index method
• The average rate (dynamics indicator) method
• The trend method.
4.2. COMPONENTS (ELEMENTS) OF TIME SERIES
The basic assumption of time-series analysis is that the factors that have
influenced patterns of activity in the past and present will continue to do
so in more or less the same manner in the future. Because of that, main
aim of time-series analysis is to identify and isolate these influencing
factors for process of prediction.
To achieve this goal, many mathematical models have been devised for
exploring the changes and fluctuations among the component factors
of a time series. Most fundamental models are given for data recorded
annually, quarterly or monthly.
312
In one time series, further elements or components can be

recognized:
• Trend like long-term component
• Seasonal component or seasonal variations
• Cyclical component or cyclical variations
• Random (irregular) component or accidental changes.
If we analyze the data on an annual basis, it explains the phenomenon

in two parts: the trend and residiuum (rest), which includes three other
components of time series. Determination of the trend on a quarterly or
monthly level variations occurs if the activity of the seasonal component
increased.
4.2.1. Trend or long-term component
Overall long-term or persistent long-term tendency of upward

or downward movement is trend.
It is possible to use appropriate mathematical and statistical model to

express the long-term component and we will determine trend as the
function where the independent variable is time.
We can define trend as systematic component of time series. Relations

in economy often have a long-term trend duration, longer than 10 years.
It can follow changes in technology, population, wealth, value etc.
Long-term movements of economic time series such as sales, employment,

stocks prices and other business phenomenon follow different patterns.
Some move steadily upward, some decline and others stay almost the
same over a period of time.
313
4.2.2. Seasonal component (seasonal variations)
Seasonal variations in time series express the influence of

season on movement of phenomena. There are oscillations
about trend with regular duration and intensity.
Seasonal component can be seen in the arithmetic diagram if analyzed

variable is presented by month or by quarter.
There are two types of season:

• “active (alive)” season when the level of appearance is
significantly above or below the average level and
• “dead (non-active)” season when the development is
intensified or slow.
Many sales, production and other time series fluctuate with the seasons.
Typical examples of variables or phenomena with seasonal component are:
consumption of electricity and gas, production of agricultural products,
number of overnight stays in tourism,
intensity of construction, etc.
There are mathematical and statistical methods that enable us to “isolate"

the influence of seasonal component.
4.2.3. Cyclical component
Cyclical components in time series express the cyclical

variations in short period of time. They are repeated by
varying the intensity and character. Periodicity of cyclical
component is 2 to 10 years.
314
A typical business cycle consists of a period of prosperity followed by

periods of recession, depression and recovery. There is not determined
appropriate mathematical statistical model that can reliably track and
predict cyclical variations.
4.2.4. Irregular or random component
Irregular variations are caused by random factors. They are

unpredictable and cannot be identified. The overall result
of influence by irregular component may sometimes lead to
deviations from the basic flow of movement.
These deviations are positive in some years and negative in others, and,
in general, do not lead to changes in trend. But if the effect of random
factors is strongly expressed (e.g. in case of war or an earthquake
etc.) then it is possible that their effect (positive or negative) will
lead to changes in the basic course of development of phenomena
(the trend).
4.2.5. Systematic versus nonsystematic

component in time series
Trend component, cyclical and seasonal changes are referred to as

systematic, deterministic components. Thay are variations of the
phenomena that can be expressed as a function of time.
Random component is a non-systematic component. It indicates the

existence of irregular changes.
One task of time series analysis is to identify and eliminate the

influence of cyclical, seasonal and random changes (RESIDIUUM)
in order to determine the trend as a long-term trend for observed
phenomena.
315
4.2.6. Additive versus multiplicative model
If the periodicity in moving of phenomenon is constantly related to

the trend, we can apply an additive model for time series components.
In the additive model all of the elements are added together to form
the original or actual data. We can write the following formula for the
additive model:28
The components in additive model operate independently and therefore

the effects of individual components of time series can be summed.
In many models, cyclical element cannot be identified and the additive

model is simplified to:
In the multiplicative model the main elements are multiplied together:
or random component may be added:
In the multiplicative model, operating components are mutually

dependent and therefore the effects can be multipled. The multiplicative
model will be appropriate for situations where the variations show
proportionate shift around trend in the same period of each year, or
quarter, or month, or week.
28
Source: Somun-Kapetanović R., Statistika u ekonomiji i menadžmentu, Ekonomski fakultet u
316
4.3. GRAPHICAL METHOD FOR EVALUATION

ANALYSIS OF SOME PHENOMENA
Graphic representation of time series could be of following types:

• Bars
• Arithmetic chart (lines)
• Semi-logaritmic diagram and
• Polar diagram (if analyzed variable is presented by month
or by quarter).
When we have more series monitored in the same period then we can apply:
Arithmetic chart (lines)
Connected bars and
Split bars.
On examples we will present different types of time series graphs.
Example 4.1.
In the period 2000-2008, we monitored Gross domestic product in

FB&H29. Results are given in the next table:
Year GDP ('000 KM)

2000 6,722,631
2001 7,273,874
2002 7,942,665
2003 9,688,863
2004 10,321,440
2005 10,831,267
2006 12,146,338
2007 13,861,000
2008 15,632,000
29
http://www.bhas.ba/new/indikatori.asp?Pripadnost=6, access: 28. 01. 2010.
317
First graph that we will create is a bar graph:
Graphicaly
presentation
of time series
by the bar chart.
Then we will create arithmetic diagram:
Graphicaly
presentation of time
series by the arithetic
diagram.
We have large figures for GDP, so we can use semi-logarithmic diagram,

with logarithm value of GDP on y-axes:
318
Example 4.2.
In the period 2003-2008, we monitored simultaneously two phenomena:

B&H Import and Export30. Results are given in the next table:
Year Import Export

2003 8,365,183 2,428,234
2004 9,422,969 3,012,763
2005 11,180,797 3,783,199
2006 11,388,783 5,164,295
2007 13,898,242 5,936,583
2008 16,287,044 6,714,302
First graph that we will create is an arithmetic diagram:
Then we can use connected bars if we want to emphasize difference

between import and export:
30
319
Graphicaly
presentation of time
series with
connected bars.
Or we can use split bars if we want to hide difference between import

and export:
Graphicaly
presentation
of time series
with splitted bars.
320
Example 4.3.
For 2008, we monitored monthly Export of B&H31. The results are given
in the next table:
Month 2008. Export from B&H

January 485,922
February 545,562
March 535,452
April 567,877
May 600,588
June 614,074
July 635,412
August 547,084
September 616,611
October 582,804
November 538,259
December 444,657
In this case we will apply a polar diagram:
Graphicaly
presentation
of time series
with polar diagram.
31
321
4.4. ABSOLUTE AND RELATIVE CHANGES
It is very often necessary to describe measure and interpret changes in

some economic, business or social variables over time. One way to make
quantification of those changes is to calculate absolute or relative change.
4.4.1. Absolute change
If we set up that Vt is the level of variable in the period t and V0 is

the level of variable in some previous reference period 0, then absolute
change occured between the period t and the reference period 0 can be
expressed by formula:
As we can see from the formula, the absolute change is expressed in

the measurement unit in which analyzed variable is measured. Because
of that, we cannot use absolute change for comparison if we work with
several variables with different units of measurement.
Absolute change has feature computation:
Absolute change can be:

• Positive if or variable increase.
• Equal to 0, if or there is not any change.
• Negative, if or variable decrease.
4.4.2. Relative change
When we divide absolute change with level of variable in period 0, then

we will get relative change or the rate of change:
322
The relative change is unnamed number. Because of that, we can use

relative change for comparison if we work with several variables with
different units of measurement.
Absolute change doesn’t have feature of computation:
Relative change can be:

• Positive if variable increases and this is the rate of growth.
• Equal to 0, if Vt - V0 or there is not any change.
• Negative, if variable decreases and this is the rate of falls.
In the period 2000-2008, we monitored Gross domestic product for

FB&H. The results are given in the next table:
Year GDP ('000 KM)

2000 6,722,631
2001 7,273,874
2002 7,942,665
2003 9,688,863
2004 10,321,440
2005 10,831,267
2006 12,146,338
2007 13,861,000
2008 15,632,000
Firstly we will calculate absolute and relative changes compared to 2000:
323
Relative changes
Absolute changes compared compared to 2000
GDP to 2000. year year
Year
('000 KM)
2000 6,722,631 0 0.0000

2001 7,273,874 551,243 0.0820
2002 7,942,665 1,220,034 0.1815
Calculating and
interpreting 2003 9,688,863 2,966,232 0.4412
absolute and
2004 10,321,440 3,598,809 0.5353
relative changes.
2005 10,831,267 4,108,636 0.6112
2006 12,146,338 5,423,707 0.8068
2007 13,861,000 7,138,369 1.0618
2008 15,632,000 8,909,369 1.3253
In 2005, GDP for FB&H increased by 4.108.636.000 KM or by 61.12%

compared to 2000.
In the same way as we take the initial year with which comparison
is made, we can take any of the years from a given period. Some
comparisons may always be made with the previous year:
Relative changes
Absolute changes compared compared to previous yea
GDP to previous year
Year
('000 KM)
2000 6,722,631 / /
Calculating and 2001 7,273,874 551,243 0.0820

interpreting 2002 7,942,665 668,791 0.0919
absolute and
relative changes. 2003 9,688,863 1,746,198 0.2199
2004 10,321,440 632,577 0.0653
2005 10,831,267 509,827 0.0494
2006 12,146,338 1,315,071 0.1214
2007 13,861,000 1,714,662 0.1412
2008 15,632,000 1,771,000 0.1278
324
In 2005, GDP for FB&H increased by 509.827.000 KM or by 4.94%

compared to 2004.
4.5. THE INDEX METHOD
Indices provide a measure of change over time, making reference to a

base year with value of 100. The index is a number that explains the
relative change in simple or complex value between the two periods one
of which is defined as the base period. The index is always unnamed
number. For interpretation of index numbers we use percentages.
Index numbers are not concerned with absolute values but rather the
movement of values for analyzed variable. Index numbers can provide
summary of changes by aggregating the available information and
enabling a comparison to a starting figure of 100.
If an index number is used to measure the relative change in

just one variable, we talk about a simple or individual index
number.
It is the ratio of two values of variable converted in a percentage form.

We will use individual index number for analysis in the case of a
homogeneous variable. We fix the base period and calculate the changes
observed between the value of the observed period which is denoted by
t and value of the base period that is denoted by 0.
If we work with more than one variables, then we talk about

aggregate index numbers.
We will use aggregate index number for analysis in the case of

heterogeneous categories. Structure of aggregate index numbers is
technically and methodologically very complicated, which sometimes
325
makes their interpretation difficult. Reference aggregated indices are:

indices of value, price indices, volume indices, indices of living costs,
stock-exchange indices (Dow Jones, the CAC 40), etc.
Individual index numbers
As we said before, we calculate individual indices to monitor the

movement of a homogeneous phenomenon.
Fixed base indices or basis indices always take the same year
as a base year:
Indices with a variable basis or chain indices always take

previous year as base year:
There are several characteristics of index numbers:

1. The transitivity characteristic:
2. The reciprocity characteristic:
3. The circularity characteristic:
326
There is the connection between the base and chain index, as follows:
We can use this connection for conversion from basic to chain indices
or vice versa.
Also, we can find the connection between the basic indices with different
bases:
We apply the indices to calculate rate of change and vice versa according
to the following link between these parameters:
In the period 2000-2008, we monitored phenomenon Gross domestic

product for FB&H. Results are given in the next table:
Year GDP ('000 KM)

2000 6,722,631
2001 7,273,874
2002 7,942,665
2003 9,688,863
2004 10,321,440
2005 10,831,267
327
2006 12,146,338
2007 13,861,000
2008 15,632,000
We will calculate and interpret individual index numbers:
Basis indices Chain indices

Year GDP ('000 KM)
It/2000 It/t _1
2000 6,722,631 100.00 /
2001 7,273,874 108.20 108.20
2002 7,942,665 118.15 109.19
Calculating 2003 9,688,863 144.12 121.99
and interpreting
basis and chain 2004 10,321,440 153.53 106.53
indices. 2005 10,831,267 161.12 104.94
2006 12,146,338 180.68 112.14
2007 13,861,000 206.18 114.12
2008 15,632,000 232.53 112.78
In 2005, GDP in FB&H increased by:

61.12% compared to 2000.
4.94% compared to previous 2004.
Example 4.4.
We observed a phenomenon for three consecutive years. In the second

year, the phenomenon increased by 10%, and then dropped by 8% in
the third year. What is the rate of change in the third in relation to the
first year?
Solution:
For solution of this problem, we will apply the characteristic of transitivity:

Application of
of index transitivity.
328
The rate of change or increase in this case in the third in relation to the
first year is 1.2%.
4.5.1. The average annual rate of change
Suppose that V is growing at an average annual rate r. If the level of

V amounted to V1 in the first year then we expect that level of V in the
second year will be:
By analogy, level in the third year is:
According to this analogy, after n years, V will be:
From the last formula we can express average annual rate r using formula:
Expression for the average annual rate of change is given through

logarithms:
329
On the basis of known average annual growth rates we can make

predictions (projections or forecasting):
What level of phenomena we can expect in a given year?
For how many years it will be achieved given the level of Vn?
Example 4.5.
Known levels of investment in one branch of the economy (in $000) are
given in the next table:
Year Investment
2003 150
2004 184
2005 192
2006 185
2007 187
2008 191
a) What is the average annual growth rate and what does it mean?
b) If we continue this growth per annum, in which year will investment
reach the level of 82% higher than the level of investment in 2003?
c) If the growth per annum stays the same, what investment level can
be expected in 2012?
Solution:
a)
Calculating and
interpreting average
annual growth rate.
330
Average annual growth rate is 4.95%. On average, in this period

investment increases by 4.95% per annum.
b)
Determining of
number of years.
In 2015 investment will be 82% higher than the level of investment in

2003.
c)
Forecasting
the level of
phenomenon.
According to prediction of the annual rate of change, in 2012 investment

will reach the level of $231.700.
4.5.2. Aggregate index numbers
Aggregate or group index numbers can be used to express

the dynamics, the relative change for more phenomena. It is
a common time index as a statistical indicator of different
variations, but relatively homogenous phenomena.
331
For example, the price index means and expresses the common price
variations for all products in consumer basket for observed period.
The most important aggregate indices are:

Index of values
Price index
Volume (quantity) index
Cost of living index
There are several methods for determination the aggregate index

numbers:
Method of reducing to the “conditional” unit can be used in the

case of relation between different but related phenomena, and at the
same time establish relationship between variable with different me-
asurement units.
For example, for different types of coal in order to reduce the con-
ditional units we can take calorie value of coal. Due to the specific
conditions of application, this method is rarely used.
Method with weighted average is based to the determination of the
middle (average) index number for the period, so that the indices
of different phenomena in the same period could be reduced to the
average index. First we have to calculate individual indices and af-
ter their weighting by the corresponding value we will get weighted
average index number.
Method with aggregation reduces various phenomena to the com-
parable values and afterwards creates index numbers. It is necessary
to fix the structure of one of the components of complex time series
in the base period or in the monitoring period.
To explain construction and calculation of aggregate indices we will

introduce some symbols:
p0, j- price for product j in base or referent 0 or period
pi, j- price for product j in current or monitored i period
q0, j- quantity (produced or consumed) for product j in base or referent
or 0 period
qi, j - quantity (produced or consumed) for product j in current or
monitored or i period
332
W0, j = p0, j . q0, j - value (produced or consumed) for product j in base

or referent or 0 period
Wi, j = pi, j . qi, j - value (produced or consumed) for product j in current
or monitored or i period
4.5.3. Index of values
Now we can define some important aggregate indices. Firstly, there is

an index of values.
For product j, index of values is equal to:
For the consumer basket or product line with m products, index of

values will be:
4.5.4. Aggregate price index
According to method of aggregation, structure of quantities (consumption

or production) must be fixed in the base or in the monitoring period
with aim to calculate the aggregate price index number.
If we fix the quantity in the base period, we will get Laspeyres

price index:
333
Laspeyres index is calculated as a weighted arithmetic mean and it has

the property of aggregation.
But if we fix the quantity in the monitored period we will get

Paasche price index:
According to method with weighted average, first we have to introduce

real budget coefficients based on the budget base period and on the
budget monitored period:
Then, Laspeyres price index is equal to aritmetic mean of price indices

for individual products that comprise the consumer basket weighted
with real budget coefficients based on the budget base period:
Paasche price index is equal to harmonic mean of price indices for

individual products that comprise the consumer basket weighted with
real budget coefficients based on the budget monitoring period:
.
334
Theoretically, Laspeyres and Paasch indices don’t have transitivity

feature. But in practice, due to the fact that these properties are
numerically almost done, it is assumed that these indices satisfy
transitivity feature to simplify their application.
To avoid subjectivity in choosing weights we should use Fisher price

index number calculated as geometric mean of Laspeyres and Paasche
price index:
4.5.5. Aggregate volume (quantity) index
According to method of aggregation, structure of prices must be fixed in

the base or in the monitoring period with aim to calculate the aggregate
volume index number.
If we fix the price of the base period we will get Laspeyres

volume index:
Laspeyres index is calculated as a weighted arithmetic mean and it has

the property of aggregation.
335
But if we fix the price of the monitored period we will get

Paasche volume index:
In order to apply method with weighted average, first we have real

budget coefficients based on the budget base period and on the budget
monitored period. Laspeyres volume index is equal to aritmetic mean
of volume indices for individual products that comprise the consumer
basket weighted with real coefficients based on the budget base period.
Paasche price index is equal to harmonic mean of volume indices for

individual products that comprise the consumer basket weighted with
real budget coefficients based on the budget monitoring period:
Theoretically, Laspeyres and Paasche indices don’t have transitivity

feature. But in practice, due to the fact that these properties are
numerically almost done, it is assumed that these indices satisfy
transitivity feature to simplify their application.
To avoid subjectivity in choosing weights we should use Fisher volume

index number as the geometric mean of Laspeyres and Paasche volume
index:
336
Example 4.6.
We have information about prices and quantities sold for the four items
in the two periods (2007 and 2008):
Volume (quantity) Price

Product
2007 2008 2007 2008
I 72 91 4 6
II 24 26 11 15
III 9 16 7 9
IV 96 102 22 24
We should determine different aggregate indices from given data.
Solution:
At first we will complete a working table with sums that we need for
calculation according to method of aggregation32:
q0 q1 p0 p1 W0=p0·q0 W1=p1·q1 p1·q0 p0·q1

72 91 4 6 288 546 432 364
24 26 11 15 264 390 360 286
9 16 7 9 63 144 81 112
96 102 22 24 2,112 2,448 2,304 2,244
Σ 2,727 3,528 3,177 3,006
Calculating and
interpreting index
of values.
Observed value of the consumer basket with 4 products is increased by

29.37% in 2008 compared to 2007.
32
Method of aggregation is simpler option for calculating than method of weighted average,
hence we use method of aggregation. However, results have to be same.
337
Calculating and
interpreting
price aggregate
indices.
Observed volume of
consumer basket with 4 products is increased by 10.64% in 2008
compared to 2007 (by Fisher).
Calculating and
interpreting
quantity aggregate
indices.
Observed price of
consumer basket with 4 products is increased by 16.93% in 2008
compared to 2007 year (by Fisher).
4.5.6. Decomposition of aggregate index
Process of decomposition of aggregate index will enable us to use

information about price and volume indices in order to determine
aggregate index.
According to definition of aggregate index it is:
338
Or:
Application
of index
decomposition.
Example 4.7.
For the consumer basket consisting of 10 products in 2009 compared

to 2008, we calculate that the price index (according to the method of
aggregate, Paache's weight) is 130% and the index of volume (according
to the method of aggregates, Laspeyres's weight) is 98.5%. Calculate
the aggregate index of values for that consumer basket. Explain the
obtained results.
Solution:
According to decomposition of index of values
Observed value of the consumer basket increased by 28.05% in 2009 in

comparison to 2008.
339
4.6. DETERMINATION OF THE TREND
As we said in the beginning of this section, trend expresses the long-

term evolution or direction. The trend can be determined from the time
series: annualy, quarterly or monthly. If the activity of the seasonal
component increased, then we will determine the trend of variations
at quarterly or monthly level. When we analyze the data on an annual
basis, it explains the phenomenon in two parts: the trend and residium,
which includes three other components of time series.
For determination of the trend, we can apply three different

methods:
• Determination of trend by „eye“
• Empirical or graphical method of moving averages
• Analytical or mathematical method of least squares
(regression model where time is independent variable).
4.6.1. Determination of trend by „eye”
If we need only general idea where the trend is going, then we will use
our judgment to draw a trend line onto the graph or we will use method
by „eye“. First step in a time series analysis is to plot the original data
and observe any patterns that may occur over time.
The main problem with this method is that several persons all drawing
such a trend line will tend to create slightly different lines. Then there
is discussion that has got the best line. Also, estimation by „eye“ does
not provide such approach that would be appropriate for more complex
further analysis.
340
4.6.2. The method of moving averages
The method of moving averages is based on calculating the

arithmetic mean of a certain number of data from data
series. Moving averages p (p < T) for series {xt, t = 1,….,T}
are defined as the successive averages account for p successive
dates. According to that, each data from series is replaced by
the arithmetic mean of the sum of that data and one or more
previous and subsequent data.
This type of trend tries to smooth out the oscillations in original data
series by looking at intervals of time that make sense, finding an average
value and then moving forward by one step and again calculating an
average. The process continues until we reach the end of data set.
1. If the order of moving average is odd (p = 2m + 1), then:
Moving averages with odd order are simple and symmetrical.
2. If the order is moving average is even (p=2m), then situation is more

complex. Weighted average of data from series need weighting
coefficients to determine date t:
1/2p for dates (t-1) and (t+1), for two extreme values yt-m and yt+m or
1/p for (p-2) intermediate values yt-m+1 to yt+m-1.
There are (p+1) elements in calculation. We can calculate (T-p) moving

averages of even order by formula:
341
The method of moving averages for smoothing a time series is very

subjective and dependent on the length of the period selected for
constructing the moving averages. If cyclical oscillations are present in
the time series data set, the value of the length of the period selected for
constructing the moving averages had to be chosen as integer number that
corresponds to the estimated average length of the cycle in that series.
The longer the length of the period selected for constructing the moving
averages, the fewer the number of moving averages that can be computed
and plotted. So, selecting moving averages with periods of length greater
than 7 time units (for example years) is usually undesirable because too
many data points would be missing at the beginning and end of the
original data set. Because of that, overall impression of the whole series
can be very difficult to obtain.
By the method of moving averages, we could "press" the trend line and
we will eliminate the impact of residiuum. According to that we can
conclude that some phenomenon has the growing or declining long-
term character. When we have original series of quarterly data, then
calculated moving averages don’t contain seasonal variation, because
the moving average for quarterly data eliminated those.
Example 4.8.
We know the data on the treasury bill rates for the period 2000-2009:
Year The treasury bill rates - yi

2000 5.42
2001 3.45
2002 3.02
2003 4.29
2004 5.51
2005 5.02
2006 5.07
2007 4.81
2008 4.66
2009 5.66
342
At first, we will create arithmetic diagram:
Arithmetic
diagram
The method of moving averages determines the long-term trend of this

phenomenon, which could not be seen on the basis of gross (original)
data. We will calculate moving averages of order 3:
The treasury bill

Year Moving averages order 3 -
rates - yi Calculating
of third order
2000 5.42 / moving averages.
2001 3.45 3.96
2002 3.02 3.59
2003 4.29 4.27
2004 5.51 4.94
2005 5.02 5.20
2006 5.07 4.97
2007 4.81 4.85
2008 4.66 5.04
2009 5.66 /
343
We will plot moving averages of order 3 on the graph:
Arithmetic
diagram
As we can see on the graph, we will get new aligned (pressed) line of
moving averages of order 3.
Then we will calculate moving averages of order 4:
Year The treasury bill rates - yi Moving averages order 4 -
Calculating moving 2000 5.42 /

averages order 3 2001 3.45 /
2002 3.02 4,06
2003 4.29 4,26
2004 5.51 4,72
2005 5.02 5,04
2006 5.07 5,00
2007 4.81 4,97
2008 4.66 /
2009 5.66 /
344
We will plot moving averages of order 4 on graph:
Graphicaly
presentation of
moving averages
order 4
As we can see on graph we will get new aligned (pressed) line for
moving averages of order 4. By completion of gross graph data with
graph which we received on the basis of data calculated as fourth-
order moving averages, we can recognize the trend of growth for
analyzed phenomena in the observed period. We can conclude that this
phenomenon mostly has the growing long-term character.
Example 4.9.
We observed movement of phenomenon on the quarterly level:
Date Gross data

2005-Q1 20 Calculating
moving averages
Q2 21 using Excel.
Q3 22
Q4 33
2006-Q1 23
Q2 26
345
Q3 23
Q4 37
2007-Q1 23
Q2 26
Q3 22
Q4 39
2008-Q1 24
Q2 29
Q3 28
Q4 40
Now we will present Excel procedure for moving average method:
1. First we will plot graph with gross data:
346
We will choose option Next:
And we will plot graph with gross data:
347
2. On the graph in Excel we will take given line and click right tip on
mouse:
We will choose option Add Trendline and Moving average:
348
For Period we will take 4 and will get a new line for moving average of
order 4:
In the same way we can get the graph for moving averages of order p.
4.7. MATHEMATICAL MODELS FOR

DETERMINATION OF LONG-TERM TREND
Trend is the most often analyzed component of time series, and studied
as a help in making forecasting projection.
Determination which model will reflect the trend

development of observed variable movement, means finding
the mathematical function that best adjusts the values of time
series analysis.
In this section the main focus is on least-square method for fitting best
mathematical trend model as guide for forecasting.
349
Models are chosen based on analysis of arithmetic diagrams of the time

series. The most common forms of mathematical functions that are used
include: linear, curvilinear, exponential etc. When we create a graph
with original data in Excel and select Add trendline option, we will get
different possibilities for mathematical trend models.
4.7.1. Least squares method for

determination of the trend
Least squares method for trend determination gives us the possibility to

determine the most appropriate model to express the movement and to
find a mathematical function whose values are the most similar to the
values of time series which is the subject of analysis. It assumes that the
observed series best approximates the function where deviation from
the series is least (sum of squares of deviations is the smallest):
where yi (yti) - original (estimated by trend model) levels of phenomenon

for given time-unit (usually a year).
350
Before applying LSM we should set up independent variable or time

variable. It is very simple to make coding of x values so that the
first observation in time series is selected as the origin and assigned
code value of xi = 1. All successive observations are then assigned
consecutively increasing integer codes: 1, 2, 3, etc. Last observation in
the series has code n. But, if the periods are given by continuity, we can
set up centering of independent time variable:
Time variable is centered to express deviations from arithmetic mean.

As the sum of these deviations is equal to zero, it simplifies the
computation.
We can measure the representativeness of the trend by standard error

of trend that shows the average deviation of empirical values of the
series from the estimated trend values:
Or by relative error of trend that is given by trend variation coefficient:
Relative error of trend can be used for comparison of the series expressed
in different units of measure.
Linear trend
When analyzed phenomenon is changing approximately by the same

absolute amount in units of time, then the general functional form which
we can use to present that movement is linear form:
351
where X is the independent variable - the time variable, Y is the

dependent variable that represents the value of the trend and a and b are
parameters to evaluate.
For linear trend model, parameter a represents constant term or the

estimated value of the trend for the period that precedes the first period
(for xi = 0). Parameter b indicates change in trend (y value) on average
if the time variable x increases by one unit or the absolute growth
phenomena in the course of one unit of time (usually years).
According to a model of linear regression, formulas to calculate the

parameters of the linear relationship obtained by LSM (based on normal
equations), are:
But if we have continuity, we can center independent time variable to

be and then the formulas for calculating the parameters of the
linear relationship are:
352
Transformation parameters of linear annual trend on a monthly /

quarterly level
The analysis of seasonal variation is necessary to determine the quarterly

or monthly trend levels. If we dispose with quarterly or monthly data
level than we can determine the trend from the original time series.
In practice, we are usually able to determine the trend of quarterly or
monthly level from known functions of the trend on an annual basis.
If a and b are parameters on an annual basis, while X refers to the

monthly time series, model for monthly trend will be:
If a and b are parameters on an annual basis, while X refers to the

quarterly time series, model for quarterly trend will be:
Parabolic trend
When the movement phenomena in the observed period shows a

tendency of curvilinear distribution, we will use parabolic (quadratic or
second-degree polynomial) trend. Model for parabolic trend equation
is:
There are the same rules for centering of independent time variables as
in linear trend model. Parameter a is estimated intercept, parameter b
is estimated linear time effect on dependent variable and parameter c is
estimated quadratic time effect on dependent variable.
353
Parameters are evaluated using the LSM based system and then we get
the normal equation:
Parameters are calculated by solving this system of three equations with

three unknowns (a, b and c).
If we centered independent variable for time, then the formulas for

calculating the parameters of the linear relationship are:
Exponential trend
When the movement of variable in successive time intervals shows the

same relative change, we will apply exponential trend model. When the
basic trend is manifested as an exponential covariance with time, that is
the sign for exponential trend.
Exponential trend equation is:
where is the average rate of change.
LSM can be directly applied for exponential trend model. First we have
to make linearization:
354
After linearization dependent variable is (log y), the model is reduced to

linear form and then formula for parameters with centered independent
variable is:
4.7.2. Trend isolation
The dynamics of phenomenon is frequently the result of the influence

of a number of factors. Factors that determine the appearance of
movement can be divided into stable and volatile. Stable factors are
those that constantly affect and determine the long-term trend effects. It
is possible to determine the stability of certain factors determining the
trend. If we exclude the impact of trend we will get related influence of
other (non-permanent) factors (residiuum). Exclusion or isolation of the
trend is implemented as follows:
Interpreting the results of trend isolation can be explained as:

influenced by residiuum, phenomenon was below
average
influenced by residiuum, phenomenon was unchanged
on average
influenced by residiuum, phenomenon was above
average.
355
In the period 2000-2008, we monitored phenomenon Gross domestic

product for FB&H33. Results are given in the next table:
Year GDP ('000 KM)

2000 6,722,631
2001 7,273,874
2002 7,942,665
2003 9,688,863
2004 10,321,440
2005 10,831,267
2006 12,146,338
2007 13,861,000
2008 15,632,000
When we present this data on graph we will get:
Arithmetic
diagram
33
356
According to graph, we can conclude that linear model is appropriate.

We have continuity in data set, so next step is to set up independent time
variable by centering. There is 9 years in series, so odd number of data
means that value for x should be set up that 0 is centered in the middle
of a series.
Year y x
2000 6,722,631 -4
2001 7,273,874 -3
2002 7,942,665 -2
2003 9,688,863 -1
2004 10,321,440 0
2005 10,831,267 1 Centering
2006 12,146,338 2 independent time
variable - odd
2007 13,861,000 3 number of data.
2008 15,632,000 4
Now we can apply linear trend model. First we need sums from working
table:
Year y x x2 x .y
2000 6,722,631 -4 16 -26,890,524
2001 7,273,874 -3 9 -21,,821,622
2002 7,942,665 -2 4 -15,885,330
2003 9,688,863 -1 1 -9,688,863
2004 10,321,440 0 0 0
2005 10,831,267 1 1 10,831,267
2006 12,146,338 2 4 242,92,676
2007 13,861,000 3 9 41,583,000
2008 15,632,000 4 16 62,528,000
Total 94,420,078 0 60 64,948,604
357
We can calculate linear trend model coefficients:
Calculating and
interpreting of
linear trend model
coefficients.
Determining linear
trend model.
Interpretation of coefficients is obtained:

Expected GDP for x = 0 (for 2004) is 10,491,119,800 KM.
The average annual increase for GDP is 1,082,500 KM.
We can measure the representativeness of the linear trend by standard

error of trend that shows the average deviation of empirical series values
from the estimated trend values. First we have to calculate predicted
values for given years.
Year y yt ( y - yt )2
2000 6,722,631 6,161,213 315,190,345,387.40
2001 7,273,874 7,243,690 911,099,344.89
2002 7,942,665 8,326,166 147,073,255,623.94
2003 9,688,863 9,408,643 78,523,223,491.56
2004 10,321,440 10,491,120 28,791,226,986.72
2005 10,831,267 11,573,597 551,053,103,066.46
2006 12,146,338 12,656,073 259,830,019,428.84
2007 13,861,000 13,738,550 14,994,007,942.22
2008 15,632,000 14,821,027 6576,77,675,291.26
Total 94,420,078 94,420,078 2,054,043,956,563.29
Calculating standard
error of trend.
358
The other way is to calculate relative error of trend that is given by trend
coefficient of variation:
Calculating trend
coefficient of
variation.
This value for relative error of trend is low, so we can say that linear
trend model is representative of original data set.
Now we can make forecasting for the next period, for example for 2010,
assuming that the trend remains the same:
Forecasting values
of phenomenon for
the next period.
Assuming that the trend remains the same, we can expect that GDP for
2010 will be 16,985,980,200 KM.
At the end, we can apply trend isolation method. We need original and
predicted data:
Year y yt (trend isolation)
2000 6,722,631 6,161,213 109.11

2001 7,273,874 7,243,690 100.42
2002 7,942,665 8,326,166 95.39 Calculating predicted
data.
2003 9,688,863 9,408,643 102.98
2004 10,321,440 10,491,120 98.38
2005 10,831,267 11,573,597 93.59
Application of trend
2006 12,146,338 12,656,073 95.97
isolation method.
2007 13,861,000 13,738,550 100.89
2008 15,632,000 14,821,027 105.47
Where trend isolation expression has value higher than 100 (years:
2000, 2001, 2003, 2007, 2008), the residiuum has positive impact on
359
GDP movement. Where trend isolation expression has value lower than
100 (years: 2002, 2004, 2005, 2006), the residiuum has negative impact
on GDP movement. We can present this on the graph:
Graphicaly
presentaton
of trend isolation.
When line for trend isolation is above 100, residiuum has positive
impact on GDP movement. When line for trend isolation is below 100,
residiuum has negative impact on GDP movement.
Example 4.10.
We have information about actual gross revenues (in million dollars) of

one company for period of 10 years:
Actual gross revenues

Year
(in million dollars)
1999 581
2000 581
2001 590
2002 620
2003 699
2004 781
360
2005 891
2006 992
2007 1110
2008 1148
To check which model will reflect the trend development in the

movement of observed variable, we will create arithmetic diagram for
the time series.
Arithmetic
diagram
According to the graph, we can conclude that linear model is appropriate.

We have continuity in data set, so next step is to set up independent time
variable by centering. There are 10 years in the series, so even number
of data means that value for x should be set up so that (-0.5) and 0.5 are
centered in the middle of a series.
Year y x
1999 581 -4.5
2000 581 -3.5
2001 590 -2.5
361
2002 620 -1.5

2003 699 -0.5
2004 781 0.5
2005 891 1.5
Centering
independent time 2006 992 2.5
variable - even
2007 1110 3.5
number of data.
2008 1148 4.5
Now we can apply linear trend model. First we need sums from the
working table:
Year y x x2 x.y
1999 581 -4.5 20.25 -2,614.5
2000 581 -3.5 12.25 -2,033.5
2001 590 -2.5 6.25 -1475
2002 620 -1.5 2.25 -930
2003 699 -0.5 0.25 -349.5
2004 781 0.5 0.25 390.5
2005 891 1.5 2.25 1,336.5
2006 992 2.5 6.25 2,480
2007 1,110 3.5 12.25 3,885
2008 1,148 4.5 20.25 5,166
Total 7,993 0 82.5 5,855.5
We can calculate linear trend model coefficients:
Calculating and
interpreting of
linear trend model
coefficients.
Determining linear
trend model.
362
Interpretation of coefficients is as follows:

Expected actual gross revenues for x = 0 (between 2003 and 2004,
middle of 2004) is 799.3 million dollars.
The average annual increase of actual gross revenues is 70.98 million
dollars.
By Excel procedure we will check quality of given linear trend model. Calculating linear
On Excel graph Add trend line will be selected and options for linear trend model using
model equation and R square value: Excel.
363
364
Result is:
According to R square value (0.9332) which is close to 1, we can say

that linear model is representative for given data set. Now we can make
forecasting for the next period, for example for 2011, assuming that the
trend remains the same:
Assuming that the trend remains the same, we can aspect that actual
gross revenue in 2011 will be 1,331.65 million dollars.
Example 4.11.
The following annual time series for the number of passengers (in
millions) on a particular airline is given:
365
The number of passengers

Year
(in millions)
2000 30
2001 32.7
2003 36
2004 37.9
2005 39.2
2007 43.1
2008 45
2009 47.8
To check which model will reflect the trend development in the

movement of observed variable, we will create arithmetic diagram for
the time series.
Arithmetic
diagram
According to the graph, we can conclude that linear model is appropriate.

We don’t have continuity in data set, so the next step is to set up
independent time variable without centering.
366
Year y x
2000 30 1 Set up independent
time variable without
2001 32.7 2 centering.
2003 36 4
2004 37.9 5
2005 39.2 6
2007 43.1 8
2008 45 9
2009 47.8 10
Now we can apply linear trend model. First we need sums from the
working table:
Year y x x2 x.y
2000 30 1 1 30
2001 32.7 2 4 65.4
2003 36 4 16 144
2004 37.9 5 25 189.5
2005 39.2 6 36 235.2
2007 43.1 8 64 344.8
2008 45 9 81 405
2009 47.8 10 100 478
Total 311.7 45 327 1,891.9
We can calculate coefficients for linear trend model:
Calculating and
interpreting of linear
trend model
coefficients.
Determining linear
trend model.
367
Interpretation of coefficients is as follows:

Expected number of passengers for x = 0 (1999) is 28.41 million.
The average annual increase in the number of passengers is 1.876
million.
We will again use the same Excel procedure as in the previous example
to check quality of given linear trend model. The result is:
Graphically
presentation
of linear trend.
Value for coefficient of determination is 99.54%. This means that

estimated linear model is almost ideal for given original data set. Now
we can make forecasting for the next period, for example for 2012,
assuming that the trend remains the same:
Assuming that the trend remains the same, we can aspect that the
number of passengers in 2013 will be 52.8 million.
368
Example 4.12.
Production in one branch of the economy was:
Year Production level (in 000 units)

2002 805
2003 615
2004 430
2005 200
2006 500
2007 850
2008 1150
Graph for movement of this phenomenon in a given period has parabolic

shape:
It is obvious that we can apply quadratic trend. We will use Excel

procedure. On graph line we will select Add trend line:
369
Then we will select Polynomial trend order 2:
370
In Options we will set up Display equation and R square:
We obtained the graph with equation for polynomial trend model:
371
According to R square value (0.92) which is close to 1, we can say that

polynomial model is representative for given data set.
Example 4.13.
Data on retail trade turnover in a company ABC in the period 2000-

2008 are known (in millions of KM):
Year Trade
2000 395
2001 459
2002 558
2003 607
2004 751
2005 816
2006 956
2007 1,137
2008 1,328
Arithmetic diagram is:
372
It can be linear or exponential trend. We will use Excel procedure for

both model and compare them.
For linear trend model we will get:
But if we choose exponential model in option Add trend line, the result
will be:
373
R square value is higher for exponential model and hence we will decide
to apply exponential model: y = 6E-174·e0.2024·x, for forecasting on retail
trade turnover in company ABS.
4.1. Data about certain phenomenon are collected for the period 1998-
2004:
Year Value of phenomenon

1998 325
1999 338
2000 346
2001 342
2002 357
2003 359
2004 365
Calculate and explain the chain indices.
Solution:
Year Value Vt It/t_1- chain index
Chain
1998 325 /
indices 1999 338 104.00
2000 346 102.37
2001 342 98.84
2002 357 104.39
2003 359 100.56
2004 365 101.67
- chain index
The value of observed phenomenon in 2000

increased by 2.37% compared to 1999.
374
4.2. Data about certain phenomenon are collected in the period 1996-
2002.
Year Value of phenomenon

1996 18
1997 21
1998 23
1999 24
2000 27
2001 26
2002 23
Calculate the base indices with the base in 1998. Explain.
Solution:
Year Value of phenomenon – Vt It / 98

Basic
1996 18 78.26 indices.
1997 21 91.30
1998 23 100.00
1999 24 104.35
2000 27 117.39
2001 26 113.04
2002 23 100.00
- Basic indices with the base in 1998
- The value of phenomenon in 1997 was 8.70% lower

compared to 1998.
- The value of phenomenon in 2000 increased by

17.39% compared to 1998.
The value of phenomenon in 2002 is the same as the

value of phenomenon in 1998.
375
4.3. The number of graduate students at a certain faculty in the period

2000-2005 was:
Year Number of graduate students

2000 100
2001 112
2002 120
2003 127
2004 129
2005 133
a) Estimate and draw the trend line on the arithmetic diagram.

b) What is the expected number of graduate students in 2002?
Solution:
a) The arithmetic diagram and graphical presentation of trend line:
376
Number of graduate
Year xi xi2 xi . yi yti
students - yi
2000 100 -5 25 -500 104.24
2001 112 -3 9 -336 110.61
2002 120 -1 1 -120 116.98
2003 127 1 1 127 123.36
2004 129 3 9 387 129.73
2005 133 5 25 665 136.10
Σ 721 0 70 223
- linear trend equation
Interpretation of parameters a and b:

a = 120.17: If xi=0 (the half of a period 2002-2003.), the estimated
number of graduate students is 120.17 120.
b = 3.186: Every six months (Δx = 1), the number of graduate students
increases by 3.186, on average.
b) In 2002.
4.4. Values of investment in the car industry (000 $) in the period 1999-
2003. are given in the following table:
377
Year Investment
1999 185
2000 187
2001 191
2002 188
2003 193
a) Calculate and explain the average absolute growth.

b) Calculate and explain the average annual growth rate.
c) If the same trend continues, how many years will it take for the level
of investment to increase by 68% compared to 1999?
d) If the same trend continues, what is the expected level of investment
in 2012?
Solution:
Year Investment ΔVt /t _1

1999 185 /
2000 187 2
2001 191 4
2002 188 -3
2003 193 5
a)
Calculating and
interpreting the
average absolute Investment in the car industry increases by 2000 $ annually, on average.
growth.
b)
Investment in the car industry increases by 1.06% annually, on average.
378
c)
d)
4.5. Quantities and prices for the three products (A, B and C) in period
1998 – 1999 are presented in table below:
379
Quantities Prices
Product
1998 1999 1998 1999
A 10 11 61 65
B 4 5 54 37
C 5 6 82 83
a) Applying the methods of generating units find Laspeyres's and

Paasche's price indices. Calculate and explain aggregate index of
value.
b) Determine Laspeyres's and Paasche's volume indices, using
previously obtained results. Interpret the results.
Solution:
q0 q1 p0 p1 p0 . q0 p1 . q1 p0 . q1 p1 . q0
10 11 61 65 610 715 671 650
4 5 54 37 216 185 270 148
5 6 82 83 410 498 492 415
Total 1236 1398 1433 1213
a)
According to Laspeyres, prices in 1999 decreased by 1.86% compared

to 1998.
According to Paasche, prices in 1999 decreased by 2.44% compared to

1998.
380
The total value of consumer basket in 1999 increased by 13.11%

compared to 1998.
b) According to decomposition of index of value, and the previously

obtained results:
According to Paasche, quantities in 1999 increased 15.25% compared

to 1998.
According to Laspeyres, quantities in 1999 increased by 15.94%

compared to 1998.
4.6. Value of phenomenon Y in a period of 7 years is given in the table

below:
Year Phenomenon Y
1997 28
1998 36
1999 33
2000 39
2001 41
2002 40
2003 45
a) Draw the arithmetic diagram.

b) Estimate the trend equation and explain the parameters.
c) Predict the value of phenomenon in 2009.
381
Solution:
t yi xi xi2 xi . yi yti
1997 28 -3 9 -84 30.26
1998 36 -2 4 -72 32.65
1999 33 -1 1 -33 35.04
2000 39 0 0 0 37.43
2001 41 1 1 41 39.82
2002 40 2 4 80 42.21
2003 45 3 9 135 44.6
Σ 262 0 28 67
a) Arithmetic diagram:
382
b) - the trend equation
If xi = 0 (in 2000.), the estimated value of phenomenon is 37.43. (The

actual value is 39.)
The value of phenomenon increases by 2.39 annually (Δx = 1), on average.
c)
4.7. We followed the movement of monthly expenditure on personal

hygiene and obtained following information:
Year Expenditure on personal hygiene (KM)

1998 150
1999 162
2000 170
2001 176
2002 180
a) Calculate and explain the basic indices with the base in 1999.
b) Calculate and explain relative change.
c) Calculate and explain the average growth rate.
d) Calculate and explain the average absolute growth.
383
e) If the same trend continues, what is the expected expenditure level in

2004-the year?
f) When will the expenditure level double, compared to the 1998?
Answer: c) r = 4.66%, Expenditure on personal hygiene increased by

4.66% annually, on the average; d) AAG = 7,5 e) V2004 = 197.18 KM;
f) In 2014.
4.8. Data on meat production are presented in the table below:
Year Meat production (000 t)

1998 145
1999 136
2000 141
2001 145
2002 136
2003 131
2004 140
2005 132
a) Calculate and explain absolute change.

b) Calculate and explain the chain indices.
c) Calculate the average annual rate.
d) If the same trend continues, what is the expected level of meat
production in 2013?
Answer: c) r = -1.33% d) V2013 = 118 560 tonnes
4.9. Data on meat, milk and cheese prices and quantities produced for
the period 1996-1998 are presented in the table below:
Production Prices
Products
1996 1997 1998 1996 1997 1998
Meat (000 kg) 30 33 35 10.00 10.50 11.00
Milk (000 l) 25 27 30 1.10 1.20 1.25
Cheese (000 kg) 10 12 15 6.00 6.50 7.00
384
a) Calculate the indices of prices and quantities according to Laspeyres,

Paashe and the Fischer for 1997 and 1998 compared to 1996.
b) Using the previous results calculate the values of indexes for 1998
and 1997 comparing to 1996.
c) Calculate the index value in 1998 compared to 1997.
Answer:
a)
b)
c)
385
4.10. Investment in a branch of the economy was:
Year Investments
1996 175
1997 250
1998 280
1999 300
2000 350
2001 400
2002 480
2003 565
2004 690
2005 720
a) Draw arithmetic diagram

b) Evaluate and draw appropriate trend line.
c) Isolate the trend and explain the result.
Answer: b) yti = 421 + 30.3 . xi
4.11. Average net salary in Bosnia and Herzegovina in the period 1998
- 2003 was:
Year Net salary

1998 296
1999 343
2000 372
2001 408
2002 446
2003 484
a) Draw arithmetic diagram.

c) Isolate the trend and explain the result.
d) What level of average wages can be expected in 2006?
386
e) If the same trend continues, what level of average wages could be

expected in 2018?
Answer: b) yti = 391.5 + 18.36 . xi; d) 501.66 e) 703.662
4.12. Arrivals of tourists from Croatia in Bosnia and Herzegovina in

2004 per month were as follows:
Month (2004) Arrivals

1 1995
2 2070
3 2523
4 2209
5 2937
6 2478
7 3389
8 2291
9 2577

b) Through moving average evaluate trend and draw appropriate trend
line.
Answer: b) Note: use moving averages order 3
4.13. Number of children born monthly in 2004 in Bosnia and

Herzegovina was as follows:
Month (2004) Number of children born

1 2238
2 2554
3 2674
4 2621
5 2718
6 2993
387
7 3201
8 3075
9 3094

c) How many children could be expected in 10th month the year 2004?
Answer: b) yti = 2796.4 + 106.88 . xi; c) 3330.8 3331
388
5
PROBABILITY
AND
THEORETICAL
DISTRIBUTIONS
CHAPTER
5
5.1. INTRODUCTION
Probability is the branch of mathematics that studies possible outcomes

of given events together with the outcomes‘ relative likelihoods and
distributions. In common usage, the word „probability“ indicates the
chance that a particular event (or set of events) will occur, expressed
either on a linear scale from 0 (impossibility) to 1 (certainty) or as a
percentage between 0 and 100%. The analysis of events governed by
probability is called Statistics.
Impossible event has a probability of 0 and a certain event has a

probability of 1. Uncertain event that may or may not occur has
probability between 0 and 1.
5.2. RANDOM VARIABLES AND

PROBABILITY DEFINITIONS
Outcome of a random trial or number of trials is random

variable. A random variable is thought of as a function
mapping the sample space of a random process to the real
numbers.
Broadly, there are two types of random variables — discrete and

continuous. Discrete random variables take on one from a set of specific
values, each with some probability greater than zero. For discrete
variables there will be a countable number of outcomes. Continuous
random variables can be realized with any of a range of values (e.g., a
real number between zero and one), whose probability of occurring is
greater than zero. For continuous variables the number of outcome is
infinite.
An outcome of a trial that is of interest for research is an

event.
391
5 PROBABILITY AND THEORETICAL DISTRIBUTIONS
An event is each possible type of occurrence. If we assemble a deck of

52 playing cards and no jokers, and draw a single card from the deck,
then the sample space is a 52-element set, as each individual card is a
possible outcome. An event, however, is any subset of the sample space,
including any single-element set (an elementary event, of which there
are 52, representing the 52 possible cards drawn from the deck), the
empty set (which is defined to have probability of zero) and the entire
set of 52 cards i.e. the sample space itself (which is defined to have
probability of one). Other events are proper subsets of the sample space
that contain multiple elements. For example, potential events include:
„Red and black at the same time without being a joker“ (0 elements),
„Red“ (26 elements),
„The 5 of Hearts“ (1 element),
„A King“ (4 elements),
„A Face card“ (12 elements),
„A Spade“ (13 elements),
„A Face card or a red suit“ (32 elements),
„A card“ (52 elements).
There are 3 approaches to the subject of probability:
• A priori classical probability approach (classical definition)
The probability of success is based on prior knowledge of the

process involved. In the simplest case, where each outcome is
equally likely to happen, the probability of event A is:
Consider a standard deck of cards that has 26 red and 26 black cards.
The probability of selecting a black card is:
392
• Empirical classical probability approach (frequency

definition)
In the previous example, which uses the a priori approach,

the number of successes and the number of outcomes are
known from the composition of the deck of cards. In empirical
approach, the outcomes are based on observed data, not on
prior knowledge of a process.
For example, if a poll is a taken and 57% of the respondents indicate that
they prefer the candidate X, there is 0.57 probability that an individual
respondent randomly selected prefers the candidate X.
• Subjective probability approach
While the probability of a favorable event with the previous

two approaches was computed objectively, either from prior
knowledge or from actual data, subjective approach refers to
the chance of occurrence assigned to an event by a particular
individual. This chance will likely be different from the
subjective probability assigned by another individual.
For example, the development team of a new product may assign a

probability of 0.7 to the chance of success for that product while the
president of the company is less optimistic and assigns a probability of 0.45.
5.3. BASIC DEFINITIONS IN PROBABILITY

AND NOTATION
Sample space is the collection of all the possible events.
393
For example, if we consider a dice, there are 6 different faces (1, 2, 3, 4,

5, 6) or if we consider a coin, there are 2 possible events (head or tails).
Simple event A is an event that can be described by a single

characteristic.
Simple or marginal probability p(A) refers to the probability

of occurrence of a simple event A.
For example, if we conduct experiment with coin, simple event is “head”

and probability of “head” is .
Complement of event A is event that includes all events

that are not part of event A
For example, consider a standard deck of cards. If event A is “a Face

card”, then complement of event A will be “a card that is not a Face
card” and probability of event “a card that is not a Face card” will be
Joint event is an event that has two or more characteristics.
For example, if we consider an experiment with dice, joint event can be

“number greater than 4” which means that this event consists of simple
events: 5 and 6. Probability of this event is .
394
Two or more events are mutually exclusive if the occurrence

of any of them implies that the others cannot occur. Being
male and being female are mutually exclusive events.
When the outcome of one event does not affect the probability
of occurrence of another event, the events are independent.
For example, if we role dice and coin in the same time, the outcomes
are independent as the outcome of dice does not influence the outcome
of the coin. However, if we select two cards from a deck without
replacement, the outcome of the second selection will be influenced by
the first selection. The probability of getting a “A Face card” in the first
selection is . But the probability of getting an “A Face card” in
the second selection depends on the outcome of the first selection and
it is:
, if in the first selection outcome was not “A Face card”,
, if in the first selection outcome was “A Face card”.
5.4. BASIC RELATIONSHIPS IN PROBABILITY
1. The probability of an event lies within the range 0 to 1.
an event cannot occur, or impossible event

an event will definitely always occur, or certain event
an event will maybe occur, or uncertain event
395
If we conduct experiment with dice:

Impossible events are “number lower than 1” or “number greater
than 6”.
Certain event is “number lower than 7”.
Uncertain event is “number 4 or 5”.
2. The sum of the probabilities of all possible outcomes from

sample space is equal to 1.
If we conduct experiment with dice:
3. Sum of probabilities of opposite events is equal to 1.
For example, if we have group of 100 students: 40 from I year, 35 from

II year and 25 from III year, probability that we will randomly select
student from I year is and probability that we will
randomly select student who is not from I year is .
4. The general multiplication rule:
• Independent events: and vice versa, if

the events A and B are independent.
396
For example, if two dices (black and white) are rolled, events
“a 4 on black dice” and “2 on white dice” are independent and
probability that these two events occurr simultaneously will be
• Dependent events: , where

is conditional probability that B will occur if A has already
occurred.
If we select two cards from a deck without replacement, the outcome

of the second selection will be influenced by the first selection. The
probability of getting a “A Face card” in the first selection is .
But the probability of getting a “A Face card” in the second selection
depends on the outcome of the first selection and it is:
i. , if in the first selection outcome has been „A Face

card“.
ii. , if in the first selection outcome has been “A Face
card”.
iii. According to general multiplication rule for dependent events,
probability that both selected cards are Face cards is equal to:
5. The general addition rule:
If two event are mutually exclusive then .
397
For example, a hamburger chain found that 75% of all customers

use mustard, 80% use ketchup and 65% use both. Probability that a
particular customer will use at least one of these (mustard or ketchup or
both) will be:
Another example is rolling the dice. If we rolled a dice, probability that

we will get an odd number will be:
5.5. BASIC RELATIONSHIPS

IN PROBABILITY EXAMPLES
Example 5.1.
A personnel officer has 8 candidates to fill 4 positions. 5 candidates are

Determining men and 3 are women.
probability by using a) What is probability that no woman will be hired?
combinatorics and b) What is probability that at least one woman will be selected?
classical approach.
Probability of
opposite event. Solution:
a) event A- no woman will be hired
398
b) event - at least one woman will be selected
Example 5.2.
It is estimated that 48% of all bachelor degrees are obtained by women

and that 17.5% all bachelor degrees are in business. Also, 4.7% of all Investigation
bachelor degrees are obtained by women majoring in business. of events
a) Are the events “bachelor degree holder is a women” and “bachelor independency.
General
degree in business” statistically independent? multiplication
b) What is probability that we randomly select women under condition rule application.
that she has bachelor degree in business?
Solution:
a) If events are statistically independent then it has to be

. Let’s check this fact:
These events are not

independent.
b) According to the general multiplication rule:
Probability that we randomly select woman under condition that she has
bachelor degree in business is 26.86%.
399
Example 5.3.
It is known that 90% of all personal computers of a particular model will

operate for at least 1 year before requiring repair. A manager purchases
Multiplication rule for
three independent 3 of these computers. What is probability that all 3 computers will work
events. for 1 year before any repair is needed?
Solution:
Event A – first computer will work for 1 year before any repair is needed
Event B – second computer will work for 1 year before any repair is needed
Event C – third computer will work for 1 year before any repair is needed
Example 5.4.
Suppose that the probability that you will get an A in Statistics is 0.65
and that probability that you will get an A in Organizational Behaviour
Illustration: general
multiplication rule for is 0.8. If these events are independent, what is probability that:
independent events a) you will get an A in both subjects.
and general b) you will get at least one A.
addition rule for
events that are not
mutually exclusive. Solution:
a) Events are independent.
Probability that you will get an A in both subjects is 52%.
400
b)
Probability that you will get at least one A is 93%.
Example 5.5.
In a large metropolitan area, a sample of 500 respondents was selected

to determine various information about consumer behavior. Among the Illustration of
contingency table.
questions asked, one was: “Do you enjoy shopping clothes?”. Out of
240 males, 136 answered affirmative, while out of 260 females, 224
answers were affirmative.
a) Set up a 2x2 contingency table to evaluate the probabilities.

b) Give an example of a simple event.
c) Give an example of a joint event.
d) What is complement of “enjoy shopping clothes”.
e) What is probability that a respondent chosen at random:
i. is a male.
ii. enjoys shopping clothes.
iii. is a female and enjoys shopping clothes.
iv. is a male and does not enjoy shopping clothes.
v. is a male or a female.
vi. is a female or does not enjoys shopping clothes.
vii. is a male or enjoy shopping clothes.
Solution:
a) Contingency table
Answer/gender Female Male Sum

Enjoys shopping clothes 224 136 360
Does not enjoy shopping clothes 36 104 140
Sum 260 240 500
401
b) Simple events:
responent chosen at random is male
respondent chosen at random is female
respondent chosen at random enjoys shopping clothes
respondent chosen at random does not enjoy shopping clothes
Joint events:
a female and enjoys shopping clothes
a female and does not enjoys shopping clothes
a female or does not enjoys shopping clothes
c) Complement of “enjoy shopping of clothes” is “not enjoy shopping

clothes”.
d) i.
ii.
iii.
iv.
v.
vi.
vii.
402
5.6. BAYES THEOREM
Bayes theorem relates the conditional and marginal

probabilities of events A and Bi:
where Bi is the i-th event of n mutually exclusive events

from sample space and equals the entire sample space.
Bayes theorem defines the probability of event Bi occurring
given event A has already occurred.
In practice, A is mainly consequence and Bi are the causes (assumption)

that precede A. In this case, Bayes theorem can provide the answer to the
question: “If consequence occurs, what is probability that it occurred as
a result of certain cause Bi?”
Example 5.6.
Suppose that a school has 60% boys and 40% girls. Half of the girls wear
trousers and the other half wear skirts, while all boys wear trousers. An Application of
observer sees a (random) student from a distance; all they can see is that Bayes theorem.
this student is wearing trousers. What is the probability this student is
a girl?
Solution:
It is clear that the probability is less than 40%, but by how much? Is it
half that, since only half of the girls are wearing trousers? The correct
answer can be computed using Bayes‘ theorem.
403
Girls Boys Total

Trousers 20 60 80
Skirts 20 0 20
Total 40 60 100
The event A is that the observed student is a girl, and the event B is that
the observed student is wearing trousers. In order to compute p(A/B),
we first need to determine:
p(A), or the probability that the student is a girl regardless of any

other information. Since the observer sees a random student, it
means that all students have the same probability of being observed,
and the fraction of girls among the students is 40%, this probability
equals to 0.4.
p( ), or the probability that the student is a boy regardless of any
other information (A is the complementary event to A). This is 60%,
or 0.6.
p(B/A), or the probability of the student wearing trousers given that
the student is a girl. As girls are likely to wear skirts and trousers
equally, this is 0.5.
p(B/ ), or the probability that student wears trousers given that the
student is a boy is 1.
p(B) is the probability that (randomly selected) student
wears trousers regardless of any other information. Since
, this probability is
Given all this information, the probability of the observer having spotted
a girl given that the observed student is wearing trousers can be compu-
ted by substituting these values in the formula:
As expected, it is less than 40%, but more than half of 40%.
404
5.7. PROBABILITY DISTRIBUTIONS
Frequency distribution formed by the group of population

units with the same characteristics is empirical distribution.
Distribution formed on the basis of theoretical propositions is
theoretical distribution.
Main characteristics of theoretical distributions are:

We suppose them in some statistical model or we use them for
hypothesis that we have to test.
Theoretical distributions are given as analytic models with known
parameters: expectation, mod, median, standard deviation, skewness
and kurtosis.
Theoretical distributions are given as probability distributions.
Probability where we know the number of possible outcomes of event

and the number of “success” realization is a priori probability. But in
statistical research we mostly don’t know probability a priori so with
experiment we try to gain knowledge for probability calculations (a
posterior). Hence, a posteriori probability is empirical or statistical
probability.
Empirical or a posteriori probability is the limit of relative

frequency of the number of “sucess” of event A, when number
of trials tends to infinity.
; m - number of “success”, n - number of trials.
Cumulative function of random variable X is probability that

X will take value lower than or equal to some real number a:
.
405
Cumulative function of discrete variable X is defined by:
Cumulative function of continuous variable X has general

form , and it is determined by parameters
such as expected value and variance.
If discrete variable X can take values from the set with

probabilities , where , the expected
value of X is: 34
For continuous variable expected value is:
Variance for discrete variable is:
34
f (x)is probability density function of continuous random variable.
406
Variance for continuous variable is:
Probability distributions can be split into 2 groups:

discrete probability distributions – deal with discrete variables
binomial distribution
Poisson distribution
Hypergeometric distribution
continuous probability distributions – deal with continuous variables
uniform distribution
normal distribution
Student (t) distribution
χ 2 (chi-square) distribution
F distribution.
The probability distribution of a random variable describes the

probability of all possible outcomes. The sum (integral) of these
probabilities equals 1.
5.8. BINOMIAL DISTRIBUTION
The binomial distribution is used when discrete random variable of

interest is the number of successes obtained in an experiment consists
of n observations. It is used to model situations that have the following
properties:
The experiment consists of a fixed number of observations – n.
Each observation is classified into one out of two mutually exclusive
categories, usually called “success” and “failure”.
The probability of an observation being classified as success, noted
as p, is constant from observation to observation. Thus, the probabi-
lity of an observation being classified as failure, noted as (1-p)=q, is
also constant over all observations.
407
The outcome (success or failure) of any observation is independent

of the outcome of any other observation.
The random variable, related to each observation (trial, experiment
repetition), that can take either values of 1 (success) or 0 (failure) is
called Bernoulli random variable.
Binomial distribution has two parameters:

n – number of observations, trials or experiment repetitions.
p – the probability of success (occurrences of a given event) of a single
observation, trial or experiment.
5.8.1. Probability distribution of a binomial

random variable
The probability distribution of a binomial random variable is:
where x is exact number of successes of interest and p(x) is

probability that among n trials exactly x successes will be
realised (given event will be realised exactly x times).
408
Figure 5.1. Binomial probability function for different values

of paramters n and p
Example 5.7.
An insurance broker believes that for particular contact, the probability

of making sale is 0.4. Suppose now that he has five contacts. What is Illustration of
probability that he will realise three sales for these five contacts? Binomial distribution.
Solution:
If we define the event “sale is made” as a success (value 1) and “sale is

not made” as a failure (value 0), then the variable X – “number of sales
realized for the five contacts” follows Binomial distribution.
409
Probability that he will realise three sales among these five contacts is
23%.
5.8.2. Characteristics of the Binomial distribution
Main characteristics of the Binomial distribution can be summarized

as follows:
Mean
Variance
Shape
Binomial distribution can be symmetrical (if p=0.5) or skewed (if

p 0.5).
We have 4 types of binomial distribution:

symmetric; if p=q=0.5
asymmetric; if p q
a priori; if we know probabilities p and q
a posteriori; if we have to find p and q by empirical method
Conditions for approximation of empirical distribution with binomial

distribution are:
Error of approximation is measure for quality of approximation.

According to modalities it is: where: f k is empirical frequency
410
and is theoretical frequency, so overall error of approximation is:
Example 5.8.
For 1,000 products we can find 28 with defect. If we randomly select 14

Determination
products in sample, what is probability that: of Binomial
a) we have exactly 4 products with defect in the sample; probabilities by
b) we have maximum 2 products with defect in the sample; using Excel.
c) we have minimum 4 products with defect in the sample.
Solution (using Excel):
This is dichotomous variable, so in that case we will apply Binomial

distribution with modalities - x: 0,1,2,3,4,...,14.
We will use Excel function:
411
a) in the sample we have exactly 4 products with defect
We ask for probability in point, not for cumulative function, so for

option Cumulative we will write False.
{=BINOMDIST(4;14;0.028;FALSE)}= 0.000463 0.0463%
b) in sample we have maximum 2 products with defect (so 0, 1 or 2

product with defect), this is cumulative distribution so for option
Cumulative we will take True.
{=BINOMDIST(2;14;0.028;TRUE)}= 0.993662 99.3662%
412
c) in the sample we have minimum 4 products with defect 4, 5 or

more products with defect, what is opposite event for cumulative
frequency (maximum 3 products with defect or 1, 2 or 3 products
with defect). Probabilities of event and opposite event sum to 1,
so we can use Excel to get probability of opposite event (1, 2 or 3
products with defect):
1- {=BINOMDIST(3;14;0.028;TRUE)}=1- 0.999509=0.000491 0.491%
5.9. POISSON DISTRIBUTION
The Poisson distribution is a useful discrete probability distribution

when you are interested in the number of times a certain event will occur
in a given unit of time or area. This type of situation frequently occurs
in a business. For example, a quality insurance manager is interested in
the number of noticeable surface defects of a new product. It is used to
model situations that have the following properties:
We are interested in the number of times a particular event occurs
in a given area of opportunity. The area of opportunity is defined by
time, length, surface area and so forth.
The probability that an event occurs in a given area of opportunity is
the same for all of the areas of opportunity.
The number of events that occur in one area of opportunity is
413
independent of the number of events that occur in another area of

opportunity.
The probability that that two or more events will occur in an area of
opportunity approaches zero as the area of opportunity becomes smaller.
The Poisson distribution is determined by one parameter , which

is the average or expected number of events per unit.
5.9.1. Probability distribution of Poisson

random variable
The probability distribution of a Poisson random variable is:
where:
• x is number of events per unit (number of successes per unit)
• p(x) is the probability of x successes given the knowledge of λ
• λ is the average or expected number of events per unit
(average or expected number of successes per unit)
• e=2.71828 (constant)
Figure 5.2. Poisson probability function for different values of parameter λ
414
The horizontal axis is the index k. The function is only defined at integer
values of k (empty lozenges). The connecting lines are only guides for the eye.
Example 5.9.
If the probability that an individual suffers a bad reaction from injection

of a given serum is 0.001, determine the probability that out of 2,000 Recognition of
individuals
and calculation
a) exactly 3 of Poisson
b) more than 2 individuals will suffer a bad reaction. probabilities.
Solution:
p=0.001 - probability that an individual suffers a bad reaction from

injection of a given serum (rare event Poisson distribution)
a)
There is 18% of chance that out of 2,000 individuals exactly 3 will

suffer a bad reaction.
b)
There is 32.3% of chance that out of 2,000 individuals more than 2 will
suffer a bad reaction.
Example 5.10. Determining of

parameter and
Suppose that, on average, three customers arrive per minute at the bank calculation of
during the noon to 1 p.m. What is probability that in a given minute Poisson probabilities.
exactly two customers will arrive?
415
Solution:
We are interested in the number of times a certain event will occur in a

given unit of time Poisson distribution.
There is 22.4% probability that at in a given minute exactly two

customers will arrive.
5.9.2. Characteristics of the Poisson distribution
Main characteristics of the Poisson distribution are:
Shape
Poisson distribution is always positively (right) skewed.
Mean
Variance
The Poisson distribution can be derived as a limiting case of the

binomial distribution as the number of trials goes to infinity and
the expected number of successes remains fixed. Therefore it
can be used as an approximation of the binomial distribution if n
is sufficiently large and p is sufficiently small. There is a rule of
thumb stating that the Poisson distribution is a good approximation
of the binomial distribution if n is at least 20 and p is smaller than or
equal to 0.05. According to this rule the approximation is excellent if
n ≥ 100 and n· p ≤ 10.
416
Example 5.11.
If probability that a randomly selected person will be colour blind is

0.3%, what is probability that among 2,800 persons we will find: Calculation of
a) 4 colour blind. Poisson probabilities
by using Excel.
b) more than 3 colour blind.
c) not more than 2 colour blind.
Solution (by Excel):
Rare event Poisson distribution
We will use Excel function:
417
a) exactly 4 colour blind persons
We ask for probability in point not for cumulative function, so in option

Cumulative we will take False.
P (X = 4) = {=POISSON(4;8.4;FALSE)} = 0.046648 4.6648%
b) more than 3 colour blind persons, this is opposite to cumulative

distribution so for option Cumulative we will take True and at the
end we will find probability of opposite event:
1 - P (X ≤ 3) = 1-{=POISSON(3;8.4;TRUE)}=1- 0.03226= 0.96774 96.774%
418
c) not more than 2 colour blind persons, this is cumulative distribution

so in option Cumulative we will take True.
P (X ≤ 2) = {=POISSON(2;8.4;TRUE)}=0.010047 1.0047 %
5.10. HYPERGEOMETRIC DISTRIBUTION
Hypergeometric distribution H (N, n, p) is distribution of n

random Bernoulli’s dependent variables. This is sampling
without repetitions. Symbols are:
• N - number of elements in population
• M - number of elements in population with characteristic A
• n - number of elements in the sample
• k - number of elements in the sample with characteristic A
•
• is probability that in a sample from particular
population, k elements have characteristic A:
419
Expectations and variance are:
This distribution has application in sampling procedure. When

(n/N<1/10) we can approximate hypergeometric distribution with
binomial distribution.
Example 5.12.
In a firm, 10 economists and 22 employees with other vocations are

employed. What is probability that in a sample with 8 employees, 3
Illustration of
Hypergeometric employees are not economists?
distribution.
Solution:
In this example, the characteristics A is “employee is not economist”.
N = 32, n = 8, M = 22, k = 3
Probability that in sample of 8 employees 3 employees are not economist

is 3.7%.
Example 5.13.
Out of 30 products in population, 30% are defective. We will choose

Calculation of sample of 4 products without replications. What is probability that there
Hypergemetric will be not more than 2 defective products?
probabilities by
using Excel.
Solution (by using Excel):
30% incorrect there is 9 defective products in population
420
Without replication dependent events hypergeometric

distribution.
not more than 2 defective products 0 or 1 or 2 defective products
We will apply Excel function for hypergeometric distribution:
probability that we select 0 incorrect products
={=HYPGEOMDIST(0;4;9;30)} = 0.218391 21.84%
probability that we select 1 defective product
421
={=HYPGEOMDIST(1;4;9;30)} = 0.436782 43.68%
probability that we select 2 defective products
={=HYPGEOMDIST(2;4;9;30)} = 0.275862 27.59%
Finally, probability that there are not more than 2 incorrect products
is sum of previous computed probabilities (“or” probability for
mutually excluded events) – 0.931034 93.1%
5.11. NORMAL DISTRIBUTION
The normal distribution, also called the Gaussian distribution, is an

important family of continuous probability distributions, applicable
422
in many fields. Each member of the family may be defined by two

parameters, location and scale: the mean (“average”, μ) and variance
(standard deviation squared, σ2) respectively.
The continuous probability density function of the normal

distribution is the Gaussian function:
where σ > 0 is the standard deviation, the real parameter μ

is the expected value. To indicate that a real-valued random
variable X is normally distributed with mean μ and variance
σ2 ≥ 0, we write
Proof:
423
since it is:
as integral of odd function on symmetric interval, and
We used following:
as integral for odd function on symmetric interval,
424
Since it is:
and
Finally:
The standard normal distribution is the normal distribution with a

mean of zero and a variance of one (the red curve on the plot bellow).
According to transformation formula, it will be:
Figure 5.3. Normal probability density function
The red line is the standard normal distribution
425
The probability density function has notable properties including:

symmetry around its mean μ
the mode and median both equal to the mean μ
the inflection points of the curve occur one standard deviation away
from the mean, i.e. at μ − σ and μ + σ.
The cumulative distribution function of a probability distribution,

evaluated at a number (lower-case) xi, is the probability of the event
that a random variable X with normal distribution is less than or equal
to xi. The cumulative distribution function of the normal distribution is
expressed in terms of the density function as follows:
Figure 5.4. Cumulative distribution function of the normal distribution
The cumulative distribution function of a probability distribution,

evaluated at a number (lower-case) zi, is the probability of the event that a
random variable Z with normal distribution is less than or equal to zi. The
cumulative distribution function of the standardized normal distribution
(red line) is expressed in terms of the density function as follows:
426
There are tables with values of cumulative distribution function of the

standardized normal distribution.
5.11.1. Rules for standardized normal distribution
Rules for determination of standardized normal distribution

probability are:
1.
2.
3.
4.
On next two graphs determination of area under curve for standardized

normal distribution (probability) is illustrated:
Figure 5.5. Determination of area under standardized normal density

function

427
Graph 5.6. Determination of area under standardized normal density

function

5.11.2. Characteristic intervals for normal distribution
If X ~ N (0;1) then we have characteristic intervals for distances

of one, two and three standard deviations from the mean:
428
Figure 5.7. Illustration of three characteristics intervals of normal

probability distribution

Example 5.14.
Illustration of
Normal distribution
The tread life of a certain brand of tire has a normal distribution with and standardized
mean 35,000 miles and standard deviation 4,000 miles. For randomly Normal
selected tire, what is probability that its life is: distribution.
Application of
a) less than 37,200 miles standardized Normal
b) more than 38,000 distribution rules.
c) between 30,000 and 36,000 miles Using of statistical
tables.
d) less than 34,000 miles
e) more than 33,000 miles.
Solution:
a)
429
b)
c)
d)
e)
Or we can get solutions by using Excel functions:
Solution by
First we have to standardize or to transform x in z.
using Excel.
We use Excel function:
430
For probabilities with z scores we use Excel function:
a) less than 37,200 miles
431
Firstly we will standardize or transform x in z:
This is table value for cumulate because z is positive and relation is <.
We don’t ask for probability in point than for cumulative function, so for
option Cumulative we will take True:
b) more than 38,000 miles
432
Firstly, we will standardize or transform x in z:
This is not table value for cumulate because z is positive and relation is
>. We don’t look for probability in point but for cumulative function, so
for option Cumulative we will take True but on the end apply formula
for the opposite event:
And formula for opposite events:
433
a) between 30,000 and 36,000 miles
Firstly, standardization is applied:
And
Now we will find cumulative probabilities for z scores:
434
and
Now we complete formula:
435
b) less than 34,000 miles
First step is standardization:
Then we find cumulative probabilities:
436
c) more than 33,000 miles
First step is transformation of x in z:
Then we calculate cumulates:
This is opposite event:
437
Example 5.15.
Scores on an examination taken by a very large group of students are

normally distributed with mean 700 and deviation 120. It is decided to
give a failing grade to the 5% of students with lowest scores. What is
minimum score needed to avoid a failing grade (or maximum score that
means a failing grade)?
Solution:
we made transformation from z to x
Minimum score needed to avoid a failing grade is 502.
We can also use Excel function to obtain the result. There is inverse
situation, we know probability and we need to find z and x for that
probability. We will use Excel function NORMINV:
438
Example 5.16.
A journal editor finds that the length of time that elapses between
receipt of a manuscript and a decision on publication follows a normal
distribution with mean 18 weeks and deviation 4 weeks. If the probability
that it will take longer is 0.2, how many weeks will pass before a decision
on a manuscript is made?
Solution:
we made transformation from z to x
439
21.4 weeks will pass before a decision on a manuscript is made.
We can also use Excel function. There is opposite for table cumulate.
So, we will find z for table value (1-0.2) = 0.8.
5.12. STUDENT t-DISTRIBUTION
T distribution was constructed by W.S.Gosset in 1908, but he published

it with pseudonym “Student” and hence this distribution is named
Student – t distribution. He created distribution when he worked with
results on samples methods.
Density function is:
where is beta-function with parameters
and n is the number of elements.
440
With cumulative function F(t) we can compute probability that variable

has value equal to or lower that fixed t, and we can use tables with
appropriate probabilities.
Shape of t distribution depends on n. (n-1) is degree of freedom or v (ni).

Degree of freedom is the number of independent observations minus
the number of parameters that define distribution:
Student distribution is wider than normal distribution. For greater

values of n (more than 30) student distribution tends to be standardized
normal distribution.
t distribution doesn’t have application in concrete problems as normal

distribution, but it is very important for inferential statistic.
Example 5.17.
For degrees of freedom n = 9, we have to find t0, for .

For the same distribution we have to determine the probability function Illustration of Student
if t = 2.54. distribution. Solution
by using statistical
tables and Excel.
Solution:
Or we can do that by using Excel procedure. This is inverse situation

when we know area (probability) between two symmetric t scores,
hence Excel function for Two-tailed will be used:
441
We will calculate t for opposite event:
Now we have to find function of probability and cumulative probability

if t = 2.54. We will use function TDIST:
442
Cumulative probability if t = 2.54 is equal to (1-0.032)=0.968
5.13. CHI-SQUARE (χ2) DISTRIBUTION
Chi-square distribution applies in cases where it is needed to make a

decision on the significant difference of actual (observed) and theoretical
(expected) frequency, or the value of variable (characteristics).
Marked by the Greek letter hi ( χ ), it is defined as the sum of

the distances (relationship difference) between the observed
and expected values according to the expected values, that is
mi - observed frequency
ei - expected (theoretical) frequency.
This distribution can take values from 0 to ∞ (always positive values)

and depends on the number of degrees of freedom. For each number
of degrees of freedom chi-square distribution is different. Probability
distributions are given in the table. The table gives information up to the
30 degrees of freedom, and if it is about more than 30 degrees of freedom
443
R. A. Fisher suggests the form that is approximately

normally distributed, so in that case we can use the normal distribution
table.
Arithmetic mean of chi-square distribution is equal to the number of

degrees of freedom, a mode is at the point where (unless if
v = 1), variance is 2v and coefficient of skewness . From the expression
for the coefficient of skewness, it follows that this distribution is very

asymmetrical for a small number of degrees of freedom, and that with
increasing degrees of freedom, it approaches symmetric distribution.
In the specific problems it has no autonomous application as the normal

distribution, but it is very important for inferential statistics. Therefore,
we observe the calculations with hi-square distribution.
Example 5.18.
If the degree of freedom is 5 and known probability when . Is

Illustration of Chi 0.9, we have to find appropriate value. Under the same conditions
square distribution. find considering that probability is known when .
Solution by using
statistical tables and
Excel. Solution:
We can also obtain the same result by using Excel function. When
, it is direct relation for Excel function CHIINV.
444
Opposite event is , so:

. That means:
5.14. F DISTRIBUTION
Under following assumptions:

X - continuous random variable which has a chi-square distribution
(χ2) with m degrees of freedom and
Y - continuous random variable which has a chi-square distribution
(χ2) with n degrees of freedom
445
These two variables are independent, the variable F, which is defined

as quotient of quotients for previously defined variables and their
respective degrees of freedom: follows Ficher-Snedecor‘s
distribution with the degree of freedom . Distribution of
probability is not balanced or symmetric with respect to m or n.
Random variable takes the value from the interval and

distribution has the following format:
where m and n represent degrees of freedom (df ).
Expected values and variance are:
Ficher’s (F) distribution is used in cases where we want to analyze

variability of two basic populations based on the sample. We will use
the F distribution to test hypotheses about the equality of two sample
variance over their relations on the basis of the number of degrees of
freedom for each of them. When the reference populations is normally
distributed then the quotient of two independent assessments variance
is given in the form of:
446
Example 5.19.
Under Fisher-Snedecor’s distribution, determine F0 if the appropriate

number of degrees of freedom is v1 =4, v2 = 7 and the corresponding Illustration of F
likelihood is . distribution. Solution
by using statistical
tables and Excel.
Solution:
We can also apply Excel solution for this problem. There is relation >, so
we can directly apply Excel function FINV:
5.15. APPROXIMATIONS OF BINOMIAL,

POISSON AND HYPERGEOMETRIC
DISTRIBUTION WITH NORMAL DISTRIBUTION
Here are summarized conditions for approximations of Binomial,

Poisson and Hypergeometic distribution with Normal distribution:
447
Figure 5.8. Conditions for approximations with Normal distribution


Constructing of
sample space. 5.1. Two homogeneous dice are thrown and their up faces are recorded.
Determine the sample space of this experiment.
Solution:
The sample space is the collection of all possible events. In our example,
the possible outcomes that can be realized on each of dices are the
numbers: 1, 2, 3, 4, 5 and 6. Hence, the sample space is set of all possible
pairs of numbers, where the first number represents result recorded on
the first dice and the second number represents the number recorded on
the second dice. Therefore, required sample space is following set:
448
5.2. In the case of experiment given in Example 3.1., determine

probabilities that: Probability: classical
approach, opposite
event, addition and
a) “Three” will appear on the first dice. multiplication rule.
b) “Three” will appear on both dice.
c) “Three” will appear at least on one dice.
d) “Three” will not appear on any dice.
Solution:
Let’s denote the events:

A – event that “three” will appear on the first dice;
B – event that “three” will appear on both dice;
C – event that “three” will appear at least on one dice;
D – event that “three” will not appear.
T is the total number of all possible (elementary) outcomes. According

to Example 3.1. we conclude that T=36 (the number of all possible pairs
of numbers that can appear on two dice).
a) X(A) is the number of outcomes in which the event A occurs.

Therefore, X(A) is the number of pairs with “3” on the first place.
So, X(A)=6.
According to classical (apriori) probability definition, the probability

that “three” will appear on the first dice is equal to:
b) X(B) is the number of outcomes in which “three” appears on both

dice. In this example, it is the number of pairs in which “3” is on
the first and second place. Only pair (3,3) satisfied the required
condition. Therefore, X(B)=1.
The probability that “three” will appear on both dice is equal to:
449
c) Let’s denote event:
E – “three” will appear on the second dice.
Notice that (in the same way as under a), we concluded that
. The event C: “three” will appear at least on one dice, will be
satisfied if “three” will appear on the first dice or “three” will appear
on the second dice (both of last two events include the case that “three”
will appear on both dice). We conclude that event C will be realized if
any of events A and B is realized. Therefore: .
Considering general addition rule:
Notice that event represents situation where both of events A

and E are realized or situation where “three” will appear on both dice.
Therefore, . Probability of event is calculated in
part b), but it might also be calculated using general multiplication rule
for independent events:
d) The event D: “three” will not appear is opposite to event C: “three”

will appear at least on one dice, therefore . Probability that
“three” will not appear is equal to:
5.3. Marketing research department consists of 10 researchers. 6

Probability: classical of them are economists and 4 are mathematicians. In order to
approach, create the terms of reference for the newest project, a team of 3
combinatorics.
researchers needs to be chosen. Find the probability that:
450
a) Exactly one mathematician will be chosen.

b) At least one mathematician will be chosen.
c) No mathematician will be chosen.
d) At least two mathematicians will be chosen.
Solution:
10 researches = 6 economists + 4 mathematicians
T is the total number of all possible ways to choose 3 out of 10 researchers,

regardless of profession.

A – event that exactly one mathematician will be chosen;
B – event that at least one mathematician will be chosen;
C – event that no mathematician will be chosen;
D – event that at least two mathematicians will be chosen.
a) X (A) represents the number of ways to chose 1 out of 4 mathematicians

and 2 out of 6 economists:
b) The simplest way to calculate probability that “at least one”

mathematicians will be chosen is by using probability of opposite
event. In this case, opposite event is: no mathematician will be
chosen. represents the number of ways to choose 3 out of 6
economists (and 0 out of 4 mathematicians). Therefore:
451
c) Calculated in part b) :
d) “At least two mathematicians” is the same as “two or more

mathematicians”. In this example it is needed to choose three
researches, therefore “at least two mathematicians” is the same as
“two or three mathematicians”. X (D) represents the number of ways
to choose 2 out of 4 mathematicians and 1 out of 6 economists or all
3 out of 4 mathematicians (and 0 out of 6 economists):
5.4. A direct retailer can receive orders either from its catalogue or by
Contingency table, repeat-customer order forms or by phone. The orders are classified
multiplication and as small, medium and large. The data about last 1000 orders are
addition rule,
given in the table below:
conditional probability.
Small Medium Large Total

Catalogue 112 82 54 248
Repeat 96 148 122 366
Phone 74 116 196 386
Total 282 346 372 1000
In order to improve their marketing activities, management wants to

examine:
a) What is the probability that the randomly chosen order is large?
b) What is the probability that the randomly chosen order is done either
by catalogue or a repeat-costumer order?
c) What is the probability that the randomly chosen order is large and
received by phone?
452
d) What is the probability that randomly chosen large order is received

by phone?
Solution:
C – order by catalogue forms;
R – order from repeat-customers;
P – order by phone;
S – small order;
M – medium order;
L – large order;
a)
b) The events C and R are mutually exclusive (order cannot be received

by catalogue forms and from repeat-customers, at the same time).
Therefore, based on general addition rule for mutually exclusive
events:
c)
d) In this case, the event L is already realized (we know that the order
is large), therefore it’s about conditional probability:
This probability can also be calculated in another way:
453
5.5. The statistical research agency conducted a study on personal

Independency, computer users and got the following results:
general addition and
multiplication rule,
78% of citizens use standard desktop personal computer, 56% use
conditional probability.
laptop, and 36% of them use both.
Find the probability that:

a) The events “citizen uses desktop personal computer” and “citizen
uses laptop” are independent?
b) Randomly chosen citizen uses at least one of them, desktop personal
computer or laptop.
c) Randomly chosen laptop user also uses desktop personal computer.
Solution:
The events:
D – citizen uses desktop personal computer; P (D) = 0.78

L – citizen uses laptop; P (L) = 0.56
a) If the events D and L are independent, then:

The converse is also true: if , the events D
and L are independent.
Let’s examine whether the events D and L are independent:
b)
c)
454
5.6. Suppose that 5 out of 100 men and 25 out of 10000 women are
colour blind and suppose that number of men equals the number Application of Bayes
of women. theorem.
a) Find the probability that the randomly chosen person (regardless of

gender) is colour blind.
b) Find the probability that the randomly chosen colour blind person is
a men.
Solution:
The events:
M – person is male,
F – person is female,
D – person is colour blind:
- probability that the randomly chosen man is

colour blind;
- probability that the randomly chosen

woman is colour blind;
The events M and F are mutually exclusive and their union covers the
entire sample space (each person is a male or female).
a)
b) Based on Bayes theorem:
455
5.7. From the experience it is known that 25% of all management

Binomial distribution: trainees are rated outstanding. 10 trainees are randomly chosen in
expected value, the sample.
variance,
probabilities.
a) What theoretical distribution follows the random variable X: the
number of outstanding trainees in the sample?
b) Find the expected value and variance of variable X.
Find the following probabilities:

c) There is exactly one outstanding trainee in the sample;
d) There are no outstanding trainees in the sample;
e) There is at least one outstanding trainee in the sample;
f) There are less than 3 outstanding trainees in the sample;
g) There are at least 9 outstanding trainees in the sample.
Solution:
a) X follows Binomial probability distribution with parameters n = 10

and p = 0.25.
b) The expected value is:
The variance is:
c)
d)
e) The events “there is at least one outstanding trainee in the sample”

and “there aew no outstanding trainees in the sample” are the
opposite events and therefore:
456
f)
g)
5.8. On Saturday mornings, customers enter a boutique at a suburban

shopping mall at an average rate of 0.5 per minute. Let X be “the Poisson distribution:
number of customers arriving in a specified 10-minute interval of expected value,
probabilities.
time”.
a) What is the expected number of customers arriving in a specified

interval of time?
b) Find probability that exactly 3 customers will arrive in a specified
interval of time.
c) Find probability that at most 3 customers will arrive in a specified
interval of time.
d) Find probability that at least 4 customers will arrive in a specified
interval of time.
e) Find probability
Solution:
We are interested in the number of occurrences of certain event in a

given unit of time, therefore it is reasonable to make assumption that X
follows Poisson’s probability distribution.
a) At an average rate of 0.5 per minute, over a 10-minute interval of time

we would expect arrivals. Therefore, parameter λ
of Poissons’ probability distribution is:
457
b)
c)
d)
e)
5.9. Daily duration of sleep minutes of middle age people is random

Normal distribution: variable X that follows Normal probability distribution with
standardization, expected value of 500 minutes and standard deviation of 100
standardized Normal minutes. Determine:
distribution rules,
graphical
presentation.
a)
b)
c)
d) interval (x1, x2), symmetric around , so probability that X
belongs to (x1, x2) is equal to 0,60.
Solution:
Formula for standardization is:
458
a)
b)
c)
d) Interval (x1, x2) is symmetric around μ (expected value of X), which

implies that corresponding standardized interval is symmetric
around E (Z) = 0. Therefore, corresponding standardized interval
has a form ( _z1, z2) (symmetric around 0), where
459
Required interval is (416, 584).
Graphical presentation:
5.10. Time needed to prepare microwave popcorns is Normal random

variable with expected value of 4.5 minutes and variance of 1.44.
Application of
Normal probability
distribution. a) Determine time x0 needed to prepare microwave popcorns, so that
10% of all pop-corns are prepared with at most x0.
b) Determine time x1 needed to prepare microwave popcorns, so that
5% of all pop-corns are prepared with at least x1.
Solution:
Formula for standardization is:
460
a)
Statistical tables of Normal probability distribution function don’t

contain values (probabilities) less than 0.5. In such cases it is necessary
to use following transformation:
b)
461
5.11. Candidates for employment in a large corporation must pass

through two initial screening procedures – a written aptitude test
and an oral interview. 60% of the candidates are unsuccessful
on the written test, 40% are unsuccessful in the interview and
25% are unsuccessful in both. The corporation invites for a final
interview only candidates who are successful in both procedures.
Offers of employment are made to 30% of those invited for a final
interview.
a) What is probability that a randomly chosen candidate will be invited

for a final interview?
b) What is probability that a randomly chosen candidate will be offered
employment?
Answer: a) 25% b) 7.5%
5.12. Market research in a particular city indicated that 82% of all

households in the city have color TV and 37% have microwave
ovens. It was also found that 28% of all households in city have
both appliances. A single household is chosen at random from this
city. What is probability that the chosen household has at least one
of these appliances?
Answer: 86%
5.13. In examining a past record of a corporation’s account balances,

an auditor finds that 15% of them have contained errors. Of those
balances in error, 60% were regarded as unusual values based on
historical figures. Of all the account balances, 20% were unusual
values. If the figure for particular balance appears unusual, what
is probability that it is in error?
Answer: 45%
5.14. The accompanying table shows for 1000 forecasts of earnings

per share made by financial analyst, the numbers of forecasts and
462
outcomes in particular categories (compared with the previous

year):
Forecast
Outcome
improvement about the same Worse
improvement 218 82 66
about the same 106 153 75
worse 75 84 141
a) What is probability that the forecast of the worse performance in

earnings will be realized?
b) What is probability that the forecast of the improvement in earnings
will be realized?
Answer: a) 141/1000 b) 181/1000
5.15. A Laundromat managers knows that 15% of new washing machines

purchased require maintenance during the first year of operation.
The manager purchases five new machines, whose performances
can be assumed to be independent.
a) What is probability that all of them will require maintenance during

the first year of operation?
b) What is probability that none of them will require maintenance
during the first year of operation?
c) What is probability that at least two of them will require maintenance
during the first year of operation?
Answer: a) 0.0076% b) 44.37% c) 16.48%
5.16. An insurance company holds fraud insurance policies on 6000

firms. In any given year, the probability that any single policy will
result in a claim is 0.001. Find the probability that at least three
claims are made in a given year.
Answer: 93.81%
463
5.17. Scores on an achievement test are known to be normally distributed

with mean 420 and deviation 80. For randomly selected person
taking this test, what is probability that score will be:
a) between 420 and 480

b) lower than 440
c) more than 410
d) Decision has been made that the 10% of persons with the lowest
scores will receive a failing grade. What is a minimum score needed
to avoid a failing grade?
e) Decision has been made that the 15% of persons with the highest
scores will receive a grant. What is a minimum score needed to get
a grant?
Answer: a) 77.34% b) 59.87% c) 54.97% d) 317.5 e) 502.9
5.18. Suppose that the variable X is normally distributed with mean of

150 and standard deviation of 25.
a) Find the probability that X is less than 97.

b) Find the probability that X is more than 93 and less than 162.
Solution:
a) the probability that X is less than 97.
b) the probability that X is more than 93 and less than 162 (between 93
and 162)
464
5.19. A light bulb manufacturer claims that the distribution of the

lifetimes of its light bulbs has a mean of 24 months and a standard
deviation of 5 months. Suppose that a consumer group decides
to check this claim by purchasing a sample of 100 light bulbs.
Assuming that the manufacturer’s claim is true, what is the
probability that the consumer’s group sample has a mean lifetime
of 23 months or less?
Answer: 2.28%
5.20. The probability of a randomly drawn individual having blue eyes

is 0.6.
a) What is the probability that four people drawn at random all have
blue eyes?
b) What is the probability that two individuals out of four in a sample
have blue eyes?
c) Calculate the mean and variance of blue eyed individuals in the
previous exercise
Answer:
a) 12.96%
b)
c) 2.4 and 0.96
465
5.21. The average income of a country is known to be £10,000 with

standard deviation £2,500. A sample of 40 individuals is taken and
their average income is calculated.
a) What is the probability distribution of this sample mean?

b) What is the probability that the sample mean is over £10,500?
c) What is the probability that the sample mean is below £8,000?
Solution:
a)
In our question, that means the following probability distribution of the

sample mean: ~ N (10,000, 2,5002/40) ~ N (10,000, 156,250)
b) The probability of the sample mean being over £10,500 is 10.3%,

c) The probability of the sample mean being below £8,000 is significantly
below 1%.
466
5.22. There are 3 damaged products out of 60 products in the package.

Find the probability that randomly drawn product is damaged.
Answer: 0.05
5.23. Standard delivery consists of 90 products. Sender informed us

that there are 4 defective products. If we take control sample of 5
products, find the probability that there is one defective product in
the sample.
Answer: 0.19
5.24. 12 candidates applied for the job of market inspector: 5 lawyers

and 7 economists. 4 candidates will get the job. Find the probability
that:
a) All candidates that will get the jobs are economists;

b) At least one lawyer will get the job;
c) At least 3 lawyers will get the job.
Answer: a) 0.07; b) 0.97; c) 0.15.
5.25. 512 out of 1000 newborns are boys. Find the probability that a
newborn is a boy.
Answer: 0.512
5.26. One card is drawn from the deck of cards consisting of 32 cards
(from 7 to ace). Find the probability that the drawn card is ace or
king.
Answer: 0.25
5.27. Smoke detection system uses two devices, A and B. If smoke

occurs, the probability of detection on device A is 0.95, on device
B is 0.9 and on both devices 0.88.
467
If smoke occurs, find the probability that:

a) it will be detected;
b) it won’t be detected.
Answer: a) 0.97; b) 0.03.
5.28. Auditor controls accuracy of accounting entries. On the basis

of experience, entry is incorrect in 5% of cases. 20 entries are
submitted to the control. Find the probability that:
a) all entries are correct;

b) three entries are incorrect;
c) Find the expected number of incorrect entries.
Answer: a) 35.8%; b) 5.96%; c) 1
5.29. On the basis of experience, 10% of all shoes made in certain shoe
factory are damaged. Find the probability that:
a) There are 2 damaged shoes in the sample of 12 shoes.

b) There are 6 damaged shoes in the sample of 20 shoes.
Answer: a) 23.01%; b) 0.89%
5.30. If dice is thrown 10 times, find the probability that “four” falls 3
times.
Answer: 15.5%
5.31. Number of persons that use elevator in the building of Faculty

of Economics during one hour follows Poisson’s probability
distribution. If it is expected that (on average) 1.6 persons use
elevator during one hour, find the probability that three persons
will use the elevator during one hour.
Answer: 13.78%
468
5.32. X ~ N (6.7, 1.44) Find
Answer: 27.385%
5.33. The average thickness of a mechanical part is 80 mm with

standard deviation of 2 mm. If variable „thickness“ follows
Normal probability distribution, find the probability that thickness
of a randomly chosen mechanical part is outside boundaries of
tolerance 70 – 86 mm.
Answer: 0.135%
5.34. The variable “human height” follows Normal probability distribution

with expected value of 164 cm and standard deviation of 15 cm.
a) If 6% of people have height greater than expected, find that height?

b) If 15% of people have height less than expected, find that height?
c) What percentage of people is taller than 170 cm?
Answer: a) 187.325; b) 148.475; c) 34.46%.
469
6
INFERENTIAL
STATISTICS
CHAPTER
6
6.1. INTRODUCTION
“I like to think of statistics as the science of learning from data ... It

presents exciting opportunities for those who work as professional
statisticians. Statistics is essential for the proper running of government,
central to decision making in industry, and a core component of modern
educational curricula at all levels.” Jon Kettenring, ASA President,
1997.35
Inferential statistics are used to draw inferences about a population

from a sample. We need statistical inference to make generalizations
from sample to population. It is very important that the chosen sample
is randomly selected and representative for the population. Well, we
need appropriate sampling methods to be sure that sample results will
provide “good” estimates of the population characteristics. However,
there is always the likelihood of some level of sample error in sample
selection. But, there is rule: larger sample lead to the smaller sample
error.
Consider an experiment in which 10 subjects who performed a task after

24 hours of sleep deprivation scored 12 points lower than 10 subjects
who performed it after a normal night‘s sleep. Is the difference real or
could it be due to chance? How much larger could the real difference be
than the 12 points found in the sample? These are the types of questions
answered by inferential statistics.
There are two main methods used in inferential statistics: estimation

and hypothesis testing. In estimation, the sample is used to estimate a
parameter and a confidence interval about the estimate constructed. A
confidence interval gives an estimated range of values which is likely to
include an unknown population parameter.
35
ibidem
473
6 INFERENTIAL STATISTICS
The estimated range has been calculated from a given set of

sample data36:
where:
• ϕ - statistic from sample
• - parameter from population
• h - surroundings
• (1 – α) - confidence
• α - first type error
36
In the most common use of hypothesis testing, a „straw man“ null hy-
pothesis is put forward and it is determined whether the data are strong
enough to reject it. For the sleep deprivation study, the null hypothesis
would be that sleep deprivation has no effect on performance.
Inferential statistics are used to make generalizations from a sample to a

population. There are two sources of error that may result in a sample‘s
being different from (not representative of) the population from which
it is drawn. These are
Figure 6.1. Illustration of sampling error and sample bias
Sampling error - Sample bias -

chance, random error constant error, due to
inadequate design
36
Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1
474
Inferential statistics take into account sampling error. These statistics do

not correct for sample bias. That is a research design issue. Inferential
statistics only address random error (chance).
6.2. THE POINT ESTIMATOR
A point estimator of a population parameter is some function

or calculation that can be used to estimate the value of the
population parameter.
As an example, the point estimator of the population mean is the mean

of the sample, as we can use the sample mean to estimate the population
mean
From the basic set of N elements in population, we can choose different

samples of sizes n. For each of these samples we can calculate certain
characteristics with which we can evaluate the characteristics of the basic
population. This feature (or point estimator) has some characteristics
such as:
Point estimator has value different from the same characteristic of
population and
Point estimator has value that is different for each of the samples.
In point estimation we use the data from the sample to compute a

value of a sample statistic that serves as an estimate of a population
parameter. We refer to sample mean as the point estimator of the
population mean μ. Sample standard deviation is the point estimator of
the population standard deviation σ1. Sample proportion p is the point
estimator of the population proportion π.
6.3. THE DISTRIBUTION OF THE SAMPLE MEANS
The samples are chosen randomly and the values of point estimators
are random variable. The values of these variables are randomly
475
distributed according to a probability distribution. If we can determine

the probability distribution of these variables we can determine the
probability with which it will have a value lower than or equal to a real
number if it is a discrete variable and the likelihood that they will be
located in an interval of real numbers if it is a continuous variable. For
a given distribution we can determine the expected value, variance and
standard deviation.
If we randomly draw k samples size n from population with N elements

and for each sample calculate the arithmetic mean, we will get related
arithmetic means as much as we have samples:
- mean for j - th sample arithmetic
mean of samples is a new random variable. As the samples were

randomly selected, arithmetic mean of samples is the random variable
for which we can calculate the arithmetic mean:
Expected value (arithmetic mean) of the arithmetic means of samples

can be viewed as the expected value of the arithmetic mean of the
sample:
This proves that the arithmetic mean of arithmetic means of the samples
is equal to arithmetic mean of the population. That means that we have
unbiased estimation of the parameter (mean) for population.
The distribution of sample means has some interesting characteristics.

First, if our samples are big enough (a large n), then the sampling
distribution will approximate a normal distribution, which, as you
know, is handy for computing probabilities.
476
Second, the mean of our sampling distribution, which is sometimes

denoted by , will be the same as the population mean. Together, these
two properties of sampling distributions comprise the central limit
theorem.
The central limit theorem says that if we take a sample from

a non-normal population X, and if the size of the sample is
large, then the distribution of X is approximately normal
with mean μ and variance .
We can say that the larger the sample size n then the closer the sampling
distribution of the sample mean is to being normal. In other words, the
larger n means the better the approximation.
Third, as you also know, to compute probabilities from a normal

distribution, we have to know the standard deviation of the distribution.
In this case, the standard deviation of the sampling distribution

is called the standard error of mean, denoted by , and is
calculated by dividing the population standard deviation by
the square root of n. In other words, the standard error of the
mean can be calculated as:
The standard error of the mean depends on the sample size (n), so the
larger sample leads to the smaller standard error of the mean.
477
6.4. CONFIDENCE INTERVAL

FOR THE POPULATION MEAN
6.4.1. Standard deviation of population is known
For a population with unknown mean μ and known standard

deviation σ of population, a confidence interval for the
population mean, based on a random sample of size n, is:
where:
• is the sample mean
• z is the upper critical value for the standard
normal distribution and depends on required confidence
• is the standard error of the mean.
If we know standard deviation for population, there are some rules for
determining sample size:
In most applications, a sample size of n = 30 is adequate.
If the population distribution is highly skewed or contains outliers, a
sample size of 50 or more is recommended.
If the population is not normally distributed but is roughly symmetric,
a sample size as small as 15 will suffice.
If the population is believed to be at least approximately normal, a
sample size of less than 15 can be used.
6.4.2. Standard deviation of population isn’t known
If standard deviation of population isn’t known, unbiased estimator

from the sample is:
where S is the standard deviation of sample.
478
In most practical research, the standard deviation of the population of

interest is not known. In this case, the standard deviation from population
σ is replaced by the estimated standard deviation from sample S, also
known as the standard error. Since the standard error is an estimate of
the true value of the standard deviation, the distribution of the sample
mean is no longer normal with mean μ and standard deviation
Instead, the sample mean follows the t distribution with mean μ and
standard deviation . The t distribution is also described by its
degrees of freedom. For a sample of size n, the t distribution will have
(n - 1) degrees of freedom. The notation for a t distribution with k
degrees of freedom is tk.
For a population with unknown mean μ and unknown

standard deviation, a confidence interval for the population
mean, based on a random sample of size n, is:
where:
• is the sample mean
• t is the upper critical value for the t distribution
with (n-1) degrees of freedom,
• is approximation for the standard error of the
mean
As the sample size n increases, the t distribution becomes closer to

the normal distribution, since the standard error approaches the true
standard deviation σ for large n. So, for sample size n >30, we can use
normal instead of t distribution.
479
Example 6.1.
As a part of an experiment, researcher measured the boiling temperature

Confidence interval of of a liquid and recorded the following readings (in degrees Celsius)
the population mean 102.5, 101.7, 103.1, 100.9, 100.5, and 102.2 on 6 different samples of the
with known population liquid. Calculated sample mean is 101.82. If he knows from historic data
standard devitation.
that the standard deviation for this procedure is 1.2 degrees, what is the
confidence interval for the population mean with type I error of 5%?
Solution:
Standard deviation σ for population is known:
Confidence interval for the population mean at a 95% confidence level

is (101.01-102.63).
Confidence interval of
Example 6.2.
the population mean
with unknown NGOs’ often present in public that millionaires should be required
population standard
devitation, large
to donate to charity. Hence, we take a sample of 19 millionaires and
sample. conduct a survey to find out what percent of their income the average
millionaire donates to charity. The mean percent in the observed sample
480
is 15 percent with a standard deviation of 5 percent. Determine 99%

confidence interval for the mean percent.
Solution:
n<30, unknown standard deviation σ for population, we know only

standard deviation S for sample t distribution

is (12.6-17.4).
Example 6.3.
Data on the trade in one region (130 observations) are gathered in

database. Sample mean is 98.249 and sample standard deviation is Confidence interval
of the population
0.733. Find a 99% confidence interval for the mean of population. mean with unknown
population standard
devitation, large
sample.
481
Solution:
n>30, unknown standard deviation σ for population, we know only

standard deviation S for sample z distribution

is (98.08-98.41).
Example 6.4.
According to report for 2009, we have data about predicted Recovery

rate in cent per dollar after closing business37 for sample with 33
countries. We have data in Excel sheet (A1-A33). We have to construct
confidence interval for Recovery rate for population of all countries
with type I error 1%.
Solution:
We will use Excel procedure to resolve this confidence interval problem.

For beginning, we will calculate statistics for a sample of 33 countries:
37
http://www.doingbusiness.org/CustomQuery/, predictions for 2009. year, access: 13. 12. 2009.
482
Tools – Descriptive statistics:
483
n>30, we don‘t know deviation for population σ, we only know

deviation from sample S we will use z distribution with Excel
function NORMSINV:
Confidence interval for Recovery rate for population of all countries

with first type error 1% is (42.48-63.34).
6.5. CONFIDENCE INTERVAL

OF THE POPULATION PROPORTIONS
Applying the general formula for a confidence interval, the

confidence interval for a proportion, pA, is
>>>
484
where:
• is the proportion in the sample,
• z depends on the level of desired confidence, and
• σ , the standard error of a proportion, is equal to:
where:
• pA is the proportion of the population and
• n is the sample size.
Since pA is not known, is used to estimate it. Therefore the estimated

value of is:
and then it will be:
Example 6.5.
Consider a researcher wishing to estimate the proportion of faulty copy

machines and slow work in library. A random sample of 40 machines Confidence interval
is taken and 12 of the machines are faulty. The problem is to compute of the population
the 95% confidence interval on π, the proportion of faulty machines in proportion, large
sample.
the population.
Solution:
The value of is:
485
6.6. CONFIDENCE INTERVAL FOR VARIANCE

IN POPULATION
Depending on whether the sample is small or large for the determination

of confidence interval for the population variance we use chi-square or
normal distribution according to the following forms:
• small sample
• large sample
486
Example 6.6.
In the sample of 40 elements, we calculated the mean 50 and the variance

of 12. We wish to determine the interval in which the population Confidence interval
variance would be, with 99% certainty. of the population
variance, large
sample.
Solution:
This is large sample, and then confidence interval will be with normal
distribution:
With 99% certainty, the population variance would be in interval

[7.44, 25.01].
Confidence interval
Example 6.7. of the population
variance, small
sample.
In the sample of 20 elements, we calculated the mean 50 and the variance
of 12. We wish to determine the interval in which the population
variance would be, with 95% certainty.
487
Solution:
This is small sample, and then confidence interval will be with chi-
square distribution:
With 95% certainty, the population variance would be in interval [7.3, 26.95].
6.7. HOW TO DETERMINE SAMPLE SIZE

ACCORDING TO SAMPLE ERROR?

population mean
Determining sample size is a very important issue because samples

that are too large may waste time, resources and money, while samples
that are too small may lead to inaccurate results. In many cases, we
488
can easily determine the minimum sample size needed to estimate a

population parameter, such as the population mean μ.
When sample data is collected and the sample mean is calculated, that
sample mean is typically different from the population mean μ. This
difference between the sample and population means can be thought of
as an error.
The margin of error is the maximum difference between

the observed sample mean and the true value of the
population mean μ:
where:
• is known as the critical value, the positive z value that
is at the vertical boundary for the area of in the right tail
of the standard normal distribution.
• σ is the population standard deviation.
• n is the sample size.
Rearranging this formula, we can get the expression for the sample size
necessary to produce results accurate to a specified confidence and
margin of error:
This formula can be used when σ is known and we want to determine

the sample size, with a confidence of (1 _ α) and the mean value μ
within . We can still use this formula if we don’t know our population
standard deviation σ. The standard deviation for the sample is:
489
Although it’s unlikely that you know σ when the population mean is not
known, you may be able to determine σ from a similar process or from
a pilot test/simulation.
Example 6.8.
We want to estimate average bill for the mobile phone that inhabitants of
Illustration of a capital spend. Studies obtained elsewhere find the standard deviation of
determining sample $25. The group wants to estimate the average bill within of the true
size for estimating average and with 95% confidence. Determine the size of a sample needed?
population mean.
Solution:
We need to have 96 or more elements in sample to achieve 95%

confidence.

population proportion
To develop formula to determine the appropriate sample size needed

when constructing a confidence interval estimate of the proportion,
recall equation for confidence interval estimate of the proportion:
490
where:
is known as the critical value, the positive value that is at the
vertical boundary for the area of in the right tail of the standard
normal distribution.
pA is the proportion of population.
N is the sample size.
Rearranging this formula, we can get the expression for

the sample size necessary to produce results accurate to a
specified confidence and margin of error.
This formula can be used when you know pA and want to determine
the sample size necessary to establish, with a confidence of (1 _ α), the
proportion for population within .
You can still use this formula if you don’t know your population
proportion and you have a proportion from sample:
Example 6.9.
If you want 99% confidence of estimating the population proportion

Illustration of
to be within an error of ±0.02 and there is historical evidence that the determining sample
population proportion was 0.4, what sample size is needed? size for estimating
population
proportion.
491
Solution:
We need to have 3,994 or more elements in the sample to achieve 99%

confidence.
6.8. HYPOTHESIS TESTING
In this part, focus is on hypothesis testing, another aspect of statistical

inference that like confidence interval estimation, is based on
information from sample. A step-by-step methodology is developed
and that methodology enables us to make inferences about a population
parameter by analyzing differences between results observed as the
statistic from sample and the results that can be expected if some
underlying hypothesis is actually true according to appropriate
theoretical distribution.38
Hypothesis testing typically begins with some theory, claim, or assertion

about a particular parameter of a population. For example, for purposes
of statistical analysis, our initial hypothesis about one production
example is that the process is working properly, meaning that the mean
fill is 350 grams, and no corrective action is needed.
38
Levine D.M. and others, Statistics for Managers Using Microsoft Excel, Prentice Hall, NY,
2005., p. 332
492
The hypothesis that the population parameter is equal to the company

specification is referred to as the null hypothesis. The null hypothesis
is always one of status quo and is identified by the symbol H0. The null
hypothesis here is that the filling process is working properly, that the
mean fill per box is the 350 grams according to standard. This can be
stated as:
If a null hypothesis is specified, an alternative hypothesis must also

be specified, one that must be true if the null hypothesis is found to be
false. The alternative hypothesis H1 is always the opposite of the null
hypothesis H0. This is stated in our cereal example as:
The alternative hypothesis represents the conclusion reached by rejecting

the null hypothesis if there is sufficient evidence from the sample
information to decide that the null hypothesis is unlikely to be true.
Hypothesis-testing methodology is designed so that the rejection of the

null hypothesis is based on evidence from the sample and the alternative
hypothesis is far more likely to be true. However, failure to reject the
null hypothesis is not proof that it is true. One can never prove that
the null hypothesis is correct because the decision is based only on
the sample information, not on the entire population. Therefore, if you
fail to reject the null hypothesis, you can only conclude that there is
insufficient evidence to warrant its rejection.39
The following key points summarize the null and alternative

hypotheses:39
1. The null hypothesis H0 represents the status quo or the
current belief in a situation.
>>>
39
Levine D.M. and others, Statistics for Managers Using Microsoft Excel, Prentice Hall,
NY, 2005., p. 333
493
2. The alternative hypothesis H1 is the opposite of the null

hypothesis and represents a research claim or specific
inference we would like to prove.
3. If we reject the null hypothesis, we have statistical proof
that the alternative hypothesis is correct.
4. If we reject the null hypothesis, then we have failed to
prove the alternative hypothesis. The failure to prove the
alternative hypothesis, however, does not mean that we
have proven the null hypothesis.
5. The null hypothesis H0 always refers to specified value of
the population parameter (such as μ), not a sample statistic
(such as ).
6. The statement of the null hypothesis always contains an
equal sign regarding the specified value of the population
parameter
7. The statement of the alternative hypothesis never
contains an equal sign regarding the specified value of the
population parameter .
Hypothesis-testing methodology provides clear definitions for evaluating

such differences and enables us to quantify the decision-making process
so that the probability of obtaining a given sample result can be found
if the null hypothesis is true. This is achieved by first determining
the sampling distribution for the sample statistic of interest (e.g. the
sample mean) and then computing the particular test statistics based
on the given sample result. Because the sampling distribution for the
test statistic often follows a well-known statistical distribution, such as
the standardized normal distribution or t distribution, these distributions
can be used to help determine the likelihood that the null hypothesis
is true.
Statistical estimation and hypothesis testing do not guarantee that

decision makers make correct decisions, but utilization of the techniques
will increase the likelihood of the decisions being correct; they allow
uncertainty to be incorporated into the process.
494
6.8.1. Regions of rejection and non-rejection
The sampling distributions of the test statistics are divided into two
regions:
Region of rejection (critical region) and
Region of non-rejection.
Figure 6.2. Graphical presentation of rejection and non-rejection regions
According to critical value approach, if the test statistic falls

into the region of non-rejection, the null hypothesis cannot be
rejected. If a value of the test statistic falls into this rejection
region, the null hypothesis is rejected because that value is
unlikely if the null hypothesis is true.
When we use a sample statistic to make decision about a population

parameter, there is a risk that an incorrect conclusion will be reached.
Two different types of errors can occur when applying hypothesis
testing methodology, type I errors and type II errors.
495
A type I error occurs if the null hypothesis H0 is rejected when

in fact it is true and should not be rejected. The probability of
a type I error occurring is α.
A type II error occurs if the null hypothesis H0 is not rejected
when in fact it is false and should be rejected. The probability
of a type II error occurring is β .
The confidence coefficient (1-α) is the probability that the

null hypothesis H0 is not rejected when in fact it is true and
should not be rejected.
The power of a statistical test (1-β ) is the probability of
rejecting the null hypothesis when in fact it is false and should
be rejected.
6.8.2. Risks in decision making process
Next table illustrates the results of two possible decisions (do not reject
H0 or reject H0) that can occur in any hypothesis test. Depending on the
specific decision, one of two types of errors may occur or one of two
types of correct conclusion may be reached.
Table 6.1. Hypothesis testing: two possible decisions

and corresponding errors
Actual situation
Statistical decision
H0 true H0 false
Correct decision Type II error
do not reject H0
Confidence = (1-α) p(type II error) = β
Type I error Correct decision
reject H0
p(type I error) = α Power = (1-β )
496
6.8.3. Procedure for hypothesis testing
Several steps can describe procedure for hypothesis testing:

1. Determine the null and alternative hypothesis
2. State critical value of test statistics according to significance or
confidence level and appropriate theoretical distribution
3. Calculate the test statistic according to values from the sample
4. Compare test statistic to critical values draw conclusion.
6.8.4. Hypothesis for the mean
We begin with the problem of testing the simple null hypothesis that
the population mean is equal to, higher or lower than some specified
value μ0. Procedure for selecting to appropriate test depends on answer
for question: “Do we know standard deviation for population or for
sample?”. If we know only standard deviation for sample, we have to
decide which theoretical distribution we will apply according to the
sample size.
To use the one-sample test about mean, the obtained numerical data
are assumed to represent a random sample from a population that is
normally distributed. In practice, as long as the sample size is not very
small and the population is not very skewed, the Student - t distribution
provides a good approximation to the sampling distribution of the mean,
when variance for population is unknown. When a large sample size is
available, standard deviation from sample estimates standard deviation
from population precisely enough, so that there is little difference
between t and z distribution. Therefore, for large sample, a z test can be
used instead of t test when variance for population is unknown.
497
Population variance σ is known
1. Two-tailed test
1.
2.
3.
4.
2. One-tailed test
a. Lower boundary
1.
2.
3.
4.
b. Upper boundary
1.
2.
3.
498
4.
Population variance σ is unknown, small sample
1. Two-tailed test
1.
2.
3.
4.
2. One-tailed test
a. Lower boundary
1.
2.
3.
4.
b. Upper boundary
1.
2.
499
3.
4.
Population variance σ is unknown, large sample
1. Two-tailed test
1.
2.
3.
4.
2. One-tailed test
a. Lower boundary
1.
2.
3.
4.
500
b. Upper boundary
1.
2.
3.
4.
Example 6.10.
A survey on the visitors satisfaction with service in restaurant is

undertaken. Visitors graded their saticfaction using Likert scale to Two tailed test of
determine their level of agreement with statements defined in the survey population mean,
(scale from 0 to 5 where 0 is completely agree and 5 is completely known population
variance.
disagree). Manager believes that the true average is 2 and the sample
results reveal mean of 1.99 and standard deviation of 0.05 liter. At the
level 0.95 of confidence test whether manager is right or the mean from 2?
Solution:
We know standard deviation for population and this is two-tailed z test:
1.
2.
501
3.
4.
There is evidence that the mean amount in the bottles is different from
2.0 liters.
Example 6.11.
The director of admissions at a large university advises parents of

One tailed test of incoming students about the cost of textbooks during a typical semester.
the population mean A sample of 80 students enrolled in the university indicates a sample
(lower boundary),
unknown
mean cost of $315.4 with a sample standard deviation of $43.2.
population variance,
large sample. Using the 0.01 level of significance, is there evidence that the population
mean is less than $320?
Solution:
We don’t know standard deviation for population, sample is large and

this is one-tailed z test:
1.
2.
502
3.
4.
There is no evidence that the population mean is less than $320.
Example 6.12.
We took 13 machines in a sample to and counting the number of daily

production on each of them. The following results were recorded: One tailed test of
the population mean
(upper boundary),
342, 426, 317, 545, 264, 451, 1,049, 631, 512, 266, 492, 562, 298. unknown
At 99% level of confidance, find if there there evidence that machines population variance,
small sample.
produce more than 350 products?
Solution:
From original data we calculate:
We don’t know standard deviation for population, sample is small and

this is one-tailed t test:
1.
2.
503
3.
4.
There is evidence that the average number of products is more than 350.
6.8.5. A two sample test for means
Means are used to summarize distributions based on continuous data

(interval or ratio measurement). A statistical measure called the t test is
used to test for the significance of the difference between two means.
The t test assesses the degree of overlap in the distribution of scores in
each of two samples being compared. When the two distributions are
highly similar, there will be little difference between the means. When
scores in one distribution are distributed differently from the other,
there is a greater probability that the difference between the means will
be greater.
A t test can be used with large or small samples. However, as the sample
size becomes smaller, mean differences have to be larger to become
significant. In addition to the requirement of continuous measurement,
the t test assumes that the variable being measured is normally
distributed in the population from which the sample was selected. Even
when distributions for samples are mildly skewed, it may be reasonable
to assume a normal distribution for the variable in the population.
However, when the distribution for a sample is badly skewed or you
doubt that the variable is normally distributed in the population, you
should not use a t test. As an alternative you can compare medians or
convert continuous data to a set of intervals and conduct a chi square
test.
We have two main types of test for the significance of the difference
between two means if we don’t know population variances:
504
1.
1.
2.
3.
4.
2.
1.
2.
3.
4.
Example 6.13.
Let’s imagine that a new soft drink has been developed and its
Test of the difference
manufacturers claim that it boosts memory-recall. We need to test between two
whether or not the drink is effective. We start by collecting two random population means,
samples, each of 100 students. We give all students a soft drink, but one large samples.
group receives the memory drink (Total-Recall) the other a carbonated
505
sugar water drink (this is known as a placebo). All 200 students think
they have received the memory drink. The students all take a memory
recall test, with the following results:
Group 1 (Total -Recall): Mean Score: 55; Standard Deviation: 12
marks
Group 2 (placebo): Mean Score 51.8; Standard Deviation: 9 marks
The difference in the Mean Scores between the two groups is 3.2 marks,
in favour of the Total-Recall drink. Is this result significant (α = 1%)?
Solution:
1.
2.
3.
4.
506
This result “The difference in the Mean Scores between the two groups
is 3.2 marks“ is not statistically significant.
Example 6.14.
In order to investigate an effect of new insecticide on a number of apple

buds, a study was conducted on apple trees that were attacked by a Test of the difference
between two
sort of aphids. 15 of apple trees from the sample were treated by new
population means,
insectide, while 14 of them were not treated at all. After a month, in a small samples.
blooming period, the data on sample were as follows:
Treated Not treated

Number of buds mean 820 582
Number of buds SD 223.6 277.3
Sample size 15 14
Is there difference between the number of buds between the groups of

treated and not treated apple trees (α = 5% )?
Solution:
1.
2.
507
3.
4.
There is significant difference between the number of buds in the treated

and not treated plants.
Example 6.15.
We conducted the research on the impact of lack of sleep on the ability of

solving mathematical tasks. On a sample of 30 of the first mathematics
test was delivered in the “normal” circumstances. After that we did not
allow them to sleep for 72 hours and parallel test was given to them.
Results are:
O. N. I test II test
1 32 28
2 34 26
3 28 30
4 27 25
5 35 33
6 19 21
7 24 22
8 30 30
9 30 27
10 27 22
11 40 32
12 28 29
508
13 35 31
14 37 36
15 15 20
16 18 20
17 19 15
18 21 20
19 27 26
20 30 28
21 38 34
22 32 30
23 30 20
24 28 21
25 27 26
26 29 33
27 22 20
28 14 15
29 35 30
30 33 32
Is there significant difference in the results of I and II testing? The data

are in the table, use the reliability of 0.94.
Solution:
we have z distribution and paired samples. Hypotheses

are:
1.
Data Analysis Excel Option (from Tools) is used in the analysis of given
paired samples:
509
Result is:
510
t-Test: Paired Two Sample for Means
I test II test
Mean 28.13333 26.06667
Variance 45.29195 32.61609
Observations 30 30
Pearson Correlation 0.853868
Hypothesized Mean Difference 0
Df 29
t Stat 3.231368
P(T<=t) one-tail 0.001531
t Critical one-tail 1.601972
P(T<=t) two-tail 0.003063
t Critical two-tail 1.957293
As we can see in Excel output:
There is significant difference between the averages for the population

which means that the existence of a lack of sleep impact on the ability
of solving mathematical tasks is confirmed.
6.8.6. Testing differences between arithmetic means

of more than two populations on the basis
of their samples - analysis of variance ANOVA
The one way analysis of variance (One Way ANOVA) aims to test
whether there is a difference among arithmetic means of more than two
populations and to compare their variances. In other words, One Way
ANOVA investigates the influence of certain factor with k dimensions
to one characteristics (variable). Therefore, we have k samples related
to k factor dimensions. For example, investigating the influence of
different fertilizer the harvest yields some kind of wheat. If the number
of elements in the i-th sample is ni, and if the j-th element and the sample
we designate by xij, we have the following results of measurements:
511
Arithmetic mean and variance for these samples are:
If all of these blocks are connected in one sample, it returns a sample
with elements of the arithmetic mean and
total variance
From the
it is
where is residual variance,
and is factorial variance.
Degrees of freedom are:
512
Appropriate assessments for variances are:
- this is estimate for total variance of population and is a

result of fluctuations in the sample as well as all other causes that
effectively influence the characteristic seen.
- this is estimate for a mid-grade variance of more groups
of samples and is a result of the sample fluctuations and diversity of

actions of the factors. Therefore it is called a factorial variance.
- this is estimation for total variance base with whom he
eliminated the influence of factors. It is a product of the fluctuations

in the sample and, therefore, is called residual variance.
If there is no difference in the effects of different factors to the
characteristic seen variances WA and Wr should represent the
same variance and the quotient should not be
significantly different from 1.
If it is correct, our assumption that all k samples belong to the same

normally divided basic set (not different from their arithmetic mean),
then the theoretical value for the comparison has a tabled value of F
with and degrees of freedom and first type error
α, so that will be .
Example 6.16.
(Kendall example) One plant of bulbs need to examine which of the 4

available quality for fibre should be used in its production, that is, with Test of the
differences of four
which fibre bulb has the longest life. They randomly selected 7 bulbs population means –
with fibre A1, 8 bulbs with fibre A2, 5 bulbs with fibre A3, 6 bulbs with analysis of variance.
fibre A4.
For selected samples, bulb life (in hours) was:
A1: 1,600, 1,610, 1,650, 1,680, 1,700, 1,720, 1,800

A2: 1,580, 1,640, 1,640, 1,700, 1,750
513
A3: 1,460, 1,550, 1,600, 1,620, 1,640, 1,660, 1,740, 1,820

A4: 1,510, 1,520, 1,530, 1,570, 1,600, 1,680.
Which of the 4 available quality for fibre should be used in production

(α = %)?
Solution:
We work with ANOVA, because we have 4 samples. For each sample

we will calculate averages:
fibre n1
A1 7 1,680.0
A2 5 1,662.0
A3 8 1,636.2
A4 6 1,568.3
Total mean is:
Total variance is:
514
Residual variance is:
Factorial variance is:
Then:
For 3 and 22 degrees of freedom and type I error 5%, we have theoretical
F value: Ft. = 3.05. Considering we conclude that heterogeneity
in the results cannot be considered significant and we are indifferent to
the choice of quality fibre. Test of the
differences of three
population means –
Example 6.17. analysis of variance.
Solution by Excel –
ANOVA.
We know data on the number of days of sick leave of employees of
enterprises X in 2004. We analyzed three employee groups: younger
515
workers (20 - 30 yrs.), middle age workers (30 - 50 yrs.) and older
workers (50 - 60 yrs.).
Younger workers Middle age workers Older workers

2 7 25
5 9 30
10 0 0
0 0 0
0 15 0
7 8 3
15 10 2
8 0 2
9 15 0
8 5 30
4 2 0
5 7 5
0 5 0
7 8 2
Is there a statistically significant difference in the number of days of sick

leave among the three referent age groups, or whether the employee’s age
has a significant impact on the number of days of sick leave? (reliability
95%)?
Solution:
Data are given in three groups or samples. To test the differences

between averages we will use ANOVA. We will apply Excel procedure
form Data analysis:
516
Test of the
analysis of variance.
ANOVA.
Result is given by Excel output:
Anova: Single Factor

SUMMARY
Groups Count Sum Average Variance
younger age 14 80 5.714286 18.83516
mean age 14 91 6.5 24.57692
older age 14 99 7.071429 136.2253
517
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 13 2 6.5 0.108552 0.897402 3.238096
Within Groups 2335.286 39 59.87912
Total 2348.286 41
p value of F test is higher than 0.01 averages are equal, the conclusion
is that there is no statistically significant difference in the number of
days of sick leave among the three reference age group.
6.8.7. Statistical tests for the proportion
In some situations, we need to check a hypothesis pertaining to the

population proportion of values that are in a particular category.
A random sample will be selected from the population and sample
proportion can be computed. The value of this statistic is then compared
to the hypothesized (assumed) value of the parameter pA, so that a
decision can be made about the hypothesis.
Well, the next problem is testing the simple null hypothesis that the
population proportion is equal to, higher or lower than some specified
value p0. We will start with model for confidence interval for the
population proportion, to find test statistic for the test for the population
proportion:
Then we can evaluate procedures for different kinds of the test for the
population proportion.
518
Two sided test for proportion
1.
2.
Test of the
analysis of variance.
3. ANOVA.
4.
One sided test for proportion, lower boundary
1.
2.
3.
4.
One sided test for proportion, upper boundary
1.
2.
3.
4.
519
Example 6.18.
Manufacturer of specific types of products for cleaning kitchen

appliances knows, from historical data, that 51.6% of customers buy his
One tailed test of the product. With the aim of improving sales advertisements are broadcast
population proportion
(upper boundary). on TV and series of promotional activities are organized. After some
time he interviewed 450 randomly selected households. This time, 264
households are identified to buy his product. With 2% error we have to
conclude whether the advertisements influenced the views of customers
about the choice of types of detergent.
Solution:
This is one sided test for the population proportion, upper boundary:
1.
2.
3.
4.
We proved that the advertisements influenced the views of customers

about the choice of types of detergent, or that demand for this
520
manufacturer’s product increased because of influence of new

advertisement.
Example 6.19.
A sample of 200 college-educated 35 to 64-year olds with household

incomes more than $100,000 per year were asked if they agreed with One tailed test
of the population
the following statement “Government should be more involved in
proportion (lower
regulation of private enterprise.” 57.5% of the respondents agreed with boundary).
given statement. With 1% type I error, test if there is an evidence that
less than 60% population of college-educated 35 to 64-year olds with
household incomes more than $100,000 per year agrees that government
should be more involved in regulation of private enterprise.
Solution:
This is one sided test for the population proportion, lower boundary:
1.
2.
3.
4.
521
Our conclusion is that we reject assumption that less than 60% population
of college-educated 35 to 64-year olds with household incomes more
than $100,000 per year agrees that government should be more involved
in regulation of private enterprise.
6.8.8. Statistical tests for the variance
However, frequently decision makers are more interested in the spread

of a population than in its central location. Many product specifications
involve both an average value and some limit on the dispersion that
these values can have like tolerance levels. There is no technique for
directly testing hypotheses about a population standard deviation. But,
a test is available for testing a population variance.
A test of a single variance assumes that the underlying distribution

is normal. The null and alternate hypotheses are stated in terms of
the population variance (or population standard deviation). The test
statistic is:
If we want to test assumption about population variance there is chi-

square test if n is less than 30. If we are using a large sample there is
normal approximation to the chi-square distribution. Then the normal
distribution approximation theoretical value of chi-square is determined
by the relation:
Two sided test for variance
1.
2.
522
3.
4.
One sided test for variance, lower boundary
1.
2.
3.
4.
One sided test for variance, upper boundary
1.
2.
3.
4.
Example 6.20.
Teachers are not only interested in how their students do on exams, on

average, but how the exam scores vary. Assumption is that the variance Two tailed test
of exam score is 25 points. We took a sample of 45 students and found of the population
variance, large
the standard deviation of 4.7. Test the assumption with a significance sample.
level of 4%.
523
Solution:
This is two sided test for the population proportion, large sample:
1.
2.
large sample the normal distribution approximation theoretical value

of chi-square
a.
b.
3.
4.
There is not enough information to reject the assumption of a math

instructor that the variance is 25 points.
524
Example 6.21.
An oil station management assumed that the standard deviation for

normally distributed waiting times for customers on Sunday afternoon One tailed test of the
population variance
is 7.2 minutes. We find that for a randomly chosen 25 customers, the (lower boundary),
waiting times have a standard deviation of 5.5 minutes. With a confinece small sample.
of 95%, test the assumption that variation of waiting time is lower than
management previously believed.
Solution:
1.
2.
3.
4.
We will accept the assumption that variation among waiting times is

lower than 7.2 minutes.
6.8.9. Chi-square (χ 2) test of independence
This is nonparametric test. For a contingency table that has

r rows and c columns, χ 2 test can be generalized as a test of
independence or association in the joint responses to two
categorical or interval grouped numerical variables.
525
Table 6.2. Contingency table (mij=f ij, ni.=fi., n.j=f.j)
modalities for modalities for variable B total

variable A B1 B2 ... Bj ... Bc (∑)
A1 m11 m12 m1j m1c n1.
A2 m21 m22 m2j m2c n2.
...
Ai mi1 mi2 mij mic ni.
...
Ar mr1 mr2 mrj mrc nr.
total (∑) n.1 n.1 n.j n.c n
Testing procedure is:
1. H0: there is no relationship between two variables or that variables

are independent / H1: there is relationship between two variables or
that variables are
2.
3.
4.
Example 6.22.
Chi square test of A university is interested in determining whether an association exists

independence
between the traveling time (from home to faculty) of its students and the
between two
variables. level of stress related problems observed on the lecturers. A study of 116
students reveals the following:
526
Traveling Stress
time high moderate low total
Under 15 min 9 5 18 32
15-45 min 17 8 28 53
Over 45 min 18 6 7 31
Total 44 19 53 116
At the level of significance 0.1, is there evidence of a significant

relationship between traveling time and stress?
Solution:
In contingency table we have information about empirical frequency.

We will calculate theoretical frequency by the formula:
Theoretical frequencies
Traveling Stress
time High moderate low total
Under 15 min 12.13793 5.241379 14.62069 32
15-45 min 20.10345 8.681034 24.21552 53
Over 45 min 11.75862 5.077586 14.16379 31
Total 44 19 53 116
Now we can calculate value for expression:
high moderate low
Under 15 min 0.811226 0.011116 0.781067

15-45 min 0.479092 0.053428 0.591452
Over 45 min 3.312873 0.167569 3.623318
527
On the end we will sum all
Appropriate test procedure is:
1. H0: there is no relationship between two variables / H1: there is

relationship between two variables
2.
3.
4.
There is no evidence of a significant relationship between traveling

time and stress.
6.8.10. Test for differences among proportion

for populations
Differences among proportions for populations are tested if we are

interested to examine whether there are significant differences between
proportions of two or more than two populations based on samples.
Model used for testing is as follows:
1.
2.
528
3.
4.
where:
m - number of samples (number of populations)
Pk - proportion in k-th population
nk - sample size for sample from k-th population
f k ( f kt ) - empirical (theoretical) frequency
Example 6.23.
The purchase of coffee is investigated on 4 separate areas. It is assumed

that consumers are buying the coffee in the same proportion in each of Chi square test of
differences among
these 4 areas. We have selected a sample of coffee consumers to test this four population
assumption. proportions.
Area Sample size Number of coffee consumers and buyers in sample

A 100 20
B 200 35
C 150 37
D 250 43
Total 700 135
Can we accept the assumption that the proportion of coffee buyers is

equal with a 5% error?
Solution:
Number of coffee consumers

Area Sample size - ni
and buyers in sample - f i
A 100 20
B 200 35
529
C 150 37
D 250 43
Total 700 135
1.
2.
3.
Expected number of coffee

Area ni fi consumers and buyers
in sample - fti
A 100 20 19.286 0.02643

B 200 35 38.572 0.33079
C 150 37 28.929 2.25176
D 250 43 48.215 0.56406
Sum 700 135 135.002 3.17304
4.
Therefore, we can say that in every area proportion of coffee buyers is

equal, with a 5% error
530
6.8.11. Test of adequacy to approximations

(goodness of fit)
If we have a previous approximation for empirical distribution by some

theoretical schedule and we want to examine the quality (adequacy of
the approximations), we use a nonparametric chi-square test:
1. H0: Arrange the population in a specific form connected to a specific

theoretical distribution of frequency / H1 : H0 approximation is not
correct
Chi square test of
2. differences among
four population
proportions.
3.
4.
where:
r - number of parameters that are estimated from empirical data
m- number of modalities or intervals
f k ( f kt ) - empirical (theoretical) frequencies
Example 6.24.
For empirical distribution:
Modalities Frequencies
Chi square test of
0 35 adequacy to
approximate with
1 115
Poisson distribution.
2 130
3 75
4 30
5 10
6 5
531
We assume that variable behaves according Poisson distribution. We

have to test the validity of these assumptions. (α = 4%)
Solution:
First we have to make approximation for this empirical distribution by

Poisson distribution. We will calculate average.
xi fi xi . fi
0 35 0
1 115 115
2 130 260
3 75 225
4 30 120
5 10 50
6 5 30
Sum 400 800
Then we will use tables or previous formula for pti.
xi fi pti
0 35 0.13534
1 115 0.27067
2 130 0.27067
3 75 0.18045
4 30 0.09022
5 10 0.03609
6 5 0.01656
Sum 400 1
Theoretical frequencies are equal to .
532
xi fi fti
0 35 54.136
1 115 108.268
2 130 108.268
3 75 72.18
4 30 36.088
5 10 14.436
6 5 6.624
Sum 400 400
Now we can apply chi-square test for goodness of fit:
1.
2.
3.
xi fi fti
0 35 54.136 10.46247
1 115 108.268 0.394085
2 130 108.268 3.632922
3 75 72.18 0.106032
4 30 36.088 1.235458
5 10 14.436 1.96781
6 5 6.624 0.527475
Sum 400 400 18.32625
4. Our approximation isn‘t valid.
533
6.1. Consider population of 3 possible elements: 0, 1 and 2. Their

Illustration of probability distribution is shown in the table below:
unbiased
estimation.
x 0 1 2
p 1/3 1/3 1/3
A random sample of n = 3 elements is selected from population.

a) Find the sampling distribution of the sample mean and the sample
median Me.
b) Show that is unbiased estimator of population mean in this
situation.
c) Show that Me is unbiased estimator of population mean in this
situation.
Solution:
a) Following table contains all possible samples, their mean, median

and probability:
Sample Me Probability
000 0 0 1/27
001 0.33 0 1/27
002 0.67 0 1/27
010 0.33 0 1/27
011 0.67 1 1/27
012 1 1 1/27
020 0.67 0 1/27
021 1 1 1/27
022 1.33 2 1/27
100 0.33 0 1/27
101 0.67 1 1/27
102 1 1 1/27
110 0.67 1 1/27
534
111 1 1 1/27
112 1.33 1 1/27
120 1 1 1/27
121 1.33 1 1/27
122 1.67 2 1/27
200 0.67 0 1/27
201 1 1 1/27
202 1.33 2 1/27
210 1 1 1/27
211 1.33 1 1/27
212 1.67 2 1/27
220 1.33 2 1/27
221 1.67 2 1/27
222 2 2 1/27
From the table it can be concluded that can assume the values 0;
0.33; 0.67; 1; 1.33; 1.67 and 2. Value occurs in only one sample:
. Similarly, value occurs in three samples:
, and so on. By calculating the probabilities of the
remaining values of , we obtain sampling distribution of , given in
the table below:
0 0.33 0.67 1 1.33 1.67 2

p( ) 1/27 3/27 6/27 7/27 6/27 3/27 1/27
Similarly, we obtain sampling distribution of Me:
Me 0 1 2
p(Me) 7/27 13/27 7/27
b) The population mean (the expected value of discrete random variable)

is equal to:
535
The expected value of discrete random variable is equal to:
Since , we conclude that is unbiased estimator of μ.
c) The expected value of sample median is:
Since the expected value of Me is equal to μ, it can be concluded that

Me is unbiased estimator of μ.
6.2. Research indicated that bicycle helmet saves lives. A study reported
in Public Health Reports (1992.) intended to identify ways of
Confidence interval of
the mean, unknown encouraging helmet use by children. One of the variables measured
population variance, was the children’s perception of the risk. A 4-point scale was used,
large sample. with scores ranging from 1 (no risk) to 4 (very high risk). A sample
of 797 children with grades 4 – 6 yielded the following results
on the perception of risk variable: . Estimate a
90% confidence interval for the average perception of risk for all
children.
Solution:
The population variance is unknown and the sample is large, therefore

we need to use Normal probability distribution.
536
Confidence interval of the mean:
6.3. In order to examine level of hearing damage, absolute sound

pressure level (SPL) was measured on eight patients. Results of Confidence
interval of the mean,
test are listed below: unknown
73.0; 80.1; 82.8; 76.8; 73. 5; 74.3; 76.0; 68.1. population variance,
Estimate 90% confidence interval for the true mean of SPL test. small sample.
Solution:
xi (SPL test) (xi _ X )2

73.0 6.631
80.1 20.476
82.8 52.201
76.8 1.501
73.5 4.306
74.3 1.626
76.0 0.181
68.1 55.876
604.6 142.795
537
Population variance is unknown and sample is small so we use

Student’s probability distribution with degrees of freedom:
6.4. A chain of “quick lube” shops has a standard service for performing
Confidence interval
oil changes and basic checkups on automobiles. The chain has a
of the mean, known standard that says that the average time per car for this service
population variance. should be 12.5 minutes. There is considerable variability in times,
due to differences in layout of engines, degree of time pressure
from other jobs, and many other sources. The standard deviation
for the chain has been 2.4 minutes. The manager of one shop picked
randomly 48 times and timed the next job after each random time.
The data were analyzed and where following results obtained:
Estimate confidence interval for the mean with type I error of 5%.
Solution:
Considering that variance of population is known, we will use Normal

probability distribution, regardless of sample size.
538
- standard error of the mean when the variance
of population is known.
6.5. A study was conducted on a sample consisting of 1000 people.

They were asked about belief in the existence of extraterrestrials. Confidence interval
of the proportion.
637 people said they believe in the existence of extraterrestrials.
Estimate the percentage of all people that believe in the existence
of extraterrestrials with confidence level of 95%.
Solution:
Sample proportions:
Type I error:
Confidence interval of the proportion, based on large sample:
539
Standard error of the sampling distribution of proportion:
- confidence interval for the percentage
- confidence interval for the percentage
6.6. Students can receive maximum of 100 points on Statistics exam.

Confidence interval
Professor is interested in understanding the variability of their
of the variance, large test results. For the randomly chosen 100 students variance of test
sample. results was 27. Using type I error of 8%, estimate the confidence
interval of variance for all students.
Solution:
S 2 = 27, n = 100 (large sample) - in order to estimate confidence interval

for the variance of the population, based on large sample, we will use
Normal probability distribution.
α = 0.08
Confidence interval of the variance, based on large sample:
540
6.7. Association of farmers wants to examine stability of milk quantity

obtained from certain type of cow. For this purpose, the sample of Confidence interval
20 cows was selected and variance of milk quantity they produced of the variance, small
sample.
during the day was 2.2. Estimate 90% confidence interval for the
variance of milk quantity.
Solution:
- small sample probability distribution with

degrees of freedom.
- confidence interval of the variance in the case of
small sample.
6.8. New drug was tested on a sample of 12 patients. The results One tailed test of
obtained on the sample showed that the symptoms begin to the population mean
disappear on average after 16 hours, with standard deviation of 0.8 (lower boundary,
unknown
hours. Pharmaceutical manufacturer claims that the drug works population variance,
for less than 14 hours. Test the manufacturer’s claim with the small sample.
significance of 95%.
541
Solution:
Hypothesis of population mean, unknown population variance and

small sample Student probability distribution with (n-1) degrees of
freedom.
1.
2.
3.
4. Experiment does not provide sufficient evidence to reject

H0 at 95% level of significance.
6.9. Manufacturer of bottled water wants to examine the correctness of

One tailed test of filling 1 liter – bottles. The assumption is that the bottles contain
the population mean at least 1 liter of water, on average. In the sample of 100 filled
(upper boundary),
unknown bottles average content was 1.05 liters with the variance 0.04. Test
population variance, the validity of assumptions with the 94% level of significance.
large sample.
Solution:
The variance of the population is unknown and sample is large, so we’ll

use the Normal probability distribution.
542
1.
2.
3.
4. We reject H0 and conclude that the assumption of

manufacturer is true.
6.10. The factory produces the chips packaged in 50 grams bags, with
the weight standard deviation of 2 grams. The quality control Two tailed test
supervisor wants to check the weight of packaged chips bags. In of the mean, known
population variance.
the sample of 25 bags of chips, the average weight was 48.9 grams.
Test the hypothesis that the weight of produced chips bags is 50
grams with type I error of 6%.
Solution:
Known standard deviation (and variance) of the population Normal

probability distribution, regardless of sample size.
543
1.
2.
3.
4. we reject the null hypothesis that the average weight

obtained in the sample does not statistically differ from the
assumption for the entire population.
6.11. The manufacturer claims that new cosmetic treatment is effective

One tailed test of in more than 70% cases. 120 women were using the treatment,
proportion (upper and 93 of them reported that the treatment is effective. Test
boundary). manufacturer’s claim with significance of 95%.
Solution:
Hypothesis of proportion, large sample Normal probability distribution
1.
2.
544
3.
4. we reject H0 and conclude that the treatment is effective in

more than 70% of cases, with type I error of 5%.
6.12. In the sample of 25 elements we calculated the variance of 4.4.

With the type I error of 5% test the statement that the variance of Two tailed test of
variance, small
population is equal to 4.5.
sample.
Solution:
Hypothesis of population variance, small sample Chi-

square probability distribution with (n-1) degrees of freedom
1.
2.
3.
4. There is not enough evidence to reject the assumption

that the population variance is equal to 4.5.
545
6.13. Variance of professional skill test results is 12. It is assumed

One tailed test of that, after the special training of personnel, the variance of test
variance (lower results will decrease. 40 employees completed the training and
boundary), large the variance of professional skill test results was 10.5. Test the
sample.
assumption with significance level of 95%.
Solution:
Hypothesis of population variance, large sample Normal

distribution approximation of theoretical Chi-square value
1.
2.
3.
4. we cannot reject H0
6.14. Statistics lectures are carried out by teaching methods A and B.

Test of difference From the sample of 10 students that followed the teaching method
between two A, we got the following test results:
population means,
small samples.
Points
67 51 89 72 80 55 74 92 58 72
gained
546
On a basis of random sample consising of test results for 10 students

that followed teaching method B, it is calculated: the mean of 84 and
the variance of 140.
a) Can we accept the assumption that, on average, there is no statistically
significant difference between efficiency of methods A and B, with
type I error of 5%?
b) With the type I error of 1%, test the statement that students who
attend Statistics lectures by method B, gain minimal of 80 points on
the exam.
Solution:
Let’s calculate the mean and the variance from given sample:
x1 j
67 16
51 400
89 324
72 1
80 81
55 256
74 9
92 441
58 169
72 1
Σ 1698
547
a) 1.
2. Student’s probability distribution!
3.
4. We reject H0 and cannot accept given assumption.
b) 1.
2.
3.
4.
6.15. Building company wants to examine whether there is a statistically

Test of difference significant difference in current account payments over for two
between two departments. The resulting data were:
population means,
large samples.
548
The average Standard

The number of
Department payment in the deviation in the
covered accounts
sample sample
I 120 1000$ 150$
II 100 900$ 120$
What they concluded with type I error of 5%?
Solution:
The average Standard

The number of
Department payment in the deviation in the
covered accounts
sample sample
I 120 - n1 1000$ - 150$ - S1
II 100 - n2 900$ - 120$ - S2
1.
2. Normal probability distribution
3.
4. We reject H0, therefore it can be concluded that the

difference between the current account payments in two covered
department is statistically significant, with type I errof of 5%.
549
6.16. Factory produces pudding powder. Variable “weight of a pudding

powder bag” follows Normal probability distribution with
standard deviation of 1,2 gr. In the sample consisting of 25 bags,
the average weight of bags was 20 gr. With type I error of 5%
estimate confidence interval for average weight of all produced
pudding powder.
Answer: (19.53; 20.47)
6.17. Study of the impact of merger in the automotive industry includes

the sample of 17 companies with similar location and business
conditions. On a basis of data from the sample, it is calculated
that average difference between development rates of company in
merger and outside the merger was -0,105 and standard deviation
was 0,44. Find the confidence interval of the average difference
between development rates of company in merger and outside it,
at 98% level of confidence.
Answer: (-0.38332; 0.1732)
6.18 There are 10000 employees in a factory. In order to estimate the

average age, sample of 500 employees is chosen and average age
of 37,19 years is calculated with standard deviation of 8,85 years.
At 95% and 99% levels of confidence, estimate the confidence
interval of average age of all employees in a factory.
Answer: (36.41; 37.97) α=5% and (36.17; 38.21) α=1%,
6.19. The sample containing 17 units is taken out of normally distributed

population. The a sample mean and sample variance were: = 4
and S 2 = 5,76. With 95% level of confidence, find the confidence
interval of the population mean.
Answer: (2.73; 5.27) α=5%
550
6.20. The sample containing 40 units is taken out of population containing

500 units. The sample mean was 25 and sample standard deviation
was 2. With first type error of 4% find the interval that population
mean belongs to.
Answer: (24.34; 25.66) α=4%
6.21. In a sample of 100 patients, 78 respond that they are satisfied with
health care they got. With type I error of 6%, find the confidence
interval for proportion of satisfied patients in population.
Answer: (0.7046; 0.8554)
6.22. In the sample with 15 car parts, the calculated variance of particular
car part width was 4. With type I error of 5%, find the confidence
interval for variance.
Answer: (2.264; 10)
6.23. In the sample with 52 elements, the calculated variance is equal

to 24,8. Considering 99%, level of confidence, find the interval of
the variance in the population.
Answer: (16.17; 46.22)
6.24. A stationery store wants to estimate the mean retail value of

greeting cards that it has in its inventory. A random sample of 20
greeting cards indicates an average value of $1.67 and a standard
deviation of $0.32.
Set up a 99% confidence interval estimate of the mean value of all

greeting cards in the store’s inventory.
Answer: n<30, unknown standard deviation σ for population, we

know only standard deviation S for sample t distribution,
1.475, 1.865
551
6.25. In NY state, savings banks are permitted to sell a form of life

insurance called Saving Bank Life Insurance. The approval
process consists of underwriting, which includes a review of the
application, a medical information bureau check, possible request
for additional medical information and medical exams, and a policy
compilation stage where the policy pages are generated and sent
to the bank for delivery. The ability to deliver approved policies to
customers in a timely manner is critical to the profitability of this
service to the bank. During a period of 1 month, a random sample
of 27 approved policies was selected and the total processing time
in days was recorded with the following results:
73, 19, 16, 64, 28, 28, 31, 90, 60, 56, 31, 56, 22, 18, 45, 48, 17, 17,
17, 91, 92, 63, 50, 51, 69, 16, 17.
a) Set up a 95% confidence interval estimate of the mean

processing time.
b) At the 0.01% level of significance, is there evidence that the
mean processing time has changed from 45 days?
Answer: a) (34.075, 53.075)
6.26. A survey of working women in North America was conducted. Of

the 1000 women surveyed, 55% believed that companies should
hold positions for those of maternity leave for six months or less
and 45% felt that they should hold positions for more than six
months.
Construct a 90% confidence interval for the proportion of all

working women in North America who believe that they should
hold positions for those of maternity leave for more than six months.
Answer: (0.424, 0.476)
6.27. An advertising agency that serves a major radio station wants

to estimate the mean amount of time that the station’s audience
spends listening to the radio on a daily basis. From past studies,
the standard deviation is estimated as 45 minutes.
552
a) What sample size is needed if the agency wants to be 90%

confident of being correct to within ±5 minutes?
b) If 99% confidence is desired, what sample size is needed?
Answer: a) 220 b) 538
6.28. What proportion of people living in the USA use the Internet when
planning their vacation? According to a poll, 35% of them use
Internet. If you were to conduct a study that would provide 95%
confidence that the point estimate is correct to within ±0.04 of the
population proportion, how large a sample size would be required?
Answer: n=547
6.29. The director of manufacturing at a clothing factory needs to

determine whether a new machine is producing a particular type
of cloth according to the manufacturer’s specifications, which
indicates that the cloth should have a mean breaking strength of
at least 70 pounds and a standard deviation of 3.5 pounds. The
director is concerned that if the mean breaking strength is actually
less than 70 pounds, the company will face too many lawsuits. A
sample of 49 pieces of cloth reveals a sample mean of 69.1 pounds.
At the 0.05 level of significance, using the critical value approach

to hypothesis testing, is there evidence that the mean breaking
strength is less than 70 pounds?
Answer: ze = -1.80, zt = -1.65, ze < zt H1, yes
6.30. One of the major measures of the quality of service provided

by any organization is the speed with which it responds to
customer complaints. A large family-held department store
selling furniture and flooring including carpeting had undergone
a major expansion in the past several years. In particular, the
flooring department had expanded from two installation crews
to an installation supervisor, a measurer, and 15 installation
crews. During a recent year there were 50 complaints concerning
carpeting installation. The following data represent the number of
553
days between the receipt of the complaint and the resolution of the
complaint:
54, 5, 35, 137, 31, 27, 152, 2, 123, 81, 74, 27, 11, 19, 126, 110, 110,
29, 61, 35, 94, 31, 26, 5, 12, 4, 165, 32, 29, 28, 29, 26, 25, 1, 14, 13,
13, 10, 5, 27, 4, 52, 30, 22, 36, 26, 20, 23, 33, 68
The installation supervisor claims that the mean number of days

between the receipt of the complaint and the resolution of the
complaint is 20 days or less. At the 0.05 level of significance, is
there evidence that the claim is not true?
Answer: ze = 3.925, zt = 1.65, ze > zt H1, yes
6.31. The operations manager at a lightbulb factory wants to determine

whether there is any difference in the average life expectancy
of bulbs manufactured on two types of machines. The process
standard deviation of machine I is 110 hour and of machine II
is 125 hours. A random sample of 25 lightbulbs obtained from
machine I indicates a sample mean of 375 hours and a similar
sample of 25 lightbulbs machine II indicates a sample mean of 362
hours.
Using the 0.05 level of significance, is there any evidence of a

difference in the average life of bulbs produced by the two types
of machines?
Answer: ze = 0.39,
6.32. A survey of holidaymakers found that on average women spent

3 hours per day sunbathing, men spent 2 hours. The sample sizes
were 36 in each case and the standard deviations were 1.1 hours
and 1.2 hours respectively. Is there the true difference between men
and women in sunbathing habits? Use the 99% confidence level.
Answer: Confidence interval for difference is [0.30, 1.70]. This

suggests that women spend more time sunbathing than men.
Please note that 0 is not included in the confidence interval.
554
6.33. A large chain of supermarkets sells 5,000 packets of cereal in each

of its stores each month. It decides to test-market a different brand
of cereal in 15 of its stores. After a month the 15 stores have sold
an average of 5,200 packets each, with a standard deviation of 500
packets. Should all supermarkets switch to selling the new brand?
Use the 99% confidence level.
Answer: te =.,55 < tt =2.62 The null hypothesis is not rejected. Therefore,
supermarkets should not switch to selling the new brand.
6.34. The output of a group of 11 workers before and after an improvement

in the lighting in their factory is as follows:
Before 52 60 58 58 53 51 52 59 60 53 55
After 56 62 63 50 55 56 55 59 61 58 56
Test whether there is a significant improvement in performance.

Use the 95% confidence level.
Answer: Accept the null hypothesis at 5% as te = -1.233 is included

in the tt interval which includes values from -2.1 to 2.1. We
can conclude that there is not a significant improvement in
performance.
6.35. With business moving at lightning speed, marketing managers

often struggle with demands that they reduce the time it takes to
craft and launch (cycle time) a marketing campaign. A survey of
175 US and UK marketing managers revealed that a marketing
campaign has an average cycle time of 2.5 months and slightly
more than 16% have cycle times less than one month. The study
results suggest that longer is not necessarily better. The marketing
managers indicated that a long development time might miss
the mark because the data become outdated. On the other hand,
they indicated that development time of less than one month can
also impair the campaign’s effectiveness. Suppose that a cross-
classification of the most recent marketing campaigns by cycle
time and effectiveness resulted in the following cross-classification
table:
555
CYCLE TIME
EFFECTIVENESS <1 1-2 2-4 >4 Total
month months months months
Highly effective 15 28 24 6 73
Effective 9 26 33 19 87
Ineffective 5 2 3 5 15
Total 29 56 60 30 175
At the 0.01 level of significance, is there evidence of significant

relationship between the length of cycle time and the effectiveness of a
marketing campaign?
Answer:
There is no evidence of a significant relationship between the length of

cycle time and the effectiveness of a marketing campaign.
6.36. Juice producer wants to check whether average juice content in

the bottles is less than the content labeled on the bottles (75 cl). 10
bottles are randomly chosen and after measuring of their content,
following results are obtained:
73.2 72.6 74.5 75.0 73.7

74.1 75.1 74.8 74.0 75.0
If we assume that obtained distribution is Normal, can the producer

conclude that average content of the bottle is less than 75 cl with type I
error of 1%?
Answer: reject H0 and accept the assumption.
6.37. Producer claims that the average strength of produced machine

is not less than 8 000 lb. In the sample of 6 machines, average
strength was 7 500 lb with standard deviation of 200 lb. Is the
producer’s claim true with type I error of 8%?
556
Answer: Cannot reject H0: the average strength is less than 8000 lb.
6.38. Variance of damaged products produced in the production process

is 25. New production process is introduced and increase of
variance is predicted. In that new production process, 55 products
have been produced and we calculate variance of 28. With type I
error of 5% tests whether the prediction is justified.
Answer: . Cannot reject H0.
6.39. It is assumed that population variance is 3. Sample of 15 elements

is taken out from the population and its variance is 3.2. Test
assumptions at the confidence level of 0.98.
Answer: .
We can accept the assumption.
6.40. In the sample, out of 200 products produced on one machine, 24

products were malfunctioning. Test hypothesis that machine gives
at most 10% malfunctioning product with confidence of 91%.
Answer: . Cannot reject H0.
6.41. Medicine producer claims that use of his medicine for 8 months
will cure allergy with 90% probability. In sample consisting of
200 persons which have been taking that medicine for at least 8
months, 40 of them have not been cured. Test producer’s claim
with type I error of 2%.
Answer: . Reject producer claim

(H0).
6.42. In order to discover whether physical activity of mother during

pregnancy affects newborn’s weight, we took two samples. First
sample with 50 newborns whose mother were housewives have
average weight of 3.725 kg. Second sample of 40 newborn whose
mother were employed have average weight of 3.61 kg. Variance
for both samples is the same and is equal to 1.2. Could we conclude
557
that mother’s activity during pregnancy affects newborn’s weight

(a = 5%?)
Answer: . Cannot reject Mother’s

activity during pregnancy doesn’t affect newborn’s weight.
6.43. M procedure is formed to evaluate attitude toward woman on

manager’s positions. High score indicates a negative attitude and
vice-versa. We took sample of 108 female students and obtained+
average value of 71.5 with standard deviation of 12.2. In sample of
151 male students, we calculate average value of 85 with standard
deviation of 19.3. Test hypothesis that both population have the
same attitude with type I error of 2%.
Answer: . Reject men and women

have different attidude toward woman on manager’s position.
558
REFERENCES
REFERENCES
1. Beals, R. E., (1972). Statistics for Economists. Chicago: Rand

MçNally & Company.
2. Blažić, M., (1982). Opšta Statistika - osnovi i analiza. Beograd:
Savremena administracija.
3. Berenson, M., L., Levine, D., M., Krehbiel, T., C., (20029. Basic
Business Statistics. New Jersey: Prentice Hall.
4. Bernstein, S., Bernstein, R., (1999). ELEMENTS OF STATISTICS
I: Descriptive Statistics and Probability. New York: McGraw – Hill.
5. Cohen, J., (1988). Statistical power analysis for the behavioral
sciences. Lawrence Erlbaum Associates.
6. Curwin, J. and Slater, R., (2008). Quantitative Methods for Business
Decisions. Birmingham: Thomson Learning.
7. Darlington, R. B., (1990). Regression and linear models. New York:
McGraw-Hill.
8. DeLurgio, S. A., (1998). Forecasting Principles and Applications.
Boston: Irwin McGraw-Hill.
9. Dixon, W. J., & Massey, F. J., (1983). Introduction to statistical analysis.
New York: McGraw-Hill.
10. Elazar, S., (1972). Matematička statistika. Sarajevo: Zavod za izdavanje
udžbenika.
11. Evans, M., Hastings, N., & Peacock, B., (1993). Statistical
Distributions. New York: Wiley.
12. Foster, J. J., Barkus, E., Yavorsky, C., (2006). Understanding and
using Advanced Statistics. London: SAGE Publications.
13. Freund, J. E.,Williams, F. J., Perles, B. M., (1993). Elementary
14. Gujerati, D., (2004). Basic Econometrics. New York: McGraw-Hill
Companies Inc.
15. Harnett, D. L., Murphy, J. L., (1985). Statistical Analysis for Business
and Economics. Massachusetts: Addison-Wesley Publishing
Company.
16. Hildebrand, D. K., Lyman Ott, R., (1996). Basic Statistical Ideas for
Managers. Belmont: Duxbury Press.
17. Ivanović, B., (1979). Teorijska statistika. Beograd: Naučna knjiga.
18. Illowsky, B., Dean, S., (2008). Collaborative Statistics, Texas: Rice
University.
561
REFERENCES
19. Jovetić, S., (2006). Statistika sa aplikacijom u Excelu. Kragujevac:

Inter Print.
20. Kemp, M. S., Kemp, S., (2004). Business Statistics Demystified,
New York: McGraw-Hill.
21. Komić, J., Lovrić, M., Stević, S., (2006). Statistička analiza. Banja
Luka: Ekonomski fakultet.
22. Komić, J., (2000). Metodi statističke analize kroz primjere – zbirka
zadataka. Banja Luka: Ekonomski fakultet.
23. Kuč, H., (1999). Statističke funkcije u Excel-u kroz primjere. Zenica:
Chip studio.
24. Levine, D.M. and others, (2005). Statistics for Managers Using
Microsoft Excel. New Jersey: Prentice Hall.
25. Lind, D. A., Mason, R. A., (1997). Basic Statistics for Business and
Economics. Boston: Irwin McGraw-Hill.
26. Lozanov-Crvenković, Z., Subotić, B., (2006). Poslovna statistika.
Zaječar: Fakultet za menadžment.
27. Lučić, B., (1996). Statistika. Sarajevo: Ekonomski Fakultet.
28. Mann, S. P., (2009). Uvod u statistiku. Beograd: Ekonomski fakultet.
29. Martić, Lj., (1986). Mjere nejednakosti i siromaštva. Zagreb:
Birotehnika.
30. McClave, J. T., Benson, P. G., Sincich, T., (2001). A First Course in
31. Mladenović, P., (2005). Verovatnoća i statistika. Beograd:
Matematički fakultet.
32. Newbold, P., Carlson, L. W., Thorne, B., (2007). Statistics for
Business and Economics. New Jersey: Prentice Hall.
33. Pauše, Ž., (1993). Uvod u matematičku statistiku. Zagreb: Školska
knjiga.
34. Petrović, Lj., (2005). Zbirka rešenih zadataka iz teorije uzoraka i
planiranja eksperimenata. Beograd: Ekonomski fakultet.
35. Poirier, D. J., (1995). Intermediate Statistics and Econometrics - A
Comparative Approach, Massachusetts: The MIT Press.
36. Resić, E., (2006). Zbirka zadataka iz statistike. Sarajevo: Ekonomski
fakultet.
37. Serdar, V., Šošić, I., (1994). Uvod u statistiku. Zagreb: Školska
knjiga.
38. Somun-Kapetanović, R., (2008). Statistika u ekonomiji i
menadžmentu. Sarajevo: Ekonomski fakultet u Sarajevu.
562
39. Spiegel, M. R., (1961). Schaum‘s Outline of Theory and Problems of

Statistics. New York: Shaum Publishing Company.
40. Šilj M., (1998). Uvod u modernu poslovnu statistiku. Varaždinske
Toplice: Nakladnička kuća Tonimir.
41. Šošić, I., (2004). Primijenjena statistika. Zagreb: Školska knjiga
42. Šošić, I., (1989). Zbirka zadataka iz osnova statistike. Zagreb:
Informator.
43. Vukadinović, S., (1981), Elementi teorije verovatnoće i matematičke
statistike. Beograd: Privredni pregled.
44. Vukadinović, S., (1983). Zbirka rešenih zadataka iz teorije
verovatnoće. Beograd: Privredni pregled.
45. Vuković N., (2003). PC Statistika i verovatnoća. Beograd: Fakultet
organizacionih nauka.
46. Waxman, P., (1993). Business Mathematics and Statistics. Prentice
Hall.
47. Webster, A. L., (1998). Applied Statistics for Business and
Economics: An Essentials Version. Boston: Irwin McGraw – Hill.
48. Wonnacott, T. H., Wonnacott, R. J., (1984). Student Workbook for
Intraductory Statistics for Business and Economics. New York:
John Wiley & Sons.
49. Žižić, M., Lovrić, M., Pavličić, D., (2001). Metodi statističke analize.
Beorgrad: Ekonomski fakultet.
50. Žužul, J., Branica, M., (1998). Statistika. Zagreb: Informator.
http://www.angelfire.com
http://bma.ac.in:8080/dspace
http://courses.wcupa.edu
http://www.bhas.ba
http://www.doingbusiness.org
http://www.mnstate.edu
http://www.statcan.ca
http://www.zoology.ubc.ca
http://www.socialresearchmethods.net
http://www.une.edu.au
http://www.statsoft.com
http://www.archive.org
http://home.ubalt.edu
563
REFERENCES
http://sofia.fhda.edu
http://www.learner.org
http://www.vias.org
http://www.frc.mass.edu
http://bcs.whfreeman.com
http://www.xycoon.com
http://www.wessa.net
http://www.le.ac.uk
564
STATISTICAL
TABLES
BINOMIAL DISTRIBUTION - Probability distribution

n=5
p 0.05 0.06 0.07 0.08 0.09 0.1 0.2 0.3 0.4 0.5
x 0 773781 733904 695688 659082 624032 590490 327680 168070 077760 031250
1 203627 234225 261818 286557 308587 328050 409600 360150 259200 156250
2 021434 029901 039413 049836 061039 072900 204800 308700 345600 312500
3 001128 001909 002967 004334 006037 008100 051200 132300 230400 312500
4 000030 000061 000112 000188 000299 000450 006400 028350 076800 156250
5 000000 000001 000002 000003 000006 000010 000320 002430 010240 031250
n=10
p 0.05 0.06 0.07 0.08 0.09 0.1 0.2 0.3 0.4 0.5
x 0 598737 538615 483982 434388 389416 348678 107374 028248 006047 000977
1 315125 343797 364288 377729 385137 387420 268435 121061 040311 009766
2 074635 098750 123388 147807 171407 193710 301990 233474 120932 043945
3 010475 016809 024766 034274 045206 057396 201327 266828 214991 117188
4 000965 001878 003262 005216 007824 011160 088080 200121 250823 205078
5 000061 000144 000295 000544 000929 001488 026424 102919 200658 246094
6 000003 000008 000018 000039 000077 000138 005505 036757 111477 205078
7 000000 000000 000001 000002 000004 000009 000786 009002 042467 117188
8 000000 000000 000000 000000 000000 000000 000074 001447 010617 043945
9 000000 000000 000000 000000 000000 000000 000004 000138 001573 009766
10 000000 000000 000000 000000 000000 000000 000000 000006 000105 000977
n=15
p 0.05 0.06 0.07 0.08 0.09 0.1 0.2 0.3 0.4 0.5
x 0 463291 395292 336701 286297 243008 205891 035184 004748 000470 000031
1 365756 378471 380146 373431 360507 343152 131941 030520 004702 000458
2 134752 169104 200292 227306 249582 266896 230897 091560 021942 003204
3 030733 046773 065328 085652 106964 128505 250139 170040 063388 013885
4 004853 008957 014752 022344 031736 042835 187604 218623 126776 041656
5 000562 001258 002443 004274 006905 010471 103182 206130 185938 091644
6 000049 000134 000306 000619 001138 001939 042993 147236 206598 152740
7 000003 000011 000030 000069 000145 000277 013819 081130 177084 196381
8 000000 000001 000002 000006 000014 000031 003455 034770 118056 196381
9 000000 000000 000000 000000 000001 000003 000672 011590 061214 152740
10 000000 000000 000000 000000 000000 000000 000101 002980 024486 091644
11 000000 000000 000000 000000 000000 000000 000011 000581 007420 041656
12 000000 000000 000000 000000 000000 000000 000001 000083 001649 013885
13 000000 000000 000000 000000 000000 000000 000000 000008 000254 003204
14 000000 000000 000000 000000 000000 000000 000000 000001 000024 000458
15 000000 000000 000000 000000 000000 000000 000000 000000 000001 000031
567
STATISTICAL TABLES
BINOMIAL DISTRIBUTION - Probability distribution

n=20
p 0.05 0.06 0.07 0.08 0.09 0.1 0.2 0.3 0.4 0.5
x 0 358486 290106 234239 188693 151645 121577 011529 000798 000037 000001
1 377354 370348 352618 328162 299957 270170 057646 006839 000487 000019
2 188677 224573 252141 271091 281828 285180 136909 027846 003087 000181
3 059582 086007 113870 141439 167238 190120 205364 071604 012350 001087
4 013328 023332 036426 052271 070295 089779 218199 130421 034991 004621
5 002245 004766 008774 014545 022247 031921 174560 178863 074647 014786
6 000295 000760 001651 003162 005501 008867 109100 191639 124412 036964
7 000031 000097 000249 000550 001088 001970 054550 164262 165882 073929
8 000003 000010 000030 000078 000175 000356 022161 114397 179706 120134
9 000000 000001 000003 000009 000023 000053 007387 065370 159738 160179
10 000000 000000 000000 000001 000003 000006 002031 030817 117142 176197
11 000000 000000 000000 000000 000000 000001 000462 012007 070995 160179
12 000000 000000 000000 000000 000000 000000 000087 003859 035497 120134
13 000000 000000 000000 000000 000000 000000 000013 001018 014563 073929
14 000000 000000 000000 000000 000000 000000 000002 000218 004854 036964
15 000000 000000 000000 000000 000000 000000 000000 000037 001294 014786
16 000000 000000 000000 000000 000000 000000 000000 000005 000270 004621
17 000000 000000 000000 000000 000000 000000 000000 000001 000042 001087
18 000000 000000 000000 000000 000000 000000 000000 000000 000005 000181
19 000000 000000 000000 000000 000000 000000 000000 000000 000000 000019
20 000000 000000 000000 000000 000000 000000 000000 000000 000000 000001
n=30
p 0.05 0.06 0.07 0.08 0.09 0.1 0.2 0.3 0.4 0.5
x 0 214639 156256 113367 081966 059053 042391 001238 000023 000000 000000
1 338903 299213 255991 213825 175212 141304 009285 000290 000004 000000
2 258637 276931 279388 269605 251266 227656 033656 001801 000043 000000
3 127050 164980 196273 218810 231938 236088 078532 007203 000266 000004
4 045136 071082 099719 128432 154837 177066 132522 020838 001197 000026
5 012353 023593 039030 058074 079631 102305 172279 046440 004149 000133
6 002709 006275 012241 021041 032815 047363 179457 082928 011524 000553
7 000489 001373 003159 006273 011127 018043 153821 121854 026341 001896
8 000074 000252 000684 001568 003164 005764 110559 150141 050487 005451
9 000010 000039 000126 000333 000765 001565 067564 157291 082275 013325
10 000001 000005 000020 000061 000159 000365 035471 141562 115185 027982
11 000000 000001 000003 000010 000029 000074 016123 110308 139619 050876
12 000000 000000 000000 000001 000004 000013 006382 074852 147375 080553
13 000000 000000 000000 000000 000001 000002 002209 044418 136039 111535
14 000000 000000 000000 000000 000000 000000 000671 023115 110127 135435
15 000000 000000 000000 000000 000000 000000 000179 010567 078312 144464
16 000000 000000 000000 000000 000000 000000 000042 004246 048945 135435
17 000000 000000 000000 000000 000000 000000 000009 001498 026872 111535
18 000000 000000 000000 000000 000000 000000 000002 000464 012938 080553
19 000000 000000 000000 000000 000000 000000 000000 000126 005448 050876
20 000000 000000 000000 000000 000000 000000 000000 000030 001997 027982
21 000000 000000 000000 000000 000000 000000 000000 000006 000634 013325
22 000000 000000 000000 000000 000000 000000 000000 000001 000173 005451
23 000000 000000 000000 000000 000000 000000 000000 000000 000040 001896
24 000000 000000 000000 000000 000000 000000 000000 000000 000008 000553
25 000000 000000 000000 000000 000000 000000 000000 000000 000001 000133
26 000000 000000 000000 000000 000000 000000 000000 000000 000000 000026
27 000000 000000 000000 000000 000000 000000 000000 000000 000000 000004
28 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000
29 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000
30 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000
568
POISSON DISTRIBUTION - Probability distribution
λ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x 0 904837 818731 740818 670320 606531 548812 496585 449329 406570 367879
1 090484 163746 222245 268128 303265 329287 347610 359463 365913 367879
2 004524 016375 033337 053626 075816 098786 121663 143785 164661 183940
3 000151 001092 003334 007150 012636 019757 028388 038343 049398 061313
4 000004 000055 000250 000715 001580 002964 004968 007669 011115 015328
5 000000 000002 000015 000057 000158 000356 000696 001227 002001 003066
6 000000 000001 000004 000013 000036 000081 000164 000300 000511
7 000000 000000 000001 000003 000008 000019 000039 000073
8 000000 000000 000001 000002 000004 000009
9 000000 000000 000000 000001
10 000000
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
569
STATISTICAL TABLES
POISSON DISTRIBUTION - Probability distribution
λ 2 3 4 5 6 7 8 9 10 11
x 0 135335 049787 018316 006738 002479 000912 000335 000123 000045 000017
1 270671 149361 073263 033690 014873 006383 002684 001111 000454 000184
2 270671 224042 146525 084224 044618 022341 010735 004998 002270 001010
3 180447 224042 195367 140374 089235 052129 028626 014994 007567 003705
4 090224 168031 195367 175467 133853 091226 057252 033737 018917 010189
5 036089 100819 156293 175467 160623 127717 091604 060727 037833 022415
6 012030 050409 104196 146223 160623 149003 122138 091090 063055 041095
7 003437 021604 059540 104445 137677 149003 139587 117116 090079 064577
8 000859 008102 029770 065278 103258 130377 139587 131756 112599 088794
9 000191 002701 013231 036266 068838 101405 124077 131756 125110 108526
10 000038 000810 005292 018133 041303 070983 099262 118580 125110 119378
11 000007 000221 001925 008242 022529 045171 072190 097020 113736 119378
12 000001 000055 000642 003434 011264 026350 048127 072765 094780 109430
13 000000 000013 000197 001321 005199 014188 029616 050376 072908 092595
14 000003 000056 000472 002228 007094 016924 032384 052077 072753
15 000001 000015 000157 000891 003311 009026 019431 034718 053352
16 000000 000004 000049 000334 001448 004513 010930 021699 036680
17 000001 000014 000118 000596 002124 005786 012764 023734
18 000000 000004 000039 000232 000944 002893 007091 014504
19 000001 000012 000085 000397 001370 003732 008397
20 000000 000004 000030 000159 000617 001866 004618
21 000001 000010 000061 000264 000889 002419
22 000000 000003 000022 000108 000404 001210
23 000001 000008 000042 000176 000578
24 000000 000003 000016 000073 000265
25 000001 000006 000029 000117
26 000000 000002 000011 000049
27 000001 000004 000020
28 000000 000001 000008
29 000001 000003
30 000000 000001
570
STANDARDIZED NORMAL DISTRIBUTION - Probability distribution
second
z decimal 0 1 2 3 4 5 6 7 8 9
place
integer and first

0.0 398942 398922 398862 398763 398623 398444 398225 397966 397668 397330
decimal place
0.1 396953 396536 396080 395585 395052 394479 393868 393219 392531 391806
0.2 391043 390242 389404 388529 387617 386668 385683 384663 383606 382515
0.3 381388 380226 379031 377801 376537 375240 373911 372548 371154 369728
0.4 368270 366782 365263 363714 362135 360527 358890 357225 355533 353812
0.5 352065 350292 348493 346668 344818 342944 341046 339124 337180 335213
0.6 333225 331215 329184 327133 325062 322972 320864 318737 316593 314432
0.7 312254 310060 307851 305627 303389 301137 298872 296595 294305 292004
0.8 289692 287369 285036 282694 280344 277985 275618 273244 270864 268477
0.9 266085 263688 261286 258881 256471 254059 251644 249228 246809 244390
1.0 241971 239551 237132 234714 232297 229882 227470 225060 222653 220251
1.1 217852 215458 213069 210686 208308 205936 203571 201214 198863 196520
1.2 194186 191860 189543 187235 184937 182649 180371 178104 175847 173602
1.3 171369 169147 166937 164740 162555 160383 158225 156080 153948 151831
1.4 149727 147639 145564 143505 141460 139431 137417 135418 133435 131468
1.5 129518 127583 125665 123763 121878 120009 118157 116323 114505 112704
1.6 110921 109155 107406 105675 103961 102265 100586 098925 097282 095657
1.7 094049 092459 090887 089333 087796 086277 084776 083293 081828 080380
1.8 078950 077538 076143 074766 073407 072065 070740 069433 068144 066871
1.9 065616 064378 063157 061952 060765 059595 058441 057304 056183 055079
2.0 053991 052919 051864 050824 049800 048792 047800 046823 045861 044915
2.1 043984 043067 042166 041280 040408 039550 038707 037878 037063 036262
2.2 035475 034701 033941 033194 032460 031740 031032 030337 029655 028985
2.3 028327 027682 027048 026426 025817 025218 024631 024056 023491 022937
2.4 022395 021862 021341 020829 020328 019837 019356 018885 018423 017971
2.5 017528 017095 016670 016254 015848 015449 015060 014678 014305 013940
2.6 013583 013234 012892 012558 012232 011912 011600 011295 010997 010706
2.7 010421 010143 009871 009606 009347 009094 008846 008605 008370 008140
2.8 007915 007697 007483 007274 007071 006873 006679 006491 006307 006127
2.9 005953 005782 005616 005454 005296 005143 004993 004847 004705 004567
3.0 004432 004301 004173 004049 003928 003810 003695 003584 003475 003370
3.1 003267 003167 003070 002975 002884 002794 002707 002623 002541 002461
3.2 002384 002309 002236 002165 002096 002029 001964 001901 001840 001780
3.3 001723 001667 001612 001560 001508 001459 001411 001364 001319 001275
3.4 001232 001191 001151 001112 001075 001038 001003 000969 000936 000904
3.5 000873 000843 000814 000785 000758 000732 000706 000681 000657 000634
3.6 000612 000590 000569 000549 000529 000510 000492 000474 000457 000441
3.7 000425 000409 000394 000380 000366 000353 000340 000327 000315 000303
3.8 000292 000281 000271 000260 000251 000241 000232 000223 000215 000207
3.9 000199 000191 000184 000177 000170 000163 000157 000151 000145 000139
4.0 000134 000129 000124 000119 000114 000109 000105 000101 000097 000093
571
STATISTICAL TABLES
STANDARDIZED NORMAL DISTRIBUTION - Cumulative distribution function
second
z decimal 0 1 2 3 4 5 6 7 8 9
place
integer and first

0.0 500000 503989 507978 511966 515953 519939 523922 527903 531881 535856
decimal place
0.1 539828 543795 547758 551717 555670 559618 563559 567495 571424 575345
0.2 579260 583166 587064 590954 594835 598706 602568 606420 610261 614092
0.3 617911 621720 625516 629300 633072 636831 640576 644309 648027 651732
0.4 655422 659097 662757 666402 670031 673645 677242 680822 684386 687933
0.5 691462 694974 698468 701944 705401 708840 712260 715661 719043 722405
0.6 725747 729069 732371 735653 738914 742154 745373 748571 751748 754903
0.7 758036 761148 764238 767305 770350 773373 776373 779350 782305 785236
0.8 788145 791030 793892 796731 799546 802337 805105 807850 810570 813267
0.9 815940 818589 821214 823814 826391 828944 831472 833977 836457 838913
1.0 841345 843752 846136 848495 850830 853141 855428 857690 859929 862143
1.1 864334 866500 868643 870762 872857 874928 876976 879000 881000 882977
1.2 884930 886861 888768 890651 892512 894350 896165 897958 899727 901475
1.3 903200 904902 906582 908241 909877 911492 913085 914657 916207 917736
1.4 919243 920730 922196 923641 925066 926471 927855 929219 930563 931888
1.5 933193 934478 935745 936992 938220 939429 940620 941792 942947 944083
1.6 945201 946301 947384 948449 949497 950529 951543 952540 953521 954486
1.7 955435 956367 957284 958185 959070 959941 960796 961636 962462 963273
1.8 964070 964852 965620 966375 967116 967843 968557 969258 969946 970621
1.9 971283 971933 972571 973197 973810 974412 975002 975581 976148 976705
2.0 977250 977784 978308 978822 979325 979818 980301 980774 981237 981691
2.1 982136 982571 982997 983414 983823 984222 984614 984997 985371 985738
2.2 986097 986447 986791 987126 987455 987776 988089 988396 988696 988989
2.3 989276 989556 989830 990097 990358 990613 990863 991106 991344 991576
2.4 991802 992024 992240 992451 992656 992857 993053 993244 993431 993613
2.5 993790 993963 994132 994297 994457 994614 994766 994915 995060 995201
2.6 995339 995473 995604 995731 995855 995975 996093 996207 996319 996427
2.7 996533 996636 996736 996833 996928 997020 997110 997197 997282 997365
2.8 997445 997523 997599 997673 997744 997814 997882 997948 998012 998074
2.9 998134 998193 998250 998305 998359 998411 998462 998511 998559 998605
3.0 998650 998694 998736 998777 998817 998856 998893 998930 998965 998999
3.1 999032 999065 999096 999126 999155 999184 999211 999238 999264 999289
3.2 999313 999336 999359 999381 999402 999423 999443 999462 999481 999499
3.3 999517 999534 999550 999566 999581 999596 999610 999624 999638 999651
3.4 999663 999675 999687 999698 999709 999720 999730 999740 999749 999758
3.5 999767 999776 999784 999792 999800 999807 999815 999822 999828 999835
3.6 999841 999847 999853 999858 999864 999869 999874 999879 999883 999888
3.7 999892 999896 999900 999904 999908 999912 999915 999918 999922 999925
3.8 999928 999931 999933 999936 999938 999941 999943 999946 999948 999950
3.9 999952 999954 999956 999958 999959 999961 999963 999964 999966 999967
4.0 999968 999970 999971 999972 999973 999974 999975 999976 999977 999978
572
Student (t) distribution (CF)

degrees of freedom
t 1 2 3 4 5 6 7 8 9 10
0.0 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000
0.1 0.53173 0.53527 0.53667 0.53742 0.53788 0.53820 0.53843 0.53860 0.53873 0.53884
0.2 0.56283 0.57001 0.57286 0.57438 0.57532 0.57596 0.57642 0.57676 0.57704 0.57726
0.3 0.59277 0.60376 0.60812 0.61044 0.61188 0.61285 0.61355 0.61409 0.61450 0.61484
0.4 0.62112 0.63608 0.64203 0.64520 0.64716 0.64850 0.64946 0.65019 0.65076 0.65122
0.5 0.64758 0.66667 0.67428 0.67834 0.68085 0.68256 0.68380 0.68473 0.68546 0.68605
0.6 0.67202 0.69528 0.70460 0.70958 0.71267 0.71477 0.71629 0.71744 0.71835 0.71907
0.7 0.69440 0.72180 0.73284 0.73875 0.74243 0.74493 0.74674 0.74811 0.74919 0.75006
0.8 0.71478 0.74618 0.75890 0.76574 0.76999 0.77289 0.77500 0.77659 0.77784 0.77885
0.9 0.73326 0.76845 0.78277 0.79050 0.79531 0.79860 0.80099 0.80280 0.80422 0.80536
1.0 0.75000 0.78868 0.80450 0.81305 0.81839 0.82204 0.82469 0.82670 0.82828 0.82955
1.1 0.76515 0.80698 0.82416 0.83346 0.83927 0.84325 0.84614 0.84834 0.85006 0.85145
1.2 0.77886 0.82350 0.84187 0.85182 0.85805 0.86232 0.86541 0.86777 0.86961 0.87110
1.3 0.79129 0.83838 0.85777 0.86827 0.87485 0.87935 0.88262 0.88510 0.88705 0.88862
1.4 0.80257 0.85176 0.87200 0.88295 0.88980 0.89448 0.89788 0.90046 0.90249 0.90412
1.5 0.81283 0.86380 0.88471 0.89600 0.90305 0.90786 0.91135 0.91400 0.91607 0.91775
1.6 0.82219 0.87463 0.89605 0.90758 0.91475 0.91964 0.92318 0.92587 0.92797 0.92966
1.7 0.83075 0.88438 0.90615 0.91782 0.92506 0.92998 0.93354 0.93622 0.93833 0.94002
1.8 0.83859 0.89317 0.91516 0.92688 0.93412 0.93902 0.94256 0.94522 0.94730 0.94897
1.9 0.84579 0.90109 0.92318 0.93488 0.94207 0.94692 0.95040 0.95302 0.95506 0.95669
2.0 0.85242 0.90825 0.93034 0.94194 0.94903 0.95379 0.95719 0.95974 0.96172 0.96331
2.1 0.85854 0.91473 0.93672 0.94817 0.95512 0.95976 0.96306 0.96553 0.96744 0.96896
2.2 0.86420 0.92060 0.94241 0.95367 0.96045 0.96495 0.96813 0.97050 0.97233 0.97378
2.3 0.86945 0.92593 0.94751 0.95853 0.96511 0.96945 0.97250 0.97476 0.97650 0.97787
2.4 0.87433 0.93077 0.95206 0.96282 0.96919 0.97335 0.97627 0.97841 0.98005 0.98134
2.5 0.87888 0.93519 0.95615 0.96662 0.97275 0.97674 0.97950 0.98153 0.98307 0.98428
2.6 0.88312 0.93923 0.95981 0.96998 0.97588 0.97967 0.98229 0.98419 0.98563 0.98675
2.7 0.88709 0.94292 0.96311 0.97295 0.97861 0.98221 0.98468 0.98646 0.98780 0.98884
2.8 0.89081 0.94630 0.96607 0.97559 0.98100 0.98442 0.98674 0.98840 0.98964 0.99060
2.9 0.89430 0.94941 0.96875 0.97794 0.98310 0.98633 0.98851 0.99005 0.99120 0.99208
3.0 0.89758 0.95227 0.97117 0.98003 0.98495 0.98800 0.99003 0.99146 0.99252 0.99333
3.1 0.90067 0.95490 0.97335 0.98189 0.98657 0.98944 0.99134 0.99267 0.99364 0.99437
3.2 0.90359 0.95733 0.97533 0.98355 0.98800 0.99070 0.99247 0.99369 0.99458 0.99525
3.3 0.90634 0.95958 0.97713 0.98503 0.98926 0.99180 0.99344 0.99457 0.99539 0.99599
3.4 0.90895 0.96166 0.97877 0.98636 0.99037 0.99275 0.99428 0.99532 0.99606 0.99661
3.5 0.91141 0.96359 0.98026 0.98755 0.99136 0.99359 0.99500 0.99596 0.99664 0.99714
3.6 0.91375 0.96538 0.98162 0.98862 0.99223 0.99432 0.99563 0.99651 0.99713 0.99758
3.7 0.91598 0.96705 0.98286 0.98958 0.99300 0.99496 0.99617 0.99698 0.99754 0.99795
3.8 0.91809 0.96860 0.98400 0.99045 0.99369 0.99552 0.99664 0.99738 0.99789 0.99826
3.9 0.92010 0.97005 0.98504 0.99123 0.99430 0.99601 0.99705 0.99773 0.99819 0.99852
4.0 0.92202 0.97140 0.98600 0.99193 0.99484 0.99644 0.99741 0.99803 0.99844 0.99874
4.1 0.92385 0.97267 0.98687 0.99257 0.99532 0.99682 0.99771 0.99828 0.99866 0.99893
4.2 0.92560 0.97386 0.98768 0.99315 0.99576 0.99716 0.99798 0.99850 0.99885 0.99909
4.3 0.92727 0.97497 0.98843 0.99368 0.99614 0.99745 0.99822 0.99869 0.99900 0.99922
4.4 0.92887 0.97602 0.98912 0.99415 0.99649 0.99772 0.99842 0.99886 0.99914 0.99933
4.5 0.93040 0.97700 0.98975 0.99459 0.99680 0.99795 0.99860 0.99900 0.99926 0.99943
4.6 0.93186 0.97792 0.99034 0.99498 0.99708 0.99815 0.99876 0.99912 0.99935 0.99951
4.7 0.93327 0.97879 0.99089 0.99535 0.99733 0.99834 0.99890 0.99923 0.99944 0.99958
4.8 0.93462 0.97962 0.99140 0.99568 0.99756 0.99850 0.99902 0.99932 0.99951 0.99964
4.9 0.93592 0.98039 0.99187 0.99598 0.99776 0.99864 0.99912 0.99940 0.99958 0.99969
5.0 0.93717 0.98113 0.99230 0.99625 0.99795 0.99877 0.99922 0.99947 0.99963 0.99973
5.1 0.93837 0.98182 0.99271 0.99651 0.99811 0.99889 0.99930 0.99954 0.99968 0.99977
5.2 0.93952 0.98248 0.99309 0.99674 0.99827 0.99899 0.99937 0.99959 0.99972 0.99980
5.3 0.94064 0.98310 0.99344 0.99696 0.99840 0.99909 0.99944 0.99964 0.99975 0.99983
5.4 0.94171 0.98369 0.99378 0.99715 0.99853 0.99917 0.99950 0.99968 0.99978 0.99985
5.5 0.94275 0.98425 0.99409 0.99734 0.99864 0.99924 0.99955 0.99971 0.99981 0.99987
5.6 0.94375 0.98478 0.99437 0.99750 0.99875 0.99931 0.99959 0.99974 0.99983 0.99989
5.7 0.94472 0.98529 0.99465 0.99766 0.99884 0.99937 0.99963 0.99977 0.99985 0.99990
5.8 0.94565 0.98577 0.99490 0.99780 0.99893 0.99942 0.99967 0.99980 0.99987 0.99991
5.9 0.94656 0.98623 0.99514 0.99794 0.99900 0.99947 0.99970 0.99982 0.99989 0.99992
6.0 0.94743 0.98666 0.99536 0.99806 0.99908 0.99952 0.99973 0.99984 0.99990 0.99993
573
STATISTICAL TABLES

degrees of freedom
t 11 12 13 14 15 16 17 18 19 20
0.0 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000
0.1 0.53893 0.53900 0.53906 0.53912 0.53917 0.53921 0.53924 0.53928 0.53930 0.53933
0.2 0.57743 0.57759 0.57771 0.57782 0.57792 0.57800 0.57807 0.57814 0.57820 0.57825
0.3 0.61511 0.61534 0.61554 0.61571 0.61585 0.61598 0.61609 0.61619 0.61628 0.61636
0.4 0.65159 0.65191 0.65217 0.65240 0.65260 0.65278 0.65293 0.65307 0.65319 0.65330
0.5 0.68654 0.68694 0.68728 0.68758 0.68783 0.68806 0.68826 0.68843 0.68859 0.68873
0.6 0.71967 0.72017 0.72059 0.72095 0.72127 0.72155 0.72179 0.72201 0.72220 0.72238
0.7 0.75077 0.75136 0.75187 0.75230 0.75268 0.75301 0.75330 0.75356 0.75379 0.75400
0.8 0.77968 0.78037 0.78096 0.78146 0.78190 0.78229 0.78263 0.78293 0.78320 0.78344
0.9 0.80630 0.80709 0.80776 0.80833 0.80883 0.80927 0.80965 0.81000 0.81031 0.81059
1.0 0.83060 0.83148 0.83222 0.83286 0.83341 0.83390 0.83433 0.83472 0.83506 0.83537
1.1 0.85259 0.85355 0.85436 0.85506 0.85566 0.85620 0.85667 0.85709 0.85746 0.85780
1.2 0.87233 0.87335 0.87422 0.87497 0.87563 0.87620 0.87670 0.87715 0.87756 0.87792
1.3 0.88991 0.89099 0.89191 0.89270 0.89339 0.89399 0.89452 0.89500 0.89542 0.89581
1.4 0.90546 0.90658 0.90754 0.90836 0.90907 0.90970 0.91025 0.91074 0.91118 0.91158
1.5 0.91912 0.92027 0.92125 0.92209 0.92282 0.92346 0.92402 0.92452 0.92498 0.92538
1.6 0.93105 0.93221 0.93320 0.93404 0.93478 0.93542 0.93599 0.93650 0.93695 0.93736
1.7 0.94140 0.94256 0.94354 0.94439 0.94512 0.94576 0.94632 0.94683 0.94728 0.94768
1.8 0.95034 0.95148 0.95245 0.95328 0.95400 0.95463 0.95518 0.95568 0.95612 0.95652
1.9 0.95802 0.95914 0.96008 0.96089 0.96158 0.96220 0.96273 0.96321 0.96364 0.96403
2.0 0.96460 0.96567 0.96658 0.96736 0.96803 0.96861 0.96913 0.96959 0.97000 0.97037
2.1 0.97020 0.97123 0.97209 0.97283 0.97347 0.97403 0.97452 0.97495 0.97534 0.97569
2.2 0.97496 0.97593 0.97675 0.97745 0.97805 0.97858 0.97904 0.97945 0.97981 0.98014
2.3 0.97898 0.97990 0.98067 0.98132 0.98189 0.98238 0.98281 0.98319 0.98352 0.98383
2.4 0.98238 0.98324 0.98396 0.98457 0.98509 0.98554 0.98594 0.98629 0.98660 0.98688
2.5 0.98525 0.98604 0.98671 0.98727 0.98775 0.98816 0.98853 0.98885 0.98913 0.98938
2.6 0.98765 0.98839 0.98900 0.98951 0.98995 0.99033 0.99066 0.99095 0.99121 0.99144
2.7 0.98967 0.99035 0.99090 0.99137 0.99177 0.99211 0.99241 0.99267 0.99291 0.99311
2.8 0.99136 0.99198 0.99249 0.99291 0.99327 0.99358 0.99385 0.99408 0.99429 0.99447
2.9 0.99278 0.99334 0.99380 0.99418 0.99450 0.99478 0.99502 0.99523 0.99541 0.99557
3.0 0.99396 0.99447 0.99488 0.99522 0.99551 0.99576 0.99597 0.99616 0.99632 0.99646
3.1 0.99495 0.99541 0.99578 0.99608 0.99634 0.99656 0.99675 0.99691 0.99705 0.99718
3.2 0.99577 0.99618 0.99652 0.99679 0.99702 0.99721 0.99738 0.99752 0.99764 0.99775
3.3 0.99646 0.99683 0.99713 0.99737 0.99757 0.99774 0.99788 0.99801 0.99812 0.99821
3.4 0.99704 0.99737 0.99763 0.99784 0.99802 0.99817 0.99830 0.99840 0.99850 0.99858
3.5 0.99751 0.99781 0.99804 0.99823 0.99839 0.99852 0.99863 0.99872 0.99880 0.99887
3.6 0.99792 0.99818 0.99838 0.99855 0.99869 0.99880 0.99890 0.99898 0.99905 0.99911
3.7 0.99825 0.99848 0.99866 0.99881 0.99893 0.99903 0.99911 0.99918 0.99924 0.99929
3.8 0.99853 0.99874 0.99890 0.99902 0.99913 0.99921 0.99928 0.99934 0.99940 0.99944
3.9 0.99876 0.99894 0.99909 0.99920 0.99929 0.99936 0.99942 0.99948 0.99952 0.99956
4.0 0.99896 0.99912 0.99924 0.99934 0.99942 0.99948 0.99954 0.99958 0.99962 0.99965
4.1 0.99912 0.99926 0.99937 0.99946 0.99953 0.99958 0.99963 0.99966 0.99970 0.99972
4.2 0.99926 0.99938 0.99948 0.99955 0.99961 0.99966 0.99970 0.99973 0.99976 0.99978
4.3 0.99937 0.99948 0.99957 0.99963 0.99968 0.99972 0.99976 0.99978 0.99981 0.99983
4.4 0.99947 0.99957 0.99964 0.99970 0.99974 0.99978 0.99980 0.99983 0.99985 0.99986
4.5 0.99955 0.99964 0.99970 0.99975 0.99979 0.99982 0.99984 0.99986 0.99988 0.99989
4.6 0.99962 0.99969 0.99975 0.99979 0.99983 0.99985 0.99987 0.99989 0.99990 0.99991
4.7 0.99967 0.99974 0.99979 0.99983 0.99986 0.99988 0.99990 0.99991 0.99992 0.99993
4.8 0.99972 0.99978 0.99983 0.99986 0.99988 0.99990 0.99992 0.99993 0.99994 0.99995
4.9 0.99976 0.99982 0.99985 0.99988 0.99990 0.99992 0.99993 0.99994 0.99995 0.99996
5.0 0.99980 0.99985 0.99988 0.99990 0.99992 0.99993 0.99995 0.99995 0.99996 0.99997
5.1 0.99983 0.99987 0.99990 0.99992 0.99993 0.99995 0.99996 0.99996 0.99997 0.99997
5.2 0.99985 0.99989 0.99991 0.99993 0.99995 0.99996 0.99996 0.99997 0.99997 0.99998
5.3 0.99987 0.99991 0.99993 0.99994 0.99996 0.99996 0.99997 0.99998 0.99998 0.99998
5.4 0.99989 0.99992 0.99994 0.99995 0.99996 0.99997 0.99998 0.99998 0.99998 0.99999
5.5 0.99991 0.99993 0.99995 0.99996 0.99997 0.99998 0.99998 0.99998 0.99999 0.99999
5.6 0.99992 0.99994 0.99996 0.99997 0.99997 0.99998 0.99998 0.99999 0.99999 0.99999
5.7 0.99993 0.99995 0.99996 0.99997 0.99998 0.99998 0.99999 0.99999 0.99999 0.99999
5.8 0.99994 0.99996 0.99997 0.99998 0.99998 0.99999 0.99999 0.99999 0.99999 0.99999
5.9 0.99995 0.99996 0.99997 0.99998 0.99999 0.99999 0.99999 0.99999 0.99999 1.00000
6.0 0.99996 0.99997 0.99998 0.99998 0.99999 0.99999 0.99999 0.99999 1.00000 1.00000
574

degrees of freedom
t 21 22 23 24 25 26 27 28 29 30
0.0 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000 0.50000
0.1 0.53935 0.53937 0.53939 0.53941 0.53943 0.53944 0.53946 0.53947 0.53948 0.53950
0.2 0.57830 0.57834 0.57838 0.57842 0.57845 0.57848 0.57851 0.57854 0.57856 0.57858
0.3 0.61644 0.61650 0.61656 0.61662 0.61667 0.61672 0.61676 0.61680 0.61684 0.61688
0.4 0.65340 0.65349 0.65358 0.65365 0.65372 0.65379 0.65385 0.65390 0.65396 0.65400
0.5 0.68886 0.68898 0.68909 0.68919 0.68928 0.68936 0.68944 0.68951 0.68958 0.68964
0.6 0.72254 0.72268 0.72281 0.72294 0.72305 0.72315 0.72325 0.72333 0.72342 0.72349
0.7 0.75420 0.75437 0.75453 0.75467 0.75480 0.75493 0.75504 0.75515 0.75525 0.75534
0.8 0.78367 0.78387 0.78405 0.78422 0.78438 0.78452 0.78465 0.78478 0.78489 0.78500
0.9 0.81084 0.81107 0.81128 0.81147 0.81165 0.81181 0.81196 0.81210 0.81223 0.81236
1.0 0.83565 0.83591 0.83614 0.83636 0.83655 0.83674 0.83691 0.83706 0.83721 0.83735
1.1 0.85811 0.85839 0.85864 0.85888 0.85909 0.85929 0.85948 0.85965 0.85981 0.85996
1.2 0.87825 0.87855 0.87882 0.87907 0.87931 0.87952 0.87972 0.87990 0.88007 0.88023
1.3 0.89616 0.89647 0.89676 0.89703 0.89727 0.89750 0.89770 0.89790 0.89808 0.89825
1.4 0.91194 0.91227 0.91257 0.91285 0.91310 0.91333 0.91355 0.91375 0.91394 0.91411
1.5 0.92575 0.92609 0.92639 0.92667 0.92693 0.92717 0.92739 0.92760 0.92779 0.92797
1.6 0.93773 0.93807 0.93838 0.93866 0.93892 0.93916 0.93938 0.93959 0.93978 0.93996
1.7 0.94805 0.94839 0.94869 0.94897 0.94923 0.94947 0.94969 0.94989 0.95008 0.95026
1.8 0.95688 0.95720 0.95750 0.95778 0.95803 0.95826 0.95848 0.95868 0.95886 0.95904
1.9 0.96437 0.96469 0.96498 0.96524 0.96549 0.96571 0.96592 0.96611 0.96629 0.96646
2.0 0.97070 0.97100 0.97128 0.97153 0.97176 0.97198 0.97217 0.97236 0.97253 0.97269
2.1 0.97601 0.97629 0.97655 0.97679 0.97701 0.97721 0.97740 0.97757 0.97773 0.97788
2.2 0.98043 0.98070 0.98094 0.98116 0.98137 0.98155 0.98173 0.98189 0.98204 0.98218
2.3 0.98410 0.98435 0.98457 0.98478 0.98496 0.98514 0.98530 0.98544 0.98558 0.98571
2.4 0.98713 0.98735 0.98756 0.98775 0.98792 0.98807 0.98822 0.98836 0.98848 0.98860
2.5 0.98961 0.98982 0.99000 0.99017 0.99033 0.99047 0.99060 0.99072 0.99084 0.99094
2.6 0.99164 0.99183 0.99200 0.99215 0.99229 0.99242 0.99253 0.99264 0.99274 0.99284
2.7 0.99330 0.99346 0.99361 0.99375 0.99387 0.99398 0.99409 0.99419 0.99427 0.99436
2.8 0.99464 0.99478 0.99491 0.99504 0.99515 0.99525 0.99534 0.99542 0.99550 0.99557
2.9 0.99572 0.99585 0.99596 0.99607 0.99617 0.99625 0.99633 0.99641 0.99648 0.99654
3.0 0.99659 0.99670 0.99680 0.99690 0.99698 0.99706 0.99713 0.99719 0.99725 0.99731
3.1 0.99729 0.99739 0.99748 0.99756 0.99763 0.99769 0.99775 0.99781 0.99786 0.99791
3.2 0.99785 0.99793 0.99801 0.99808 0.99814 0.99820 0.99825 0.99830 0.99834 0.99838
3.3 0.99830 0.99837 0.99843 0.99849 0.99855 0.99860 0.99864 0.99868 0.99872 0.99875
3.4 0.99865 0.99871 0.99877 0.99882 0.99887 0.99891 0.99894 0.99898 0.99901 0.99904
3.5 0.99893 0.99899 0.99904 0.99908 0.99912 0.99915 0.99918 0.99921 0.99924 0.99926
3.6 0.99916 0.99920 0.99924 0.99928 0.99931 0.99934 0.99937 0.99939 0.99941 0.99943
3.7 0.99934 0.99937 0.99941 0.99944 0.99947 0.99949 0.99951 0.99953 0.99955 0.99957
3.8 0.99948 0.99951 0.99954 0.99956 0.99959 0.99961 0.99963 0.99964 0.99966 0.99967
3.9 0.99959 0.99962 0.99964 0.99966 0.99968 0.99970 0.99971 0.99973 0.99974 0.99975
4.0 0.99968 0.99970 0.99972 0.99974 0.99975 0.99977 0.99978 0.99979 0.99980 0.99981
4.1 0.99974 0.99976 0.99978 0.99980 0.99981 0.99982 0.99983 0.99984 0.99985 0.99986
4.2 0.99980 0.99981 0.99983 0.99984 0.99985 0.99986 0.99987 0.99988 0.99988 0.99989
4.3 0.99984 0.99986 0.99987 0.99988 0.99989 0.99989 0.99990 0.99991 0.99991 0.99992
4.4 0.99988 0.99989 0.99990 0.99990 0.99991 0.99992 0.99992 0.99993 0.99993 0.99994
4.5 0.99990 0.99991 0.99992 0.99993 0.99993 0.99994 0.99994 0.99995 0.99995 0.99995
4.6 0.99992 0.99993 0.99994 0.99994 0.99995 0.99995 0.99996 0.99996 0.99996 0.99996
4.7 0.99994 0.99995 0.99995 0.99996 0.99996 0.99996 0.99997 0.99997 0.99997 0.99997
4.8 0.99995 0.99996 0.99996 0.99997 0.99997 0.99997 0.99997 0.99998 0.99998 0.99998
4.9 0.99996 0.99997 0.99997 0.99997 0.99998 0.99998 0.99998 0.99998 0.99998 0.99998
5.0 0.99997 0.99997 0.99998 0.99998 0.99998 0.99998 0.99998 0.99999 0.99999 0.99999
5.1 0.99998 0.99998 0.99998 0.99998 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999
5.2 0.99998 0.99998 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999
5.3 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999 1.00000
5.4 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999 1.00000 1.00000 1.00000
5.5 0.99999 0.99999 0.99999 0.99999 0.99999 1.00000 1.00000 1.00000 1.00000 1.00000
5.6 0.99999 0.99999 0.99999 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
5.7 0.99999 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
5.8 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
5.9 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
6.0 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
575
STATISTICAL TABLES
Chi-square distribution (CF)

Degrees of freedom
Chi-square 1 2 3 4 5 6 7 8 9 10
0.001 0.025227 0.000500 0.000008 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.01 0.079656 0.004988 0.000265 0.000012 0.000001 0.000000 0.000000 0.000000 0.000000 0.000000
0.05 0.176937 0.024690 0.002929 0.000307 0.000029 0.000003 0.000000 0.000000 0.000000 0.000000
0.1 0.248170 0.048771 0.008163 0.001209 0.000162 0.000020 0.000002 0.000000 0.000000 0.000000
0.2 0.345279 0.095163 0.022411 0.004679 0.000886 0.000155 0.000025 0.000004 0.000001 0.000000
0.3 0.416118 0.139292 0.039972 0.010186 0.002357 0.000503 0.000100 0.000019 0.000003 0.000001
0.4 0.472911 0.181269 0.059758 0.017523 0.004670 0.001148 0.000263 0.000057 0.000012 0.000002
0.5 0.520500 0.221199 0.081109 0.026499 0.007877 0.002161 0.000554 0.000133 0.000030 0.000007
0.6 0.561422 0.259182 0.103568 0.036936 0.011997 0.003599 0.001008 0.000266 0.000066 0.000016
0.7 0.597216 0.295312 0.126796 0.048671 0.017031 0.005509 0.001664 0.000473 0.000128 0.000033
0.8 0.628907 0.329680 0.150533 0.061552 0.022967 0.007926 0.002556 0.000776 0.000223 0.000061
0.9 0.657218 0.362372 0.174572 0.075439 0.029778 0.010879 0.003715 0.001195 0.000365 0.000106
1 0.682689 0.393469 0.198748 0.090204 0.037434 0.014388 0.005171 0.001752 0.000562 0.000172
2 0.842701 0.632121 0.427593 0.264241 0.150855 0.080301 0.040160 0.018988 0.008532 0.003660
3 0.916735 0.776870 0.608375 0.442175 0.300014 0.191153 0.114998 0.065642 0.035705 0.018576
4 0.954500 0.864665 0.738536 0.593994 0.450584 0.323324 0.220223 0.142877 0.088587 0.052653
5 0.974653 0.917915 0.828203 0.712703 0.584120 0.456187 0.340037 0.242424 0.165692 0.108822
6 0.985694 0.950213 0.888390 0.800852 0.693781 0.576810 0.460251 0.352768 0.260082 0.184737
7 0.991849 0.969803 0.928102 0.864112 0.779360 0.679153 0.571120 0.463367 0.362881 0.274555
8 0.995322 0.981684 0.953988 0.908422 0.843764 0.761897 0.667406 0.566530 0.465854 0.371163
9 0.997300 0.988891 0.970709 0.938901 0.890936 0.826422 0.747344 0.657704 0.562726 0.467896
10 0.998435 0.993262 0.981434 0.959572 0.924765 0.875348 0.811427 0.734974 0.649515 0.559507
11 0.999089 0.995913 0.988274 0.973436 0.948620 0.911624 0.861381 0.798301 0.724291 0.642482
12 0.999468 0.997521 0.992617 0.982649 0.965212 0.938031 0.899441 0.848796 0.786691 0.714943
13 0.999689 0.998497 0.995363 0.988724 0.976621 0.956964 0.927892 0.888150 0.837394 0.776328
14 0.999817 0.999088 0.997095 0.992705 0.984391 0.970364 0.948819 0.918235 0.877675 0.827008
15 0.999892 0.999447 0.998183 0.995299 0.989638 0.979743 0.964001 0.940855 0.909064 0.867938
16 0.999937 0.999665 0.998866 0.996981 0.993156 0.986246 0.974884 0.957620 0.933118 0.900368
17 0.999963 0.999797 0.999293 0.998067 0.995500 0.990717 0.982604 0.969891 0.951284 0.925636
18 0.999978 0.999877 0.999560 0.998766 0.997054 0.993768 0.988030 0.978774 0.964826 0.945036
19 0.999987 0.999925 0.999727 0.999214 0.998078 0.995836 0.991813 0.985140 0.974807 0.959737
20 0.999992 0.999955 0.999830 0.999501 0.998750 0.997231 0.994430 0.989664 0.982088 0.970747
21 0.999995 0.999972 0.999895 0.999683 0.999190 0.998165 0.996230 0.992853 0.987350 0.978906
22 0.999997 0.999983 0.999935 0.999800 0.999476 0.998789 0.997460 0.995084 0.991121 0.984895
23 0.999998 0.999990 0.999960 0.999873 0.999662 0.999204 0.998295 0.996636 0.993804 0.989253
24 0.999999 0.999994 0.999975 0.999920 0.999783 0.999478 0.998861 0.997708 0.995699 0.992400
25 0.999999 0.999996 0.999985 0.999950 0.999861 0.999659 0.999241 0.998445 0.997029 0.994654
26 1.000000 0.999998 0.999990 0.999968 0.999911 0.999777 0.999496 0.998950 0.997957 0.996260
27 1.000000 0.999999 0.999994 0.999980 0.999943 0.999855 0.999667 0.999293 0.998601 0.997396
28 1.000000 0.999999 0.999996 0.999988 0.999964 0.999906 0.999780 0.999526 0.999046 0.998195
29 1.000000 0.999999 0.999998 0.999992 0.999977 0.999939 0.999855 0.999683 0.999352 0.998754
30 1.000000 1.000000 0.999999 0.999995 0.999985 0.999961 0.999905 0.999789 0.999561 0.999143
31 1.000000 1.000000 0.999999 0.999997 0.999991 0.999975 0.999938 0.999859 0.999704 0.999413
32 1.000000 1.000000 0.999999 0.999998 0.999994 0.999984 0.999959 0.999907 0.999801 0.999600
33 1.000000 1.000000 1.000000 0.999999 0.999996 0.999990 0.999974 0.999938 0.999866 0.999728
34 1.000000 1.000000 1.000000 0.999999 0.999998 0.999993 0.999983 0.999959 0.999911 0.999815
35 1.000000 1.000000 1.000000 1.000000 0.999998 0.999996 0.999989 0.999973 0.999940 0.999875
36 1.000000 1.000000 1.000000 1.000000 0.999999 0.999997 0.999993 0.999982 0.999960 0.999916
37 1.000000 1.000000 1.000000 1.000000 0.999999 0.999998 0.999995 0.999988 0.999974 0.999943
38 1.000000 1.000000 1.000000 1.000000 1.000000 0.999999 0.999997 0.999992 0.999983 0.999962
39 1.000000 1.000000 1.000000 1.000000 1.000000 0.999999 0.999998 0.999995 0.999988 0.999975
40 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999999 0.999997 0.999992 0.999983
41 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999999 0.999998 0.999995 0.999989
42 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999999 0.999999 0.999997 0.999993
43 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999999 0.999998 0.999995
44 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999999 0.999999 0.999997
45 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999999 0.999998
46 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999999 0.999999
47 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999999
48 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999999
49 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
50 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
576

Degrees of freedom
Chi-square 11 12 13 14 15 16 17 18 19 20
0.001 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.01 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.05 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.5 0.000001 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.6 0.000004 0.000001 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.7 0.000008 0.000002 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.8 0.000016 0.000004 0.000001 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.9 0.000029 0.000008 0.000002 0.000001 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.000050 0.000014 0.000004 0.000001 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.001504 0.000594 0.000226 0.000083 0.000030 0.000010 0.000003 0.000001 0.000000 0.000000
3 0.009274 0.004456 0.002066 0.000926 0.000402 0.000170 0.000070 0.000028 0.000011 0.000004
4 0.030083 0.016564 0.008809 0.004534 0.002263 0.001097 0.000517 0.000237 0.000106 0.000046
5 0.068833 0.042021 0.024807 0.014187 0.007874 0.004247 0.002229 0.001140 0.000569 0.000277
6 0.126636 0.083918 0.053847 0.033509 0.020252 0.011905 0.006814 0.003803 0.002072 0.001102
7 0.200916 0.142386 0.097848 0.065288 0.042350 0.026739 0.016451 0.009874 0.005787 0.003315
8 0.286696 0.214870 0.156400 0.110674 0.076217 0.051134 0.033453 0.021363 0.013329 0.008132
9 0.378108 0.297070 0.227056 0.168949 0.122483 0.086586 0.059738 0.040257 0.026521 0.017093
10 0.469613 0.384039 0.306066 0.237817 0.180260 0.133372 0.096390 0.068094 0.047054 0.031828
11 0.556737 0.471081 0.389182 0.313964 0.247406 0.190515 0.143436 0.105643 0.076162 0.053777
12 0.636357 0.554320 0.472356 0.393697 0.320971 0.256020 0.199863 0.152763 0.114375 0.083924
13 0.706675 0.630959 0.552188 0.473476 0.397702 0.327242 0.263814 0.208427 0.161429 0.122616
14 0.767007 0.699292 0.626156 0.550289 0.474471 0.401286 0.332898 0.270909 0.216309 0.169504
15 0.817503 0.758564 0.692647 0.621845 0.548583 0.475361 0.404518 0.338033 0.277403 0.223592
16 0.858869 0.808764 0.750870 0.686626 0.617948 0.547039 0.476165 0.407453 0.342722 0.283376
17 0.892124 0.850403 0.800696 0.743822 0.681136 0.614403 0.545634 0.476895 0.410132 0.347026
18 0.918419 0.884309 0.842481 0.793219 0.737334 0.676103 0.611159 0.544347 0.477562 0.412592
19 0.938906 0.911472 0.876896 0.835051 0.786266 0.731337 0.671468 0.608177 0.543164 0.478174
20 0.954659 0.932914 0.904790 0.869859 0.828067 0.779779 0.725771 0.667180 0.605422 0.542070
21 0.966629 0.949620 0.927071 0.898367 0.863171 0.821489 0.773710 0.720587 0.663199 0.602867
22 0.975627 0.962480 0.944638 0.921386 0.892196 0.856808 0.815281 0.768015 0.715744 0.659489
23 0.982325 0.972274 0.958324 0.939730 0.915860 0.886265 0.850749 0.809410 0.762658 0.711205
24 0.987267 0.979659 0.968870 0.954178 0.934907 0.910496 0.880565 0.844972 0.803848 0.757608
25 0.990883 0.985177 0.976916 0.965433 0.950057 0.930175 0.905290 0.875084 0.839458 0.798569
26 0.993510 0.989266 0.982999 0.974113 0.961977 0.945972 0.925539 0.900242 0.869811 0.834188
27 0.995405 0.992273 0.987559 0.980746 0.971264 0.958517 0.941932 0.921005 0.895347 0.864736
28 0.996763 0.994468 0.990950 0.985772 0.978431 0.968380 0.955062 0.937945 0.916571 0.890601
29 0.997730 0.996060 0.993454 0.989550 0.983915 0.976064 0.965474 0.951621 0.934015 0.912241
30 0.998415 0.997208 0.995290 0.992368 0.988079 0.981998 0.973655 0.962554 0.948202 0.930146
31 0.998898 0.998030 0.996628 0.994456 0.991215 0.986544 0.980028 0.971213 0.959627 0.944810
32 0.999237 0.998616 0.997598 0.995994 0.993562 0.990000 0.984952 0.978013 0.968745 0.956702
33 0.999474 0.999032 0.998296 0.997119 0.995306 0.992610 0.988728 0.983310 0.975960 0.966259
34 0.999638 0.999325 0.998796 0.997938 0.996595 0.994567 0.991604 0.987404 0.981622 0.973875
35 0.999752 0.999532 0.999153 0.998530 0.997541 0.996026 0.993779 0.990548 0.986033 0.979896
36 0.999831 0.999676 0.999407 0.998957 0.998232 0.997107 0.995413 0.992944 0.989444 0.984619
37 0.999885 0.999777 0.999586 0.999262 0.998734 0.997903 0.996635 0.994759 0.992065 0.988298
38 0.999922 0.999846 0.999712 0.999480 0.999098 0.998487 0.997542 0.996127 0.994065 0.991144
39 0.999947 0.999895 0.999800 0.999635 0.999359 0.998912 0.998213 0.997150 0.995583 0.993333
40 0.999964 0.999928 0.999862 0.999745 0.999547 0.999221 0.998706 0.997913 0.996728 0.995005
41 0.999976 0.999951 0.999905 0.999822 0.999680 0.999445 0.999067 0.998478 0.997587 0.996275
42 0.999984 0.999967 0.999935 0.999876 0.999775 0.999605 0.999329 0.998894 0.998228 0.997234
43 0.999989 0.999977 0.999955 0.999914 0.999843 0.999721 0.999520 0.999200 0.998704 0.997956
44 0.999993 0.999985 0.999969 0.999941 0.999890 0.999803 0.999657 0.999423 0.999056 0.998495
45 0.999995 0.999990 0.999979 0.999959 0.999923 0.999861 0.999756 0.999586 0.999315 0.998897
46 0.999997 0.999993 0.999986 0.999972 0.999947 0.999903 0.999827 0.999703 0.999504 0.999194
47 0.999998 0.999995 0.999990 0.999981 0.999963 0.999932 0.999878 0.999788 0.999643 0.999413
48 0.999999 0.999997 0.999993 0.999987 0.999975 0.999953 0.999914 0.999849 0.999743 0.999575
49 0.999999 0.999998 0.999996 0.999991 0.999982 0.999967 0.999940 0.999893 0.999816 0.999693
50 0.999999 0.999999 0.999997 0.999994 0.999988 0.999977 0.999958 0.999925 0.999869 0.999779
577
STATISTICAL TABLES

Degrees of freedom
Chi-square 21 22 23 24 25 26 27 28 29 30
0.001 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.01 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.05 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.7 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000002 0.000001 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 0.000020 0.000008 0.000003 0.000001 0.000001 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.000132 0.000062 0.000028 0.000013 0.000006 0.000002 0.000001 0.000000 0.000000 0.000000
6 0.000574 0.000292 0.000146 0.000071 0.000034 0.000016 0.000007 0.000003 0.000002 0.000001
7 0.001858 0.001019 0.000548 0.000289 0.000150 0.000076 0.000038 0.000019 0.000009 0.000004
8 0.004856 0.002840 0.001628 0.000915 0.000505 0.000274 0.000146 0.000076 0.000039 0.000020
9 0.010786 0.006669 0.004043 0.002404 0.001404 0.000805 0.000454 0.000252 0.000137 0.000074
10 0.021088 0.013695 0.008723 0.005453 0.003347 0.002019 0.001197 0.000698 0.000401 0.000226
11 0.037213 0.025251 0.016812 0.010988 0.007054 0.004451 0.002761 0.001685 0.001012 0.000599
12 0.060382 0.042621 0.029529 0.020092 0.013432 0.008827 0.005706 0.003628 0.002271 0.001400
13 0.091376 0.066839 0.048010 0.033880 0.023499 0.016027 0.010753 0.007100 0.004616 0.002956
14 0.130401 0.098521 0.073129 0.053350 0.038268 0.027000 0.018745 0.012811 0.008623 0.005717
15 0.177048 0.137762 0.105366 0.079241 0.058617 0.042666 0.030568 0.021565 0.014985 0.010260
16 0.230349 0.184114 0.144731 0.111924 0.085171 0.063797 0.047053 0.034181 0.024464 0.017257
17 0.288894 0.236638 0.190748 0.151338 0.118206 0.090917 0.068878 0.051411 0.037819 0.027425
18 0.350996 0.294012 0.242511 0.196992 0.157609 0.124227 0.096480 0.073851 0.055728 0.041466
19 0.414860 0.354672 0.298775 0.248010 0.202879 0.163570 0.129999 0.101864 0.078712 0.059992
20 0.478739 0.416960 0.358088 0.303224 0.253175 0.208443 0.169244 0.135536 0.107073 0.083458
21 0.541056 0.479262 0.418912 0.361275 0.307390 0.258036 0.213712 0.174651 0.140851 0.112112
22 0.600490 0.540111 0.479748 0.420733 0.364256 0.311303 0.262623 0.218709 0.179811 0.145956
23 0.656022 0.598270 0.539229 0.480202 0.422437 0.367053 0.314988 0.266960 0.223457 0.184740
24 0.706941 0.652771 0.596192 0.538403 0.480626 0.424035 0.369684 0.318464 0.271068 0.227975
25 0.752836 0.702925 0.649715 0.594239 0.537626 0.481025 0.425538 0.372165 0.321752 0.274968
26 0.793551 0.748318 0.699134 0.646835 0.592401 0.536895 0.481399 0.426955 0.374509 0.324868
27 0.829147 0.788774 0.744032 0.695547 0.644115 0.590667 0.536205 0.481753 0.428295 0.376729
28 0.859849 0.824319 0.784218 0.739960 0.692147 0.641542 0.589026 0.535552 0.482087 0.429563
29 0.885998 0.855139 0.819690 0.779869 0.736084 0.688918 0.639101 0.587472 0.534934 0.482403
30 0.908012 0.881536 0.850598 0.815248 0.775711 0.732389 0.685846 0.636782 0.585996 0.534346
31 0.926342 0.903884 0.877207 0.846217 0.810981 0.771731 0.728861 0.682919 0.634576 0.584593
32 0.941450 0.922604 0.899857 0.873007 0.841988 0.806878 0.767916 0.725489 0.680127 0.632473
33 0.953783 0.938126 0.918933 0.895927 0.868932 0.837902 0.802930 0.764256 0.722261 0.677458
34 0.963761 0.950876 0.934842 0.915331 0.892092 0.864976 0.833953 0.799127 0.760740 0.719167
35 0.971765 0.961255 0.947984 0.931599 0.911797 0.888351 0.861134 0.830133 0.795460 0.757360
36 0.978135 0.969634 0.958747 0.945113 0.928400 0.908331 0.884701 0.857402 0.826436 0.791923
37 0.983166 0.976344 0.967487 0.956240 0.942263 0.925246 0.904933 0.881139 0.853776 0.822856
38 0.987111 0.981678 0.974528 0.965327 0.953739 0.939439 0.922138 0.901601 0.877664 0.850250
39 0.990185 0.985888 0.980159 0.972691 0.963160 0.951245 0.936641 0.919077 0.898336 0.874271
40 0.992563 0.989188 0.984631 0.978613 0.970836 0.960988 0.948763 0.933872 0.916063 0.895136
41 0.994393 0.991759 0.988158 0.983343 0.977043 0.968966 0.958814 0.946294 0.931134 0.913096
42 0.995792 0.993749 0.990922 0.987095 0.982027 0.975451 0.967085 0.956641 0.943841 0.928426
43 0.996857 0.995281 0.993074 0.990053 0.986003 0.980686 0.973841 0.965195 0.954471 0.941404
44 0.997662 0.996453 0.994741 0.992370 0.989155 0.984884 0.979322 0.972215 0.963298 0.952307
45 0.998268 0.997346 0.996025 0.994175 0.991638 0.988229 0.983739 0.977938 0.970576 0.961398
46 0.998722 0.998022 0.997009 0.995573 0.993582 0.990878 0.987277 0.982572 0.976535 0.968926
47 0.999061 0.998532 0.997758 0.996650 0.995097 0.992964 0.990093 0.986301 0.981383 0.975116
48 0.999312 0.998915 0.998327 0.997476 0.996270 0.994598 0.992322 0.989284 0.985302 0.980175
49 0.999498 0.999201 0.998756 0.998106 0.997175 0.995870 0.994076 0.991656 0.988452 0.984282
50 0.999635 0.999414 0.999079 0.998584 0.997869 0.996856 0.995449 0.993533 0.990968 0.987598
1.
578
Fisher-Snedecor's F distribution
P(F)=0,1
m 1 2 3 4 5 6 7 8 9 10
n 1 39.8635 49.5000 53.5932 55.8330 57.2401 58.2044 58.9060 59.4390 59.8576 60.1950
2 8.5263 9.0000 9.1618 9.2434 9.2926 9.3255 9.3491 9.3668 9.3805 9.3916
3 5.5383 5.4624 5.3908 5.3426 5.3092 5.2847 5.2662 5.2517 5.2400 5.2304
4 4.5448 4.3246 4.1909 4.1072 4.0506 4.0097 3.9790 3.9549 3.9357 3.9199
5 4.0604 3.7797 3.6195 3.5202 3.4530 3.4045 3.3679 3.3393 3.3163 3.2974
6 3.7759 3.4633 3.2888 3.1808 3.1075 3.0546 3.0145 2.9830 2.9577 2.9369
7 3.5894 3.2574 3.0741 2.9605 2.8833 2.8274 2.7849 2.7516 2.7247 2.7025
8 3.4579 3.1131 2.9238 2.8064 2.7264 2.6683 2.6241 2.5893 2.5612 2.5380
9 3.3603 3.0065 2.8129 2.6927 2.6106 2.5509 2.5053 2.4694 2.4403 2.4163
10 3.2850 2.9245 2.7277 2.6053 2.5216 2.4606 2.4140 2.3772 2.3473 2.3226
12 3.1765 2.8068 2.6055 2.4801 2.3940 2.3310 2.2828 2.2446 2.2135 2.1878
15 3.0732 2.6952 2.4898 2.3614 2.2730 2.2081 2.1582 2.1185 2.0862 2.0593
20 2.9747 2.5893 2.3801 2.2489 2.1582 2.0913 2.0397 1.9985 1.9649 1.9367
24 2.9271 2.5383 2.3274 2.1949 2.1030 2.0351 1.9826 1.9407 1.9063 1.8775
30 2.8807 2.4887 2.2761 2.1422 2.0492 1.9803 1.9269 1.8841 1.8490 1.8195
40 2.8354 2.4404 2.2261 2.0909 1.9968 1.9269 1.8725 1.8289 1.7929 1.7627
60 2.7911 2.3933 2.1774 2.0410 1.9457 1.8747 1.8194 1.7748 1.7380 1.7070
120 2.7478 2.3473 2.1300 1.9923 1.8959 1.8238 1.7675 1.7220 1.6842 1.6524
479001600 2.7055 2.3026 2.0838 1.9449 1.8473 1.7741 1.7167 1.6702 1.6315 1.5987
P(F)=0,05
m 1 2 3 4 5 6 7 8 9 10
n 1 161.4476 199.5000 215.7073 224.5832 230.1619 233.9860 236.7684 238.8827 240.5433 241.8817
2 18.5128 19.0000 19.1643 19.2468 19.2964 19.3295 19.3532 19.3710 19.3848 19.3959
3 10.1280 9.5521 9.2766 9.1172 9.0135 8.9406 8.8867 8.8452 8.8123 8.7855
4 7.7086 6.9443 6.5914 6.3882 6.2561 6.1631 6.0942 6.0410 5.9988 5.9644
5 6.6079 5.7861 5.4095 5.1922 5.0503 4.9503 4.8759 4.8183 4.7725 4.7351
6 5.9874 5.1433 4.7571 4.5337 4.3874 4.2839 4.2067 4.1468 4.0990 4.0600
7 5.5914 4.7374 4.3468 4.1203 3.9715 3.8660 3.7870 3.7257 3.6767 3.6365
8 5.3177 4.4590 4.0662 3.8379 3.6875 3.5806 3.5005 3.4381 3.3881 3.3472
9 5.1174 4.2565 3.8625 3.6331 3.4817 3.3738 3.2927 3.2296 3.1789 3.1373
10 4.9646 4.1028 3.7083 3.4780 3.3258 3.2172 3.1355 3.0717 3.0204 2.9782
12 4.7472 3.8853 3.4903 3.2592 3.1059 2.9961 2.9134 2.8486 2.7964 2.7534
15 4.5431 3.6823 3.2874 3.0556 2.9013 2.7905 2.7066 2.6408 2.5876 2.5437
20 4.3512 3.4928 3.0984 2.8661 2.7109 2.5990 2.5140 2.4471 2.3928 2.3479
24 4.2597 3.4028 3.0088 2.7763 2.6207 2.5082 2.4226 2.3551 2.3002 2.2547
30 4.1709 3.3158 2.9223 2.6896 2.5336 2.4205 2.3343 2.2662 2.2107 2.1646
40 4.0847 3.2317 2.8387 2.6060 2.4495 2.3359 2.2490 2.1802 2.1240 2.0772
60 4.0012 3.1504 2.7581 2.5252 2.3683 2.2541 2.1665 2.0970 2.0401 1.9926
120 3.9201 3.0718 2.6802 2.4472 2.2899 2.1750 2.0868 2.0164 1.9588 1.9105
479001600 3.8415 2.9957 2.6049 2.3719 2.2141 2.0986 2.0096 1.9384 1.8799 1.8307
579
STATISTICAL TABLES
P(F)=0,1
m 1 2 3 4 5 6 7 8 9 10
n 1 4052.1807 4999.5000 5403.3520 5624.5833 5763.6496 5858.9861 5928.3557 5981.0703 6022.4732 6055.8467
2 98.5025 99.0000 99.1662 99.2494 99.2993 99.3326 99.3564 99.3742 99.3881 99.3992
3 34.1162 30.8165 29.4567 28.7099 28.2371 27.9107 27.6717 27.4892 27.3452 27.2287
4 21.1977 18.0000 16.6944 15.9770 15.5219 15.2069 14.9758 14.7989 14.6591 14.5459
5 16.2582 13.2739 12.0600 11.3919 10.9670 10.6723 10.4555 10.2893 10.1578 10.0510
6 13.7450 10.9248 9.7795 9.1483 8.7459 8.4661 8.2600 8.1017 7.9761 7.8741
7 12.2464 9.5466 8.4513 7.8466 7.4604 7.1914 6.9928 6.8400 6.7188 6.6201
8 11.2586 8.6491 7.5910 7.0061 6.6318 6.3707 6.1776 6.0289 5.9106 5.8143
9 10.5614 8.0215 6.9919 6.4221 6.0569 5.8018 5.6129 5.4671 5.3511 5.2565
10 10.0443 7.5594 6.5523 5.9943 5.6363 5.3858 5.2001 5.0567 4.9424 4.8491
12 9.3302 6.9266 5.9525 5.4120 5.0643 4.8206 4.6395 4.4994 4.3875 4.2961
15 8.6831 6.3589 5.4170 4.8932 4.5556 4.3183 4.1415 4.0045 3.8948 3.8049
20 8.0960 5.8489 4.9382 4.4307 4.1027 3.8714 3.6987 3.5644 3.4567 3.3682
24 7.8229 5.6136 4.7181 4.2184 3.8951 3.6667 3.4959 3.3629 3.2560 3.1681
30 7.5625 5.3903 4.5097 4.0179 3.6990 3.4735 3.3045 3.1726 3.0665 2.9791
40 7.3141 5.1785 4.3126 3.8283 3.5138 3.2910 3.1238 2.9930 2.8876 2.8005
60 7.0771 4.9774 4.1259 3.6490 3.3389 3.1187 2.9530 2.8233 2.7185 2.6318
120 6.8509 4.7865 3.9491 3.4795 3.1735 2.9559 2.7918 2.6629 2.5586 2.4721
479001600 6.6349 4.6052 3.7816 3.3192 3.0173 2.8020 2.6393 2.5113 2.4073 2.3209
P(F)=0,005
m 1 2 3 4 5 6 7 8 9 10
n 1 16210.7227 19999.5000 21614.7414 22499.5833 23055.7982 23437.1111 23714.5658 23925.4062 24091.0041 24224.4868
2 198.5013 199.0000 199.1664 199.2497 199.2996 199.3330 199.3568 199.3746 199.3885 199.3996
3 55.5520 49.7993 47.4672 46.1946 45.3916 44.8385 44.4341 44.1256 43.8824 43.6858
4 31.3328 26.2843 24.2591 23.1545 22.4564 21.9746 21.6217 21.3520 21.1391 20.9667
5 22.7848 18.3138 16.5298 15.5561 14.9396 14.5133 14.2004 13.9610 13.7716 13.6182
6 18.6350 14.5441 12.9166 12.0275 11.4637 11.0730 10.7859 10.5658 10.3915 10.2500
7 16.2356 12.4040 10.8824 10.0505 9.5221 9.1553 8.8854 8.6781 8.5138 8.3803
8 14.6882 11.0424 9.5965 8.8051 8.3018 7.9520 7.6941 7.4959 7.3386 7.2106
9 13.6136 10.1067 8.7171 7.9559 7.4712 7.1339 6.8849 6.6933 6.5411 6.4172
10 12.8265 9.4270 8.0807 7.3428 6.8724 6.5446 6.3025 6.1159 5.9676 5.8467
12 11.7542 8.5096 7.2258 6.5211 6.0711 5.7570 5.5245 5.3451 5.2021 5.0855
15 10.7980 7.7008 6.4760 5.8029 5.3721 5.0708 4.8473 4.6744 4.5364 4.4235
20 9.9439 6.9865 5.8177 5.1743 4.7616 4.4721 4.2569 4.0900 3.9564 3.8470
24 9.5513 6.6609 5.5190 4.8898 4.4857 4.2019 3.9905 3.8264 3.6949 3.5870
30 9.1797 6.3547 5.2388 4.6234 4.2276 3.9492 3.7416 3.5801 3.4505 3.3440
40 8.8279 6.0664 4.9758 4.3738 3.9860 3.7129 3.5088 3.3498 3.2220 3.1167
60 8.4946 5.7950 4.7290 4.1399 3.7599 3.4918 3.2911 3.1344 3.0083 2.9042
120 8.1788 5.5393 4.4972 3.9207 3.5482 3.2849 3.0874 2.9330 2.8083 2.7052
479001600 7.8794 5.2983 4.2794 3.7151 3.3499 3.0913 2.8968 2.7444 2.6210 2.5188
580
P(F)=0,1
m 12 15 20 24 30 40 60 120 479001600
n 1 60.7052 61.2203 61.7403 62.0020 62.2650 62.5291 62.7943 63.0606 63.3282
2 9.4081 9.4247 9.4413 9.4496 9.4579 9.4662 9.4746 9.4829 9.4912
3 5.2156 5.2003 5.1845 5.1764 5.1681 5.1597 5.1512 5.1425 5.1337
4 3.8955 3.8704 3.8443 3.8310 3.8174 3.8036 3.7896 3.7753 3.7607
5 3.2682 3.2380 3.2067 3.1905 3.1741 3.1573 3.1402 3.1228 3.1050
6 2.9047 2.8712 2.8363 2.8183 2.8000 2.7812 2.7620 2.7423 2.7222
7 2.6681 2.6322 2.5947 2.5753 2.5555 2.5351 2.5142 2.4928 2.4708
8 2.5020 2.4642 2.4246 2.4041 2.3830 2.3614 2.3391 2.3162 2.2926
9 2.3789 2.3396 2.2983 2.2768 2.2547 2.2320 2.2085 2.1843 2.1592
10 2.2841 2.2435 2.2007 2.1784 2.1554 2.1317 2.1072 2.0818 2.0554
12 2.1474 2.1049 2.0597 2.0360 2.0115 1.9861 1.9597 1.9323 1.9036
15 2.0171 1.9722 1.9243 1.8990 1.8728 1.8454 1.8168 1.7867 1.7551
20 1.8924 1.8449 1.7938 1.7667 1.7382 1.7083 1.6768 1.6433 1.6074
24 1.8319 1.7831 1.7302 1.7019 1.6721 1.6407 1.6073 1.5715 1.5327
30 1.7727 1.7223 1.6673 1.6377 1.6065 1.5732 1.5376 1.4989 1.4564
40 1.7146 1.6624 1.6052 1.5741 1.5411 1.5056 1.4672 1.4248 1.3769
60 1.6574 1.6034 1.5435 1.5107 1.4755 1.4373 1.3952 1.3476 1.2915
120 1.6012 1.5450 1.4821 1.4472 1.4094 1.3676 1.3203 1.2646 1.1926
479001600 1.5458 1.4871 1.4206 1.3832 1.3419 1.2951 1.2400 1.1686 1.0000
P(F)=0,05
m 12 15 20 24 30 40 60 120 479001600
n 1 243.9060 245.9499 248.0131 249.0518 250.0951 251.1432 252.1957 253.2529 254.3148
2 19.4125 19.4291 19.4458 19.4541 19.4624 19.4707 19.4791 19.4874 19.4957
3 8.7446 8.7029 8.6602 8.6385 8.6166 8.5944 8.5720 8.5494 8.5264
4 5.9117 5.8578 5.8025 5.7744 5.7459 5.7170 5.6877 5.6581 5.6281
5 4.6777 4.6188 4.5581 4.5272 4.4957 4.4638 4.4314 4.3985 4.3650
6 3.9999 3.9381 3.8742 3.8415 3.8082 3.7743 3.7398 3.7047 3.6689
7 3.5747 3.5107 3.4445 3.4105 3.3758 3.3404 3.3043 3.2674 3.2297
8 3.2839 3.2184 3.1503 3.1152 3.0794 3.0428 3.0053 2.9669 2.9276
9 3.0729 3.0061 2.9365 2.9005 2.8637 2.8259 2.7872 2.7475 2.7067
10 2.9130 2.8450 2.7740 2.7372 2.6996 2.6609 2.6211 2.5801 2.5379
12 2.6866 2.6169 2.5436 2.5055 2.4663 2.4259 2.3842 2.3410 2.2962
15 2.4753 2.4034 2.3275 2.2878 2.2468 2.2043 2.1601 2.1141 2.0658
20 2.2776 2.2033 2.1242 2.0825 2.0391 1.9938 1.9464 1.8963 1.8432
24 2.1834 2.1077 2.0267 1.9838 1.9390 1.8920 1.8424 1.7896 1.7330
30 2.0921 2.0148 1.9317 1.8874 1.8409 1.7918 1.7396 1.6835 1.6223
40 2.0035 1.9245 1.8389 1.7929 1.7444 1.6928 1.6373 1.5766 1.5089
60 1.9174 1.8364 1.7480 1.7001 1.6491 1.5943 1.5343 1.4673 1.3893
120 1.8337 1.7505 1.6587 1.6084 1.5543 1.4952 1.4290 1.3519 1.2539
479001600 1.7522 1.6664 1.5705 1.5173 1.4591 1.3940 1.3180 1.2214 1.0000
581
STATISTICAL TABLES
P(F)=0,1
m 12 15 20 24 30 40 60 120 479001600
n 1 6106.3207 6157.2846 6208.7302 6234.6309 6260.6486 6286.7821 6313.0301 6339.3913 6365.8685
2 99.4159 99.4325 99.4492 99.4575 99.4658 99.4742 99.4825 99.4908 99.4992
3 27.0518 26.8722 26.6898 26.5975 26.5045 26.4108 26.3164 26.2211 26.1251
4 14.3736 14.1982 14.0196 13.9291 13.8377 13.7454 13.6522 13.5581 13.4631
5 9.8883 9.7222 9.5526 9.4665 9.3793 9.2912 9.2020 9.1118 9.0204
6 7.7183 7.5590 7.3958 7.3127 7.2285 7.1432 7.0567 6.9690 6.8800
7 6.4691 6.3143 6.1554 6.0743 5.9920 5.9084 5.8236 5.7373 5.6495
8 5.6667 5.5151 5.3591 5.2793 5.1981 5.1156 5.0316 4.9461 4.8588
9 5.1114 4.9621 4.8080 4.7290 4.6486 4.5666 4.4831 4.3978 4.3105
10 4.7059 4.5581 4.4054 4.3269 4.2469 4.1653 4.0819 3.9965 3.9090
12 4.1553 4.0096 3.8584 3.7805 3.7008 3.6192 3.5355 3.4494 3.3608
15 3.6662 3.5222 3.3719 3.2940 3.2141 3.1319 3.0471 2.9595 2.8684
20 3.2311 3.0880 2.9377 2.8594 2.7785 2.6947 2.6077 2.5168 2.4212
24 3.0316 2.8887 2.7380 2.6591 2.5773 2.4923 2.4035 2.3100 2.2107
30 2.8431 2.7002 2.5487 2.4689 2.3860 2.2992 2.2079 2.1108 2.0062
40 2.6648 2.5216 2.3689 2.2880 2.2034 2.1142 2.0194 1.9172 1.8047
60 2.4961 2.3523 2.1978 2.1154 2.0285 1.9360 1.8363 1.7263 1.6006
120 2.3363 2.1915 2.0346 1.9500 1.8600 1.7628 1.6557 1.5330 1.3805
479001600 2.1847 2.0385 1.8783 1.7908 1.6964 1.5923 1.4730 1.3246 1.0000
P(F)=0,005
m 12 15 20 24 30 40 60 120 479001600
n 1 24426.3662 24630.2051 24835.9709 24939.5653 25043.6277 25148.1532 25253.1369 25358.5734 25464.4593
2 199.4163 199.4329 199.4496 199.4579 199.4663 199.4746 199.4829 199.4912 199.4996
3 43.3874 43.0847 42.7775 42.6222 42.4658 42.3082 42.1494 41.9895 41.8283
4 20.7047 20.4383 20.1673 20.0300 19.8915 19.7518 19.6107 19.4684 19.3247
5 13.3845 13.1463 12.9035 12.7802 12.6556 12.5297 12.4024 12.2737 12.1435
6 10.0343 9.8140 9.5888 9.4742 9.3582 9.2408 9.1219 9.0015 8.8793
7 8.1764 7.9678 7.7540 7.6450 7.5345 7.4224 7.3088 7.1933 7.0760
8 7.0149 6.8143 6.6082 6.5029 6.3961 6.2875 6.1772 6.0649 5.9506
9 6.2274 6.0325 5.8318 5.7292 5.6248 5.5186 5.4104 5.3001 5.1875
10 5.6613 5.4707 5.2740 5.1732 5.0706 4.9659 4.8592 4.7501 4.6385
12 4.9062 4.7213 4.5299 4.4314 4.3309 4.2282 4.1229 4.0149 3.9039
15 4.2497 4.0698 3.8826 3.7859 3.6867 3.5850 3.4803 3.3722 3.2602
20 3.6779 3.5020 3.3178 3.2220 3.1234 3.0215 2.9159 2.8058 2.6904
24 3.4199 3.2456 3.0624 2.9667 2.8679 2.7654 2.6585 2.5463 2.4276
30 3.1787 3.0057 2.8230 2.7272 2.6278 2.5241 2.4151 2.2998 2.1760
40 2.9531 2.7811 2.5984 2.5020 2.4015 2.2958 2.1838 2.0636 1.9318
60 2.7419 2.5705 2.3872 2.2898 2.1874 2.0789 1.9622 1.8341 1.6885
120 2.5439 2.3727 2.1881 2.0890 1.9840 1.8709 1.7469 1.6055 1.4311
479001600 2.3583 2.1868 1.9998 1.8983 1.7891 1.6691 1.5325 1.3637 1.0000
582
INDEX
INDEX
Absolute change 8, 322, 323

Absolute frequency 146
Aggregate index 8, 331
Aggregate price index 8, 333
Aggregate volume (quantity) index 8, 335
Analysis of variance 259
ANOVA 7, 10, 243, 259, 261, 262, 270, 271, 511, 514, 515, 516, 517,
518, 519
Arithmetic mean 5, 73, 93, 444, 512
Basic indices 375

Bayes theorem 403, 455
Bernoulli random variable 408
Binomial distribution 9, 408, 409, 410, 411, 456
Census 5, 22
Chain indices 328
Chi-square 10, 443, 525, 545, 546, 576, 577, 578
Class 5, 48, 49
Coefficient of determination 253, 289, 301
Coefficient of variation 6, 101, 198, 202, 210, 222
Complement of event 394
Confidence interval 127, 480, 481, 482, 484, 485, 487, 536, 537, 538,
539, 540, 541, 554
Contingency table 401, 452, 526
Continuous variables 20, 40, 48
Correlation 227, 237, 238, 247, 253, 262, 270, 287, 511
Cumulative frequency 47
Cumulative function 405, 406
585
INDEX
Data 5, 11, 14, 15, 21, 58

Density function 440
Descriptive 14, 71, 125, 483, 561
Discrete variables 20
Distribution 204, 205, 222, 405, 446
Event 400
Factorial variance 515

F distribution 407, 446, 447
Fisher-Snedecor’s distribution 447
Forecasting 248, 254, 331, 359, 561
Frequency distribution 35, 40, 405
Gaussian function 423

Geometric mean 6, 79, 145
Gini coefficient 120, 121, 122, 133, 141, 142, 143, 195, 196
Histogram 61, 221

Hypergeometric distribution 407, 419
586
Index of values 8, 332, 333

Inferential statistics 14, 473, 474, 475
Joint event 394
Kurtosis 6, 113, 127
Linear trend 351

Lorenz's curve 195, 212
Measures of central tendency 197, 200

Measures of dispersion 198, 201
Median 6, 16, 81, 87, 89, 104, 105, 108, 126, 197, 200, 222
Middle absolute distance 92, 95, 109
Mode 6, 83, 88, 90, 104, 106, 108, 198, 200, 222
Model 262, 270, 353, 528
Normal distribution 429, 447, 448, 458, 546
587
INDEX
One-tailed test 498, 499, 500

Outliers 5, 57, 205, 274
Point estimator 475

Poisson distribution 9, 407, 413, 414, 415, 416, 417, 457, 531, 532
Population 5, 21, 220, 498, 499, 500, 538
Probability 8, 9, 391, 394, 398, 399, 400, 401, 405, 407, 408, 410, 414,
420, 443, 449, 450, 534, 561, 567, 568, 569, 570, 571
Probability distribution 8, 9, 408, 414, 567, 568, 569, 570, 571
Quartile deviation 220

Quartile range 220
Quartiles 6, 84
Random variable 446

Regression 240, 242, 243, 253, 261, 270, 272, 290, 301, 561
Relative change 8, 322, 323
Relative frequency 56
Residual variance 515
Sample 5, 21, 24, 34, 219, 220, 224, 260, 306, 393, 474, 475, 481, 507,
511, 529, 534, 539, 557
Sample bias 474
Sample size 507, 529
588
Sample space 393

Sampling 5, 22, 23, 24, 25, 26, 474
Sampling error 474
Scatter plot 61, 252
Simple event 394
Skewness 127
Spearman’s rank 249, 291, 293, 294, 295, 296, 304
Standard deviation 9, 97, 101, 109, 126, 222, 478, 480
Standard error 126, 249, 259, 263, 540
Statistical test 7, 259, 261
Statistical unit 21
Statistics 3, 4, 9, 13, 14, 15, 16, 17, 36, 71, 227, 229, 231, 233, 234, 239,
240, 242, 243, 245, 247, 248, 261, 270, 391, 400, 473, 474, 492, 493,
540, 546, 547, 561, 562, 563, 567, 571, 573, 575, 585
Stratification 28
Stratified sample 5, 28
Student distribution 441
Symmetry 6, 110
Total variance 514

Trend 7, 8, 313, 315, 349, 355
Trend isolation 8, 355
Two-tailed test 498, 499, 500
Type I error 127, 272, 496, 539
Type II error 496
Variables 123
Variance 95, 100, 127, 222, 273, 406, 407, 410, 416, 511, 517, 546, 557
Z value 6, 101, 173, 198, 202
589

Statistics in Economics and Management

Uploaded by

Document Information

Original Description:

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Statistics in Economics and Management

Uploaded by

Copyright:

Emina RESI] . Adela DELALI] . Merima BALAVAC .

Autori Doc.dr Emina Resić

Izdavač Ekonomski fakultet u Sarajevu

Glavni urednik Dekan, prof. dr Veljko Trivun

Recenzenti Prof. dr Rabija Somun – Kapetanović,

Design&DTP Adis Duhović

Lektor Mr Milica Babić

[tampa Premier Febeco d.o.o. Mostar

Godina izdanja 2010.

CIP - Katalogizacija u publikaciji

STATISTICS in economics and management / Emina

Bibliografija: str. 561-564 ; bibliografske i

We want our students to be effective when facing and working with

This book is the result of our long-standing work in subjects: Statistics,

Statistics in Economics and Management at School of Economics and

We use this opportunity to thank reviewers and everyone who contributed

2. DESCRIPTIVE STATISTICS ................................................... 69

2.2.2. Harmonic mean ........................................................... 77

3. REGRESSION AND CORRELATION .................................. 225

3.10. PREDICTION OR FORECASTING .................................. 247

4. TIME SERIES ANALYSIS ...................................................... 309

4.2.6. Additive versus multiplicative model ........................ 316

5. PROBABILITY AND THEORETICAL DISTRIBUTIONS . 389

5.8.2. Characteristics of the Binomial distribution .............. 410

6. INFERENTIAL STATISTICS ................................................. 471

6.8.3. Procedure for hypothesis testing ................................ 497

REFERENCES ............................................................................... 559

STATISTICAL TABLES ............................................................... 565

INDEX ............................................................................................. 583

1.1. WHAT IS STATISTICS?

Any manager operating in the business environment requires as much

Statistics could also be defined as the science of uncertainty. Statistics

Statistics, in short, is the study of data. It involves collecting, classifying,

Descriptive statistics, which involves the studies of methods

It utilizes numerical and graphical methods to look for patterns in a data

Inferential statistics, which involves the systems and techniques

Statistical dealing with data has three main aspects:

1. The collection of qualitative or numerical data,

Before one can present and interpret information, there has to be

material from which paper is produced, so too, can data be viewed as

Data are defined as “facts or figures from which conclusions

Data, information and statistics are often misunderstood. They are

Table 1.1. Data collected on the weight of 20 individuals in classroom

Data collected on the weight of 20 individuals in classroom

the daily weight measurements of each individual in your classroom;

Statistics offices collect data every day to provide information. Once

the number of persons in a group in each weight category (20 to

Statistics represent a common method of presenting information. In

the average weight of people in your office;

1.3. SCALES OF MEASUREMENT

Different scales of measurement have correspondence with appropriate

Nominal scale classifies data into various distinct categories

For example nominal scale is appropriate for:

Ordinal scale classifies data into various distinct categories

Comparisons of greater and less can be made, in addition to equality

Ratio scale is an ordered scale which involves a true zero

The most important characteristic of interval scale is that the

In this case, differences between arbitrary pairs of numbers can be

are therefore meaningful. However, the zero point on the scale is

Categorical variables (attributes) are connected with nominal

1.4. DISCRETE AND CONTINUOUS VARIABLES

Numerical variable has numerical form. It can be either discrete or

Discrete variables produce numerical responses that arise

An example of a discrete numerical variable is “the number of subscribed

Continuous variables produce numerical responses that arise

The response takes on any value within a continuum or interval,