Download as pdf or txt
Download as pdf or txt
You are on page 1of 156

Module Code

and title:
(E304)INITIATION
TO
STATISTICS
Module Code and title:
Lecturer’s (E304)INITIATION TO STATISTICS
name:
Mweruli T.
Fidèle (MSc.)
Faculty of Education
Department: Languages and Humanities
Introduction
P
Level and Mode: Level III
The Sigma
notation Day, Weekend and Holidays Modes
Chapter 1
Exercises

Graphical
representation
Lecturer’s name: Mweruli T. Fidèle (MSc.)
of data

DATA DE- KIBOGORA POLYTECHNIC


SCRIPTION:
Measures of
location and
dispersion A.Y 2021-2022
Measures of
skewness
1 / 144
Outline

Module Code
and title:
(E304)INITIATION 1 Introduction
TO
STATISTICS P
Lecturer’s
2 The Sigma notation
name:
Mweruli T.
Fidèle (MSc.)
3 Chapter 1 Exercises

Introduction 4 Graphical representation of data


P
The Sigma
notation 5 DATA DESCRIPTION: Measures of location and dispersion
Chapter 1
Exercises 6 Measures of skewness
Graphical
representation
of data 7 INFERENTIAL STATISTICS
DATA DE- Probability
SCRIPTION:
Measures of
location and 8 A brief overview of the classical linear regression model
dispersion

Measures of
skewness
2 / 144
Module Code Statistics is the study of how to collect, organize, analyze, and
and title:
(E304)INITIATION
interpret data. It is employed in many fields, e.g. business,
TO
STATISTICS
education, psychology, pharmacy, government etc. We are
Lecturer’s exposed to statistics in everyday events such as planning your
name:
Mweruli T. budget, measuring the performance for a classroom of
Fidèle (MSc.)
students, gambling, sport games etc.
Introduction Population
The Sigma
P
The conceptual totality of objects under consideration is called
notation
population. In other words, population refers to the group of
Chapter 1
Exercises all subjects being studied. The size of the population is
Graphical denoted by N. It is frequently very large. A population may be
representation
of data finite or infinite. A finite population has a limit or fixed size i.e.
DATA DE-
SCRIPTION:
countable population. Example 1.1 lists finite populations. An
Measures of infinite population is impossible to measure since it has no
location and
dispersion limits i.e. uncountable population (see Example 1.2).
Measures of
skewness
3 / 144
Introduction to Statistics

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data
Parameter A number that describes a characteristic of a
DATA DE-
SCRIPTION: population is called parameter. Normally Greek letters (for
Measures of
location and instance, µ, , σ, ρ, σ 2 ) re used to denote population
dispersion
parameters.
Measures of
skewness
3 / 144
Introduction to statistics: Population vs. Sample

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
4 / 144
Population parameter vs. Sample Statistic:
Example
Module Code
and title:
(E304)INITIATION
TO
STATISTICS
Example 1.3:
Lecturer’s
name: The 49 numbers from which the lotto numbers are drawn
Mweruli T.
Fidèle (MSc.) constitutes a population of 49 items. The six numbers that are
randomly drawn are a sample from the population. The largest
Introduction
P number in the population is 49 and it is a population parameter
The Sigma
notation which could be denoted by the Greek letter , for instance, and
Chapter 1
Exercises
in this case α = 49.
Graphical
If the sample consists of the numbers 5, 21, 3, 42, 10 and 11 the
representation
of data
sample maximum is 42 and could be denoted by the Roman
DATA DE-
letter a, thus a = 42.
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
5 / 144
Purpose of statistics

Module Code
and title:
The field of statistics is concerned with
(E304)INITIATION
TO 1 Descriptive statistics
STATISTICS
The collection, arrangement, summarization and
Lecturer’s
name: presentation of data. Methods used are graphs, tables and
Mweruli T.
Fidèle (MSc.) numerical measures (see Section 2 and 3). We determine
or estimate population characteristics and distinguish
Introduction
P between a population parameter and a sample statistic.
The Sigma
notation 2 Inferential statistics
Chapter 1
Exercises
The process of inferring an estimate, prediction or decision
Graphical about a population based on sample data. It is easier and
representation
of data cheaper to take a representative sample from the
DATA DE- population of interest. Draw conclusions or make
SCRIPTION:
Measures of estimates about the population on the basis of information
location and
dispersion provided by the sample. However, such conclusions and
Measures of estimates are not always going to be correct.
skewness
6 / 144
Purpose of Statistics

Module Code
and title:
(E304)INITIATION
TO
STATISTICS
For this reason, measure of reliability is built into the statistical
Lecturer’s
inference. Two such measures are confidence level and
name:
Mweruli T.
significance level. The sample should be representative of the
Fidèle (MSc.) population. One way of trying to ensure this is by simple
Introduction
random sampling. A simple random sample is chosen in such
The Sigma
P a manner that every element/observation/object has an equal
notation
chance of being selected. Example of simple random sampling
Chapter 1
Exercises is selecting a playing card from a shuffled deck. Another
Graphical example is to assign a number to each member of the
representation
of data population. The population becomes a set of numbers. Then,
DATA DE- using the random number table, we can choose a sample of
SCRIPTION:
Measures of desired size.
location and
dispersion

Measures of
skewness
7 / 144
Definitions

Module Code
and title:
(E304)INITIATION
TO Observing a characteristic if we find that it takes on different
STATISTICS

Lecturer’s
values in different persons or objects, we call the characteristic
name:
Mweruli T.
a variable for example age, gender, marks of students etc. Not
Fidèle (MSc.) all students achieve the same mark and the marks vary from
Introduction
student to student, thus we call it a variable. A variable is
The Sigma
P denoted by a symbol, usually a capital letter of the alphabet,
notation
such as X , Y , T etc. Values are all the possible observations of
Chapter 1
Exercises a variable. Data are the observed values of a variable. Data
Graphical that are randomly collected are called data of a random
representation
of data variable (RV).
DATA DE- The actual observations of a random variable are usually
SCRIPTION:
Measures of denoted by the corresponding lower case letters.
location and
dispersion

Measures of
skewness
8 / 144
Definitions

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
9 / 144
Definitions: Examples

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
10 / 144
Discrete random variable (RV)

Module Code
and title:
A variable that can assume only certain distinct values. Data
(E304)INITIATION
TO
described by a discrete variable is called discrete data. If a
STATISTICS variable can assume only one value it is called a constant. In
Lecturer’s
name:
certain situations, fractional values are also integers.
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
11 / 144
Discrete Random Variable: Examples

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
12 / 144
Continuous random variable (RV)

Module Code
and title:
(E304)INITIATION
TO
STATISTICS
These types of random variables assume any value within a
Lecturer’s
name: specified interval. Data described by a continuous variable is
Mweruli T.
Fidèle (MSc.) called continuous data. The values recorded depend on the
accuracy of measurement and the sensitivity of the measuring
Introduction

The Sigma
P instrument.
notation Example 1.10:
Chapter 1
Exercises
a) The weight of a bag of potatoes can be 5kg or 5.05kg. It
Graphical
can also assume any value between these two numbers.
representation
of data
b) Sets of measurement data, such as length, mass, yield per
DATA DE- hectare etc.
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
13 / 144
Quantitative variable and Qualitative variable

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
14 / 144
Nominal-scaled data

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
15 / 144
Nominal-scaled data: Examples

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
16 / 144
Ordinal-scaled data

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
17 / 144
Ordinal-scaled data: Examples

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
18 / 144
Interval-scaled data

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
19 / 144
Interval-scaled data: examples

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
20 / 144
Ratio-scaled data

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
21 / 144
Outline

Module Code
and title:
(E304)INITIATION 1 Introduction
TO
STATISTICS P
Lecturer’s
2 The Sigma notation
name:
Mweruli T.
Fidèle (MSc.)
3 Chapter 1 Exercises

Introduction 4 Graphical representation of data


P
The Sigma
notation 5 DATA DESCRIPTION: Measures of location and dispersion
Chapter 1
Exercises 6 Measures of skewness
Graphical
representation
of data 7 INFERENTIAL STATISTICS
DATA DE- Probability
SCRIPTION:
Measures of
location and 8 A brief overview of the classical linear regression model
dispersion

Measures of
skewness
22 / 144
P
The Sigma notation

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
23 / 144
P
Sigma notation

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
24 / 144
P
Sigma notation: Examples

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
25 / 144
P
Sigma notation: Summation of a constant

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
26 / 144
Summation of a constant: Examples

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
27 / 144
An extension to the sigma notation concept

Module Code
and title:
(E304)INITIATION
As an extension to the concept, note that
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
28 / 144
An extension to the sigma notation concept

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
29 / 144
An extension to the sigma notation concept

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
30 / 144
Outline

Module Code
and title:
(E304)INITIATION 1 Introduction
TO
STATISTICS P
Lecturer’s
2 The Sigma notation
name:
Mweruli T.
Fidèle (MSc.)
3 Chapter 1 Exercises

Introduction 4 Graphical representation of data


P
The Sigma
notation 5 DATA DESCRIPTION: Measures of location and dispersion
Chapter 1
Exercises 6 Measures of skewness
Graphical
representation
of data 7 INFERENTIAL STATISTICS
DATA DE- Probability
SCRIPTION:
Measures of
location and 8 A brief overview of the classical linear regression model
dispersion

Measures of
skewness
31 / 144
Module Code 1 Forty students were enrolled in an English literature class.
and title:
(E304)INITIATION
The lecturer asked five students who usually sit in the
TO
STATISTICS
back of the classroom if they would like A Tale of Two
Lecturer’s Cities as the next class reading assignment. Three of the
name:
Mweruli T. five students replied yes.
Fidèle (MSc.)
a) Identify the population and the sample in the situation.
Introduction b) Is this likely to be a representative sample? If not, why
The Sigma
P
not?
notation

Chapter 1 2 Is the random variable in Example 1.4(a) and Example


Exercises

Graphical
1.4(b) the same? What is it?
representation
of data 3 Table 1.8 is an example of student data. In this table the
DATA DE-
SCRIPTION:
random variables are age, weight etc. And the entries in
Measures of
location and
the table are the realizations of the variables. Classify each
dispersion of the following variables, first, as qualitative or
Measures of
skewness
quantitative and second, as a nominal, ordinal, interval, or
31 / 144
Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.) ratio scale.
Introduction 4 For each of the following, indicate whether the appropriate
The Sigma
P
variable would be qualitative or quantitative. If you
notation

Chapter 1
identify the variable as qualitative, indicate whether it
Exercises would be nominal or ordinal. If you identify the variable as
Graphical
representation
quantitative, indicate whether it would be discrete or
of data continuous.
DATA DE-
SCRIPTION:
a) Whether you own a ”Panasonic”, ”LG”, ”Telefunken”
Measures of
location and
or ”other” television set.
dispersion b) Your status as either ”first-year”, ”second-year”,
Measures of
skewness
”third-year” or ”post-graduate.”
31 / 144
Module Code c) The number of people who attended the first-year
and title:
(E304)INITIATION
orientation programme in 2010.
TO
STATISTICS
d) The price of your most recent haircut.
Lecturer’s e) Distance (in km) between KP and Tyazo station.
name:
Mweruli T. f) The number of students on campus who live on
Fidèle (MSc.)
residence.
Introduction

The Sigma
P
5 For each of the following, indicate the type of data (scale
notation

Chapter 1
of measurement) that best describes the information. a)
Exercises In mid June 2009, a corporation had approximately 39,000
Graphical
representation
employees.
of data b) Yesterday’s newspaper reported that the previous day’s
DATA DE-
SCRIPTION:
highest temperature in Cape Town was 27 degrees Celsius.
Measures of
location and
c) The news broadcast at 19:00 every weekday.
dispersion d) An individual respondent answered ”yes” when asked
Measures of
skewness
if TV contributes to violence.
32 / 144
Outline

Module Code
and title:
(E304)INITIATION 1 Introduction
TO
STATISTICS P
Lecturer’s
2 The Sigma notation
name:
Mweruli T.
Fidèle (MSc.)
3 Chapter 1 Exercises

Introduction 4 Graphical representation of data


P
The Sigma
notation 5 DATA DESCRIPTION: Measures of location and dispersion
Chapter 1
Exercises 6 Measures of skewness
Graphical
representation
of data 7 INFERENTIAL STATISTICS
DATA DE- Probability
SCRIPTION:
Measures of
location and 8 A brief overview of the classical linear regression model
dispersion

Measures of
skewness
32 / 144
Module Code How do we construct a frequency distribution (grouped data)?
and title:
(E304)INITIATION
Table 2.1 represents nothing more than a jumble of numbers,
TO
STATISTICS
called raw data. It is not very interesting or informative. One
Lecturer’s way of organizing the data would be to summarize and group it
name:
Mweruli T. in a frequency table. Frequency table counts the number of
Fidèle (MSc.)
observations that fall into each of the series of class intervals
Introduction that cover the complete range of observations. The class
The Sigma
P
intervals are defined by lower and upper limits. The class
notation
frequency is the number of observations whose values lie
Chapter 1
Exercises between the lower and upper limits. The advantage of the
Graphical frequency table is that one obtains a clear ”overall” picture of
representation
of data the data by summarizing discrete or continuous data into class
DATA DE-
SCRIPTION:
intervals each with corresponding frequencies. Consider data
Measures of set of the monthly salaries of company employees (in rands)
location and
dispersion (see Table 2.1).
Measures of
skewness
33 / 144
Graphical representation of data: Frequency
distribution
Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE- Note:


SCRIPTION:
Measures of A data set that has not been organized numerically is called
location and
dispersion raw data.
Measures of
skewness
33 / 144
Graphical representation of data: Frequency
distribution
Module Code
and title:
An arrangement of raw numerical data in descending or
(E304)INITIATION
TO
ascending order of magnitude is called an array. Class intervals
STATISTICS do not overlap. By grouping the data like this, much of the
Lecturer’s
name:
original detail is lost in the array table. We will group the list
Mweruli T.
Fidèle (MSc.)
of 50 observations (see Table 2.1) into 6 groups. By doing that
we will lose some information. There is no single ”correct”
Introduction
P
frequency table for a given set of data.
The Sigma
notation How do we construct a frequency distribution (grouped
Chapter 1 data)?
Exercises
Consider monthly salaries of company employees (see Table
Graphical
representation 2.1).
of data
Step 1: Determine the data range
DATA DE-
SCRIPTION: Range is the difference between maximum and minimum value,
Measures of
location and i.e.
dispersion
R = M − m = 2290 − 1384 = 906 (1)
Measures of
skewness
34 / 144
Steps in constructing a Frequency distribution table

Module Code
and title: where M denotes the maximum value and m denotes the
(E304)INITIATION
TO minimum value of the data set.
STATISTICS
Step 2: Decide the number of classes, K
Lecturer’s
name: Find the smaller integer k such that
Mweruli T.
Fidèle (MSc.)
2k ≥ n (2)
Introduction

(Sturge’s Rule) where n : sample size, for our case Sample


P
The Sigma
notation

Chapter 1
size: n = 50.
Exercises
25 = 32 < n, 26 = 64 > n, k = 6. Number of class, k is 6 or
Graphical
representation more.
of data
Step 3: Determine the class width, c
DATA DE-
SCRIPTION:
Measures of R 906
location and c= = = 151 (3)
dispersion k 6
Measures of
skewness
35 / 144
Steps in constructing a Frequency distribution table

Module Code
and title:
(E304)INITIATION
TO Round the result to a convenient value (e.g. multiple of 2, 5 or
STATISTICS

Lecturer’s
10) that is close to calculated value. Select either 150 or 200.
name:
Mweruli T.
Let class width c = 200.
Fidèle (MSc.) Step 4: Determine the class intervals (see Table 2.2)
Introduction
The first class interval should contain the minimum value.
The Sigma
P For the first class interval: Lower limit should be smaller or
notation
equal to minimum value. Minimum value is 1384 . Select lower
Chapter 1
Exercises limit = 1200
Graphical Upper limit = lower limit + class width = 1200 + 200 = 1400
representation
of data The first class interval is denoted as 1200 − 1400. The first
DATA DE- class interval is defined as Amounts that are more than R1,200
SCRIPTION:
Measures of but less than R1,400
location and
dispersion

Measures of
skewness
36 / 144
Frequency distribution

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation For the second class interval:
Chapter 1
Exercises
Lower limit of the 2nd class interval is the upper limit of the
Graphical first class interval. Upper limit of the 2nd class interval is the
representation
of data
lower limit of the 2nd interval plus class width. Second class
DATA DE- interval is denoted as ”1400-1600” The second class interval is
SCRIPTION:
Measures of defined as Amounts that are more than R1,400 but less than
location and
dispersion
R1,600 Continue to determine the class intervals until the final
Measures of class interval.
skewness
37 / 144
Frequency distribution

Module Code
and title:
Note:
(E304)INITIATION
TO 1 The class intervals are chosen in such a way that there can
STATISTICS

Lecturer’s
be no doubt about the placing of an entry.
name:
Mweruli T.
2 A class interval with no upper or lower class limit (e.g. for
Fidèle (MSc.) monthly salary data ”R2,200 and over”) is called an open
Introduction class interval.
The Sigma
P
3 The class width is the difference between the lower and
notation
upper limits.
Chapter 1
Exercises 4 Usually all the class intervals of the frequency distribution
Graphical
representation
have equal class widths.
of data
5 The number of class intervals is usually between 5 and 20,
DATA DE-
SCRIPTION: depending on the data.
Measures of
location and 6 The final class interval should contain the maximum value
dispersion

Measures of
(see Table 2.2)
skewness
38 / 144
Frequency distribution

Module Code
and title: Step 5: List and count the number of values that fall into
(E304)INITIATION
TO each class interval. These counts are called frequencies (see
STATISTICS
Table 2.3 to Table 2.5). For example, the class interval,
Lecturer’s
name: 1200 − 1400 contains two values, 1384 and 1385. Thus the
Mweruli T.
Fidèle (MSc.) frequency is 2. The list of classes, together with their
frequencies, is called frequency table or frequency distribution.
Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
These values are arranged in ascending order in Table 2.4.
skewness
39 / 144
Frequency distribution table

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1 An alternative way of presenting Table 2.4 is to draw a tally


Exercises

Graphical
frequency table. In the tally column (score column) each stroke
representation
of data
represents an entry. The fifth stroke is drawn through the
DATA DE-
previous four as shown, which makes it easy to add the tally
SCRIPTION:
Measures of
and obtain the frequency. This would appear in a frequency
location and
dispersion
table as shown in Table 2.5.
Measures of
skewness
40 / 144
Frequency Distribution

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical Class mark (also called class midpoint)


representation
of data It is obtained by adding the lower and upper class limits and
DATA DE- dividing by two (see Table 2.6).
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
41 / 144
Frequency Distribution

Module Code
and title:
(E304)INITIATION lower limit + upper limit
TO class mark = (4)
STATISTICS 2
Lecturer’s The class mark and the class limits are used to construct
name:
Mweruli T. histograms, frequency polygons and ogives.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
42 / 144
Frequency distribution table: Examples

Module Code
and title:
The Table 2.7 below shows the raw data of weights (kg) of 50
(E304)INITIATION
TO
pieces of luggage.
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation Question: Construct a frequency distribution table of the
Chapter 1
Exercises
above data set.
Graphical
Solution:
representation
of data
Step 1: Calculate the range
DATA DE-
Range = maximum value − minimum value = 20 − 7 = 13
SCRIPTION:
Measures of
Step 2: Decide the number of classes
location and
dispersion
Decide the number of classes, k
Measures of
Let n be the number of observations (or sample size), thus
skewness (n = 50) 43 / 144
Frequency distribution table: Examples

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s Find the smaller integer k such that 2k ≥ n (Sturge’s Rule).


name:
Mweruli T. Note that 25 = 32 which is less than 50 and 26 = 64 which is
Fidèle (MSc.)
greater than 50 Thus, number of classes is 6 or more.
Introduction Step 3: Determine the c
P
The Sigma
notation R 13
Chapter 1
c= = = 2.1667
Exercises k 6
Graphical Select a more convenient class width, say 2.
representation
of data Step 4: Determine the class limits (see Table 2.8)
DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
44 / 144
Frequency distribution: Examples

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data
Step 5:
DATA DE-
SCRIPTION: Count the number of observations that fall into each class
Measures of
location and intervals (see Table 2.8 above)
dispersion

Measures of
skewness
45 / 144
Graphical presentation of data: Histogram

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Histogram (also called a frequency histogram):
Mweruli T.
Fidèle (MSc.)
A graphical representation of a frequency distribution.
Histogram is useful for large data sets involving quantitative
Introduction
P
variables. Vertical axis is labeled with the frequencies and the
The Sigma
notation horizontal axis is labeled with the class intervals. It is a set of
Chapter 1 rectangles and each base is centered at the class mark. There
Exercises
are no gaps between the rectangles, because the data are
Graphical
representation continuous (see Figure 2.1).
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
46 / 144
Graphical presentation of data: Histogram

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data
Figure 2.1 is a histogram of the monthly salaries of employees
DATA DE-
SCRIPTION: from Table 2.6 with
Measures of
location and x-axis: monthly salaries
dispersion
y-axis: frequency
Measures of
skewness
47 / 144
Graphical presentation of data: Histogram

Module Code
and title:
The interval end points are 1200, 1400 etc and class midpoints
(E304)INITIATION
TO
are 1300; 1500; 1700; 1900; 2100; 2300. The width of the base
STATISTICS of each rectangle is equal to 200 units. The histogram is useful
Lecturer’s
name:
as a crude summary of the data. Certain features of the
Mweruli T.
Fidèle (MSc.)
population represented by the data are immediately evident.
For instance, the majority of employees earned salaries between
Introduction
P
1400 and 1800. Employees earning more than 2400 are rare.
The Sigma
notation By simply changing the vertical scale from frequency to relative
Chapter 1 frequency (or relative frequency (%)) and keeping exactly the
Exercises
same diagram, a histogram can be changed into a relative
Graphical
representation frequency histogram or percentage histogram.
of data
Mode
DATA DE-
SCRIPTION: The observation with the greatest frequency (explained in
Measures of
location and Section 3). A modal class interval is the class interval with the
dispersion
highest frequency (largest number of observations). A
Measures of
skewness unimodal histogram has a single peak. 48 / 144
Unimodal Histogram

Module Code
and title:
The histogram in Figure 2.1 is a unimodal. A special type of
(E304)INITIATION
TO
symmetric unimodal histogram is one that is bell shaped (see
STATISTICS Figure 2.2). A bimodal histogram has two peaks (see Figure
Lecturer’s
name:
2.3).
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
49 / 144
Bimodal Histogram

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data Skewness
DATA DE- A skewed histogram is one with a long tail extending to either
SCRIPTION:
Measures of the right or the left. The former is called positively skewed, and
location and
dispersion the latter is called negatively skewed (see Figure 2.4). Figure
Measures of 2.1 has a long tail extending to the right or positively skewed.
skewness
50 / 144
Histogram

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name: Incomes of employees in large firms tend to be positively
Mweruli T.
Fidèle (MSc.)
skewed, because there is a large number of relatively low-paid
workers and a small number of well-paid executives. The time
Introduction
P
taken by students to write exams is frequently negatively
The Sigma
notation skewed because few students hand in their exams early; most
Chapter 1 prefer to reread their papers and hand them in near the end of
Exercises
the scheduled test period. Calculating skewness will be
Graphical
representation discussed in Section 3.
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
51 / 144
Skewness

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data Relative frequency
DATA DE-
SCRIPTION: frequency f
Measures of relative frequency = = (5)
location and
dispersion
number of observations n
Measures of
skewness
52 / 144
Relative frequency

Module Code
and title: Relative frequency can be expressed as a percentage.
(E304)INITIATION
TO
STATISTICS f
Lecturer’s
relative frequency (%) = × 100 (6)
name: n
Mweruli T.
Fidèle (MSc.) Example 2.3:
Consider Table 2.6. Find the relative frequency (or fraction) of
Introduction
P company employees earning a monthly salary between R1 600
The Sigma
notation and R2 300.
Chapter 1 Solution:
Exercises

Graphical
Adjust the frequencies which fall in this interval. The interval
representation
of data
of graduates earning a salary of R2 300 lies within the interval
DATA DE-
R2 200 and R2 400 (rounded off to four decimals). Here is how
SCRIPTION:
Measures of
the adjustment is performed. The upper limit of this interval (2
location and
dispersion
400) will now be changed to 2 300 because the frequencies web
Measures of
are interested in fall between 1600 and 2300 i.e.
skewness
53 / 144
Relative frequency

Module Code
and title:
(E304)INITIATION
TO
1600 < salary < 2300
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
54 / 144
Relative frequency

Module Code
and title:
(E304)INITIATION Example 2.5:
TO
STATISTICS Consider Table 2.6. Calculate the relative frequency expressed
Lecturer’s as a percentage between R1400 and R1600.
name:
Mweruli T. Solution:
Fidèle (MSc.)

18
Introduction = 36%
The Sigma
P 50
notation
Frequency polygon
Chapter 1
Exercises A line graph of class frequencies plotted against class marks. It
Graphical can be obtained by connecting the mid points of the tops of
representation
of data the rectangles in the histogram. In Figure 2.5, a histogram of
DATA DE- the data and the corresponding frequency polygon are drawn to
SCRIPTION:
Measures of the same set of axes, in order to clearly illustrate the
location and
dispersion relationship between both ways of depicting the data.
Measures of
skewness
55 / 144
Frequency polygon

Module Code
and title:
(E304)INITIATION
TO
STATISTICS
Note the extensions of the frequency polygon on either side of
Lecturer’s
name: the histogram. Create lower interval (1000− < 1200) at zero f
Mweruli T.
Fidèle (MSc.) and upper interval (2400− < 2600) at zero f to ”anchor” the
graph. Note also that the sum of the areas of the rectangles in
Introduction
P the histogram equals the total area bounded by the frequency
The Sigma
notation polygon and the axis. By simply changing the vertical scale
Chapter 1
Exercises
from frequency to relative frequency (or frequency (%)), and
Graphical
keeping exactly the same diagram, frequency polygon can be
representation
of data
changed into a relative frequency polygon or percentage
DATA DE-
polygon.
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
56 / 144
Frequency polygon

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises Cumulative frequency (F )
Graphical The sum of the frequency of the class intervals less than and
representation
of data equal to that class interval. For example, the cumulative
DATA DE- frequency for 1400− < 1600 is 0 + 2 + 18 = 20 (see Table 2.9).
SCRIPTION:
Measures of Cumulative frequency distributions are useful in answering the
location and
dispersion following questions: ”How many observations fall below this
Measures of upper class limit?”
skewness
57 / 144
Frequency polygon

Module Code
and title:
(E304)INITIATION Create an extra class below the first class (for example,
TO
STATISTICS ”monthly salaries first interval is below 1200 ”). Table 2.9
Lecturer’s shows cumulative frequencies and is called a cumulative
name:
Mweruli T. frequency table or cumulative distribution. Note that if the
Fidèle (MSc.)
sample size increases and the class widths decrease, the ogive
Introduction becomes a virtually smooth curve.
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
58 / 144
Cumulative frequency polygon (Ogive)

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
A graph showing the cumulative frequencies is called a
Mweruli T.
Fidèle (MSc.)
cumulative frequency polygon or ogive (pronounced o-jive) (see
Figure 2.6). By changing the vertical scale from frequency to
Introduction
P
relative frequency (or frequency (%)), and keeping exactly the
The Sigma
notation same diagram, a cumulative histogram (or polygon) can be
Chapter 1 changed into a relative cumulative histogram (or relative
Exercises
cumulative polygon) or percentage cumulative histogram (or
Graphical
representation percentage cumulative polygon).
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
59 / 144
Cumulative frequency polygon (Ogive)

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Relative cumulative frequency (F (%))
Measures of
location and
The cumulative frequency divided by the total n . It is
dispersion
expressed as a percentage (%) when multiplied by 100.
Measures of
skewness
60 / 144
Relative cumulative frequency (F (%)): example

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name: Example 2.6: Consider the monthly salaries for company
Mweruli T.
Fidèle (MSc.)
employees. Find the second and third class intervals ”relative
cumulative frequencies” expressed as percentage.
Introduction
P
Solution: Second class interval
The Sigma 2
notation 50 × 100 = 4%
Chapter 1 Third class interval
Exercises 20
Graphical 50 × 100 = 40%
representation Note: see last column in Table 2.9.
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
61 / 144
Bar graph (also called rod graph)

Module Code
and title: A graphical display of categorical (or qualitative) variables.
(E304)INITIATION
TO Similar to the histogram such that rectangle height indicates
STATISTICS

Lecturer’s
the frequency. However, the base width is not important. In
name:
Mweruli T.
2006, first time entering first year students attended the
Fidèle (MSc.) orientation programme. One of the questions asked during the
Introduction
orientation period was: In which province did you matriculate?
The Sigma
P Table 2.16 and Figure 2.7 illustrates the outcome. Note: The
notation
frequencies can only be whole numbers such as 0; 1; 2; etc.
Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
62 / 144
Bar graph (also called rod graph)

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION: R commands to construct all the above plots are
Measures of
location and described in Lecturer’s notes that will be covered in the R
dispersion

Measures of
sessions.
skewness
63 / 144
Outline

Module Code
and title:
(E304)INITIATION 1 Introduction
TO
STATISTICS P
Lecturer’s
2 The Sigma notation
name:
Mweruli T.
Fidèle (MSc.)
3 Chapter 1 Exercises

Introduction 4 Graphical representation of data


P
The Sigma
notation 5 DATA DESCRIPTION: Measures of location and dispersion
Chapter 1
Exercises 6 Measures of skewness
Graphical
representation
of data 7 INFERENTIAL STATISTICS
DATA DE- Probability
SCRIPTION:
Measures of
location and 8 A brief overview of the classical linear regression model
dispersion

Measures of
skewness
64 / 144
Module Code This section introduces numerical descriptive measurements,
and title:
(E304)INITIATION
which explains various characteristics of a population and
TO
STATISTICS
sample. For each descriptive measurement we will describe how
Lecturer’s to calculate both the population parameter and the sample
name:
Mweruli T. statistic. These descriptive measurements are critical to the
Fidèle (MSc.)
development of statistical inferences and are applied mostly to
Introduction quantitative data.
The Sigma
P
Realistically populations are very large. The formulas describing
notation
the calculation of parameters are not practical and are seldom
Chapter 1
Exercises used. They are provided to teach the concept and the notation.
Graphical The population descriptive measurements have sample
representation
of data counterparts defined similarly. They are estimates of the
DATA DE-
SCRIPTION:
respective population parameters. When a frequency
Measures of distribution is constructed (see Section 2), the individual
location and
dispersion observations of the random variable are lost and are therefore
Measures of
skewness
unknown.
65 / 144
Measures of central location (central tendency)

Module Code
and title:
(E304)INITIATION A measure of central tendency (also referred to as measures of
TO
STATISTICS
centre or central location) is a summary measure that attempts
Lecturer’s to describe a whole set of data with a single value that
name:
Mweruli T. represents the middle or centre of its distribution. There are
Fidèle (MSc.)
three main measures of central tendency: the mode, the median
Introduction and the mean. Each of these measures describes a different
The Sigma
P
indication of the typical or central value in the distribution.
notation
Mode
Chapter 1
Exercises This is one of the measures of central tendency. The mode is
Graphical defined as a value within a frequency distribution which has the
representation
of data highest frequency.
DATA DE-
SCRIPTION:
Example: Find the mode of the signing bonuses of eight
Measures of
location and
football players for a specific year. The bonuses in millions of
dispersion dollars are 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4 and 10.
Measures of
skewness
65 / 144
Measures of central location: Mode

Module Code
and title:
(E304)INITIATION
TO
Solution:
STATISTICS It is helpful to arrange the data in order although it is not
Lecturer’s
name:
necessary. 10, 10, 10, 11.3, 12.4, 14.0, 18.0, 34.5. Since 10
Mweruli T.
Fidèle (MSc.)
million occurred 3 times, a frequency larger than any other
number, the mode is 10 million.
Introduction
P
Computation of the mode from grouped Data The mode
The Sigma
notation can be computed for grouped frequency distribution. The
Chapter 1 following formula makes the task easier while computing modal
Exercises
sample/population elements.
Graphical
representation  
of data δ1
DATA DE-
Mode = L1 + ×c (7)
SCRIPTION:
δ1 + δ2
Measures of
location and Where L1 = Lower class boundary of the modal class
dispersion

Measures of
skewness
66 / 144
Computation of the mode from grouped Data

Module Code
and title:
δ1 = Difference between the highest frequency and value before
(E304)INITIATION
TO
it
STATISTICS δ2 = Difference between the highest frequency and value after
Lecturer’s
name:
it
Mweruli T.
Fidèle (MSc.)
c = Class interval
Example: Based on the grouped data below, find the mode
Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data
Solution: Based on the table, we have the following:
DATA DE-
SCRIPTION: L = 10.5, δ1 = 6, δ2 = 2 and c = 10. Substituting these values
Measures of
location and in the above formula of the mode we get
dispersion  
6
Measures of Mode = 10.5 + × 10 = 18.
skewness 6+2 67 / 144
Measures of central location: Mode

Module Code
and title:
(E304)INITIATION
A data set that has only one value that occurs with the
TO
STATISTICS
greatest frequency is said to be unimodal. If a data set has
Lecturer’s two values that occur with the same greatest frequency, both
name:
Mweruli T. values are considered to be the mode and the data set is said
Fidèle (MSc.)
to be bimodal. If a data set has more than two values that
Introduction occur with the same greatest frequency, each value is used as
The Sigma
P
the mode, and the data set is said to be multimodal. When no
notation
data value occurs more than once, the data set is said to have
Chapter 1
Exercises no mode. A data set can have more than one mode or no
Graphical mode at all.
representation
of data Median
DATA DE-
SCRIPTION:
The median is calculated by placing all the observations in
Measures of order (ascending or descending). The median splits the data
location and
dispersion set into two parts that have an equal number of values.
Measures of
skewness
68 / 144
Measures of Central tendency: Median

Module Code
and title:
(E304)INITIATION
TO As a result, half (50%) of the data values are greater than or
STATISTICS

Lecturer’s
equal to the median and half (50%) of the data values are less
name:
Mweruli T.
than or equal to the median. If the number of observations is
Fidèle (MSc.) odd then the median is the central value if they are arranged in
Introduction
order of magnitude. When the number of observation, is even
The Sigma
P then the median is the average of the two middle observations.
notation
Median is not as sensitive to extreme values as is the mean.
Chapter 1
Exercises Calculate the median of 8; 3; 5; 12; 10
Graphical Solution:
representation
of data Step 1: Put the data in ascending order
DATA DE- Ordered data: 3, 5, 8, 10, 12
SCRIPTION:
Measures of Step 2: Find the median (middle value): 8
location and
dispersion

Measures of
skewness
69 / 144
Computation of the median from grouped Data

Module Code
and title: The median for grouped data is a little bit different from the
(E304)INITIATION
TO ungrouped data. It is computed by using the following steps:
STATISTICS
Step 1: Construct the cumulative frequency distribution.
Lecturer’s
name: Step 2: Decide the class that contain the median. Median
Mweruli T.
Fidèle (MSc.) Class is the first class with the value of cumulative frequency
equal at least n2
Introduction
P Step 3: Find the median by using the following formula:
The Sigma
notation n
2 −F

Chapter 1
Exercises Median = Lm + ×c (8)
fm
Graphical
representation
of data
Where: n = the total frequency,
DATA DE- F = the cumulative frequency before median class,
SCRIPTION:
Measures of
c = the class interval,
location and
dispersion
Lm = the lower boundary of the median class,
Measures of fm = the frequency of the median class
skewness
70 / 144
Computation of the median from grouped Data

Module Code
and title: Example: Based on the grouped data below, find the median
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
To find the median we first construct the cumulative frequency
The Sigma
notation distribution as follow:
Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
Median class is given by n2 = 25 which is the 3rd class;
location and
dispersion
F = 22, fm = 12, Lm = 20.5 and c = 10. Therefore by applying
Measures of
the formula, we get the following
skewness
71 / 144
Computation of the median from grouped Data:
Example
Module Code
and title: !
50
(E304)INITIATION
2 − 22
TO Median = 20.5 + × 10 = 23
STATISTICS 12
Lecturer’s
name:
Mweruli T.
Thus, 25 persons take less than 23 minutes to travel to work
Fidèle (MSc.)
and another 25 persons take more than 23 minutes to travel to
Introduction work.
The Sigma
P
Mean: The arithmetic mean
notation
This is commonly known as average or mean. It is obtained by
Chapter 1
Exercises first of all summing up the values given and by dividing the
Graphical total value by the total number of observations.
representation
of data Pn
Xi
DATA DE- X̄ = i=1 (9)
SCRIPTION:
Measures of
n
location and
dispersion Where n represents the total number of observations in the
Measures of sample.
skewness
72 / 144
Mean: The arithmetic mean

Module Code
and title:
For a population, the Greek letter µ (mu) is used for the mean.
(E304)INITIATION
TO Pn
STATISTICS Xi
µ = i=1 (10)
Lecturer’s N
name:
Mweruli T.
Fidèle (MSc.) where N denotes the population size.
Example:
Introduction

The Sigma
P The following data represent the number of days off per year
notation for a sample of individuals selected from nine different
Chapter 1
Exercises
countries. Find the mean.
Graphical
20, 26, 40, 36, 23, 42, 35, 24, 30.
representation
of data
Solution:
DATA DE-
SCRIPTION:
Measures of 20 + 26 + 40 + 36 + 23 + 42 + 35 + 24 + 30
location and
dispersion
X̄ = = 30.7days
9
Measures of
skewness
73 / 144
Computation of the mean from grouped Data

Module Code
and title:
(E304)INITIATION
For the grouped data the mean is given by
TO Pn
STATISTICS
fX
Lecturer’s P i i
X̄ = i=1 (11)
name: fi
Mweruli T.
Fidèle (MSc.)
where in the above formula the Xi are now the midpoints of
Introduction each class.
P
The Sigma
notation
Example: The data represent the number of miles run during
Chapter 1 one week for a sample of 20 runners. Find the mean.
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
74 / 144
Weighted Mean

Module Code
and title:
The type of mean that considers an additional factor is called
(E304)INITIATION
TO
the weighted mean, and it is used when the values are not all
STATISTICS equally represented.
Lecturer’s
name:
To find the weighted mean of a variable (for instance X ) we
Mweruli T.
Fidèle (MSc.)
multiply each value by its corresponding weight and dividing
the sum of the products by the sum of the weights.
Introduction Pn
wX
X̄ = i=1P i i (12)
P
The Sigma
notation wi
Chapter 1
Exercises Where w1 , w2 , · · · , wn are the weights and X1 , X2 , · · · , Xn are
Graphical the values.
representation
of data Example: A student got an A in English Composition I (3
DATA DE- credits), a C in Introduction to Psychology (3 credits), a B in
SCRIPTION:
Measures of Biology I (4 credits), and a D in Physical Education (2 credits).
location and
dispersion Assuming A= 4 grade points, B =3 grade points, C =2 grade
Measures of points, D = 1 grade point, and F=0 grade points, find the
skewness
students grade point average. 75 / 144
Weighted Mean: Example

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Harmonic mean
Exercises This is a measure of central tendency which is used to
Graphical
representation
determine the average growth rates for natural economies. It is
of data
defined as the reciprocal of the average of the reciprocals of all
DATA DE-
SCRIPTION: the values given by HM.
Measures of
location and 1
dispersion HM = (13)
1/n (1/x1 + 1/x2 + 1/x3 + · · · 1/xn )
Measures of
skewness
76 / 144
Harmonic mean: Example

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
The economic growth rates of five countries were given as
Fidèle (MSc.) 20%, 15%, 25%, 18%and5% Calculate the harmonic mean.
Introduction

1 1
P
The Sigma
notation HM = = = 10.86%
Chapter 1 1/5 (1/20 + 1/15 + 1/25 + 1/18 + 1/5) 0.092
Exercises

Graphical Exercise Here !


representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
77 / 144
Measures of variation/ Measures of spread

Module Code
and title:
The Measures of Variation denotes how stretched or squeezed
(E304)INITIATION
TO
a distribution is. In other words it shows how spread out a
STATISTICS distribution is. Common examples of measures of statistical
Lecturer’s
name:
dispersion are the range, variance, and standard deviation.
Mweruli T.
Fidèle (MSc.)
Range
The range is the highest value minus the lowest value, i.e the
Introduction
P
difference between the lagest value and the smallest value of
The Sigma
notation the data set. It is denoted by the letter R and mathematically
Chapter 1 expressed by
Exercises
R = XL − XS (14)
Graphical
representation
of data
Where XL denotes the largest observation of the data set and
DATA DE-
XS denotes the smallest observation of the data set. In some
SCRIPTION:
Measures of
other text books this measure of spread can be denoted as
location and
dispersion R =M −m (15)
Measures of
skewness where M denotes the maximum value of the data set and m 78 / 144
Measures of variation/ Measures of spread:
Variance
Module Code
and title: The variability can also be defined in terms of how close the
(E304)INITIATION
TO scores in the distribution are to the middle of the distribution.
STATISTICS

Lecturer’s
Using the mean as the measure of the middle of the
name:
Mweruli T.
distribution, the variance is defined as the average squared
Fidèle (MSc.) difference of the scores from the mean. The variance can be
Introduction
computed as follows:
P
The Sigma
(X − µ)2
P
notation
2
σ = (16)
Chapter 1
Exercises
N
Graphical
representation where σ 2 : population variance, X : individual value,
of data
µ : population mean, and N : population size
DATA DE-
SCRIPTION: If the variance in a sample is used to estimate the variance in a
Measures of
location and population, then the previous formula underestimates the
dispersion
variance and the following formula should be used:
Measures of
skewness
79 / 144
Measures of variation: Sample Variance

Module Code
and title:
(E304)INITIATION P 2
TO
2 X − X̄
STATISTICS S = (17)
Lecturer’s
n−1
name:
Mweruli T. where S 2 : sample variance, X : individual value, X̄ sample
Fidèle (MSc.)
mean, and n : sample size.
Introduction It is worth noting that if the values are near the mean, the
variance will be small. In contrast, if the values are far from the
P
The Sigma
notation

Chapter 1
mean, the variance will be large. Someone might wonder why
Exercises
the squared distances are used instead of the actual distances.
Graphical
representation One reason is that the sum of the distances will always be zero.
of data
Variance and Standard Deviation for Grouped Data
DATA DE-
SCRIPTION: The procedure for finding the variance and standard deviation
Measures of
location and for grouped data is similar to that for finding the mean for
dispersion
grouped data, and it uses the midpoints of each class.
Measures of
skewness
80 / 144
Variance and Standard Deviation for Grouped Data

Module Code
and title:
For grouped data, the population variance is given by
(E304)INITIATION

fi xm 2
TO P 2 P 
STATISTICS 2 fi xm
σ = P − P (18)
Lecturer’s fi fi
name:
Mweruli T.
Fidèle (MSc.) wich can be also written as
fi (xm − x¯m )2
P
Introduction
2
The Sigma
P σ = P
notation fi
Chapter 1
Exercises where xm denotes the midpoint of each class and x¯m the
Graphical associated mean of the midpoints of classes.
representation
of data For the grouped data, the sample variance is given by
DATA DE- "P
SCRIPTION: P 2
P 2 #
fi f i x m f i x m
S2 = P
Measures of
location and P − P . (19)
dispersion fi − 1 fi fi
Measures of
skewness
81 / 144
Variance and Standard Deviation for Grouped Data

Module Code
and title: Example: A certain factory produces ballbearings. A sample of
(E304)INITIATION
TO bearing from the factory produced the following results
STATISTICS

Lecturer’s
name:
Mweruli T. Table: Frequncy distribution table of the data
Fidèle (MSc.)
Class boundaries (Diameter of bearing in mm) Frequency
Introduction
P 91-93 4
The Sigma
notation 94-96 6
Chapter 1 97-99 34
Exercises

Graphical
100-102 40
representation
of data
103-105 13
DATA DE-
106- 108 3
SCRIPTION:
Measures of
location and Determine the variance and the standard deviation of the
dispersion
diameter of the sample bearings.
Measures of
skewness
82 / 144
Standard deviation

Module Code
and title:
The standard deviation is the positive square root of the
(E304)INITIATION
TO
variance. The symbols for the population and sample standard
STATISTICS deviation are σ and S respectively. The reason to find the
Lecturer’s
name:
square root is that since the distances were squared, the units
Mweruli T.
Fidèle (MSc.)
of the resultant numbers are the squares of the units of the
original raw data. Finding the square root of the variance puts
Introduction
P
the standard deviation in the same units as the raw data. Thus
The Sigma
notation for the population, the standard deviation is given by
sP
Chapter 1
2 fi xm 2
P 
Exercises fi xm
σ=+ P − P (20)
Graphical
representation
fi fi
of data
and for the sample the standard deviation is given by
DATA DE- v
SCRIPTION: "P  #
fi xm 2
u P
Measures of 2
P
location and
u fi fi xm
dispersion S= P t P − P . (21)
fi − 1 fi fi
Measures of
skewness
83 / 144
Uses of the Variance and Standard Deviation

Module Code
and title:
(E304)INITIATION
As previously stated, variance and standard deviation can
TO
STATISTICS
be used to determine the spread of the data. If the
Lecturer’s variance or standard deviation is large, the data are more
name:
Mweruli T. dispersed. This information is useful in comparing two (or
Fidèle (MSc.)
more) data sets to determine which is more (most)
Introduction variable.
The measures of variance and standard deviation are used
P
The Sigma
notation

Chapter 1
to determine the consistency of a variable. For example, in
Exercises the manufacture of fittings, such as nuts and bolts, the
Graphical
representation
variation in the diameters must be small, or the parts will
of data
not fit together.
DATA DE-
SCRIPTION: The variance and standard deviation are used to determine
Measures of
location and the number of data values that fall within a specified
dispersion
interval in a distribution.
Measures of
skewness
84 / 144
Uses of the Variance and Standard Deviation

Module Code
and title:
(E304)INITIATION
For example, Chebyshev’s theorem (see section of
TO
STATISTICS
inferential statistics) shows that, for any distribution, at
Lecturer’s least 75% of the data values will fall within 2 standard
name:
Mweruli T. deviations of the mean.
Fidèle (MSc.)
Finally, the variance and standard deviation are used quite
Introduction often in inferential statistics.
P
The Sigma
notation Coefficient of Variation
Chapter 1
Exercises
Coefficient of variance expresses standard deviation relative to
Graphical
its mean. It is a measure of dispersion used to compare the
representation
of data
dispersion in two sets of data. This measure is independent of
DATA DE- the unit of the measurement, i.e is a statistic that allows us to
SCRIPTION:
Measures of
compare standard deviations when the units are different. It is
location and
dispersion
equal to the standard deviation divided by the mean. The
Measures of result is expressed as a percentage.
skewness
85 / 144
Coefficient of Variation

Module Code
and title:
When to use coefficient of variance:
(E304)INITIATION
TO 1 When comparison groups have very different means (CV is
STATISTICS
suitable as it expresses the standard deviation relative to
Lecturer’s
name: its corresponding mean)
Mweruli T.
Fidèle (MSc.) 2 When different units of measurements are involved, e.g.
group 1 unit is m, and group 2 unit is kg (CV is suitable
Introduction

The Sigma
P for comparison as it is unit-free)
notation
In such cases, standard deviation should not be used for
Chapter 1
Exercises comparison. It is given by
Graphical S
representation
of data
C .V = × 100 (22)

DATA DE-
SCRIPTION: for the sample and for the population the C.V is given by
Measures of
location and σ
dispersion C .V = × 100 (23)
Measures of
µ
skewness
86 / 144
The Coefficient of Variation (C.V): Example

Module Code
and title: The following two data sets show the Weights of newborn
(E304)INITIATION
TO elephants (kg) and Weights of newborn mice (kg) respectively.
STATISTICS
Compute the coefficient of variation for each group and
Lecturer’s
name: compare the variability of data in the two groups.
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of Exercise Here !


skewness
87 / 144
Outline

Module Code
and title:
(E304)INITIATION 1 Introduction
TO
STATISTICS P
Lecturer’s
2 The Sigma notation
name:
Mweruli T.
Fidèle (MSc.)
3 Chapter 1 Exercises

Introduction 4 Graphical representation of data


P
The Sigma
notation 5 DATA DESCRIPTION: Measures of location and dispersion
Chapter 1
Exercises 6 Measures of skewness
Graphical
representation
of data 7 INFERENTIAL STATISTICS
DATA DE- Probability
SCRIPTION:
Measures of
location and 8 A brief overview of the classical linear regression model
dispersion

Measures of
skewness
88 / 144
Module Code Skewness is a concept which is commonly used in statistical
and title:
(E304)INITIATION
decision making. It refers to the degree in which a given
TO
STATISTICS
frequency curve is deviating away from the normal distribution.
Lecturer’s These are numerical values which assist in evaluating the
name:
Mweruli T. degree of deviation of a frequency distribution from the normal
Fidèle (MSc.)
distribution. The following is the commonly used measure of
Introduction skewness:
P
The Sigma
notation
(Mean − Median)
Coefficient Skewness = 3 × (24)
Chapter 1 Standarddeviation
Exercises

Graphical
There are 2 types of skewness namely
representation
of data 1 Positive skewness
DATA DE- 2 Negative skewness
SCRIPTION:
Measures of
location and
Positive Skewness
dispersion This is the tendency of a given frequency curve leaning towards
Measures of
skewness
the left. In a positively skewed distribution, the long tail
88 / 144
Module Code extended to the right. In this distribution one should note the
and title:
(E304)INITIATION
following:
TO
STATISTICS The mean is usually bigger than the mode and median
Lecturer’s
name: The median always occurs between the mode and mean
Mweruli T.
Fidèle (MSc.)
There are more observations below the mean than above
Introduction the mean
P
The Sigma
notation Negative Skewness This is an asymmetrical curve in which
Chapter 1 the long tail extends to the left. This frequency curve for the
Exercises
age distribution is characteristic of the age distribution in
Graphical
representation developed countries
of data

DATA DE- The mode is usually bigger than the mean and median
SCRIPTION:
Measures of
location and
The median usually occurs in between the mean and mode
dispersion The number of observations above the mean are usually
Measures of
skewness
more than those below the mean
88 / 144
Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
89 / 144
Chebyshev’s Theorem

Module Code
and title: As stated previously, the variance and standard deviation of a
(E304)INITIATION
TO variable can be used to determine the spread, or dispersion, of
STATISTICS
a variable. That is, the larger the variance or standard
Lecturer’s
name: deviation, the more the data values are dispersed. For example,
Mweruli T.
Fidèle (MSc.) if two variables measured in the same units have the same
mean, say, 70, and the first variable has a standard deviation of
Introduction
P 1.5 while the second variable has a standard deviation of 10,
The Sigma
notation then the data for the second variable will be more spread out
Chapter 1
Exercises
than the data for the first variable.
Graphical
Chebyshevs theorem, developed by the Russian mathematician
representation
of data
Chebyshev (1821-1894), specifies the proportions of the spread
DATA DE-
in terms of the standard deviation.
SCRIPTION:
Measures of
Chebyshev’s theorem will show us how to use the mean and the
location and
dispersion
standard deviation to find the percentage of the total
Measures of
observations that fall within a given interval about the mean.
skewness
89 / 144
Chebyshev’s Theorem

Module Code
and title:
(E304)INITIATION
TO
STATISTICS
Theorem
Lecturer’s
name: The Chebyshev’s theorem states that the proportion of values
Mweruli T.
Fidèle (MSc.) from a data set that will fall within k standard deviations of the
Introduction
mean will be at least 1 − 1/k 2 , where k is a number greater
The Sigma
P than 1. This theorem states that at least three-fourths, or
notation 75%, of the data values will fall within 2 standard deviations of
Chapter 1
Exercises
the mean of the data set. Furthermore, the theorem states that
Graphical at least eight-ninths, or 88.89%, of the data values will fall
representation
of data within 3 standard deviations of the mean. It can clearly be seen
DATA DE- on the following ideal distribution.
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
90 / 144
Chebyshev’s Theorem

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1 Example: The mean price of houses in a certain neighborhood


Exercises

Graphical
is R. 50,000, and the standard deviation is R. 10,000. Find the
representation
of data
price range for which at least 75% of the houses will sell.
DATA DE-
Solution: Using the Chebyshevs theorem, we find that at
SCRIPTION:
Measures of
least 75% of all homes sold in the area will have a price range
location and
dispersion
from R. 30,000 to R. 70,000.
Measures of
skewness
91 / 144
The Empirical (Normal) Rule

Module Code
and title:
(E304)INITIATION
TO
Chebyshev’s theorem applies to any distribution regardless of
STATISTICS
its shape. But, when a distribution is bell-shaped (or what is
Lecturer’s
name: called normal), the following statements, which make up the
Mweruli T.
Fidèle (MSc.) empirical rule, are true:
Introduction
1 Approximately 68% of the data values will fall within 1
The Sigma
P standard deviation of the mean.
notation
2 Approximately 95% of the data values will fall within 2
Chapter 1
Exercises standard deviations of the mean.
Graphical
representation
3 Approximately 99.7% of the data values will fall within 3
of data
standard deviations of the mean.
DATA DE-
SCRIPTION: For example, suppose that the scores on a national achievement
Measures of
location and exam have a mean of 480 and a standard deviation of 90.
dispersion

Measures of
skewness
92 / 144
The Empirical (Normal) Rule

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s If these scores are normally distributed, then approximately


name:
Mweruli T. 68% will fall between 390 and 570 (480 +90 =570 and 480 -
Fidèle (MSc.)
90 =390). Approximately 95% of the scores will fall between
Introduction 300 and 660 (480 + 2 . 90 = 660 and 480 - 2 . 90 =300).
The Sigma
P
Approximately 99.7% will fall between 210 and 750 (480 + 3 .
notation
90 =750 and 480 - 3 . 90 =210).
Chapter 1
Exercises The empirical rule gives an estimate for one standard deviation
Graphical
representation
while Chebyshev’s theorem doesnt not. It can be seen as
of data follows:
DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
93 / 144
The Empirical (Normal) Rule

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
Exercise Here !
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
94 / 144
Outline

Module Code
and title:
(E304)INITIATION 1 Introduction
TO
STATISTICS P
Lecturer’s
2 The Sigma notation
name:
Mweruli T.
Fidèle (MSc.)
3 Chapter 1 Exercises

Introduction 4 Graphical representation of data


P
The Sigma
notation 5 DATA DESCRIPTION: Measures of location and dispersion
Chapter 1
Exercises 6 Measures of skewness
Graphical
representation
of data 7 INFERENTIAL STATISTICS
DATA DE- Probability
SCRIPTION:
Measures of
location and 8 A brief overview of the classical linear regression model
dispersion

Measures of
skewness
95 / 144
Module Code The study of probability stems from the analysis of certain
and title:
(E304)INITIATION
games of chance, and it has found applications in most
TO
STATISTICS
branches of science and engineering. In this section the basic
Lecturer’s concepts of probability theory are presented.
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
96 / 144
SAMPLE SPACE AND EVENTS

Module Code
and title:
(E304)INITIATION
TO
Random Experiments:
STATISTICS In the study of probability, any process of observation is
Lecturer’s
name:
referred to as an experiment. The results of an observation are
Mweruli T.
Fidèle (MSc.)
called the outcomes of the experiment.
An experiment is called a random experiment if its outcome
Introduction
P
cannot be predicted. Typical examples of a random experiment
The Sigma
notation are the roll of a die, the toss of a coin, drawing a card from a
Chapter 1 deck, or selecting a message signal for transmission from
Exercises
several messages.
Graphical
representation Sample Space:The set of all possible outcomes of a random
of data
experiment is called the (or universal set), and it is denoted by
DATA DE-
SCRIPTION: S. An element in S is called a Each outcome of a random
Measures of
location and experiment corresponds to a sample point.
dispersion

Measures of
skewness
96 / 144
Sample space and Events: Examples

Module Code
and title:
(E304)INITIATION (1). Find the sample space for the experiment of tossing a coin
TO
STATISTICS
(a) once and (b) twice.
Lecturer’s
name: 1 There are two possible outcomes, heads or tails. Thus
Mweruli T.
Fidèle (MSc.)
S = {H, T } where H and T represent head and tail,
respectively.
Introduction

The Sigma
P 2 There are four possible outcomes. They are pairs of heads
notation and tails. Thus S = {HH, HT , TH, TT }
Chapter 1
Exercises (2). Find the sample space for the experiment of tossing a coin
Graphical
representation
repeatedly and of counting the number of tosses required until
of data the first head appears.
DATA DE-
SCRIPTION:
Clearly all possible outcomes for this experiment are the terms
Measures of
location and
of the sequence 1, 2, 3, · · · . Thus S = {1, 2, 3, · · · }.
dispersion Note that there are an infinite number of outcomes
Measures of
skewness
97 / 144
Sample space and Events: Examples

Module Code
and title:
(E304)INITIATION
TO
(3). Find the sample space for the experiment of measuring (in
STATISTICS hours) the lifetime of a transistor. Clearly all possible outcomes
Lecturer’s
name:
are all nonnegative real numbers.Thus S = {0 ≤ τ ≤ ∞}
Mweruli T.
Fidèle (MSc.)
where τ represents the life of a transistor in hours.
Note that any particular experiment can often have many
Introduction
P
different sample spaces depending on the observation of
The Sigma
notation interest.A sample space S is said to be discrete if it consists of
Chapter 1 afinite number of sample points (as in Example 1) or countably
Exercises
infnite sample points (as in Example 2). A set is called
Graphical
representation countable if its elements can be placed in a one-to-one
of data
correspondence with the positive integers. A sample space S is
DATA DE-
SCRIPTION: said to be continuous if the sample points constitute a
Measures of
location and continuum (as in Example 3)
dispersion

Measures of
skewness
98 / 144
Events

Module Code
and title: Since we have identified a sample space S as the set of all
(E304)INITIATION
TO possible outcomes of a random experiment, we will review
STATISTICS
some set notations in the following. If τ is an element of S (or
Lecturer’s
name: belongs to S), then we write τ ∈ S and if is not in the sample
Mweruli T.
Fidèle (MSc.) space S we write τ ∈ / S. A sample point of S is often referred
to as an elementary event.
Introduction
P Note that the sample space S is a subset of itself, i.e S ⊂ S
The Sigma
notation and since S is the set of all possible outcomes it is often called
Chapter 1
Exercises
certain event.
Graphical
Example: Consider the experiment of Example 2. Let A be the
representation
of data
event that the number of tosses required until the first head
DATA DE-
appears is even. Let B be the event that the number of tosses
SCRIPTION:
Measures of
required until the first head appears is odd. Let C be the event
location and
dispersion
that the number of tosses required until the first head appears
Measures of
is less than 5. Express events A, B, and C .
skewness
99 / 144
NOTION AND AXIOMS OF PROBABILITY

Module Code
and title:
An assignment of real numbers to the events defined in a
(E304)INITIATION
TO
sample space S is known as the probability measure. Consider
STATISTICS a random experiment with a sample space S and let A be a
Lecturer’s
name:
particular event defined in S,
Mweruli T.
Fidèle (MSc.)
Relative Frequency Definition:
Suppose that the random experiment is repeated n times. If
Introduction
P
event A occurs n(A) times, then the probability of event A,
The Sigma
notation denoted P(A), is defined as
Chapter 1
Exercises
n(A)
P(A) = lim (25)
Graphical n→∞ n
representation
of data where n(A)/n is called the relative frequency of event A. Note
DATA DE- that this limit may not exist, and in addition, there are many
SCRIPTION:
Measures of situations in which the concepts of repeatability may not be
location and
dispersion valid. It is clear that for any event A, the relative frequency will
Measures of have the following properties:
skewness
100 / 144
Relative frequency definition

Module Code
and title:
(E304)INITIATION
1 0 ≤ n(A)/n ≤ 1 and n(A)/n = 0 if A occurs in none of
TO
STATISTICS
the n repeated trials and n(A)/n = 1 if the event A occurs
Lecturer’s in all the n repeated trials.
name:
Mweruli T. 2 If A and B are mutually exclusive events, then
Fidèle (MSc.)

Introduction n(A ∪ B) = n(A) + n(B)


P
The Sigma
notation
and
Chapter 1 n(A ∪ B) n(A) n(B)
Exercises = +
Graphical
n n n
representation
of data
Axiomatic Definition:
DATA DE- Let S be a sample space and A be an event defined in in S.
SCRIPTION:
Measures of Then in the axiomatic definition, the probability of event
location and
dispersion
A, P(A) is a real number assigned to A which satisfies the
Measures of following three axioms:
skewness
101 / 144
Axiomatic Definition

Module Code
and title:
(E304)INITIATION 1 Axiom 1: P(A) ≥ 0
TO
STATISTICS
2 Axiom 2: P(S) = 1
Lecturer’s
name: 3 Axiom 3: P(A ∪ B) = P(A) + P(B) if A ∩ B = ∅.
Mweruli T.
Fidèle (MSc.)
If the sample space S is not finite, then axiom 3 must be
Introduction modified as follows:
Axiom 3: If A1 , A2 , · · · An is an infinite sequence of mutually
P
The Sigma
notation

Chapter 1
exclusive events in S(Ai ∩ Aj = ∅, i 6= j) then
Exercises

Graphical

X
representation
of data
P (∪∞
i=1 Ai ) = P(Ai ) (26)
DATA DE-
i=1
SCRIPTION:
Measures of
location and
These axioms satisfy our intuitive notion of probability measure
dispersion
obtained from the notion of relative frequency.
Measures of
skewness
102 / 144
Elementary Properties of Probability

Module Code
and title:
By using the above axioms, the following useful properties of
(E304)INITIATION
TO
probability can be obtained:
STATISTICS

Lecturer’s
P(Ā) = 1 − P(A) (27)
name:
Mweruli T. P(A) ≤ P(B) (28)
Fidèle (MSc.)
if A ⊂ B.
Introduction

The Sigma
P P(A ∪ B) = P(A) + P(B) − P(A ∩ B) (29)
notation

Chapter 1
P(A) ≤ 1. (30)
Exercises
If A1 , A2 , · · · , An are n arbitrary events in S, then
Graphical
representation Xn X X
of data n
P (∪i=1 Ai ) = P(Ai )− P (Ai ∩ Aj )+ P (Ai ∩ Aj ∩ Ak )
DATA DE-
SCRIPTION: i=1 i6=j i6=j6=k
Measures of
location and
(31)
dispersion
− · · · (−1)n−1 P (A1 ∩ A2 ∩ · · · ∩ An ) where the sum of the
Measures of
skewness second term is over all distinct pairs of events, that of the third
103 / 144
Elementary Properties of Probability

Module Code
and title:
If A1 , A2 , · · · , An is finite sequence of mutually exclusive events
(E304)INITIATION
TO
in S(Ai ∩ Aj = ∅, i 6= j) then
STATISTICS n
X
Lecturer’s
name:
P (∪ni=1 Ai ) = P(Ai ). (32)
Mweruli T.
Fidèle (MSc.)
i=1
and a similar equality holds for any subcollection of the events.
Introduction
P Note that property 4 can be easily derived from axiom 2 and
The Sigma
notation property 3. Since A ⊂ S, then we have
Chapter 1
Exercises P(A) ≤ P(S) = 1
Graphical
representation
Thus, combining with axiom 1 we obtain
of data

DATA DE-
0 ≤ P(A) ≤ 1
SCRIPTION:
Measures of Property 5 implies that
location and
dispersion
P(A ∪ B) ≤ P(A) + P(B)
Measures of
skewness
since P(A ∩ B) ≥ 0 by axiom 1. 104 / 144
EQUALLY LIKELY EVENTS: A. Finite Sample
Space
Module Code
and title:
Consider a finite sample space S with n finite elements
(E304)INITIATION
TO
STATISTICS S = {τ1 , τ2 , · · · , τn } (33)
Lecturer’s
name:
Mweruli T. where the τs0 s are elementary events. Let P(τi ) = pi . Then
Fidèle (MSc.)

Introduction
0 ≤ pi ≤ 1, i = 1, 2, 3, · · · , n. (34)
P
The Sigma
notation
n
Chapter 1 X
Exercises pi = p1 + p2 + p3 + · · · + pn = 1. (35)
Graphical i=1
representation
of data
If ∪i∈I τi where I is is a collection of subscripts, then
DATA DE-
SCRIPTION: X X
Measures of
location and
P(A) = P(τi ) = pi . (36)
dispersion τi ∈A i∈I
Measures of
skewness
105 / 144
Equally likely events

Module Code
and title:
(E304)INITIATION
TO When all elementary events τi (i = 1, 2, · · · , n) are equally
STATISTICS
likely, that is,
Lecturer’s
name: p1 = p2 = · · · = pn
Mweruli T.
Fidèle (MSc.)
then from Eq.(74) we have
Introduction
1
pi = (37)
P
The Sigma
notation
n
Chapter 1
Exercises and
Graphical n(A)
representation P(A) = (38)
of data n
DATA DE-
SCRIPTION:
where n(A) is the number of outcomes belonging to event A
Measures of
location and
and n is the number of sample points in S.
dispersion

Measures of
skewness
106 / 144
CONDITIONAL PROBABILITY: Definition

Module Code
and title:
The conditional probability of an event A given event B,
(E304)INITIATION
TO
denoted by P(A \ B) is defined as
STATISTICS

Lecturer’s P(A ∩ B)
name: P(A \ B) = P(B) > 0 (39)
Mweruli T. P(B)
Fidèle (MSc.)

Introduction
where P(A ∩ B) is the joint probability of A and B.
The Sigma
P Similarly,
notation

Chapter 1 P(A ∩ B)
Exercises P(B \ A) = P(A) > 0 (40)
Graphical
P(A)
representation
of data
is the conditional probability of an event B given event A.
DATA DE-
SCRIPTION: From Eqs.(78) and (79), we have
Measures of
location and
dispersion P(A ∩ B) = P(A \ B).P(B) = P(B \ A)P(A). (41)
Measures of
skewness
107 / 144
Baye’s Rule

Module Code
and title:
From Eq. (80) we can obtain the following Bayes’rule:
(E304)INITIATION
TO
P(B \ A)P(A)
STATISTICS
P(A \ B) = (42)
Lecturer’s P(B)
name:
Mweruli T.
Fidèle (MSc.) TOTAL PROBABILITY
The events A1 , A2 , · · · An are called mutually exclusive and
Introduction
P exhaustive if
The Sigma
notation

Chapter 1 ∪ni=1 Ai = A1 ∪ A2 ∪ · · · ∪ An = S (43)


Exercises

Graphical and Ai ∩ Aj = ∅, i 6= j.
representation
of data Let B be an event in the sample space S. Then
DATA DE-
SCRIPTION: n
X n
X
Measures of
location and P(B) = P(B ∩ Ai ) = P(B \ Ai )P(Ai ) (44)
dispersion
i=1 i=1
Measures of
skewness
108 / 144
TOTAL PROBABILITY

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s which is known as the total probability of event B.


name:
Mweruli T. Let A = Ai in Eq. (81); then, using Eq. (83), we obtain
Fidèle (MSc.)

Introduction P(B \ Ai )P(Ai )


P(Ai \ B) = Pn . (45)
i=1 P(B \ Ai )P(Ai )
P
The Sigma
notation

Chapter 1
Exercises
Note that the terms on the right-hand side are all conditioned
Graphical
on events Ai while the term on the left is conditioned on B.
representation
of data
Equation (84) is sometimes referred to as Baye’s rule.
DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
109 / 144
INDEPENDENT EVENTS

Module Code
and title:
Two events A and B are said to be (statistically) independent
(E304)INITIATION
TO
if and only if
STATISTICS P(A ∩ B) = P(A).P(B) (46)
Lecturer’s
name:
Mweruli T.
It follows immediately that if A and B are independent, then by
Fidèle (MSc.) Eqs. (78) and (79),
Introduction
P P(A \ B) = P(A) and P(B \ A) = P(B). (47)
The Sigma
notation

Chapter 1
If two events A and B are independent, then it can be shown
Exercises that A and B̄ are also independent, that is
Graphical
representation
of data P(A ∩ B̄) = P(A).P(B̄). (48)
DATA DE-
SCRIPTION:
Measures of
Then
location and P(A ∩ B̄)
dispersion P(A \ B) = = P(A). (49)
P(B̄)
Measures of
skewness
110 / 144
INDEPENDENT EVENTS

Module Code
and title:
(E304)INITIATION Thus, if A is independent of B, then the probability of A0 s
TO
STATISTICS occurrence is unchanged by information as to whether or not B
Lecturer’s has occurred.
name:
Mweruli T. Three events A, B, C are said to be independent if and only if
Fidèle (MSc.)

Introduction
P(A ∩ B ∩ C ) = P(A)P(B)P(C ) (50)
P
The Sigma
notation P(A ∩ B) = P(A)P(B)
Chapter 1
Exercises P(A ∩ C ) = P(A)P(C )
Graphical
representation
of data

DATA DE-
P(B ∩ C ) = P(B)P(C
SCRIPTION:
Measures of
location and
We may also extend the definition of independence to more
dispersion than three events.
Measures of
skewness
111 / 144
INDEPENDENT EVENTS

Module Code
and title: The events A1 , A2 , · · · , An are independent if and only if for
(E304)INITIATION
TO every subset {Ai1 , Ai2 , · · · , Aik } (2 ≤ k ≤ n) of these events,
STATISTICS

Lecturer’s
name: P(Ai1 ∩ Ai2 ∩ · · · ∩ Aik ) = P(Ai )P(Ai2 ) · · · P(Aik ). (51)
Mweruli T.
Fidèle (MSc.)
Finally, we define an infinite set of events to be independent if
Introduction
P
and only if every finite subset of these events is independent.
The Sigma
notation To distinguish between the mutual exclusiveness (or
Chapter 1 disjointness) and independence of a collection of events we
Exercises
summarize as follows:
Graphical
representation 1. If {Ai , i = 1, 2, · · · , n} is a sequence of mutually exclusive
of data
events, then
DATA DE- n
SCRIPTION:
X
Measures of P (∪ni=1 Ai ) = P(Ai ). (52)
location and
dispersion i=1
Measures of
skewness
112 / 144
INDEPENDENT EVENTS

Module Code
and title:
(E304)INITIATION 2. If {Ai , i = 1, 2, · · · , n} is a sequence of independent
TO
STATISTICS events, then
n
Lecturer’s Y
name:
Mweruli T.
P (∩ni=1 Ai ) = P(Ai ) (53)
Fidèle (MSc.) i=1

Introduction and a similar equality holds for any subcollection of the events.
The Sigma
P
EXERCISES
notation
1. Consider a random experiment of tossing a coin three times.
Chapter 1
Exercises

Graphical
representation 1 Find the sample space S , if we wish to observe the exact
of data
sequences of heads and tails obtained.
DATA DE-
SCRIPTION: 2 Find the sample space S , if we wish to observe the
Measures of
location and number of heads in the three tosses.
dispersion

Measures of
skewness
113 / 144
Exercises

Module Code
and title:
2. Consider an experiment of drawing two cards at random
(E304)INITIATION
TO
from a bag containing four cards marked with the integers 1
STATISTICS through 4.
Lecturer’s
name:
(a)Find the sample space S, of the experiment if the first card
Mweruli T.
Fidèle (MSc.)
is replaced before the second is drawn.
(b) Find the sample space S, of the experiment if the first card
Introduction
P
is not replaced
The Sigma
notation 3. An experiment consists of tossing two dice. Find the sample
Chapter 1 space S.
Exercises
Find the event A that the sum of the dots on the dice equals 7.
Graphical
representation Find the event B that the sum of the dots on the dice is
of data
greater than 10.
DATA DE-
SCRIPTION: Find the event C that the sum of the dots on the dice is
Measures of
location and greater than 12.
dispersion
4. State every possible event in the sample space
Measures of
skewness S = {a, b, c, d}. 114 / 144
Outline

Module Code
and title:
(E304)INITIATION 1 Introduction
TO
STATISTICS P
Lecturer’s
2 The Sigma notation
name:
Mweruli T.
Fidèle (MSc.)
3 Chapter 1 Exercises

Introduction 4 Graphical representation of data


P
The Sigma
notation 5 DATA DESCRIPTION: Measures of location and dispersion
Chapter 1
Exercises 6 Measures of skewness
Graphical
representation
of data 7 INFERENTIAL STATISTICS
DATA DE- Probability
SCRIPTION:
Measures of
location and 8 A brief overview of the classical linear regression model
dispersion

Measures of
skewness
115 / 144
Module Code Regression analysis is almost certainly the most important tool
and title:
(E304)INITIATION
at the econometrician’s disposal. But what is regression
TO
STATISTICS
analysis? In very general terms, regression is concerned with
Lecturer’s describing and evaluating the relationship between a given
name:
Mweruli T. variable and one or more other variables. More specifically,
Fidèle (MSc.)
regression is an attempt to explain movements (variations) in a
Introduction variable by reference to movements in one or more other
The Sigma
P
variables.To make this more concrete, denote the variable
notation
whose movements the regression seeks to explain by y and the
Chapter 1
Exercises variables which are used to explain those variations by
Graphical x1 , x2 , · · · , xk . Hence, in this relatively simple setup, it would
representation
of data be said that variations in k variables (the x 0 s) cause changes in
DATA DE-
SCRIPTION:
some other variable, y . In these teaching sessions we will be
Measures of limited to the case where the model seeks to explain changes in
location and
dispersion only one variable y (simple linear regression model) and the
Measures of
skewness
case where the model seeks to explain changes in one variable y
115 / 144
Module Code by two or more variables (Multiple linear regression model ).
and title:
(E304)INITIATION
The table below shows the various completely interchangeable
TO
STATISTICS
names for y and the x 0 s, and all of these terms will be used
Lecturer’s synonymously in these teaching sessions.
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
116 / 144
Regression versus correlation

Module Code
and title:
The correlation between two variables measures the degree of
(E304)INITIATION
TO
linear association between them. If it is stated that y and x
STATISTICS are correlated, it means that y and x are being treated in a
Lecturer’s
name:
completely symmetrical way. Thus, it is not implied that
Mweruli T.
Fidèle (MSc.)
changes in x cause changes in y , or indeed that changes in y
cause changes in x. Rather, it is simply stated that there is
Introduction
P
evidence for a linear relationship between the two variables, and
The Sigma
notation that movements (variations) in the two are on average related
Chapter 1 to an extent given by the correlation coefficient.
Exercises
In regression, the dependent variable (y ) and the independent
Graphical
representation variable(s) (x 0 s) are treated very differently. The y variable is
of data
assumed to be random or ”stochastic” in some way, i.e. to
DATA DE-
SCRIPTION: have a probability distribution. The x variables are, however,
Measures of
location and assumed to have fixed ( ”non-stochastic”) values in repeated
dispersion
samples. Regression as a tool is more flexible and more
Measures of
skewness powerful than correlation. 116 / 144
Regression versus correlation

Module Code
and title: Suppose you do have a need to establish the relation between
(E304)INITIATION
TO sales and cost for publicity. The question will be ”Is there a
STATISTICS
relationship between sales and publicity?” The response to this
Lecturer’s
name: question recalls the regression and correlation analysis that we
Mweruli T.
Fidèle (MSc.) will be dealing with in these teaching sessions. However, with
regression analysis, it should noticed that, the best-fitting curve
Introduction

The Sigma
P can be expressed in different form such as a linear, quadratic,
notation polynomial, exponential, logarithmic, moving average, etc.
Chapter 1
Exercises
In this unit, the problem under study is to find out line(s)
Graphical
which is (are) the best-fitting line through all the points.
representation
of data
The general convention is that the sum of squares of the
DATA DE- vertical distances between the fitted line and the observed
SCRIPTION:
Measures of
points should be minimized, and thus the method is called
location and
dispersion
least squares regression. First of all sketch the scatter plot
Measures of and see if the scatter is linear.
skewness
117 / 144
Regression versus correlation

Module Code
and title:
(E304)INITIATION
Others questions should be:
TO
STATISTICS
Is there a relationship between consumption and income?
Lecturer’s
Does the amount spent on advertising have a influence on the
name:
Mweruli T.
volume of sales?
Fidèle (MSc.) Is there a relationship between the inflation rate and the
Introduction
interest rate the bank charge?
The Sigma
P When the relationship between variables is examined, the
notation
purpose is to
Chapter 1
Exercises
Determine whether there is a relationship
Graphical
representation Describe the relationship, if it exists , with an equation
of data

DATA DE- Use the equation to make predictions.


SCRIPTION:
Measures of
location and
There are two techniques that are used to estimate a
dispersion
relationship that may exist between variable:
Measures of
skewness
118 / 144
Regression versus correlation

Module Code
and title:
(E304)INITIATION
1 Correlation analysis determines the strength and
TO
STATISTICS
direction of the relationship between variables.
Lecturer’s 2 Regression analysis involves a mathematical equation
name:
Mweruli T. that explains just how the variables are related. The
Fidèle (MSc.)
equation is also used for the purpose of predicting future
Introduction values for one variable based on the knowledge of the
values of the other variable.
P
The Sigma
notation

Chapter 1
Exercises
Simple regression
Graphical
For simplicity, suppose for now that it is believed that y
representation
of data
depends on only one x variable. Again, this is of course a
DATA DE- severely restricted case, but the case of more explanatory
SCRIPTION:
Measures of
variables will be considered in the next section of multiple
location and
dispersion
regression model. Three examples of the kind of relationship
Measures of that may be of interest include:
skewness
119 / 144
Simple regression

Module Code
and title:
(E304)INITIATION
TO
STATISTICS How asset returns vary with their level of market risk
Lecturer’s
name:
Measuring the long-term relationship between stock prices
Mweruli T.
Fidèle (MSc.)
and dividends
Constructing an optimal hedge ratio.
Introduction

The Sigma
P
Suppose that a researcher has some idea that there should be a
notation
relationship between two variables y and x, and that financial
Chapter 1
Exercises theory suggests that an increase in x will lead to an increase in
Graphical
representation
y . A sensible first stage to testing whether there is indeed an
of data association between the variables would be to form a scatter
DATA DE-
SCRIPTION:
plot of them. Suppose that the outcome of this plot is figure
Measures of
location and
illustrated below:
dispersion

Measures of
skewness
120 / 144
Scatter plot of two variables, y and x

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
In this case, it appears that there is an approximate positive
Measures of
location and
linear relationship between x and y which means that increases
dispersion
in x are usually accompanied by increases in y , and that the
Measures of
skewness relationship between them can be described approximately by a
121 / 144
Simple Regression

Module Code
and title:
It would therefore be of interest to determine to what extent
(E304)INITIATION
TO
this relationship can be described by an equation that can be
STATISTICS estimated using a defined procedure. It is possible to use the
Lecturer’s
name:
general equation for a straight line of the form:
Mweruli T.
Fidèle (MSc.) y = α + βx (54)
Introduction to get the line that best ”fits” the data. The researcher would
then be seeking to find the values of the parameters or
P
The Sigma
notation

Chapter 1
coefficients, α and β, which would place the line as close as
Exercises possible to all of the data points taken together.
Graphical
representation
However, this equation (y = α + βx) is an exact one.
of data Assuming that this equation is appropriate, if the values of α
DATA DE-
SCRIPTION:
and β had been calculated, then given a value of x, it would be
Measures of
location and
possible to determine with certainty what the value of y would
dispersion be. Imagine - a model which says with complete certainty what
Measures of
skewness
the value of one variable will be given any value of the other!
122 / 144
Simple Regression

Module Code
and title:
(E304)INITIATION
TO
STATISTICS Clearly this model is not realistic. Statistically, it would
Lecturer’s
name:
correspond to the case where the model fitted the data
Mweruli T.
Fidèle (MSc.)
perfectly - that is, all of the data points lay exactly on a
straight line. To make the model more realistic, a random
Introduction
P
disturbance term, denoted by u, is added to the equation, thus
The Sigma
notation

Chapter 1
yt = α + βxt + ut (55)
Exercises

Graphical
representation
where the subscript t(= 1, 2, 3, · · · ) denotes the observation
of data number. The disturbance term can capture a number of
DATA DE-
SCRIPTION:
features as described below:
Measures of
location and
dispersion

Measures of
skewness
123 / 144
Reasons for the inclusion of the error term in the
model
Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises
So how are the appropriate values ofα and β determined? α
Graphical
representation and β are chosen so that the (vertical) distances from the data
of data
points to the fitted lines are minimised (so that the line fits the
DATA DE-
SCRIPTION: data as closely as possible). The parameters are thus chosen to
Measures of
location and minimise collectively the (vertical) distances from the data
dispersion
points to the fitted line.
Measures of
skewness
124 / 144
Simple Regression

Module Code
and title:
This could be done by ”eye-balling” the data and, for each set
(E304)INITIATION
TO
of variables y and x, one could form a scatter plot and draw on
STATISTICS a line that looks as if it fits the data well by hand, as in figure
Lecturer’s
name:
below:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
125 / 144
Simple Regression

Module Code
and title:
(E304)INITIATION Note that the vertical distances are usually minimised rather
TO
STATISTICS than the horizontal distances or those taken perpendicular to
Lecturer’s the line. This arises as a result of the assumption that x is
name:
Mweruli T. fixed in repeated samples, so that the problem becomes one of
Fidèle (MSc.)
determining the appropriate model for y given (or conditional
Introduction upon) the observed values of x.
The Sigma
P
This ”eye-balling” procedure may be acceptable if only
notation
indicative results are required, but of course this method, as
Chapter 1
Exercises well as being tedious, is likely to be imprecise. The most
Graphical
representation
common method used to fit a line to the data is known as
of data ordinary least squares (OLS).
DATA DE-
SCRIPTION:
Two alternative estimation methods (for determining the
Measures of
location and
appropriate values of the coefficients α and β) are the method
dispersion of moments and the method of maximum likelihood.
Measures of
skewness
126 / 144
Simple Regression

Module Code
and title:
Suppose now, for ease of exposition, that the sample of data
(E304)INITIATION
TO
contains only five observations. The method of OLS entails
STATISTICS taking each vertical distance from the point to the line,
Lecturer’s
name:
squaring it and then minimising the total sum of the areas of
Mweruli T.
Fidèle (MSc.)
squares (hence ”least squares”), as shown in figure below

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion
This can be viewed as equivalent to minimising the sum of the
Measures of
skewness areas of the squares drawn from the points to the line.
127 / 144
Simple Regression

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s Tightening up the notation, let yt denote the actual data point
name:
Mweruli T. for observation t and let yˆt denote the fitted value from the
Fidèle (MSc.)
regression line- in other words, for the given value of x of this
Introduction observation t, yˆt is the value for y which the model would have
The Sigma
P
predicted.
notation
Finally, let uˆt denote the residual, which is the difference
Chapter 1
Exercises between the actual value of y and the value fitted by the model
Graphical for this data point, i.e (yt − yˆt ). This is shown for just one
representation
of data observation t in the figure below:
DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
128 / 144
Plot of a single observation, together with the line
of best fit, the residual and the fitted value
Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and What is done is to minimise the sum of the uˆt 2 .
dispersion

Measures of
skewness
129 / 144
Simple Regression

Module Code
and title:
(E304)INITIATION
The reason that the sum of the squared distances is minimised
TO
STATISTICS
rather than, for example, finding the sum of uˆt that is as close
Lecturer’s to zero as possible, is that in the latter case some points will lie
name:
Mweruli T. above the line while others lie below it. Then, when the sum to
Fidèle (MSc.)
be made as close to zero as possible is formed, the points above
Introduction the line would count as positive values, while those below would
The Sigma
P
count as negatives. So these distances will in large part cancel
notation
each other out, which would mean that one could fit virtually
Chapter 1
Exercises any line to the data, so long as the sum of the distances of the
Graphical points above the line and the sum of the distances of the
representation
of data points below the line were the same. In that case, there would
DATA DE- not be a unique solution for the estimated coefficients.
SCRIPTION:
Measures of In fact, any fitted line that goes through the mean of the
location and
dispersion observations (i.e. x̄, ȳ ) would set the sum of the uˆt to zero.
Measures of
skewness
130 / 144
Simple Regression

Module Code
and title:
(E304)INITIATION However, taking the squared distances ensures that all
TO
STATISTICS deviations that enter the calculation are positive and therefore
Lecturer’s do not cancel out.
name:
Mweruli T. So minimising the sum of squared distances  is given by
Fidèle (MSc.) 2 2 2 2 2
minimising
P uˆ
 1 + u
ˆ 2 + u
ˆ3 + u
ˆ4 + u
ˆ 5 or minimising
Introduction 5 2
P t=1 uˆt .
The Sigma
notation This sum is known as the residual sum of squares (RSS) or
Chapter 1
Exercises
the sum of squared residuals. But what is uˆt ? Again, it is the
Graphical
difference between
P 2 the actual point and the line, Pyt − yˆt . So2
representation
of data
minimising t uˆt is equivalent to minimising t (yt − yˆt ) .
DATA DE- Letting α̂ and β̂ denote the values of α and β selected by
SCRIPTION:
Measures of
minimising the RSS, respectively, the equation for the fitted
location and
dispersion
line is given by
Measures of
skewness
131 / 144
Mathematical derivations of CLRM results

Module Code
and title:
(E304)INITIATION
TO
yt = α̂ + β̂xt .
STATISTICS
Now let L denote the RSS, which is also known as a loss
Lecturer’s
name: function. Take the summation over all of the observations, i.e.
Mweruli T.
Fidèle (MSc.) from t = 1 to T , where T is the number of observations
T T  2
Introduction X X
The Sigma
P L= (yt − yˆt )2 = yt − α̂ − β̂xt (56)
notation t=1 t=1
Chapter 1
Exercises L is minimised with respect to (w.r.t.) α̂ and β̂, to find the
Graphical values of α and β which minimise the residual sum of squares
representation
of data to give the line that is closest to the data. So L is differentiated
DATA DE- w.r.t. α̂ and β̂, setting the first derivatives to zero.
SCRIPTION:
Measures of A derivation of the ordinary least squares (OLS) estimator is to
location and
dispersion be performed on white board in class. The coefficient
Measures of estimators for the slope and the intercept are given by
skewness
132 / 144
Simple Regression: coefficient estimators for the
slope and the intercept
Module Code
and title: P
(E304)INITIATION xt yt − T x̄ ȳ
TO β̂ = P 2 (57)
STATISTICS xt − T x̄ 2
Lecturer’s
name:
Mweruli T.
α̂ = ȳ − β̂ x̄ (58)
Fidèle (MSc.)
Equations (4) and (5) state that, given only the sets of
Introduction
P
observations xt and yt , it is always possible to calculate the
The Sigma
notation values of the two parameters, α̂ and β̂, that best fit the set of
Chapter 1 data. Equation (4) is the easiest formula to use to calculate
Exercises
the slope estimate, but the formula can also be written, more
Graphical
representation intuitively, as P
of data (xt − x̄) (yt − ȳ )
DATA DE- β̂ = (59)
(xt − x̄)2
P
SCRIPTION:
Measures of
location and
dispersion
which is equivalent to the sample covariance between x and y
Measures of divided by the sample variance of x.
skewness
133 / 144
Simple Regression

Module Code
and title:
To reiterate, this method of finding the optimum is known as
(E304)INITIATION
TO
OLS. It is also worth noting that it is obvious from the
STATISTICS equation for that the regression line will go through the mean
Lecturer’s
name:
of the observations- i.e. that the point (x̄, ȳ ) lies on the
Mweruli T.
Fidèle (MSc.)
regression line.
Note: The equations (4) and (5) give the regression line of y
Introduction
P
on x as the dependent varaible and independent variable
The Sigma
notation respectively. The two equations can also be written in the usual
Chapter 1 way we are familiar with as: l1 ≡ ŷ = a + bx, where b = slope
Exercises P
xi yi − nxi yi
P
Graphical Sxy
representation b= = P P 2 , a = ȳ − bx̄ (60)
of data Sxx xi2 − ( nxi )
DATA DE-
SCRIPTION:
Measures of
l2 ≡ x̂ = a0 + b 0 x, where b 0 = slope
P
location and P xi yi
dispersion
0 S xy x i yi −
b = = P P n
2 , a0 = x̄ − b 0 ȳ (61)
Syy yi2 − ( nyi )
Measures of
skewness
134 / 144
Simple Regression: Example

Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
Table bellow presents the amounts of sales and publicity (in
name:
Mweruli T.
×103 dollars)
Fidèle (MSc.)

Introduction
P
The Sigma
notation 1. Find the scatter plot and comments.
Chapter 1 2. Determine the line regression equation
Exercises
3. Determine the coefficient correlation and interpret it value.
Graphical
representation 4. Determine the coefficient of determination and interpret it
of data
value.
DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
135 / 144
Simple Regression: Example

Module Code
and title:
(E304)INITIATION
TO The scatter plot of the above data set is given by
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of and the sactter appears to be linear.
location and
dispersion

Measures of
skewness
136 / 144
Simple Regression: Example

Module Code
and title:
(E304)INITIATION
TO
The line regression is given by
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation The regression line equation of y on x is l1 ≡ ŷ = −0.5 + 2.1x
Chapter 1 What should be the second line regression?
Exercises
Correlation
Graphical
representation The word correlation is used when you want to assess whether
of data
the relationship between 2 numerical variables or 2 categorical
DATA DE-
SCRIPTION: variables. The word association is appropriate if the 2 variables
Measures of
location and are categorical.
dispersion

Measures of
skewness
137 / 144
Correlation: Pearson correlation

Module Code
and title:
If the two variables X and Y are assumed normally distributed,
(E304)INITIATION
TO
then use Pearson correlation and Pearson correlation is denoted
STATISTICS by r and r is comprised between −1 and 1 expressed by:
Lecturer’s
name: Sxy
Mweruli T. r=p (−1 ≤ r ≤ 1) (62)
Fidèle (MSc.) Sxx × Syy
Introduction Interpretation:
P
The Sigma
notation

Chapter 1
Exercises

Graphical
representation
of data

DATA DE-
SCRIPTION:
Measures of
location and
dispersion

Measures of
skewness
138 / 144
Spearman’s Rho correlation

Module Code
and title:
(E304)INITIATION
TO From the above example, the Pearson correlation coefficient is
STATISTICS
r = 0.991968 ≈ 0.9920, it indicates that there is a positive and
Lecturer’s
name: very strong correlation between publicity (x) and sales (y ).
Mweruli T.
Fidèle (MSc.) The Spearman’s Rho correlation is denoted by rs and rs is
comprised between −1 and 1.
Introduction

The Sigma
P If the two variables X and Y are not normally distributed, then
notation use Spearman’s Rho Rank correlation and r s is expressed as:
Chapter 1
Exercises
6 ni=1 di2
P
Graphical rs = 1 − (63)
representation n(n2 − 1)
of data

DATA DE-
SCRIPTION:
where di = rank(xi ) − rank(yi ) and n = number of
Measures of
location and
observations.
dispersion

Measures of
skewness
139 / 144
Coefficient of determination (R 2 )

Module Code
and title:
Variation about regressionP
line
(E304)INITIATION
TO
Total variation (TSS) = (yi − P ȳ )2
STATISTICS The explained variation (ESS) = P (yˆi − ȳ )2
Lecturer’s
name:
The unexplained variation (RSS) = (yi − yˆi )2
Mweruli T.
Fidèle (MSc.)
TSS = RSS + ESS,

Introduction ESS = TSS − RSS (64)


P
The Sigma
notation Dividing both sides of equation (11) by TSS this yields the
Chapter 1
Exercises
coefficient of determination R 2
(yˆi − ȳ )2
Graphical
P
representation 2 RSS ESS
R =1− = =P (65)
(yi − ȳ )2
of data
TSS TSS
DATA DE-
SCRIPTION:
Measures of
location and
From the example described above, R 2 = 0.984 ≈ 98.4%, it
dispersion indicates that publicity (x) explains the variations of sales (y )
Measures of
skewness
at 98.4%.
140 / 144
Residual and autocorrelation

Module Code
and title:
(E304)INITIATION
TO
STATISTICS The Durbin Watson (DW ) test reports a test statistic,
Lecturer’s Pn 2
name:
Mweruli T. i=2P(ei − ei−1 )
Fidèle (MSc.)
DW = n 2
(66)
i=1 ei
Introduction

The Sigma
P DW takes value from 0 to 4, but
notation
1 If DW = 2 is no autocorrelation.
Chapter 1
Exercises 2 If 0 < DW < 2 there is an indication of positive
Graphical
representation
autocorrelation.
of data
3 If 2 < DW ≤ 4 there is an indication of negative
DATA DE-
SCRIPTION: autocorrelation.
Measures of
location and
dispersion

Measures of
skewness
141 / 144
The assumptions underlying the classical linear
regression model
Module Code
and title:
(E304)INITIATION
TO
STATISTICS

Lecturer’s
name:
Mweruli T.
Fidèle (MSc.)

Introduction
P
The Sigma
notation

Chapter 1
Exercises
Properties of the OLS estimator
Graphical
representation If assumptions 1-4 hold, then the estimators α̂ and β̂
of data
determined by OLS will have a number of desirable properties,
DATA DE-
SCRIPTION: and are known as Best Linear Unbiased Estimators
Measures of
location and (BLUE). (See further details in the reference book)
dispersion

Measures of
skewness
142 / 144
Precision and standard errors

Module Code
and title: Any set of regression estimates α̂ and β̂are specific to the
(E304)INITIATION
TO sample used in their estimation. In other words, if a different
STATISTICS
sample of data was selected from within the population, the
Lecturer’s
name: data points (the xt and yt ) will be different, leading to
Mweruli T.
Fidèle (MSc.) different values of the OLS estimates.
Recall that the OLS estimators ( α̂ and β̂ )are given by
Introduction
P equations (4) and (5). It would be desirable to have an idea of
The Sigma
notation how good these estimates of α and β are in the sense of having
Chapter 1
Exercises
some measure of the reliability or precision of the estimators (α̂
Graphical
and β̂). It is thus useful to know whether one can have
representation
of data
confidence in the estimates, and whether they are likely to vary
DATA DE-
much from one sample to another sample within the given
SCRIPTION:
Measures of
population. An idea of the sampling variability and hence of
location and
dispersion
the precision of the estimates can be calculated using only the
Measures of
sample of data available.
skewness
143 / 144
Precision and standard errors

Module Code
and title:
This estimate is given by its standard error. Given assumptions
(E304)INITIATION
TO
1-4 above, valid estimators of the standard errors can be shown
STATISTICS to be given by
Lecturer’s s P 2 s P 2
name:
xt x
Mweruli T.
Fidèle (MSc.) SE (α̂) = s 2
=s P 2 t  (67)
xt − T x̄ 2
P
T (xt − x̄) T
Introduction s s
The Sigma
P 1 1
notation SE (β̂) = s P 2
=s P 2 2
(68)
(xt − x̄) x t − T x̄
Chapter 1
Exercises
where s is the estimated standard deviation of the residuals and
Graphical
representation is given by s
of data P 2
DATA DE-
uˆt
SCRIPTION: s= (69)
Measures of T −2
location and
dispersion s is also known as the standard error of the regression or the
Measures of standard error of the estimate. Example Here!
skewness
144 / 144

You might also like