STAE Lecture Notes - LU1

Learning Unit 1: INTRODUCTION TO STATISTICS
LEARNING OBJECTIVES
• Understand the concepts of a population, sample, parameter, statistic, random variable and data
• Distinguish between descriptive and inferential statistics
• Distinguish between categorical and numerical variables
• Identify the four different scales of measure
• Know the difference between raw data and frequency data
Textbook reference: Chapter 1
1.1. What Is Statistics

An important part of the scientific research process is the gathering, ordering and analysis of information from
which conclusions can be drawn and interpretation done. The study of statistical methods focuses on the
question of how this gathered data should be analysed so that meaningful conclusions can be drawn.
1.2. Notation And Terminology

Population
The term population refers to the entire collection of individuals, objects or items under consideration. A
population may be finite or infinite. For example, the shoes manufactured on any given day in a factory is a
finite population. However, all the outcomes when flipping a coin repeatedly (and indefinitely) would be
considered an infinite population. The total number of elements in a population is denoted by N.
Population parameter
A population parameter is a constant value (usually unknown) that describes some measurable aspect of a
population. Population parameters are generally denoted using Greek letters.
Sample
A sample is a subset of the population of interest. Samples are generally used to collect information since it is
not always possible or feasible to consider the entire population. The total number of elements in a sample is
denoted by n.
Sampling unit
A sampling unit is the object being measured, counted or observed.
1
Sample statistic
A number calculated from sampled data, which describes a measurable aspect of a sample, is called a statistic.
Sample statistics are generally denoted using Roman letters.
Notation
Sample statistic Population parameter
Mean x (x-bar)  (mu)
Variance s 2 (s-squared)  2 (sigma-squared)
Standard deviation s  (sigma)
x
Proportion p=  (pi)
n
Size n N
Descriptive statistics
Descriptive statistics comprise those methods used to organise and describe information that has been
collected in a sample.
Inferential statistics
Inferential statistics comprise those methods and techniques used for making generalisations, predictions or
estimates about the population using sampled data.
Random variable
A characteristic of the elements of a population (or sample) for which the observed values differ from element
to element is called a variable. In probability theory, where a variable assumes certain values with certain
associated probabilities, the variable is called a random variable. Variables are denoted by capital letters, e.g.,
X, Y, Z, and the actual values assumed by the random variables are denoted by lower case letters, e.g., x, y, z.
For example, let X = the height of boys in metres. Here X is a random variable, which measures the variable
“height”. If three boys are selected at random, i.e., n = 3, and their respective heights are 1.40m, 1.37m and
1.41m, then the realisations of the random variable X are denoted as xi, for i = 1, 2, 3:
x1 = 1.40 x2 = 1.37 x3 = 1.41
Data
The collection of all variables measured forms the data.
2
Unit of measurement
The unit of measurement of a variable is the standard unit used to express a quantity. For example, if
measurements are made in whole seconds, the unit of measurement is one second, i.e., unit = 1. If a stopwatch
records the length of time to solve a problem in tenths of a second , the unit of measurement is 0.1 second, i.e.,
unit = 0.1.
Sigma notation
In Mathematics, sigma notation is the standard notation used to represent summation. It is a convenient and
simple way to write long sums in a compact form. It is denoted by the Greek capital letter sigma,  . If a
random variable X consist of n observations x1 , x2 , , xn , the sum of all n values is represented in sigma
n
notation as  x , or simply as  x .
i =1
i
For example, if X = the number of children in a household where x1 = 2 , x2 = 3 and x3 = 0 , then the total
number of children in all three households in the sample is:

3
x =x = x + x
i =1
i 1 2 + x3 = 2 + 3 + 0 = 5
Exercise 1.1
Consider the results of the 2 semester tests for STAE:
1) All the STAE students form the
2) A selection of 50 STAE students is a
3) Each test is a
4) The sampling unit is
5) The results from all 2 tests form the
6) The average mark for Test 1 is a
7) To test whether the current group of STAE students perform better than groups from previous years is the
process of
3
1.3. Variable Type
It is important to identify a variable in terms of its type, namely categorical or numerical. This distinction
determines the appropriate analyses that can be performed on a variable.
Categorical variables
Categorical variables are also known as qualitative variables. Such variables allow for classification based on
some characteristic. For example, gender classified as male and female. The values of categorical variables
are often recorded as numerical values, e.g., coding then gender variable where 1 = Male and 2 = Female, but
these values have no numerical meaning as they simply denote the categories of the variable.
Numerical variables
Numerical variables are also known as quantitative variables. Such variables are naturally measured as
numbers. For example, a person’s height in centimetres. Arithmetic operations can be performed on the
variables as the values have numerical meaning. Numerical variables are further classified as either discrete
or continuous. Discrete variables assume values obtained by counting and consist of a finite number of values,
for example the number of children in a household. Continuous variables assume values obtained by
measuring and consist of an infinite number of values along the real line, such as the variable age.
1.4. Measurement Scales

Variables are also classified according to the measurement scale, i.e., the procedure used to measure or obtain
the data. There are four measurement scales, namely nominal, ordinal, interval and ratio.
Nominal
A categorical variable is measured on a nominal scale if the variable consists of two or more categories with
no intrinsic order. For example, a person’s eye colour could be classified as brown, blue, green or grey. There
is no logical way in which these four categories can be ordered.
Ordinal
A categorical variable is measured on an ordinal scale if the variable consists of two or more categories that
can be ordered or ranked. For example, a person’s age classified as young, middle-aged or old. The three
possible values of this variable are ordered in a logical way. Another example is an anxiety rating on a scale
from 1 to 5 where 1 = no anxiety and 5 = high anxiety. In this case numbers are used to reflect the measurement
in an order from low to high, i.e., a score of 4 indicates higher anxiety than a score of 2, but that does not mean
a person who rated 4 is twice as anxious as a person who rated 2.
4
Interval
A numerical variable (discrete or continuous) is measured on an interval scale if the values of the variable can
be arranged in order, there is no true or absolute zero, i.e., the value of zero is an arbitrary reference point,
differences between data values are meaningful, but ratios between values are not meaningful. For example,
temperature in degrees Celsius. The values are numerical and ordered. A temperature of 0 °C does not mean
an absence of temperature, i.e., the scale has an arbitrary zero value. The difference between
10 °C and 20 °C is the same as the difference between 30 °C and 40 °C, namely a 10-degree difference.
However, 20 °C is not twice as hot as 10 °C, i.e., ratios are not meaningful.
Ratio
A numerical variable (discrete or continuous) is measured on a ratio scale if the values of the variable can be
arranged in order, there is a true or absolute zero, differences between data values are meaningful, and ratios
between values are meaningful. For example, the amount of money in a bank account in Rand. The values are
numerical and ordered. An amount of R0 implies an absence money, i.e., the scale has an absolute zero value.
The difference between R10 and R20 is the same as the difference between R30 and R40, namely a R10
difference. R20 is twice as much money as R10, i.e., ratios are meaningful.
1.5. Data Formats

Raw data
Raw data refers to unprocessed information, also known as source data or primary data. All information
collected are first represented in raw data format, i.e., the dataset. The dataset is shown in a matrix format with
rows and columns. Variables are given in the columns and observations are given in the rows. A sample of n
observations and p variables will yield a dataset with n rows and p columns.
Steps to enter raw data into calculator

1) SETUP →down arrow →3:STAT → 2:OFF
2) MODE → 2:STAT → 1: 1 – VAR
3) Enter variable values in the column labelled X
4) AC
Frequency data
Frequency data are raw data in aggregated format where individual, or a range, of data values are listed with
a count of the number of times each value/range appeared in the dataset. This count is referred to as the
frequency of occurrence, or simply the frequency. It shows how the data are distributed across the scale.
Frequency data provide an overview of the sampled information.
5
Univariate frequency data represent counts of a single variable, and bivariate frequency data represent counts
of the combination of two variables. Steps to enter frequency data into the calculator are discussed in Section
2.1.3.
Exercise 1.2
Data were collected for a random sample of 20 coffee consumers. The survey yielded the following nine
variables, and the data are given in Table 1. For each variable, identify the type and the scale of measure.
Coffee Choice Coffee

Highest Household Daily coffee
ID Gender Age type of brand affinity
qualification size consumption
preference rating score
1 Male 24 Tertiary certificate 4 3 Instant 2 2.3

2 Male 26 Degree/Diploma 2 1 Instant 1 1.9
3 Female 25 Degree/Diploma 3 2 Filter 1 0.8
4 Female 29 Less than matric 5 7 Instant 5 4.4
6 Male 21 Tertiary certificate 1 1 Filter 3 0.4
8 Male 19 Matric 1 1 Filter 4 0.4
Postgraduate
9 Female 28 2 3 Instant 2 3.1
degree
10 Female 34 Matric 3 2 Instant 1 1.9
Postgraduate
12 Female 39 5 2 Filter 3 0.6
degree
14 Male 35 Degree/Diploma 2 4 Filter 5 3.6
15 Female 29 Matric 3 1 Filter 4 1.0
16 Male 19 Matric 6 2 Instant 4 1.4
17 Female 32 Degree/Diploma 1 3 Filter 3 2.4
18 Male 19 Less than matric 2 5 Instant 2 3.4
19 Female 26 Tertiary certificate 5 2 Instant 3 0.2
Postgraduate
20 Female 36 3 8 Instant 2 4.6
degree
Table 1: Raw data from coffee consumption survey
6
Variable Type Scale of measure
Consumer ID number
Gender
Age
Highest qualification
Household size
Daily coffee consumption (number of cups)
Coffee type preference
Choice of brand rating from not important (1) to

very important (5)
Coffee affinity score

STAE Lecture Notes - LU1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAE Lecture Notes - LU1

Uploaded by

Copyright:

Available Formats

Learning Unit 1: INTRODUCTION TO STATISTICS

Textbook reference: Chapter 1

1.1. What Is Statistics

1.2. Notation And Terminology

number of children in all three households in the sample is:

1.4. Measurement Scales

1.5. Data Formats

Steps to enter raw data into calculator

Coffee Choice Coffee

1 Male 24 Tertiary certificate 4 3 Instant 2 2.3

Daily coffee consumption (number of cups)

Coffee type preference

Choice of brand rating from not important (1) to

Coffee affinity score

You might also like