Chapter 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Chapter 1

Introduction

Objectives
By the end of this chapter the student should be able to:
 understand the basic concepts of statistics

 know the sigma (  ) notation

1.1 Introduction
The Oxford English Dictionary says that the word statistics came into use more than 200 years ago.
Statistics then referred to a country’s quantifiable political characteristics: population, taxes and provinces.
Statistics was “state numbers”. The tools of statistics are employed in many fields, e.g. business, education,
psychology, pharmacy, etc.

Population
Population is the conceptual totality of objects under consideration. In other words, population is the group
of all items of interest. It is frequently very large. A population may be finite or infinite. A finite population
has a limit or fixed size. Example 1.1 lists finite populations. An infinite population is impossible to measure
since it has no limits (see Example 1.2). The descriptive measure of a population is called a parameter.

Example 1.1:
a) All the leaves on a tree are a finite population.
b) All the books in a library are a finite population.
c) All the sand grains on all the beaches in the world are a very large population of sand grains, but
still a finite population.

Example 1.2:
The population of particles in the cosmos is an infinite population.

Chapter 1 page 1
Parameter
A parameter is the number that describes some property of a population. Normally Greek letters (e.g. 𝜇,
𝜌, 𝜎, 𝜎 2 see table of Greek alphabet on page A5 of the Appendix) are used to denote population
parameters.

Sample
A sample is a set of data drawn from the population. Descriptive measure of a sample is called a sample
statistic (see Example 1.3).

Figure 1.1: Population and Sample

Sample statistic
A number, that describes some property of a sample, also called a statistic. Sample statistics are used to
make inferences (or decisions) about parameters. Lower case letters (e.g. 𝑥̅ , 𝑟, 𝑠, 𝑠 2 ) are normally used to
denote sample statistics.

Example 1.3:
The 49 numbers from which the lotto numbers are drawn constitutes a population of 49 items. The six
numbers that are randomly drawn are a sample from the population. The largest number in the population

is 49 and it is a population parameter which could be denoted by the Greek letter  , thus   49. If the

sample consists of the numbers 5, 21, 3, 42, 10 and 11 the sample maximum is 42 and could be denoted
by the Roman letter a, thus a = 42.

Chapter 1 page 2
1.2 Purpose of statistics
The field of statistics is concerned with

1. Descriptive statistics
The collection, arrangement, summary and presentation of a set of data. Methods used are graphs,
tables and numerical measures (see Chapters 2 and 3). We determine or estimate population
characteristics and distinguish between a population parameter and a sample statistic. Consider
for example the case where the observations of individuals are categorical, such as having been
vaccinated against measles, or not. The proportion of vaccinated persons in the population, P is
a population proportion. Now suppose that a random sample of size n from the population

contains X vaccinated individuals.


X
Then the sample proportion of vaccinated persons is p̂  is an estimate of P . But p̂ in
n
general does not equal P . Note: Another sample may produce a different value of p̂ .

2. Inferential statistics
The process of inferring an estimate, prediction or decision about a population based on sample
data. For example, scientists working at a pharmaceutical manufacturing company want to

determine if the average weight of a specific drug is


  10mg . However, this pharmaceutical
manufacturing company produces a drug in batches (lots) of 50,000 tablets. That is, one complete
production cycle is represented by 50,000 tablets. To calculate the average weight, the scientists
have to weight each of 50,000 tablets. Since the population of one complete production cycle is
very large, investigating each tablet would be impractical, time consuming and expensive (etc.
costs for damages to product, hiring staff, training staff). It is easier and cheaper to take a
representative sample from the population of interest. Draw conclusions or make estimates about
the population on the basis of information provided by the sample. However, such conclusions
and estimates are not always going to be correct. For this reason, measure of reliability is build
into the statistical inference. Two such measures are confidence level and significance level (see
Chapter 7 and 8).

Chapter 1 page 3
The sample should be representative of the population. One way of trying to ensure this is by simple
random sampling. A simple random sample is chosen in such a manner that every
element/observation/object has an equal chance of being selected. Example of simple random sampling
is selecting a playing card from a shuffled deck. Another example is to assign a number to each member
of the population. The population becomes a set of numbers. Then, using the random number table, we
can choose a sample of desired size.

1.3 Definitions
Observing a characteristic if we find that it takes on different values in different persons or objects, we call
the characteristic a variable for example age, gender, marks of students etc. Not all students achieve the
same mark and the marks vary from student to student, thus we call it a variable. A variable is denoted
by a symbol, usually a capital letter of the alphabet, such as X ; Y ; Z etc. Values are all the possible

observations of a variable. Data are the observed values of a variable. Data that are randomly collected
are called data of a random variable (RV).

The actual observations of a random variable are usually denoted by the corresponding lower case letters.

Let x1 denote a particular variate, i.e. the value of a particular member of X , in this case the first member
of X . The symbol xi denotes the value of X , when i the subscript, assumes a particular value. The
only values which the subscript can assume are the integers, 1; 2; 3, etc. Another word for subscript is
the word, index. In Table 1.1, the first subscript represents the row and the second subscript the column.
th
x ij represents the number at the intersection of the i row and the j th column. Subscripts are usually

denoted by the letters i and j but other letters, e.g. k; p; q; s etc. can also be used.

Table 1.1: Subscript of rows and columns

column 1 column 2 column 3

row 1 x11 x12 x13

row 2 x21 x22 x23

row 3 x31 x32 x33

row 4 x41 x42 x43

Chapter 1 page 4
Example 1.4:
a) A mark for a statistics test is out of 100, the values are the integers between 0 and 100.
b) The marks for a statistics test for 10 students are listed below and are the data from which we will
extract the information: 32, 35, 40, 52, 68, 72, 75, 82, 89, 90

Example 1.5:
Let RV Y represents tree height
a) If the height of the tenth tree in the population of trees is 3 metres, then y10  3

b) If the height of the second tree is 4 meters, then y2  4

Example 1.6:

In a candy shop, let RV X be the price of the sweets


a) If the price of the first packet of sweets on the shelf is R4.50, then x1  R4.50
b) If the price of the second packet of sweets on the shelf is R6.75, then x2  R6.75

Discrete random variable (RV)


A variable that can assume only certain distinct values. Data described by a discrete variable is called
discrete data. If a variable can assume only one value it is called a constant. In certain situations, fractional
values are also integers.

Example 1.7:
a) The number of children in a family can only assume discrete data such as 0, 1, 2, 3, 4, … etc. It
is not possible for a family to have 2.6 children.
b) The number of goats on a farm. Can only assume discrete data such as 0, 1, 2, 3, 4, … etc.
c) Coke bottles are available in 1L, 1.5L and 2L. Since other fractional data between 1L and 1.5L
does not occur, these data can be considered as discrete.

Example 1.8:

Let X be the number of defective parts in a sample of size n  5 .


The values are tabulated in Table 1.2.

Table 1.2: Example 1.8 data values


𝑥𝑖 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6
values 0 1 2 3 4 5

Chapter 1 page 5
Example 1.9:
Let Y be the number of correct answers in a multiple choice test paper consisting of 10 questions. The
values are tabulated in Table 1.3.
Table 1.3: Example 1.9 data values
𝑥𝑖 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 𝑥8 𝑥9 𝑥10 𝑥11
values 0 1 2 3 4 5 6 7 8 9 10

Continuous random variable (RV)


Assume any value within a specified interval. Data described by a continuous variable is called continuous
data. The values recorded depend on the accuracy of measurement and the sensitivity of the measuring
instrument.

Example 1.10:
a) The weight of a bag of potatoes can be 5kg or 5.05kg. It can also assume any value between
these two numbers.
b) Sets of measurement data, such as length, mass, yield per hectare etc.

Quantitative variable
Described mostly by numbers, for example weight, height, age etc. It is suitable for performing calculations
such as adding, subtracting, multiplying and dividing data values.

Qualitative variable
Described by categorical (non-numeric) responses, for exaple eye colour, gender etc. Usually these are
described by words and are also called attribute variables.

Scale of measurement
The four data types are
1. Nominal-scaled data
2. Ordinal-scaled data
3. Interval-scaled data
4. Ratio-scaled data

Chapter 1 page 6
Nominal-scaled data
 values are not numbers but words describing categories
 qualitative (non-numeric)
 categories are mutually exclusive (no object can be placed in more than one category)
 not ordered/ranked in any way
 minimum number of categories is two and there can be as many categories as needed
 we can code nominal data by assigning number to each category
 cannot perform any calculations on the codes
 only permitted to count the occurrences of each category and graphically present as pie chart
and/or bar graph (see Chapter 2)

Examples 1.11:
a) Gender coded as follows: ‘0’: male, ‘2’: female
b) Marital status coded as follows: ‘1’: single, ‘2’: married, ‘3’: divorced, ‘4’: widowed
c) The provinces were first year UWC students matriculated are as follows:
‘1’: Western Cape, ‘2’: Eastern Cape, ‘3’: Northern Cape, ‘4’: Gauteng,

Ordinal-scaled data
 appear to be nominal
 qualitative (non-numeric)
 classifies things into one and only one category
 categories are ordered/ranked; each category can be greater than (>) or less than (<) its
neighbour
 when assigning codes to the values, maintain the order of the values
 permissible calculation is to place all the data in order and split data set into 100, 10, 4 or 2 equal
parts; descriptive measurements are respectively called percentiles, deciles, quartiles and median
(see Chapter 3); construct box plot (see Chapter 3)

Example 1.12:
Marathon rankings for female runners:
1st place: Susan
2nd place: Janet
3rd place: Mary
4th place: Denise

Chapter 1 page 7
Marathon rankings show that Susan is faster than Janet, who is faster than Mary. However, we cannot say
that the time between Susan and Janet is the same as the time between Mary and Denise (i.e. has no unit
of measurement).

Interval-scaled data
 is quantitative (numeric).
 includes ‘greater than (>)’ or ‘less than (<)’ measurements, such as rankings or preferences.
 has unit of measurement (i.e. it is possible to determine exact distance between categories).
 has no meaningful zero point (or absolute zero point)

Examples 1.13:
We can easily tell if one temperature is greater than, equal to, or less than another. I.e., 40 degrees
Celsius is greater than 20 degrees Celsius. Has order/ranking. The distance between 40 degrees Celsius
and 30 degrees Celsius is the same at the distance between 20 degrees Celsius and 10 degrees Celsius.
Temperature has unit of measurement. Zero degrees is the freezing point of water at sea level but does
not mean the complete absence of heat. It is simply a convenient starting point. Temperature has no
meaningful zero point.

Example 1.14:
One o’ clock is less than 02:00 which is less than 04:00 (ranking/order). The time between 01:00 and
03:00 is the same as the time between 06:00 and 08:00 (unit of measurement). Midnight (00:00) is the
start of day (has no absolute zero point).

Ratio-scaled data
 is quantitative (numeric).
 includes ‘greater than’ or ‘less than’ measurements, such as rankings or preferences.
 has unit of measurement (i.e. it is possible to determine exact distance between categories).
 has absolute zero point (or meaningful zero point/origin). I.e. at the zero value on the ratio scale,
the characteristic being measured has decreased to the point where it is not present or at least it
is not observable.

Example 1.15:
a) Age, height, weight etc.
b) If you are 20 years old, you are older than a persons aged 15 years old (ranking) and you are five
years older (unit of measurement). With a ratio scale, we also have a point where none of the
scale exists; when a person is born his or her age is zero.

Chapter 1 page 8
1.4 Sigma  notation

The symbol  is shorthand for ‘the sum of’. Hence, the Greek capital letter  is used to denote sum,

which is the answer to an addition sum.


3
 xi  x1  x2  x3
i 1

In words ‘the sum of xi for i going from one to three is’ x1  x2  x3 .

Generally
n
x  x  x  x  ...  x n
i 1 2 3 (1.1)
i 1
n

x  x1  x2  x3  ....  xn
2 2 2 2 2
i
i 1 (1.2)

The notation may be simplified even further, for example x denoting the sum of all x values. The

5
subscript ‘i’ is called a dummy variable. Any other subscript could be used such that y
i 1
i is equivalent

5
to y
j 1
j . The subscript is also called the index variable. If the index variable is omitted from the sigma

notation, then in these notes it is assumed to run from 1 to n, that is its full range, thus in this context
n

 yi means y
i 1
i .

Chapter 1 page 9
Example 1.16:
Table 1.4 gives the values.
Table 1.4: Example 1.16 data values
𝑦𝑖 𝑦1 𝑦2 𝑦3 𝑦4 𝑦5 𝑦6
values 3 1 2 3 5 4

Calculate
5
a) y
i 1
i

6
b) y
i4
i

y
2
c) i
i 1

2
6 
d)  yi 
 i 1 

Solution:
5
a) y
i 1
i  y1  y2  y3  y4  y5  3  1  2  3  5  14

6
b) y
i4
i  y4  y5  y6  3  5  4  12

y  y1  y2  y3  y4  32  12  22  32  9  1  4  9  23
2 2 2 2 2
c) i
i 1

2
 6 
  yi    y1  y2  y3  y 4  y5  y6   3  1  2  3  5  4  182  324
2 2
d)
 i 1 

Summation of a constant
In the case where k is a constant,
n

k
i 1
i  k1  k2  ...  kn  k  k  k  ...  k  n  k (1.3)

Chapter 1 page 10
Example 1.17:
Calculate
4

2
i 1

Solution
4

 2  2  2  2  2  2 4  8
i 1

Example 1.18:
3

4
i 1

Solution
3

 4  4  4  4  4  3  12
i 1

The subscript associated with the constant is generally omitted. In the following where k is a constant
n

 k  x   k  x
i 1
i 1  k  x2  ...  k  xn
(1.4)
n
 k  (x1  x2  ...  xn )  k   xi
i 1

n n
Thus  k  xi   k   xi
i 1 i 1

Example 1.19:
Consider Table 1.4. Calculate
3
a)  3  y 
i 1
i

6
b)  2  y 
i 1
i

Chapter 1 page 11
Solution:
3 3
a)  3  yi   3   yi  3   y1  y2  y3   3  3  1  2  3  6  18
i 1 i 1

6 6
b)  2  yi   2   yi  2   y1  y2  y3  y4  y5  y6   2  3  1  2  3  5  4  2  18  36
i 1 i 1

An extension to the sigma notation concept


As an extension to the concept, note that
n

1 x  y   x
i
i i 1  y1    x2  y2   ...   xn  yn  (1.5)

which may simply be written as  x  y  . Note that this is not the same as  x  y , which means
the sum of x multiplied by the sum of y. Furthermore consider Table 1.5 with figures arranged in rows

and columns, with r rows and c columns.

Table 1.5: Subscript of r rows and c columns

column 1 column 2 column 3 … Column c

row 1 x11 x12 x13 … x1c

row 2 x21 x22 x23 … x2c

row 3 x31 x32 x33 … x3c

row 4 x41 x42 x43 … x4c

… … … … … …

row r xr1 xr2 xr3 xrc


r c

 x
i 1 j 1
ij  x11  x21  ...  xij  ...  xrc (1.6)

which represents the sum of all the figures in Table 1.5.


Consider
n

 (x
i 1
i  y j )  (x1  y1 )  (x 2  y 2 )  ...  (x n  y n ) (1.7)

Re-grouping the x and y values separately,

n n n

 (x
i 1
i  y j )  (x1  x 2  ...  x n )  ( y1  y 2  ...  yn )   x i   y i
i 1 i 1
(1.8)

Chapter 1 page 12
where there is no doubt as to what is meant, the above is written as  x   y. In the same way it can
be shown that
n
   n
  n
 n

 a  xi   b  yi   c  zi    a   xi    b   yi    c   zi 
i 1  i 1   i 1   i 1  (1.9)
n
  n

 a  xi   b   a   xi   n  b 
i 1  i 1 
(1.10)

n
    n 
 a  xi  b   a   xi   n  b 
i 1  
 1  (1.11)

Example 1.20:
Table 1.6 gives the data values
Table 1.6: Example 1.20 data values
𝑦𝑖 𝑦1 𝑦2 𝑦3
values 3 2 1
𝑥𝑖 𝑥1 𝑥2 𝑥3
values 2 4 2

Calculate
3

 (x
i 1
i  yi )

Solution:

3 3 3

 ( xi  y i )   xi   y i
i 1 i 1 i 1

 ( x1  x2  x3 )  ( y1  y 2  y3 )  (2  4  2)  (3  2  1)  14

Example 1.21:
10 10

 yi2  1540 x y
10 10
Calculate a value for  ( xi  yi ) given that
i 1
2
 xi2  385;
i 1 i 1
and
i 1
i i  770 .

Chapter 1 page 13
Solution:
Note:

( x  y) 2  ( x  y)  ( x  y)  x 2  x  y    y  x   y 2  x 2  2  x  y   y 2
Thus,
10 10

 ( xi  yi ) 2   ( xi2  2  xi  yi   yi2 )
i 1 i 1

 10   10
  10 
   xi2    2    xi  yi     yi2   385  1540  1540  3465
 i 1   i 1   i 1 

10
2
 ( xi  yi )  3465
Hence i 1

Suppose we have observations y1 , y2 , , yn and if f is a function, then

 f (y )
i 1
i denotes f ( y1 )  f ( y2 )    f ( yn ) (1.13)

Example 1.22:
Table 1.7 gives the data values
Table 1.7: Example 1.20 data values
𝑦𝑖 𝑦1 𝑦2 𝑦3
values 3 1 2
𝑦𝑖 𝑦4 𝑦5 𝑦6
values 3 5 4

6
If f ( y)  2  y   1 , show that  f (y ) i is 30.
i 1

Chapter 1 page 14
Solution:
6

 f (y )  f (y )  f (y )  f (y )  f (y )  f (y )  f (y )
i 1
i 1 2 3 4 5 6

 2  y1   1  2  y 2   1  2  y3   1  2  y 4   1  2  y5   1  2  y6   1

 2  3  1  2  1  1  2  2  1  2  3  1  2  5  1  2  4   1

 (6  1)  (2  1)  (4  1)  (6  1)  (10  1)  (8  1)

 5  1  3  5  9  7  30
Chapter 1 - Exercises

1. Forty students were enrolled in an English literature class. The lecturer asked five students who usually
sits in the back of the classroom if they would like A Tale of Two Cities as the next class reading assignment.
Three of the five students replied ‘yes’.
a) Identify the population and the sample in the situation.
b) Is this likely to be a representative sample? If not, why not?
2. Is the random variable in Example 1.4.1 and Example 1.4.2 the same. What is it?

3. Table 1.8 is an example of student data. In this table the random variables are age, weight etc. And
the entries in the table are the realisations of the variables. Classify each of the following variables, first,
as qualitative or quantitative and second, as a nominal, ordinal, interval, or ratio scale.

Chapter 1 page 15
Table 1.8: Student data
ID age Weight height class gender Study
1 27 73 174 THIRD MALE ALWAYS
2 24 59 156 FIRST FEMALE MOST TIME
3 22 56 164 SECOND FEMALE MOST TIME
4 20 70 176 FIRST MALE SOMETIMES
5 28 67 172 HONS MALE MOST TIME
6 24 61 168 FIRST FEMALE NEVER
7 21 73 186 SECOND MALE SOMETIMES
8 27 59 158 FIRST FEMALE NEVER

4. For each of the following, indicate whether the appropriate variable would be qualitative or quantitative.
If you identify the variable as qualitative, indicate whether it would be nominal or ordinal. If you identify
the variable as quantitative, indicate whether it would be discrete or continuous.
a) Whether you own a ‘Panasonic’, ‘LG’, ‘Telefunken’ or ‘other’ television set.
b) Your status as either ‘first-year’, ‘second-year’, ‘third-year’ or ‘post-graduate’ student.
c) The number of people who attended the first-year orientation programme in 2010.
d) The price of your most recent haircut.
e) Distance (in km) between Sc2 and Unibell station.
f) The number of students on campus who live on residence.

5. For each of the following, indicate the type of data (scale of measurement) that best describes the
information.
a) In mid June 2009, a corporation had approximately 39,000 employees.
b) Yesterday’s newspaper reported that the previous day’s highest temperature in Cape Town was 27
degrees Celsius.
c) The news broadcast at 19:00 every weekday.
d) An individual respondent answered ‘yes’ when asked if TV contributes to violence.

Chapter 1 page 16
Chapter 1- Answers

Exercise 1.1
a) Population: Forty students enrolled for English literature class; sample: five students who sat at
the back of the classroom
b) Not a representative sample. Five students sat at back of classroom and does not represent the
entire class.

Exercise 1.2
Yes, the random variable (RV) is the same. The random variable (RV) is Statistics test mark.

Exercise 1.3
a) ID: qualitative & nominal
b) Age: quantitative & ratio
c) Weight: quantitative & ratio
d) Height: quantitative & ratio
e) Class: qualitative & ordinal
f) Gender: qualitative & nominal
g) Study: qualitative & ordinal

Exercise 1.4
a) Qualitative & nominal
b) Qualitative & ordinal
c) Quantitative & discrete
d) Quantitative & continuous
e) Quantitative & continuous
f) Quantitative & discrete

Exercise 1.5
a) Ratio
b) Interval
c) Interval
d) Nominal

Chapter 1 page 17

You might also like