Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Statistics

Types of Data
Data can be categorized in many ways, first we’ll look at the two main types of data:
 Qualitative Data – Information given using words, it may be structured or semi
structured. (E.g., the ball is red)
 Quantitative Data – Information given using numbers or other statistical references. (E.g.,
Consumption = $100 + 5wages)
There are two types of quantitative data
1. Discrete Data – is that can only take certain countable values. It is usually only
whole numbers. (E.g., shoe size, 6,7,8,9,10.5,11.5)
2. Continuous Data – this type of data can take any value within a range; it is usually
decimals or real numbers. (E.g., the average of 5 numbers is 7.932)

Measure of Central tendency


In statistics, sometimes is a need to find or use a single value that would represent or
characterize a group as a whole. This single value is called a statistical average (measure of
central tendency)
There are three statistical averages, namely:
 Mean
 Mode
 Median
Mean
The mean for a given set of data is the arithmetic average of the total set of data.
Formula:
x x
x= =
f n
Where: - means the ‘sum of’
x – The mean
x – The value of an observation (variable)
f – The frequency
Median
The median for a given set of data is the ‘middle’ term or central value, when the data is
arranged in order (i.e., ascending or descending). Its symbol is Q2, the median should always
have the same number of values above and below. When we have an odd number of values the
median is easily obtained, however, when we have an even number of values the median is found
by finding the arithmetic average of the two middle numbers.

( N + 1 )th
Formula: Q2 = term , where ‘N’ – the number of terms
2

Mode
The mode for a given set of data is the value with the highest frequency. The most
common observation. In everyday terms we would say that the mode is the most popular or most
fashionable item. If a distribution has a single mode, or two modes or three modes, then it is said
to be unimodal, bimodal or trimodal respectively.

EXAMPLE:
Given the group of numbers below, find the mean, mode and median.
3, 5, 7, 9, 5, 8, 6, 5, 3, 4, 7, 8, 5, 5, 2
First, we would arrange the number in order, whether ascending or descending
2, 3, 3, 4, 5, 5, 5, 5, 5, 6, 7, 7, 8, 8, 9
The mode can easily be found by the most frequent number, number which appears the most
Mode = 5
Median would be the middle term this can be achieve, by using the formula or crossing a number
from the back while crossing another form the front.
2, 3, 3, 4, 5, 5, 5, 5, 5, 6, 7, 7, 8, 8, 9

( N + 1 )th
Formula: Q2 = term
2

( 15+1 )th
= term
2

16th
= term
2
= 8th term
2, 3, 3, 4, 5, 5, 5, 5, 5, 6, 7, 7, 8, 8, 9
In both cases, we can see that the median = 5
x 2+ 3+3+ 4+5+5+5+ 5+5+6+7+ 7+8+ 8+ 9
Mean = →
n 15
82
=
15
= 5.47 (2 decimal places)

Comparing the three measures of central tendencies


Average Advantages Disadvantages
s
Mean It is most popular measure It cannot be obtained graphically
It can be calculated exactly It can be affected by outliers
It can help with further statistical It can sometimes give an impossible value
calculations when data is discrete.
All data is used in its calculation
Mode Easy to find or understand It can’t help with further statistical
calculations
It is not affected by outliers A set of data can have more than one mode
it can be easily obtained from a graph It can’t be determined exactly from
grouped data

Median Easy to find or understand It’s rarely used in further statistical


calculations
It is not affected by outliers In grouped data the median is estimated
It can both represent the set of data and an In a limited set of data it may not represent
actual member the entire group
Measures of Dispersion
In statistics it’s also good to know the dispersion of a set of data, or the spread of a set of data.
This gives us an idea as to how close the entire group of values are to the central values.
There are four measures of dispersion:
 Range
 Interquartile range
 Semi – Interquartile Range
 Standard Deviation
Range
The range of a set of values is the difference between the largest and the smallest observations.
Formula:
Range = Largest Observation - Smallest Observation
Interquartile & Semi – Interquartile Ranges
A quartile is one of three values that divide an ordered set of data into four equal parts. The first
(lower) quartile ‘Q1’ is the value below which one – quarter of the data lies. The second (middle) quartile
‘Q2’ is the value below which one half of the data lies. The third (upper) quartile ‘Q3’ is the value below
which three – quarters of the data lie.

( N + 1 )th
Formula: Q1 = term
4

3 ( N +1 )th
Q3 = term
4
The Interquartile Range of a set of values, is the difference between the upper and lower quartiles.
Formula:
I.Q.R. = Q3 – Q1
The Semi – Interquartile Range of a set of values, is half the difference between the upper and
lower quartiles. Hence, it is half of the Interquartile Range.

I . Q. R . Q3 −Q1
S.I.Q.R. = =
2 2
Standard Deviation
The standard deviation of a set of values tells us on average how far each value lie from the
central value.
Formula:
Sx/ =
√ X2 X 2
n
−( )
n

=
√ X2
n
−X 2

EXAMPLE
Given the following numbers, find the range, Interquartile range, semi-interquartile range and
standard deviation.

3, 5, 7, 9, 5, 8, 6, 5, 3, 4, 7, 8, 5, 5, 2
First, again we would arrange the values in order.
2, 3, 3, 4, 5, 5, 5, 5, 5, 6, 7, 7, 8, 8, 9

Range = Highest – lowest


=9–2
=7

( N + 1 )th 3 ( N +1 )
th
Q1 = term Q3 = term
4 4

( 15+1 )th 3 (15+ 1 )th


= term = term
4 4
= 4th term = 12th term
=4 =7
Interquartile range = Q3 – Q1
=7–4
=3

I . Q. R . Q3 −Q1
Semi-interquartile range = = =
2 2
Q3−Q1
=
2
7−4
=
2
= 1.5
Standard Deviation =
√ X2
n
−X 2

We would use the mean we found earlier, mean = 5.47.


We would then find the square of all the values

(2, 3, 3, 4, 5, 5, 5, 5, 5, 6, 7, 7, 8, 8, 9)2 - (4, 9, 9, 16, 25, 25, 25, 25, 25, 36, 49, 49, 64, 64, 81)
2
X 4+ 9+9+16+ 25+25+25+25+25+ 36+49+ 49+64 +64+ 81
=
n 15
=33.73 (2 decimal places)

Standard dev. =
√ X2
n
−X 2


= 33.73 – ( 5.47 )2

= √ 33.73−29.9209
= √ 3.8091
= ± 1.95
Statistical Diagrams
There are a number of statistical diagrams, ranging from tables, to charts and graphs. We
would first, explore the concepts of grouped vs. ungrouped data and also look at which diagrams
would be better suited to illustrate which sort of data.
We would a tally chart, frequency table and cumulative frequency table to illustrate the
difference.
EXAMPLE
Given the numbers below, construct a cumulative frequency to illustrate the data.
3, 2, 4, 5, 1, 4, 5, 3, 5, 4, 3, 2, 3, 1, 1 ,1, 3, 2, 4, 2, 5, 2, 3, 4
Class Tally Frequency Cumulative Frequency
1 IIII 4 4
2 IIII 5 5+4=9
3 IIII I 6 9 + 6 = 15
4 IIII 5 15 + 5 = 20
5 IIII 4 20 + 4 = 24
The table above shows cumulative frequency of ungrouped data

Given the numbers below, construct a cumulative frequency to illustrate the data.
2, 3, 7, 11, 12, 21, 24, 18, 19, 25, 19, 10, 8, 9, 7, 6, 16, 20, 22, 17, 4, 12, 18,
Class Tally Frequency Cumulative Frequency
1–5 III 3 3
6 – 10 IIII I 6 3+6=9
11 – 15 III 3 9 + 3 = 12
16 – 20 IIII I 6 12 + 6 = 18
21 – 25 IIII 5 18 + 5 = 23
The table above shows cumulative frequency of grouped data.
Grouped vs. ungrouped data
Ungrouped data is simply where each value would be its own class, whereas grouped
data would be where a class is a range (group) and members of the group, would be the values
that fall within the range of said class.
Class characteristics (Grouped data)
 Class Interval – this is defined as a grouping.
 Class limits – these are the end values of a class intervals. The higher limit is called the
‘upper limit’ and the other is called the ‘lower limit’
 Class Boundaries – these are the limits adjusted to ‘0.5’. The ‘upper boundary’ would be
the ‘upper limit + 0.5’ and the ‘lower boundary’ would be the ‘lower limit – 0.5’.
 Class width – this would be the distance from the upper boundary to the lower boundary.
It is found by subtracting the two boundaries.
 Class Midpoint – this is defined as the average of the lower limit/boundary and the upper
limit/boundary. It is found by adding either boundaries or limits together then dividing by
two (2).
EXAMPLE
Class/Class Class Limits Class Boundaries Class Width Class Midpoint
Interval
1–5 Upper limit = 5 U.B = 5 + 0.5 = 5.5 U.B – L.B 5+1
=3
Lower limit = 1 L.B = 1 – 0.5 = 0.5 5.5 – 0.5 = 5 2
6 – 10 Upper limit = 10 U.B = 10 + 0.5 = 10.5 U.B – L.B 10+6
=8
Lower limit = 6 L.B = 6 – 0.5 = 5.5 10.5 – 5.5 = 5 2
11 – 15 Upper limit = 15 U.B = 15 + 0.5 = 15.5 U.B – L.B 15+11
=13
Lower limit = 11 L.B = 11 – 0.5 = 10.5 15.5 – 10.5 = 5 2
16 – 20 Upper limit = 20 U.B = 20 + 0.5 = 20.5 U.B – L.B 20+16
=18
Lower limit = 16 L.B = 16 – 0.5 = 15.5 20.5 – 15.5 = 5 2
21 – 25 Upper limit = 25 U.B = 25 + 0.5 = 25.5 U.B – L.B 25+21
=23
Lower limit = 21 L.B = 21 – 0.5 = 20.5 25.5 – 20.5 = 5 2

Find the mean, median and mode of data grouped vs. ungrouped data
EXAMPLE
Class Tally Frequency Cumulative Frequency
1 IIII 4 4
2 IIII 5 9
3 IIII I 6 15
4 IIII 5 20
5 IIII 4 24
The table above shows cumulative frequency of ungrouped data
Mode – the mode would be the class with the highest frequency, which is 3
Median – median would be the middle value after the data has been arranged.

( N + 1 )th
Formula: Q2 = term
2

( 24+1 )th
= term
2
= 12.5th term
Notice that the table is already in order, we can use the cumulative frequency column to
determine the median. Given that the median is the 12.5th term, it would be between 9 and 15,
which means that the median of the data set would be 3.
Mean – arithmetic average of all the numbers, we normally add all values and then divide, this
time, since the table arranged the values already, we can multiply.
The class 1 had a frequency of four (4), which means that we have four 1s in the data set,
we therefore, multiply 1 by 4. Similarly, the class 2 has a frequency of five (5), which means that
we have five 2s in the data set, hence we would multiply 2 by 5 and so on.
1 x 4 = 4, 2 x 5 = 10, 3 x 6 = 18, 4 x 5 = 20 and 5 x 4 = 20
Now adding all the values to find the mean would give
4+ 10+18+20+20
=
24
=3
Notice that the mean, median and mode all equal the same value, that is because they are ALL
AVERAGES. Ideally, the mean, mode and median should always align

Class Tally Frequency Cumulative Frequency


1–5 III 3 3
6 – 10 IIII I 6 9
11 – 15 III 3 12
16 – 20 IIII I 6 18
21 – 25 IIII 5 23
The table above shows cumulative frequency of grouped data.
Once again mode would be the class with the highest frequency, hence the modal classes would
be 6 – 10 and 16 – 20, remember we can have multiple modes.
Median – median would be the middle class after the data has been arranged.

( N + 1 )th
Formula: Q2 = term
2

( 23+1 )th
= term
2
= 12th term
The twelfth term falls in the range of 11 – 15. Hence that would be our median class or the class
where the median is found.
Again, we have a process to find the mean. Given that we do not have exact values, to find the
mean of group data we estimate. We do this by finding the midpoint of each class and then
multiplying the midpoint by the frequency, the following processes remains the same as the
ungrouped example above.
Class Class midpoint frequency Freq x midpoint
1–5 3 3 3x3=9
6 – 10 8 6 8 x 6 = 48
11 – 15 13 3 13 x 3 = 39
16 – 20 18 6 18 x 6 = 108
21 – 25 23 5 23 x 5 = 115

Now adding all the values to find the mean would give
9+48+ 39+108+115
=
23
= 13.87 (2 decimal places)

Constructing statistical charts and graphs


Bar/ Column chart – this is where we have the classes on the ‘x’ axis and the frequency on the
‘y’ axis for bar chart and the reverse for a column chart. In bar charts the individual bars should
not be touching each other. Also note that the ‘y’ axis, should be labelled (numbered) in some
order, whether you decide to count in 5s, 10, 2s, etc.
EXAMPLE
Histogram – this is where we have the classes on the ‘x’ axis and the frequency on the ‘y’ axis,
note it should not be confused with a bar chart, since in a histogram the bar MUST touch each
other, there should be no space. Similarly, the ‘y’ axis should be numbered in order.
EXAMPLE

Pie Chart – perhaps the hardest chart to construct at this stage, we first need to determine the
angle (portion of the pie) each class would consume, then we proceed to draw our pie chart.
EXAMPLE
Probability
Probability is defined as the measure of how likely an event is to occur. The probability must be
between the numbers zero (0) and one (1). One (1) represents total certainty, while zero (0)
represents total uncertainty.
Example,
Probability of a person having three heads is 0
Probability that a person would die is 1

Terminology
 Sample space – the sample space (U) is the set of all possible outcomes of a given
experiment. Each element of the sample space is called a sample point or outcome.
 Event – is a set of outcomes that fit the criteria.
 Equally likely events – this occurs, when two event both have the same chance of
occurring at any given time. It can also mean that over a long period of time, each event
would occur an equal number of times.
Example,
 Tossing a fair coin, each time you toss the coin it has a fair chance to land on either
‘head’ or ‘tail’. Also, over a given number of tosses you are expected to get the same
number of heads as tails.
 Rolling a dice.
 Impossible Event – if the probability of an event occurring is zero (0), this event is said to
impossible (i.e., would never occur)
 Certain Event - if the probability of an event occurring is one (1), this event is said to
certain (i.e., would always occur)

Notations
Events are always denoted with capital letters.
 The probability of Event A occurring – P(A)
 The probability of sample space – P(U)
 The complement of event A (event A not occurring – P(A’)
 The number of times event A can occur – n(A)
 The number of times event A cannot occur – n(A’)
 The number of possible outcomes (sample space) – n(U)/n(T)
Formulae
 The probability of event A occurring [P(A)]
The number of favourable outcomes
¿
The totalnumber of possible outcomes
n( A)
¿
n(U )

 The probability of Event A not occurring [P(A’)]


= Probability of sample space – Probability of Event A
= P(U) – P(A)
= 1 – P(A)

Questions:
1. A bag contains 60 marbles; 45 green and 15 red.
a) What is probability of drawing a green marble?
b) What is probability of drawing a red marble?
c) If 15 green marbles are removed from the bag, what is the chance now of drawing
a green marble?
d) What is the chance of drawing a yellow marble?
e) What is the probability of drawing either a green marble or a red marble?
2. A card is chosen from a standard pack of 52 cards. What is probability that the card is:
a) An Ace
b) A red card
c) A spade
d) A red king
e) A jack of clubs
f) A black Diamond
3. If a letter is taken at random from the words MATHEMATICS OLYMPIAD, what is the
probability that:
a) It is a vowel
b) It is a M
c) It is an O
d) It is a T
Frequency curves
If a large sample is taken from a very large population and a frequency polygon is drawn
for midpoints which are relatively close to each other consecutively, we can literally draw a
curve through the points instead of straight lines. We say that the frequency polygon tends
towards a smooth continuous curve, called the frequency curve. This is seen below.

Types of Frequency Curves


There are three types of frequency curves that we need to have knowledge of:
1. The normal (symmetrical) curve
2. The negatively skewed curve
3. The positively skewed curve

The Normal Curve

Whenever we plot a variable against the corresponding frequency and a symmetrical bell
-shaped curved is obtained, this is said to be a normal probability curve and represents a normal
distribution. In a normal distribution, the measures of central tendency the mean, the median and
the mode, all coincide. That is, they all have the same value. Examples of data that follow
normal distribution are height, weight, mass test scores etc.
Skewness
When a frequency curve is drawn and the graph obtained is non symmetrical, then the
curve is said to be lopsided or skewed. The data is said to represent a skewed distribution.

The Negatively Skewed Curve

The negatively skewed distribution is one that is skewed to the left. This distribution can
be obtained from the results of a test in which most of the students performed well and only a
few students performed poorly.
In a negatively skewed distribution, the measure of central tendency are all different in
such a way that, the mean is less than the median, which is less than the mode. That means ‘the
mean < median < mode’

The Positively Skewed Curve

The positively skewed distribution is one that is skewed to the right. This distribution can be
obtained from the results of a test in which most of the students performed poorly and only a few
students performed well. A number of positively skewed distributions occur in their own right.
For example, the number of children per family.
In a negatively skewed distribution, the measure of central tendency are all different in
such a way that, the mean is more than the median, which is more than the mode. That means
‘the mean > median > mode’

You might also like