Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 18

L01: Basic statistics

2007 Winter Math Course,


COMP Department, HKUST.
Statistics
 Statistics is the science dealing with the
collection, organizing, analysis and
interpretation of numerical data
 We will focus on the analysis of data
 Two topics will be discussed
 Measures of central tendency (average)
 Measures of variation (dispersion)
Measures of central tendency
 Motivation
 We have a bulk of data in hand, how can we
reduce them to a human understandable form?
 All values lie in between two extreme values
 Can we make use of a single number to
represent such a data set?
 Goal
 By computing the “average” of a data set
Types of averages
 There are several methods to measure the
central tendency using different averages
 List of different averages
 Arithmetic mean or simply, mean
 Median
 Mode
Arithmetic mean
 Definition
 Suppose we have N numbers x1, x2, …, xN
 The arithmetic mean is defined as
Merits and demerits of mean
 Merits
 Mean is well understood by most people
 Computation of mean is easy
 Demerits
 Sensitive to extreme value
 For example:
 X={1,1,1,1,2,9}, mean(X)=2.5 which does not
reflect the actually central tendency of this set of
numbers
Median
 Definition
 It divides the numbers into two halves such
that the number of items below it is the same
as the number of items above it
 Suppose we have n numbers x1, x2, ……, xn.
 Median is defined as
Merits and demerits of median
 Merits
 Another widely used measure of central
tendency
 It is not influenced by extreme values
 Demerits
 When the number of items are small, median
may not be representative, because it is a
positional average
Mode
 Definition
 Mode is defined as the most frequent value in
a set of numbers
 Example: X={1,1,3,3,3,3,4,5,6}, Mode(X)=3
 Merits and demerits
 Merits
 It represents the most typical value in the
distribution
 Demerits
 It may not be uniquely defined
 Example: X={1,1,2,2}, Mode(X)=1 or 2
Measures of variation (dispersion)
 Motivation
 Sometimes using measures of central tendency
alone is not enough
 Two data sets which look very different from
each other may have the same average value
 How can we solve this problem?
 Goal
 By computing a value called “variation” or
“dispersion” to characterize how data varies on
each side of the average value
Types of variation (dispersion)
 There are several methods to measure the
variation of a data set
 List of different measures
 Range
 Mean deviation
 Variance and standard deviation
Range
 Definition
 Range is the difference between the largest
and the smallest numbers
 Suppose we have N numbers X={x1,x2,…,xN},
then Range(X) = max(X) - min(X)
 Merits
 Meaningful in some scenarios
 Easy to compute and well understood
 Demerits
 Greatly affected by extreme values
Mean deviation
 Definition
 First, sum all the absolute difference between
every item value and the mean of the
distribution.
 Then divide the sum by the number of items
 Suppose we have N numbers x1, x2, ……, xN.
The mean deviation is defined as
Merits and demerits of mean deviation
 Merits
 Relatively easy to understand
 Less affected by the extreme values
 Demerits
 We have no short-cut method to compute
mean deviation
 Given only the means and mean deviations of
two sets, we can not compute the mean
deviation of the combined set. We need to
know every items in the combined set
Variance and standard deviation
 Definition
 Suppose we have N numbers x1, x2, ……, xN.
The variance σ2 is defined as

 The square root of variance, σ is called the


standard deviation
 In the literature, N-1 is sometimes used as the
denominator in the computation of σ, instead
of N
Merits and Demerits of standard
deviation
 Merits
 Standard deviation is the most common
method used to measure variation
 Short-cut method to compute standard
deviation
 Demerits
 More affected by extreme values in comparison
with the mean deviation
 Reason for this?
Standard deviation
 Instead of by definition, there is a short-
cut method to compute σ. How to derive
it?
Combined standard deviation
 Let x1,x2,…,xm and y1,y2,…,yn be two set of data
with means and standard deviation x , y and σx
and σy respectively
 Then the combined mean z and the standard σz
deviation can be computed as follow:

You might also like