Analysing Data

Quantitative Research
Methods
Analysing Data
1
Summary Measures
Describing Data Numerically
Center and Location Other Measures Variation

of Location
Mean Range
Percentiles
Median Interquartile Range
Quartiles
Mode
Variance
Weighted Mean
Standard Deviation
Coefficient of
Variation
2
2.1
Measures of Central Tendency
Overview
Central Tendency
Mean Median Mode
n
xi
i 1
x
n
Arithmetic Midpoint of Most frequently
average ranked values observed value
3
Arithmetic Mean
 The arithmetic mean (mean) is the most
common measure of central tendency
 For a population of N values:
N
xi
x1 x 2  xN Population
μ i 1
values
N N
Population size
 For a sample of size n:

n
xi
i 1 x1 x 2  xn Observed
x values
n n
Sample size
4
Arithmetic Mean (continued)
 The most common measure of central tendency

 Mean = sum of values divided by the number of values
 Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1 2 3 4 5 15 1 2 3 4 10 20
3 4
5 5 5 5
5
Median
 In an ordered array, the median is the “middle”

number, i.e., the number that splits the
distribution in half
 The median is not affected by extreme values
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
6
Median (continued)
 To find the median, sort the n data values

from low to high (sorted data is called a
data array)
 Find the value in the i = (1/2)n position
 The ith position is called the Median Index
Point
 If i is not an integer, round up to next highest
integer
7
Median Example (continued)
Data array:
4, 4, 5, 5, 9, 11, 12, 14, 16, 19, 22, 23, 24
 Note that n = 13
 Find the i = (1/2)n position:
i = (1/2)(13) = 6.5
 Since 6.5 is not an integer, round up to 7
 The median is the value in the 7th position:
Md = 12
8
Shape of a Distribution
 Describes how data is distributed
 Symmetric or skewed
Left-Skewed Symmetric Right-Skewed
Mean < Median Mean = Median Median < Mean

(Longer tail extends to left) (Longer tail extends to right)
9
Mode
 A measure of location
 The value that occurs most often
 Not affected by extreme values
 Used for either numerical or categorical data
 There may be no mode
 There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
Mode = 5 No Mode
10
Weighted Mean
 Used when values are grouped by frequency or

relative importance
Example: Sample of
26 Repair Projects
Weighted Mean Days
Days to Frequency to Complete:
Complete
5 4 w i xi (4 5) (12 6) (8 7) (2 8)
XW
6 12 wi 4 12 8 2
7 8 164
6.31 days
8 2 26
11
Geometric Mean
 Geometric mean
 Used to measure the rate of change of a variable
over time
1/n
xg n (x1 x 2  xn ) (x1 x 2  xn )
 Geometric mean rate of return
 Measures the status of an investment over time
1/n
rg (x1 x 2 ... xn ) 1
 Where xi is the rate of return in time period i

12
Example
An investment of $100,000 rose to $150,000 at the

end of year one and increased to $180,000 at end
of year two:
X1 $100,000 X2 $150,000 X3 $180,000
50% increase 20% increase
What is the mean percentage return over time?
13
Example (continued)
Use the 1-year returns to compute the arithmetic

mean and the geometric mean:
Arithmetic (50%) (20%)

mean rate X 35% Misleading result
2
of return:
Geometric rg (x1 x 2 )1/n 1

mean rate
[(50) (20)]1/2 1 More
of return:
(1000) 1/2 1 31.623 1 30.623% accurate
result
14
Review Example
 Five houses on a hill by the beach
$2,000 K
House Prices:
$2,000,000
500,000 $500 K
300,000 $300 K
100,000
100,000
$100 K
$100 K
15
Review Example:
Summary Statistics
House Prices:
 Mean: ($3,000,000/5)
$2,000,000 = $600,000
500,000
300,000
100,000
100,000  Median: middle value of ranked data
Sum 3,000,000
= $300,000
 Mode: most frequent value

= $100,000
16
Which measure of location
is the “best”?
 Mean is generally used, unless extreme

values (outliers) exist . . .
 Then median is often used, since the median
is not sensitive to extreme values.
 Example: Median home prices may be reported for
a region – less sensitive to outliers
17
Other Location Measures
Other Measures
of Location
Percentiles Quartiles
The pth percentile in a data array:  1st quartile = 25th percentile

 p% are less than or equal to this
value
 2nd quartile = 50th percentile
 (100 – p)% are greater than or = median
equal to this value
(where 0 ≤ p ≤ 100)
 3rd quartile = 75th percentile
18
Percentiles
 The pth percentile in an ordered array of n values is the

value in ith position, where
p If i is not an integer,
i (n) round up to the next
100 higher integer value
 Example: Find the 60th percentile in an ordered array of

19 values.
p 60 So use value in the

i (n) (19) 11.4
100 100 i = 12th position
19
Quartiles
 Quartiles split the ranked data into 4 segments with
an equal number of values per segment
25% 25% 25% 25%
Q1 Q2 Q3
 The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
 Q2 is the same as the median (50% are smaller, 50% are
larger)
 Only 25% of the observations are greater than the third
quartile
20
Quartiles
 Example: Find the first quartile

Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9)
25
Q1 = 25th percentile, so find i : i = 100 (9) = 2.25
so round up and use the value in the 3rd position: Q1 = 13
21
Quartile Formulas
Find a quartile by determining the value in the

appropriate position in the ranked data, where
First quartile position: Q1 = 0.25(n+1)
Second quartile position: Q2 = 0.50(n+1)

(the median position)
Third quartile position: Q3 = 0.75(n+1)
where n is the number of observed values
22
Measures of Variation
Variation
Range Variance Standard Deviation Coefficient of

Variation
Population Population
Interquartile
Variance Standard
Range
Deviation
Sample Sample
Variance Standard
Deviation
23
Variation
 Measures of variation give information on

the spread or variability of the data
values.
Same center,
different variation
24
Measuring variation
Small standard deviation
Large standard deviation
25
Range
 Simplest measure of variation

 Difference between the largest and the smallest
observations:
Range = Xlargest – Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
26
Disadvantages of the Range
 Ignores the way in which data are distributed
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
27
Interquartile Range
 Can eliminate some outlier problems by using

the interquartile range
 Eliminate high- and low-valued observations

and calculate the range of the middle 50% of
the data
 Interquartile range = 3rd quartile – 1st quartile

IQR = Q3 – Q1
28
Interquartile Range Example
Example:
Median X
X Q1 Q3 maximum
minimum (Q2)
25% 25% 25% 25%
12 30 45 57 70
Interquartile range
= 57 – 30 = 27
29
Variance
 Average of squared deviations of values from

the mean
N
 Population variance: (xi μ)2
σ 2 i 1
N
n
 Sample variance: (xi x )2
2 i 1
s
n -1
30
Population Variance
 Average of squared deviations of values from

the mean
N
 Population variance:
(xi μ)
2
σ 2 i 1
N
Where μ = population mean
N = population size
xi = ith value of the variable x
31
Sample Variance
 Average (approximately) of squared deviations

of values from the mean
n
Sample variance: 2

(xi x)
2 i 1
s
n -1
Where X = arithmetic mean
n = sample size
Xi = ith value of the variable X
32
Standard Deviation
 Most commonly used measure of variation
 Shows variation about the mean
 Has the same units as the original data
N
 Population standard deviation:
(x i μ) 2
σ i 1
N
 Sample standard deviation: n

(x i x )2
i 1
s
n -1
33
Population Standard Deviation
 Population standard deviation:
N
(x i μ) 2
σ i 1
N
34
Sample Standard Deviation
 Sample standard deviation: n

(x i x)2
i 1
S
n -1
35
Calculation Example:
Sample Standard Deviation
Sample
Data (xi) : 10 12 14 15 17 18 18 24
n=8 Mean = x = 16
(10 X )2 (12 x)2 (14 x)2  (24 x)2

s
n 1
(10 16)2 (12 16)2 (14 16)2  (24 16)2

8 1
126 A measure of the “average”

4.2426 scatter around the mean
7
36
Comparing Standard Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.570
37
Advantages of Variance and
Standard Deviation
 Each value in the data set is used in the

calculation
 Values far from the mean are given extra

weight
(because deviations from the mean are squared)
38
Coefficient of Variation
 Measures relative variation
 Always in percentage (%)
 Shows variation relative to mean
 Is used to compare two or more sets of data
measured in different units
Population Sample
σ s
CV 100% CV 100%
μ x
39
Comparing Coefficient
of Variation
 Stock A:
 Average price last year = $50
 Standard deviation = $5
s $5
CVA 100% 100% 10%
x $50 Both stocks
 Stock B: have the same
standard
 Average price last year = $100 deviation, but
stock B is less
 Standard deviation = $5 variable relative
to its price
s $5
CVB 100% 100% 5%
x $100
40
The Empirical Rule
 If the data distribution is bell-shaped, then

the interval:
 μ 1σ contains about 68% of the values in
the population or the sample
68%
μ
μ 1σ
41
The Empirical Rule
 μ 2σ contains about 95% of the values in
the population or the sample
 μ 3σ contains almost all (about 99.7%) of
the values in the population or the sample
95% 99.7%
μ 2σ μ 3σ
42
Standardized Data Values
 A standardized data value refers to

the number of standard deviations a
value is from the mean
 Standardized data values are

sometimes referred to as z-scores
43
Standardized Population Values
x μ
z
σ
where:
 x = original data value
 μ = population mean
 σ = population standard deviation
 z = standard score
(number of standard deviations x is from μ)
44
Standardized Sample Values
x x
z
s
where:
 x = original data value
 x = sample mean
 s = sample standard deviation
 z = standard score
(number of standard deviations x is from μ)
45
Standardized Value Example
 IQ scores in a large population have a bell-
shaped distribution with mean μ = 100 and
standard deviation σ = 15
Find the standardized score (z-score) for a
person with an IQ of 121.
Answer: x μ 121 100

z 1.4
σ 15
Someone with an IQ of 121 is 1.4 standard deviations

above the mean
46
Using Microsoft Excel
 Descriptive Statistics can be obtained

from Microsoft® Excel
 Select:
data / data analysis / descriptive statistics
 Enter details in dialog box
47
Using Excel
 Select data / data analysis / descriptive statistics
48
Using Excel
 Enter input
range details
 Check box for

summary
statistics
 Click OK
49
Excel output
Microsoft Excel
descriptive statistics output,
using the house price data:
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
50

Analysing Data

Uploaded by

Copyright:

Available Formats

You might also like

Analysing Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysing Data

Uploaded by

Copyright:

Available Formats

Quantitative Research

Describing Data Numerically

Center and Location Other Measures Variation

Mean Median Mode

 For a sample of size n:

 The most common measure of central tendency

 In an ordered array, the median is the “middle”

 The median is not affected by extreme values

 To find the median, sort the n data values

Left-Skewed Symmetric Right-Skewed

Mean < Median Mean = Median Median < Mean

 Used when values are grouped by frequency or

 Where xi is the rate of return in time period i

An investment of $100,000 rose to $150,000 at the

X1 $100,000 X2 $150,000 X3 $180,000

50% increase 20% increase

What is the mean percentage return over time?

Use the 1-year returns to compute the arithmetic

Arithmetic (50%) (20%)

Geometric rg (x1 x 2 )1/n 1

 Mode: most frequent value

 Mean is generally used, unless extreme

The pth percentile in a data array:  1st quartile = 25th percentile

 The pth percentile in an ordered array of n values is the

 Example: Find the 60th percentile in an ordered array of

p 60 So use value in the

25% 25% 25% 25%

 Example: Find the first quartile

so round up and use the value in the 3rd position: Q1 = 13

Find a quartile by determining the value in the

First quartile position: Q1 = 0.25(n+1)

Second quartile position: Q2 = 0.50(n+1)

Third quartile position: Q3 = 0.75(n+1)

where n is the number of observed values

Range Variance Standard Deviation Coefficient of

 Measures of variation give information on

Small standard deviation

Large standard deviation

 Simplest measure of variation

Range = Xlargest – Xsmallest

 Can eliminate some outlier problems by using

 Eliminate high- and low-valued observations

 Interquartile range = 3rd quartile – 1st quartile

 Average of squared deviations of values from

 Average of squared deviations of values from

 Average (approximately) of squared deviations

 Sample standard deviation: n

 Population standard deviation:

 Sample standard deviation: n

(10 X )2 (12 x)2 (14 x)2  (24 x)2

(10 16)2 (12 16)2 (14 16)2  (24 16)2

126 A measure of the “average”

 Each value in the data set is used in the

 Values far from the mean are given extra

 If the data distribution is bell-shaped, then

 A standardized data value refers to

 Standardized data values are

 σ = population standard deviation

(number of standard deviations x is from μ)