Analysing Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Quantitative Research

Methods

Analysing Data

1
Summary Measures

Describing Data Numerically

Center and Location Other Measures Variation


of Location
Mean Range
Percentiles
Median Interquartile Range
Quartiles
Mode
Variance
Weighted Mean
Standard Deviation

Coefficient of
Variation
2
2.1
Measures of Central Tendency
Overview
Central Tendency

Mean Median Mode

n
xi
i 1
x
n
Arithmetic Midpoint of Most frequently
average ranked values observed value

3
Arithmetic Mean
 The arithmetic mean (mean) is the most
common measure of central tendency
 For a population of N values:
N
xi
x1 x 2  xN Population
μ i 1
values
N N
Population size

 For a sample of size n:


n
xi
i 1 x1 x 2  xn Observed
x values
n n
Sample size
4
Arithmetic Mean (continued)

 The most common measure of central tendency


 Mean = sum of values divided by the number of values
 Affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4
1 2 3 4 5 15 1 2 3 4 10 20
3 4
5 5 5 5

5
Median

 In an ordered array, the median is the “middle”


number, i.e., the number that splits the
distribution in half

 The median is not affected by extreme values

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3

6
Median (continued)

 To find the median, sort the n data values


from low to high (sorted data is called a
data array)
 Find the value in the i = (1/2)n position
 The ith position is called the Median Index
Point
 If i is not an integer, round up to next highest
integer

7
Median Example (continued)

Data array:
4, 4, 5, 5, 9, 11, 12, 14, 16, 19, 22, 23, 24

 Note that n = 13
 Find the i = (1/2)n position:
i = (1/2)(13) = 6.5
 Since 6.5 is not an integer, round up to 7
 The median is the value in the 7th position:
Md = 12
8
Shape of a Distribution
 Describes how data is distributed
 Symmetric or skewed

Left-Skewed Symmetric Right-Skewed

Mean < Median Mean = Median Median < Mean


(Longer tail extends to left) (Longer tail extends to right)

9
Mode
 A measure of location
 The value that occurs most often
 Not affected by extreme values
 Used for either numerical or categorical data
 There may be no mode
 There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

Mode = 5 No Mode
10
Weighted Mean

 Used when values are grouped by frequency or


relative importance

Example: Sample of
26 Repair Projects
Weighted Mean Days
Days to Frequency to Complete:
Complete
5 4 w i xi (4 5) (12 6) (8 7) (2 8)
XW
6 12 wi 4 12 8 2
7 8 164
6.31 days
8 2 26

11
Geometric Mean
 Geometric mean
 Used to measure the rate of change of a variable
over time
1/n
xg n (x1 x 2  xn ) (x1 x 2  xn )
 Geometric mean rate of return
 Measures the status of an investment over time
1/n
rg (x1 x 2 ... xn ) 1

 Where xi is the rate of return in time period i


12
Example

An investment of $100,000 rose to $150,000 at the


end of year one and increased to $180,000 at end
of year two:

X1 $100,000 X2 $150,000 X3 $180,000

50% increase 20% increase

What is the mean percentage return over time?

13
Example (continued)

Use the 1-year returns to compute the arithmetic


mean and the geometric mean:

Arithmetic (50%) (20%)


mean rate X 35% Misleading result
2
of return:

Geometric rg (x1 x 2 )1/n 1


mean rate
[(50) (20)]1/2 1 More
of return:
(1000) 1/2 1 31.623 1 30.623% accurate
result
14
Review Example
 Five houses on a hill by the beach
$2,000 K
House Prices:

$2,000,000
500,000 $500 K
300,000 $300 K
100,000
100,000

$100 K

$100 K

15
Review Example:
Summary Statistics

House Prices:
 Mean: ($3,000,000/5)
$2,000,000 = $600,000
500,000
300,000
100,000
100,000  Median: middle value of ranked data
Sum 3,000,000
= $300,000

 Mode: most frequent value


= $100,000

16
Which measure of location
is the “best”?

 Mean is generally used, unless extreme


values (outliers) exist . . .
 Then median is often used, since the median
is not sensitive to extreme values.
 Example: Median home prices may be reported for
a region – less sensitive to outliers

17
Other Location Measures
Other Measures
of Location

Percentiles Quartiles

The pth percentile in a data array:  1st quartile = 25th percentile


 p% are less than or equal to this
value
 2nd quartile = 50th percentile
 (100 – p)% are greater than or = median
equal to this value
(where 0 ≤ p ≤ 100)
 3rd quartile = 75th percentile

18
Percentiles

 The pth percentile in an ordered array of n values is the


value in ith position, where

p If i is not an integer,
i (n) round up to the next
100 higher integer value

 Example: Find the 60th percentile in an ordered array of


19 values.

p 60 So use value in the


i (n) (19) 11.4
100 100 i = 12th position

19
Quartiles
 Quartiles split the ranked data into 4 segments with
an equal number of values per segment

25% 25% 25% 25%

Q1 Q2 Q3

 The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
 Q2 is the same as the median (50% are smaller, 50% are
larger)
 Only 25% of the observations are greater than the third
quartile

20
Quartiles

 Example: Find the first quartile


Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9)

25
Q1 = 25th percentile, so find i : i = 100 (9) = 2.25

so round up and use the value in the 3rd position: Q1 = 13

21
Quartile Formulas

Find a quartile by determining the value in the


appropriate position in the ranked data, where

First quartile position: Q1 = 0.25(n+1)

Second quartile position: Q2 = 0.50(n+1)


(the median position)

Third quartile position: Q3 = 0.75(n+1)

where n is the number of observed values

22
Measures of Variation
Variation

Range Variance Standard Deviation Coefficient of


Variation
Population Population
Interquartile
Variance Standard
Range
Deviation

Sample Sample
Variance Standard
Deviation

23
Variation

 Measures of variation give information on


the spread or variability of the data
values.

Same center,
different variation

24
Measuring variation

Small standard deviation

Large standard deviation

25
Range

 Simplest measure of variation


 Difference between the largest and the smallest
observations:

Range = Xlargest – Xsmallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13

26
Disadvantages of the Range
 Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119

27
Interquartile Range

 Can eliminate some outlier problems by using


the interquartile range

 Eliminate high- and low-valued observations


and calculate the range of the middle 50% of
the data

 Interquartile range = 3rd quartile – 1st quartile


IQR = Q3 – Q1

28
Interquartile Range Example

Example:
Median X
X Q1 Q3 maximum
minimum (Q2)
25% 25% 25% 25%

12 30 45 57 70

Interquartile range
= 57 – 30 = 27

29
Variance

 Average of squared deviations of values from


the mean
N
 Population variance: (xi μ)2
σ 2 i 1
N

n
 Sample variance: (xi x )2
2 i 1
s
n -1
30
Population Variance

 Average of squared deviations of values from


the mean
N
 Population variance:
(xi μ)
2

σ 2 i 1
N
Where μ = population mean
N = population size
xi = ith value of the variable x
31
Sample Variance

 Average (approximately) of squared deviations


of values from the mean
n
Sample variance: 2

(xi x)
2 i 1
s
n -1
Where X = arithmetic mean
n = sample size
Xi = ith value of the variable X
32
Standard Deviation
 Most commonly used measure of variation
 Shows variation about the mean
 Has the same units as the original data
N
 Population standard deviation:
(x i μ) 2
σ i 1
N

 Sample standard deviation: n


(x i x )2
i 1
s
n -1
33
Population Standard Deviation
 Most commonly used measure of variation
 Shows variation about the mean
 Has the same units as the original data

 Population standard deviation:

N
(x i μ) 2

σ i 1
N
34
Sample Standard Deviation
 Most commonly used measure of variation
 Shows variation about the mean
 Has the same units as the original data

 Sample standard deviation: n


(x i x)2
i 1
S
n -1

35
Calculation Example:
Sample Standard Deviation
Sample
Data (xi) : 10 12 14 15 17 18 18 24
n=8 Mean = x = 16

(10 X )2 (12 x)2 (14 x)2  (24 x)2


s
n 1

(10 16)2 (12 16)2 (14 16)2  (24 16)2


8 1

126 A measure of the “average”


4.2426 scatter around the mean
7
36
Comparing Standard Deviations

Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.570

37
Advantages of Variance and
Standard Deviation

 Each value in the data set is used in the


calculation

 Values far from the mean are given extra


weight
(because deviations from the mean are squared)

38
Coefficient of Variation
 Measures relative variation
 Always in percentage (%)
 Shows variation relative to mean
 Is used to compare two or more sets of data
measured in different units

Population Sample

σ s
CV 100% CV 100%
μ x
39
Comparing Coefficient
of Variation
 Stock A:
 Average price last year = $50

 Standard deviation = $5

s $5
CVA 100% 100% 10%
x $50 Both stocks
 Stock B: have the same
standard
 Average price last year = $100 deviation, but
stock B is less
 Standard deviation = $5 variable relative
to its price
s $5
CVB 100% 100% 5%
x $100
40
The Empirical Rule

 If the data distribution is bell-shaped, then


the interval:
 μ 1σ contains about 68% of the values in
the population or the sample

68%

μ
μ 1σ
41
The Empirical Rule
 μ 2σ contains about 95% of the values in
the population or the sample
 μ 3σ contains almost all (about 99.7%) of
the values in the population or the sample

95% 99.7%

μ 2σ μ 3σ

42
Standardized Data Values

 A standardized data value refers to


the number of standard deviations a
value is from the mean

 Standardized data values are


sometimes referred to as z-scores

43
Standardized Population Values

x μ
z
σ
where:
 x = original data value

 μ = population mean

 σ = population standard deviation

 z = standard score

(number of standard deviations x is from μ)

44
Standardized Sample Values

x x
z
s
where:
 x = original data value

 x = sample mean

 s = sample standard deviation

 z = standard score

(number of standard deviations x is from μ)

45
Standardized Value Example
 IQ scores in a large population have a bell-
shaped distribution with mean μ = 100 and
standard deviation σ = 15
Find the standardized score (z-score) for a
person with an IQ of 121.

Answer: x μ 121 100


z 1.4
σ 15

Someone with an IQ of 121 is 1.4 standard deviations


above the mean
46
Using Microsoft Excel

 Descriptive Statistics can be obtained


from Microsoft® Excel
 Select:
data / data analysis / descriptive statistics

 Enter details in dialog box

47
Using Excel

 Select data / data analysis / descriptive statistics

48
Using Excel

 Enter input
range details

 Check box for


summary
statistics

 Click OK
49
Excel output
Microsoft Excel
descriptive statistics output,
using the house price data:
House Prices:

$2,000,000
500,000
300,000
100,000
100,000

50

You might also like