Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 126

Business Statistics

CHAPTER 3
DESCRIBING DATA

Dang Quan Tri


Summary Measures
Describing Data Numerically

Center and Location Other Measures Variation


of Location
Mean Range
Percentiles
Median Interquartile Range
Quartiles
Mode
Variance
Weighted Mean
Standard Deviation

Coefficient of
Variation
Content

Mean

Median

Mode

Weighted Mean

Range
Mean (Arithmetic Average)
Cha
p 3- (continued)
4

 The most common measure of central tendency


 Mean = sum of values divided by the number of values
 Affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4

1  2  3  4  5 15 1  2  3  4  10 20
 3  4
5 5 5 5
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Mean

Mean is the average of numbers.


How to calculate?
 Adding up all the numbers.
 Dividing by how many numbers there are
Example
 What is the mean of these numbers?
 6, 11, 7
 S1: Adding the numbers:
 6 + 11 + 7 = 24
 S2: Dividing by how many numbers:
 24 / 3 = 8
 -> The mean is 8
Why does this work?
 Because 6 , 11, 7 added together is the same as 3
lots of 8.
 Like you are “ fattening out” the numbers.
Negative number
 How to handle negative numbers?
 Adding a negative number is the same as
subtracting the number (without the negative):
3 + (-2)= 3 - 2 = 1
Example
 What is the mean of theses numbers :
 6 , 7 , 9 ,5 ,2 ,3
 3 , - 7, 5 , 13, -2
 -1 , -3 , -6 , -9, -10
Mean (Arithmetic Average)
 The Mean is the arithmetic average of data values
 Sample mean
n = Sample Size
n

x i
x1  x 2    x n
x i 1

n n
 Population mean
N = Population Size
N

x x1  x 2    x N
i
  i 1
N N
 Ex : Suppose that in thirty shots at a target,
a mark makes the following scores:
 5 2 2 3 4 4 3 2 0 3 0 3 2 1 5 1 3 1 5 5 2 4 0 0 4

54455
 Calculate Mean
Weighted Mean

 Used when values are grouped by frequency or


relative importance
Weighted Mean - Case 1

Example: Sample of
26 Repair Projects

Days to
Frequency
Complete Weighted Mean Days
5 4 to Complete:
6 12
7 8
8 2

Business Statistics: A Decision-


Chap 3-13 Making Approach, 6e © 2005
Prentice-Hall, Inc.
 When the weights don’t add to 1:
 Multiply each weight w by its matching value x, sum
that all up, and divide by the sum of weights:
 Weighted mean =

XW 
w x
i i

(4  5)  (12  6)  (8  7)  (2  8)
w i 4  12  8  2
164
  6.31 days
26
Weighted mean –Case 2
 Decisions : Weighted means can help with decisions
where some things are more important than others.
 Ex: Sam want to buy a new cammera, and decides on the
following rating system:
 Image Quality 50%
 Battery life 30%
 Zoom Range 20%
 The sony camera gets 8 ( out of 10) for Image Quality, 6 for Battery
Life and 7 for Zoom Range
 The Canon camera gets 9 for Image Quality, 4 for Battery Life and
6 for Zoom Range.
 Which cammera is best
 Sony: 0.5 x 8 + 0.3 x 6 + 0.2 x 7 = 7.2
 Canon : 0.5 x 9 + 0.3 x 4 + 0.2 x 6 = 6.9
 -> Sam decides to buy the Sony.

 - When the weights add to 1:


 Multiply each weight by the matching value and sum it
all up.
 Ex: Sam want to buy a new Phone, and decides on the
following rating system:
 System 40%
 Battery life 30%
 Zoom Range 30%
 The Iphone gets 8 ( out of 10) for Image Quality, 6 for Battery
Life and 7 for Zoom Range
 The Samsung gets 9 for Image Quality, 4 for Battery Life and 6
for Zoom Range.
 Which cammera is best
Median
Cha
p 3-
20
 Not affected by extreme values

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3

 In an ordered array, the median is the “middle”


number

Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-


Hall, Inc.
How to calculate Median
 S1: collect data
 S2: Put the data in order
 S3: Calculate the median index (i)
 S4: Find median (Value)
Median index
 i=½n
 Where : i = index of point in the data set corresponding
to the median value
 n = sample size

 If is not an integer, round its value up to the next


highest integer. This next highest integer then is the
position of the median in the data array
 If is an integer, the median is the average of the values
in postion i and position i+1
Example
Example
 Find the median of 3, 13 , 7, 5 ,21, 23, 39, 23, 40,
14, 12, 56, 23, 15 and 29?
 Find the median of 5, 6,7, 10, 4 ,2 ,3 ,8
In statistics, We care about where the data gathered
by calculating mean  but sometime it will make
mistake
Skewed and Symmetric Data
Skewed and Symmetric Data
 Data in a population or sample can be either
symmetric or skewed ( shape of data), depending
on how the data are distributed around the center.
Skewed and Symmetric Data
 Symmetric data: Data sets whose values are evenly
spread around the center. Median = mean
 Skewed data data sets that are not symmetric. For
skewed data, the mean will be larger or smaller
than the median
Mode
 The mode is simply the number which appears
most often
 Mode is most

 How to find?
 Putting the number in order
 Counting how many of each number ( or the highest
frequency of value is mode)
 Find the mode
 1, 3, 3, 3, 4, 4, 6, 6, 6, 9
 3 appears three times, as does 6
 -> there are two mode: at 3 and 6

 We having more than one mode


 Having two modes is called “bimodal”
 Having more than two modes is called “multimodal”
 Find the mode of 3, 7 ,5 ,13 ,20, 23, 39, 23, 40 ,23,
14 ,12 ,56, 23 and 29?
Range
 Simplest measure of variation
 Difference between the largest and the smallest
observations:
Range = xmaximum – xminimum

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Disadvantage of range
 The range can sometims be misleading when there
are extremely high or low values.

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Review Example
Cha
p 3-
40
 Five houses on a hill by the beach
$2,000 K
House Prices:

$2,000,000
500,000 $500 K
300,000 $300 K
100,000
100,000

$100 K

$100 K
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Summary Statistics

House Prices:
 Mean: ($3,000,000/5)
$2,000,000
= $600,000
500,000
300,000
100,000
100,000
 Median: middle value of ranked data
Sum 3,000,000 = $300,000

 Mode: most frequent value


= $100,000
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Quartiles

Quartiles are the values that divide a list of


numbers into quarters

How to find?
* First put the list of numbers in order
* Then cut the list into four equal parts
-> The quartiles are at the”cuts”
Example

5, 8, 4, 4, 6, 3, 8

• Put them in order: 3, 4, 4, 5, 6, 8, 8


• Cut the list into quarters:
• 3, 4, 4 ,5, 6, 8 ,8
2nd 4th 6th

• And the result is:


• Quartile 1 (Q1) = 4
• Quartile 2 (Q2), which also the median, =
5
How to calculate
i= p*n
Q1 ->p =25% -> lower quartiles
Q2 ->p =50% -> median
Q3 -> p= 75% -> upper quartiles
If i is not integer -> round up to next
highest integer
If i is an integer, the pth percentile is the
average of the values in position i and
position i+1
COMPARING HURDLES SCORES

2007 2008 Here are the top eleven 50 m goat racing times in
12.1 12.3 seconds for 2007 and 2008.
14.0 13.7 Work out the mean and range.
15.3 15.5
2007 2008
15.4 15.5
Mean 15.4 16.1
15.4 15.6
Range 4.9 10.6
15.6 15.9
15.7 16.0 Which year was better and why?
15.7 16.1 Why might this comparison be unfair?
16.1 16.1
16.7 17.1 The interquartile range is a better measure of spread
17.0 22.9 when the data contains an outlier.
Noted

Sometimes a “cut” is between two numbers.


-> The Quartile is the average of the two numbers
FINDING THE INTERQUARTILE RANGE

When there are outliers in the data, it is more appropriate to


calculate the interquartile range.
The interquartile range (IQR) is the
range of the middle half of the data.

The upper quartile is The lower quartile is the


the data value that is data value that is one quarter
three quarters of the way of the way along the list
along the ordered list. (when written in order of size).

interquartile range =
upper quartile – lower quartile
Box and Whisker Diagrams.

Anatomy of a Box and Whisker Diagram.


Lower Lower Upper Upper
limit Quartile Median Quartile limit
Whisker Whisker
Box

4 5 6 7 8 9 10 11 12
Boys

130 140 150 160 170 180 cm 190


Girls

Box plots are useful for comparing two or more sets of data like
that shown below for heights of boys and girls in a class.
S1: Sort the data
S2: Calculate quartiles
S3: draw the box correspond to Q1 and Q3
S4: Draw a vertical line through the box at the median
S5: compute the upper and lower limit

*** Lower limit : Q1 – 1.5 (Q3- Q1)


*** Upper limit: Q3 + 1.5 (Q3 – Q1)

S6: Draw the whiskers


S7: Plot the outliers
* n = 45
Drawing a Box Plot.

Example 1: Draw a Box plot for the data below


Q1 Q2 Q3

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9

4 5 6 7 8 9 10 11 12
Drawing a Box Plot.

Example 2: Draw a Box plot for the data below

Q1 Q2 Q3
3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,

Lower Upper
Quartile Median Quartile
= 4 = 8 = 10

3 4 5 6 7 8 9 10 11 12 13 14 15
Drawing a Box Plot.

Question: Stuart recorded the heights in cm of boys in his


class as shown below. Draw a box plot for this data.
QL Q2 Qu

137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186

Lower Upper
Quartile Median Quartile
= 158 = 171 = 180

130 140 150 160 170 180 cm 190


Drawing a Box Plot.

Question: Gemma recorded the heights in cm of girls in the same


class and constructed a box plot from the data. The box plots for both
boys and girls are shown below. Use the box plots to choose some
correct statements comparing heights of boys and girls in the class.
Justify your answers. Boys

130 140 150 160 170 180 cm 190

Girls

1. The girls are taller on average. 2. The boys are taller on average.

3. The girls show less variability in height. 5. The smallest person is a girl.

4. The boys show less variability in height. 6. The tallest person is a boy.
PERCENTILES

 A percentile is a measure that tells us what percent


of the total frequency scored at or below that
measure.
 EX : You are the fourth tallest person in a group of 20 -
80% of people are shorter than you
 -> That means you are at the 80th percentile
 IF your height is 1.85 then 1.85 is the 80% percentile
height in that group.
 I =(p/100)*n
 Where p : desired percentage

 If i is not integer -> round up to next highest


integer
If i is an integer, the pth percentile is the
average of the values in position i and position
i+1
Estimating Percentiles from a Line graph

 A total of 10,000 visited Aeon mall over 12 hours


Time ( hours) People
0 0
2 350
4 1100
6 2400
8 6500
10 8850
12 10,000

 Estimate the 30th percentiles (when 30% of the vistors had


arrived).
 Estimate what percentile of visitors had arrived after 11 hours
Key Points:

 Percentile rank is a number between 0 and 100


indicating the percent of cases falling at or below that
score.
 Percentile ranks are usually written to the nearest
whole percent: 74.5% = 75% = 75th percentile
 Scores are arranged in rank order from lowest to
highest
 There is no 0 percentile rank - the lowest score is at
the first percentile
 There is no 100th percentile - the highest score is at
the 99th percentile
 You have 25 test scores, and in order from lowest
to highest they look like this:
 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77,
78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99.
 Find the 90th percentile and the 20th percentile for
these (ordered) scores.
Practice Problems
Measures of variation
 Plan A  Plan B
 15  23
 25  26
 35  25
 20  24
 30  27

Calculate mean and median of data above


Variance
 Variance is the average of the squared differences
of the data value from the mean.
MEASURES OF VARIABILITY
POPULATION VARIANCE

• The population variance is the mean squared deviation


from the population mean:
N

 i
( x   ) 2

2  i 1
N
• Where 2 stands for the population variance
•  is the population mean
• N is the total number of values in the population
• xi is the value of the i-th observation.
•  represents a summation
These are the numbers of newspapers sold at the local shop over the
last 20 days:
22, 20, 18, 23, 20, 25, 22, 20, 18, 20, 19 ,19 ,20, 22, 21, 20, 21,
23, 25, 29.
Standard Deviation
 Standard deviation is a measure of how spread out
numbers are
 It is a symbol is
 The formula is the square root of the variance
Population Variance
 In practice population variance cannot be
computed directly because the entire population is
not ordinarily observed.
 An analogous measure of variability may be
determined with sample data.
 This referred to as sample variance
MEASURES OF VARIABILITY
SAMPLE VARIANCE

• The sample variance is defined as follows:


N

 i
( x  x ) 2

s2  i 1
n 1
• Where s2 stands for the sample variance
• x is the sample mean
• n is the total number of values in the sample
• xi is the value of the i-th observation.
•  represents a summation
MEASURES OF VARIABILITY
POPULATION/SAMPLE STANDARD DEVIATION

• The standard deviation is the positive square root of the


variance:
Population standard deviation:    2

Sample standard deviation: s  s2


• Compute the standard deviations of advertising and sales.
MEASURES OF VARIABILITY
POPULATION/SAMPLE STANDARD DEVIATION

• Compute the sample standard deviation of advertising


data: 2.5, 1.3, 1.4, 1.0 and 2.0

• Compute the population standard deviation of sales data:


264, 116, 165, 101 and 209
MEASURES OF VARIABILITY
POPULATION/SAMPLE CV

• The coefficient of variation is the standard deviation


divided by the means


Population coefficient of variation: CV 

s
Sample coefficient of variation: cv 
x
Coefficient of Variation
Cha
p 3-
83
 Measures relative variation
 Always in percentage (%)
 Shows variation relative to mean
 Is used to compare two or more sets of data
measured in different units
Population Sample

σ  s 
CV     100% CV     100%

μ  x 
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Comparing Coefficient
Cha of Variation
p 3-
84
 Stock A:
 Average price last year = $50

 Standard deviation = $5

s  $5
CVA     100% 
  100%  10%
x  $50 Both stocks
have the same
 Stock B: standard
 Average price last year = $100 deviation, but
stock B is less
 Standard deviation = $5
variable relative
to its price
s $5
CVB     100%   100%  5%
 x $100
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
 Data can be “distributed” (spread out) in different
way.
Skewed and Symmetric Data
 Data in a population or sample can be either
symmetric or skewed ( shape of data), depending
on how the data are distributed around the center.
Use for
 In probability theory,
the normal (or Gaussian)distribution is a very
common continuous probability distribution.

 Normal distributions are important in statistics and are


often used in the natural and social sciences to represent
real-valued random variables whose distributions are
not known
Examples
 Many things closely follow a Normal Distribution:
 Heights of people
 Size of things produced by machines
 Errors in measurements
 Blood pressure
 Marks on a test
 => we say the data is “ normally distributed”
Characteristics
 The normal distribution has:
 Mean= median = mode
 Symmetry about the center
 50% of values less than the mean and 50% greater than
the mean.
 What is the standard deviation ????

 MEASURE OF HOW SPREAD


OUT NUMBERS ARE.
When you calculate the standard deviation of
your data, you will find that:
Example
 95% of students at school are between 1.1m and
1.7 tall. Assuming this data is normally
distributed can you calculate the mean and
standard deviation?
 The mean is halfway between 1.1m and 1.7m:
 Mean = 1.4m
 95% is 2 standard deviations either side of the mean ( a
total of standard deviations) so:
 1 standard deviation = (1.7 -1.1)/4 = 0.15
Cha
The Empirical Rule
p 3-
101
 If the data distribution is bell-shaped, then the
interval:
 μ  1σ contains about 68% of the values in
the population or the sample

68%

μ
μ  1σ
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Cha
The Empirical Rule
p 3-
102
 μ  2σ contains about 95% of the values in
the population or the sample
 μ  3σ contains about 99.7% of the values
in the population or the sample

95% 99.7%

μ  2σ μ  3σ
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Standardized Data Values
Cha
p 3-
103

 A standardized data value refers to the


number of standard deviations a value is
from the mean
 Standardized data values are sometimes

referred to as z-scores

 -> The number of standard deviations from the


mean is also called “ standard score,’ sigma”, or
“Z-score”
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
 One of your friend is 1.85 tall. You can see on the
bell curve that 1.85 is 3 standard deviations from
the mean of 1.4, so:
 -> your friend’s height has “z score” of 3.0
 It is also possible to calculate how many standard
deviations 1.85 is from the mean.
 How far is 1.85 from the mean?
 It is 1.85 -1.4 = 0.45 m from the mean
 How many standard deviations is that? The
standard deviation is 0.15, so:
 0.45/0.15 = 3 standard deviation
 To covert a value to a standard score ( “z-score”):
 First subtract the mean,
 Then divide by the standard deviation
 -> doing that is called “Standarding”
 Example
 A survey of daily time had these results (in minutes):
 26, 33, 65, 28, 34, 55, 25, 44, 50, 36, 26, 37, 43, 62,
35, 38, 45, 32, 28, 34
 The mean is 38.8 minutes and the standard deviation is
11.4 minutes
 Convert the value to z-scores (‘ standard scores”)
 To covert 26
 First subtract the mean: 26 – 38.8 =-12.8
 Then divide by the standard deviation: -12.8/11.4 = -
1.12
 So 26 is -1.12 standard deviations from the mean
Cha
Standardized Population Values
p 3-
110

x μ
z
σ
where:
 x = original data value

 μ = population mean

 σ = population standard deviation

 z = standard score

(number of standard deviations x is from μ)


Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Cha
Standardized Sample Values
p 3-
111

xx
z
s
where:
 x = original data value

 x = sample mean

 s = sample standard deviation

 z = standard score

(number of standard deviations x is from μ)


Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Why standardize…?
 It can help you decisions about your data.
 Example: Professor Willoughby is marking a test.
 Here are the students results ( out of 60 points):
 20, 15, 26, 32, 18, 35, 14, 26, 22, 17
 Most students didn’t even get 30 out of 60 and most
will fail
 The test must have been really hard, so the Prof
decides to standardize all the scores and only fail
people 1 standard deviation below the mean.

 The mean is 22.5 and the standard deviation is 6.75 and these are
standard scores:
 -0.45 ; -1.21, 1.36, - 0.76, 0.76, 1.82, -1.36, 0.45, -0.15, -0.91
 Therefore, 2 students will fail ( the one who scored 15 and 14 on the test)
In more detail
 Here is the standard normal distribution with
percentages for every half of a standard deviation
and cumulative percentages.
 Your score is a recent test was 0.5 standard
deviations above the average, how many people
scored lower than you did? ( P (z <0.5)
 Question 1
 95% of students at school weight between 62 kg and
90 kg.
 Assuming this data is normally distributed, what are
the mean and standard deviation.
Question 2

 A machine produces electrical components


 99.7% of the components have lengths between
1.176 cm and 1.224 cm.
 Assuming this data is normally distributed, what
are the mean and standard deviation?
Question 3
 68% of the marks in a test are between 51 and 64.
 Assuming this data is normally distributed, what
are the mean and standard deviation?
Question 4
 A company makes parts for a machine. The lengths
of the parts must be within certain limits or they
will be rejected.
 A large number or parts were measured and the
mean and standard deviation were calculated as 3.1
m and 0.005 m respectively.
 Assuming this data is normally distributed and
99.7% of the parts were accepted, what are the
limits?
Question 5
 Students pass a test if they score 50% or more.
 The marks of large number of students were
sampled and the mean and standard deviation were
calculated as 42% and 8% respectively.
 Assuming this data is normally distributed, what
percentage of students pass the test?
Standard normal distribution table
 It shows you that percent of population:
 Between 0 and z
 Less than z
 Greater than z
Example
 Find the percent of population between 0.45

 Start at the row for 0.4 and read along until 0.45:
there is the value 0.1736
 And 0.1736 is 17.36%
 So 17.36% of the population are between 0.045
standard deviations from the mean.
 Because the curve is symmetrical, the same table
can be used for value going either direction, so
negative 0.45 (-0.45) also has an area of 0.1736.
Example
 Find the percent of population z between -1 and + 2
 Use the standard Normal Distribution table to find
 P(0<Z
 P (Z
 P ( -1.65 <Z
 P (0.85 < Z
 P (Z>1.75)
 P (Z -0.69)
 P (-1.27 < Z
 P (Z >-2.64)
 P (Z 0.96)
 Find Z when you know percentage.

 P( Z to + ) = 50.8%
 P ( - to Z) =30.85%
 P ( -2 to Z ) = 11.29%
 P ( Z to 3) = 0.3%
Question 9
 The mean July daily rainfall in Waterville is 10mm
and the standard deviation is 1.5mm
 Assume that this data is normally distributed
 How many days in July would you expect the daily
rainfall to be less than 8.5 mm?
 Trong clip “Em gái mưa” của Hương Tràm. Người
ta đo được lượng mưa trung bình trong mỗi cảnh
mưa là 20mm, độ lệch chuẩn là 3mm.
 Giả sử rằng lượng mưa theo phân phối chuẩn.
 Hỏi có tổng cộng bao nhiêu cảnh mưa có lượng
mưa dưới 16 mm. Biết rằng có tổng cộng 40 cảnh
mưa trong clip của Hương Tràm.
 Độ tuổi trung bình trong một tổng thể của người
dân tại địa phương là 43, độ lệch chuẩn ( standard
deviation là 14. Địa phương có 5,000 người. Hỏi có
bao nhiêu người trong độ tuổi từ 22 đến 57. Giả sử
rằng đây là phân phối chuẩn ( normal distribution)
 Biết trung bình trong một tổng thể là 20, độ lệch
chuẩn (standard deviation) là 3. Tổng thể có 2000.
Lưu ý giả sử rằng đây là phân phối chuẩn (normal
distribution). Hỏi:
 a. Có bao nhiêu giá trị từ 14 đến 17. (1 điểm)
 b. Người ta biết rằng trong cuộc điều tra này giá trị
từ 17 đến x chiếm 68.28%. Tìm x ( 1 điểm)
 C. Người ta biết rằng trong cuộc điều tra này giá
trị lớn hơn hoặc bằng x có 131 người ( số chưa
làm tròn) . Tìm x
 The mean July daily rainfall in Waterville is 10mm
and the standard deviation is 1.5mm
 Assume that this data is normally distributed
 How many days in July would you expect the daily
rainfall from 8.5 mm to 11.5?
Thank you

You might also like