Lecture 1

Handout 1
Describing Data
©
Summarizing and Describing
Data (Graphical)
 Pareto Chart
 Histogram
Pareto Chart
 Consider a company that sells business
bags. Among these bags, some items
generate more revenues than other items.
By ranking the items according to the
revenue, the company will know which
items they have to emphasize (in terms of
cost management, etc). For such a
purpose, a Pareto chart is useful.
Pareto Chart Example
\16,000,000 120%
\14,000,000
100%
100%
\12,000,000 90%
80% 80%
\10,000,000
\8,000,000 60%
\6,000,000
40%
\4,000,000
20%
\2,000,000
\0 0%
k n ag A rs B B A rs ks
ac w B e e e e e ld
e c
Bl
Bro A Cas h old a s
c as
c as o Pa
Ba
g
ag O
u it rd itC he he y
H
D ay
s s s
B S Ca Su tac
tac Ke
e es e A t A t
is n n m
si Na
Bu B u
Revenue Cumulative Percentage

The following data is on our IM Folder. Open the
excel file “Bag sales”
Item Price Number sold Revenue Percentage revenue Cumulative Percentage
OA Bag 25,000 250 　　　
Attache case A 18,000 50 　　　
Attache case B 17,000 60 　　　
Key Holders 3,000 300 　　　
Business Bag Black 45,000 300 　　　
Business Bag Brown 45,000 250 　　　
Name Card holders 5,000 500 　　　
Day Packs 9,000 100 　　　
Suit Case A 30,000 120 　　　
Suit Case B 25,000 90 　　　
Sum 　　　　　
A Procedure to make a Pareto
1. Compute the revenue for each item
2. Compute the total revenue
3. Sort the data according to the revenue
4. Compute the percentage of revenue for
each item
5. Compute the cumulative percentage of
revenue
6. Make the Pareto Chart
\16,000,000 120%
\14,000,000
100%
100%
\12,000,000 90%
80% 80%
\10,000,000
\8,000,000 60%
\6,000,000
40%
\4,000,000
20%
\2,000,000
\0 0%
A
g
s
B
A
n
k
Ba
er
ck
er
ac
ow
se
se
se
se
ld
ld
Pa
Bl
Ca
Br
ca
ca
Ca
ho
Ho
O
y
g
Da
e
e
it
Ba
rd
it
y
Ba
ch
ch
Su
Su
Ke
Ca
s
ta
ta
s
es
At
At
es
e
m
sin
sin
Na
Bu
Bu
Revenue Cumulative Percentage

From the Pareto Chart example,
we can learn
 Business bags black, Business bag brown,
and OA bags count above 70% of total
revenue.
Require a lot of inventory
 Too much reliance on a small number of
items. Need more marketing effort for suit
cases and name card folders.
Pareto Chart Example 2
Visualizing Revenue by Clients
 Use Pivot Table

 Pareto Chart
Example (Sales by Clients
spread sheet)
Date Client Revenue
This data is a part of “Sales
Sep-03 Office ABC 1,000,000
by clients” data stored on
Sep-03 Taiyo Advertisement 1,300,000
our Applied Stat Folder.
Sep-03 Ad soken 600,000
Sep-03 Hakuhodo 8,000,000
Oct-03 Office ABC 1,000,000

From this data, we would Oct-03 Hakuhodo 1,000,000
like to make (1) a table that Nov-03 Ad soken 500,000
ranks the revenue by clients, Nov-03 Daisan Kikaku 800,000
and (2) Pareto Chart Nov-03 Asahi Agency 1,000,000
Dec-03 Office ABC 2,000,000
Dec-03 Taiyo Advertisement 900,000

Revenue Ranking Table
(Example of the Use of Pivot
Table)
Client Revenue % Revenue Cumulative %
Hakuhodo 21,000,000 47% 47%
Office ABC 6,800,000 15% 62%
Daisan Kikaku 6,200,000 14% 76%
Asahi Agency 3,900,000 9% 85%
Ad soken 3,600,000 8% 93%
Taiyo Advertisement 3,200,000 7% 100%
Grand Total 44700000 1

Pareto Chart Example 2
Revenue by clients
100%
25,000,000 120%
100% 100%
20,000,000
90%
80% 80%
15,000,000
60%
10,000,000
40%
5,000,000
20%
0 0%
Hakuhodo Office ABC Daisan Kikaku Asahi Agency Ad soken Taiyo
Advertisement
Revenue Cumulative %
Histogram and frequency table
Example
Visualizing your clients’ age range using

histogram.
Histogram Example
Age range Frequency
Histogram
～ 15 0
12 11 11 ～ 20 0
10 ～ 25 4
8 ～ 30 5
6
Frequency
6 5 ～ 35 11
4 4
4 ～ 40 11
2
2 ～ 45 6
0 0 0 0
0 ～ 50 4
～ 55 2
～ 60 0
Clients' Age range
More 0
From the histogram, we can
learn that
 Clients of age between 35 and 45 are the
primary clients.
It is important to maintain the satisfaction

of these clients.
Provide new services for other age ranges

to increase client base.
Making Histogram and
Frequency Table
 Open the data “Clients list” which is
stored in our Applied Stat Folder. This is
the data for the histogram shown in the
previous slides.
Numerical Measure of data
summary (I)
 Difference between Population and
Sample
 Mean (Average)
 Median
Difference between
Population and Sample
Population
A population is the complete set of all items
in which an investigator is interested.
Examples of Populations
 Names of all registered voters in the
United States.
 Incomes of all families living in Daytona
Beach.
 Grade point averages of all the students in
your university.
 A major objective of statistics is to make
an inference about the population. For
example “What is the average income of
all families living in Daytona Beach.”
 Often, collecting the data for the
population is costly or impossible.
Therefore, we often collect data for only a
part of the population. Such data is called
a “Sample”.
A Sample
Sample
A sample is an observed subset of
population values.
Numerical Measure of
Summarizing Data
1-1 Mean (Average)
 How to compute the mean (average)

 Understanding the mathematical notation
of the mean (average)
 Cautionary notes for the use of the mean”
1-2 How to compute the mean
 Sum all the data, then divide it by the

number of observations.
 We use the term “sample size” to mean
the number of observation.
1-3 Computing the mean: an
example
•This is a sample data of the ages of your
Client business clients. Compute the mean age of
ID Age
your clients in this sample.
1 49
2 37 •Note that this is a typical data format that

we will encounter in this course. It has the
3 48 observation id (Client ID), and the value
of the variable of interest (age) for each
4 46
observation.
5 37
2-1 Understanding the
mathematical notation of the
mean
Observation
id Variable X
This is one of the most common format
of data that we deal with. In the first
1 x1
column, we have the observation id, and
2 x2 the second column has the value for each
x3
observation. (Often observation id is
3
omitted)
. .
. . In the previous example, variable X is the
n xn age of the clients. Then observation id =1
means that this is the first customer in
your customer list, and x1 is the age of the
customer.
2-2 Understanding the
mathematical notation of the
mean
Observation
id Variable X
When a data set is given in this
format, the sample mean of the
1 x1 variable X, denoted by X ,is given by
n
2 x2
x1  x2    xn x i
3 x3 X   i 1
n n
. . n
. . The notation,  xi is the

i 1
n xn summation notation. This is

simply the sum from x1 to xn
2-3 Sample Mean and
Population Mean
 Most often we use a sample data. For
example, if we want to know the
popularity rating of the current
government, we may use data from 10,000
interviews. This is just a part of the whole
voting population.
 Though not often, we may have the data
from the whole population.
2-4 Sample Mean and
Population Mean
 Later, it will become convenient to
distinguish Sample mean and population
mean. Thus we will use different notation
for the sample mean and the population
mean.
2-5 Notations for the sample
mean and the population mean
For a sample mean, we use the following notation

n
x1  x2    xn x i
X   i 1
n n
For the population mean, we use μ to denote the
population mean. We also use upper case N to denote the
N
sample size.
x1  x2    x N x i
  i 1
N N
3-1 Cautionary note
 : Mean (average) is not necessarily the
“center of the data”
3-2 Example
 “The average Japanese household saving
in year 2005 is ￥ 17,280,000”
This data may make you feel “well, if I do not

have this much saving, I am not normal”
Now, take a look at the histogram of the

household saving in the next slide.
10.7
Above
40,000
38,000-
1 1
40,000
36,000-
center of the data”. An example
38,000
34,000-
2 2 1.9 1.7 1.2 1.3
36,,000
The mean may not be “the
32,000-
34,000
30,000-
32,000
28,000-
30,000
Sample mean
=17,280,000
26,000-
Histgram of Japanese Household Savings
28,000 Savings in thousand yen

24,000-
26,000
22,000-
3.5 3 3 2.7
24,000
20,000-
22,000
18,000-
20,000
16,000-
18,000
14,000-
5.1 4.5
16,000
12,000-
14,000
6.9 6.2 10,000-
12,000
8,000-10,000
8.2
6,000-8,000
9.5
4,000-6,000
10.6
2,000-4,000
16 14.1
below2,000
14
12
10
8
6
4
2
0
Percentage
 One may think that the average is the
“normal household”. However, you can
see that a lot of households have savings
much less than the average. The average
saving is very high because a few
households have huge savings.
 In such case, “median” can give you a
better sense of a “normal household”. The
definition of the median is given in the
next slide.
4-1 Median
Sort the data in an ascending order.
Then the median is the value in the
middle (middle observation)
When the number of observation is an

even number, then there is no
“middle observation”. In such case,
take the average of the two middle
numbers
4-2 Median Exercise
 Open the file “ Computation of median
A”. This data contains the age of a
company’s clients. Find the median age of
this sample
 Open the file “Computation of median B”.
This data contains the revenue of bag
sales. Find the median of this sample.
10.7
Above 40,000
38,000-
1 1
40,000
36,000-
38,000
Japanese Household saving
2 2 1.9 1.7 1.2 1.3
34,000-
36,000
32,000-
34,000
30,000-
32,000
28,000-
Sample Average
30,000
=17,280,000
26,000-
Histgram of Japanese Household Savings
28,000
Savings in thousand yen

24,000-
26,000
revisited
3.5 3 3 2.7
22,000-
24,000
20,000-
22,000
18,000-
20,000
16,000-
18,000
5.1 4.5 14,000-
16,000
12,000-
14,000
10,000-
6.9 6.2
12,000
8,000-10,000
10,520,000
Median =
8.2
6,000-8,000
9.5
4,000-6,000
10.6
2,000-4,000
16 14.1
below2,000
8
6
4
2
0
14
12
10
Percentage
Corresponding chapters
 This lecture note covers the following
topics of the textbook.
 1.1 Sampling
 Example 2.6 Pareto Diagram
 2.4 Arithmetic Mean, Median

Lecture 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1

Uploaded by

Copyright:

Available Formats

Handout 1

Revenue Cumulative Percentage

OA Bag 25,000 250

Attache case A 18,000 50

Attache case B 17,000 60

Key Holders 3,000 300

Business Bag Black 45,000 300

Business Bag Brown 45,000 250

Name Card holders 5,000 500

Day Packs 9,000 100

Suit Case A 30,000 120

Suit Case B 25,000 90

Revenue Cumulative Percentage

Visualizing Revenue by Clients

 Use Pivot Table

Sep-03 Hakuhodo 8,000,000

Oct-03 Office ABC 1,000,000

ranks the revenue by clients, Nov-03 Daisan Kikaku 800,000

and (2) Pareto Chart Nov-03 Asahi Agency 1,000,000

Dec-03 Office ABC 2,000,000

Dec-03 Taiyo Advertisement 900,000

Hakuhodo 21,000,000 47% 47%

Office ABC 6,800,000 15% 62%

Daisan Kikaku 6,200,000 14% 76%

Asahi Agency 3,900,000 9% 85%

Ad soken 3,600,000 8% 93%

Taiyo Advertisement 3,200,000 7% 100%

Grand Total 44700000 1

Visualizing your clients’ age range using

It is important to maintain the satisfaction

Provide new services for other age ranges

1-1 Mean (Average)

 How to compute the mean (average)

 Sum all the data, then divide it by the

2 37 •Note that this is a typical data format that

. . The notation,  xi is the

n xn summation notation. This is

For a sample mean, we use the following notation

This data may make you feel “well, if I do not

Now, take a look at the histogram of the

28,000 Savings in thousand yen

When the number of observation is an

2 2 1.9 1.7 1.2 1.3

Savings in thousand yen

You might also like

OA Bag 25,000 250 　　　

Attache case A 18,000 50 　　　

Attache case B 17,000 60 　　　

Key Holders 3,000 300 　　　

Business Bag Black 45,000 300 　　　

Business Bag Brown 45,000 250 　　　

Name Card holders 5,000 500 　　　

Day Packs 9,000 100 　　　

Suit Case A 30,000 120 　　　

Suit Case B 25,000 90