Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

Handout 1

Describing Data

©
Summarizing and Describing
Data (Graphical)

 Pareto Chart
 Histogram
Pareto Chart
 Consider a company that sells business
bags. Among these bags, some items
generate more revenues than other items.
By ranking the items according to the
revenue, the company will know which
items they have to emphasize (in terms of
cost management, etc). For such a
purpose, a Pareto chart is useful.
Pareto Chart Example
\16,000,000 120%
\14,000,000
100%
100%
\12,000,000 90%
80% 80%
\10,000,000
\8,000,000 60%
\6,000,000
40%
\4,000,000
20%
\2,000,000
\0 0%
k n ag A rs B B A rs ks
ac w B e e e e e ld
e c
Bl
Bro A Cas h old a s
c as
c as o Pa
Ba
g
ag O
u it rd itC he he y
H
D ay
s s s
B S Ca Su tac
tac Ke
e es e A t A t
is n n m
si Na
Bu B u

Revenue Cumulative Percentage


Pareto Chart Example
The following data is on our IM Folder. Open the
excel file “Bag sales”
Item Price Number sold Revenue Percentage revenue Cumulative Percentage

OA Bag 25,000 250      

Attache case A 18,000 50      

Attache case B 17,000 60      

Key Holders 3,000 300      

Business Bag Black 45,000 300      

Business Bag Brown 45,000 250      

Name Card holders 5,000 500      

Day Packs 9,000 100      

Suit Case A 30,000 120      

Suit Case B 25,000 90      

Sum          
A Procedure to make a Pareto
1. Compute the revenue for each item
2. Compute the total revenue
3. Sort the data according to the revenue
4. Compute the percentage of revenue for
each item
5. Compute the cumulative percentage of
revenue
6. Make the Pareto Chart
Pareto Chart Example
\16,000,000 120%
\14,000,000
100%
100%
\12,000,000 90%
80% 80%
\10,000,000
\8,000,000 60%
\6,000,000
40%
\4,000,000
20%
\2,000,000
\0 0%
A
g

s
B

A
n
k

Ba

er

ck
er
ac

ow

se

se

se

se

ld
ld

Pa
Bl

Ca
Br

ca

ca
Ca
ho

Ho
O

y
g

Da
e

e
it
Ba

rd

it

y
Ba

ch

ch
Su

Su

Ke
Ca
s

ta

ta
s
es

At

At
es

e
m
sin

sin

Na
Bu

Bu

Revenue Cumulative Percentage


From the Pareto Chart example,
we can learn
 Business bags black, Business bag brown,
and OA bags count above 70% of total
revenue.
Require a lot of inventory
 Too much reliance on a small number of
items. Need more marketing effort for suit
cases and name card folders.
Pareto Chart Example 2

Visualizing Revenue by Clients

 Use Pivot Table


 Pareto Chart
Example (Sales by Clients
spread sheet)
Date Client Revenue
This data is a part of “Sales
Sep-03 Office ABC 1,000,000
by clients” data stored on
Sep-03 Taiyo Advertisement 1,300,000
our Applied Stat Folder.
Sep-03 Ad soken 600,000

Sep-03 Hakuhodo 8,000,000

Oct-03 Office ABC 1,000,000


From this data, we would Oct-03 Hakuhodo 1,000,000
like to make (1) a table that Nov-03 Ad soken 500,000

ranks the revenue by clients, Nov-03 Daisan Kikaku 800,000

and (2) Pareto Chart Nov-03 Asahi Agency 1,000,000

Dec-03 Office ABC 2,000,000

Dec-03 Taiyo Advertisement 900,000


Revenue Ranking Table
(Example of the Use of Pivot
Table)
Client Revenue % Revenue Cumulative %

Hakuhodo 21,000,000 47% 47%

Office ABC 6,800,000 15% 62%

Daisan Kikaku 6,200,000 14% 76%

Asahi Agency 3,900,000 9% 85%

Ad soken 3,600,000 8% 93%

Taiyo Advertisement 3,200,000 7% 100%

Grand Total 44700000 1  


Pareto Chart Example 2

Revenue by clients
100%
25,000,000 120%

100% 100%
20,000,000
90%
80% 80%
15,000,000
60%
10,000,000
40%

5,000,000
20%

0 0%
Hakuhodo Office ABC Daisan Kikaku Asahi Agency Ad soken Taiyo
Advertisement

Revenue Cumulative %
Histogram and frequency table

Example

Visualizing your clients’ age range using


histogram.
Histogram Example
Age range Frequency
Histogram
~ 15 0

12 11 11 ~ 20 0

10 ~ 25 4
8 ~ 30 5
6
Frequency

6 5 ~ 35 11
4 4
4 ~ 40 11
2
2 ~ 45 6
0 0 0 0
0 ~ 50 4
~ 55 2
~ 60 0
Clients' Age range
More 0
From the histogram, we can
learn that
 Clients of age between 35 and 45 are the
primary clients.

It is important to maintain the satisfaction


of these clients.

Provide new services for other age ranges


to increase client base.
Making Histogram and
Frequency Table
 Open the data “Clients list” which is
stored in our Applied Stat Folder. This is
the data for the histogram shown in the
previous slides.
Numerical Measure of data
summary (I)
 Difference between Population and
Sample
 Mean (Average)
 Median
Difference between
Population and Sample

Population
A population is the complete set of all items
in which an investigator is interested.
Examples of Populations
 Names of all registered voters in the
United States.
 Incomes of all families living in Daytona
Beach.
 Grade point averages of all the students in
your university.
 A major objective of statistics is to make
an inference about the population. For
example “What is the average income of
all families living in Daytona Beach.”
 Often, collecting the data for the
population is costly or impossible.
Therefore, we often collect data for only a
part of the population. Such data is called
a “Sample”.
A Sample

Sample
A sample is an observed subset of
population values.
Numerical Measure of
Summarizing Data

1-1 Mean (Average)

 How to compute the mean (average)


 Understanding the mathematical notation
of the mean (average)
 Cautionary notes for the use of the mean”
1-2 How to compute the mean

 Sum all the data, then divide it by the


number of observations.
 We use the term “sample size” to mean
the number of observation.
1-3 Computing the mean: an
example
•This is a sample data of the ages of your
Client business clients. Compute the mean age of
ID Age
your clients in this sample.
1 49

2 37 •Note that this is a typical data format that


we will encounter in this course. It has the
3 48 observation id (Client ID), and the value
of the variable of interest (age) for each
4 46
observation.
5 37
2-1 Understanding the
mathematical notation of the
mean
Observation
id Variable X
This is one of the most common format
of data that we deal with. In the first
1 x1
column, we have the observation id, and
2 x2 the second column has the value for each
x3
observation. (Often observation id is
3
omitted)
. .
. . In the previous example, variable X is the
n xn age of the clients. Then observation id =1
means that this is the first customer in
your customer list, and x1 is the age of the
customer.
2-2 Understanding the
mathematical notation of the
mean
Observation
id Variable X
When a data set is given in this
format, the sample mean of the
1 x1 variable X, denoted by X ,is given by
n
2 x2
x1  x2    xn x i

3 x3 X   i 1
n n
. . n

. . The notation,  xi is the


i 1

n xn summation notation. This is


simply the sum from x1 to xn
2-3 Sample Mean and
Population Mean
 Most often we use a sample data. For
example, if we want to know the
popularity rating of the current
government, we may use data from 10,000
interviews. This is just a part of the whole
voting population.
 Though not often, we may have the data
from the whole population.
2-4 Sample Mean and
Population Mean
 Later, it will become convenient to
distinguish Sample mean and population
mean. Thus we will use different notation
for the sample mean and the population
mean.
2-5 Notations for the sample
mean and the population mean

For a sample mean, we use the following notation


n

x1  x2    xn x i
X   i 1
n n
For the population mean, we use μ to denote the
population mean. We also use upper case N to denote the
N
sample size.
x1  x2    x N x i
  i 1

N N
3-1 Cautionary note
 : Mean (average) is not necessarily the
“center of the data”
3-2 Example
 “The average Japanese household saving
in year 2005 is ¥ 17,280,000”

This data may make you feel “well, if I do not


have this much saving, I am not normal”

Now, take a look at the histogram of the


household saving in the next slide.
10.7

Above
40,000
38,000-
1 1

40,000
36,000-
center of the data”. An example

38,000
34,000-
2 2 1.9 1.7 1.2 1.3

36,,000
The mean may not be “the

32,000-
34,000
30,000-
32,000
28,000-
30,000
Sample mean
=17,280,000

26,000-
Histgram of Japanese Household Savings

28,000 Savings in thousand yen


24,000-
26,000
22,000-
3.5 3 3 2.7

24,000
20,000-
22,000
18,000-
20,000
16,000-
18,000
14,000-
5.1 4.5
16,000
12,000-
14,000
6.9 6.2 10,000-
12,000
8,000-10,000

8.2
6,000-8,000

9.5
4,000-6,000

10.6
2,000-4,000

16 14.1
below2,000

14
12
10
8
6
4
2
0
Percentage
 One may think that the average is the
“normal household”. However, you can
see that a lot of households have savings
much less than the average. The average
saving is very high because a few
households have huge savings.
 In such case, “median” can give you a
better sense of a “normal household”. The
definition of the median is given in the
next slide.
4-1 Median
Sort the data in an ascending order.
Then the median is the value in the
middle (middle observation)

When the number of observation is an


even number, then there is no
“middle observation”. In such case,
take the average of the two middle
numbers
4-2 Median Exercise
 Open the file “ Computation of median
A”. This data contains the age of a
company’s clients. Find the median age of
this sample
 Open the file “Computation of median B”.
This data contains the revenue of bag
sales. Find the median of this sample.
10.7

Above 40,000
38,000-
1 1

40,000
36,000-
38,000
Japanese Household saving

2 2 1.9 1.7 1.2 1.3

34,000-
36,000
32,000-
34,000
30,000-
32,000
28,000-
Sample Average

30,000
=17,280,000

26,000-
Histgram of Japanese Household Savings

28,000

Savings in thousand yen


24,000-
26,000
revisited

3.5 3 3 2.7

22,000-
24,000
20,000-
22,000
18,000-
20,000
16,000-
18,000
5.1 4.5 14,000-
16,000
12,000-
14,000
10,000-

6.9 6.2
12,000
8,000-10,000
10,520,000
Median =

8.2
6,000-8,000

9.5
4,000-6,000

10.6
2,000-4,000

16 14.1
below2,000

8
6
4
2
0
14
12
10
Percentage
Corresponding chapters
 This lecture note covers the following
topics of the textbook.

 1.1 Sampling
 Example 2.6 Pareto Diagram
 2.4 Arithmetic Mean, Median

You might also like