Professional Documents
Culture Documents
Chap 1 PZ
Chap 1 PZ
Chap 1 PZ
Summarizing Data
1.1 Graphical summaries of the data
Dot plot and histogram
The time series plot
1.2 Numerical descriptive measures
1.3 Measures of central tendency
The sample mean
The median
Mean versus median
1.4 Measures of dispersion
The sample variance
The sample standard deviation
1.5 The empirical rule
1.6 How to relate two things
1.7 Linearly related variables
Linear functions
Mean and variance of a linear function
Linear combinations
Mean and variance of a linear combination
1.1 Graphical Summaries of the Data
data. 0 1
C1
© Imperial College Business School
•Data is the statistician’s raw material, the numbers that
we use to interpret reality
E = (1+.1)B = (1.1)B
Interpret:
The returns are
centered or
located
at about .01.
The spread or
variation
in the returns is
huge.
8
Dotplot for canada
canada
center or
location of the data
Some data
does not
have the
mound
shape.
Volume
It is skewed
to the left.
We also have data on countries other than Canada.
Let us compare Canada with Japan.
It really helps to get things on the same scale.
How is Japan different from Canada?
Mutual fund data
Dreyfus
growth fund
Putman
income fund
Equally weighted
market
T-bills
The beer data:
nbeerm: the number of beers male MBA students claim
they can drink without getting drunk
nbeerf: same for females
We call a
point
like this an
outlier
Generally the males claim they can drink more, their numbers are
centered or located at larger values.
© Imperial College Business School
The number of bars you use affects how “smooth "the
picture looks.
• The return data has an important feature that the beer data does
not have
• It has an order!
On the
vertical
axis we
have
returns.
On the
horizontal
axis we
have “time”
Now do you
see a pattern?
x1, x2 , x3 , xn
the last number, n is the number
of numbers, or the “number of
the first
observations.” You may also hear it
number
referred to as the “sample size.”
x1 5
n=5
2
x3 8
6
2
sum x1 x2 xn
x
n n
We often use the x symbol to denote the mean of the
numbers x
We call it “x bar”
© Imperial College Business School
Here is a more compact way to write the same thing…
Consider x1 x2 xn
We use a shorthand for it (it is just notation):
x
i 1
i x1 x2 xn
n
1
x xi
n i 1
© Imperial College Business School
Graphical interpretation of the sample mean
Let us go back
to our standard
histogram
To summarize this
we can compute
the average value
for both men and
women
>>mean(nbeerm(1:Tm))
>>ans = 7.862500000000000
>>mean(nbeerf(1:Tf));
>>ans = 4.222222222222222
• Let us compare the means of the Canadian
and Japanese returns
>> mean(canada)
ans = 0.009065420560748
>> mean(japan)
ans = 0.002336448598131
xi
i 1
means that for each value of i, from 1 to n,
we add to the sum the value indicated,
in this case xi
compute y bar.
(here, we do not
sum over all
observations: we
sum only the
second and
third.)
Example
1,2,3,4,5 Median = 3
1,1,2,3,4,5 Median = (2+3)/2 =2.5
Example:
1,2,3,4,5 Mean: 3 Median: 3
1,2,3,4,100 Mean: 22 Median: 3
. . . .
-+---------+---------+---------+---------+---------+-----y
The
y numbers
0.030 0.045 are more spread
0.060 0.075 out0.090
than the x0.105
numbers.
We want a numerical measure of variation or spread.
xi x
. . . .
-+---------+---------+---------+---------+---------+-----x
. . . .
-+---------+---------+---------+---------+---------+-----y
0.030 0.045 0.060 0.075 0.090 0.105
n
1
s 2
x
n 1 i 1
( xi x) 2
sx s 2
x
XX
= .02, .01, .01, .02
The sample
standard deviation
for the y data
is bigger than
that for the x data.
This numerically
captures the
fact that y has
“more variation”
about its mean
than x.
Example 2 (graphical)
The standard
deviations
measure the
fact that there
is more spread
in the
Japanese
returns
x sx
where the data is how spread out,
how variable the data is
( x s x, x s x ) x s x
Approximately 95% of the data is in the interval
( x 2s x , x 2s x ) x 2s x
The empirical rule will help us understand sx and relate the
summaries back to the histogram
s x .03833
10
The empirical
rule says that Density
roughly 95%
5
of the
observations
are between the
dashed lines and 0
roughly 68% -0.1 0.0 0.1
between canada
the dotted lines.
Looks reasonable. x sx x sx
Same thing viewed from the
perspective of the time series plot.
x 2s x
5% outside
would be
about
5 points. x
There are 4
points
outside,
which is
pretty close.
x 2s x
A little finance: comparing mutual funds
Let us use the means and standard deviations to compare mutual
funds.
For 9 different assets we compute the means and standard deviations.
Then, we plot the means versus the standard deviations.
The assets
are:
Variable N Mean StDev
drefus 180 0.00677 0.04724
fidel 180 0.00470
0.05659
keystne 180 0.00654 0.08424
Putnminc 180 0.00552 0.03008
scudinc 180 0.00443 0.03597
windsor 180 0.01002 0.04864
eqmrkt 180 0.01082 0.06856
valmrkt 180 0.00681 0.04800
tbill 180 0.00598
0.00252 © Imperial College Business School
It is considered good to have a large
mean return and a small standard deviation.
0.011 eqmrkt
windsor
0.010
0.009
Mean
0.008
valmrkt drefus
0.007 keystne
tbill
0.006 Putnminc
0.005 fidel
scudinc
0.004
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
StDev
© Imperial College Business School
Let us compare some countries
honkong
0.02 Based
on
usa
singapor france monthly
returns
Mean
belgium germany
australi finalnd
0.01 canada from ‘88
italy to ‘96
japan
0.00
0.03 0.04 0.05 0.06 0.07 0.08
StDev
nbeer weight i
12.0 192 1 20
12.0 160 2
5.0 155 3
5.0 120 4
nbeer
10
7.0 150 5
13.0 175 6
4.0 100 7
0
12.0 165 8 100 150 200
12.0 165 9 weight
12.0 150 10
. . .
. . . Scatter plot
. . .
Now we think of each pair of numbers as an observation.
Each pair corresponds to a person.
Each person has two numbers associated with him/her,
# beers and weight.
Each pair corresponds to a point on the plot. © Imperial College Business School
Example:
0.2
Each point
corresponds 0.1
to a month
windsor
0.0
-0.1
1 n
s xy
n 1 i 1
( xi x)( yi y )
s xy
rxy
s xs y
Correlation of 0
y1
y1 and x1 = 0.019 -1
-2
-3 -2 -1 0 1 2 3
x1
Correlation of 1
y2 and x2 = 0.995
y2
-1
-2
-3
-3 -2 -1 0 1 2 3
x2
© Imperial College Business School
4
Correlation of 1
y3
0
y3 and x3 = 0.586 -1
-2
-3
-4
-3 -2 -1 0 1 2 3
x3
Correlation of 1
y4 and x4 = -0.982
y4
-1
-2
-3
-3 -2 -1 0 1 2 3
x4
Correlation of y5 and x5 = 0.210
9
8
7
6
5
y5
4
3
2
1
0
-3 -2 -1 0 1 2 3
x5
0.1
canada
0.0
-0.1
-0.1 0.0 0.1
usa
japan usa
usa 0.246
singapor 0.407 0.473
x y
0.07 0.11
0.06 0.05
0.04 0.09
0.03 0.03
© Imperial College Business School
First, let us compute the covariance
(which is a necessary ingredient to
compute the correlation):
1 n
n 1 i 1
( xi x)( yi y)
1
((.07 .05)(.11.07) (.06 .05)(.05 .07) (.04 .05)(.09 .07) (.03 .05)(.03 .07))
3
1
(.02*.04 .01 * ( .02) ( .1)*.02 ( .02) * ( .04))
3
1 1
(.0008 .0002.0002.0008) (.0012) .0004
3 3
= .0004
x
0.11
0.10
0.09
0.08 (III) (I)
y 0.07 y
0.06
0.05
(II) (IV)
0.04
0.03
Points in (I) have both x and y bigger than their means so we get a
positive contribution to the covariance.
Points in (II) have both x and y less than their means so we get a
positive contribution to the covariance.
In (III) and (IV) one of x and y is less than its mean and the other is
greater so we get a negative contribution. The further out the point is,
the bigger the contribution.
© Imperial College Business School
just a few
relatively small Lots of positive contributions
contributions
just a few
relatively small
Lots of positive contributions
contributions
.0004
rxy .6
(.0365)(.0183)
y = c0 + c 1 x
y c 0 c1x
c 0 : the intercept We think of the c’s as constants
c1 : the slope (fixed numbers) while x and y vary.
Let us look at
our
>> cel = [ -10 0 10 15 20 25 30 35 ]';
temperature
example.
>> mul = (9/5)*cel;
Suppose we
first multiply
>> fahr = 32+mul;
by (9/5) and
then add 32.
© Imperial College Business School
>> mean([ cel mul fahr])
ans =
15.625000000000000
28.125000000000000
60.125000000000000
>> std([ cel mul fahr])
ans =
15.221577729375776
27.398839912876394
27.398839912876394
. . .. . . . .
+---------+---------+---------+---------+---------+-------cel
. . . . . . . .
+---------+---------+---------+---------+---------+-------mul
. . . . . . . .
+---------+---------+---------+---------+---------+-------fahr
0 30 60 90 120 150
© Imperial College Business School
Interpret
s c s
2
y
2 2
1 x
Example:
Suppose our movie star also gets 5 percent of
all sales of the CD released with the movie.
How is the star’s income related to the film’s
gross and CD sales (in millions of dollars)?
y c 0 c1x1 c 2 x 2 c k x k
y is a linear combination of the x’s.
ci is the coefficient of xi.
(100)*.5*(1+.1) + (100)*.5*(1+.15)
=100*(1+.5*.1+.5*.15)= 100(1 + RP)
Rp w1x1 w 2 x 2
This is beautiful (…some people get a kick out of weird stuff!)
Questions:
Let us use our country data and suppose that we had put
.5 into USA and .5 into Hong Kong.
What would our returns have been?
In MatLab:
>> port = .5*honkong + .5*usa
0.017
Kong. 0.016
0.015
0.013
usa
deviation? StDev
honkong
forming
portfolios
Mean
0.015
is an usa
interesting
thing to do! 0.010
canada
Then, y c 0 c1x1 c 2 x2
y c 0 c1x1 c 2 x2
s c s c s 2c1c 2 s x1x2
2
y
2 2
1 x1
2 2
2 x2
Covariances
>> cov([ honkong usa port])
Covariances
honkong usa
port
honkong 0.00521497
usa 0.00103037 0.00110774
port 0.00416882 0.00104972
0.00338905
.0033 =
(.25)*(.25)*.00111 + (.75)*(.75)*.0052+(2)*(.25)*(.75)*(.00103)
-0 .0 5
1
-0 .0 7
y = .5x1 + .5 x2 -0 .1-0 .0 1
0 .0 3
-0 .006.0 4
At each point we
0
-0 .0 1
-0 -0
.0 .0
53
x2
plot the value of y -0 .0 5
0 .0 5
-1
0 .1 2
-0 .0 8 0 .1 3
The variances and 00 .1
.112
0 .0 5
covariance are:
-2
0 .0 3
x1 x2
-1 0 1 2
x1 1.334636 x1
x2 -1.208679 1.106238
The dashed lines are drawn
Then, the variance of y is at the mean of x1 and x2
1
0 .815
0 .700 .8
.7 8
0 0.5.5 3
At each point we 0 .2 30 .3 3
0
plot the value of y
x2
-0 .0 3
-0 .1 7
-0-0
.4.3
69
-0 .7 9-0 .7
-1
covariance are:
-2 -1 .8 5
x1 -2 -1 0 1 2
x2 x1
x1 1.158167
x2 1.046490 The dashed lines are drawn
0.9609463 at the mean of x1 and x2
Then, the variance of y is
1 .0 5 3 = .5 *.5 *1 .1 5 8 + .5 * .5 * .9 6 1 + 2 * .5 * .5 * 1 .0 4 6 5
Why is the variance of y not so much smaller than those of the x’s ?
Example:
2 .0
0 .9 3
1 .5
-0 .0 2
0 .7 5
y = .5x1 + .5 x2 -0 .2 7 1 .2 9
1 .0
-0 .4 3 1 .0 3
0 .1 7
At each point we
0 .5
0 .4 3
x2
plot the value of y
0 .0
-0 .0 9 0 .3 9
-1 .11 -0 .3 5
-1 .2 0 .2 3
-0 .5
The variances and -1 .0 7 -0 .7 6
0 .1 3
covariance are:
-1 .0
-1 .6 7
-0 .6 9
-2 -1 0 1
x1 x2
x1
x1 1.3870537
x2 0.1976187
0.8247886 The dashed lines are drawn
Then, the variance of y is at the mean of x1 and x2
0 .6 5 1 7 5 = .5 *.5 *1 .3 8 7 + .5 * .5 * .8 2 4 8 + 2 * .5 * .5 * .1 9 7 6
Suppose,
y c 0 c1x1 c 2 x2 c 3 x3 ck xk
Then,
y c 0 c1x1 c 2 x2 c 3 x3 ck xk
s c s c s c s
2
y
2 2
1 x1
2 2
2 x2
2
3
2
x3
y c 0 c1x1 c 2 x2 c 3 x3
y c 0 c1x1 c 2 x2 c 3 x3
s c s c s c s
2
y
2 2
1 x1
2 2
2 x2
2
3
2
x3
Covariances