FRM P1.Quantitative-Analysis

Quantitative Analysis
FRM 2013 Study Notes – Part1.Topic2
By David Harper, CFA FRM CIPM

www.bionicturtle.com
Table of Contents
Miller, Chapter 2: Probabilities ............................................................................................. 2

Miller, Chapter 3: Basic Statistics ........................................................................................ 20
Miller, Chapter 4: Distributions ........................................................................................... 36
Miller, Chapter 5: Hypothesis Testing and Confidence Intervals ......................................... 60
Stock & Watson’s Probability and Statistics Review (Chapters 2 & 3) ................................ 75
Stock, Chapter 4: Linear Regression with one regressor.................................................... 78
Stock, Chapter 5: Single Regression: Hypothesis Tests and Confidence Intervals ............... 90
Stock: Chapter 6: Linear Regression with Multiple Regressors ........................................... 95
Stock, Chapter 7: Hypothesis Tests and Confidence Intervals in Multiple Regression ........100
Jorion, Chapter 12: Monte Carlo Methods .........................................................................104
Hull, Chapter 22: Estimating Volatilities and Correlations .................................................115
Allen, Boudoukh, and Saunders, Chapter 2: Quantifying Volatility in VaR Models ...........125
www.bionicturtle.com FRM 2012  QUANTITATIVE ANALYSIS  1

Miller, Chapter 2:
Probabilities
In this chapter …
 Describe the concept of probability.
 Describe and distinguish between continuous and discrete random variables.
 Define and distinguish between the probability density function, the cumulative
distribution function and the inverse cumulative distribution function, and
calculate probabilities based on each of these functions.
 Calculate the probability of an event given a discrete probability function.
 Distinguish between independent and mutually exclusive events.
 Define joint probability, describe a probability matrix and calculate joint
probabilities using probability matrices.
 Define and calculate a conditional probability, and distinguish between conditional
and unconditional probabilities.
 Describe Bayes’ Theorem and apply it in the calculation of conditional probabilities.
Terminology (for discrete/compound, these do not need to be memorized)

 Statistical or random experiment: an observation or measurement process with
multiple but uncertain outcomes
 Population or sample space: set of all possible outcomes of an experiment; Sample
point: each member or outcome of the sample space
 Outcome: the result of a single trial. For example, if we roll two dice, an outcome might
be a three (3) and a four (4); a different outcome might be a (5) and a (2).
 Event: the result that reflects none, one, or more outcomes in the sample space. Events
can be simple or compound. An event is a subset of the sample space. If we roll two-
dice, an example of an event might be rolling a seven (7) in total.
 Random variable (or stochastic variable): A stochastic or random variable (r.v.) is a
“variable whose value is determined by the outcome of an experiment”
 Discrete random variable: r.v. that can take a finite number of values (or countably
infinite). For example, coin, six-sided die, bond default (yes or no).
 Continuous random variable: r.v. that can take any value in some interval. For
example, asset returns, time.
 Mutually exclusive events: events which cannot simultaneously occur. If A and B are
mutually exclusive, the probability of (A and B) is zero. Put another way, their
intersection is the null set.
 Collectively exhaustive events (a.k.a., cumulatively exhaustive): events that
cumulatively describe all possible outcomes.

Describe the concept of probability.
Assume a discrete random variable X, which can take various values, x(i). The probability of any
given x(i) occurring is p(i), which is represented by the following:
P  X  xi   pi s.t . x  x1, x2, , xn 

where P   is the probability operator
The probability (p) of an event always falls between zero (0) and one (1.0 or 100%). In the
case of a discrete random variable, a probability function solves for the expectation that a
variable will equal a certain value. In the case of a continuous random variable, the probability
function gives the likelihood that the random variable will fall between specified intervals.
If f(x) is a probability function, in order to satisfy the definition of a “probability” the following
two conditions must be true:
 The probability function f(x) must be greater than (or equal to) zero and less than (or
equal to) one; for example, there is no such thing as a -30% probability of occurrence or
a 120% probability of occurrence.
 The sum of each mutually exclusive probability must be equal to one: if the
probabilities are mutually exclusive and cumulatively exhaustive (we include all possible
outcomes), the sum of those probabilities must equal one.
In mathematical terms, these conditions are represented as properties of a probability, where

f(x) characterizes a discrete random variable:
1st condition: 0  f ( x)  1
2nd condition:  f ( x)  1
x
A two-sided coin is the classic discrete random variable (Bernoulli). If we flip a coin,
there are two possible outcomes: heads or tails. The first condition above insists that the
probability of flipping a head (or a tail) must lie between 0% and 100%. And the odds for a
head are 50%. The second condition insists that all probability functions must sum to
100%: 50% probability of “heads” plus 50% probability of “tails” = 100% (1.0). In other
words, “outcome = heads” and “outcome = tails” cover all the possible outcomes; we
aren’t omitting another possible outcome.
In contrast to the case of a discrete random variable, a continuous random variable

does not have exact outcomes like 1.0 or 3.5. For example, asset returns. Strictly, we
must define continuous outcomes on an interval: instead of P[X=x] we need P[x1 < X <
x2]. For a continuous variable, the probability of any specific value occurring is zero.

Describe and distinguish between continuous and discrete random
variables.
We characterize (describe) a random variable with a probability distribution. The random

variable can be discrete or continuous; and in either the discrete or continuous case, the
probability can be local (PMF, PDF) or cumulative (CDF).
A random variable is a variable whose value is determined by the outcome of an

experiment (a.k.a., stochastic variable). “A random variable is a numerical summary of
a random outcome. The number of times your computer crashes while you are writing a
term paper is random and takes on a numerical value, so it is a random variable.”—S&W
Continuous
Pr (c1 ≤ Z ≤ c2) =
Pr (Z ≤ c)= φ(c)
φ(c2) - φ(c1)
probability Cumulative
function Distribution
(pdf, pmf) Function (CDF)
Pr (X = 3) Pr (X ≤ 3)
Discrete
Continuous random variable
A continuous random variable (X) has an infinite number of values within an interval:
b
P (a  X  b)  a f ( x )dx

Discrete random variable
A discrete random variable (X) assumes a value among a finite set including x1, x2, x3 and so
on. The probability function is expressed by:
P( X  xk )  f ( xk )
Notes on continuous versus discrete random variables
 Discrete random variables can be counted. Continuous random variables must be

measured.
 Examples of a discrete random variable include: coin toss (head or tails, nothing in
between); roll of the dice (1, 2, 3, 4, 5, 6); and “did the fund beat the
benchmark?”(yes, no). In risk, common discrete random variables are default/no
default (0/1) and loss frequency.
 Examples of continuous random variables include: distance and time. A common

example of a continuous variable, in risk, is loss severity.
 Note the similarity between the summation (∑ ) under the discrete variable and the
integral (∫) under the continuous variable. The summation (∑) of all discrete
outcomes must equal one. Similarly, the integral (∫) captures the area under the
continuous distribution function. The total area “under this curve,” from (-∞) to (∞),
must equal one.
 All four of the so-called sampling distributions—that each converge to the

normal—are continuous: normal, student’s t, chi-square, and F distribution.

Summary
Continuous Discrete
Are measured Are counted
Infinite Finite
Examples in Finance
Distance, Time (e.g.) Default (1,0) (e.g.)
Severity of loss (e.g.) Frequency of loss (e.g.)
Asset returns (e.g.)
For example
Normal Bernoulli (0/1)

Student’s t Binomial (series i.i.d. Bernoullis)
Chi-square Poisson
F distribution Logarithmic
Lognormal
Exponential
Gamma, Beta
EVT Distributions (GPD, GEV)

Define and distinguish between the probability density function, the
cumulative distribution function and the inverse cumulative distribution
function, and calculate probabilities based on each of these functions.
Probability density functions (pdf): top row below
The probability density function answers a “local” question: If the random variable is discrete,
the pdf (a.k.a., probability mass function, pmf, if discrete) is the probability the variable will
assume an exact value of x; i.e., PMF: P(X = x). If the variable is continuous, the pdf tells us the
likelihood of outcomes occurring on an interval between any two points. The pdf functions are
illustrated on the top row below, continuous (left-hand) and discrete (right-hand):
r2
P r1  X  r2    f ( x )dx  p f ( x )  P( X  xi )
r1
a
F (a )   f ( x )dx  P  X  a  F (a)  P( X  x )


Cumulative distribution functions (CDF): bottom row above
The cumulative density function (CDF) associates with either a PMF or PDF (i.e., the CDF can
apply to either a continuous or random variable). The CDF gives the probability the random
variable will be less than, or equal to, some value. CDF: P(X  x).
Inverse cumulative distribution function
If F(x) is a cumulative distribution function, then we define F −1(p), the inverse cumulative
distribution, as follows:
F(x)  p  F 1( p)  x s.t. 0  p  1

We can examine the inverse cumulative distribution function by applying it to the standard
normal distribution, N(0,1). For example, F(-1)(95%) = 1.645 because at +1.645 standard
deviations to the right of mean in a standard normal, 95% of the area is to the left (under the
curve). Here are other common inverse CDFs:
The inverse cumulaitve distribution function (CDF) is also called the quantile function;
see http://en.wikipedia.org/wiki/Quantile_function. We can now see why Dowd says that
“VaR is just a quantile [function];” e.g., 95% VaR is the inverse CDF for p = 5% or p 95%.
Univariate versus multivariate probability density functions
A single variable (univariate) probability distribution is concerned with only a single random
variable; e.g., roll of a die, default of a single obligor. A multivariate probability density
function concerns the outcome of an experiment with more than one random variable. This
includes, the simplest case, two variables (i.e., a bivariate distribution).
Density Cumulative
Univariate f(x)= P(X = x) F(x) = P(X ≤ x)
Bivariate f(x)= P(X = x, Y =y) f(x) = P(X ≤ x, Y ≤ y)

Calculate the probability of an event given a discrete probability
function.
Probability: Classical or “a priori” definition
The probability of outcome (A) is given by:
Number of outcomes favorable to A

P ( A) 
Total number of outcomes
For example, consider a craps roll of two six-sided dice. What is the probability of rolling a
seven; i.e., P[X=7]? There are six outcomes that generate a roll of seven: 1+6, 2+5, 3+4, 4+3, 5+2,
and 6+1. Further, there are 36 total outcomes. Therefore, the probability is 6/36.
In this case, the outcomes need to be mutually exclusive, equally likely, and
“cumulatively exhaustive” (i.e., all possible outcomes included in total). A key property
of a probability is that the sum of the probabilities for all (discrete) outcomes is 1.0.
Probability: Relative frequency or empirical definition
Relative frequency is based on an actual number of historical observations (or Monte Carlo
simulations). For example, here is a simulation (produced in Excel) of one hundred (100) rolls of
a single six-sided die:
Empirical Distribution
Roll Freq. %
1 11 11%
2 17 17%
3 18 18%
4 21 21%
5 18 18%
6 15 15%
Total 100 100%

Note the difference between an a priori probability and an empirical probability:
 The a priori (classical) probability of rolling a three (3) is 1/6,
 But the empirical frequency, based on this sample, is 18%. If we generate another
sample, we will produce a different empirical frequency.
This relates also to sampling variation. The a priori probability is based on population
properties; in this case, the a priori probability of rolling any number is clearly 1/6th.
However, a sample of 100 trials will exhibit sampling variation: the number of threes (3s)
rolled above varies from the parametric probability of 1/6 th. We do not expect the
sample to produce 1/6th perfectly for each outcome.
Distinguish between independent and mutually exclusive events.
Mutually exclusive events
For a given random variable, the probability of any of two mutually exclusive events occurring is
just the sum of their individual probabilities. In statistics notation, we can write:
P  A  B   P  A  P B  if mutually exclusive
where A  B is the union of A and B; i.e., the probability of either A or B occurring. This equality
is true only for mutually exclusive events. This property of mutually exclusive events can be
extended to any number of events. The probability that any of (n) mutually exclusive events
occurs is the sum of the probabilities of (each of) those (n) events.
Independent events
X and Y are independent if the condition distribution of Y given X equals the marginal
distribution of Y. Since independence implies Pr (Y=y | X=x) = Pr(Y=y):
Pr( X  x,Y  y )
Pr(Y  y | X  x ) 
Pr( X  x )
The most useful test of statistical independence is given by:
Pr( X  x,Y  y )  Pr( X  x )P(Y  y )
X and Y are independent if their joint distribution is equal to the product of their
marginal distributions.

Statistical independence is when the value taken by one variable has
no effect on the value taken by the other variable. If the variables are
independent, their joint probability will equal the product of
their marginal probabilities. If they are not independent, they are
dependent.
For example, when rolling two dice, the second will be independent of the first.
This independence implies that the probability of rolling double-sixes is equal to the product of
P(rolling one six) and P(rolling one six). If two die are independent, then P (first roll = 6, second
roll = 6) = P(rolling a six) * P (rolling a six). And, indeed: 1/36 = (1/6)*(1/6)
Define joint probability, describe a probability matrix and calculate joint

probabilities using probability matrices.
It is convenient to summarize joint probabilities in a probability matrix (a.k.a., probability table).

In Miller’s example, we assume a company that issue both bonds and stock.
 The bonds can either be downgraded, be upgraded, or have no change in rating.
 The stock can either outperform the market or underperform the market.
Consider Miller’s table below:
 The joint probability of both the company's stock outperforming the market and the
bonds being upgraded is 15% (yellow cell).
 Similarly, the joint probability of the stock underperforming the market and the bonds
having no change in rating is 25% (orange cell).
We can also see the unconditional (a.k.a., marginal) probabilities, by adding across a row or
down a column. The probability of the bonds being upgraded, irrespective of the stock's
performance, is: 15% + 5% = 20%. Similarly, the probability of the equity outperforming the
market is: 15% + 30% + 5% = 50%. Importantly, all of the joint probabilities add to 100%.
Equity
Outperform Underperform
Upgrade 15% 5% 20%
Bonds No change 30% 25% 55%
Downgrade 5% 20% 25%
50% 50% 100%

In perhaps a more comprehensive illustration, Stock & Watson illustrate with two variables:
 The age of the computer (A), a Bernoulli such that the computer is old (0) or new (1)
 The number of times the computer crashes (M)
Joint probability function (Stock & Watson example)
The joint probability is the probability that the random variables (in this case, both random
variables) take on certain values simultaneously.
Pr( X  y,Y  y ) general form
Pr( A  0, M  0)  0.35 for example
0 1 2 3 4 Tot
0 0.35 0.065 0.05 0.025 0.01 0.50

Old
1 0.45 0.035 0.01 0.005 0.00 0.50

New
Tot 0.80 0.100 0.03 0.030 0.01 1.00
“The joint probability distribution of two discrete random variables, say X and Y, is the
probability that the random variables simultaneously take on certain values, say x and y.
The probabilities of all possible ( x, y) combinations sum to 1. The joint probability
distribution can be written as the function Pr(X = x, Y = y).” —S&W
Define and calculate a conditional probability, and distinguish between

conditional and unconditional probabilities.
Marginal (a.k.a., unconditional) probability functions
A marginal (or unconditional) probability is the simple case: it is the probability that does
not depend on a prior event or prior information. The marginal probability is also called the
unconditional probability.
In the following table, please note that ten joint outcomes are possible because the age variable
(A) has two outcomes and the “number of crashes” variable (M) has five outcomes. Each of the
ten outcomes is mutually exclusive and the sum of their probabilities is 1.0 or 100%. For
example, the probability that a new computer crashes once is 0.035 or 3.5%.

The marginal (unconditional) probability that a computer is new (A = 1) is the sum of joint
probabilities in the second row:
l
Pr(Y  y )   Pr  X  xi ,Y  y 
i 1
Pr( A  1)  0.5
0 1 2 3 4 Tot
0 0.35 0.065 0.05 0.025 0.01 0.50

Old
1 0.45 0.035 0.01 0.005 0.00 0.50

New
Tot 0.80 0.100 0.03 0.030 0.01 1.00
“The marginal probability distribution of a random variable Y is just another name for
its probability distribution. This term distinguishes the distribution of Y alone (marginal
distribution) from the joint distribution of Y and another random variable. The marginal
distribution of Y can be computed from the joint distribution of X and Y by adding up the
probabilities of all possible outcomes for which Y takes on a specified value”—S&W
Conditional probability functions
Conditional is the probability of an outcome given (conditional on) another outcome:
Pr( X  x,Y  y )
Pr(Y  y | X  x ) 
Pr( X  x )
Pr(M  0 | A  0)  0.35 0.50  0.70
0 1 2 3 4 Tot
0 0.35 0.065 0.05 0.025 0.01 0.50

Old
1 0.45 0.035 0.01 0.005 0.00 0.50

New
Tot 0.80 0.100 0.03 0.030 0.01 1.00
“The distribution of a random variable Y conditional on another random variable X taking

on a specific value is called the conditional distribution of Y given X. The conditional
probability that Y takes on the value y when X takes on the value x is written:
Pr(Y = y | X = x).” –S&W

Conditional probability = Joint Probability ÷ Marginal Probability
What is the probability of B occurring, given that A has already occurred?
P( A  B)
P (B | A )   P ( A)P (B | A)  P ( A  B )
P ( A)
Conditional and unconditional expectation
The conditional concept extends to expectations also:
 An unconditional expectation is the expected value of the variable without any

restrictions (or lacking any prior information)
A conditional expectation is an expected value for the variable conditional on prior

 information or some restriction (e.g., the value of a correlated variable). The conditional
expectation of Y, conditional on X = x, is given by E(Y | X= x)
And we can refer also to a conditional variance of Y, conditional on X=x, is given by variance
(Y | X = x)
The two-variable regression is a important conditional expectation. In this case, we say

the expected Y is conditional on X: E(Y | X i )  B1  B2 X i
Miller: “The concept of independence is closely related to the concept of conditional

probability. Rather than trying to determine the probability of the market being up and
having rain, we can ask, ‘What is the probability that the stock market is up given that it
is raining?’ We can write this as a conditional probability:
P[ market up | rain].
The vertical bar tells us the probability of the first argument is conditional on the
second. We read this as “The probability of ‘market up’ given ‘rain’ is equal to p.”
If the weather and the stock market are independent, then the probability of the market
being up on a rainy day is the same as the probability of the market being up on a sunny
day. If the weather somehow affects the stock market, however, then the conditional
probabilities might not be equal. We could have a situation where:
P[market up | rain] ≠ [market up | no rain]
In this case, the weather and the stock market are no longer independent. We can no
longer multiply their probabilities together to get their joint probability.”

For Example: Two Stocks (S) and (T)
For example, consider two stocks. Assume that both Stock (S) and Stock (T) can each only reach
three price levels. Stock (S) can achieve: $10, $15, or $20. Stock (T) can achieve: $15, $20, or $30.
Historically, assume we witnessed 26 outcomes and they were distributed as follows.
Note S = S$10/15/20 and T = T$15/20/30 :
S= $10 S= $15 S=$20 Total

T=$15 0 2 2 4
T=$20 3 4 3 10
T=$30 3 6 3 12
Total 6 12 8 26
What is the joint probability?
A joint probability is the probability that both random variables will have a certain outcome.
Here the joint probability P(S=$20, T=$30) = 3/26.
What is the marginal (unconditional) probability
The unconditional probability of the outcome where S=$20 = 8/26 because there are eight
events out of 26 total events that produce S=$20. The unconditional probability P(S=20) = 8/26
What is the conditional probability
Instead we can ask a conditional probability question: “What is the probability that S=$20 given
that T=$20?” The probability that S=$20 conditional on the knowledge that T=$20 is 3/10
because among the 10 events that produce T=$20, three are S=$20.
P (S  $20,T  $20) 3
P (S  $20 T  $20)  
P (T  $20) 10
In summary:
 The unconditional probability P(S=20) = 8/26

 The conditional probability P(S=20 | T=20) = 3/10
 The joint probability P(S=20,T=30) = 3/26

Describe Bayes’ Theorem and apply this theorem in the calculation of
conditional probabilities.
Bayes Theorem shows how a conditional probability of the form P(B|A) may be combined with
the initial probability P(A) to obtain the final probability P(A|B):
P B | A   P  A 
P  A | B 
P B 
P B | A   P  A 

P B | A  P  A  P B | A  P  A
Bayes’ Theorem: Example #1, Stock moves conditional on economic state
In the following example, assume a simplified model of reality: a Bernoulli trial in which the
economy can either grow (G) or slow (S). Additionally, our stock can either go up (U) or (D).
Therefore, there are four possible future states:
 Economy grows, stock goes up (G then U)

 Economy slows, stock goes up (S then U)
 Economy grows, stock goes down (G then D)
 Economy slows, stock goes down (S then D)
Bayes’ Theorem solves for a conditional probability. In this case, we can ask the following
question: what is the probability that the economy grew given the prior information (conditional
on the evidence given) that the stock went up:
P(U | G)P(G)
P(G |U ) 
P(U )
Because P(U) is itself the sum of two possible outcomes, the denominator can be expanded and
the we get the elaborate version of the Bayes’ formula:
P(U | G)P(G)
P(G |U ) 
P(U | G)P(G)  P(U | G)P(G)

Now let’s assume specific probabilities:
 Unconditional probability the economy will grow: P(G) = 70%

 Unconditional probability the economy will slow: P(S) = 1 – 70% = 30%
 Conditional probability that stock goes up if economy grows: P(U|G) = 80%
 Conditional probability that stock goes down if economy grows: P(D|G) = 1- 80% = 20%
 Conditional probability that stock goes up if economy slows: P(U|S) = 30%
 Conditional probability that stock goes down if economy slows: P(D|S) = 1- 30% = 70%
 Joint probability that economy grows & stock goes up = P(GU) = P(U|G)P(G) = 56%
 Joint probability that economy grows & stock goes down = P(GD) = P(D|G)P(G) = 14%
 Note something about this two-step binomial tree: the four terminal nodes are
mutually exclusive and cumulatively exhaustive.
Specifically, P(GU) + P(GD) + P(SU) + P(SD) = 56% + 14% + 9% + 21% = 100%.
Up (U)
Econ Grow
(G)
Down (D)
a priori
Up (U)
Econ Slows
(S)
Down (D)
Without applying Bayes the unconditional (marginal) probability the economy will grow is
given by: P(G) = 70%; this is the probability without any prior information.
Apply Bayes Theorem: Instead, assume that we are given additional information. Specifically,
we are told “the stock went up.” Now, what is the probability the economy grew given
(conditional on) the stock went up?
(80%)(70%) 56%
P(G |U )    86%
(80%)(70%)  (30%)(30%) 65%
In summary:
 The unconditional (a.k.a., marginal) probability that the economy will grow, P(G), is 70%
 The posterior probability, conditional on the knowledge that the stock went up, that the
economy grew, P(G|U) = 86%

Bayes’ Theorem: Example #2, GARP’s 2011 Practice Exam, Part I, Question #5
Question: John is forecasting a stock’s performance in 2010 conditional on the state of the
economy of the country in which the firm is based. He divides the economy’s performance into
three categories of “GOOD”, “NEUTRAL” and “POOR” and the stock’s performance into three
categories of “increase”, “constant” and “decrease”.
He estimates:
 The probability that the state of the economy is GOOD is 20%. If the state of the economy
is GOOD, the probability that the stock price increases is 80% and the probability that
the stock price decreases is 10%.
 The probability that the state of the economy is NEUTRAL is 30%. If the state of the
economy is NEUTRAL, the probability that the stock price increases is 50% and the
probability that the stock price decreases is 30%.
 If the state of the economy is POOR, the probability that the stock price increases is 15%
and the probability that the stock price is 70%.
Billy, his supervisor, asks him to estimate the probability that the state of the economy is
NEUTRAL given that the stock performance is constant. John’s best assessment of that
probability is closest to what?
Given Answer: Use Bayes’ Theorem:

P(NEUTRAL | Constant) = P(Constant | Neutral) * P(Neutral) / P(Constant)
= 0.2 * 0.3 / (0.1 * 0.2 + 0.2 * 0.3 + 0.15 * 0.5) = 0.387
This may be easier to follow if we represent question’s assumptions as a matrix (joint

probabilities):

Bayes’ Theorem: Example #2, GARP’s 2011 Practice Exam, Part 2, Question #5
Question: John is forecasting a stock’s price in 2011 conditional on the progress of certain
legislation in the United States Congress. He divides the legislative outcomes into three
categories of “Passage”, “Stalled” and “Defeated” and the stock’s performance into three
categories of “increase”, “constant” and “decrease” and estimates the following events:
Passage Stalled Defeated
Prob of legislative outcome 20% 50% 30%
Prob of Increase in Stock 10% 40% 70%

Price given legislative
outcome
Prob of Decrease in Stock 60% 30% 10%

Price given legislative
outcome
Answer:

Miller, Chapter 3:
Basic Statistics
In this chapter…
 Define, calculate, and interpret the mean, standard deviation, and variance of a
random variable.
 Define, calculate, and interpret the covariance and correlation between two
random variables.
 Interpret and calculate the variance for a portfolio and understand the derivation
of the minimum variance hedge ratio.
 Calculate the mean and variance of sums of larger variables.
 Describe the four central moments of a statistical variable or distribution: mean,
variance, skewness and kurtosis.
 Interpret the skewness and kurtosis of a statistical distribution, and interpret the
concepts of coskewness and cokurtosis.
 Define and interpret the best linear unbiased estimator (BLUE).

Define, calculate, and interpret the mean, standard deviation, and
variance of a random variable.
If we can characterize a random variable (e.g., if we know all outcomes and that each outcome is
equally likely—as is the case when you roll a single die)—the expectation of the random
variable is often called the mean or arithmetic mean.
Expected value (mean)

Expected value exists when we have a parametric distribution (e.g., normal, binomial) or
probabilities. Expected value is the weighted average of possible values. In the case of a
discrete random variable, expected value is given by:
k
E (Y )  y1p1  y 2 p2   y k pk   y i pi
i 1
In the case of a continuous random variable, expected value is given by:
E( X )   xf ( X )dx
Population mean versus sample mean
If we have a complete data set, then the mean is a population mean which implies that the
mean is exactly the true (and only true) mean:
1 n 1
  ai   a1  a2   an 
n i 1 n
However, in practice, we typically do not have the population. Rather, more often we have only a
subset of the population or a dataset that cannot realistically be considered comprehensive; e.g.,
the most recent year of equity returns. A mean of such a dataset, which is much more likely in
practice, is called the sample mean. The sample mean, of course, uses the same formula:
1 n 1
ˆ   ri   r1  r2   rn 
n i 1 n
But the difference between a population parameter (e.g., population mean) and a sample
estimate (e.g., sample mean) is essential to statistics: each sample will produce a different
sample mean, which is likely to be near the “true” population mean but different depending on
the sample. We use the sample estimate to infer something about the unobserved population
parameter.

Variance
Variance (and standard deviation) is the second moment, the most common measures of
dispersion. The variance of a discrete random variable Y is given by:
k
Y2  variance(Y )  E Y  Y      y i  Y  pi
2 2
  i 1
Variance is also expressed as the difference between the expected value of X^2 and the square
of the expected value of X. This is the more useful variance formula:
Y2  E[(Y  Y )2 ]  E(Y 2 )  [E(Y )]2
Please memorize this variance formula above: it comes in handy! For example, if the
probability of loan default (PD) is a Bernouilli trial, what is the variance of PD?
E[PD^2] – (E[PD])^2, As E[PD^2] = p and E[PD] = p, E[PD^2] – (E[PD])^2 = p – p^2 = p*(1-p).
Example: Variance of a single six-sided die
For example, what is the variance of a single six-sided die? First, we need to solve for the
expected value of X-squared, E[X2]. This is given by:
 1  1  1  1  1  1 91
E [ X 2 ]    (12 )    (22 )    (32 )    (42 )    (52 )    (62 ) 
6 6 6 6 6 6 6
Then, we need to square the expected value of X, [E(X)]2. The expected value of a single six-sided
die is 3.5 (the average outcome). So, the variance of a single six-sided die is given by:
91
Variance( X )  E ( X 2 )  [E ( X )]2   (3.5)2  2.92
6
What is the variance of the total of two six-sided die cast together? It is simply the
Variance (X) plus the Variance (Y) or about 5.83. The reason we can simply add them
together is that they are independent random variables.

Sample Variance:
The unbiased estimate of the sample variance is given by:
1 k
sx2  
k  1 i 1
( y i  Y )2
The above sample variance is used by Hull, for example, to calculate historical variance (and
volatility) of asset returns. Specifically, Hull employs a sample variance (which divides by k-
1 or n-1) to compute historical volatility. Admittedly, because the variable is daily returns, he
subsequently makes two simplifying assumptions, including reversion to division by (n) or (k).
However, the point remains: when computing the volatility (standard deviation) of an historical
set of the returns, the square root of the above sample variance it typically appropriate: it gives
an unbiased estimate (of the variance, at least).
Properties of variance
 constant
2
0
 X2 Y   X2   Y2 only if independent
 X2 Y   X2   Y2 only if independent
 X2  b   X2
 aX
2
 a 2 X2
 aX
2
b  a X
2 2
 aX
2
 bY  a  X  b  Y
2 2 2 2
only if independent
 X2  E ( X 2 )  E ( X )2
Standard deviation:
Standard deviation is given by:
Y  var(Y )  E Y  Y     y i  Y 2 pi
2
 
As variance = standard deviation^2, standard deviation = Square Root[variance]

Sample Standard Deviation:
The unbiased estimate of the sample standard deviation is given by:
1 k
sX  
k  1 i 1
( y i  Y )2
This is merely the square root of the sample variance. This formula is important because
this is the technically a safe way to calculate sample volatility; i.e., when in doubt,
you are rarely mistaken to employ the (n-1) or (k-1)
Define, calculate, and interpret the covariance and correlation between

two random variables.
Covariance
Covariance is analogous to variance, but instead of looking at the deviation from the mean of one
variable, we are look at the relationship between the deviations of two variables. Put another
way, the covariance is the average cross-product. If the means of both variables, X and Y, are
known we can use the following formula for covariance, which might be called a population
covariance:
1 n
 XY    X i   X Yi  Y 
n i 1
If the true means are unknown, the we calculate the same means; however, in this case, we
acknowledge that the means are sample means and we are calculating a sample covariance. The
Sample covariance multiplies the sum of cross-products by 1/(n-1) rather than 1/n:
  X i  X Yi  Y 
1 n
s XY 
n  1 i 1
Correlation
Correlation is a key measure in the FRM and is typically denoted by Greek rho (ρ). Correlation
is the covariance between two variables divided by the product of their respective standard
deviations (a.k.a., volatilities):
 XY
 XY  where  XY  cov( X ,Y )  E [( X   X )(Y  Y )]
 XY

The correlation coefficient translates covariance into a unitless metric that runs from -1.0
to +1.0: We may also refer to the computationally similar sample correlation, which is sample
covariance divided by the product of sample standard deviations:
s XY
r XY 
S X SY
What is the covariance of a variable with itself; i.e., what is covariance(X,X)? It is the
variance(X). It will be helpful to keep in mind that a variable’s covariance with itself is
its variance; for example, knowing this, we realize that the diagonal in a covariance
matrix is populated with variances, because variance is a special case of covariance!
Covariance: For example
For a very simple example, consider three (X,Y) pairs: {(3,5), (2,4), (4,6)}:
X Y (X-X )(Y-Y )
avg avg
3 5 0.0
2 4 1.0
4 6 1.0
Avg = 3 Avg = 5 Avg =  = 0.67
XY
StdDev = SQRT(.67) StdDev = SQRT(.67) Correl. = 1.0
Please note:
 Average X = (3+2+4)/3 = 3.0. Average Y = (5+4+6)/3 = 5.0

 The first cross-product = (3 – 3)*(5 - 5) = 0.0
 The sum of cross-products = 0 + 2 + 1 = 2.0
 The population covariance = [sum of cross-products] / n = 2.0 / 3 = 0.67
 The sample covariance = [sum of cross-products] / (n- 1) = 2.0 / 2 = 1.0
Properties of covariance
1. If X &Y are independent,  XY  cov( X ,Y )  0
2. cov(a  bX , c  dY )  bd cov( X ,Y )
3. cov( X , X )  var( X ). In notation,  XX   X2
4. If X &Y are not independent,
 X2 Y   X2   Y2  2 XY
 X2 Y   X2   Y2  2 XY

On the next page we illustrate the application of the variance theorems and the correlation
coefficient.
Please walk through this example so you understand the calculations.
The example refers to two products, Coke (X) and Pepsi (Y).
We (somehow) can generate growth projections for both products. For both Coke (X) and Pepsi
(Y), we have three scenarios (bad, medium, and good). Probabilities are assigned to each
growth scenario.
In regard to Coke:
 Coke has a 20% chance of growing 3.0%,

 A 60% of growing 9.0%, and
 A 20% chance of growing 12.0%.
In regard to Pepsi,
 Pepsi has a 20% chance of growing 5.0%,

 A 60% chance of growing 7.0%, and
 A 20% of growing 9.0%
Finally, we know these outcomes are not independent. We want to calculate the correlation
coefficient.

20% 60% 20%
Coke (X) 3 9 12
Pepsi (Y) 5 7 9
pX 0.6 5.4 2.4

pY 1.0 4.2 1.8
E(X) 8.4 Sum of pXs above

E(Y) 7.0 Sum of pYs above
XY 15 63 108
pXY 3 37.8 21.6
E(XY) 62.4
E(XY)-E(X)E(Y) 3.6 Key formula: Covariance of X,Y
X2 9 81 144
Y2 25 49 81
pX2 1.8 48.6 28.8

pY2 5 29.4 16.2
E(X2) 79.2
E(Y2) 50.6
VAR(X) 8.64 E[X^2] – [E(X)]^2

VAR(Y) 1.60 E[Y^2] – [E(Y)]^2
STDEVP(X) 2.939
STDEVP(Y) 1.265
COV/(STD)(STD) 0.9682
The calculation of expected values is required: E(X), E(Y), E(XY), E(X 2) and E(Y2). Make sure you
can replicate the following two steps:
 The covariance is equal to E(XY) – E(X)E(Y): 3.6 = 62.4 – (8.4)(7.0)
 The correlation coefficient () is equal to the Cov(X,Y) divided by the product of the
standard deviations: XY = 97% = 3.6  (2.94  1.26)

Key properties of correlation:
 Correlation has the same sign (+/-) as covariance
 Correlation measures the linear relationship between two variables
 Between -1.0 and +1.0, inclusive
 The correlation is a unit-less metric
 Zero covariance → zero correlation (But the converse not necessarily true. For example,
Y=X^2 is nonlinear )
Correlation (or dependence) is not causation. For example, in a basket credit default
swap, the correlation (dependence) between the obligors is a key input. But we do not
assume there is mutual causation (e.g., that one default causes another). Rather, more
likely, different obligors are similarly sensitive to economic conditions. So, economic
deterioration may the the external cause that all obligors have in common.
Consqequently, their default exhibit dependence. But the causation is not internal.
Further, note that (linear) correlation is a special case of dependence. Dependence is

more general and includes non-linear relationships.
Interpret and calculate the variance for a portfolio and understand the
derivation of the minimum variance hedge ratio.
If we assume the variables X(A) and X(B) and (a) and (b) as constant, we can express Y as a
linear combination of the variables:
Y  aX A  bX B
In which case, the variance of this linear combination is the classic two-variable variance where
the third term includes correlation, rho(AB):
Y2  a2 A2  b2 B2  2ab AB A B

If we imagine that we hold $1 of Security A and we hedge it with ($h) of Security B (where
positive h implies a long position in Security B; and negative h implies a short position in
Security B), then we can re-express the same two-variable variance as a portfolio variance
consisting of $1 of Security A plus $h of Security B where the portfolio variance is:
P  X A  hX B
 P2  12 A2  h2 B2  2(1)h AB A B   A2  h2 B2  2h AB A B

The above implies various portfolio variances as a function of various hedge ratios. We can look
for the portfolio with the least (lowest) variance, as a function of some hedge ratio, by taking the
derivative of portfolio variance with respect to the hedge ratio (h) and setting this to zero (i.e.,
the local minimum or maximum):
d P2
 2h B2  2  AB A B
dh
A
h *    AB
B
This h(*) is the minimum variance hedge: the hedge ratio (i.e., the amount of Security B) that
returns the lowest portfolio variance.
Calculate the mean and variance of sums of larger variables.
Mean
E(a  bX  cY )  a  b X  c Y
Variance
In regard to the sum of correlated variables, the variance of correlated variables is given by the
following (note the two expressions; the second merely substitutes the covariance with the
product of correlation and volatilities. Please make sure you are comfortable with this
substitution).
 X2 Y   X2  Y2  2 XY , and given that  XY   XY

 X2 Y   X2  Y2  2 X Y
In regard to the difference between correlated variables, the variance of correlated variables is
given by:
 X2 Y   X2  Y2  2 XY and given that  XY   X Y

 X2 Y   X2  Y2  2 X Y

Variance with constants (a) and (b)
Variance of sum includes covariance (X,Y):

variance(aX  bY )  a2 X2  2ab XY  b2Y2
If X and Y are independent, the covariance term drops out and the variance simply adds:
variance( X  Y )   X2  Y2
Describe the four central moments of a statistical variable or distribution:

mean, variance, skewness and kurtosis.
Moments of a distribution
The k-th moment of X is defined as:
mk  E  X k 
We refer to m(k) as the k-the moment of X. But this is a raw moment and we are generally
more concerned with central moments; i.e., “moments about the mean” or sometimes
“moments around the mean.”
The k-th moment about the mean (), or k-th central moment, is given by:
 ( x   )k
n
k-th moment  i 1 i
, or equivalently
n
 k  E  X    
k
 
In this way, the difference of each data point from the mean is raised to a power (k=1, k=2, k=3,
and k=4). There are the four moments of the distribution:
 If k=1, this refers to the first moment about zero: the mean.
 If k=2, this refers to the second moment about the mean: the variance.
With respect to skewness and kurtosis, it is convention to standardize the moment, such that:
 If k=3, then the third moment divided by the cube of the standard deviation returns
the skewness
 If k=4, then the fourth moment divided by the square of the variance (standard
deviation^4) about the mean returns the kurtosis; a.k.a., tail density, peakedness.

Interpret the skewness and kurtosis of a statistical distribution, and
interpret the concepts of coskewness and cokurtosis.
Skewness (asymmetry)
Skewness refers to whether a distribution is symmetrical. An asymmetrical distribution is

skewed, either positively (to the right) or negatively (to the left) skewed. The measure of “relative
skewness” is given by the equation below, where zero indicates symmetry (no skewness):
E [( X   )3 ]
Skewness =  3 
3
Please note that skewness is not actually the (raw) third moment, or even the third moment
about the mean. Skewness is the standardized central third moment: the third moment
about the mean divided by the cube of the standard deviation.
For example, the gamma distribution has positive skew (skew > 0):
Gamma Distribution
Positive (Right) Skew
1.20
1.00 alpha=1,
0.80 beta=1
0.60
0.40 alpha=2,
0.20 beta=.5
-
alpha=4,
0.0
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
beta=.25
Skewness is a measure of asymmetry
If a distribution is symmetrical, mean = median = mode. If a distribution has positive

skew, the mean > median > mode. If a distribution has negative skew, the mean <
median < mode.

Kurtosis
Kurtosis measures the degree of “peakedness” of the distribution, and consequently of

“heaviness of the tails.” A value of three (3) indicates normal peakedness. The normal
distribution has kurtosis of 3, such that “excess kurtosis” equals (kurtosis – 3).
E [( X   )4 ]
Kurtosis =  4 
4
Please note that kurtosis is not actually the (raw) fourth moment, or even the fourth moment
about the mean. Kurtosis is the standardized central fourth moment: the fourth moment
about the mean divided by square of the variance.
A normal distribution has relative skewness of zero and kurtosis of three (or the same
idea put another way: excess kurtosis of zero). Relative skewness > 0 indicates positive
skewness (a longer right tail) and relative skewness < 0 indicates negative skewness (a
longer left tail). Kurtosis greater than three (>3), which is the same thing as saying
“excess kurtosis > 0,” indicates high peaks and fat tails (leptokurtic). Kurtosis less than
three (<3), which is the same thing as saying “excess kurtosis < 0,” indicates lower peaks.
Kurtosis is a measure of tail weight (heavy, normal, or light-tailed) and “peakedness”:

kurtosis > 3.0 (or excess kurtosis > 0) implies heavy-tails.
Financial asset returns are typically considered leptokurtic (i.e., heavy or fat- tailed)
For example, the logistic distribution exhibits leptokurtosis (heavy-tails; kurtosis > 3.0):
Logistic Distribution
Heavy-tails (excess kurtosis > 0)
0.50
0.40 alpha=0, beta=1
0.20
alpha=0, beta=3
0.10
- N(0,1)
1 5 9 13 17 21 25 29 33 37 41

Coskewness and Cokurtosis
Assume four series of fund returns where the returns are the same for each fund manager
(A),(B), (C), and (D) but only the order of returns is different:
Fund Returns
time A B C D
1 0.0% -3.8% -15.3% -15.3%
2 -3.8% -15.3% -7.2% -7.2%
3 -15.3% 3.8% 0.0% -3.8%
4 -7.2% -7.2% -3.8% 15.3%
5 3.8% 0.0% 3.8% 0.0%
6 7.2% 7.2% 7.2% 7.2%
7 15.3% 15.3% 15.3% 3.8%
Due to the ordering difference, the portfolio (A+B) is different than the portfolio (C+D):
Blended Portfolios
time A+B C+D
1 -1.9% -15.3%
2 -9.5% -7.2%
3 -5.8% -1.9%
4 -7.2% 5.8%
5 1.9% 1.9%
6 7.2% 7.2%
7 15.3% 9.5%
And scatterplots show the difference between (B versus A) and (D versus C):
20% 20%
15% 15%
10% 10%
5% 5%
B 0% D 0%
-5% -5%
-10% -10%
-15% -15%
-20% -20%
-20% -10% 0% 10% 20% -20% -10% 0% 10% 20%
A C
Note:
 The worst return for A+B is only -9.5% = (-3.8% - 15.3%)/2, but
 The worst return for C+D is -15.3% = (-15.3% - 15.3%)/2

For two random variables, there are two non-trivial coskewness statistics:
 AAB   A   A   B  B  
2
 
 ABB   A   A  B  B  
2
 
In general, for (n) random variables, the number of non-trivial cross-central moments of order
(m) is given by:
k
 m  n  1!  n
m !  n  1!
In the case of n=3, we have coskewness which is given by:
k3 
 n  2 n  1 n  n
6
Define and interpret the best linear unbiased estimator (BLUE).
An estimator is a function of a sample of data to be drawn randomly from a population.
An estimate is the numerical value of the estimator when it is actually computed using data from
a specific sample. An estimator is a random variable because of randomness in selecting the
sample, while an estimate is a nonrandom number.
The sample mean, Ӯ, is the best linear unbiased estimator (BLUE). In the Stock & Watson
example, the average (mean) wage among 200 people is $22.64:
Sample Mean $22.64

Sample Standard Deviation $18.14
Sample size (n) 200
Standard Error 1.28
H0: Population Mean = $20.00
Test t statistic 2.06
p value 4.09%
Please note:
 The average wage of (n = ) 200 observations is $22.64

 The standard deviation of this sample is $18.14
 The standard error of the sample mean is $1.28 because $18.14/SQRT(200) = $1.28
 The degrees of freedom (d.f.) in this case are 199 = 200 – 1

“An estimator is a recipe for obtaining an estimate of a population parameter. A simple
analogy explains the core idea: An estimator is like a recipe in a cook book; an estimate
is like a cake baked according to the recipe.” Barreto & Howland, Introductory
Econometrics
In the above example, the sample mean is an estimator of the unknown, true population mean
(in this case, the same mean estimator gives an estimate of $22.64).
What makes one estimator superior to another?
 Unbiased: the mean of the sampling distribution is the population mean (mu)
 Consistent. When the sample size is large, the uncertainty about the value of arising
from random variations in the sample is very small.
 Variance and efficiency. Among all unbiased estimators, the estimator has the smallest
variance is “efficient.”
If the sample is random (i.i.d.), the sample mean is the Best Linear Unbiased Estimator
(BLUE). The sample mean is:
 Consistent, AND
 The most EFFICIENT among all linear UNBIASED estimators of the population mean

Miller, Chapter 4:
Distributions
In this chapter…
 Define and distinguish between parametric and nonparametric distributions.
 Describe the key properties of the uniform distribution, Bernoulli distribution,
Binomial distribution, Poisson distribution, normal distribution and lognormal
distribution, and identify common occurrences of each distribution.
 Describe and apply the Central Limit Theorem and explain the conditions for the
theorem.
 Describe the properties of independent and identically distributed (i.i.d.) variables.
 Describe the properties of linear combinations of normally distributed variables.
 Identify the key properties and parameters of the Chi-squared, Student’s t, and F-
distributions.
 Describe a mixture distribution and explain the creation and characteristics of
mixture distributions.
Define and distinguish between parametric and nonparametric

distributions.
A parametric distribution can be described by a mathematical function. For example, the normal
distribution is perhaps the most well-known parametric distribution. The normal distribution is
a parametric distribution that requires only two parameters, mean and variance:
( x   )2

1 2 2
f (x)  e
 2
A nonparametric distribution cannot be summarized by a mathematical formula; in its simplest
form it is “just a collection of data.”

Describe the key properties of the uniform distribution, Bernoulli
distribution, Binomial distribution, Poisson distribution, normal
distribution and lognormal distribution, and identify common occurrences
of each distribution.
Uniform distribution
If the random variable, X, is continuous, the uniform distribution is given by the following
probability density function (pdf):
 1
 for a  x  b
f (x)   b  a
0 for x  a or x  b
If the random variable, X, is discrete, the uniform distribution is given by the following
probability density function (pdf; although, in the case of a discrete variable, we can also refer to
this as a probability mass function, pmf)
1
f (x) 
n
This is an extremely simple distribution. Common examples of discrete uniform distributions
are:
 A coin, where n=2, such that the Prob[heads] = ½ and Prob[tails] = ½; or
 A six-sided die, where for example, Prob[rolling a one] = 1/6
The uniform distribution is characterized by the following cumulative distribution function

(CDF):
a  b1
P  X  a 
b2  b1

Bernoulli
A random variable X is called Bernoulli distributed with parameter (p) if it has only two
possible outcomes, often encoded as 1 (“success” or “survival”) or 0 (“failure” or “default”), and
if the probability for realizing “1” equals p and the probability for “0” equals 1 – p. The classic
example for a Bernoulli-distributed random variable is the default event of a company.
A Bernoulli variable is discrete and has two possible outcomes:
1 if C defaults in I
X 
0 else
Binomial
A binomial distributed random variable is the sum of (n) independent and identically distributed
(i.i.d.) Bernoulli-distributed random variables. The probability of observing (k) successes is
given by:
n n n!
P (Y  k )    pk (1  p )n k ,   
k   k  (n  k )! k !
Poisson
The Poisson distribution depends upon only one parameter, lambda λ, and can be interpreted as
an approximation to the binomial distribution. A Poisson-distributed random variable is usually
used to describe the random number of events occurring over a certain time interval. The
lambda parameter (λ) indicates the rate of occurrence of the random events; i.e., it tells us
how many events occur on average per unit of time.
In the Poisson distribution, the random number of events that occur during an interval of time,
(e.g., losses/ year, failures/ day) is given by:
k
P (N  k )  e 
k!
Normal Binomial Poisson

Mean    np  
Variance 2  2  npq 2  
Standard Dev.    npq  
In Poisson, lambda is both the expected value (the mean) and the variance!

Normal
Characteristics of the normal distribution include:
 The middle of the distribution, mu (µ), is the mean (and median). This first moment is
also called the “location”
 Standard deviation and variance are measures of dispersion (a.k.a., shape). Variance is
the second-moment; typically, variance is denoted by sigma-squared such that standard
deviation is sigma.
 The distribution is symmetric around µ. In other words, the normal has skew = 0
 The normal has kurtosis = 3 or “excess kurtosis” = 0
Properties of normal distribution:
 Location-scale invariance: Imagine random variable X, which is normally distributed

with the parameters µ and σ. Now consider random variable Y, which is a linear function
of X: Y = aX + b. In general, the distribution of Y might substantially differ from the
distribution of X, but in the case where X is normally distributed, the random variable Y
is again normally distributed with parameters [mean = a*mu + b] and [variance =
a^2*sigma]. Specifically, we do not leave the class of normal distributions if we
multiply the random variable by a factor or shift the random variable.
 Summation stability: If you take the sum of several independent random variables,
which are all normally distributed with mean (µi) and standard deviation (σi), then the
sum will be normally distributed again.
 The normal distribution possesses a domain of attraction. The central limit theorem
(CLT) states that—under certain technical conditions—the distribution of a large sum of
random variables behaves necessarily like a normal distribution.
The normal distribution is not the only class of probability distributions having a domain
of attraction. Actually three classes of distributions have this property: they are called
stable distributions.
0.5
0.3 1 2 2 2
f (x)  e ( x   )
 2
0.1
(4.0)
(3.0)
(2.0)
(1.0)
-0.1
0.0
1.0
2.0
3.0
4.0

The normal distribution is commonplace for at least three (or four) reasons:
 The central limit theorem (CLT) says that sampling distribution of sample means tends
to be normal (i.e., converges toward a normally shaped distributed) regardless of the
shape of the underlying distribution; this explains much of the “popularity” of the normal
distribution.
 The normal is economical (elegant) because it only requires two parameters (mean
and variance). The standard normal is even more economical: it requires no
parameters.
 The normal is tractable: it is easy to manipulate (especially in regard to closed-form

equations like the Black-Scholes)
 Parsimony: It only requires (is fully described by) two parameters: mean and variance
It is common to retrieve an historical dataset such as a series of monthly returns and

compute the mean and standard deviation of the series. In some cases, the analyst will
stop at that point, having determined the first and second moments of the data.
Often times, the user is implicitly “imposing normality” by assuming the data is normally
distributed. For example, the user might multiply the standard deviation of the dataset
by 1.645 or 2.33 (i.e., normal distribution deviates) in order to estimate a value-at-risk.
But notice what happens in this case: without a test (or QQ-plot, for example) the analyst
is merely assuming normality because the normal distribution is conveniently
summarized by only the first two moments! Many other non-normal distributions have
first (aka, location) and second moments (aka, scale or shape).
In this way, it is not uncommon to see the normal distribution used merely for the sake of
convenience: when we only have the first two distributional moments, the normal is
implied perhaps merely because they are the only moments that have been computed.

Standard normal distribution
A normal distribution is fully specified by two parameters, mean and variance (or standard
deviation). We can transform a normal into a unit or standardized variable:
 Standard normal has mean = 0,and variance = 1

 No parameters required!
This unit or standardized variable is normally distributed with zero mean and variance of
one (1.0). Its standard deviation is also one (variance = 1.0 and standard deviation = 1.0). This is
written as: Variable Z is approximately (“asymptotically”) normally distributed: Z ~ N(0,1)
Standard normal distribution: Critical Z values:
Key locations on the normal distribution are noted below. In the FRM curriculum, the choice of
one-tailed 5% significance and 1% significance (i.e., 95% and 99% confidence) is common, so
please pay particular attention to the yellow highlights:
Critical Two-sided One-sided

z values Confidence Significance
1.00 ~ 68% ~ 15.87%
1.645
~ 90% ~ 5.0 %
(~1.65)
1.96 ~ 95% ~ 2.5%
2.327
~ 98% ~ 1.0 %
(~2.33)
2.58 ~ 99% ~ 0.5%
Memorize two common critical values: 1.65 and 2.33. These correspond to confidence
levels, respectively, of 95% and 99% for a one-tailed test. For VAR, the one-tailed test is
relevant because we are concerned only about losses (left-tail) not gains (right-tail).
Multivariate normal distributions
Normal can be generalized to a joint distribution

of normal; e.g., bivariate normal distribution.
Properties include:
1. If X and Y are bivariate normal, then aX + bY is normal;

any linear combination is normal
2. If a set of variables has a multivariate normal distribution,

the marginal distribution of each is normal
3. If variables with a multivariate normal distribution have covariances that equal zero,
then the variables are independent

Common examples of Bernoulli, binomial, normal and Poisson
In the FRM, these four distributions are quite common:
 The Bernoulli is used to characterize default: an obligor or bond will either default or
survive. Most bonds “survive” each year, until perhaps one year they default. At any
given point in time, or (for example) during any given year, the bond will be in one of
two states. The Bernoulli is invoked when there are only two outcomes.
 The binomial is a series of independent and identically distributed (i.i.d.) Bernoulli

variables, such that the binomial is commonly used characterize a portfolio of credits.
 The normal distribution is common:
o Typically the central limit theorem (CLT) will justify the significance test of the
sample average in a large sample. For example, to test the sample average asset
return or excess return.
o In many cases, due to convenience, the normal distribution is employed to model

equity returns for short horizons; typically this is an assumption made with the
understanding that it may not be realistic.
 The Poisson distribution has two very common purposes:
o Poisson is often used, as a generic stochastic process, to model the time of

default in some credit risk models
o As a discrete distribution, the Poisson is arguably the most common distribution

employed for operational loss frequency (but not for loss severity, which wants a
continuous distribution)
Bernoulli Binomial Normal Poisson
• Default • Basket of • Significance • Operational

(0/1) Credits; test of large loss
• Basket sample frequency
credit average
default (CLT)
swap (CDS) • Short
horizon
equity
returns

Lognormal
The lognormal is common in finance: If an asset return (r) is normally distributed, the
continuously compounded future asset price level (or ratio or prices; i.e., the wealth ratio) is
lognormal. Expressed in reverse, if a variable is lognormal, its natural log is normal.
Lognormal
Non-zero, positive skew, heavy right tail
1.00%
0.80%
0.60%
0.40%
0.20%
0.00%
Additional distributions: not in 2013 AIMs but occasionally relevant
The following distributions are not explicitly assigned in this section (Miller), but have
historically been relevant to the FRM, to various degrees
Exponential
The exponential distribution is popular in queuing theory. It is used to model the time we have
to wait until a certain event takes place. According to the text, examples include “the time
until the next client enters the store, the time until a certain company defaults or the time until
some machine has a defect.” The exponential function is non-zero
f ( x )  e   x ,   1 , x 0

Exponential
3.00
2.00 0.5
1.00
1
0.00
2
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
-

Weibull
Weibull is a generalized exponential distribution; i.e., the exponential is a special case of the
Weibull where the alpha parameter equals 1.0.

x
 
 
F(x)  1 e ,x  0
Weibull distribution
2.00
1.50 alpha=.5,
beta=1
1.00
alpha=2, beta=1
0.50
- alpha=2, beta=2
2.0
0.0
0.5
1.0
1.5
2.5
3.0
3.5
4.0
4.5
5.0
The main difference between the exponential distribution and the Weibull is that, under the
Weibull, the default intensity depends upon the point in time t under consideration. This allows
us to model the aging effect or teething troubles:
For α > 1—also called the “light-tailed” case—the default intensity is monotonically increasing
with increasing time, which is useful for modeling the “aging effect” as it happens for machines:
The default intensity of a 20-year old machine is higher than the one of a 2-year old machine.
For α < 1—the “heavy-tailed” case—the default intensity decreases with increasing time. That
means we have the effect of “teething troubles,” a figurative explanation for the effect that after
some trouble at the beginning things work well, as it is known from new cars. The credit spread
on noninvestment-grade corporate bonds provides a good example: Credit spreads usually
decline with maturity. The credit spread reflects the default intensity and, thus, we have the
effect of “teething troubles.” If the company survives the next two years, it will survive for a
longer time as well, which explains the decreasing credit spread.
For α = 1, the Weibull distribution reduces to an exponential distribution with parameter β.

Gamma distribution
The family of Gamma distributions forms a two parameter probability distribution family with
the density function (pdf) given by:
 1
1 x
f (x)   e x    ,x  0
  ( )  
Gamma distribution
1.20
1.00
0.60 alpha=2, beta=.5
0.40 alpha=4, beta=.25
0.20
-
The Gamma distribution is related to:
 For alpha = 1, Gamma distribution becomes exponential distribution
 For alpha = k/2 and beta = 2, Gamma distribution becomes Chi-square distribution
Beta distribution
The beta distribution has two parameters: alpha (“center”) and beta (“shape”). The beta
distribution is very flexible, and popular for modeling recovery rates.
Beta distribution
(popular for recovery rates)
0.06
0.05 alpha = 0.6, beta = 0.6
0.04
0.03 alpha = 1, beta = 5
0.02
0.01 alpha = 2, beta = 4
0.00
alpha = 2, beta = 1.5
0.07
0.14
0.21
0.28
0.35
0.42
0.49
0.56
0.63
0.70
0.77
0.84
0.91
0.98
-

Example of Beta Distribution
The beta distribution is often used to model recovery rates. Here are two examples: one beta
distribution to model a junior class of debt (i.e., lower mean recovery) and another for a senior
class of debt (i.e., lower loss given default):
Junior Senior
alpha (center) 2.0 4.0
beta (shape) 6.0 3.3
Mean recovery 25% 55%
0.03
0.02
0.01 Senior
Junior
0.00
12%
18%
24%
30%
36%
42%
48%
54%
60%
66%
72%
78%
84%
90%
96%
0%
6%
Recovery (Residual Value)
Logistic
A logistic distribution has heavy tails:
Logistic distribution
0.45
0.40
0.35
0.25
alpha=2, beta=1
0.20
alpha=0, beta=3
0.15
0.10 N(0,1)
0.05
-
1 4 7 10 13 16 19 22 25 28 31 34 37 40

Extreme Value Theory
Measures of central tendency and dispersion (variance, volatility) are impacted more by
observations near the mean than outliers. The problem is that, typically, we are concerned with
outliers; we want to size the liklihood and magnitude of low frequency, high severity (LFHS)
events. Extreme value theory (EVT) solves this problem by fitting a separate distribution to
the extreme loss tail. EVT uses only the tail of the distribution, not the entire dataset.
In applying extreme value theory (EVT), two general approaches are:
 Block maxima (BM). The classic approach
 Peaks over threshold (POT). The modern approach that is often preferred.
Block maxima
The dataset is parsed into (m) identical, consecutive and non-overlapping periods called blocks.
The length of the block should be greater than the periodicity; e.g., if the returns are daily, blocks
should be weekly or more. Block maxima partitions the set into time-based intervals. It requires
that observations be identically and independently (i.i.d.) distributed.
Generalized extreme value (GEV) fits block maxima

The Generalized extreme value (GEV) distribution is given by:
   
1

exp  (1   y )    0
H ( y )    
 
 y
exp( e )  0
The  (xi) parameter is the “tail index;” it represents the fatness of the tails. In this expression, a
lower tail index corresponds to fatter tails.
Generalized Extreme Value (GEV)

0.15
0.10
0.05
0.00
0
15
10
20
25
30
35
40
45
Per the (unassigned) Jorion reading on EVT, the key thing to know here is that (1) among
the three classes of GEV distributions (Gumbel, Frechet, and Weibull), we only care
about the Frechet because it fits to fat-tailed distributions, and (2) the shape parameter
determines the fatness of the tails (higher shape → fatter tails)
Peaks over threshold (POT)
Peaks over threshold (POT) collects the dataset of losses above (or in excess of) some threshold.

The cumulative distribution function here refers to the probability that the “excess loss” (i.e., the
loss, X, in excess of the threshold, u, is less than some value, y, conditional on the loss exceeding
the threshold):
FU ( y )  P( X  u  y | X  u )
u X
-4 -3 -2 -1 0 1 2 3 4
Peaks over threshold (POTS):
 1
 x 
1  (1  )  0
 
G , ( x )  
1  exp(  x )  0
 
Generalized Pareto Distribution

(GPD)
1.50
1.00
0.50
-
0 1 2 3 4
Block maxima is: time-based (i.e., blocks of time), traditional, less sophisticated, more
restrictive in its assumptions (i.i.d.)
Peaks over threshold (POT) is: more modern, has at least three variations (semi-
parametric; unconditional parametric; and conditional parametric), is more flexible

EVT Highlights:
Both GEV and GPD are parametric distributions used to model heavy-tails.
GEV (Block Maxima)
 Has three parameters: location, scale and tail index
 If tail > 0: Frechet
GPD (peaks over threshold, POT)
 Has two parameters: scale and tail (or shape)
 But must select threshold (u)

Describe and apply the Central Limit Theorem and explain the conditions
for the theorem.
In brief:
 Law of large numbers: under general conditions, the sample mean (Ӯ) will be near the
population mean.
 Central limit theorem (CLT): As the sample size increases, regardless of the underlying
distribution, the sampling distributions approximates (tends toward) normal
Central limit theorem (CLT)
We assume a population with a known mean and finite variance, but not necessarily a normal
distribution (we may not know the distribution!). Random samples of size (n) are then
drawn from the population. The expected value of each random variable is the population’s
mean. Further, the variance of each random variable is equal the population’s variance divided
by n (note: this is equivalent to saying the standard deviation of each random variable is equal to
the population’s standard deviation divided by the square root of n).
The central limit theorem says that this random variable (i.e., of sample size n, drawn from the
population) is itself normally distributed, regardless of the shape of the underlying
population. Given a population described by any probability distribution having mean () and
finite variance (2), the distribution of the sample mean computed from samples (where each
sample equals size n) will be approximately normal. Generally, if the size of the sample is at least
30 (n  30), then we can assume the sample mean is approximately normal!
But sample mean (and sum)

Not Normal! → Normal Distribution!
(individually) (if finite variance)
Each sample has a sample mean. There are many sample means. The sample means have
variation: a sampling distribution. The central limit theorem (CLT) says the sampling
distribution of sample means is asymptotically normal.

Summary of central limit theorem (CLT):
 We assume a population with a known mean and finite variance, but not necessarily a
normal distribution.
 Random samples (size n) drawn from the population.
 The expected value of each random variable is the population mean
 The distribution of the sample mean computed from samples (where each sample equals
size n) will be approximately (asymptotically) normal.
 The variance of each random variable is equal to population variance divided by n
(equivalently, the standard deviation is equal to the population standard deviation
divided by the square root of n).
Sample Statistics and Sampling Distributions

When we draw from (or take) a sample, the sample is a random variable with its own
characteristics. The “standard deviation of a sampling distribution” is called the
standard error. The mean of the sample or the sample mean is a random variable defined by:
X1  X 2  Xn
X
n
Describe the properties of independent and identically distributed (i.i.d.)

variables.
A random sample is a sample of

random variables that are
independent and identically
distributed (i.i.d.)
Independent and identically Independent Identical

distributed (i.i.d.) variables: Not (auto) correlated Same Mean,
Same Variance
 Each random variable has the
same (identical) probability Homo-skedastic
distribution (PDF/PMF, CDF) distribution

 Each random variable is drawn independently of the others; no serial- or auto-
correlation
The concept of independent and identically distributed (i.i.d.) variables is a key

assumption we often encounter: to scale volatility by the square root of time requires
i.i.d. returns. If returns are not i.i.d., then scaling volatlity by the square root of time
will give an incorrect answer.

Describe the properties of linear combinations of normally distributed
variables.
As previously noted, a property of the normal distribution is summation stability: If you take
the sum of several independent random variables, which are all normally distributed with mean
(µi) and standard deviation (σi), then the sum will be normally distributed again.
According to Rachev, “Why is the summation stability property important for financial
applications? Imagine that the daily returns of the S&P 500 are independently normally
distributed with µ= 0.05% and σ= 1.6%. Then the monthly returns again are normally
distributed with parameters µ= 1.05% and σ = 7.33% (assuming 21 trading days per month) and
the yearly return is normally distributed with parameters µ = 12.6% and σ = 25.40% (assuming
252 trading days per year). This means that the S&P 500 monthly return fluctuates randomly
around 1.05% and the yearly return around 12.6%.
The last important property that is often misinterpreted to justify the nearly exclusive use of
normal distributions in financial modeling is the fact that the normal distribution possesses a
domain of attraction. A mathematical result called the central limit theorem states that—under
certain technical conditions—the distribution of a large sum of random variables behaves
necessarily like a normal distribution. In the eyes of many, the normal distribution is the unique
class of probability distributions having this property. This is wrong and actually it is the class of
stable distributions (containing the normal distributions), which is unique in the sense that a
large sum of random variables can only converge to a stable distribution.”

Identify the key properties and parameters of the Chi-squared, Student’s
t, and F-distributions.
Chi-squared distribution
Chi-square distribution
40%
30% k=2
20%
k=5
10%
0%
0 10 20 30
For the chi-square distribution, we observe a sample variance and compare to hypothetical
population variance. This variable has a chi-square distribution with (n-1) d.f.:
 s2 
 2  (n  1) ~ ( n 1)
2
 
Chi-squared distribution is the sum of m squared independent standard normal random

variables. Properties of the chi-squared distribution include:
 Nonnegative (>0)
 Skewed right, but as d.f. increases it approaches normal
 Expected value (mean) = k, where k = degrees of freedom
 Variance = 2k, where k = degrees of freedom
 The sum of two independent chi-square variables is also a chi-squared variable
Chi-squared distribution: For example (Google’s stock return variance)
Google’s sample variance over 30 days is 0.0263%. We can test the hypothesis that the
population variance (Google’s “true” variance) is 0.02%. The chi-square variable = 38.14:
Sample variance (30 days) 0.0263%

Degrees of freedom (d.f.) 29
Population variance? 0.0200%
Chi-square variable 38.14 = 0.0263%/0.02%*29
=CHIDIST() = p value 11.93% @ 29 d.f., Pr[.1] = 39.0875
Area under curve (1- ) 88.07%
With 29 degrees of freedom (d.f.), 38.14 corresponds to roughly 10% (i.e., to left of 0.10 on the
lookup table). Therefore, we can reject the null with only 88% confidence; i.e., we are likely to
accept the probability that the true variance is 0.02%.

Student t’s distribution
t distribution vs. Normal

0.04
0.03
2
0.02
20
0.01 Normal
0.00
0.4
0.8
1.2
1.6
2.4
2.8
3.2
3.6
0
2
The student’s t distribution (t distribution) is among the most commonly used distributions. As
the degrees of freedom (d.f.) increases, the t-distribution converges with the normal
distribution. It is similar to the normal, except it exhibits slightly heavier tails (the lower the d.f..,
the heavier the tails). The student’s t variable is given by:
X  X
t
Sx n
Properties of the t-distribution:
 Like the normal, it is symmetrical
 Like the standard normal, it has mean of zero (mean = 0)
 Its variance = k/(k-2) where k = degrees of freedom. Note, as k increases, the variance
approaches 1.0. Therefore, as k increases, the t-distribution approximates the
standard normal distribution.
 Always slightly heavy-tail (kurtosis>3.0) but converges to normal. But the student’s t is
not considered a really heavy-tailed distribution
In practice, the student’s t is the mostly commonly used distribution. When we test the
significance of regression coefficients, the central limit thereom (CLT) justifies the
normal distribution (because the coefficients are effectively sample means). But we
rarely know the population variance, such that the student’s t is the appropriate
distribution.
When the d.f. is large (e.g., sample over ~30), as the student’s t approximates the
normal, we can use the normal as a proxy. In the assigned Stock & Watson, the sample
sizes are large (e.g., 420 students), so they tend to use the normal.

Student t’s distribution: For example
For example, Google’s average periodic return over a ten-day sample period was +0.02% with
sample standard deviation of 1.54%. Here are the statistics:
Sample Mean 0.02%

Sample Std Dev 1.54%
Days (n=10) 10
Confidence 95%
Significance (1-) 5%
Critical t 2.262
Lower limit -1.08%

Upper limit 1.12%
The sample mean is a random variable. If we know the population variance, we assume the
sample mean is normally distributed. But if we do not know the population variance (typically
the case!), the sample mean is a random variable following a student’s t distribution.
In the Google example above, we can use this to construct a confidence (random) interval:
s
X  t
n
We need the critical (lookup) t value. The critical t value is a function of:
 Degrees of freedom (d.f.); e.g., 10-1 =9 in this example, and
 Significance; e.g., 1-95% confidence = 5% in this example
The 95% confidence interval can be computed. The upper limit is given by:
1.54%
X  (2.262)  1.12%
10
And the lower limit is given by:
1.54%
X  (2.262)  1.08%
10
Please make sure you can take a sample standard deviation, compute the critical t value
and construct the confidence interval.

Both the normal (Z) and student’s t (t) distribution characterize the sampling distribution of
the sample mean. The difference is that te normal is used when we know the population
variance; the student’s t is used when we mus rely on the sample variance. In practice, we don’t
know the population variance, so the student’s t is typically appropriate.
Z
X    X
t
X    X
X SX
n n
F-Distribution
F distribution
10%
8%
6%
4% 19,19
2% 9,9
0%
0.1 0.4 0.7 1.0 1.3 1.6 1.9 2.2
The F distribution is also called the variance ratio distribution (it may be helpful to think of it as
the variance ratio!). The F ratio is the ratio of sample variances, with the greater sample variance
in the numerator:
s x2
F
sy2
Properties of F distribution:
 Nonnegative (>0)
 Skewed right
 Like the chi-square distribution, as d.f. increases, approaches normal
 The square of t-distributed r.v. with k d.f. has an F distribution with 1,k d.f.
 m * F(m,n)=χ2

F-Distribution: For example
For example, based on two 10-day samples, we calculated the sample variance of Google and
Yahoo. Google’s variance was 0.0237% and Yahoo’s was 0.0084%. The F ratio, therefore, is 2.82
(divide higher variance by lower variance; the F ratio must be greater than, or equal to, 1.0).
GOOG YHOO
=VAR() 0.0237% 0.0084%
=COUNT() 10 10
F ratio 2.82
Confidence 90%
Significance 10%
=FINV() 2.44
At 10% significance, with (10-1) and (10-1) degrees of freedom, the critical F value is 2.44.
Because our F ratio of 2.82 is greater than (>) 2.44, we reject the null (i.e., that the population
variances are the same). We conclude the population variances are different.
Describe a mixture distribution and explain the creation and

characteristics of mixture distributions.
A mixture distribution is a sum of other distribution functions but weighted by probabilities. The
density function of a mixture distribution is, then, the probability-weighted sum of the
component density function
n
f ( x )   w i pi ( x ) where p(.) are pdf
i 1
In Miller’s example, the mixture distribution is a
f ( x )  wLfL ( x )  wH fH ( x )
According to Miller, “Mixture distributions are extremely flexible. In a sense they occupy a realm
between parametric distributions and non-parametric distributions. In a typical mixture
distribution, the component distributions are parametric but the weights are based on empirical
(non-parametric) data. Just as there is a trade-off between parametric distributions and non-
parametric distributions, there is a trade-off between using a low number and a high number of
component distributions. By adding more and more component distributions, we can
approximate any data set with increasing precision. At the same time, as we add more and more
component distributions, the conclusions that we can draw become less and less general in
nature.”

Normal mixture distribution
A mixture distribution is extremely flexible.
If two normal distributions have the same mean, they combine (mix) to produce mixture
distribution with leptokurtosis (heavy-tails). Otherwise, mixtures are infinitely flexible.
0.45
Normal 1
0.40
0.35 Normal 2
0.30
0.25 Mixture
0.20
0.15
0.10
0.05
0.00
-10 -5 0 5 10

Miller, Chapter 5:
Hypothesis Testing and
Confidence Intervals
In this chapter…
 Define and estimate the sample mean and sample variance.
 Define, calculate and interpret the mean [sic] and variance of the sample mean.
 Define and construct a confidence interval.
 Define and interpret the null hypothesis and the alternative hypothesis, and
calculate the test statistics.
 Define, interpret, and calculate the t-statistic.
 Describe the process of selecting and constructing a null hypothesis.
 Differentiate between a one-tailed and a two-tailed test and explain the
circumstances in which to use each test.
 Interpret the results of hypothesis tests with a specific level of confidence.
 Describe and apply the principle of Chebyshev’s inequality.
Define and estimate the sample mean and sample variance.
The sample mean is denoted by Y and given by:
1 n n 1
ˆ   i  n xi
n i 1
x 
i 1
The sample variance is given by:
X 
2
n X
S X2  
i
i 1 n 1
The variance of the sample mean (more below) is given by:
Y2 Y
variance(Y )  Std Dev(Y )  Y 
n n

Define, calculate and interpret the mean and variance of the sample
mean.
Sampling distribution of the sample mean
If either: (i) the population is infinite and random sampling, or (ii) finite population and
sampling with replacement, the variance of the sampling distribution of means is:
Y2
E [(Y  Y ) ]   
2 2
Y
n
This says, “The variance of the sample mean is equal to the population variance divided by the
sample size.” For example, the (population) variance of a single six-sided die is 2.92. If we roll
three die (i.e., sampling “with replacement”), then the variance of the sampling distribution =
(2.92  3) = 0.97.
If the population is size (N), if the sample size n  N, and if sampling is conducted “without
replacement,” then the variance of the sampling distribution of means is given by:
Y2  N  n 
 
2
 
Y
n  N 1
Standard error is the standard deviation of the sample mean
The standard error is the standard deviation of the sampling distribution of the estimator,
and the sampling distribution of an estimator is a probability (frequency distribution) of the
estimator (i.e., a distribution of the set of values of the estimator obtained from all possible
same-size samples from a given population). For a sample mean (per the central limit theorem!),
the variance of the estimator is the population variance divided by sample size. The
standard error is the square root of this variance; the standard error is a standard deviation:
 Y2 Y
se  
n n

If the population is distributed with mean  and variance 2 but the distribution is not a
normal distribution, then the standardized variable given by Z below is “asymptotically
normal; i.e., as (n) approaches infinity () the distribution becomes normal.
Z
Y     Y    ~ N(0,1)
Y Y
se Y
n
The denominator is the standard error: which is simply the name for the standard
deviation of sampling distribution.
Define and construct a confidence interval.
The confidence interval uses the product of [standard error х critical “lookup” t]. In the Stock
& Watson example, the confidence interval is given by 22.64 +/- (1.28)(1.96) because 1.28 is the
standard error and 1.96 is the critical t (critical Z) value associated with 95% two-tailed
confidence:
Sample Mean $22.64

Sample Std Deviation $18.14
Sample size (n) 200
Standard Error 1.28
Confidence 95%
Critical t 1.972
Lower limit $20.11 95% CI for Y  Y  1.96SE Y 
Upper limit $25.17
22.64  1.28  1.972

Confidence Intervals: Another example with a sample of 28 P/E ratios
Assume we have price-to-earnings ratios (P/E ratios) of 28 NYSE companies:
Mean 23.25
Variance 90.13
Std Dev 9.49
Count 28
d.f. 27
Confidence (1-α) 95%
Significance (α) 5%
Critical t 2.052
Standard error 1.794
Lower limit 19.6 = 23.25 - (2.052)*(1.794)

Upper limit 26.9 = 23.25 + (2.052)*(1.794)
Hypothesis 18.5
t value 2.65 = (23.25 - 18.5) / (1.794)
p value 1.3%
Reject null with 98.7%
 The confidence coefficient is selected by the user; e.g., 95% (0.95) or 99% (0.99).
 The significance = 1 – confidence coefficient.
To construct a confidence interval with the dataset above:
 Determine degrees of freedom (d.f.). d.f. = sample size – 1. In this case, 28 – 1 = 27 d.f.
 Select confidence. In this case, confidence coefficient = 0.95 = 95%
 We are constructing an interval, so we need the critical t value for 5% significance with
two-tails.
 The critical t value is equal to 2.052. That’s the value with 27 d.f. and either 2.5% one-
tailed significance or 5% two-tailed significance (see how they are the same provided the
distribution is symmetrical?)
 The standard error is equal to the sample standard deviation divided by the square root
of the sample size (not d.f.!). In this case, 9.49/SQRT(28)  1.794.
 The lower limit of the confidence interval is given by: the sample mean minus the
critical t (2.052) multiplied by the standard error (9.49/SQRT[28]).
 The upper limit of the confidence interval is given by: the sample mean plus the
critical t (2.052) multiplied by the standard error (9.49/SQRT[28]).
Sx S
X  t    X  X  t  x
n n
9.49 9.49
23.25   2.052    X  23.25   2.052 
28 28

 This confidence interval is a random interval. Why? Because it will vary randomly with
each sample, whereas we assume the population mean is static.
We don’t say the probability is 95% that the “true” population mean lies within
this interval. That implies the true mean is variable. Instead, we say the
probability is 95% that the random interval contains the true mean. See how the
population mean is trusted to be static and the interval varies?
Define and interpret the null hypothesis and the alternative hypothesis,
and calculate the test statistics.
A two-tailed null hypothesis takes the form:
H0 :   c
H1 :   c
In many cases, in practice, the test is a significance test such that it is often assumed that both
(i) the null is two-tailed and further (ii) that the null hypothesis is that the estimate is equal to
zero. Symbolically, then, the following is a very common test:
H0 :   0
H1 :   0
As Miller says, “in many scientific fields where positive and negative deviations are equally
important, two-tailed confidence levels are the more prevalent. In risk management, more often
than not, we are more concerned with the probability of bad outcomes, and this concern
naturally leads to one-tailed tests.
A one-tailed test rejects the null only if the estimate is either significantly above or significantly
below, but only specifies one direction. For example, the following null hypothesis is not rejected
if the estimate is greater than the value (c); we are here concerned with deviations in one
direction only:
H0 :   c
H1 :   c

Define, interpret, and calculate the t-statistic.
The t-statistic or t-ratio is given by:
Y  Y ,0
t
SE (Y )
The critical t-value or “lookup” t-value is the t-value for which the test just rejects the null
hypothesis at a given significance level. For example:
 95% two-tailed (2T) critical t-value with 20 d.f. is 2.086
 Significance test: is t-statistic > critical (lookup) t?
The critical t-values bound a region within the student’s distribution that is a specific
percentage (90%? 95%? 99%?) of the total area under the student’s t distribution curve. The
student’s t distribution with (n-1) degrees of freedom (d.f.) has a confidence interval given by:
SY S
Y  t   Y  Y   t  Y
n n
For example: critical t
If the (small) sample size is 20, then the 95% two-tailed critical t is 2.093. That is because the
degrees of freedom are 19 (d.f. = n - 1) and if we review the lookup table on the following page
(corresponds to Gujarati A-2) under the column = 0.025/0.5 and row = 19, then we find the cell
value = 2.093. Therefore, given 19 d.f., 95% of the area under the student’s t distribution is
bounded by +/- 2.093. Specifically, P(-2.093 ≤ t ≤ 2.093) = 95%.
Please note, further because the distribution is symmetrical (skew=0), 5% among both tails
implies 2.5% in the left-tail.

Three questions that apply the t-statistic from the bionicturtle question database
209.1. Nine (9) companies among a random sample of 60 companies defaulted. The companies
were each in the same highly speculative credit rating category: statistically, they represent a
random sample from the population of CCC-rated companies. The rating agency contends that
the historical (population) default rate for this category is 10.0%, in contrast to the 15.0%
default rate observed in the sample. Is there statistical evidence, with any high confidence, that
the true default rate is different than 10.0%; i.e., if the null hypothesis is that the true default rate
is 10.0%, can we reject the null?
a) No, the t-statistic is 0.39

b) No, the t-statistic is 1.08
c) Yes, the t-statistic is 1.74
d) Yes, the t-statistic is 23.53
209.2. Over the last two years, a fund produced an average monthly return of +3.0% but with
monthly volatility of 10.0%. That is, assume the random sample size (n) is 24, with mean of 3.0%
and sigma of 10.0%. Are the returns statistically significant; in other words, can we decide the
true mean return is great than zero with 95% confidence?
a) No, the t-statistic is 0.85

b) No, the t-statistic is 1.47
c) Yes, the t-statistic is 2.55
d) Yes, the t-statistic is 3.83
209.3. Assume the frequency of internal fraud (an operational risk event type) occurrences per
year is characterized by a Poisson distribution. Among a sample of 43 companies, the mean
frequency is 11.0 with a sample standard deviation of 4.0. What is the 90% confidence interval
of the population's mean frequency?
a) 10.0 to 12.0
b) 8.8 to 13.2
c) 7.5 to 14.5
d) Need more information (Poisson parameter)

Answers:
209.1. B. No, the t-statistic is only 1.08. For a large sample, the distribution is normally
approximated, such that at 5.0% two-tailed significance, we reject if the abs(t-statistic)
exceeds 1.96.
The standard error = SQRT(15%*85%/60) = 0.046098; please note: if you used
SQRT(10%*90%/60) for the standard error, that is not wrong, but also would not change the
conclusion as the t-statistic would be 1.29
The t statistic = (15%-10%)/0.046098 = 1.08;
The two-sided p value is 27.8%, but as the t statistic is well below 2.0, we cannot confidently
reject.
We don't really need the lookup table or a calculator: the t-statistic tells us that the observed
sample mean is only 1.08 standard deviations (standard errors) away from the hypothesized
population mean.
A two-tailed 90% confidence interval implies 1.64 standard errors, so this (72.8%) is much less
confident than even 90%.
209.2. B. No, the t-statistic is 1.47

The standard error = 10%/SQRT(24) = 0.020412
The t statistic = (3.0% - 0%)/0.020412 = 1.47.
The one-tailed critical t, at 95% with 23 df, is 1.71; two-tailed is 2.07.
(even if we assume normal one-sided, the 95% critical Z is 1.645, of course.)
209.3. A. 10.0 to 12.0

The central limit theorem (CLT) says, if the sample is random (i.i.d.), the sampling distribution of
the sample mean tends toward the normal REGARDLESS of the underlying distribution!
The standard error = SQRT(4^2/43) = 4/SQRT(43) = 0.609994.
The 90% confidence interval = 11.0 +/- 1.645*0.609994 = 11.0 +/- 1.0 = 10.0 to 12.0
... did you realize that a 90% two-side confidence INTERVAL implies the same deviate (1.645) as
a 95% one-sided deviate?

Student’s t Lookup Table
Excel function: = TINV(two-tailed probability [larger #], d.f.)
1-tail: 0.25 0.1 0.05 0.025 0.01 0.005 0.001

d.f. 2-tail: 0.50 0.2 0.1 0.05 0.02 0.01 0.002
1 1.000 3.078 6.314 12.706 31.821 63.657 318.309
2 0.816 1.886 2.920 4.303 6.965 9.925 22.327
3 0.765 1.638 2.353 3.182 4.541 5.841 10.215
4 0.741 1.533 2.132 2.776 3.747 4.604 7.173
5 0.727 1.476 2.015 2.571 3.365 4.032 5.893
6 0.718 1.440 1.943 2.447 3.143 3.707 5.208
7 0.711 1.415 1.895 2.365 2.998 3.499 4.785
8 0.706 1.397 1.860 2.306 2.896 3.355 4.501
9 0.703 1.383 1.833 2.262 2.821 3.250 4.297
10 0.700 1.372 1.812 2.228 2.764 3.169 4.144
11 0.697 1.363 1.796 2.201 2.718 3.106 4.025
12 0.695 1.356 1.782 2.179 2.681 3.055 3.930
13 0.694 1.350 1.771 2.160 2.650 3.012 3.852
14 0.692 1.345 1.761 2.145 2.624 2.977 3.787
15 0.691 1.341 1.753 2.131 2.602 2.947 3.733
16 0.690 1.337 1.746 2.120 2.583 2.921 3.686
17 0.689 1.333 1.740 2.110 2.567 2.898 3.646
18 0.688 1.330 1.734 2.101 2.552 2.878 3.610
19 0.688 1.328 1.729 2.093 2.539 2.861 3.579
20 0.687 1.325 1.725 2.086 2.528 2.845 3.552
21 0.686 1.323 1.721 2.080 2.518 2.831 3.527
22 0.686 1.321 1.717 2.074 2.508 2.819 3.505
23 0.685 1.319 1.714 2.069 2.500 2.807 3.485
24 0.685 1.318 1.711 2.064 2.492 2.797 3.467
25 0.684 1.316 1.708 2.060 2.485 2.787 3.450
26 0.684 1.315 1.706 2.056 2.479 2.779 3.435
27 0.684 1.314 1.703 2.052 2.473 2.771 3.421
28 0.683 1.313 1.701 2.048 2.467 2.763 3.408
29 0.683 1.311 1.699 2.045 2.462 2.756 3.396
30 0.683 1.310 1.697 2.042 2.457 2.750 3.385
The green shaded area represents values less than three (< 3.0). Think of it as the “sweet
spot.” For confidences less than 99% and d.f. > 13, the critical t is always less than 3.0. So, for
example, a computed t of 7 or 13 will generally be significant. Keep this in mind because in
many cases, you do not need to refer to the lookup table if the computed t is large; you can
simply reject the null.

The general framework
The subsequent AIMs breakdown the following general hypothesis testing framework:
Define & interpret the null hypothesis and

the alternative
Distinguish between one‐sided and

two‐sided hypotheses
Describe the confidence interval approach

to hypothesis testing
Describe the test of significance approach

to hypothesis testing
Define, calculate and interpret type I and

type II errors
Define and interpret the p value

Describe the process of selecting and constructing a null hypothesis.
The null hypothesis, denoted by H0, is tested against the alternative hypothesis, which is
denoted by H1 or sometimes HA.
Often, we test for the significance of the intercept or a

partial slope coefficient in a linear regression. Typically, Define & interpret the null
in this case, our null hypothesis is: “the slope is zero” or hypothesis and the alternative
“there is no correlation between X and Y” or “the
regression coefficients jointly are not significant.” In Distinguish between one‐sided
which case, if we reject the null, we are finding the and two‐sided hypotheses
statistic to be significant which, in this case, means
“significantly different than zero.”
Describe the confidence interval
approach to hypothesis testing
H0 : E (Y )  Y ,0
H1 : E (Y )  Y ,0 Describe the test of significance
Please note the null must contain the equal sign
(“=“):
Define, calculate and interpret
type I and type II errors
H0 : E (Y )  $20
H1 : E (Y )  $20
Statistical significance implies our null hypothesis (i.e., the parameter equals zero) was
rejected. We concluded the parameter is nonzero. For example, a “significant” slope
estimate means we rejected the null hypothesis that the true slope is zero.

Differentiate between a one-tailed and a two-tailed test and explain the
circumstances in which to use each test.
Your default assumption should be a two-sided

hypothesis. If unsure, assume two-sided. Define & interpret the null
hypothesis and the alternative
Here is a one-sided null hypothesis:
H0 : E (Y )  Y ,0 Distinguish between one‐sided

and two‐sided hypotheses
H1 : E (Y )  Y ,0
Specifically, “The one-sided null hypothesis is that the approach to hypothesis testing
population average wage is less than or equal to $20.00:”
H0 : E (Y )  $20 Describe the test of significance

H1 : E (Y )  $20
type I and type II errors
The null hypothesis always includes the equal sign (=), regardless! The null cannot include
only less than (<) or greater than (>).

Interpret the results of hypothesis tests with a specific level of
confidence.
In the significance approach, instead of defining the Define & interpret the null
confidence interval, we compute the standardized hypothesis and the alternative
distance in standard deviations from the observed mean
to the null hypothesis: this is the test statistic (or
Distinguish between one‐sided
computed t value). We compare it to the critical (or and two‐sided hypotheses
lookup) value.
If the test statistic is greater than the critical (lookup) Describe the confidence interval
value, then we reject the null. approach to hypothesis testing
Describe the test of significance

Reject H0 at 90% if t act
 1.64 approach to hypothesis testing
Reject H0 at 95% if t act  1.96 Define, calculate and interpret

Reject H0 at 99% if t act
 2.58 type I and type II errors

If we reject a hypothesis which is actually true, we have
committed a Type I error. If, on the other hand, we Define & interpret the null
accept a hypothesis that should have been rejected, we hypothesis and the alternative
have committed a Type II error.
 Type I error = significance level = α = Pr [reject H0 Distinguish between one‐sided

and two‐sided hypotheses
| H0 is true]
 Type II error = β = Pr [“accept” H0 | H0 is false] Describe the confidence interval

 We can reject null with (1-p)% confidence
 Type I: to reject a true hypothesis Describe the test of significance

 Type II: to accept a false hypothesis
Type I and Type II errors: for example type I and type II errors
Suppose we want to hire a portfolio manager who has

produced an average return of +8% versus an index that Define and interpret the p value
returned +7%. We conduct a test statistical test to
determine whether the “excess +1%” is due to luck or “alpha” skill. We set a 95% confidence
level for our test. In technical parlance, our null hypothesis is that the manager adds no skill (i.e.,
the expected return is 7%).
Under the circumstances, a Type I error is the following: we decide that excess is significant and
the manager adds value, but actually the out-performance was random (he did not add skill). In
technical terms, we mistakenly rejected the null. Under the circumstances, a Type II error is the
following: we decide the excess is random and, to our thinking, the out-performance was
random. But actually it was not random and he did add value. In technical terms, we falsely
accepted the null.

Describe and apply the principle of Chebyshev’s inequality
Chebyshev’s inequality provides a shorthand method for specifying a cumulative probability
without our need to know the underlying distribution (conditional on a finite variance):
1 1
P( X    k )  , or P ( X    k )  1 
k2 k2
Example of Chebyshev’s Inequality
 The probability that the random variable X falls at least 3 standard deviations from its
mean (expected value) is less than or equal to 1/3^2 =1/9; i.e., this is just the upper
bound, the probability is likely less than 1/9th
 The probability that the random variable X falls at least 4 standard deviations from its
mean is less than or equal to 1/4^2 = 1/16.

Stock & Watson’s
Probability and
Statistics Review
(Chapters 2 & 3)
Review of Probability (Stock & Watson Chapter 2)
Previously S&W Chapters 2 and 3 were assigned, but replaced in the 2013 FRM. Here we recap
selected highlights.
 Risk measurement is largely the quantification of uncertainty. We quantify uncertainty

by characterizing outcomes with random variables. Random variables have distributions
which are either discrete or continuous.
 In general, we observe samples; and use them to make inferences about a population (in
practice, we tend to assume the population exists but it not available to us)
 We are concerned with the first four moments of a distribution:
o Mean
o Variance, the square of the standard deviation.
Annualized standard deviation is called volatility; e.g., 12% volatility per annum.
o Skew (a function of the third moment about the mean): a symmetrical
distribution has zero skew or skewness
o Kurtosis (a function of the fourth moment about the mean).
 The normal distribution has kurtosis = 3.0
 As excess kurtosis = 3 – kurtosis, a normal distribution has zero excess
kurtosis
 Kurtosis > 3.0 refers to a heavy-tailed distribution (a.k.a., leptokurtosis),
which will also tend to have a higher peaked.
 The concepts of joint, conditional and marginal probability are important.
 To test a hypothesis about a sample mean (i.e., is the true population mean different than
some value), we use a student t or normal distribution
o Student t if the population variance is unknown (it usually is unknown)
o If the sample is large, the student t remains applicable, but as it approximates the
normal, for large samples the normal is used since the difference is not material
 To test a hypothesis about a sample variance, we use the chi-squared
 To test a joint hypothesis about regression coefficients, we use the F distribution

 In regard to the normal distribution:
o N(mu, sigma^2) indicates the only two parameters required. For example,
N(3,10) connotes a normal distribution with mean of 3 and variance of 10 and,
therefore, standard deviation of SQRT(10)
o The standard normal distribution is N(0,1) and therefore requires no parameter
specification: by definition it has mean of zero and variance of 1.0.
o Please memorize, with respect to the standard normal distribution:
 For N(0,1) Pr(Z < -2.33) ~= 1.0% (CDF is one-tailed)
 For N(0,1)  Pr (Z< -1.645)~ = 5.0% (CDF is one-tailed)
 The definition of a random sample is technical: the draws (or trials) are independent and
identically distributed (i.i.d.)
o Identical: same distribution
o Independence: no correlation (in a time series, no autocorrelation)
 The assumption of i.i.d. is a precondition for:
o Law of large numbers
o Central limit theorem (CLT)
o Square root rule (SRR) for scaling volatility; e.g., we typically scales a daily
volatility of (V) to an annual volatility with V*SQRT(250). Please note that i.i.d.
returns is the unrealistic precondition.
Review of Statistics (from Stock & Watson Chapter 3)
 We mostly do not observe population parameters but instead infer them from sample
estimates which are values given by estimators such as sample mean and sample
variance. An estimator is a “recipe” for obtaining an estimate of a population parameter.
o The sample mean is BLUE: Best Linear Unbiased Estimator
 The t-statistic tests the null hypothesis that the population mean equals a certain value.
o If the sample (n) is large (e.g., greater than 30), the t-statistic has a standard
normal sampling distribution when the null hypothesis is true.
o A common test is to test the significance of a regression coefficient. While the
specifics vary, in many cases here the null is “the slope coefficient is zero.”
 The p-value is an “exact (aka, marginal) significance level:” it is the probability of

drawing a statistic at least as adverse to the null hypothesis as the one actually computed
(observed), assuming the null hypothesis is correct.
o p-value is the smallest significance level at which the null can be rejected
o If the p-value is very small (e.g., 0.00x), reject the null. If the p-value is large
(e.g., 0.19 or 19%), accept (fail to reject) the null.
o You will NOT be asked, on the FRM, to calculate a p-value (e.g., you cannot derive
it on the TI BA II+ or HP 12c). You may be asked to interpret a given p-value.

 A 95% confidence interval for is an interval constructed so that it contains the true value
of in 95% of all possible samples:

90% CI for Y  Y  1.64SE Y  
95% CI for Y  Y  1.96SE Y 
99% CI for Y  Y  2.58SE Y 
Where SE is the standard error = sample standard deviation / SQRT (n) =

SQRT (sample variance / n)
1
 Sample covariance: sample  XY 
n 1
 ( X i  X )(Yi  Y )
s XY
 Sample correlation sample   r XY 
s X sY
o Correlation (X,Y)
= covariance (X,Y) / [Std Deviation(X)] * [Std Deviation(Y)]

Stock, Chapter 4:
Linear Regression
with one regressor
In this chapter…
 Explain how regression analysis in econometrics measures the relationship between
dependent and independent variables.
 Define and interpret a population regression function, regression coefficients,
parameters, slope and the intercept.
 Define and interpret the stochastic error term.
 Define and interpret a sample regression function, regression coefficients,
parameters, slope and the intercept.
 Describe the key properties of a linear regression.
 Describe the method and three key assumptions of ordinary least squares for
estimation of parameters.
 Summarize the benefits of using OLS estimators.
 Describe the properties of OLS estimators and their sampling distributions, and
explain the properties of consistent estimators in general.
 Define and interpret the explained sum of squares (ESS), the total sum of squares
(TSS), the sum of squared residuals (SSR), the standard error of the regression
(SER), and the regression R .2
 Interpret the results of an ordinary least squares regression
What is Econometrics?
Econometrics is a social science that applies tools (economic theory, mathematics and statistical
inference) to the analysis of economic phenomena. Econometrics consists of “the application of
mathematical statistics to economic data to lend empirical support to the models constructed
by mathematical economics.”

Methodology of econometrics
 Create a statement of theory or hypothesis
 Collect data: time-series, cross-sectional, or pooled (combination of time-series and
cross-sectional)
 Specify the (pure) mathematical model: a linear function with parameters (but without
an error term)
 Specify the statistical model: adds the random error term
 Estimate the parameters of the chosen econometric model: we are likely to use ordinary
least squares (OLS) approach to estimate parameters
 Check for model adequacy: model specification testing
 Test the model’s hypothesis
 Use the model for prediction or forecasting
Create theory
Estimate parameters
(hypothesis)
Collect data Test model specification
Specify mathematical
Test hypothesis
model
Specify statistical Use model to predict or

(econometric) model forecast
Note:
 The pure mathematical model “although of prime interest to the mathematical

economist, is of limited appeal to the econometrician, for such a model assumes an
exact, or deterministic, relationship between the two variables.”
 The difference between the mathematical and statistical model is the random error
term (u in the econometric equation below). The statistical (or empirical) econometric
model adds the random error term (u):
Yi  B0  B1X i  ui

Three different data types used in empirical analysis
Three types of data for empirical analysis:
 Time series - returns over time for an individual asset
 Cross-sectional - average return across assets on a given day
 Pooled (combination of time series and cross-sectional) - returns over time for a
combination of assets; and
 Panel data (a.k.a., longitudinal or micropanel) data is a special type of pooled data in
which the cross-sectional unit (e.g., family, company) is surveyed over time.
For example, we often characterize a portfolio with a matrix. In such a matrix, the assets are
given in the rows and the period returns (e.g., days/months/years) are given in the columns:
Time → 2006 2007 2008

Asset #1  Returns: Auto or Serial Correlation
Asset #2 Cross –
Asset #3 Sectional  Return Volatility: Auto or Serial Correlation
Asset #4 (or spatial)
correlation
For such a “matrix portfolio,” we can examine the data in at least three ways:
 Returns over time for an individual asset (time series)
 Average return across assets on a given day (cross-sectional or spatial)
 Returns over time for a combination of assets (pooled)
Time Series Cross-sectional Pooled
• Returns over time • Average return • Returns over time

for an individual across assets on a for a combination
asset given day of assets
• Example – Returns • Example – Returns • Includes panel
on a single asset for a data (a.k.a.,
from Jan. through business/family longitudinal,
Mar. 2009 on a given day micropanel)

Explain how regression analysis in econometrics measures the
relationship between dependent and independent variables.
A linear regression may have one or more of the following objectives:
 To estimate the (conditional) mean, or average, value of dependent variable
 Test hypotheses about nature of dependence
 To predict or forecast dependent
 One or more of the above
Correlation (dependence) is not causation. Further, linear correlation is a specific type of

dependence, which is a more general relationship (e.g., non-linear relationship).
In Stock and Watson, the authors regress student test scores (dependent variable) against class
size (independent variable):
TestScore  0  ClassSize  ClassSize other factors

Yi  0  1X i ui
Dependent Independent
(regressand) (regressor)
Variable Variable
Define and interpret a population regression function, regression

coefficients, parameters, slope and the intercept.
Yi  0  1Xi ui

In a linear regression with a single independent variable (aka, univariate, two-variable
regression), there are two important coefficients:
 The slope coefficient. For example, assume we regress average weekly lotto spending
against weekly income. If the slope is 0.080, that tells us that if income goes up by a
dollar, the mean or average weekly lotto spend goes up by 8 cents. The slope is a
measure of the average change in the dependent given an change in the
independent variable.
 Assume an intercept of 7.62. This indicates that if income were zero, the average lotto
spend would be $7.62. The intercept is the predicted value of the dependent if the
independent variable is equal to zero.
Define and interpret the stochastic error term.
The error term contains all the other factors aside from (X) that determine the value of the
dependent variable (Y) for a specific observation.
Yi  0  1Xi ui
The stochastic error term is a random variable. Its value cannot be a priori determined.
 May (probably) contains variables not explicit in model
 Even if all variables included, will still be some randomness
 Error may also include measurement error
 Ockham’s razor: a model is a simplification of reality. We don’t necessarily want to

include every explanatory variable
Define and interpret a sample regression function, regression

coefficients, parameters, slope and the intercept.
In theory, there is one unknowable population and one set of unknowable parameters (B1, B2).
But there are many samples, each sample → SRF → Estimator (statistic) → Estimate
Stochastic PRF Yi  B0  B1X i  ui

Sample regression function (SRF) Yî  b0  b1X i
Stochastic sample regression function (SRF) Yi  b0  b1X i  ei

Each sample produces its own scatterplot. Through this sample scatterplot, we can plot a sample
regression line (SRL). The sample regression function (SRF) characterizes this line; the SRF is
analogous to the PRF, but for each sample.
 B0 = intercept = regression coefficient
 B1 = slope = regression coefficient
 e(i) = the residual term
Note the correspondence between error term and the residual. As we specify the model,
we ex ante anticipate an error; after we analyze the observations, we ex post observe
residuals.
Unlike the PRF which is presumed to be stable (unobserved), the SRF varies with each
sample. So, we expect to get different SRF. There is no single “correct” SRF!
Each Sample returns a different SRF

(sampling variation)
50
Sample #1
40
30 Sample #2
20
Linear (Sample #1)
10
0 Linear (Sample #2)
0 200 400

Describe the key properties of a linear regression.
Linear in parameters No correlation Given value of X,

between explanatory expected value of
•May be non-linear in disturbance (error)
variables variable(s) and
disturbance (error) term is zero
term •E(u|X)=0
Constant variance Error terms are Regression model

•Var(u)=^2. uncorrelated correctly specified
Homoscedasticity •Cov(ui, uj) = 0
Error term is
normally distributed
•u ~ N(0,^2)
 Linear in the parameters (may be non-linear in variables)
 Explanatory variables is uncorrelated with the disturbance (error) term
 Expected value of disturbance (error) term is zero: E(u) = 0
 The variance of the error is constant; i.e., homoscedastic
 There is no correlation between error terms; i.e., no serial- or auto-correlation
 Regression model is correctly specified
 Error term (u) is normally distributed

Describe the method and three key assumptions of ordinary least squares
for estimation of parameters:
The process of ordinary least squares estimation seeks to achieve the minimum value for the
residual sum of squares (squared residuals = e^2).
Estimate (conditional) mean of dependent The conditional distribution of

variable u(i) given X(i) has a mean of
zero
Test hypotheses about nature of

dependence [X(i), Y(i)] are independent
and identically distributed
(i.i.d.)
To forecast the mean value of the
dependent
Correlation (dependence) is not causation Large outliers are unlikely
The three key assumptions of the ordinary least squares (OLS) linear regression model are the
following:
1. Assumption # 1: The conditional distribution of the error term, u(i), has a mean of zero.
This assumption is a formal mathematical statement about the “other factors” contained
in the error term and asserts that these other factors are unrelated to the independent
variable, X(i), in the following sense: given a value of X(i), the mean of the distribution of
these other factors is zero.
2. Assumption #2: X(i), Y(i) are independently and identically distributed (i.i.d.) across
observations. This assumption is a statement about how the sample is drawn. If the
observations are drawn by simple random sampling from a single large population, then
[X(i), Y(i)] are i. i. d. The i. i. d. assumption is a reasonable one for many data collection
schemes.
3. Assumption # 3: Large outliers are unlikely The third least squares assumption is that
large outliers— that is, observations with values of , or both that are far outside the usual
range of the data— are unlikely. Large outliers can make OLS regression results
misleading. Another way to state this assumption is that X and Y have finite
kurtosis. The assumption of finite kurtosis is used in the mathematics that justify the
large- sample approximations to the distributions of the OLS test statistics.

Summarize the benefits of using OLS estimators.
There are both practical and theoretical reasons to use the OLS estimators.
 Commonly accepted and widely familiar: Because OLS is the dominant method used
in practice, it has become the common language for regression analysis throughout
economics, finance and the social sciences more generally. Presenting results using OLS
means that you are “speaking the same language” as other economists and statisticians.
The OLS formulas are built into virtually all spreadsheet and statistical software
packages, making OLS easy to use.
 Desirable theoretical properties (unbiased and consistent): The OLS estimators

also have desirable theoretical properties. They are analogous to the desirable
properties of as an estimator of the population mean. Specifically, the OLS estimator is
unbiased and consistent. The OLS estimator is also efficient among a certain class of
unbiased estimators; however, this efficiency result holds under some additional special
conditions.
Describe the properties of OLS estimators and their sampling

distributions, and explain the properties of consistent estimators in
general.
Given the assumptions of the classical linear regression model, the least-squares (OLS) estimates
possess ideal properties. These properties are contained in the well-known Gauss–Markov
theorem. To understand this theorem, we need to consider the best linear unbiasedness
property of an estimator. An OLS estimator is said to be a best linear unbiased estimator (BLUE)
of if the following hold:
1. It linear, that is, a linear function of a random variable, such as the dependent variable Y
in the regression model.
2. It is unbiased, that is, its average or expected value is equal to the true value
3. It has minimum variance in the class of all such linear unbiased estimators; an unbiased
estimator with the least variance is known as an efficient estimator.
In the regression context it can be proved that the OLS estimators are BLUE. This is the gist
of the famous Gauss–Markov theorem, which can be stated as follows:
 Gauss–Markov Theorem: Given the assumptions of the classical linear regression

model, the least-squares estimators, in the class of unbiased linear estimators, have
minimum variance, that is, they are BLUE.

Define and interpret the explained sum of squares (ESS), the total sum of
squares (TSS), the sum of squared residuals (SSR), the standard error of
the regression (SER), and the regression R . 2
We can break the regression equation into three parts:
 Explained sum of squares (ESS),

 Sum of squared residual (RSS), and
 Total sum of squares (TSS).
The explained sum of squares (ESS) is the squared distance between the predicted Y and the
mean of Y:
n
ESS   (Yî  Y )2
i 1
The sum of squared residuals (SSR) is the summation of each squared deviation between
the observed (actual) Y and the predicted Y:
n
SSR   (Yi  Yî )2
i 1
The sum of squared residual (SSR) is the square of the error term. It is directly related to the
standard error of the regression (SER):
n n
SSR   (Yi  Yî )2   uî2  SER 2 (n  2)
i 1 i 1
Equivalently:
ˆ 2   ei2  ˆ   ei2
n2 n2
The ordinary least square (OLS) approach minimizes the SSR.
The SSR and the standard error of regression (SER) are directly related; the SER is the
standard deviation of the Y values around the regression line.

The standard error of the regression (SER) is a function of the sum of squared residual (SSR):
SSR ei2
SER  
n2 n2
Note the use of the use of (n-2) instead of (n) in the denominator. Division by this smaller
number—in this case (n-2) instead of (n)—is referred to as “an unbiased estimate.”
(n-2) is used because the two-variable regression has (n-2) degrees of freedom (d.f.).
In order to compute the slope and intercept estimates, two independent observations are
consumed.
If k = the number of explanatory variables plus the intercept (e.g., 2 if one explanatory
variable; 3 if two explanatory variables), then SER = SQRT[SSR/(n-k)].
If k = the number of slope coefficients (excluding the intercept), then similarly, SER =
SQRT[SSR/(n-k -1)]
Interpret the results of an ordinary least squares regression
In the Stock & Watson example, the authors regress TestScore against the Student-teacher ratio
(STR):
Test Scores versus Student-Teacher Ratio

720.0
700.0
Test Scores
680.0
660.0
640.0
620.0
600.0
10.0 15.0 20.0 25.0 30.0
Student-teacher ratio
The regression function, with standard error, is given by:
TestScore  698.9  2.28  STR

(9.47) (0.48)

The regression results are given by:
B(1) B(0)
Regression coefficients -2.28 698.93
Standard errors, SE() 0.48 9.47
R^2, SER 0.05 18.58
F, d.f. 22.58 418.00
ESS, RSS 7,794 144,315
Please note:
 Both the slope and intercept are both significant at 95%, at least. The test statistics are
73.8 for the slope (699/9.47) and 4.75 (2.28/0.48). For example, given the very high test
statistic for the slope, its p-value is approximately zero.
 The coefficient of determination (R^2) is 0.05, which is equal to 7,794/(7,7794 +

144,315)
 The degrees of freedom are n – 2 = 420 – 2 = 418

Stock, Chapter 5: Single
Regression: Hypothesis Tests
and Confidence Intervals
In this chapter…
 Define, calculate, and interpret confidence intervals for regression coefficients.
 Define and interpret the p-value.
 Define and interpret hypothesis tests about regression coefficients.
 Define and differentiate between homoskedasticity and heteroskedasticity.
 Describe the implications of homoskedasticity and heteroskedasticity.
 Explain the Gauss-Markov Theorem and its limitations, and alternatives to the OLS.
 Define, describe, apply, and interpret the t-statistic when the sample size is small.
Define, calculate, and interpret confidence intervals for regression

coefficients.
 Upper/lower limit = Regression coefficient ± [standard error × critical value @ c%]
 The coefficient is effectively a sample mean, so this is essentially similar to

computing the confidence interval for a sample mean
 In the example from Stock and Watson, lower limit = 680.4 = 698.9 – 9.47 × 1.96
Confidence Interval
Coefficient SE Lower Upper
Intercept 698.9 9.47 680.4 717.5
Slope (B1) -2.28 0.48 -3.2 -1.3

Define and interpret the p-value.
The p-value is the “exact significance level:”
 Lowest significance level a which a null

can be rejected Define & interpret the null
hypothesis and the alternative
 We can reject null with (1-p)% confidence
Distinguish between one‐sided
The p-value is an abbreviation that stands for and two‐sided hypotheses
“probability-value.” Suppose our hypothesis is
that a population mean is 10; another way of
saying this is “our null hypothesis is H0: mean = approach to hypothesis testing
10 and our alternative hypothesis is H1: mean 
10.” Suppose we conduct a two-tailed test, given
Describe the test of significance
the results of a sample drawn from the approach to hypothesis testing
population, and the test produces a p-value of .03.
This means that we can reject the null hypothesis
with 97% confidence – in other words, we can be type I and type II errors
fairly confident that the true population mean is
not 10.
Our example was a two-tailed test, but recall we
have three possible tests:
 The parameter is greater than (>) the stated value (right-tailed test), or
 The parameter is less than (<) the stated value (left-tailed test), or
 The parameter is either greater than or less than (≠) the stated value (two-tailed test).
Small p-values provide evidence for rejecting the null hypothesis in favor of the alternative
hypothesis, and large p values provide evidence for not rejecting the null hypothesis in favor of
the alternative hypothesis.
Keep in mind a subtle point about the p-value and “rejecting the null.” It is a soft rejection.
Rather than accept the alternative, we fail to reject the null. Further, if we reject the null, we are
merely rejecting the null in favor of the alternative.
The analogy is to a jury verdict. The jury does not return a verict of “innocent;” rather they
return a verdict of “not guilty.”
 
p  value  PrH0 Z  t act  1   t act  

Define and interpret hypothesis tests about regression coefficients.
The key idea here is that the regression coefficient (the estimator or sample statistic) is a
random variable that follows a student’s t distribution (because we do not know the population
variance, or it would be the normal):
b1  B1 regression coefficient  null hypothesis [0]

t  ~ tn  2
se(b1) se(regression coefficient)
Test of hypothesis for the slope (b1)
To test the hypothesis that the regression coefficient (b1) is equal to some specified value (),
we use the fact that the statistic
b1  
test statistic t 
se(b1)
This has a student's distribution with n - 2 degrees of freedom because there are two coefficients
(slope and intercept).
Using the same example:
TestScore  698.9  2.28  STR

(9.47) (0.48)
STR: t statistic = |(-2.28 – 0)/0.48| = 4.75
p value 2 Tail ~0%

Define and differentiate between homoskedasticity and
heteroskedasticity.
The error term u(i) is homoskedastic if the variance of

the conditional distribution of u(i) given X(i) is constant
for i = 1,…,n and in particular does not depend on X(i).
Otherwise the error term is heteroskedastic.
Describe the implications of homoskedasticity and heteroskedasticity.
Mathematical Implications of Homoskedasticity
 The OLS estimators remain unbiased and asymptotically normal. “Whether the
errors are homoskedastic or heteroskedastic, the OLS estimator is unbiased, consistent,
and asymptotically normal.”
 OLS estimators are efficient if the least squares assumptions are true. This result is
called the Gauss– Markov theorem.
 Homoskedasticity-only variance formula. If the errors are homoskedastic, then there

is a specialized formula that can be used for the standard errors of the slope and
intercept estimates.
Explain the Gauss-Markov Theorem and its limitations, and alternatives to

the OLS.
The Gauss– Markov theorem states that, under a set of conditions known as the Gauss– Markov
conditions, the OLS slope (b1) estimator has the smallest conditional variance, given , of all
linear conditionally unbiased estimators of parameter (B1); that is, the OLS estimator is BLUE.
The Gauss– Markov theorem provides a theoretical justification for using OLS, but has two key
limitations:
 Its conditions might not hold in practice. “In particular, if the error term is
heteroskedastic— as it often is in economic applications— then the OLS estimator is no
longer BLUE… An alternative to OLS when there is heteroskedasticity of a known form,
called the weighted least squares estimator.
 Even if the conditions of the theorem hold, there are other candidate estimators that are
not linear and conditionally unbiased; under some conditions, these other estimators are
more efficient than OLS.

Alternatives to ordinary least squares (OLS)
Under certain conditions, some regression estimators are more efficient than OLS.
 The weighted least squares (WLS) estimator: If the errors are heteroskedastic, then
OLS is no longer BLUE. If the heteroskedasticity is known (i.e., if the conditional variance
of given is known up to a constant factor of proportionality) then an alternate estimator
exists with a smaller variance than the OLS estimator. This method, weighted least
squares (WLS), weights the (i-th) observation by the inverse of the square root of the
conditional variance of u(i) given X(i). Because of this weighting, the errors in this
weighted regression are homoskedastic, so OLS, when applied to the weighted data, is
BLUE.
o Although theoretically elegant, the problem with weighted least squares is that we
must know how the conditional variance of u(i) depends on X(i). Because this is
rarely known, weighted least squares is used far less frequently in practice than
OLS.
 The least absolute deviations (LAD) estimator: The OLS estimator can be sensitive to
outliers. If extreme outliers are “not rare” (or if we can safely ignore extreme outliers),
then the least absolute deviations (LAD) estimator may be more effecitve. In LAD, the
regression coefficients are obtained by solving a minimization that uses the absolute
value of the prediction “mistake” (i.e., instead of its square).
o Because “in many economic data sets, severe outliers are rare,” the use of the LAD
estimator is uncommon in applications.
Define, describe, apply, and interpret the t-statistic when the sample size
is small.
When the sample size is small, the exact distribution of the t- statistic is complicated and
depends on the unknown population distribution of the data. If, however, the three least squares
assumptions hold, the regression errors are homoskedastic, and the regression errors are
normally distributed, then the OLS estimator is normally distributed and the homoskedasticity-
only t- statistic has a Student t distribution. These five assumptions— the three least squares
assumptions, that the errors are homoskedastic, and that the errors are normally distributed—
are collectively called the homoskedastic normal regression assumptions.

Stock: Chapter 6:
Linear Regression
with Multiple
Regressors
In this chapter…
 Define, interpret, and discuss methods for addressing omitted variable bias.
 Distinguish between simple and multiple regression.
 Define and interpret the slope coefficient in a multiple regression.
 Describe homoskedasticity and heterosckedasticity in a multiple regression.
 Describe and discuss the OLS estimator in a multiple regression.
 Define, calculate, and interpret measures of fit in multiple regression.
 Explain the assumptions of the multiple linear regression model.
 Explain the concept of imperfect and perfect multicollinearity and their
implications.
Define, interpret, and discuss methods for addressing omitted variable

bias.
Omitted variable bias occurs if both:
1. Omitted variable is correlated with the included regressor, and
2. Omitted variable is a determinant of the dependent variable
“If the regressor ( the student– teacher ratio) is correlated with a variable that has been
omitted from the analysis ( the percentage of English learners) and that determines, in
part, the dependent variable ( test scores), then the OLS estimator will have omitted
variable bias.” –S&W
The first least squares assumption is that the error term, u(i), has a conditional mean of
zero: E[ u(i) | X(i) ] = 0. Omitted variable bias means this OLS assumption is not true.

Stock and Watson show that omitted variable bias can be expressed mathematically, if we
assume a correlation between u(i) and X(i) denoted by rho(X,u). Then the OLS estimator has the
following limit; i.e., as the sample sizes increases the estimator (B1 carrot) does not tend toward
the parameter (B1) but rather:

ˆ1 
p
 1   X 
X
1. Omitted variable bias is a problem whether the sample size is large or small. Because the
estimator (B1 carrot) does not converge in probability to the true value (B1), the
estimator (B1 carrot) is biased and inconsistent; i.e., [B1 carrot] is not a consistent
estimator of [B1] when there is omitted variable bias.
2. Whether this bias is large or small in practice depends on the correlation (rho) between
the regressor and the error term. The larger is the correlation, the larger the bias.
3. The direction of the bias in depends on whether (X) and (u) are positively or negatively
correlated.
Distinguish between simple [single] and multiple regression.
Multiple regression model extends the single variable regression model:
Yi  0  1X1i  ui
Yi  0  1X1i  2 X2i   k X ki  ui , i  1,..., n
Define and interpret the slope coefficient in a multiple regression.
The B(1) slope coefficient, for example, is the effect on Y of a unit change in X(1) if we hold the
other independent variables, X(2) …., constant.
Yi  0  1X1i  2 X 2i   k X ki  ui , i  1,..., n
B(2) is the effect on Y(i) given a unit change in X(2), if

we hold constant X(1), X(3), … and X(N)

Describe homoskedasticity and heterosckedasticity in a multiple
regression.
If variance [u(i) | X(1i), …. X(ki)] is contant for i = 1, …, n, then model is homoskedastic.

Otherwise, the model is heteroskedastic.
Describe and discuss the OLS estimator in a multiple regression.
TestScore  686  1.10  STR  0.65  PctEL
OLS estimate of the coefficient

on the student-teacher ratio (B1)
Define, calculate, and interpret measures of fit in multiple regression.
Standard error of regression (SER)
Standard error of regression (SER) estimates the standard deviation of the error term u(i). In
this way, the SER is a measure of spread of the distribution of Y around the regression line. In a
multiple regression, the SER is given by:
SSR
SER 
n  k 1
Where (k) is the number of slope coefficients; e.g., in this case of a two variable regression, k = 1.
For the standard error of the regression (SER), the denominator is n – [# of variables], or
n – [# of coefficients including the intercept].
In a univariate regression (i.e., one indepenent variable/regressor), the deonominator is

n – 2 as n – 1 – 1 = n -2
In a regression with two regressors (two independent variables), the denominator is n – 3

as n – 2 – 1 = n – 3.

Coefficient of determination (R^2)
The coefficient of determination is the fraction of the sample variance of Y(i) explained by (or
predicted by) the independent variables”
ESS SSR
R2   1
TSS TSS
In multiple regression, the R^2 increases whenever a regressor (independent variable)

is added, unless the estimated oefficient on the added regressor is exactly zero.
“Adjusted R^2”
The unadjusted R^2 will tend to increase as additional independent variables are added.
However, this does not necessarily reflect a better fitted model. The adjusted R^2 is a
modified version of the R^2 that does not necessarily increase with a new independent
variable is added. Adjusted R^2 is given by:
n  1 SSR su2ˆ
R  1
2
 1 2
n  k  1 TSS sY
“The R^2 is useful because it quantifies the extent to which the regressors (independent
variables) account for, or explain, the variation in the dependent variable. Nevertheless,
heavy reliance on the R^2 can be a trap. In applications, “maximize the R^2” is rarely the
answer to any economically or statistically meaningful question. Instead, the decision
about whether to include a variable in a multiple regression should be based on whether
including that variable allows you better to estimate the causal effect of interest.” –S&W
Explain the assumptions of the multiple linear regression model.
 Conditional distribution of u(i) given X(1i), X(2i),…,X(ki) has mean of zero

 X(1i), X(2i), … X(ki), Y(i) are independent and identically distributed (i.i.d.)
 Large outliers are unlikely
 No perfect collinearity
The regressors exhibit perfect multi-collinearity if one of the regressors is a perfect

linear function of the other regressors. The fourth least squares assumption is that the
regressors are not perfectly multicollinear.

Explain the concept of imperfect and perfect multicollinearity and their
implications.
Imperfect multicollinearity is when two or more of the independent variables (regressors) are
highly correlated: there is a linear function of one of the regressors that is highly correlated with
another regressor. Imperfect multicollinearity does not pose any problems for the theory
of the OLS estimators; indeed, a purpose of OLS is to sort out the independent influences of the
various regressors when these regressors are potentially correlated. Imperfect multicollinearity
does not prevent estimation of the regression, nor does it imply a logical problem with the
choice of independent variables (i.e., regressor).
 However, imperfect multicollinearity does mean that one or more of the regression
coefficients could be estimated imprecisely
“Perfect multicollinearity is a problem that often signals the presence of a log-ical error.
In contrast, imperfect multicollinearity is not necessarily an error, but rather just a
feature of OLS, your data, and the question you are trying to answer. If the variables
in your regression are the ones you meant to include— the ones you chose to address the
potential for omitted variable bias— then imperfect mul-ticollinearity implies that it will
be difficult to estimate precisely one or more of the partial effects using the data at
hand.” –S&W

Stock, Chapter 7:
Hypothesis Tests and
Confidence Intervals
in Multiple Regression
In this chapter…
 Construct, perform, and interpret hypothesis tests and confidence intervals for a
single coefficient in a multiple regression.
 Construct, perform, and interpret hypothesis tests and confidence intervals for
multiple coefficients in a multiple regression.
 Define and interpret the F-statistic.
 Define, calculate, and interpret the homoskedasticity-only F-statistic.
 Describe and interpret tests of single restrictions involving multiple coefficients.
 Define and interpret confidence sets for multiple coefficients.
 Define and describe omitted variable bias in multiple regressions.
 Interpret the R2 and adjusted-R2 in a multiple regression.
Construct, perform, and interpret hypothesis tests and confidence

intervals for a single coefficient in a multiple regression.
The Stock & Watson example adds an additional independent variable (regressor). Under this
three variable regression, Test Scores (dependent) are a function of the Student/Teacher ratio
(STR) and the Percentage of English learners in district (PctEL).

STR = Student/Teacher ratio
PctEL = Percentage (%) of English learners in district
TestScore  686  1.10  STR  0.65  PctEL

(7.41) (0.38) (0.04)
STR: t statistic = (-1.10 – 0)/0.38 = -2.90
p value 2 Tail = 0.40%
PctEL t statistic = (-0.65– 0)/0.04 = 16.52
p value 2 Tail ~ 0.0%
STR: Lower limit = -1.10 – 0.38×1.96 = -1.85
Upper limit = -1.10 + 0.38×1.96 = -0.35
PctEL Lower limit = -0.65 – 0.04×1.96 = -0.73
Upper limit = -0.65 + 0.04×1.96 = -0.57
Construct, perform, and interpret hypothesis tests and confidence

intervals for multiple coefficients in a multiple regression.
The “overall” regression F-statistic tests the joint hypothesis that all slope coefficients are zero
Define and interpret the F-statistic.
F-statistic is used to test joint hypothesis about regression coefficients.
1  t1  t2  2 ˆt 1,t 2t1t2 

2 2
F  
2  1  2 ˆt 1,t 2 
The “overall” regression F- statistic tests the joint hypothesis that all the slope coefficients are
zero; i.e., the null hypothesis is that all slope coefficients are zero. Under this null hypothesis,
none of the regressors explains any of the variation in the dependent variable (although the
intercept can be nonzero).

Define, calculate, and interpret the homoskedasticity-only F-statistic.
If the error term is homoskedastic, the F-statistic can be written in terms of the improvement in
the fit of the regression as measured either by the sum of squared residuals or by the regression
R^2.
F
SSRrestricted  SSRunrestricted  q
SSRunrestricted  n  kunrestricted  1
Describe and interpret tests of single restrictions involving multiple

coefficients.
Approach #1: Test the restrictions directly
Approach #2: Transform the regression
Define and interpret confidence sets for multiple coefficients.
Confidence ellipse characterizes a confidence set for two coefficients; this is the two-dimension
analog to the confidence interval:
9
8
7
6
5
Coefficient on
4
Expn (B2)
3
2
1
0
-1 -2 -1.5 -1 -0.5 0 0.5 1 1.5
Coefficient on STR (B1)

Define and describe omitted variable bias in multiple regressions.
Omitted variable bias: an omitted determinant of Y (the dependent variable) is correlated with
at least one of the regressor (independent variables).
Interpret the R2 and adjusted-R2 in a multiple regression.

There are four pitfalls to watch in using the R^2 or adjusted R^2:
1. An increase in the R^2 or adjusted R^2 does not necessarily imply that an added variable
is statistically significant
2. A high R^2 or adjusted R^2 does not mean the regressors are a true cause of the
dependent variable
3. A high R^2 or adjusted R^2 does not mean there is no omitted variable bias
4. A high R^2 or adjusted R^2 does not necessarily mean you have the most appropriate set
of regressors, nor does a low R^2 or adjusted R^2 necessarily mean you have an
inappropriate set of regressors

Jorion, Chapter 12:
Monte Carlo
Methods
In this chapter…
 Describe how to simulate a price path using a geometric Brownian motion model.
 Describe how to simulate various distributions using the inverse transform method.
 Describe the bootstrap method.
 Explain how simulations can be used for computing VaR and pricing options.
 Describe the relationship between the number of Monte Carlo replications and the
standard error of the estimated values.
 Describe and identify simulation acceleration techniques.
 Explain how to simulate correlated random variables using Cholesky factorization.
 Describe deterministic simulations.
 Describe the drawbacks and limitations of simulation procedures.
Describe how to simulate a price path using a geometric Brownian motion

(GBM) model.
Run trials For all trials, Sort

Specify a (10 or 1 MM) calculate outcomes,
random each a terminal (at best to worst.
process (GBM function of horizon) asset Quintiles
for a stock) random (or portolio) (e.g., 1%ile)
variable value are VaRs
Geometric Brownian Motion (GBM) is the continuous motion/ process in which the randomly
varying quantity (in our example ‘Asset Value’) has a fluctuated movement and is dependent on
a variable parameter (in our example the stochastic variable is ‘Time’). The standard variable
parameter is the shock and the progress in the asset’s value is the drift. Now, the GBM can be
represented as drift + shock as shown below.

dSt  t St dt   t St dz is the infinitesimal (continuous) representation of the GBM
S  St 1( t   t ) is the discrete representation of the GBM
GBM models a deterministic drift plus a stochastic random shock
The above illustration is the shock and drift progression of the asset. The asset drifts upward
with the expected return of  over the time interval t . But the drift is also impacted by shocks
from the random variable  . We measure the standard deviation  by a random variable 
(which is the random shock) here. As the variance is adjusted with time t , volatility is adjusted
with the square root of time t .
10-day GBM simulation (40 trials)

$12.00
$11.00
$10.00
$9.00
$8.00
day 1
day 8
day 10
day 2
day 3
day 4
day 5
day 6
day 7
day 9
Expected Drift is the deterministic component but shock is the random component in this stock
price process simulation. The Y-axis has an empirical distribution rather than a parametric
distribution and can be easily used to calculate the VaR. This Monte Carlo Distribution allows us
to produce an empirical distribution in future which can be used to calculate the VaR.
GBM assumes constant volatility (generally a weakness) unlike GARCH(1,1) which models
time-varying volatility.

Describe how to simulate various distributions using the inverse
transform method.
The inverse transform method translates a random number (under a uniform distribution) into
a cumulative standard normal distribution:
CDF pdf
Random NORMSINV() NORMDIST()
0.10 -1.282 0.18
0.15 -1.036 0.23
0.20 -0.842 0.28
0.25 -0.674 0.32
0.30 -0.524 0.35
0.35 -0.385 0.37
0.40 -0.253 0.39
0.45 -0.126 0.40
0.50 0.000 0.40
A random variable is generated, between 0 and 1. In Excel, the function is =RAND(). This
corresponds to the Y-axis on the first chart below. This will necessarily correspond to standard
normal CDF; e.g., RAND(.4) corresponds to -0.126 because NORMSINV(RAND(.4)) = -0.126.

Describe the bootstrap method.
The bootstrap method is a subclass of (type of) historical simulation (like HS). In regular
historical simulation, current portfolio weights are applied to the historical sample (e.g., 250
trading days). The bootstrap differs because it is “with replacement:” a historical period (i.e., a
vector of daily returns on a given day in the historical window) is selected at random. This
becomes the “simulated” vector of returns for day T+1. Then, for day T+2 simulation, a daily
vector is again selected from the window; it is “with replacement” because each simulated day
can select from the entire historical sample. Unlike historical simulation—which runs the
Standard normal PDF
-5.000 0.000 5.000

current portfolio through the single historical sample—the bootstrap randomizes the historical
sample and therefore can generate many historically-informed samples.
The advantages of the bootstrap include: can model fat-tails (like HS); by generating
repeated samples, we can ascertain estimate precision. Limitations, according to Jorion,
include: for small sample sizes, the bootstrapped distribution may be a poor
approximation of the actual one.
Randomize
Historical Date,
But same indexed
returns within date
(preserves cross-sectional
correlations)

Monte Carlo versus bootstrapping
Monte Carlo simulation is a generation of a distribution of returns and/or asset prices paths by
the use of random numbers. Bootstrapping randomizes the selection of actual historical returns.
Monte Carlo Bootstrapping

Both generate hypothetical future scenario and determine VaR by
th th
“lookup function:” what is 95 or 99 worst simulated loss?
Neither incorporates autocorrelation (basic MC does not)
Algorithm describes return Retrieves set (vector) of actual
path (e.g., GBM) historical returns
Randomizes return Randomizes historical date
Correlation must be modeled Built-in correlation
Uses parametric assumptions No distributional assumption
(does not assume normality)
Model risk Needs lots of data
Monte Carlo advantages include:
 Powerful & flexible
 Able to account for a range of risks (e.g., price risk, volatility risk, and nonlinear
exposures)
 Ban be extended to longer horizons (important for credit risk measurement)
 Can measure operational risk.
 However, Monte Carlo simulation can be expensive and time-consuming to run,

including: costly computing power and costly expertise (human capital).
Bootstrapping advantages include:
 Simple to implement
 Naturally incorporates spatial (cross-sectional) correlation
 Automatically captures non-normality in price changes (i.e., does not impose a

parametric distributional assumption)

Explain how simulations can be used for computing VaR and pricing
options.
Value at Risk (VaR)
Once a price path has been generated, we can build a portfolio distribution at the end of the
selected horizon:
1. Choose a stochastic process and parameters
2. Generate a pseudo-sequence of variables from which prices are computed
3. Calculate the value of the asset (or portfolio) under this particular sequence of prices at
the target horizon
4. Repeat steps 2 and 3 as many times as needed
This process creates a distribution of values. We can sort the observations and tabulate the
expected value and the quantile, which is the expected value in c times 10,000 replications.
Value at risk (VaR) relative to the mean is then:
VaR(c,T )  E(FT )  Q(FT ,C )
Pricing options
Options can be priced under the risk-neutral valuation method by using Monte Carlo simulation:
1. Choose a process with drift equal to riskless rate (mu = r)
2. Simulate prices to the horizon
3. Calculate the payoff of the stock option (or derivative) at maturity
4. Repeat steps as often as needed
The current value of the derivative is obtained by discounting at the risk free rate and averaging
across all experiments:
ft  E * e rt F (ST )
This formula means that each future simulated price, F(St), is discounted at the risk-free rate;
i.e., to solve for the present value. Then the average of those values is the expected value, or
value of the option. The Monte Carlo method has several advantages. It can be applied in
many situations, including options with so-called price-dependent paths (i.e., where the value
depends on the particular path) and options with atypical payoff patterns. Also, it is powerful
and flexible enough to handle varieties of options. With one notable exception: it cannot
accurately price options where the holder can exercise early (e.g., American-style options).

Describe the relationship between the number of Monte Carlo
replications and the standard error of the estimated values.
The relationship between the number of replications and precision (i.e., the standard error of
estimated values) is not linear: to increase the precision by 10X requires approximately 100X
more replications. The standard error of the sample standard deviation:
1 SE (ˆ ) 1
SE (ˆ )    
2T  2T
Therefore to increase VaR precision by (1/T) requires a multiple of about T2 the number of
replications; e.g., x 10 precision needs x 100.
= 10^2 =
100x
replications
se() = 1/10
reduce se() for
better precision
Describe and identify simulation acceleration techniques.
Because an increase in precision requires exponentially more replications, acceleration

techniques are used:
 Antithetic variable technique: changes the sign of the random samples. Appropriate
when the original distribution is symmetric. Creates twice as many replications at little
additional cost.
 Control variates technique: attempts to increase accuracy by reducing sample variance

instead of by increasing sample size (the traditional approach).
 Importance sampling technique (Jorion calls this the most effective acceleration
technique): attempts to sample along more important paths
 Stratified sampling technique: partitions the simulation region into two zones.

Explain how to simulate correlated random variables using Cholesky
factorization.
Cholesky factorization
By virtue of the inverse transform method, we can use =NORMSINV(RAND()) to create standard
random normal variables. The RAND() function is a uniform distribution bounded by [0,1]. The
NORMSINV() translates the random number into the z-value that corresponds to the probability
given by a cumulative distribution. For example, =NORMSINV(5%) returns -1.645 because 5% of
the area under a normal curve lies to the left of - 1.645 standard deviations.
But no realistic asset or portfolio contains only one risk factor. To model several risk factors, we
could simply generate multiple random variables. Put more technically, the realistic modeling
scenario is a multivariate distribution function that models multiple random variables. But the
problem with this approach, if we just stop there, is that correlations are not included. What we
really want to do is simulate random variables but in such a way that we capture or reflect the
correlations between the variables. In short, we want random but correlated variables.
The typical way to incorporate the correlation structure is by way of a Cholesky factorization (or
decomposition) . There are four steps:
1. The covariance matrix. This contains the implied correlation structure; in fact, a
covariance matrix can itself be decomposed into a correlation matrix and a volatility
vector.
2. The covariance matrix(R) will be decomposed into a lower-triangle matrix (L) and an
upper-triangle matrix (U). Note they are mirrors of each other. Both have identical
diagonals; their zero elements and nonzero elements are merely "flipped"
3. Given that R = LU, we can solve for all of the matrix elements: a,b,c (the diagonal) and x,
y, z. That is “by definition.” That's what a Cholesky decomposition is, it is the solution
that produces two triangular matrices whose product is the original (covariance) matrix.
4. Given the solution for the matrix elements, we can calculate the product of the triangle
matrix to ensure the produce does equal the original covariance matrix (i.e., does LU =
R?). Note, in Excel a single array formula can be used with = MMULT().
The lower triangle (LU) is the result of the Cholesky Decomposition. It is the thing we can use to
simulate random variables, that itself is "informed" by our covariance matrix.

Correlated random variables
The following transforms two independent random variables into correlated random variables:
1  1
 2  1  (1   2 )2
1,2 : independent random variables

: correlation coefficient
1,  2 : correlated random variables
Correlated Random Vars Correlated Time Series

2.00 $16.0
1.00 $14.0
$12.0
-
$10.0
-1.00
$8.0
-2.00 $6.0
-3.00 $4.0
-4.00 -2.00 - 2.00 1 4 7 10 13 16 19 22 25 28 31
Snapshot from the learning spreadsheet:
Series Series
Correlation 0.75 Mean #1
1% #2
1%
Volatility 10% 10%
Correlated
N (0,1) N (0,1) Series Series
2.06 1.26 #1
$10.0 #2
$10.0
0.52 (0.73) $10.62 $9.37
1.51 0.99 $12.34 $10.39
(1.44) 0.48 $10.68 $11.00
If the variables are uncorrelated, randomization can be performed independently for each
variable. Generally, however, variables are correlated. To account for this correlation, we start
with a set of independent variables η, which are then transformed into the (). In a two-variable
setting, we construct the following:

1  1
 2  1  (1   2 )1/22
ρ = is the correlation coefficient between the variables ().
This is a transformation of two independent random variable into correlated random variables.
Prior to the transformation, 1 and 2 are random variables that have necessary correlation.
The first random variable is retained (1 = 1) and the second is transformed (recast) into a
random variable that is correlated
Describe deterministic simulations.
Quasi Monte Carlo (QMC) – a.k.a. deterministic simulation
Instead of drawing independent samples, the deterministic scheme systematically fills the space
left by previous numbers in the series.
 Advantage: Standard error shrinks at 1/K rather than 1/ k.

 Disadvantage: Since not independent accuracy cannot be easily determined.
Monte Carlo simulations methods generate independent, pseudorandom points that attempt to
“fill” an N-dimensional space, where N is the number of variables driving the price of securities.
Researchers now realize that the sequence of points does not have to be chosen randomly. In a
deterministic scheme, the draws (or trials) are not entirely random. Instead of random trials,
this scheme fills space left by previous numbers in the series.
Scenario Simulation
The first step consists of using principal-component analysis to reduce the dimensionality of the
problem; i.e., to use the handful of factors, among many, that are most important.
The second step consists of building scenarios for each of these factors, approximating a normal
distribution by a binomial distribution with a small number of states.

Describe the drawbacks and limitations of simulation procedures.
Simulation Methods are flexible
 Either postulate stochastic process or resample historical
 All full valuation on target date
However
 More prone to model risk: need to pre-specify the distribution
 Much slower and less transparent than analytical methods
 Sampling variation (more precision requires vastly greater number of replications)
The tradeoff is speed vs. accuracy
A key drawback of the Monte Carlo method is the computational requirements; a large number
of replications are typically required (e.g., thousands of trials are not unusual).
Simulations inevitably generate sampling variability, or variations in summary statistics due to

the limited number of replications. More replications lead to more precise estimates but take
longer to estimate.

Hull, Chapter 22:
Estimating Volatilities
and Correlations
In this chapter…
 Explain how historical data and various weighting schemes can be used in
estimating volatility.
 Describe the exponentially weighted moving average (EWMA) model for estimating
volatility and its properties, and estimate volatility using the EWMA model.
 Describe the generalized auto regressive conditional heteroscedasticity
[GARCH(p,q)] model for estimating volatility and its properties.
 Estimate volatility using the GARCH(p,q) model.
 Explain mean reversion and how it is captured in the GARCH(1,1) model.
 Explain how the parameters of the GARCH(1,1) and the EWMA models are estimated
using maximum likelihood methods.
 Explain how GARCH models perform in volatility forecasting.
 Describe how correlations and covariances are calculated, and explain the
consistency condition for covariances.
How to Estimate Volatility
Volatility is instantaneously unobservable. In general, our basic approaches either infer an

implied volatility (based on an observed market price) or estimate the current volatility based
on a historical series of returns. There are two broad steps to computing historical volatility:
1. Compute the series of periodic (e.g., daily) returns;

2. Choose a weighting scheme (to translate a series into a single metric)

1. Compute the series of periodic returns (e.g., 1 period = 1 day)
In many cases, we assume one period equals one day. In this case, we are estimating a daily
volatility. We can either compute the “continuously compounded daily return” or the “simple
percentage change.” If Si-1 is yesterday’s price and Si is today’s price,
Continuously compounded return The simple percentage return

(aka, log return): is given by:
 S  Si  Si 1
ui  ln  i  ui 
 Si 1  Si 1
Linda Allen (the next reading) contrasts three periodic returns: continuously
compounded, simple percentage change, and absolute change (she says we should only
use absolute changes in the case of interest rate-related variables). She argues that
continuously compounded returns should be used when computing VAR because these
returns are “time consistent.”
2. Choose a weighting scheme
The series can be either un-weighted (each return is equally weighted) or weighted. A weighted
scheme puts more weight on recent returns because they tend to be more relevant.
The “standard” un-weighted (or equally weighted) scheme
The un-weighted (which is really equally-weighted) variance is a “standard” historical variance.

In this case, the variance is given by:
1 m  n2  variance rate per day

 
2
n 
m  1 i 1
(un  i  u )2
m  most recent m observations
u  the mean/average of all daily returns (ui )
Please note this is the sample formula employed by Stock and Watson for the sample
variance. This is technically a correct variance.

However, in practice the dataset often consists of daily returns, a relatively large sample (e.g.,
one year = 250 trading days), and the mean is often near to zero. Given this, for practical
purposes and because the difference is typically insignificant, Hull makes two simplifying
assumptions:
 The average daily return of zero is assumed to be zero: ̅
 The denominator (m-1) is replaced with m.
This produces a simplified version of the standard (un-weighted) variance:
1 m 2
variance =    un i
2
n
m i 1
According to Hull: “These three changes make very little difference to the estimates that are
calculated, but they allow us to simplify the formula.” Hull’s third change is to switch from the
continuously compounded (log) return to the simple return, but we recommend that you keep
the log return to maintain consistency with the next (Linda Allen) reading.
In the “convenent” versoin, we replace (m-1) with (m) in the denominator. (m-1)
produces an “unbiased” estimator and (m) produces a “maximum likelihood” estimator.
Which is correct? Both are correct, the choice begs the issue of what properties of the
estimator we find more desirable. Estimators are like recipes intended to give estimates
of the true population variance. There can be different “recipes;” some will have more
desirable properties than others.
Explain how historical data and various weighting schemes can be used in
estimating volatility.
The weighted scheme (a better approach, generally)
The simple historical approach does not apply different weights to each return (put another
way, it gives equal weights to each return). But we generally prefer to apply greater weights to
more recent returns:
m
    i un2 i
2
n
i 1
The alpha () parameters are simply weights; the sum of the alpha () parameters must equal
one because they are weights.
The glaring flaw in a simple historical volatility (i.e., an un-weighted or equally-weighted

variance) is that the most distant return gets the same weight as yesterday’s return

We can now add another factor to the model: the long-run average variance rate. The idea here
is that the variance is “mean regressing:” think of it the variance as having a “gravitational pull”
toward its long-run average. We add another term to the equation above, in order to capture the
long-run average variance. The added term is the weighted long-run variance:
m
   VL    i un2i
2
n
i 1
The added term is gamma (the weighting) multiplied by () the long-run variance because
the variance is a weighted factor.
-
formatted ARCH (m) model:
m
      i un2i
2
n
i 1
This is the same ARCH(m) only the product of gamma and the long-run variance is
replaced by a single constant, omega (ω). Why does this matter? Because you may see
GARCH(1,1) represented with a single constant (i.e., the omega term), and you want to
realize the constant will not be the long-run variance itself; rather, the constant will be
the product of the long-run variance and a weight.
Summary: Un-weighted versus weighted
Un-Weighted Scheme
1 m 2
 n2   un  i
m i 1
Weighted Scheme alpha(i) weights must sum to one
m
 n2    i un2 i
i 1

Describe exponentially weighted moving average (EWMA) model for
estimating volatility and its properties, and estimate volatility using the
EWMA model.
In exponentially weighted moving average (EWMA), the weights decline (in constant proportion,
given by lambda). The exponentially weighted moving average (EWMA) is given by:
 n2  (1   ) 0un21  Ratio between any two

(1   ) 1 2
un  2  consecutive weights is
constant: lambda (λ)
(1   ) 2un23 
... infinite series
The infinite series above reduces to (i.e., is equivalent to) the recursive form of EWMA:
 n2   n21  (1   )un21
Infinite series elegantly
RiskMetricsTM is a just a branded version of EWMA: reduces to recursive EWMA
 n2  (0.94) n21  (0.06)un21
“The EWMA approach has the attractive feature that relatively little data need to be
stored. At any given time, only the current estimate of the variance rate and the most
recent observation on the value of the market variable need be remembered. When a
new observation on the market variable is obtained, a new daily percentage change is
calculated … to update the estimate of the variance rate. The old estimate of the
variance rate and the old value of the market variable can then be discarded.” –Hull
Describe the generalized auto regressive conditional heteroscedasticity

(GARCH(p,q)) model for estimating volatility and its properties. Estimate
volatility using the GARCH(p,q) model.
EWMA is a special case of GARCH(1,1) where gamma = 0 and (alpha + beta = 1). GARCH (1,1) is
the weighted sum of a long run-variance (weight = gamma), the most recent squared-return
(weight = alpha), and the most recent variance (weight = beta)
 n2   VL   un21   n21
This GARCH(1,1) is a case of the ARCH(m) above: the first term is constant omega (i.e.,
the weighted long-run variance) and the second and third terms are recursively giving
exponentially decreasing weights to the historical series of returns.

To summarize the key features of the GARCH(1,1):
The mean reversion term is the product of a weight (gamma) and

the long-run (unconditional) variance.
If gamma = 0, GARCH(1,1) “reduces” to EWMA
 n2   VL   un21   n21
An alpha weight applied to the most recent

return^2 (aka, “innovation”). Alpha is
analogous to (1- lambda) in EWMA
A beta weight applied to the most recent variance.

Beta is analogous to lambda in EWMA.
“The ‘(1,1)’ in GARCH(1,1) indicates that on is based on the most recent observation of
u^2 and the most recent estimate of the variance rate. The more general GARCH(p, q)
model calculates on from the most recent p observations on u2 and the most recent q
estimates of the variance rates. GARCH(1,1) is by far the most popular of the GARCH
models.” – Hull
Two worked examples (in two columns) are on the following page.

In the volatility practice bag (learning spreadsheet 2.b.6), we illustrate and compare the
calculation of EWMA to GARCH(1,1):
beta (b) or lambda 0.860 0.898 In both, most weight to lag variance
If EWMA: lambda only
1-lambda 0.140 0.102 In EWMA, only two weights
sum of weights 1.00 1.00
If GARCH (1,1): alpha, beta, & gamma
omega (w) 0.00000200 0.00000176 omega = gamma * long run variance
alpha (a) 0.130 0.063 Weight to lag return
alpha + beta (a+b) 0.9900 0.9602 “persistence” of GARCH
gamma 0.010 0.040 Weight to L.R. var = 1 – alpha – beta
sum of weights: 1.000 1.000
Long Term Variance 0.00020 0.00004 omega/(1-alpha-beta)
Long Term Volatility 1.4142% 0.6650% SQRT()
Updated Volatility Estimate

Assumptions
Last Volatility 1.60% 0.60%
Last Variance 0.000256 0.000036
Yesterday's price 10.00 10.00
Today's price 9.90 10.21
Last Return -1.0% 2.0%
EWMA
Updated Variance 0.000235 0.000074 *lag variance+(1-)*lag return^2
Updated Volatility 1.53% 0.86%
GARCH(1,1)
Updated Variance 0.000236 0.000060 GARCH (1,1) = omega + beta*lag var
Updated Volatility 1.53% 0.77% + alpha * lag return^2
GARCH (1,1) Forecast

Number of days (t) 10 10
Forecast Variance 0.00023218 0.00005463 L.R. var + (alpha+beta)^t*(var-L.R. var)
Forecast Volatility 1.524% 0.739%

Explain mean reversion and how it is captured in the GARCH(1,1) model.
If we are given omega and two of the weights (alpha and beta), we can use our understanding of
GARCH(1,1) ….
 n2     un21   n21   VL   un21   n21

… to solve for the long-run average variance as a function of omega and the weights:
VL  
1   
Explain how the parameters of the GARCH(1,1) and the EWMA models are
estimated using maximum likelihood methods.
In maximum likelihood methods we choose parameters that maximize the likelihood of the
observations occurring.
Max Likelihood Estimation (MLE) for GARCH(1,1)

Avg Return 0.001 mu 0.0006
Std Dev (Returns) 0.006 Omega 0.0000
Alpha 0.0001
Beta 0.8221
mu * 1000 0.646
alpha 0.000
persistence 0.822
variance*10000 0.363
Log likelihood value: 110.94
“It is now appropriate to discuss how the parameters in the models we have been
considering are estimated from historical data. The approach used is known as the
maximum likelihood method. It involves choosing values for the parameters that
maximize the chance (or likelihood) of the data occurring. To illustrate the method, we
start with a very simple example. Suppose that we sample 10 stocks at random on a
certain day and find that the price of one of them declined on that day and the prices of
the other nine either remained the same or increased. What is the best estimate of the
probability of a price decline? The natural answer is 0.1.” –Hull

Explain how GARCH models perform in volatility forecasting.
The forecasted volatility forward (k) days is given by:
E[ n2 k ]  VL  (   )k ( n2  VL )
The expected future variance rate, in (t) periods forward, is given by:
E[ n2t ]  VL  (   )t ( n2  VL )
For example, assume that a current volatility estimate (period n) is given by the following
GARCH (1, 1) equation:
 n2  0.00008  (  0.1)(4%)n2 1  (   0.7)(.0016)

In this example, alpha is the weight (0.1) assigned to the previous squared return (the previous
return was 4%), beta is the weight (0.7) assigned to the previous variance (0.0016).
What is the expected future volatility, in ten days (n + 10)?
First, solve for the long-run variance. It is not 0.00008; this term is the product of the variance
and its weight. Since the weight must be 0.2 (= 1 - 0.1 -0.7), the long run variance = 0.0004.
 0.00008
VL    0.0004
1     1  0.7  0.1
Second, we need the current variance (period n). That is almost given to us above:
 n2  0.00008  0.00016  0.00112  0.00136

Now we can apply the formula to solve for the expected future variance rate:
E [ n2 t ]  (0.0004)  (0.1  0.7)10 (0.00136  0.0004)

 0.0005031
This is the expected variance rate, so the expected volatility is approximately 2.24%. Notice how
this works: the current volatility is about 3.69% and the long-run volatility is 2%. The 10-day
forward projection “fades” the current rate nearer to the long-run rate.

Describe how correlations and covariances are calculated, and explain
the consistency condition for covariances.
Correlations play a key role in the calculation of value at risk (VaR). We can use similar methods
to EWMA for volatility. In this case, an updated covariance estimate is a weighted sum of
 The recent covariance; weighted by lambda
 The recent cross-product; weighted by (1-lambda)
cov n   cov n 1 (1   )xn 1y n 1

Allen, Boudoukh, and
Saunders, Chapter 2:
Quantifying Volatility in
VaR Models
In this chapter…
 Discuss how asset return distributions tend to deviate from the normal distribution.
 Explain potential reasons for the existence of fat tails in a return distribution and
discuss the implications fat tails have on analysis of return distributions.
 Distinguish between conditional and unconditional distributions.
 Discuss the implications regime switching has on quantifying volatility.
 Explain the various approaches for estimating VaR.
 Compare, contrast and calculate parametric and non-parametric approaches for
estimating conditional volatility, including:
 Historical standard deviation
 Exponential smoothing
 GARCH approach
 Historic simulation
 Multivariate density estimation
 Hybrid methods
 Explain the process of return aggregation in the context of volatility forecasting
methods.
 Discuss implied volatility as a predictor of future volatility and its shortcomings.
 Explain long horizon volatility/VaR and the process of mean reversion according to
an AR(1) model.

Key terms
Risk varies over time. Models often assume a normal (Gaussian) distribution (“normality”) with
constant volatility from period to period. But actual returns are non-normal and volatility varies
over time (volatility is “time-varying” or “non-constant”). Therefore, it is hard to use parametric
approaches to random returns; in technical terms, it is hard to find robust “distributional
assumptions for stochastic asset returns”
 Conditional parameter (e.g., conditional volatility): a parameter such as variance that

depends on (is conditional on) circumstances or prior information. A conditional
parameter, by definition, changes over time.
 Persistence: In EWMA, the lambda parameter (λ). In GARCH (1,1), the sum of the alpha
(α) and beta () parameters. High persistence implies slow decay toward to the long-run
average variance.
 Autoregressive: Recursive. A parameter (today’s variance) is a function of itself

(yesterday’s variance).
 Heteroskedastic: Variance changes over time (homoskedastic = constant variance).
 Leptokurtosis: a fat-tailed distribution where relatively more observations are near the
middle and in the “fat tails (kurtosis > 3)
Stochastic behavior of returns
Risk measurement (VaR) concerns the tail of a distribution, where losses occur. We want to
impose a mathematical curve (a “distributional assumption”) on asset returns so we can
estimate losses. The parametric approach uses parameters (i.e., a formula with parameters) to
make a distributional assumption but actual returns rarely conform to the distribution curve. A
parametric distribution plots a curve (e.g., the normal bell-shaped curve) that approximates a
range of outcomes but actual returns are not so well-behaved: they rarely “cooperate.”

Value at Risk (VaR) – 2 asset, relative vs. absolute
Know how to compute two-asset portfolio variance & scale portfolio volatility to derive VaR:
Inputs (per annum(

Trading days /year 252
Initial portfolio value (W) $100
VaR Time horizon (days) (h) 10
VaR confidence interval 95%
Asset A
Volatility (per year) 10.0%
Expected Return (per year) 12.0%
Portfolio Weight (w) 50%
Asset B
Volatility 20.0%
Expected Return (per year) 25.0%
Portfolio Weight (1-w) 50%
Correlation (A,B)
Autocorrelation (h-1, h) 0.30
0.25 Independent, = 0. Mean reverting = negative
Outputs
Annual
Covariance (A,B) 0.0060 COV = (correlation A,B)(volatility A)(volatility B)
Portfolio variance 0.0060
0.0155
Exp Portfolio return 0.0155
18.5%
Portfolio volatility (per year) 12.4%
Period (h days)
Exp periodic return (u) 0.73%
Std deviation (h), i.i.d 2.48%
Scaling factor 15.78 Don’t need to know this, used for AR(1)
Std deviation (h), 3.12% Standard deviation if auto-correlation.
Autocorrelation
Normal deviate (critical z 1.64 Normal deviate
value) future value
Expected 100.73
Relative VaR, i.i.d 100.73
$4.08 Doesn’t include the mean return
Absolute VaR, i.i.d $3.35 Includes return; i.e., loss from zero
Relative VaR, AR(1) $5.12 The corresponding VaRs, if autocorrelation
Absolute VaR, AR(1) $4.39 incorporated. Note VaR is higher!
 Relative VaR, iid = $100 value * 2.48% 10-day sigma * 1.645 normal deviate
 Absolute VaR, iid = $100 * (-0.73% + 2.48% * 1.645)
 Relative VaR, AR(1) = $100 value * 3.12% 10-day AR sigma * 1.645 normal deviate
 Absolute VaR, AR(1) = $100 * (-0.73% + 3.12% * 1.645)

Discuss how asset return distributions tend to deviate from the normal
distribution.
Compared to a normal (bell-shaped) distribution, actual asset returns tend to be:
 Fat-tailed (a.k.a., heavy tailed): A fat-tailed distribution is characterized by having

more probability weight (observations) in its tails relative to the normal distribution.
 Skewed: A skewed distribution refers—in this context of financial returns—to the

observation that declines in asset prices are more severe than increases. This is in
contrast to the symmetry that is built into the normal distribution.
 Unstable: the parameters (e.g., mean, volatility) vary over time due to variability in
market conditions.
NORMAL RETURNS ACTUAL FINANCIAL RETURNS

Symmetrical Skewed
“Normal” Tails Fat-tailed (leptokurtosis)
Stable Unstable (time-varying)
Interest rate distributions are not constant over time
10 years of interest rate data are collected (1982 – 1993). The distribution plots the daily change
in the three-month treasury rate. The average change is approximately zero, but the “probability
mass” is greater at both tails. It is also greater at the mean; i.e., the actual mean occurs more
frequently than predicted by the normal distribution.
Actual returns:
4.5% 1. Skewed
4.0% 2. Fat-tailed
3rd Moment = (kurtosis>3)
3.5%
Skew • 3 3. Unstable
3.0%
2.5%
2.0%
1.5% 4th Moment =
1.0%
2nd Variance kurtosis • 4
“scale”
0.5%
0.0% 1st moment
-3 -2 -1 Mean
0 1 2 3
“location”

Explain potential reasons for the existence of fat tails in a return
distribution and discuss the implications fat tails have on analysis of
return distributions.
A distribution is unconditional if tomorrow’s distribution is the same as today’s distribution. But

fat tails could be explained by a conditional distribution: a distribution that changes over time.
Two things can change in a normal distribution: mean and volatility. Therefore, we can explain
fat tails in two ways:
 Conditional mean is time-varying; but this is unlikely given the assumption that
markets are efficient
 Conditional volatility is time-varying; Allen says this is the more likely explanation!
Normal distribution says:

-10% @ 95th %ile
If fat tails, expected VaR loss is

understated!
Explain how outliers can really be indications that the volatility varies with time.
We observe that actual financial returns tend to exhibit fat-tails. Jorion (like Allen et al) offers
two possible explanations:
 The true distribution is stationary. Therefore, fat-tails reflect the true distribution but the
normal distribution is not appropriate
 The true distribution changes over time (it is “time-varying”). In this case, outliers can in
reality reflect a time-varying volatility.

Distinguish between conditional and unconditional distributions.
An unconditional distribution is the same regardless of market or economic conditions; for
this reason, it is likely to be unrealistic.
A conditional distribution in not always the same: it is different, or conditional on, some
economic or market or other state. It is measured by parameters such as its conditional mean,
conditional standard deviation (conditional volatility), conditional skew, and conditional
kurtosis.
Discuss the implications regime switching has on quantifying volatility.
A typical distribution is a regime-switching volatility model: the regime (state) switches from
low to high volatility, but is never in between. A distribution is “regime-switching” if it changes
from high to low volatility.
The problem: a risk manager may assume (and measure) an unconditional volatility but the
distribution is actually regime switching. In this case, the distribution is conditional (i.e., it
depends on conditions) and might be normal but regime-switching; e.g., volatility is 10% during a
low-volatility regime and 20% during a high-volatility regime but during both regimes, the
distribution may be normal. However, the risk manager may incorrectly assume a single 15%
unconditional volatility. But in this case, the unconditional volatility is likely to exhibit fat
tails because it does not account for the regime switching.

Explain the various approaches for estimating VaR.
Volatility versus Value at Risk (VaR)
Volatility is an input into our (parametric) value at risk (VaR):
VaR$  W$z
VaR%  z
Linda Allen’s Historical-based approaches
The common attribute to all the approaches within this class is their use of historical time series
data in order to determine the shape of the conditional distribution.
 Parametric approach. The parametric approach imposes a specific distributional

assumption on conditional asset returns. A representative member of this class of models
is the conditional (log) normal case with time-varying volatility, where volatility is
estimated from recent past data.
 Nonparametric approach. This approach uses historical data directly, without

imposing a specific set of distributional assumptions. Historical simulation is the
simplest and most prominent representative of this class of models.

Implied volatility based approach.
This approach uses derivative pricing models and current derivative prices in order to impute
an implied volatility without having to resort to historical data. The use of implied volatility
obtained from the Black–Scholes option pricing model as a predictor of future volatility is the
most prominent representative of this class of models.
Jorion’s Value at Risk (VaR) typology
Please note that Jorion’s taxonomy approaches from the perspective of local versus full
valuation. In that approach, local valuation tends to associate with parametric approaches:
Risk Measurement
Local valuation Full valuation
Historical
Linear models Nonlinear models
Simulation
Full Covariance Monte Carlo
Gamma
matrix Simulation
Factor Models Convexity
Diagonal Models

Value at Risk (VaR)
Parametric
 Delta normal
Non parametric
 Historical
Simulation
 Bootstrap
 Monte Carlo
Hybrid (semi-p)
 HS + EWMA
EVT
 POT (GPD)
 Block
maxima
(GEV)

Volatility
1. Implied Volatility
2. Equally weighted
returns or un-
weighted (STDEV)
3. More weight to
recent returns
 GARCH(1,1)
 EWMA
4. MDE (more weight to

similar states!)

Historical approaches
An historical-based approach can be non-parametric, parametric or hybrid (both). Non-

parametric directly uses a historical dataset (historical simulation, HS, is the most common).
Parametric imposes a specific distributional assumption (this includes historical standard
deviation and exponential smoothing)
Compare, contrast and calculate parametric and non-parametric

approaches for estimating conditional volatility, including: HISTORICAL
STANDARD DEVIATION
Historical standard deviation is the simplest and most common way to estimate or predict
future volatility. Given a history of an asset’s continuously compounded rate of returns we take
a specific window of the K most recent returns.
This standard deviation is called a moving average (MA) by Jorion. The estimate requires a
window of fixed length; e.g., 30 or 60 trading days. If we observe returns (rt) over M days, the
volatility estimate is constructed from a moving average (MA):
M
  (1/ M ) rt2i
2
t
i 1
Each day, the forecast is updated by adding the most recent day and dropping the furthest day.
In a simple moving average, all weights on past returns are equal and set to (1/M). Note raw
returns are used instead of returns around the mean (i.e., the expected mean is assumed zero).
This is common in short time intervals, where it makes little difference on the volatility estimate.
For example, assume the previous four daily returns for a stock are 6% (n-1), 5% (m-2), 4% (n-
3) and 3% (n-4). What is a current volatility estimate, applying the moving average, given that
our short trailing window is only four days (m=14)? If we square each return, the series is
0.0036, 0.0025, 0.0016 and 0.0009. If we sum this series of squared returns, we get 0.0086.
Divide by 4 (since m=4) and we get 0.00215. That’s the moving average variance, such that the
moving average volatility is about 4.64%.
The above example illustrates a key weakness of the moving average (MA): since all
returns weigh equally, the trend does not matter. In the example above, notice that
volatilty is trending down, but MA does not reflect in any way this trend. We could
reverse the order of the historical series and the MA estimation would produce the same
result.

The moving average (MA) series is simple but has two drawbacks
 The MA series ignores the order of the observations. Older observations may no
longer be relevant, but they receive the same weight.
 The MA series has a so-called ghosting feature: data points are dropped arbitrarily due
to length of the window.

approaches for estimating conditional volatility, including: GARCH
APPROACH, EXPONENTIAL SMOOTHING (EWMA), and Exponential
smoothing (conditional parametric)
Modern methods place more weight on recent information. Both EWMA and GARCH place
more weight on recent information. Further, as EWMA is a special case of GARCH, both EWMA
and GARCH employ exponential smoothing.
GARCH (p, q) and in particular GARCH (1, 1)
GARCH (p, q) is a general autoregressive conditional heteroskedastic model:
 Autoregressive (AR): tomorrow’s variance (or volatility) is a regressed function of

today’s variance—it regresses on itself
 Conditional (C): tomorrow’s variance depends—is conditional on—the most recent
variance. An unconditional variance would not depend on today’s variance
 Heteroskedastic (H): variances are not constant, they flux over time
GARCH regresses on “lagged” or historical terms. The lagged terms are either variance or
squared returns. The generic GARCH (p, q) model regresses on (p) squared returns and (q)
variances. Therefore, GARCH (1, 1) “lags” or regresses on last period’s squared return (i.e., just 1
return) and last period’s variance (i.e., just 1 variance).
GARCH (1, 1) given by the following equation.
ht  0  1rt21   ht 1 ht or  t2  conditional variance (i.e., we're solving for it)

a or   weighted long-run (average) variance
ht 1 or  2
t-1  previous variance
2
rt-1 or 2
rt-1,t  previous squared return

Persistence is a feature embedded in the GARCH model.
In the above formulas, persistence is = (b + c) or (alpha-1+ beta). Persistence refers to

how quickly (or slowly) the variance reverts or “decays” toward its long-run average.
High persistence equates to slow decay and slow “regression toward the mean;” low
persistence equates to rapid decay and quick “reversion to the mean.”
A persistence of 1.0 implies no mean reversion. A persistence of less than 1.0 implies “reversion
to the mean,” where a lower persistence implies greater reversion to the mean.
As above, the sum of the weights assigned to the lagged variance and lagged squared
return is persistence (b+c = persistence). A high persistence (greater than zero but less
than one) implies slow reversion to the mean.
But if the weights assigned to the lagged variance and lagged squared return are greater
than one, the model is non-stationary. If (b+c) is greater than 1 (if b+c > 1) the model is
non-stationary and, according to Hull, unstable. In which case, EWMA is preferred.
Linda Allen says about GARCH (1, 1):
 GARCH is both “compact” (i.e., relatively simple) and remarkably accurate. GARCH
models predominate in scholarly research. Many variations of the GARCH model have
been attempted, but few have improved on the original.
 The drawback of the GARCH model is its nonlinearity.
For example: Solve for long-run variance in GARCH (1,1)
Consider the GARCH (1, 1) equation below:
 n2  0.2   un21   n21

Assume that:
 the alpha parameter = 0.2,
 the beta parameter = 0.7, and
Note that omega is 0.2 but don’t mistake omega (0.2) for the long-run variance! Omega is the
product of gamma and the long-run variance. So, if alpha + beta = 0.9, then gamma must be
0.1. Given that omega is 0.2, we know that the long-run variance must be 2.0 (0.2  0.1 = 2.0).

EWMA
EWMA is a special case of GARCH (1,1). Here is how we get from GARCH (1,1) to EWMA:
GARCH(1,1)   t2  a  brt21,t  c t21
Then we let a = 0 and (b + c) =1, such that the above equation simplifies to:
GARCH(1,1) =  t2  brt21,t  (1  b) t21
This is now equivalent to the formula for exponentially weighted moving average (EWMA):
EWMA   t2  brt21,t  (1  b) t21

 t2   t21  (1   )rt21,t
In EWMA, the lambda parameter now determines the “decay:” a lambda that is close to one
(high lambda) exhibits slow decay.
RiskMetricsTM Approach
RiskMetrics is a branded form of the exponentially weighted moving average (EWMA) approach.
The optimal (theoretical) lambda varies by asset class, but the overall optimal parameter used
by RiskMetrics has been 0.94. In practice, RiskMetrics only uses one decay factor for all series:
 0.94 for daily data

 0.97 for monthly data (month defined as 25 trading days)
Technically, the daily and monthly models are inconsistent. However, they are both easy to use,
they approximate the behavior of actual data quite well, and they are robust to misspecification.
Each of GARCH (1, 1), EWMA and RiskMetrics are each parametric and recursive.
Advantages and Disadvantages of MA (i.e., STDEV) vs. GARCH
GARCH estimations can provide estimations that are more accurate than MA
Jorion’s Moving average (MA)

= Allen’s STDEV GARCH
Ghosting feature More recent data assigned greater weights

Trend information is not incorporated A term added to incorporate mean reversion
Except Linda Allen warns: GARCH (1,1) needs more parameters and may pose greater
MODEL RISK (“chases a moving target”) when forecasting out-of-sample

Graphical summary of the parametric methods that assign more weight to recent
returns (GARCH & EWMA)

Summary Tips:
GARCH (1, 1) is generalized RiskMetrics; and, conversely, RiskMetrics is restricted case of

GARCH (1,1) where a = 0 and (b + c) =1. GARCH (1, 1) is given by:
 n2   VL   un21   n21
The three parameters are weights and therefore must sum to one:
    1
Be careful about the first term in the GARCH (1, 1) equation: omega (ω) = gamma(λ) *
(average long-run variance). If you are asked for the variance, you may need to divide
out the weight in order to compute the average variance.
Determine when and whether a GARCH or EWMA model should be used in volatility
estimation
In practice, variance rates tend to be mean reverting; therefore, the GARCH (1, 1) model is
theoretically superior (“more appealing than”) to the EWMA model. Remember, that’s the big
difference: GARCH adds the parameter that weights the long-run average and therefore it
incorporates mean reversion.
GARCH (1, 1) is preferred unless the first parameter is negative (which is implied if alpha
+ beta > 1). In this case, GARCH (1,1) is unstable and EWMA is preferred.
Explain how the GARCH estimations can provide forecasts that are more accurate.
The moving average computes variance based on a trailing window of observations; e.g., the
previous ten days, the previous 100 days.
There are two problems with moving average (MA):
 Ghosting feature: volatility shocks (sudden increases) are abruptly incorporated into the
MA metric and then, when the trailing window passes, they are abruptly dropped from
the calculation. Due to this the MA metric will shift in relation to the chosen window
length
 Trend information is not incorporated
GARCH estimates improve on these weaknesses in two ways:
 More recent observations are assigned greater weights. This overcomes ghosting
because a volatility shock will immediately impact the estimate but its influence will fade
gradually as time passes
 A term is added to incorporate reversion to the mean

Explain how persistence is related to the reversion to the mean.
Given the GARCH (1, 1) equation: ht  0  1rt21   ht 1
Persistence  1  
GARCH (1, 1) is unstable if the persistence > 1. A persistence of 1.0 indicates no mean reversion.
A low persistence (e.g., 0.6) indicates rapid decay and high reversion to the mean.
GARCH (1, 1) has three weights assigned to three factors. Persistence is the sum of the
weights assigned to both the lagged variance and lagged squared return. The other
weight is assigned to the long-run variance.
If P = persistence and G = weight assigned to long-run variance, then P+G = 1.
Therefore, if P (persistence) is high, then G (mean reversion) is low: the persistent series
is not strongly mean reverting; it exhibits “slow decay” toward the mean.
If P is low, then G must be high: the impersistent series does strongly mean revert; it
exhibits “rapid decay” toward the mean.
The average, unconditional variance in the GARCH (1, 1) model is given by:
0
LV 
1  1  

approaches for estimating conditional volatility, including: HISTORIC
SIMULATION
Historical simulation is easy: we only need to determine the “lookback window.” The problem is
that, for small samples, the extreme percentiles (e.g., the worst one percent) are less precise.
Historical simulation effectively throws out useful information.
“The most prominent and easiest to implement methodology within the class of
nonparametric methods is historical simulation (HS). HS uses the data directly. The only
thing we need to determine up front is the lookback window. Once the window length is
determined, we order returns in descending order, and go directly to the tail of this
ordered vector. For an estimation window of 100 observations, for example, the fifth
lowest return in a rolling window of the most recent 100 returns is the fifth percentile.
The lowest observation is the first percentile. If we wanted, instead, to use a 250
observations window, the fifth percentile would be somewhere between the 12th and the
13th lowest observations (a detailed discussion follows), and the first percentile would be
somewhere between the second and third lowest returns.” –Linda Allen
Compare and contrast the use of historic simulation, multivariate density

estimation, and hybrid methods for volatility forecasting.
Nonparametric Volatility Forecasting
Historic Simulation Hybrid

MDE
(HS) (HS & EWMA)
• Sort returns • Like ARCH(m) • Sort returns (like HS)

• Lookup worst • But weights based on • But weight them,
• If n=100, for 95th function of [current greater weight to
percentile look vs. historical state] recent (like EWMA)
between bottom 5th • If state (n-50)  state
& 6th (today), heavy
weight to that
return2

Advantages Disadvantages
Historical Easiest to implement (simple, Uses data inefficiently
Simulation convenient) (much data is not used)
Multivariate Onerous model: weighting
Very flexible: weights are
density scheme; conditioning
function of state (e.g., economic
estimation variables; number of
context such as interest rates)
observations
not constant
Data intensive
Hybrid
approach Unlike the HS approach, better Requires model
incorporates more recent assumptions; e.g., number
information of observations

approaches for estimating conditional volatility, including: MULTIVARIATE
DENSITY ESTIMATION
Multivariate Density Estimation (MDE)
The key feature of multivariate density estimation is that the weights (assigned to historical
square returns) are not a constant function of time. Rather, the current state—as
parameterized by a state vector—is compared to the historical state: the more similar the states
(current versus historical period), the greater the assigned weight. The relative weighting is
determined by the kernel function:
K
   ( t  i )ut2i
2
t
i 1
Kernel function Vector describing economic state at

time t-i
Instead of weighting returns^2 by time,

Weighting by proximity to current state

Compare EWMA to MDE:
 Both assign weights to historical squared returns (squared returns = variance

approximation);
 Where EWMA assigns the weight as an exponentially declining function of time (i.e., the
nearer to today, the greater the weight), MDE assigns the weight based on the nature of
the historical period (i.e., the more similar to the historical state, the greater the weight)

approaches for estimating conditional volatility, including: HYBRID
METHODS
The hybrid approach is a variation on historical simulation (HS). Consider the ten (10)
illustrative returns below. In simple HS, the return are sorted from best-to-worst (or worst-to-
best) and the quantile determines the VaR. Simple HS amounts to giving equal weight to each
returns (last column). Given 10 returns, the worst return (-31.8%) earns a 10% weight under
simple HS.
Cum'l
Sorted Periods Hybrid Hybrid Compare
Return Ago Weight Weight to HS
-31.8% 7 8.16% 8.16% 10%
-28.8% 9 6.61% 14.77% 20%
-25.5% 6 9.07% 23.83% 30%
-22.3% 10 5.95% 29.78% 40%
5.7% 1 15.35% 45.14% 50%
6.1% 2 13.82% 58.95% 60%
6.5% 3 12.44% 71.39% 70%
6.9% 4 11.19% 82.58% 80%
12.1% 5 10.07% 92.66% 90%
60.6% 8 7.34% 100.00% 100%
However, under the hybrid approach, the EWMA weighting scheme is instead applied. Since the
worst return happened seven (7) periods ago, the weight applied is given by the following,
assuming a lambda of 0.9 (90%):
Weight (7 periods prior) = 90%^(7-1)*(1-90%)/(1-90%^10) = 8.16%

Note that because the return happened further in the past, the weight is below the 10% that is
assigned under simple HS.
120%
100%
Hybrid 80%
Weights 60%
HS Weights 40%
20%
0%
1 2 3 4 5 6 7 8 9 10
Hybrid methods using Google stock’s prices and returns:

Number
Google (GOOG) Period of days Cumulative Weight
Date Close Return Sorted ago HS Hybrid
6/24/2009 409.29 0.89% 1 -5.90% 76 1.0% 0.2% 0.2%
6/23/2009 405.68 -0.41% 2 -5.50% 94 2.0% 0.1% 0.3%
6/22/2009 407.35 -3.08% 3 -4.85% 86 3.0% 0.1% 0.4%
6/19/2009 420.09 1.45% 4 -4.29% 90 4.0% 0.1% 0.5%
6/18/2009 414.06 -0.27% 5 -4.25% 78 5.0% 0.2% 0.7%
6/17/2009 415.16 -0.20% 6 -3.35% 47 6.0% 0.6% 1.3%
6/16/2009 416 -0.18% 7 -3.26% 81 7.0% 0.2% 1.4%
6/15/2009 416.77 -1.92% 8 -3.08% 3 8.0% 3.7% 5.1%
6/12/2009 424.84 -0.97% 9 -3.01% 88 9.0% 0.1% 5.2%
6/11/2009 429 -0.84% 10 -2.64% 55 10.0% 0.4% 5.7%
In this case:
 Sample includes 100 returns (n=100)
 We are solving for the 95th percentile (95%) value at risk (VaR)
 For the hybrid approach, lambda = 0.96
 Sorted returns are shown in the purple column
 The HS 95% VaR = ~ 4.25% because it is the fifth-worst return (actually, the quantile can
be determined in more than one way)
 However, the hybrid approach returns a 95% VaR of 3.08% because the “worst returns”
that inform the dataset tend to be further in the past (i.e., days ago = 76, 94, 86, 90…).
Due to this, the individual weights are generally less than 1%.

Explain the process of return aggregation in the context of volatility
forecasting methods.
The question is: how do we compute VAR for a portfolio which consists of several positions.
The first approach is the variance-covariance approach: if we make (parametric) assumptions

about the covariances between each position, then we extend the parametric approach to the
entire portfolio. The problem with this approach is that correlations tend to increase (or change)
during stressful market events; portfolio VAR may underestimate VAR in such circumstances.
The second approach is to extend the historical simulation (HS) approach to the portfolio:
apply today’s weights to yesterday’s returns. In other words, “what would have happened if we
held this portfolio in the past?”
The third approach is to combine these two approaches: aggregate the simulated returns and
then apply a parametric (normal) distributional assumption to the aggregated portfolio.
The first approach (variance-covariance) requires the dubious assumption of normality—for the
positions “inside” the portfolio. The text says the third approach is gaining in popularity and is
justified by the law of large numbers: even if the components (positions) in the portfolio are not
normally distributed, the aggregated portfolio will converge toward normality.

Explain how implied volatility can be used to predict future volatility
To impute volatility is to derivate volatility (to reverse-engineer it, really) from the observed
market price of the asset. A typical example uses the Black-Scholes option pricing model to
compute the implied volatility of a stock option; i.e., option traders will average at-the-money
implied volatility from traded puts and calls.
The advantages of implied The shortcomings (or disadvantages) of implied volatility

volatility are: include:
 Truly predictive (reflects  Model-dependent

market’s forward-
looking consensus)  Options on the same underlying asset may trade at
different implied volatilities; e.g., volatility
 Does not require, nor is smile/smirk
restrained by, historical
distribution patterns  Stochastic volatility; i.e., the model assumes constant
volatility, but volatility tends to change over time
 Limited availability because it requires traded (set by

market) price
Explain how to use option prices to derive forecasts of volatilities
This requires that a market mechanism (e.g., an exchange) can provide a market price for the
option. If a market price can be observed, then instead of solving for the price of an option, we
use an option pricing model (OPM) to reveal the implied (implicit) volatility. We solve (“goal
seek”) for the volatility that produces a model price equal to the market price:
cmarket  f ( ISD )
Where the implied standard deviation (ISD) is the volatility input into an option pricing model
(OPM). Similarly, implied correlations can also be “recovered” (reverse-engineered) from
options on multiple assets. According to Jorion, ISD is a superior approach to volatility
estimation. He says, “Whenever possible, VAR should use implied parameters” [i.e., ISD or
market implied volatility].

Discuss implied volatility as a predictor of future volatility and its
shortcomings.
Many risk managers describe the application of historical volatility as similar to “driving by
looking in the rear-view mirror.” Another flaw is the assumption of stationarity; i.e., the
assumption that the past is indicative of the future.
Implied volatility, “an intriguing alternative,” can be imputed from derivative prices using a
specific derivative pricing model. The simplest example is the Black–Scholes implied volatility
imputed from equity option prices.
 In the presence of multiple implied volatilities for various option maturities and exercise
prices, it is common to take the at-the-money (ATM) implied volatility from puts and
calls and extrapolate an average implied; this implied is derived from the most liquid
(ATM) options
The advantage of implied volatility is that it is a forward-looking, predictive measure.
“A particularly strong example of the advantage obtained by using implied volatility (in
contrast to historical volatility) as a predictor of future volatility is the GBP currency
crisis of 1992. During the summer of 1992, the GBP came under pressure as a result of the
expectation that it should be devalued relative to the European Currency Unit (ECU)
components, the deutschmark (DM) in particular (at the time the strongest currency
within the ECU). During the weeks preceding the final drama of the GBP devaluation,
many signals were present in the public domain … This was the case many times prior to
this event, especially with the Italian lira’s many devaluations. Therefore, the market
was prepared for a crisis in the GBP during the summer of 1992. Observing the thick
solid line depicting option-implied volatility, the growing pressure on the GBP
manifests itself in options prices and volatilities. Historical volatility is trailing,
“unaware” of the pressure. In this case, the situation is particularly problematic since
historical volatility happens to decline as implied volatility rises. The fall in historical
volatility is dueto the fact that movements close to the intervention band are bound to
be smaller by the fact of the intervention bands’ existence and the nature of
intervention, thereby dampening the historical measure of volatility just at the time that
a more predictive measure shows increases in volatility.” – Linda Allen
Is implied volatility a superior predictor of future volatility?
“It would seem as if the answer must be affirmative, since implied volatility can react
immediately to market conditions. As a predictor of future volatility this is certainly an
important feature.”

Why does implied volatility tend to be greater than historical volatility?
According to Linda Allen, “empirical results indicate, strongly and consistently, that implied
volatility is, on average, greater than realized volatility.” There are two common explanations.
 Market inefficiency due to supply and demand forces.
 Rational markets: implied volatility is greater than realized volatility due to stochastic
volatility. “Consider the following facts: (i) volatility is stochastic; (ii) volatility is a priced
source of risk; and (iii) the underlying model (e.g., the Black–Scholes model) is, hence,
misspecified, assuming constant volatility. The result is that the premium required by the
market for stochastic volatility will manifest itself in the forms we saw above – implied
volatility would be, on average, greater than realized volatility.”
But implied volatility has shortcomings.
 Implied volatility is model-dependent. A mis-specified model can result in an

erroneous forecast.
“Consider the Black–Scholes option-pricing model. This model hinges on a few

assumptions, one of which is that the underlying asset follows a continuous time
lognormal diffusion process. The underlying assumption is that the volatility parameter is
constant from the present time to the maturity of the contract. The implied volatility is
supposedly this parameter. In reality, volatility is not constant over the life of the options
contract. Implied volatility varies through time. Oddly, traders trade options in “vol”
terms, the volatility of the underlying, fully aware that (i) this vol is implied from a
constant volatility model, and (ii) that this very same option will trade tomorrow at a
different vol, which will also be assumed to be constant over the remaining life of the
contract.” –Linda Allen
 At any given point in time, options on the same underlying may trade at different
vols. An example is the [volatility] smile effect – deep out of the money (especially) and
deep in the money (to a lesser extent) options trade at a higher volatility than at the
money options.
Explain long horizon volatility/VaR and the process of mean reversion

according to an AR(1) model.
Explain the implications of mean reversion in returns and return volatility
The key idea refers to the application of the square root rule (S.R.R. says that variance scales
directly with time such that the volatility scales directly with the square root of time). The
square root rule, while mathematically convenient, doesn’t really work in practice because it
requires that normally distributed returns are independent and identically distributed (i.i.d.).

What I mean is, we use it on the exam, but in practice, when applying the square root rule to
scaling delta normal VaR/volatility, we should be sensitive to the likely error introduced.
Allen gives two scenarios that each illustrate “violations” in the use of the square root rule to
scale volatility over time:
If mean reversion… Then square root rule

In returns Overstates long run volatility
If current vol. > long run volatility, overstates
In return volatility
If current vol. < long run volatility, understates
For FRM purposes, three definitions of mean reversion are used:
 Mean reversion in the asset dynamics. The price/return tends towards a long-run
level; e.g., interest rate reverts to 5%, equity log return reverts to +8%
 Mean reversion in variance. Variance reverts toward a long-run level; e.g., volatility
reverts to a long-run average of 20%. We can also refer to this as negative
autocorrelation, but it's a little trickier. Negative autocorrelation refers to the fact that a
high variance is likely to be followed in time by a low variance. The reason it's tricky is
due to short/long timeframes: the current volatility may be high relative to the long run
mean, but it may be "sticky" or cluster in the short-term (positive autocorrelation) yet, in
the longer term it may revert to the long run mean. So, there can be a mix of (short-term)
positive and negative autocorrelation on the way being pulled toward the long run mean.
 Autoregression in the time series. The current estimate (variance) is informed by (a

function of) the previous value; e.g., in GARCH(1,1) and exponentially weighted moving
average (EWMA), the variance is a function of the previous variance.

Square root rule
The simplest approach to extending the horizon is to use the “square root rule”
 (rt ,t  J )   (rt ,t 1)  J J-period VAR = J  1-period VAR

For example, if the 1-period VAR is $10, then the 2-period VAR is $14.14 ($10 x square root of 2)
and the 5-period VAR is $22.36 ($10 x square root of 5).
The square-root-rule: under the two assumptions below, VaR scales with the square root
of time. Extend one-period VaR to J-period VAR by multiplying by the square root of J.
The square root rule (i.e., variance is linear with time) only applies under restrictive i.i.d.
The square-root rule for extending the time horizon requires two key assumptions:
 Random-walk (acceptable)
 Constant volatility (unlikely)

FRM P1.Quantitative-Analysis

Uploaded by

Copyright:

Available Formats

You might also like

FRM P1.Quantitative-Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FRM P1.Quantitative-Analysis

Uploaded by

Copyright:

Available Formats

Quantitative Analysis

FRM 2013 Study Notes – Part1.Topic2

By David Harper, CFA FRM CIPM

Miller, Chapter 2: Probabilities ............................................................................................. 2

www.bionicturtle.com FRM 2012  QUANTITATIVE ANALYSIS  1

Terminology (for discrete/compound, these do not need to be memorized)

www.bionicturtle.com FRM 2013  QUANTITATIVE ANALYSIS  2

P  X  xi   pi s.t . x  x1, x2, , xn 

In mathematical terms, these conditions are represented as properties of a probability, where

In contrast to the case of a discrete random variable, a continuous random variable

www.bionicturtle.com FRM 2013  QUANTITATIVE ANALYSIS  3

We characterize (describe) a random variable with a probability distribution. The random

A random variable is a variable whose value is determined by the outcome of an

Continuous random variable

www.bionicturtle.com FRM 2013  QUANTITATIVE ANALYSIS  4

Notes on continuous versus discrete random variables

 Discrete random variables can be counted. Continuous random variables must be

 Examples of continuous random variables include: distance and time. A common

 All four of the so-called sampling distributions—that each converge to the

www.bionicturtle.com FRM 2013  QUANTITATIVE ANALYSIS  5

Normal Bernoulli (0/1)

www.bionicturtle.com FRM 2013  QUANTITATIVE ANALYSIS  6

Probability density functions (pdf): top row below

www.bionicturtle.com FRM 2013  QUANTITATIVE ANALYSIS  7

Inverse cumulative distribution function

F(x)  p  F 1( p)  x s.t. 0  p  1

Univariate versus multivariate probability density functions

www.bionicturtle.com FRM 2013  QUANTITATIVE ANALYSIS  8

Probability: Classical or “a priori” definition

The probability of outcome (A) is given by:

Number of outcomes favorable to A

Probability: Relative frequency or empirical definition

www.bionicturtle.com FRM 2013  QUANTITATIVE ANALYSIS  9

 The a priori (classical) probability of rolling a three (3) is 1/6,

Distinguish between independent and mutually exclusive events.

Mutually exclusive events

Pr( X  x,Y  y )  Pr( X  x )P(Y  y )

www.bionicturtle.com FRM 2013  QUANTITATIVE ANALYSIS  10

Define joint probability, describe a probability matrix and calculate joint

It is convenient to summarize joint probabilities in a probability matrix (a.k.a., probability table).

 The bonds can either be downgraded, be upgraded, or have no change in rating.

Consider Miller’s table below:

www.bionicturtle.com FRM 2013  QUANTITATIVE ANALYSIS  11

 The number of times the computer crashes (M)

Joint probability function (Stock & Watson example)

Pr( X  y,Y  y ) general form

Pr( A  0, M  0)  0.35 for example

0 0.35 0.065 0.05 0.025 0.01 0.50

1 0.45 0.035 0.01 0.005 0.00 0.50

Tot 0.80 0.100 0.03 0.030 0.01 1.00

Define and calculate a conditional probability, and distinguish between

Marginal (a.k.a., unconditional) probability functions

www.bionicturtle.com FRM 2013  QUANTITATIVE ANALYSIS  12

0 0.35 0.065 0.05 0.025 0.01 0.50

1 0.45 0.035 0.01 0.005 0.00 0.50

Tot 0.80 0.100 0.03 0.030 0.01 1.00

Conditional probability functions

Conditional is the probability of an outcome given (conditional on) another outcome: