Professional Documents
Culture Documents
Master Cheatsheet1
Master Cheatsheet1
Descriptive Statistics.......................................................................................................................................................4
Displaying Quantitative Variables Graphically.................................................................................................................5
Discrete Distributions......................................................................................................................................................6
Continuous Distributions.................................................................................................................................................6
Sampling Distribution......................................................................................................................................................8
Law of Large Numbers (LLN)................................................................................................................................8
Central Limit Theorem (CLT)........................................................................................................................................9
Hypothesis Testing........................................................................................................................................................10
o Type 1 Error.......................................................................................................................................................10
o Type 2 Error.......................................................................................................................................................10
p-value...............................................................................................................................................................10
Power of Test....................................................................................................................................................11
Comparing Distributions...............................................................................................................................................11
Simpson’s Paradox............................................................................................................................................11
Comparing/Testing Averages........................................................................................................................................13
Choosing Method......................................................................................................................................................13
ANOVA (Analysis Of Variance) F-test.........................................................................................................................13
t-test..........................................................................................................................................................................14
z-test......................................................................................................................................................................... 15
Lecture 4 Linear Regression..........................................................................................................................................15
Linear Regression (1......................................................................................................................................................15
Lecture 5 Multiple Regression.......................................................................................................................................16
Multiple Regression (....................................................................................................................................................16
o Adjusted R2........................................................................................................................................................17
Regression Output.....................................................................................................................................................19
Usefulness Tests of Regression Coefficients (2 tests)....................................................................................................20
1. Right-sided F test...............................................................................................................................................20
2. 2-sided t-test.....................................................................................................................................................20
Variables Selection........................................................................................................................................................20
Indicator or Dummy Variables.......................................................................................................................................21
Interaction among X’s...............................................................................................................................................22
Missing Values...........................................................................................................................................................23
Outliers......................................................................................................................................................................24
Logistic Regression........................................................................................................................................................25
Time Series....................................................................................................................................................................25
Test for Independence Assumption for residuals in time series....................................................................................28
o Durbin-Watson Test..........................................................................................................................................28
1
Test for Variance assumption........................................................................................................................................29
o F-test................................................................................................................................................................. 29
o Individual t-test.................................................................................................................................................29
o To test if models are really a better fit..................................................................................................................30
o Goodness of Fit..................................................................................................................................................30
o Forecast Accuracy..............................................................................................................................................30
o Residual Analysis...............................................................................................................................................30
Forecast Errors..............................................................................................................................................................31
Bias....................................................................................................................................................................31
MAD (Mean Absolute Deviation).......................................................................................................................31
MAPE.................................................................................................................................................................31
MSE (Mean Square Error)..................................................................................................................................31
RMSE (Root Mean Square Error).......................................................................................................................32
Prediction Intervals.......................................................................................................................................................32
Point forecast............................................................................................................................................................32
Interval forecast........................................................................................................................................................32
How to model both Trend and Seasonality...................................................................................................................33
Additive Model (Linear Trend and Seasonality).........................................................................................................33
Multiplicative Model (Nonlinear Trend and Seasonality)..........................................................................................33
Ratio-to-Moving-Averages (Seasonal Index).............................................................................................................33
Summary of Multiple Regression-based models.......................................................................................................34
How to Model if Underlying Pattern is NOT Apparent? (NO obvious trend or seasonality)..........................................35
Naïve forecast...........................................................................................................................................................35
Smoothing Out method.............................................................................................................................................35
Simple Moving Average (SMA)......................................................................................................................35
Weighted Moving Average (WMA)................................................................................................................35
Single Exponential Smoothing (SES)..............................................................................................................36
Holt’s exponential smoothing method..................................................................................................................37
Winter’s exponential smoothing method..............................................................................................................37
Simple Exponential Smoothing Prediction Interval...............................................................................................39
How to assess the existence of autocorrelation?..........................................................................................................40
o Visual analysis: lagged scatterplot (e.g. original vs. lag 1, lag 1 vs lag 2)............................................................40
Quantitative value: autocorrelation function (ACF) (how to compute autocorrelations)..................................40
o 1st test: Individual Test for Autocorrelation...................................................................................................41
o 2nd test: Joint Test for Autocorrelation (Ljung-Box test/Q test).....................................................................41
o 3rd test: Durbin-Watson test (autocorrelation check for lag-1 residuals only)...............................................42
Stationarity............................................................................................................................................................43
2
Time Series Models if Original Data have Autocorrelation and Stationary....................................................................44
Autoregressive Model (AR).......................................................................................................................................44
Partial Autocorrelations (PACF).....................................................................................................................46
Moving Average Model (MA)....................................................................................................................................47
Summary...................................................................................................................................................................48
ARMA Model.............................................................................................................................................................48
Information Criteria (IC) → to choose p and q for ARMA.....................................................................................48
Box-Jenkins Methodology.............................................................................................................................................49
SAS Output................................................................................................................................................................50
To see if fitted model is good: look at Autocorrelation Check of Residuals (Q-test / Ljung-box test)........................50
Stationarity Condition...................................................................................................................................................52
Non-Stationary Models.................................................................................................................................................52
Random Walk Model.................................................................................................................................................52
Random Walk with Drift............................................................................................................................................52
Non-Stationarity............................................................................................................................................................52
1. Trend-stationary (TS).........................................................................................................................................53
o Deterministic trend.......................................................................................................................................53
2. Difference-stationary (DS).................................................................................................................................53
To make Non-Stationary become Stationary.............................................................................................................53
Unit-root tests to test for stationarity...........................................................................................................................54
Augmented Dickey Fuller (ADF) Test:........................................................................................................................55
Time plot.......................................................................................................................................................55
ACF and PACF................................................................................................................................................55
Forecasts for variable Close...........................................................................................................................56
ARIMA(p,d,q) model......................................................................................................................................................56
Summary of Unit Root Test.......................................................................................................................................56
Summary: Forecasting with ARIMA models..............................................................................................................57
Pairs Trading:.................................................................................................................................................................57
Model Diagnostics.........................................................................................................................................................58
Cluster Analysis.............................................................................................................................................................59
3
LECTURE 1: EXPLORING AND COLLECTING DATA
Data Types
Qualitative: categorical, nominal, labelled, based on ranks (eg. 1st, 2nd), not ordered (eg. red, blue, apple,
Toshiba)
Quantitative: numerical, values with units, can tell which is big or small (eg. height in cm), discrete v.s.
continuous, cross-sectional v.s. time series
Half-Half: ordinal, ranked, but not specific value (eg. 5 star hotel)
Descriptive Statistics
Single Number
Mode: → most repeated value
o Visual: hump in histogram
o =MODE()
o Unimodal (1 mode), bimodal (2 modes), multimodal (many modes), or NO mode may exist
CENTER
Average/Mean: =AVERAGE() → more resistant than Median in finding the centre for asymmetric data
Median: =MEDIAN() → average of 2 centre numbers; Median is resistant to outliers suitable to find the
centre when the distribution is skewed, contains outliers or gaps)
SHAPE
Symmetry:
o Symmetric if halves on either side of center look (approximately) like mirror images
o If symmetric, mean and median are close (because median is 50th percentile and mean is the
average). Normal distribution: symmetric mean=mode=median
Skewness: skewed to the side of longer tail
o =SKEW()
o Positive skew: long right-tail
o Negative skew: long left-tail
o Zero skew: no skewness
o Magnitude increases as the degree of skewness increases
o If mode is smaller than median and mean, the distribution is right skewed.
o If skewness exists, there would be little symmetry.
SPREAD
To approximate interval, for the centre of the data, within which half of the data points were found:
median +/- 0.5 x IQR. 34% from either side of average is
Average (Q1 + Q3): average of (1st quartile + 3rd quartile) within the 1SD
Range: =MAX()-MIN() → covers 100% of data 1SD cover 68% of the data
Standard Deviation: =STDEV()
o Six Sigma → not actually 6 sigma, but 4.5 sigma (build a margin of error of 1.5 sigma)
o Mean +/– Standard Deviation gives two-thirds interval
o SD = √variance
4
Variance: SD2 → =VAR() → the average of the squared deviations from the mean
σ 2=
∑ ( y−μ)2 s2 =
∑ ( y− ȳ )2
n or sample variance: n−1
Interquartile Range (IQR):
o Q1 = 1st quartile (bottom 25%, or 25th percentile point)
o Q3 = 3rd quartile (top 25%, or 75th percentile point)
o IQR = Q3 – Q1 → (range of middle 50%)
o IQR should be close to 2 x MAD, since both cover 50% of the data
o IQR is a better measure for asymmetric distributions
o An indicator of size of variance
Scaled Interquartile Range (SIQR) to make it equal to SD:
o SIQR = IQR / (2*NORMSINV(0.75)) i.e. IQR * 0.741 = SD almost equal to SD→ to make it comparable
to SD, which is 100%
Median Absolute Deviation (MAD):
o Measures median of absolute deviation, where the absolute deviation is the absolute difference btw
a data point and the median of the data.
o MAD = med [ |y – med(y)| ]
o Consider the data (1, 1, 2, 2, 4, 6, 9). It has a median value of 2. The absolute deviations about 2 are
(1, 1, 0, 0, 2, 4, 7) which in turn have a median value of 1 (because the sorted absolute deviations are
(0, 0, 1, 1, 2, 4, 7)). So the median absolute deviation for this data is 1.
always positive (eg. deviations -5 & 2 → MAD = 3)
covers 50% of the data
1.5 MAD = 1 SD
o NOT differentiable, not elegant
o Median +/– MAD gives 50% interval
o MAD is resistant to outliers
The presence of outliers does NOT change the value of the MAD
In contrast, the SD is very sensitive to the presence of outliers
Scaled Median Absolute Deviation (SMAD):
o MAD / NORMSINV(0.75) → {=MEDIAN(ABS(?-MEDIAN(?)))/NORMSINV(0.75)}
o If Standard Normal N(0,1): SMAD = SD
o SMAD is resistant to outliers
Which measures of center and spread to be reported for a distribution?
o If the shape is skewed, the median and IQR should be reported.
o If the shape is unimodal and symmetric, the mean and standard deviation (and possibly the median
and IQR) should be reported.
o Always pair the median with the IQR, and the mean with the standard deviation.
Displaying Quantitative Variables Graphically
5
Divider line at median (50th percentile) within box
Whiskers extend away from box
5-number summary:
o Whiskers at min and max
o Median, Q1, Q3, Min, Max (5 number)
o Lazy version
7-number summary:
o Whiskers end at 5th and 95th percentiles
o centre 90% of data is within whiskers
o 5% between whisker's end and extreme (min or max → represented by
dots)
o Median, Q1, Q3, 5%, 95%, Min, Max (7 number)
o Structure: min, 5th percentile, Q1, median (or Q2), Q3, 95th percentile, max
Longer part of the box is at the top/bottom, the data is skewed right/left.
Sampling: because population is too huge, sampling is often used
Population
Sample
Statistic: number computed from sample used to estimate parameters of the population (e.g. mean)
Can never have 100% confidence of correctly estimating the population parameter of interest because the
sample used is not the whole population. That’s why standard deviation/error of the estimate is not zero.
Designs: (how to use samples)
o Census: Include whole population (eg. to measure world population)
o Convenient sample: (not a proper sampling design)
Take whatever is available, sometimes no choice
Likely to be unrepresentative of the population
o Simple Random Sample (SRS):
Larger samples more accurate
Need a listing of all members of population (sampling frame); can be expensive
With or Without Replacement
In practice, we sample without replacement
With formula, we assume with replacement, for simplicity
o Stratified Sampling:
Proportional representation, rare stratum never left out
Eg. Indian minorities in Singapore, boost their representation since their population is small
Sample randomly from each stratum
o Cluster Sampling or multistage sampling:
Identify a microcosm of population and say that your cluster represents population
Eg. sample of Serangoon represents Singapore since it is similar
Discrete Distributions
BINOMIAL: B(n, p)
n = no. of trials. p = probability of success. X = no. of successes in n trials
o Success or Failure?
prob(X = x) =BINOMDIST(x,n,p,FALSE)
prob(X ≤ x) =BINOMDIST(x,n,p,TRUE)
o Average = np
o Variance = np(1-p)
o Eg. Toss a fair coin 20 times, what’s the probability of at least 18 heads?
6
= 1 – BINOMDIST(17,20,0.5,TRUE)
7
Continuous Distributions
NORMAL (N):
o N(µ, σ 2) or N(µ, σ )
Average = µ (affects position of peak)
Variance = σ 2 (affects spread)
Covers 34% of data from either side of average
68% within 1 SD of average
95% within 2 SDs
99.7% within 3 SDs
99.994% within 4 SDs
When asked to find the corresponding percentile values using means and SD, find use the
68-95-99.7 percentiles using SD.
Mean = Mode = Median
Symmetrical about the centre
IQR/(2*normsinv(0.75))=0.741*IQR = SD.
i.e. Standard deviation = 0.75 x IQR. IQR = SD/0.75
o Approximate sample proportion p when population is large
N(p, p(1–p)/n)
Requirements:
p > 5/n & (1 – p) > 5/n
p (1 – p) is max when p = 0.5
Conservatively, use N(p, 0.5(1-0.5)/n) → N(p, 0.25/n) for margin of error
o use Normal to Approximate BINOMIAL, B(n, p)
N(np, np(1–p))
Requirements:
np > 5 & n(1 – p) > 5
OR np ≥ 10 & n(1 – p) ≥ 10 (in some texts, less common)
Often used for: P( B(n, p) ≥ x )
Continuity Correction: P( B(n, p) ≥ (x – 0.5) )
o Standardisation (z-score)
8
y −μ standard deviation
o td = SE ( y )=
SE ( y) √n
bell-shape, like Standard Normal N(0, 1), but with df
lower peaks, higher tails (more spread out)
d = degree of freedom
Average = 0
Variance = d/(d – 2) → nearly 1
o As degrees of freedom ↑, the t-distribution tends toward the Standard Normal Distribution
o t-distribution is more accurate than the z-distribution (Standard Normal Distribution). z-distribution
can approximate the more accurate t-distribution if the sample size is large.
o t-value represents the number of standard errors by which the sample mean differs from the
population mean
eg. if a t-value is 2.5, the sample mean is 2.5 standard errors above the population mean
F-distribution:
o Fv, d
F distribution with v, d DF (degrees of freedom)
v degrees of freedom in numerator
d degrees of freedom in denominator
Average = d/(d – 2)
Variance is complicated
o Used in ANOVA (hence regression)
Sampling Distribution
Sampling Distribution: distribution of all the averages of different samples
Law of Large Numbers (LLN):
1. The average of many independent samples is (with high probability) close to the mean of the population
o Average of all the sample averages → expect to get the population average
σ
2. Standard Deviation of the many independent samples = → smaller than population average
√n
3. The relative frequency becomes closer to the object probability as more trials are performed.
Point estimate
o A single value given to a sample as an estimate of the true value of a population
o Sampling Error/Estimation Error:
Difference between a point estimate and true value of population
Because samples are smaller than population, there will always be sampling error
o Sampling Distribution:
Distribution of point estimates from all possible samples from the population
o Unbiased Estimate:
Point estimate where mean of sampling distribution of that statistic = true value of
population
OTHERWISE, it is biased
Unbiased estimates are desirable because they average out to the correct value
o Standard Error:
How much point estimates vary from sample to sample (SD of sampling distribution)
Standard Deviation of a Sampling distribution (NOT population)
Ideally, estimate should have small Standard Errors
If point estimates vary wildly, then a point estimate from 1 sample is NOT reliable
9
σ
Standard Error =
√n
When we don’t know σ, we approximate Standard Error =
sample standard deviation
√n
Diminishing returns: the standard error declines only with the square root of the
sample size, as seen from its formula.
Approximately Normal, so can use Standard Error exactly as you use Standard Deviation
Eg. 2 Standard Errors on either side of mean, 95% confident of capturing mean
2s
Approximate Confidence Interval = sample mean ±
√n
o Within this range/interval, we are 95% confident
Interval Estimate
o An interval/range within which a value has a stated probability of occurring
o Confidence Interval:
Probability that a value will fall within a range
Measure of reliability/how accurate our estimate is
Eg. given a 95% confidence interval, value will be between 5 and 100
Interpretation: we are 95% confident that the value will be between 5 and 100
s
95% confidence interval for µ is: “sample average” ± ME (= (tn-1, (1-95%)/2 * ))
standard deviation √ n
Mean + t-critical (100% - confidence interval, with n-1 degrees of freedom) x SE (=
√n
If the population mean and variance are known, confidence interval is given by normal distribution.
10
o Randomization Condition: The data values must be sampled randomly, or the concept of a sampling
distribution makes no sense. (This is usually assumed.)
o Independence Assumption: The sampled values must be independent of each other.
o Small-Fraction-Sample Size Condition: The sample size is a small fraction (traditionally, < 10%) of the
population. E.g. population size 100, each sample consists of 10 subjects only. (This is rarely looked
at.)
o Large-Enough-Sample Number Condition: If the population is symmetric (e.g. uniform, as in die-
throwing; unimodal will further help), even a fairly small sample is okay. For highly skewed
distributions, very large samples may be required. Traditionally, a short rule is n > 30. (This is
frequently ignored.)
Hypothesis Testing
Null hypothesis
o “fail to reject” null OR “reject” null (DON’T say “accept”)
o Null is a simple hypothesis: does NOT allow a range
Eg. cannot say null: average is between 4 and 6
o Type 1 Error:
If null is true and you declare it is false (reject). False positive.
Probability α → significance level
Probability of Type 1 error occurring
E.g. continue with further testing when in fact mean tumour weights are the same
for the two groups
In practice, α usually specified by boss/client (fixed)
Always under Null Hypothesis, and drawn towards the Alternative Hypothesis
Use α to calculate c – critical value for testing (no need if can get p-value)
Alternative hypothesis
o Set up against the null hypothesis
o Can be one-sided or two-sided (affects calculation of z-critical using 0.05 or 0.025)
o Alternative is a composite hypothesis: can allow a range
o Type 2 Error: → CANNOT have both Type 1 and Type 2 error at the same time
If null is false and you declare it is true (fail to reject). False negative
Probability β
Probability of Type 2 error occurring
E.g. abandon the drug when in fact the mean tumour weight of the treated mice is
smaller than that of those in the controlled group
Under the Alternative Hypothesis and drawn towards the Null Hypothesis
Results:
o if sample value is right side of critical value: do not reject Null Hypothesis, H0
o if sample value is left side of critical value: reject Null Hypothesis, H0 Alternative Hypothesis, Ha
11
o To reduce both (Horizontal variation), increase n (sample size)
In practice, people only specify α (fixed); so increase n → reduce β
o α & β Cannot occur at the same time
p-value → also called the significance Probability of the test
o probability of seeing something favouring HA more than (or at least equal to) H0, given H0 is true.
o Probability that HA happens given that H0 is true. If p-value is small, HA should not happen, yet it
happened, it means H0 is rejected.
chance of a worse sample for H0 → NOT the probability of H0 being true, which is just 1 or 0
big-p-value favours the non-rejection of H0
o if have p-value, it is a statistical method. If NOT, then it is NOT a statistical method.
o Results:
p ≥ α: do NOT reject H0
p < α: reject H0
CANNOT change p after looking at α
Power of Test
o 1–β
Probability of rejecting H0 when it is false (when HA is true)
Computed using sampling distribution for the Alternative Hypothesis (NOT Null)
Allows us to describe ideal test
Higher power → more reliable test
To achieve higher power, increase sample size n (power depends on sample size)
o Ideal test characteristics:
Small α → hardly reject H0 when it is true
Large power (1 – β) → often reject H0 when it is false
Comparing Distributions
Scatterplot (for dependent samples, not suitable for independent samples):
Define which is x-variable and y-variable (defining wrongly will affect results)
o X-variable: independent variable (changing this)
o Y-variable: dependent variable (changing x will affect y)
Results:
o Negative/Positive relationship?
o Strong/Moderate/Weak relationship?
Strong: points are close or on the best fit line (trend line)
Weak: points are far away from the best fit line (trend line)
o Linear/Logarithmic/Exponential (etc.) relationship?
o Outliers?
Lying outside Upper Control Limit or Lower Control Limit
Real data or data error?
Perform one scatterplot with outliers, and one without outliers
Does outlier have any significant effect?
12
How to detect missing values:
o Blank cells in Excel data set
What to do about missing values:
o Ignore them
Must be aware of how software deals with missing values
Eg. Excel’s AVERAGE function divides by existing values, does not touch missing values
o Filling the gaps in some way
Examining existing values in the row of any missing value to help predict missing value
Fill in all missing values with average of existing values in that column
Not very good option
Observations:
o Comparatively short box plots: data shows high agreement with each other
o Comparatively long box plots: data shows that there are different opinions about this aspect
o Box plot is much higher/lower than another: could suggest differences between groups
o "No overlap in spreads" or "75% is below 75%" so there IS a difference between group 'A' & 'B'
Compare:
o Medians
o Consistency (smaller IQR → more consistent)
13
Comparing/Testing Averages:
Choosing Method:
1. 1, 2 or more samples?
o 1: 1-sample t-test, 1-sample z-test (test if the sample mean is larger/smaller/different from expected)
o 2: t-test, z-test, ANOVA
o More than 2 independant variables : ANOVA
2. Are samples independent?
3. Are population variances known?
4. Are unknown population variances equal?
For 1 sample
o If population is Normal (or sample large for CLT)
Variance Known: 1-sample z-test
Variance Unknown: 1-sample t-test
For 2 samples
o If populations are Normal, Samples are Independent
Variance Known + Not too different: z-test
Variance Unknown + Equal (but you know they are equal): pooled t-test OR 1-way ANOVA
Variance Unknown + Unequal (BUT not too different): 2-sample t-test *most common
o If Paired Samples across populations: dependent within pairs, independent between pairs
First, compute the paired difference: d = x1 – x2
If difference between pairs is Normal + Variance Unknown: paired t-test
5. A statistical hypothesis is only about a population parameter, not sample estimate.
14
o Usually < 0.01 → convincing evidence for HA
Next step: Discover which means are significantly different from which other means
Usually done by examining Confidence Intervals
o F statistic/F vs F critical
If F < F crit → variances are the same
If F > F crit → variances are NOT the same
o SS (sum of squares):
Between groups: sum of squares due to treatment
Within groups: sum of squares due to error
o Between Variances (SSR):
measures how much sample means differ from one another
Only if Between Variance > Within Variance, can you conclude with any assurance that
there are differences between population means – and reject null hypothesis
o Within Variances (SSE):
measures how much observations within each sample differ from one another
Large Within Variances:
Difficult to infer whether there are really differences between population means
Small Within Variances:
Easy to infer whether there are really differences between population means
o df (degrees of freedom)
o two variations MS (means squares):
Between groups (Horizontal variation): means squares due to treatment (MST).
Within groups (Vertical variation): means squares due to error (MSE).
t-test:
Used to find out whether the means of a dataset are equal according to a certain α level
o To see if there are any significant differences between TWO different groups (unless 1 sample t-test)
o H0: μ1 – μ2 = Δ0 H1: μ1 – μ2 ≠ Δ0
Conditions for using a t-test:
1. Standard Deviation is NOT known
2. n < 30
How to conduct a t-Test:
o 1-tail test: knows what the difference is (eg. group 1 > group 2)
o 2-tail test*: unsure if there is a difference (eg. although there is difference, not sure which sign)
o Type 1 (paired t-test; a one sample t-test for the mean paired difference (μ1 – μ2)):
Dependent + unknown variance + variance equal
dependent paired samples across populations between the 2 groups
dependence within pairs, independence between pairs
Related across columns, but NOT rows
difference between pairs is Normal, and variance is UNKNOWN
15
Requires 2 samples of equal size
Paired t-test for average of difference between pairs
variance
Result:
o If p-value < stated α level (small p-value) → reject H0 →difference between the 2 groups is
significant
o If p-value ≥ stated α level, do not reject H0 NOT enough evidence to conclude that 2 groups are
different also implies that the confidence interval for 1 - at confidence coefficient of 1-α will
include zero since there is not enough evidence to conclude that 2 groups are different
Confidence interval
o Mean +/- t-critical (with df and confidence coefficient specified) x standard error
o Mean = mean of X1 – mean of X2. t-critical can be calculated from t-test outcome. SE is St-test
z-test:
Used to find out whether the means of a dataset are equal according to a certain α level
16
o To see if there are any significant differences between 2 different groups
Conditions for using a z-test:
o When conditions for t-test NOT satisfied:
Standard Deviation is known
n > 30
Result:
o If p-value < stated α level (small p-value) → reject H0 →difference between the 2 groups is
significant
o If p-value ≥ stated α level, NOT enough evidence to conclude that 2 groups are different
o ^β : slope term the measures the expected change in y given a unit change in x only.
1
+ve or –ve correlation between x and y
CANNOT infer strength of correlation; magnitude can be changed by scale of measurement
sy
^β =r So, if r=0, ^β 1= 0. Then ^β 0= y , and ^y = β^ 0 + β^ 1 x = y
1
sx
o x: independent variable. y: dependent variable
o ε : error term, residual → sum of residuals = 0
o Residual r (estimate εi) = observed y value - fitted y value
o Unlike correlation, regression is not symmetrical in X and Y (so the regression equation of X on Y is
not x= β^ 0 + ^β 1 ^y or x= β^ 2 + ^β 3 ^y
Standard error of the linear regression (Checking the model with standard deviation of the residuals,
estimate the standard deviation of the error term)
o How much the points spread vertically around the regression line
o
o
se =
√ ∑ r2
n−2 (r is the residuals and n is the number of observations)
Application of standard deviation of the residuals/standard error of estimate: Given Se, you can find
how many standard errors away your fitted value is from the actual value using residual for that
value divided by Se. Se is 3170, residual of a particular point is 2086, this indicates that the
fit/prediction is about 2086/3170 = 0.66 SDs away from the actual value, which shows that it is a
quite good prediction.
Conditions:
o Constant Variance Condition
The standard deviation around the regression line should be the same along the whole line
o Quantitative Variables Condition:
Correlation applies only to quantitative variables
o Linearity Condition:
Correlation measures the strength only of the linear association (between 2 variables)
o Outlier Condition:
Outliers can distort the correlation
Line of “Best Fit”
o Sum of residuals is NOT a good assessment of how well line fits data
17
Because some are positive, some are negative
o Sum of square of residuals better
Smaller sum of squares → better fit
Smallest sum of squares → line of “best fit”, or least squares line
Correlation:
o Correlation is NOT causality, vice versa. Cannot be computed for more than 2 variables
o Correlation measures extent of clustering around the positively/negatively sloping 45 ° line, for
standardised X and Y variables.
o Correlation treats x and y symmetrically → ρ ( x , y ) =ρ( y , x ). same r, if x and y are interchanged
o Correlation has no units (standardised values are used)
o 0 correlation → not linearly correlated (but doesn’t mean NO relation)
o Correlation always between –1 and +1, not affected by units of measurement, sensitive to outliers.
o For Standard Units:
(X −μ x ) (Y −μ y )
Zx = , Zy =
σx σy
Correlation = Average of Standard Units ρ ( x , y ) =average ( z x , z y )
∑ zx zy
o Correlation, r =
n−1
sy sx
o Correlation r. ^β 1=r r = ^β 1
sx sy
Cov(x , y)
o Correlation, r=
sxsy
o In linear regression, Multiple R = r(y, ^y ) = r(y, ^β 0 + ^β 1 x ) = r(y,x) = r(x,y) = correlation coefficient
Covariance:
n
o Covariance =
∑ (x i−x )( y i− y ) (Covariance can be of any value)
i=1
n−1
2
R (The fraction of the y’s variation accounted for by the simple regression model)
o R2 = Multiple R2 = correlation coefficient r2, R2
o 1- R2 = fraction of y’s variation left in the residuals. Percentage of y unexplained by the model
o Removing outliers can change or may not affect |r|, and hence R2
Regression Effect:
o In a different round, the corresponding observation tends to be closer to the average
o Works both ways: not just future, but also backwards
o If currently round is b SDs from mean; then, c rounds (generations) apart, will be b|r|c SDs away,
where r is the correlation coefficient.
18
o Independence Assumption (probabilistic independence of errors)
o Normality Assumption (errors are normally distributed)
How to perform Multiple Regression:
o Check “residual plot” to get Residual Plot (residuals plotted in a scatter plot)
Residuals ε : difference between the observed y values and expected/predicted y values
2 things must be true if regression line captures the overall pattern of data well
1. Residual plot shows no obvious pattern – random
2. Smaller residuals → better
Steps:
o Formulate the Model (specify the variables)
o Estimate the parameters
o Perform model diagnostic testing → to justify your model selection
o Conduct hypothesis testing → to test whether necessary to use your predictive model
o Reformulate the model if necessary, then repeat steps 2 to 4 → If satisfied, use the model to
forecast
Prediction:
o Prediction is for outside sample data, otherwise we are doing a fit
o We can calculate the error term for the predicted value
o Check assumptions for error term:
1. Zero average
Check: guaranteed by presence of intercept term Do square plots to better see
2
2. Equal variance σ → sample value of σ is Standard Errors
Check using Residuals vs Fit
Check using Residual Plot
Hope for no pattern
3. Independent HOWEVER, violations can still
Check using Residuals vs Each X (same as vs. Fit) be missed by residual plots
Hope for no pattern/linearity
4. Normally distributed
Check using Normal probability plot of normalized residuals
Hope for straight line → if NOT straight enough, try y transformation
o If assumptions are violated, try transforming the data or adding more variables.
Results:
o p-value (Coefficients table)
o Significance F (ANOVA)
p-value of F
Small Significance F → Regression is good → NOT all means equal, populations diff
Large Significance F → Regression is bad
NOT the same as F statistic
F statistic: “signal-to-noise ratio”
SS(Between Groups)
df BG
o F statistic =
SS (Within Groups)
df WG
Larger F statistic → better → NOT all means equal, populations different
o Adjusted R2 (Regression Statistics. Goodness of fit)
R2 adjusted of no. of variables in the model
Use it to judge if you should add extra variable to regression
If adjusted R2 increases, add variable
19
If decreases, don’t add variable
NOT affected by sample size n
Percentage of variance is explained by the model
SS(Error )
n−k −1
1–
SS (Total)
n−1
SS(Error) = ∑ ¿¿ ¿
SS(Total) = ∑ ¿¿ ¿
o R Square, (Multiple R)2:
Percentage of variation/variability is explained by the model (Coefficient of determination)
Between 0 and 1
If add extra variables to regression, will always increase R2
∴ Higher R2 NOT always preferred → may have too many variables
If sample size n increase, R2 decrease → if have k+1 data rows → R2 = 1
SS (Error)
1–
SS(Total )
o R, Multiple R
If only linear regression (1 variable), R = correlation coefficient
is r(y, y-hat) when regression has intercept
o Standard error of multiple regression (estimate for the standard deviation of the error term)
How much the points spread vertically around the regression line
se =
√ ∑ r2
n−k −1 (r: residuals, n: no. of observations, k: no. of independent variables)
About 68%, 95%, 99% of predictions made would be within 1, 2, 3 SD of the actual Y.
o F-test:
To test if at least one coefficient is NOT 0 / to test if ALL coefficients are 0 (null hypothesis)
But does NOT tell you which ONE is 0
To do that, use t-test
o t-test:
To test if a specific coefficient is NOT 0 (not important)
o df: Degree of Freedom
Larger → better; should NOT be too small
Eg. 100 samples, 99 variables to draw the regression, degree of freedom is 0. No
point in doing regression because every single sample is perfectly fit (over fit)
If you add a variation, it will reduce the degree of freedom
Use R2 to tell if it is worth it
o SS:
SSR: Sum of squares explained by regression SS(Total) = ∑ ¿¿ ¿
SSE: sum of squares residual/error term SS(Error) = ∑ ¿¿ ¿
Amount of certainty that remains in the model
SST: sum of squares total = SSR + SSE SS(Total) = ∑ ¿¿ ¿
The total amt of variation in the data that cannot be account for by the model
20
Unlike Significance F, NOT much affected by sample size
If sample size large, Significance F tends to be small
BUT Adjusted R2 can still be very small % of variation explained by model. As no. of
Can be < 0 or > 1 Regression Output predictors increase, R square always increases.
Increases as sample size decreases
Regression Statistics
Multiple R √ R Square % of variance explained by model. Adjusted R square will
R Square SSR/SST stop increasing (and drops) after a certain no. of
(1−R Square)(n−1) variables are added. Then you stop adding variables
Adjusted R Square 1− Usually smaller than R2
(n−k−1)
Standard Error √ MSE
Observations n SE: sample estimate of the standard
deviation (variance) of the error term.
(SD of vertical distances from points
from regression line)
ANOVA
SS
df MS F Significance F
SSR MSR
MSR =
Regression k SSR (k ) MSE 9.3996E-37
MSE =
F is “signal to noise”
SSE ratio. F-statistic p-value of f-test;
Residual n–k–1 SSE (n – k−1)
Good model
21
R square large → good model
Significance F small → good model
All P-values are small → good model
if overall p-value for variables is small, but 1 variable has large p-value → we reject (not all p-values are small)
We will want to find a model that removes that variable with large p-value
22
Usefulness Tests of Regression Coefficients (2 tests)
1. Right-sided F test: test that all Xs taken together do NOT linearly contribute to Y
o To test if regression model is useful; to test if all = 0, if one deviates then it’s NOT true
H0: β 1 = β 2 = … = β k = 0
Interpretation: all X’s are not contributing → regression is NOT useful
Note: intercept β 0 is NOT included here
Ha: at least one β j ≠ 0
o Right-sided F test with k, n-k-1 degrees of freedom
k = no. of variables in regression
o Results:
Large Significance F → do NOT reject H0
all β ’s = 0 → all X’s are NOT contributing
Regression is NOT useful
Small Significance F → reject H0
All you can say is: at least one x that is at least a little linearly related to Y
o Regression model is useful
Smallest Significance F is most useful
o CANNOT say every x is
o Because can have individual variables with small Significance F, but overall
Significance F is large (need to do individual t-test)
Significance F decreases as sample size increases
Unlike Adjusted R2, which is NOT affected by sample size
2. 2-sided t-test: test if one particular Xj does NOT linearly contribute to Y, in the presence of the other Xs
o Assuming regression has some use, test the usefulness of individual X variables in the regression
H0: β j = 0 → variable is NOT useful
Ha: β j ≠ 0
o 2-sided t-test with n-k-1 degrees of freedom
^β −0 coefficient j
j
T-value/ t-statistic/t-ratio/t= = determines p-value
SE ( ^β ) Standard Error j
j
o Use the results from this test for variable selection (next section)
Throw away variable with the largest p-value
Keep throwing away until overall p-value stops improving
o Results:
Small p-value: reject the null that the variable is not useful.
Large p-value could mean either:
1. X intrinsically NOT linearly related to Y but could have other relationship, OR
2. X linearly related to Y, BUT collinear with other X’s
o Multicollinearity: variables are correlated with one another
o Other X might make an important variable seem less important
Variables Selection
Not all X’s are useful; sometimes putting in useless X’s might hurt the regression.
Therefore, need to select useful variables
23
p-value: use this to determine which variable can be thrown away; throw largest p-value → see if overall p
improves
If overall p-value improves (lower), then continue throwing away variables
If overall p-value worsens (higher), then add back that variable, and use that model
In the end, model with the smallest overall p-value will be chosen
o All necessary x’s included; all unnecessary x’s excluded
o All x variables should have necessary transformations and interactions
o Abandon the final model if Adjusted R2 < 0.2
LECTURE 6
24
Dummy variable to capture different intercepts
Predicted Y = a + b1X + b2D
When D=1, Predicted Y = (a+ b2) + b1X
When D=0, Predicted Y = a + b1X
For the 2 different D:
o I have 2 different regression lines with the same slope but different intercepts
o In effect, when you include only a dummy variable in a regression equation, you are allowing the
intercepts of the two lines to differ (by an amount b2), but you are forcing the lines to be parallel.
To be more realistic, you might want to allow them to have different slopes, in addition to possibly
different intercepts.
Variable Selection using dummy rows (choose which variables you want to include in your model)
25
the variable is significant and different slopes are needed
o Large p-value for dummy variable “meat”: the variable is not significant
The dummy variable is not significant, they have the same intercept
26
For the 1 dummy variable and 1 interaction variable:
o (ONLY P-VALUE FOR INTERACTION IS SIGNIFIICANT) I have 2 different regression lines with the same
intercept but 2 different slopes for the 2 different categories
o OR. (IF BOTH P-VALUES ARE SIGNIFICANT) I have 2 different regression lines with 2 different
intercepts and slopes for the 2 different categories
Interaction variable can be the product of any two variables, a numerical and a dummy variable, two dummy
variables, or even two numerical variables, can be used. Also works when both variables are dummies from
different categorical variables
Pros and cons to adding interaction variables
o More complex and interesting model, significantly better fits.
o Extremely difficult to interpret.
Missing Values
27
b2
For X missing: …+ b1X+average + b2D → b1(average) + b21 → b1(average+ )
b1
Outliers
Variable Transformations
Prediction:
Try to stay within or near the range of each X
Use feasible values of X
o Eg. for dummy variables, don’t use 0.5, use either 0 or 1
Prediction is almost always better with analytics than without
28
such 1% increase entails the same increase in Cost. This is another way of describing the decreasing marginal
cost property.
Logistic Regression
Time Series
o Objective:
Search for patterns in historical series and extrapolate these patterns into the future
Provide forecasts of future values of the time series, based on past information
o Time sequence of data is an important aspect
Eg. if we change the sequence, we will get a different result
Most time series are equally spaced at roughly regular time intervals
Eg. daily, monthly, quarterly, annually
NOT time series if:
NOT recorded sequentially, OR
Sequence NOT important
o Time series plot:
Time on x-axis (horizontal)
Variable on y-axis (vertical) → eg. quarterly sales
o Components of Time Series: (can exhibit none, or maybe one or two components)
Trend component (T)
Types:
o Linear Trend α + βt (the change in Y remains constant)
2
o Quadratic Trend α + βt + yt
o Exponential Trend α exp(βt) (the percentage change in Y remains)
p
o Polynomial Trend α 0 + β 1 t+ …+ β p t
Seasonal component (S)
Short-term (1 year or less), repetitive behaviour
Time between peaks: Period
o Each Period → similar pattern
29
Seasonality + Increasing Exponential Trend
Cyclical component (C)
Longer than 1 year, more irregular (random fluctuations) and difficult to predict
30
Additive model:
o Y = Trend + Seasonality + Cycle + Random
Multiplicative model:
o Y = Trend x Seasonality x Cycle x Random
o Always fit the multiplicative model by taking logarithms (transformations)
Ln(yt) = a + bt + et
Why?
Easy to estimate coefficients ()
Transformed data closer satisfies assumptions of statistical models (eg. normality)
o Better fitted model (smaller Significance F)
To change the form to an Additive model
STEP 1–2:
Linear Trend Model → Salest = a + b(time)
Sales changes by a constant amount each time
need to convert quarter data (Q1, Q2, Q3, Q4, Q1,Q2…) to time= 0, 1, 2,3,4,5,…
regression of sales and time
o Slope, b:
o Expected change of sales
o Salest – Salest-1 = (a+bt) – (a + b (t – 1)) = b
o Because variable is time, and there is only one variable
o Intercept, a:
o Expected value of quarterly sales at the initial time
o Trend line:
o Ignores seasonal variation in the sales
o Using linear trend equation to forecast sales may result in over /underestimate in different quarters
31
Check to make sure Adjusted R2 is higher for the transformed data (Ln (Sales)) than original
data (Sales)
If higher: transformed data is better model → greater variance explained by model
For Exponential:
o Slope, b:
Approximately the % change of sales per time unit (eg. quarter)
HOWEVER, only holds true if slope is close to 0
If slope is very large, forget it
o Intercept, a:
Expected value of ln(sales) at initial time
o to find actual % change in sales for a particular month:
Sales = e slope – 1
o For t = 0 (month 0):
Sales = e intercept → since exp(0) = 1
Trend line:
o Still ignores seasonal variation in the sales
o Durbin-Watson Test:
o Used to test for Autocorrelation between 2 sequential errors (e.g. lag1 and lag2, lag2 and lag3)
Autocorrelation of errors: correlation between errors; ie. Errors are dependent
Autocorrelation of lag 1 and lag 2 errors means overprediction in Jan will lead to
overprediction in Feb
Test whether errors are independent or not
o DW = average
n n n n
∑ (~ϵi )2 ∑ (~ϵ i )2
i=2 i=2
Numerator: Sum of [Error at time i (starts from 2), subtract previous error, and square it]
Denominator: Sum of [square error for each time i]
o Null hypothesis: No lag-1 autocorrelation
o Alternative hypothesis: there is lag-1 autocorrelation (either positive or negative two-sided test)
0 ≤ d ≈2 (1−~ ρ)≤ 4
When ~ ρ=0 → d ≈2
When ρ=1~ → d ≈0
~
When ρ=−1 → d ≈ 4
Easier to use p-value
Since tables for critical values are NOT always readily available
If p-value < α , reject null
Reject null when it’s in the 2 yellow regions
There is autocorrelation
When d close to 0: positive autocorrelation
When d close to 4: negative autocorrelation
Do NOT reject null:
No evidence of autocorrelation
Dependence between 2 consequent errors are NOT correlated to each other
32
White Area:
Inconclusive region, we don’t know what to do
dL and dU:
critical values, can be found in table for DW test
fixed number, but varies for different no. of:
o Observations
o Variables
o Limitations:
Only test for first autocorrelation (between 2 consequent errors), but not others
If fall within inconclusive region (white), we don’t know what to do
However, DW test usually found in business reports, so we learn it
o Interpretation:
If there is Autocorrelation btw the errors (Errors NOT independent):
Formula used to compute Standard Error is wrong
o Thus confidence interval & hypothesis test will be wrong
o Errors are supposed to be useless
Least squares estimator
o Still linear and unbiased (expectation = true parameter)
o BUT it is NOT efficient (less accurate estimator)
Check equal variance/spread assumption (linearity/independence assumption also) using residual plot
Non-equal variance Non-linear dependent residuals
o F-test
o Used to test if independent variables can predict the dependent variables
To test how useful the fitted model is
o Null hypothesis:
H0: β 1=β 2=β 3=…=β k =0
Coefficients (slopes) are jointly zero
F-test is a joint test. If even one is NOT 0, then reject the null hypothesis
If all are 0 (do NOT reject null), then variables are useless
o Alternative hypothesis:
H1: β i ≠ 0 (at least one coefficient is not zero/can predict the dependent variable)
33
o Reject Null
When p-value (significance F) < α
Conclude: the fitted model is useful in predicting the dependent variable
o Individual t-test
o If reject null in F-test, may want to check significance of each coefficient
~
( β i−0) coefficient
o t-test statistic: t= →
SE(β i ) standard error of coefficient
o Null hypothesis:
H0: β i = 0
Coefficient (Slope) of a particular variable (predictor) = 0
It is NOT a joint test like F-test
If null hypothesis NOT rejected → model is useless
o Reject Null
When p-value (significance F) < α
Conclude: the variable is useful in predicting the dependent variable
Since t-statistic is significant, then advisable to include this extra variable
34
o Keep in mind to choose models based on:
o Interpretability & Parsimony
Parsimony: less complex, less predictors (simple structure)
Forecasting methods
Forecast Errors
Forecast Origin: Time at which forecast is made
Forecast Horizon: Time period to which the forecast relates
Forecast Error: Difference between acttual value and forecasted value from fitted model
o Smaller forecast errors → better forecast method
o h-step-ahead forecast: Ft+h → forecast for period t+h made at time t
o h-step-ahead forecast error: et+h = Yt+h – Ft+h → error of forecast (actual – forecast)
o Measures of forecast error: (can choose whichever to minimize; they tend to make the others small)
Bias:
Arithmetic average of the errors
n n
1 1
o Bias = ∑ et = ∑ (Y t −F t ) Excel =AVERAGE(SUM(actual – forecast))
n t =1 n t=1
Limitation:
o When 0, 0, 0, 0 → Bias gives us 0 error Bias cannot differentiate
o When 10, -10, 10, -10 → Bias also gives us 0 error Use MAD instead
MAD (Mean Absolute Deviation):
1
o MAD =
n
∑|et| Excel =AVERAGE{SUM of [ABS(actual – forecast)]}
Limitation:
o MSE penalizes large errors because the errors are squared
NOT in the same unit as the data (it’s a squared unit)
RMSE (Root Mean Square Error)
Root Mean Squared Error (similar to standard deviation of the sampling distribution)
o Square Root of the MSE → so RMSE is the same unit as the data
o
RMSE = √ MSE=
√ 1
n
∑ e 2t
Excel =SQRT(AVERAGE{SUM[(actual – forecast)^2]})
Prediction Intervals
Point forecast: sub in values of 26th month to get a single data point (forecasted)
Point forecast may differ from actual value
Chance that they are the same is very small
“Future Expected Sales at time = ___ are [point value]”
Interval forecast: forecasted values will fall within a certain range, with a 95% confidence
Compute lower and upper confidence intervals:
o Ft+1 ± z x SE (lower: Ft+1 – z x SE, upper: Ft+1 + z x SE)
Ft+1 is point estimate for (eg. log sales → if you use SE of log sales model)
z: (value from Normal tables)
90% confidence interval → z = 1.645
95% confidence interval → z = 1.960
99% confidence interval → z = 2.576
SE: (estimated standard error of residual/error)
Taken from regression output
Regression Statistics
Multiple R 0.999999996
R Square 0.999999992
Adjusted R Square 0.999999992
Standard Error 0.000462371
Observations 25
36
o However, Future Expected Sales may be as low as [lower interval] under bad conditions, and as high
as [higher interval] under good conditions”
37
LECTURE 8
How to model both Trend and Seasonality (Whole_foods.xlsx)
38
Find “Ratio” (=Sales/Average Sales)
Seasonal Index: averaging the all the ratios for a particular season
o Eg. for Q1, take Average for all Q1 Ratios
o Average of Seasonal Indices = 1:
If SI > 1, sales in Q1 is higher than average (eg. 1.157 → 15.7%
higher than average)
if SI < 1, sales is lower than average (eg. 0.919 → 8.91% lower than
average)
o Sum of Seasonal Indices = no. of seasons
Eg. if 4 quarters, sum of season index = 4
Eg. if 12 months, sum of season index = 12
Impact of Season disappears, only Trend left
2. Get a Forecast for Deseasonalized data
Choose an appropriate Trend model (since now only have trend component) by plotting
Deseasonalised Sales and Time to see the trend
Eg. If observe a linear trend: Data Analysis > Regression
Deseasonalised Sales: y-variable. Time: x-variable
Fitted model: deseasonalized Sales = intercept + slope x Time
Eg. if observe an exponential trend (nonlinear trend): Data Analysis > Regression
ln (deseasonalized Sales): y-variable. Time: x-variable
Fitted model is double transformed data → ln & deseasonalized Sales
ln (deseasonalized Sales) = intercept + slope x Time
To cancel ln , take exponential on both sides:
o Deseasonalized Sales = eintercept ∙ eslope x Time
Expected sales at time = 0 are eintercept
Sales expected to increase by eslope – 1 every (time) (eg. month/quarter)
3. Reseasonalize: Point Forecast
Multiply forecast with Seasonal Index to get forecast for actual data
Eg. if initial time point
Eg. for exponential trend model: Sales = eintercept ∙ eslope x Time x SI
is Q1 of 2007,
To find point forecast
for Q3 of 2008:
Model Assumption: dependence between Y and X’s is stable overall, in the past and in the future
MR model is better for long term forecast as compared to time series model.
39
40
How to Model if Underlying Pattern is NOT Apparent? (NO obvious trend or seasonality)
Naïve forecast:
Forecast = Last Observation (Ft+1 = Yt)
Naïve forecast only makes sense if history repeats itself
Trace values well, but lagged behind
Naïve forecast is random walk (non-stationary)
Limitation 1:
o Only appropriate for immediate/short term forecast
o Always same value for long-term prediction (eg. 10 yrs)
Yt+1 = Yt; Yt+10 also = Yt (because Yt is last known)
So will be a straight line after the last known value
Limitation 2:
o Consists of past errors (random noise) as well
o Use Smoothing Out Method to ‘smooth out’ past errors
How to smooth?
Simple Moving Average (SMA) can determine the value of n using Excel:
o Forecast is the average of past n observations (NOT all observations)
Y t +Y t −1+ …+Y t −n+1
Ft+1 =
n
n = forecasting horizon (how far back we look)
the n observations are treated equally → equal weights
o Larger no. of n → smoother forecast (stable BUT less accurate)
Higher MAPE (percent error)
Use large n if expect there to be little or no change in the future
o Smaller no. of n → more responsive to changes (less stable BUT more accurate)
Lower MAPE (percent error)
Use small n if expect there to be change
Because small n is more responsive to changes
Easily influenced by outliers. Use median instead of mean
o Limitation:
May miss trends (eg. downward trend) as the data gets average out
Weighted Moving Average (WMA) can determine the value of n and wn using excel:
o Weighted Average of past n observations (NOT all observations)
o Ft+1 = w1Yt + w2Y2 + … + wnYt-n+1
Higher weights assigned to more recent (in most cases, but not always)
w1 > w2 > … > wn
weights reflect relative importance of each previous observation
o higher importance given to more recent data → may reveal trends
Weights sum up to 1
w1 + w2 + … + wn = 1
o WMA more flexible than SMA (with equal weights)
41
Eg. SMA may miss a downward trend
WMA give higher importance to recent data, which can reveal downward trend
Single Exponential Smoothing (SES):
o Idea: the prediction of the future depends mostly on the most recent observation, and on the
latest forecast
o Weighted moving average with exponentially decreasing weights that are controlled by smoothing
constant α
Ft+1 = α Yt + (1 – α )Ft
Smoothing Constant alpha α :
α is given to most recent observation, the 1-α to historical forecast
α is a self-learning procedure → automatically corrects previous forecast by
considering the forecast errors it made in the past
Denotes importance of the most recent observation
o How much our forecast will react to previous forecast error
Smaller α :
There is little reaction to previous error
No need to update forecast so much
Smoother and stable to sudden changes
Selection of the initial forecast is more important
Larger α :
There is a lot of reaction to previous error
Update a lot: should depend on more recent observations
Less smoothing effect
Follow historical values closely
α tells us how much we should update our current forecast from previous forecast
o α tells us if previous forecast is trustable or not
o If α = 0:
Previous forecast error has no impact on current forecast
No need to update previous forecast
past forecast = current forecast
Forecasts over time are similar to each other
o Flatter forecast curve
Same as naïve forecast for long-term forecast (last known Ft)
Since we only have 1 most recent observation → if predict
far into the future, it’s the same value
o If α = 1:
Previous forecast is not good at all → need to update
Depend on most recent observation
Most recent observation = current forecast
Same as naïve forecast
Can be rearranged as Ft+1 = Ft + α (Yt – Ft)
Yt – Ft = forecast error in the past
Interpretation: Forecast + Correction on previous forecast error
α tells you whether your previous forecast is trustable or not
Can be rearranged as
Weights decline exponentially into the past
Distant values get smaller weights
o How to choose initial forecast? (2 methods)
Naïve forecast
42
Just copy from previous observation
Simple Moving Average (to better smooth out error)
Take average of previous four or five observations
Disadvantage of SES:
SES does NOT consider trend (or seasonality)
o Eg. If there is a trend in the data:
Regular exponential smoothing will always lag behind the trend
However, SES has Advantages:
o Considers all past available observations (better than SMA & WMA)
o Can find the optimal smoothing constant
Therefore, we modify the method → Holt-Winters’ Method (considers trend)
Holt-Winter’s Method: Extension of Single Exponential Smoothing method if there is trend or seasonal variation
Winter’s exponential smoothing method (note: even if NO trend or seasonality, still can use Holt-Winter’s)
IF there is both trend & seasonality, introduce 1 more smoothing constant (Gamma):
o α → smoothing constant for the data
o β → smoothing constant for the trend
o γ → smoothing constant for seasonality
Idea:
o Split the effects of level and trend and seasonality
Ft+1 = (Lt+1 + Dt+1) x St+1 OR
Ft+h = (Lt+1 + hDt+1) x S(t+h–M) (1 < h ≤ M)
Forecast for h periods in future: (new Level + [h x new trend]) x seasonal component
M: length of seasonality (no. of periods in the season)
1. Exponentially Smooth Series: (same as SES)
Yt
o Lt+1 = α +(1−α )( Lt + Dt )
St +1− M
43
Actual observation divided by Seasonal Index (St+1-M) (deseasonalise) to
ignore impact of seasonality + previous forecast for level
2. Trend Estimate: (same as Holt’s method)
o Dt+1 = β (Lt+1 – Lt) + (1 – β )Dt
Most recent observation level for trend + previous forecast for trend
3. Seasonality Estimate:
Y t +1− M
o St+1 = γ + ( 1−γ ) St +1− M
Lt +1−M
Most recent observation divided by Level to ignore impact of trend +
previous forecast for seasonality
Smoothing constants α , β , γ tell us:
o How good their respective previous forecasts are
How much we should update our previous forecast (the greater the values of α, β and ϒ,
the more updating needed)
eg. if α near 0 → previous forecast of level is good, no need to update much
eg. if β = 1 → previous forecast of trend is not good, need to continuously update it
eg. if γ = 0.43 → somewhat neutral, seasonal forecast should be updated over time
o eg. over time, magnitude of seasonal variation becomes larger and larger. If
use constant seasonal index, will miss increase in change in seasonal impact
o Ifα , β , γ = 0, does NOT mean that there is no trend or seasonality
Just means initial forecast for level/trend/season are good enough
No need to update initial forecast
o If Ifα , β , γ = 1, all components need to be updated.
Underlying pattern is not apparent, use smoothing methods, without assumtpions on the trend and seasonal
components. Only appropriate for immediate/short term forecast.
44
Underlying pattern of the series is clear, use regression-based modelling methods for forecast. If there is linear
pattern, linear trend model and additive model of seasonal patterns are appropriate. Or else use log transformation
to obtain linearity, which leads to exponential model or multiplicative model.
If data show trend and seasonality, use ratio-to-moving-averages method and HW model
45
Simple Exponential Smoothing Prediction Interval
Forecasting method most effective when parameters for the trend and seasonal components may be changing over
time.
3. for yT+1
for yT+2
The above formula shows the coefficients of yt’s on the r.h.s. decrease exponentially with time.
46
LECTURE 9:
Autocorrelation: time series depends on its own past values (serial dependence)
To determine if there is autocorrelation:
o Copy data, then paste it one row down to get Lag-1 data (so that Yt+1 = Yt for lag-1 data)
o =CORREL function to get autocorrelation
o Visual analysis: lagged scatterplot (e.g. original vs. lag 1, lag 1 vs lag 2)
o Copy data, then paste it one row down to get Lag-1 data. Insert > Scatterplot
o from scatterplot:
Highly correlated if
Strongly clustered around straight line (there is autocorrelation)
if Random scattering: indicates that NO autocorrelation
o Value at time t independent of values at other times
o Past values CANNOT be used to predict future values
Sign of correlation:
Downward sloping: negative correlation
Upward sloping: positive correlation
HOWEVER, visual analysis is just a rough idea
More precise values, may want to calculate the autocorrelation
Quantitative value: autocorrelation function (ACF) (how to compute autocorrelations)
E [ ( Y t−μ )( Y t−k −μ ) ]
o ρk = 2
σ
Interpretation: Covariance of 2 random variables/variance of the time series
Same time series, therefore the 2 variables use the same mean, and same variance
k = time lag
Autocorrelation is a function of the time lag k
o Sample Autocorrelation Function (ACF)
for measuring autocorrelations in samples, instead of population
eg. lag-1: (measure autocorrelation between 2 successive observations)
T
∑ ( Y t−Y )( Y t−1−Y )
^ρk = t =2 T
∑ ( Y t −Y )2
t=1
47
With these results, we can introduce individual test (below)
Higher order autocorrelations:
T
∑ ( Y t−Y )( Y t−k −Y )
t =k+1
^ρk = T
∑ ( Y t −Y )2
t=1
Numerator: T – k terms
Denominator: T terms
Limitations:
o If k > T, impossible to estimate (k: time lag, T: sample size)
o As k increases, accuracy becomes lower
1
o Rule of thumb: T ≥ 50 and k ≤ of sample size T
4
o Why compute autocorrelation?
Our interest: If autocorrelations are always zero
If zero, may not consider time series model like AR or MA (no serial dependence)
If not, can use past value to predict future value
st
o 1 test: Individual Test for Autocorrelation:
Null Hypothesis: autocorrelation (ACF) = 0
H0: ρk =0 → no autocorrelation, time series is random
Applied to only an autocorrelation at any lag k (individual)
Eg. lag-1 autocorrelation: null hypothesis is ρ1 = 0
Eg. lag-k autocorrelation: null hypothesis is ρk = 0
Because sample estimator is Normally distributed:
2
| ^ρk | >
√T
2
Only values larger than 2 SDs ( ) (i.e. >95%) indicate a significance at 5% level
√T
o Because 95% of the values should be within 2 SDs
o Reject null hypothesis → can use past values to forecast future values
o Conclude that the time series is NOT random
Correlogram
Each correlation is displayed as a bar, to give us an idea of the sample
autocorrelation
x-axis: time lag k, from 1 to a large number
Each bar tells the autocorrelation btw lag k and original data
Dashed lines: 5% significant limits at ± 2/√ T
o T = no. of observations (eg. daily prices of stock)
o Any bar beyond dashed lines: significant
Do NOT reject null
o If within the limits: insignificant
Reject the null
Limitation:
Need to repeat it many times, time consuming
o What if when more than 1 autocorrelation (eg. lag-1 to lag-100)?
nd
o 2 test: Joint Test for Autocorrelation (Ljung-Box test/Q test):
Null Hypothesis: first m autocorrelations are jointly 0
H0: ρ1= ρ2=…=ρm=0
48
If even one ρ ≠ 0, then reject H0
m: no. of autocorrelations you are jointly testing
m
^ρk 2
Q(m) = T (T +2) ∑
2
x ( m) follows Chi-square distribution
k=1 (T −k )
o 3rd test: Durbin-Watson test (autocorrelation check for lag-1 residuals only)
o Used to test for first order Autocorrelation between 2 sequential errors (lag-1)
o Limitations:
Cannot detect for higher order autocorrelations
Possible that lag-1 autocorrelation is insignificant, but how about lag-2, lag-3 etc.
If fall within inconclusive region (white), we don’t know what to do
However, DW test usually found in business reports, so we learn it
49
o Null hypothesis: No lag-1 autocorrelation (e.g. original and lag 1, lag 1 and lag 2)
H0: ρ1=0
n n n n
∑ (~ϵi )2 ∑ (~ϵ i )2
i=2 i=2
Numerator: Sum of [Error at time i (starts from 2), subtract previous error, and square it]
Denominator: Sum of [square error for each time i]
o Test statistic d is approximately related to autocorrelation of order 1:
0 ≤ d ≈2 (1−~ ρ)≤ 4
When ~ ρ=0 → d ≈2
When ρ=1 ~ → d ≈0
When ρ=−1 ~ → d ≈4
Easier to use p-value
Since tables for critical values are NOT always readily available
If p-value < α , reject null
Reject null when it’s in the 2 yellow regions
There is autocorrelation
When d close to 0: positive autocorrelation
When d close to 4: negative autocorrelation
Do NOT reject null:
No evidence of autocorrelation
Dependence between 2 consequent errors are NOT correlated to each other
White Area:
Inconclusive region, we don’t know what to do
dL and dU:
critical values, can be found in table for DW test
fixed number, but varies for different no. of:
o Observations
o Variables
o Interpretation:
If there is Autocorrelation (Errors NOT independent):
Formula used to compute Standard Error is wrong
o Thus confidence interval & hypothesis test will be wrong
o Errors are supposed to be useless
Least squares estimator
o Still linear and unbiased (expectation = true parameter)
o BUT it is NOT efficient (less accurate estimator)
Stationarity
50
o If NOT stable, relationship is changing over time → prediction is NOT dependable anymore
o We want stability/stationarity, so that relationship does NOT change over time
Conclusion:
y or zt fluctuates with constant variation around a constant mean.
reasonable to conclude that the time series or first differences zt are stationary.
51
ϕ slope: AR coefficient (constant) → take from SAS output (e.g. AR1,1)
o if ϕ is zero, then middle term disappears
Past values CANNOT be used to predict future values
o If ϕ is large, past values strongly influence future values
Zero mean δ delta: Intercept (constant) → NOT mean (don’t take MU directly from SAS output)
Non-constant variance δ
independent μ= δ = μ∗(1−ϕ) where ϕ is AR coefficient (e.g. AR1,1)
(1−ϕ)
et: should have no autocorrelation in the residuals (ie. et = 0)
o Deviation between actual value and fitted value is due to random shock et
o Assumptions for residuals:
1. Zero mean (residual plot of residuals all around the x-axis)
2. Constant Variance
3. Mutually Uncorrelated (Independent, random)
Past errors do NOT depend on current error, vice versa
o AR(1) model is stationary only if: (stationarity condition)
−1< ϕ<1 (note: there is NO equals sign)
Stationary models have:
δ
Constant mean: μ =
(1−ϕ)
2
σe
Constant variance: γ 0= 2
1−ϕ
k
Constant ACF: corr (Y t , Y t −k )= ρk =ϕ k = 1,2,…
o Use this to derive true autocorrelations:
Eg. lag-3 correlation = ϕ 3
o Sample autocorrelation can differ from true autocorrelation (using ϕ k )
Due to random noise
However, it is very close, and reflects similar patterns
When the true autocorrelation is zero, the sample autocorrelation
function can be negative.
AR(1) model can also be represented by its demeaned (- μ from both sides) series:
o Yt −μ = ϕ (Y t −1−μ)+ e t
Stationarity (−1< ϕ<1) is important so that the data shows mean reversion:
Yt = δ +ϕ Y t−1 +e t with Y1 and δ (ignore et), Yt can be calculated
Mean reversion:
o Future values of stationary time series always fluctuate around its mean
o If non-stationary: value of time series becomes explosive → confusing
NOT able to estimate future values accurately
o To check adequacy of AR(1) model:
Residuals, et = 0 (NO autocorrelation)
Check Autocorrelation for residuals (Ljung-Box test to test autocorrelation of residuals)
Null: there is no autocorrelation of residuals
Check Pr>ChiSq, if it is greater than 0.05, do not reject the null no autocorrelation
of residuals adequate model
Check Pr>ChiSq, if it is less than 0.05, reject the null there is autocorrelation of
residuals inadequate model
If residuals are autocorrelated:
Means AR(1) model does NOT successfully capture data’s characteristics
Indicates “left-over” dependence → model is NOT adequate
52
Try higher order AR models (more lagged values)
AR(p) → higher order AR model
o Depends on p lagged values
Lag order p refers to the last lag value
o Yt = δ +ϕ 1 Y t −1+ …+ϕ p Y t −p + et
Same assumptions for residuals, et:
Zero mean
Constant Variance
Mutually Uncorrelated (Independent, random)
Eg. for AR(2):
δ
Mean: μ =
(1−ϕ 1−ϕ 2)
ϕ1
ACF: ρ0 =1, ρ1= , ρ =ϕ ρ + ϕ ρ
1−ϕ 2 k 1 k−1 2 k−2
o Higher order AR models have more complex ACF patterns
When lag k becomes larger, autocorrelations become smaller → pattern does NOT change
Sample autocorrelations are good estimators of population autocorrelation
o Partial Autocorrelations for lag order higher than p are zero
π kk =0 , for all k > p
Partial Autocorrelations (PACF):
o Amount of correlation between a time series and a lag of itself that is NOT explained by correlations
at all lower order lags
1
Also Normally distributed with standard deviation of
√T
Remove impact of lag-1 autocorrelation
Eg. lag-1 autocorrelation between wife and husband
Husband and mother-in-law also lag-1 autocorrelation
When wife tells husband things, husband may tell it to mother-in-law
Partial autocorrelation is autocorrelation between wife and mother-in-law
o π kk =Corr (Y t −P ( Y t|Y t +1 , … ,Y t + k−1 ) , Y t +k −P ( Y t +k|Y t+1 , … , Y t +k−1 ))
P ( W |Z ) is the ‘best linear projection’ of W on Z
No need to compute this manually, but understand the SAS
Regression of data based on AR(p)
o SAS output:
Autocorrelation
Blue portion
2 SD region, reject if
bar is beyond region
53
o How to choose AR order:
ACF plot: (in below example)
We see when k increase, sample ACF declines over time
But, PACF plot:
The p partial autocorrelations are significant if they are above dashed lines
Eg. Cutoff at lag-2 → indicates the data follows AR(2) model
For larger values of k, (eg. k = 15 in the plot above) might see the bar larger than 2 SDs.
However, remember that for large k, our estimation is NOT that accurate
Can ignore it
o it appears only because of the random noise
o These 2 SDs is just a rough confidence interval, and it is weakly correlated
54
Moving Average Model (MA):
Motivated by impact of forecast error
o Eg. we find that future exchange rates do not depend on past exchange rates, instead they depend
on previous forecast error (discrepancy of actual from forecast in the past)
MA(1):
o Yt = μ+e t +θ 1 e t−1
o Current value Yt, depends on 1 previous forecast error (lag-1 error) et-1 and current error et
et: current forecast error
et-1: previous forecast error
μ: μ= intercept. Taking negative signs of MA coefficients
If slope θ = 0: current value is random (only depends on current random noise)
If slope θ is large (-ve or +ve)
it indicates that the previous forecast error has strong influence on current value
Error:
Current or Past, we require it to have:
o Zero mean
o Constant variance
o NO autocorrelation
o Difference between AR and MA:
AR includes lagged terms of time series itself
MA includes lagged terms of the noise/residuals of the time series
o Link between AR and MA:
MA model can be reformulated as AR(∞ ) model
Advantage of MA model:
Instead of using a higher order of AR model to forecast eg. foreign exchange rate,
can just use a compact and simple model MA(1) model to explain the same thing,
because MA(1) is identical to higher order AR model
o MA(1) only has one unknown parameter, higher order AR has many
o Accuracy also improves when you use a simple model
o Property of MA(1) model:
Has only one non-zero (significant) autocorrelation at k = 1 (NOT partial autocorrelation)
It is a lag-1 autocorrelation (k = 1)
The rest are zero (insignificant)
Higher order MA(q) model:
o Yt = μ+e t +θ 1 e t−1 + … + θq e t −q
Same error assumptions:
Zero mean
Constant variance
NO autocorrelation
o MA models are stationary
E(Yt) = μ → take the -ve μ from SAS output
2 2
Var(Yt) = (1 + θ1 +…+ θq)σ 2
Corr(Yt, Yt-k) = 0 if k > q, …
Autocorrelations are always 0 if time lag > q
o To determine how many MA terms are needed:
If sample ACF is significant at lag q, and NOT significant at higher lags
Correlogram cuts off at lag q
PACF declines over time
55
Then, should chose MA(q) model → order q
Summary:*****
PACF cutoff at p, ACF exponentially declines over time → AR(p) model
ACF cutoff at q, PACF exponentially declines over time → MA(q) model
BOTH PACF and ACF exponentially decline over time → ARMA, but have to fit p & q by trial and error
BOTH PACF and ACF, all 0 → white noise
Note: if ACF lag 1 & lag 3 significant, but lag 2 insignificant → use MA3 if it is strongly significant, otherwise use
MA(1)
In SAS, can just type “1, 3” → to show that lag 2 is insignificant
Note: if residual autocorrelations are strongly significant, then the model is NOT adequate → consider higher
orders
If weakly autocorrelated → can ignore the autocorrelations
Note: if data is stationary, it does NOT mean data is autocorrelated, vice versa. 2 separate concepts.
Eg. random errors (with zero mean, constant variance) are stationary, but autocorrelations = 0
Eg. non-stationary data may have serial dependence (though that dependence is changing over time)
The 2 IC balances:
56
Forecast accuracy
Model complexity
o more predictors, forecast error automatically becomes smaller → doesn’t make sense
o penalizes unnecessarily complicated models
How to use IC?
The smaller the IC, the better the fitted model:
Box-Jenkins Methodology
1. Model Identification
Use sample ACF and sample PACF to choose model
If both do NOT have cutoff → consider ARMA model
Use AIC and BIC to choose p and q for ARMA
SAS:
Plots and results > check “Actual values plot”
To get ACF and PACF
2. Model Estimation
Estimate unknown parameter → SAS will do it
SAS:
i. Enable estimation steps > check “Perform estimation steps”
ii. Model definition
Add p for AR model > click “Add” (note: to consider AR(3): type “1,2,3” > Add)
Add q for MA model > click “Add”
3. Model Validation NOT just “3”, or it won’t
Check adequacy of model with focus on the residual consider lag-1 and lag-2
If autocorrelation in residual → indicates selected model is NOT adequate in explaining serial
dependence → should consider higher order model
How to check existence of autocorrelation? 3 tests:
1. Individual test
2
| ^ρk | > reject the null that there is no autocorrelation
√T
2. Joint test (Q-test or Ljung-Box test)
3. DW test
Limitation of DW test → only for lag-1 autocorrelation (not higher order)
Check statistical significance of the coefficients by check their p-values. P-value < 0.05 significant
coefficient
57
t-statistic p-value
SAS Output
AR Coefficient
−θ : MA Coefficient → always take the opposite sign!! Eg. here, take –9
AR model:
intercept = MU X (1-sum of AR coefficients) Constant Estimate 3.51E-07
To see if fitted model is good: look at Autocorrelation Check of Residuals (Q-test / Ljung-box test)
Conditional Least Squares Estimation t-value=
Autocorrelation Check of Residuals estimate/standard
Parameter Estimate Standard Error t Value Approx Pr > |t| Lag error
To Lag Chi-Square DF Pr>ChiS Autocorrelations
MU 825.850351 25.15334729
q 32.83 0.0000 0 Approx Pr > |t| is
the p-value of the
AR1,1
6 1
25.52 5 0.000248001
0.0001 lag-14032.24
lag-2 0.0000
lag-3 lag-4 1 lag-5 testlag-6
of whether the
MA1,1 9 coefficients are
lag-7 lag-8 lag-9 lag-10 lag-11 lag-12
significant
lag-13 lag-14 lag-15 lag-16 lag-17 lag-18
Test statistic
No. of joint test Null hypothesis: no autocorrelation
Degrees of freedom
p-value: smaller than 5% → reject the null that they are
Residue depends on eg. 1 jointly 0, the residuals are autocorrelated and the model
parameter, your AR(1) is not adequate
coefficient. Therefore, DF If eg. ARMA(3,1):
is reduced by 1 There’s 4 unknown parameters:
AR(1),AR(2),AR(3),MA(1)
DF is reduced by 4
58
95% confidence interval:
Standard Error =
Forecast for the next period (obs=61) = 98.152 + 1.96 x 1.025 = 100.161
Standard error of future forecast (obs=62) based on current forecast (obs=61)
o The standard error is increased to 1.052 because the forecast is baesd on obs=61 which is a
forecast. The standard error increases due to the accumulation of forecast error.
Parsimony of model
Higher AR models are complex even though they may be more adequate. Always use the simplest possible model
based on PACF (tells AR) and ACF (tells MA).
59
LECTURE 10: UNIT ROOT AND PAIRS TRADING
Stationarity Condition:
AR Coefficient: −1< ϕ<1 (note: there is NO equals sign)
There are 3 parts:
o Constant Mean
o Constant Variance
o Constant ARF structure
If |ϕ|< 1 → Stationary data
If ϕ=1 → Random Walk (non-stationary data)
If ¿ ϕ∨¿1 → Explosive data
AR, MA, ARMA can ONLY be applied to stationary data
Non-Stationary Models
Non-Stationarity
60
Non-stationarity causes problems:
Time Series Analysis extrapolates historical patterns into future via statistical methods
o If non-stationary, history doesn’t necessarily repeat itself, methods may fail
If we see trend in data, it indicates that the data is NOT stationary, because expectation of mean level is changing
over time (another way to check for stationarity besides checking the value of -1 < φ < 1)
2 kinds of non-stationarity are important:
1. Trend-stationary (TS): Yt = ϑ + βt +ε t
o Deterministic trend
ϑ + βt
Intercept + slope x time → time dependent
If we can use formula to describe the trend, it is deterministic
Future value of trend is fixed, NOT Random → intercept and slope do not change
o However, it is non-stationary:
Mean is time dependent
o HOWEVER, Demeaned series becomes stationary
Remove the mean from the data:
Data – expectation: Yt -ϑ + βt = (ϑ + βt+ ε t ) -ϑ + βt
First term cancels
what's left is the noise → error = 0
o Error is stationary because:
Constant mean
Constant variance
Constant ACF
2. Difference-stationary (DS):
o A process with a stochastic trend or a unit root (ϕ = 1)
Eg. Random Walk(e.g Finance series, Stock prices, exchange rate),, Random Walk with Drift
o Non-stationary data
o Even if you compute demeaned series, it is still non-stationary
Because what’s left is the SUM of noise
How to obtain stationary data from DS data?
Differencing the data
61
Original data is stationary (no need to do transformation)
Eg. I(1):
Original NOT stationary
First order differentiation (change of price) is stationary
Eg. I(2):
Original NOT stationary
First order differentiation NOT stationary
Second order differentiation (change of change) is stationary
o After making data stationary, select eg. ARMA model:
NOT for original data, but for differenced data
We should NOT use ARMA (etc.) model for non-stationary
After you difference and forecast, you can compute it back to get forecast for original data
Just use forecast from fitted model + lagged value
e.g. first order differenced model. Given current price = 253. Forecast rate of return: 2.3%
Forecast price = 253* 1.023 = 259
Given Forecast interval: [1%, 4%]. Forecast price interval= [253 *1.01, 253 * 1.04] = [256, 263]
If second order differencing used
o Model forecasts the change of rate of return
o Given rate of return calculate forecast return of rate calculate price
Unit-root tests to test for stationarity
Consider α is like φ, if -1 < φ < 1, data is stationary, so if -1 < α < 1 non-stationary
Define: ρ=α −1
First question
Is my data is stationary or not?
Test estimator α :
o Null Hypothesis, H0 in t-test: α = 1 (data is NOT stationary)
estimator – target α −1
o t-statistic = =
SE of estimator SE α
Results:
o if ¿ α ∨¿ < 1, for sure my data is stationary
o if α = 1 ( ρ=0), data follows random walk → data is NOT stationary
HOWEVER, if data is non-stationary, test statistic is NOT t-distributed anymore
Instead, it is Dickey-Fuller distributed
Use new critical value (shifts left) from DF distribution, NOT t-distribution
Called a Dickey-Fuller test, instead of a t-test
If H0 is Stationary → Red curve
Then look at the time plot of the differenced data and sample ACF and PACF to select the appropriate model
Time plot:
o Reflects differenced data
o See if mean reversion feature → value fluctuate around the mean
ACF and PACF → select appropriate ARMA model
63
o Reflects differenced data
o If no clear clue which model to use, use trial and error
o Eg. use MA(1)
Stage 3: Forecasting > Enable forecasting steps > tick “Perform forecasting steps”
HOWEVER, remember model is for the differenced data
Forecasts for variable Close:
o Forecasts are for original data (NOT differenced/stationary data)
Because SAS knows that you considered the differenced
o Gives you the forecast, as well as the 95% confidence interval
Manually: interval forecast = point forecast ± 1.960 x SE
“I” stands for integration → how many times differencing I should conduct to get a stationary model
Check d with ADF test, DON’T over difference
o Eg. if d = 1, p = 1, q = 0 → ARIMA(1,1,0) model
o Eg. if data only becomes stationary at d =2, then → ARIMA(1,2,0) model
p and q find using the stationary (differenced/integrated) data
SAS:
Choose data
Excel SAS > Tick “Difference the response series”
o SAS will base its analysis on first order (or order specified), NOT original data
Eg. Autocorrelations will be for the differenced data
Limitations of ARIMA:
ARIMA identification is difficult and time consuming
o However, visual analysis is subjective
Model may NOT have intuitive interpretation
o Difficult to explain why sales depends on lag 3 sales, why not yesterday’s sales
Identification and estimation can be badly distorted by outlier effects
Models that perform similarly on the historical data may yield quite different forecasts
Box-Jenkins approach does NOT tell us if model is too big/unnecessarily big
o Only tells us if the model is big enough, and whether it is too small
o Eg. if use AR(10) for an AR(1) → still tells us that model is adequate
Problem: more than necessary unknown parameters
Reduces accuracy
64
o Many users then treat the series as if it has a unit root.
Difference the data, forecast changes or growth rates.
we stop when we find stationary data
For differenced data
Summary: Forecasting with ARIMA models
Spurious Regression:
Have 2 random walk process (non-stationary), both are independent
If regress one data on another, should expect no dependence
o HOWEVER, we see some correlation (dependence)
o 2 non-stationary processes share some non-stationary root together
Called “co-integration” → Spurious dependence (won’t be tested)
“Spurious” because the 2 data are independent from each other
Pairs Trading:
To make profits using stationarity
Nobody knows the true value of security, how to know if overvalued or undervalued?
Don’t consider absolute value → look at relative value instead
1. Pick out 2 financial instruments that are similar to each other
Eg. same product, similar management board
We expect them to have similar price
2. We see if there is a deviation of price of one from the other
If there is a large deviation, one is overvalued and the other undervalued
Can combine the 2 by regressing one on the other
65
We get the residuals (error = actual – forecast)
o Error is stationary
o We also observe that the data is mean-reverting
Eg. DJ: dependent variable, FTSE: predictor variable
o Portfolio: et = DJt – β x FTSEt
3. We don’t need the true price, we just use the relative idea to see if one is over or undervalued
Then we make our decision
Trading strategy: when we see large deviation from the mean:
Below the mean: we should buy the portfolio
o buy 1 unit of DJ, short-sell β units of FTSE
o Clear the position at the mean
Above the mean: we should short-sell the portfolio
o Short-sell 1 unit of DJ, buy β units of FTSE
o Clear the position at the mean
We are only sure that it has mean-reversion, but NOT if it will continue to go above the
mean or below the mean → so we clear position at the mean
Model Diagnostics
Moving Average Model – MA(1)
What can go wrong?
The data may contain outliers.
The time series may be non-stationary.
at : Random shock at time t
The errors may be autocorrelated. - Assumptions:
The errors may show changing variances over time. o Normal distribution
The mean of the errors may be non-zero. o Independent of t
The errors may not be normally distributed. o Independent of other at terms
at-1 : Random shock at time t-1
Choose model based on goodness of fit, forecast accuracy and θ1 : Unknown parameter; estimate from data
residual analysis. δ = μ : constant term (if applicable)
Also consider interpretability and parsimony
66
Cluster Analysis
√ 2 2
D m= ( Z1−S 1 ) + ( Z 2−S 2) + …+ ( Z7 −S 7 )
2
√
2 2 2
( X 1 −Y 1 ) ( X 2−Y 2 ) ( X 7−Y 7 ) (σ2 is variance)
Dm= 2
+ 2
+…+ 2
σ1 σ2 σ3
Dm does not depend on scales of measurement used as they are standardised
Dm will be influenced when new objects are included as the SD needs to be recalculated
ED and MD are not affected significant when including new objects (even if they are outliers)
ED and MD are scale dependent
MD, as compared to ED, can reduce the effect of outliers as differences are not squared
Using different distance measures can lead to very different cluster results.
Two types of clustering
Hierarchical algorithms (using dendrogram)
o No need to specify the number of clusters to begin with (just do it using agglomerative method)
o Interpretation of results (E.g. no. of clusters to use) is very subjective
o Categorical variables cannot be calculated (should not be taken into account)
o Look at the scatterplot (only possible for observations determined by two variables) of the objects to
determine the number of clusters needed (after the SAS cluster analysis)
o Scatterplots and dendrogram can also help identify possible outliers
67
o Top-Down (divisive): starting with all the data in a single cluster, consider every possible way to
divide the cluster into two. Choose the best division and recursively operate on both sides. i.e.
Weeding out dissimilar observations
o Bottom-Up (agglomerative): starting with each item in its own cluster, find the best pair to merge
into a new cluster. Repeat until all clusters are fused together. i.e. joining together similar
observations
SAS
Using SAS, choose Euclidean distance and Ward’s minimum variance method
Look at the resulting dendrogram to determine the number of cluster to use
Rerun the analysis, adding the number of cluster in the results.
The new analysis sorts the observations into the number of clusters specified
Sort the original observations into the clusters assigned
Calculate the cluster mean for each variable for each cluster (e.g. Price mean for
cluster 1’s price)
Plot bar chart to compare the cluster means and for interpretation
Manual
Need to standardise the data when we do cluster analysis as different scales may be
used for variables
Find distance matrix by determining the Euclidean distance for each two observations
Merge the two observations with the smallest distance into a cluster
Consider the newly formed cluster as one observation, update the distance matrix by
finding the wards linkage btw every observation to the new cluster
Wards linkage: minimises the variance of the merged clusters
o Robust to outliers and noises
68
o K means algorithm
SAS
Manual
Define a value for K, say 3. (not arbitrarily defined)
Arbitrarily specify K number of cluster centres/means
Compute the distance of each object to the cluster centres, classify object to the
nearest cluster
When all objects have been assigned, recalculate the positions of the centres by
taking average of the observations in the respective 3 clusters
Compute the distance of each observations to the 3 new cluster centres, classify the
observations to the nearest cluster
Find new cluster centre again and reclassify the observations until no observations
changes their positions (the algorithm converges)
o Selecting K
Scatterplot of observations to see how many clusters there are
K=5
o PCs can be ordered according to the magnitude of their variances, which are
the associated eigenvalues
o PCs are affected by outliers (thus run PCA to obtain PCs, scatterplot PC1
against PC2 to check for outliers, if there are, remove the outliers and rerun
PCA to get better PCs and do scatterplot to find K (the number of clusters))
K=2
69
o Cumulative column gives the percentage of variation of the data explained
by the PCs
Xi(j): the ith observation that is classified into the jth group
Cj: the jth cluster centre
When K=1, J is the sum of distances from C1, the only mean, to all the data observations
Plot objective function values for K= 1, 2, 3…6
Select K that corresponds to the abrupt change in the plot (“knee finding”/ “elbow finding”)
K=2
70