Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 70

KEY CONCEPTS

Descriptive Statistics.......................................................................................................................................................4
Displaying Quantitative Variables Graphically.................................................................................................................5
Discrete Distributions......................................................................................................................................................6
Continuous Distributions.................................................................................................................................................6
Sampling Distribution......................................................................................................................................................8
 Law of Large Numbers (LLN)................................................................................................................................8
Central Limit Theorem (CLT)........................................................................................................................................9
Hypothesis Testing........................................................................................................................................................10
o Type 1 Error.......................................................................................................................................................10
o Type 2 Error.......................................................................................................................................................10
 p-value...............................................................................................................................................................10
 Power of Test....................................................................................................................................................11
Comparing Distributions...............................................................................................................................................11
 Simpson’s Paradox............................................................................................................................................11
Comparing/Testing Averages........................................................................................................................................13
Choosing Method......................................................................................................................................................13
ANOVA (Analysis Of Variance) F-test.........................................................................................................................13
t-test..........................................................................................................................................................................14
z-test......................................................................................................................................................................... 15
Lecture 4 Linear Regression..........................................................................................................................................15
Linear Regression (1......................................................................................................................................................15
Lecture 5 Multiple Regression.......................................................................................................................................16
Multiple Regression (....................................................................................................................................................16
o Adjusted R2........................................................................................................................................................17
Regression Output.....................................................................................................................................................19
Usefulness Tests of Regression Coefficients (2 tests)....................................................................................................20
1. Right-sided F test...............................................................................................................................................20
2. 2-sided t-test.....................................................................................................................................................20
Variables Selection........................................................................................................................................................20
Indicator or Dummy Variables.......................................................................................................................................21
Interaction among X’s...............................................................................................................................................22
Missing Values...........................................................................................................................................................23
Outliers......................................................................................................................................................................24
Logistic Regression........................................................................................................................................................25
Time Series....................................................................................................................................................................25
Test for Independence Assumption for residuals in time series....................................................................................28
o Durbin-Watson Test..........................................................................................................................................28

1
Test for Variance assumption........................................................................................................................................29
o F-test................................................................................................................................................................. 29
o Individual t-test.................................................................................................................................................29
o To test if models are really a better fit..................................................................................................................30
o Goodness of Fit..................................................................................................................................................30
o Forecast Accuracy..............................................................................................................................................30
o Residual Analysis...............................................................................................................................................30
Forecast Errors..............................................................................................................................................................31
 Bias....................................................................................................................................................................31
 MAD (Mean Absolute Deviation).......................................................................................................................31
 MAPE.................................................................................................................................................................31
 MSE (Mean Square Error)..................................................................................................................................31
 RMSE (Root Mean Square Error).......................................................................................................................32
Prediction Intervals.......................................................................................................................................................32
Point forecast............................................................................................................................................................32
Interval forecast........................................................................................................................................................32
How to model both Trend and Seasonality...................................................................................................................33
Additive Model (Linear Trend and Seasonality).........................................................................................................33
Multiplicative Model (Nonlinear Trend and Seasonality)..........................................................................................33
Ratio-to-Moving-Averages (Seasonal Index).............................................................................................................33
Summary of Multiple Regression-based models.......................................................................................................34
How to Model if Underlying Pattern is NOT Apparent? (NO obvious trend or seasonality)..........................................35
Naïve forecast...........................................................................................................................................................35
Smoothing Out method.............................................................................................................................................35
 Simple Moving Average (SMA)......................................................................................................................35
 Weighted Moving Average (WMA)................................................................................................................35
 Single Exponential Smoothing (SES)..............................................................................................................36
Holt’s exponential smoothing method..................................................................................................................37
Winter’s exponential smoothing method..............................................................................................................37
Simple Exponential Smoothing Prediction Interval...............................................................................................39
How to assess the existence of autocorrelation?..........................................................................................................40
o Visual analysis: lagged scatterplot (e.g. original vs. lag 1, lag 1 vs lag 2)............................................................40
 Quantitative value: autocorrelation function (ACF) (how to compute autocorrelations)..................................40
o 1st test: Individual Test for Autocorrelation...................................................................................................41
o 2nd test: Joint Test for Autocorrelation (Ljung-Box test/Q test).....................................................................41
o 3rd test: Durbin-Watson test (autocorrelation check for lag-1 residuals only)...............................................42
Stationarity............................................................................................................................................................43

2
Time Series Models if Original Data have Autocorrelation and Stationary....................................................................44
Autoregressive Model (AR).......................................................................................................................................44
 Partial Autocorrelations (PACF).....................................................................................................................46
Moving Average Model (MA)....................................................................................................................................47
Summary...................................................................................................................................................................48
ARMA Model.............................................................................................................................................................48
Information Criteria (IC) → to choose p and q for ARMA.....................................................................................48
Box-Jenkins Methodology.............................................................................................................................................49
SAS Output................................................................................................................................................................50
To see if fitted model is good: look at Autocorrelation Check of Residuals (Q-test / Ljung-box test)........................50
Stationarity Condition...................................................................................................................................................52
Non-Stationary Models.................................................................................................................................................52
Random Walk Model.................................................................................................................................................52
Random Walk with Drift............................................................................................................................................52
Non-Stationarity............................................................................................................................................................52
1. Trend-stationary (TS).........................................................................................................................................53
o Deterministic trend.......................................................................................................................................53
2. Difference-stationary (DS).................................................................................................................................53
To make Non-Stationary become Stationary.............................................................................................................53
Unit-root tests to test for stationarity...........................................................................................................................54
Augmented Dickey Fuller (ADF) Test:........................................................................................................................55
 Time plot.......................................................................................................................................................55
 ACF and PACF................................................................................................................................................55
 Forecasts for variable Close...........................................................................................................................56
ARIMA(p,d,q) model......................................................................................................................................................56
Summary of Unit Root Test.......................................................................................................................................56
Summary: Forecasting with ARIMA models..............................................................................................................57
Pairs Trading:.................................................................................................................................................................57
Model Diagnostics.........................................................................................................................................................58
Cluster Analysis.............................................................................................................................................................59

3
LECTURE 1: EXPLORING AND COLLECTING DATA

Data Types

 Qualitative: categorical, nominal, labelled, based on ranks (eg. 1st, 2nd), not ordered (eg. red, blue, apple,
Toshiba)
 Quantitative: numerical, values with units, can tell which is big or small (eg. height in cm), discrete v.s.
continuous, cross-sectional v.s. time series
 Half-Half: ordinal, ranked, but not specific value (eg. 5 star hotel)

Descriptive Statistics

Single Number
 Mode: → most repeated value
o Visual: hump in histogram
o =MODE()
o Unimodal (1 mode), bimodal (2 modes), multimodal (many modes), or NO mode may exist

CENTER
 Average/Mean: =AVERAGE() → more resistant than Median in finding the centre for asymmetric data
 Median: =MEDIAN() → average of 2 centre numbers; Median is resistant to outliers  suitable to find the
centre when the distribution is skewed, contains outliers or gaps)

SHAPE
 Symmetry:
o Symmetric if halves on either side of center look (approximately) like mirror images
o If symmetric, mean and median are close (because median is 50th percentile and mean is the
average). Normal distribution: symmetric  mean=mode=median
 Skewness: skewed to the side of longer tail
o =SKEW()
o Positive skew: long right-tail
o Negative skew: long left-tail
o Zero skew: no skewness
o Magnitude increases as the degree of skewness increases
o If mode is smaller than median and mean, the distribution is right skewed.
o If skewness exists, there would be little symmetry.

SPREAD
 To approximate interval, for the centre of the data, within which half of the data points were found:
median +/- 0.5 x IQR.  34% from either side of average is
 Average (Q1 + Q3): average of (1st quartile + 3rd quartile) within the 1SD
 Range: =MAX()-MIN() → covers 100% of data  1SD cover 68% of the data
 Standard Deviation: =STDEV()
o Six Sigma → not actually 6 sigma, but 4.5 sigma (build a margin of error of 1.5 sigma)
o Mean +/– Standard Deviation gives two-thirds interval
o SD = √variance

4
 Variance: SD2 → =VAR() → the average of the squared deviations from the mean

σ 2=
∑ ( y−μ)2 s2 =
∑ ( y− ȳ )2
n or sample variance: n−1
 Interquartile Range (IQR):
o Q1 = 1st quartile (bottom 25%, or 25th percentile point)
o Q3 = 3rd quartile (top 25%, or 75th percentile point)
o IQR = Q3 – Q1 → (range of middle 50%)
o IQR should be close to 2 x MAD, since both cover 50% of the data
o IQR is a better measure for asymmetric distributions
o An indicator of size of variance
 Scaled Interquartile Range (SIQR) to make it equal to SD:
o SIQR = IQR / (2*NORMSINV(0.75)) i.e. IQR * 0.741 = SD almost equal to SD→ to make it comparable
to SD, which is 100%
 Median Absolute Deviation (MAD):
o Measures median of absolute deviation, where the absolute deviation is the absolute difference btw
a data point and the median of the data.
o MAD = med [ |y – med(y)| ]
o Consider the data (1, 1, 2, 2, 4, 6, 9). It has a median value of 2. The absolute deviations about 2 are
(1, 1, 0, 0, 2, 4, 7) which in turn have a median value of 1 (because the sorted absolute deviations are
(0, 0, 1, 1, 2, 4, 7)). So the median absolute deviation for this data is 1.
 always positive (eg. deviations -5 & 2 → MAD = 3)
 covers 50% of the data
 1.5 MAD = 1 SD
o NOT differentiable, not elegant
o Median +/– MAD gives 50% interval
o MAD is resistant to outliers
 The presence of outliers does NOT change the value of the MAD
 In contrast, the SD is very sensitive to the presence of outliers
 Scaled Median Absolute Deviation (SMAD):
o MAD / NORMSINV(0.75) → {=MEDIAN(ABS(?-MEDIAN(?)))/NORMSINV(0.75)}
o If Standard Normal N(0,1): SMAD = SD
o SMAD is resistant to outliers
 Which measures of center and spread to be reported for a distribution?
o If the shape is skewed, the median and IQR should be reported.
o If the shape is unimodal and symmetric, the mean and standard deviation (and possibly the median
and IQR) should be reported.
o Always pair the median with the IQR, and the mean with the standard deviation.
Displaying Quantitative Variables Graphically

 Histogram: (comparing 2 groups)


o Vertical bar charts without gaps (so there is area under the curve)
o Binning, the least significant figures should only differ by 1, 2 or 5
o To plot quantitative data (with units), CANNOT plot qualitative
o General shape (like humps → easier to see asymmetry & mode), BUT may hide outliers
 Boxplot: (comparing many groups)
o Box and Whiskers
 Shows outliers (as dots) and skewness, NOT general shapes
 "Box up" the IQR / middle 50% (from 25th percentile to 75th percentile line)

5
 Divider line at median (50th percentile) within box
 Whiskers extend away from box
 5-number summary:
o Whiskers at min and max
o Median, Q1, Q3, Min, Max (5 number)
o Lazy version
 7-number summary:
o Whiskers end at 5th and 95th percentiles
o centre 90% of data is within whiskers
o 5% between whisker's end and extreme (min or max → represented by
dots)
o Median, Q1, Q3, 5%, 95%, Min, Max (7 number)
o Structure: min, 5th percentile, Q1, median (or Q2), Q3, 95th percentile, max
 Longer part of the box is at the top/bottom, the data is skewed right/left.
Sampling: because population is too huge, sampling is often used
 Population
 Sample
 Statistic: number computed from sample used to estimate parameters of the population (e.g. mean)
 Can never have 100% confidence of correctly estimating the population parameter of interest because the
sample used is not the whole population. That’s why standard deviation/error of the estimate is not zero.
 Designs: (how to use samples)
o Census: Include whole population (eg. to measure world population)
o Convenient sample: (not a proper sampling design)
 Take whatever is available, sometimes no choice
 Likely to be unrepresentative of the population
o Simple Random Sample (SRS):
 Larger samples more accurate
 Need a listing of all members of population (sampling frame); can be expensive
 With or Without Replacement
 In practice, we sample without replacement
 With formula, we assume with replacement, for simplicity
o Stratified Sampling:
 Proportional representation, rare stratum never left out
 Eg. Indian minorities in Singapore, boost their representation since their population is small
 Sample randomly from each stratum
o Cluster Sampling or multistage sampling:
 Identify a microcosm of population and say that your cluster represents population
 Eg. sample of Serangoon represents Singapore since it is similar

Discrete Distributions

 BINOMIAL: B(n, p)
 n = no. of trials. p = probability of success. X = no. of successes in n trials
o Success or Failure?
 prob(X = x) =BINOMDIST(x,n,p,FALSE)
 prob(X ≤ x) =BINOMDIST(x,n,p,TRUE)
o Average = np
o Variance = np(1-p)
o Eg. Toss a fair coin 20 times, what’s the probability of at least 18 heads?

6
 = 1 – BINOMDIST(17,20,0.5,TRUE)

7
Continuous Distributions

 NORMAL (N):
o N(µ, σ 2) or N(µ, σ )
 Average = µ (affects position of peak)
 Variance = σ 2 (affects spread)
 Covers 34% of data from either side of average
 68% within 1 SD of average
 95% within 2 SDs
 99.7% within 3 SDs
 99.994% within 4 SDs
 When asked to find the corresponding percentile values using means and SD, find use the
68-95-99.7 percentiles using SD.
 Mean = Mode = Median
 Symmetrical about the centre
 IQR/(2*normsinv(0.75))=0.741*IQR = SD.
i.e. Standard deviation = 0.75 x IQR. IQR = SD/0.75
o Approximate sample proportion p when population is large
 N(p, p(1–p)/n)
 Requirements:
 p > 5/n & (1 – p) > 5/n
 p (1 – p) is max when p = 0.5
 Conservatively, use N(p, 0.5(1-0.5)/n) → N(p, 0.25/n) for margin of error
o use Normal to Approximate BINOMIAL, B(n, p)
 N(np, np(1–p))
 Requirements:
 np > 5 & n(1 – p) > 5
 OR np ≥ 10 & n(1 – p) ≥ 10 (in some texts, less common)
 Often used for: P( B(n, p) ≥ x )
 Continuity Correction: P( B(n, p) ≥ (x – 0.5) )
o Standardisation (z-score)

 where s is the standard deviation of the sample


 This shows how many standard deviations the value is above or below the overall mean. For
example, a z-score of +2.0 indicates that a data value is two standard deviations above the
mean.
 Allows for comparison of variables measured in different units
o Standard Normal, Z
 N(0, 1)
 Average = 0
 Variance = 1 = SD
X−μ
 N(0, 1) = = Zx Zx represents Standard Normal
σ
o Eg. P(-1 < Z < 1)
 If Z = 2 (+ve), the original value is 2 standard deviations to the right of the mean
 If Z = -0.5 (-ve), the original value is 0.5 standard deviations to the left of the mean
 t-distribution:

8
y −μ standard deviation
o td = SE ( y )=
SE ( y) √n
bell-shape, like Standard Normal N(0, 1), but with df
 lower peaks, higher tails (more spread out)
 d = degree of freedom
 Average = 0
 Variance = d/(d – 2) → nearly 1
o As degrees of freedom ↑, the t-distribution tends toward the Standard Normal Distribution
o t-distribution is more accurate than the z-distribution (Standard Normal Distribution). z-distribution
can approximate the more accurate t-distribution if the sample size is large.
o t-value represents the number of standard errors by which the sample mean differs from the
population mean
 eg. if a t-value is 2.5, the sample mean is 2.5 standard errors above the population mean

 F-distribution:
o Fv, d
F distribution with v, d DF (degrees of freedom)

 v degrees of freedom in numerator
 d degrees of freedom in denominator
 Average = d/(d – 2)
 Variance is complicated
o Used in ANOVA (hence regression)

Sampling Distribution
Sampling Distribution: distribution of all the averages of different samples
 Law of Large Numbers (LLN):
1. The average of many independent samples is (with high probability) close to the mean of the population
o Average of all the sample averages → expect to get the population average
σ
2. Standard Deviation of the many independent samples = → smaller than population average
√n
3. The relative frequency becomes closer to the object probability as more trials are performed.
 Point estimate
o A single value given to a sample as an estimate of the true value of a population
o Sampling Error/Estimation Error:
 Difference between a point estimate and true value of population
 Because samples are smaller than population, there will always be sampling error
o Sampling Distribution:
 Distribution of point estimates from all possible samples from the population
o Unbiased Estimate:
 Point estimate where mean of sampling distribution of that statistic = true value of
population
 OTHERWISE, it is biased
 Unbiased estimates are desirable because they average out to the correct value
o Standard Error:
 How much point estimates vary from sample to sample (SD of sampling distribution)
 Standard Deviation of a Sampling distribution (NOT population)
 Ideally, estimate should have small Standard Errors
 If point estimates vary wildly, then a point estimate from 1 sample is NOT reliable

9
σ
 Standard Error =
√n
 When we don’t know σ, we approximate Standard Error =
sample standard deviation
√n
 Diminishing returns: the standard error declines only with the square root of the
sample size, as seen from its formula.
 Approximately Normal, so can use Standard Error exactly as you use Standard Deviation
 Eg. 2 Standard Errors on either side of mean, 95% confident of capturing mean
2s
 Approximate Confidence Interval = sample mean ±
√n
o Within this range/interval, we are 95% confident
 Interval Estimate
o An interval/range within which a value has a stated probability of occurring
o Confidence Interval:
 Probability that a value will fall within a range
 Measure of reliability/how accurate our estimate is
 Eg. given a 95% confidence interval, value will be between 5 and 100
 Interpretation: we are 95% confident that the value will be between 5 and 100
s
 95% confidence interval for µ is: “sample average” ± ME (= (tn-1, (1-95%)/2 * ))
standard deviation √ n
Mean + t-critical (100% - confidence interval, with n-1 degrees of freedom) x SE (=
√n
If the population mean and variance are known, confidence interval is given by normal distribution.

Central Limit Theorem (CLT):


 The sampling distribution of any population with mean and SD, is approximately Normal, and
approximation improves as the sample size n becomes larger
o HOWEVER, CLT does NOT say that when sample size is large enough, we can assume exactly Normal
 Distribution is approximately Normal, provided n is sufficient large (usually n ≥ 30)
o if population distribution is approximately symmetric, Normal approximation is good, even for low n
o HOWEVER, if very skewed or bimodal, need a really large n for Normal model to work well
 CLT talks about sample means of different samples, but NOT about distribution of data from sample
o If n is large, Xn approximates a Normal distribution with (assuming population mean and standard
deviation are known):
 Average = µ → average same as population average
σ
 SD of the sampling distribution = SE =
√n
( X n−μ)
 σ → N(0, 1) [Standard Normal]
√n
 If we know µ but NOT σ , estimate σ using s, the sample standard deviation:
( X n−μ)
o T-Statistic = s → the sampling distribution follows t-distribution with n – 1 df
√n
s
 = Standard Error of t-distribution (approximate standard error for normal dis.)
√n
 Assumptions and conditions for CLT

10
o Randomization Condition: The data values must be sampled randomly, or the concept of a sampling
distribution makes no sense. (This is usually assumed.)
o Independence Assumption: The sampled values must be independent of each other.
o Small-Fraction-Sample Size Condition: The sample size is a small fraction (traditionally, < 10%) of the
population. E.g. population size 100, each sample consists of 10 subjects only. (This is rarely looked
at.)
o Large-Enough-Sample Number Condition: If the population is symmetric (e.g. uniform, as in die-
throwing; unimodal will further help), even a fairly small sample is okay. For highly skewed
distributions, very large samples may be required. Traditionally, a short rule is n > 30. (This is
frequently ignored.)

Hypothesis Testing
 Null hypothesis
o “fail to reject” null OR “reject” null (DON’T say “accept”)
o Null is a simple hypothesis: does NOT allow a range
 Eg. cannot say null: average is between 4 and 6
o Type 1 Error:
 If null is true and you declare it is false (reject). False positive.
 Probability α → significance level
 Probability of Type 1 error occurring
 E.g. continue with further testing when in fact mean tumour weights are the same
for the two groups
 In practice, α usually specified by boss/client (fixed)
 Always under Null Hypothesis, and drawn towards the Alternative Hypothesis
 Use α to calculate c – critical value for testing (no need if can get p-value)
 Alternative hypothesis
o Set up against the null hypothesis
o Can be one-sided or two-sided (affects calculation of z-critical using 0.05 or 0.025)
o Alternative is a composite hypothesis: can allow a range
o Type 2 Error: → CANNOT have both Type 1 and Type 2 error at the same time
 If null is false and you declare it is true (fail to reject). False negative
 Probability β
 Probability of Type 2 error occurring
 E.g. abandon the drug when in fact the mean tumour weight of the treated mice is
smaller than that of those in the controlled group
 Under the Alternative Hypothesis and drawn towards the Null Hypothesis
 Results:
o if sample value is right side of critical value: do not reject Null Hypothesis, H0
o if sample value is left side of critical value: reject Null Hypothesis, H0 Alternative Hypothesis, Ha

 α & β are always in competition


o Shifting critical value will cause one of α & β to increase and the other to decrease → α + β = 1

11
o To reduce both (Horizontal variation), increase n (sample size)
 In practice, people only specify α (fixed); so increase n → reduce β
o α & β Cannot occur at the same time
 p-value → also called the significance Probability of the test
o probability of seeing something favouring HA more than (or at least equal to) H0, given H0 is true.
o Probability that HA happens given that H0 is true. If p-value is small, HA should not happen, yet it
happened, it means H0 is rejected.
 chance of a worse sample for H0 → NOT the probability of H0 being true, which is just 1 or 0
 big-p-value favours the non-rejection of H0
o if have p-value, it is a statistical method. If NOT, then it is NOT a statistical method.
o Results:
 p ≥ α: do NOT reject H0
 p < α: reject H0
 CANNOT change p after looking at α
 Power of Test
o 1–β
 Probability of rejecting H0 when it is false (when HA is true)
 Computed using sampling distribution for the Alternative Hypothesis (NOT Null)
 Allows us to describe ideal test
 Higher power → more reliable test
 To achieve higher power, increase sample size n (power depends on sample size)
o Ideal test characteristics:
 Small α → hardly reject H0 when it is true
 Large power (1 – β) → often reject H0 when it is false

Comparing Distributions
Scatterplot (for dependent samples, not suitable for independent samples):
 Define which is x-variable and y-variable (defining wrongly will affect results)
o X-variable: independent variable (changing this)
o Y-variable: dependent variable (changing x will affect y)
 Results:
o Negative/Positive relationship?
o Strong/Moderate/Weak relationship?
 Strong: points are close or on the best fit line (trend line)
 Weak: points are far away from the best fit line (trend line)
o Linear/Logarithmic/Exponential (etc.) relationship?
o Outliers?
 Lying outside Upper Control Limit or Lower Control Limit
 Real data or data error?
 Perform one scatterplot with outliers, and one without outliers
 Does outlier have any significant effect?

Missing Values: (not a statistical test)


 Possible reasons:
o Reluctance of people to provide all requested personal information
 Eg. why should you care how old I am when I first drank alcohol?
o Data doesn’t exist
 Eg. stock market prices for time before company went public
o Values are simply unknown

12
 How to detect missing values:
o Blank cells in Excel data set
 What to do about missing values:
o Ignore them
 Must be aware of how software deals with missing values
 Eg. Excel’s AVERAGE function divides by existing values, does not touch missing values
o Filling the gaps in some way
 Examining existing values in the row of any missing value to help predict missing value
 Fill in all missing values with average of existing values in that column
 Not very good option

Box Plot: (not a statistical approach, but a graphical approach)


 Side-by-side box plots are popular for comparing distributions
 Box plots handle missing values well
 Box plots are used to show overall patterns of response for a group. Can also show outliers

 Observations:
o Comparatively short box plots: data shows high agreement with each other
o Comparatively long box plots: data shows that there are different opinions about this aspect
o Box plot is much higher/lower than another: could suggest differences between groups
o "No overlap in spreads" or "75% is below 75%" so there IS a difference between group 'A' & 'B'

 Compare:
o Medians
o Consistency (smaller IQR → more consistent)

13
Comparing/Testing Averages:

Choosing Method:
1. 1, 2 or more samples?
o 1: 1-sample t-test, 1-sample z-test (test if the sample mean is larger/smaller/different from expected)
o 2: t-test, z-test, ANOVA
o More than 2 independant variables : ANOVA
2. Are samples independent?
3. Are population variances known?
4. Are unknown population variances equal?
For 1 sample
o If population is Normal (or sample large for CLT)
 Variance Known: 1-sample z-test
 Variance Unknown: 1-sample t-test
For 2 samples
o If populations are Normal, Samples are Independent
 Variance Known + Not too different: z-test
 Variance Unknown + Equal (but you know they are equal): pooled t-test OR 1-way ANOVA
 Variance Unknown + Unequal (BUT not too different): 2-sample t-test *most common
o If Paired Samples across populations: dependent within pairs, independent between pairs
 First, compute the paired difference: d = x1 – x2
 If difference between pairs is Normal + Variance Unknown: paired t-test
5. A statistical hypothesis is only about a population parameter, not sample estimate.

ANOVA (Analysis Of Variance) F-test:


 Used to find out whether the means (NOT variance) of a dataset are equal according to a certain α level
o To see if there are any significant differences in means between 2 or more different populations
 eg. different university courses, or medical treatments
 NOT a test of variances, instead comparing population averages/means
 Not a method for comparing multiple distributions of variables
o It is a multi-sample extension of the 2 sample 2-sided pooled t-test with unknown, equal variances
 H0: means of all groups/populations (NOT samples) are equal
 HA: at least one mean is different from the others
o ANOVA is a multiple regression with only dummy variables
o ANOVA is regression for categorical variables which are changed to dummy variables
 Assumptions:
1. Independent random samples
2. Normal populations: the populations follow a normal model.
3. Equal population variance (and unknown)
 Results:
o Significance F/p-value: (sample size does NOT affect reliability of p-value, affects size of p-value)
 If p-value ≥ stated α level, NOT enough evidence to reject the null H0
 Data are consistent with the null hypothesis (CANNOT say supports null hypothesis)
 Suggests the groups are not related
 NOT statistically significant → Analysis stops here
 If p-value < stated α level (small p-value) → reject null
 Shows populations are different (thus interesting)
 Smaller p-value → more evidence in favour of alternative hypothesis, HA
o Statistically significant

14
o Usually < 0.01 → convincing evidence for HA
 Next step: Discover which means are significantly different from which other means
 Usually done by examining Confidence Intervals
o F statistic/F vs F critical
 If F < F crit → variances are the same
 If F > F crit → variances are NOT the same
o SS (sum of squares):
 Between groups: sum of squares due to treatment
 Within groups: sum of squares due to error
o Between Variances (SSR):
 measures how much sample means differ from one another
 Only if Between Variance > Within Variance, can you conclude with any assurance that
there are differences between population means – and reject null hypothesis
o Within Variances (SSE):
 measures how much observations within each sample differ from one another
 Large Within Variances:
 Difficult to infer whether there are really differences between population means
 Small Within Variances:
 Easy to infer whether there are really differences between population means
o df (degrees of freedom)
o two variations MS (means squares):
 Between groups (Horizontal variation): means squares due to treatment (MST).
 Within groups (Vertical variation): means squares due to error (MSE).

t-test:
 Used to find out whether the means of a dataset are equal according to a certain α level
o To see if there are any significant differences between TWO different groups (unless 1 sample t-test)
o H0: μ1 – μ2 = Δ0 H1: μ1 – μ2 ≠ Δ0
 Conditions for using a t-test:
1. Standard Deviation is NOT known
2. n < 30
 How to conduct a t-Test:
o 1-tail test: knows what the difference is (eg. group 1 > group 2)

o 2-tail test*: unsure if there is a difference (eg. although there is difference, not sure which sign)
o Type 1 (paired t-test; a one sample t-test for the mean paired difference (μ1 – μ2)):
 Dependent + unknown variance + variance equal
 dependent paired samples across populations between the 2 groups
 dependence within pairs, independence between pairs
 Related across columns, but NOT rows
 difference between pairs is Normal, and variance is UNKNOWN

15
 Requires 2 samples of equal size
 Paired t-test for average of difference between pairs

o Type 2 (pooled t-test):


 Independent + unknown variance + variance equal
 Independent sample (1 does not affect the other), populations are Normal
 both variances are SAME

variance

o Type 3 (2-sample t-test): (most common)


 Independent + unknown variance + variance NOT equal
 Independent samples (1 affects the other), populations are Normal
 both variances are DIFFERENT
 The 2-sample t-test often yields a statistic with non-integer degrees of freedom.

 Result:
o If p-value < stated α level (small p-value) → reject H0 →difference between the 2 groups is
significant
o If p-value ≥ stated α level, do not reject H0  NOT enough evidence to conclude that 2 groups are
different  also implies that the confidence interval for 1 -  at confidence coefficient of 1-α will
include zero since there is not enough evidence to conclude that 2 groups are different
 Confidence interval
o Mean +/- t-critical (with df and confidence coefficient specified) x standard error
o Mean = mean of X1 – mean of X2. t-critical can be calculated from t-test outcome. SE is St-test
z-test:
 Used to find out whether the means of a dataset are equal according to a certain α level

16
o To see if there are any significant differences between 2 different groups
 Conditions for using a z-test:
o When conditions for t-test NOT satisfied:
 Standard Deviation is known
 n > 30
 Result:
o If p-value < stated α level (small p-value) → reject H0 →difference between the 2 groups is
significant
o If p-value ≥ stated α level, NOT enough evidence to conclude that 2 groups are different

Lecture 4 Linear Regression

Linear Regression (1 input X, 1 output Y):

 Equation of the population: y = β 0 + β 1x + ε i


 Scatterplot of sample points: ^y = β^ 0 + β^ 1 x
o ^β : intercept term → y value when x = 0
0

o ^β : slope term  the measures the expected change in y given a unit change in x only.
1
 +ve or –ve correlation between x and y
 CANNOT infer strength of correlation; magnitude can be changed by scale of measurement
sy
 ^β =r So, if r=0, ^β 1= 0. Then ^β 0= y , and ^y = β^ 0 + β^ 1 x = y
1
sx
o x: independent variable. y: dependent variable
o ε : error term, residual → sum of residuals = 0
o Residual r (estimate εi) = observed y value - fitted y value
o Unlike correlation, regression is not symmetrical in X and Y (so the regression equation of X on Y is
not x= β^ 0 + ^β 1 ^y or x= β^ 2 + ^β 3 ^y
 Standard error of the linear regression (Checking the model with standard deviation of the residuals,
estimate the standard deviation of the error term)
o How much the points spread vertically around the regression line

o
o
se =
√ ∑ r2
n−2 (r is the residuals and n is the number of observations)
Application of standard deviation of the residuals/standard error of estimate: Given Se, you can find
how many standard errors away your fitted value is from the actual value using residual for that
value divided by Se. Se is 3170, residual of a particular point is 2086, this indicates that the
fit/prediction is about 2086/3170 = 0.66 SDs away from the actual value, which shows that it is a
quite good prediction.
 Conditions:
o Constant Variance Condition
 The standard deviation around the regression line should be the same along the whole line
o Quantitative Variables Condition:
 Correlation applies only to quantitative variables
o Linearity Condition:
 Correlation measures the strength only of the linear association (between 2 variables)
o Outlier Condition:
 Outliers can distort the correlation
 Line of “Best Fit”
o Sum of residuals is NOT a good assessment of how well line fits data

17
 Because some are positive, some are negative
o Sum of square of residuals better
 Smaller sum of squares → better fit
 Smallest sum of squares → line of “best fit”, or least squares line
 Correlation:
o Correlation is NOT causality, vice versa. Cannot be computed for more than 2 variables
o Correlation measures extent of clustering around the positively/negatively sloping 45 ° line, for
standardised X and Y variables.
o Correlation treats x and y symmetrically → ρ ( x , y ) =ρ( y , x ). same r, if x and y are interchanged
o Correlation has no units (standardised values are used)
o 0 correlation → not linearly correlated (but doesn’t mean NO relation)
o Correlation always between –1 and +1, not affected by units of measurement, sensitive to outliers.
o For Standard Units:
(X −μ x ) (Y −μ y )
 Zx = , Zy =
σx σy
 Correlation = Average of Standard Units ρ ( x , y ) =average ( z x , z y )
∑ zx zy
o Correlation, r =
n−1
sy sx
o Correlation r. ^β 1=r  r = ^β 1
sx sy
Cov(x , y)
o Correlation, r=
sxsy
o In linear regression, Multiple R = r(y, ^y ) = r(y, ^β 0 + ^β 1 x ) = r(y,x) = r(x,y) = correlation coefficient
 Covariance:
n

o Covariance =
∑ (x i−x )( y i− y ) (Covariance can be of any value)
i=1
n−1
 2
R (The fraction of the y’s variation accounted for by the simple regression model)
o R2 = Multiple R2 = correlation coefficient r2, R2
o 1- R2 = fraction of y’s variation left in the residuals. Percentage of y unexplained by the model
o Removing outliers can change or may not affect |r|, and hence R2
 Regression Effect:
o In a different round, the corresponding observation tends to be closer to the average
o Works both ways: not just future, but also backwards
o If currently round is b SDs from mean; then, c rounds (generations) apart, will be b|r|c SDs away,
where r is the correlation coefficient.

Lecture 5 Multiple Regression

Multiple Regression (Multiple input X’s, 1 output Y):


 Equation: y = b0 + b1x1 + b2x2 + … + bkxk + ε i → can use to model nonlinear relationships (by transforming)
 b0 is the intercept term. It could be meaningless if some X variables can never be equal to zero
 ANOVA is regression for categorical variables
 Multiple regression coefficients must be interpreted together with the other predictors in the model.
 Size of the coefficients is affected by scaling issue, and hence does not tell us their importance in the model
 Assumptions: (if assumptions violated, maybe introduce another variable)
o Equal-Variance Assumption (variation of Y about the regression line is the same. Or called Constant
error variance. NO Fan shape)

18
o Independence Assumption (probabilistic independence of errors)
o Normality Assumption (errors are normally distributed)
 How to perform Multiple Regression:
o Check “residual plot” to get Residual Plot (residuals plotted in a scatter plot)
 Residuals ε : difference between the observed y values and expected/predicted y values
 2 things must be true if regression line captures the overall pattern of data well
1. Residual plot shows no obvious pattern – random
2. Smaller residuals → better
 Steps:
o Formulate the Model (specify the variables)
o Estimate the parameters
o Perform model diagnostic testing → to justify your model selection
o Conduct hypothesis testing → to test whether necessary to use your predictive model
o Reformulate the model if necessary, then repeat steps 2 to 4 → If satisfied, use the model to
forecast
 Prediction:
o Prediction is for outside sample data, otherwise we are doing a fit
o We can calculate the error term for the predicted value
o Check assumptions for error term:
1. Zero average
 Check: guaranteed by presence of intercept term Do square plots to better see
2
2. Equal variance σ → sample value of σ is Standard Errors
 Check using Residuals vs Fit
Check using Residual Plot
 Hope for no pattern
3. Independent HOWEVER, violations can still
 Check using Residuals vs Each X (same as vs. Fit) be missed by residual plots
 Hope for no pattern/linearity
4. Normally distributed
 Check using Normal probability plot of normalized residuals
 Hope for straight line → if NOT straight enough, try y transformation
o If assumptions are violated, try transforming the data or adding more variables.
 Results:
o p-value (Coefficients table)
o Significance F (ANOVA)
 p-value of F
 Small Significance F → Regression is good → NOT all means equal, populations diff
 Large Significance F → Regression is bad
 NOT the same as F statistic
 F statistic: “signal-to-noise ratio”
SS(Between Groups)
df BG
o F statistic =
SS (Within Groups)
df WG
 Larger F statistic → better → NOT all means equal, populations different
o Adjusted R2 (Regression Statistics. Goodness of fit)
 R2 adjusted of no. of variables in the model
 Use it to judge if you should add extra variable to regression
 If adjusted R2 increases, add variable

19
 If decreases, don’t add variable
 NOT affected by sample size n
 Percentage of variance is explained by the model
SS(Error )
n−k −1
 1–
SS (Total)
n−1
 SS(Error) = ∑ ¿¿ ¿
 SS(Total) = ∑ ¿¿ ¿
o R Square, (Multiple R)2:
 Percentage of variation/variability is explained by the model (Coefficient of determination)
 Between 0 and 1
 If add extra variables to regression, will always increase R2
 ∴ Higher R2 NOT always preferred → may have too many variables
 If sample size n increase, R2 decrease → if have k+1 data rows → R2 = 1
SS (Error)
 1–
SS(Total )
o R, Multiple R
 If only linear regression (1 variable), R = correlation coefficient
 is r(y, y-hat) when regression has intercept
o Standard error of multiple regression (estimate for the standard deviation of the error term)
 How much the points spread vertically around the regression line


se =
√ ∑ r2
n−k −1 (r: residuals, n: no. of observations, k: no. of independent variables)
 About 68%, 95%, 99% of predictions made would be within 1, 2, 3 SD of the actual Y.
o F-test:
 To test if at least one coefficient is NOT 0 / to test if ALL coefficients are 0 (null hypothesis)
 But does NOT tell you which ONE is 0
 To do that, use t-test
o t-test:
 To test if a specific coefficient is NOT 0 (not important)
o df: Degree of Freedom
 Larger → better; should NOT be too small
 Eg. 100 samples, 99 variables to draw the regression, degree of freedom is 0. No
point in doing regression because every single sample is perfectly fit (over fit)
 If you add a variation, it will reduce the degree of freedom
 Use R2 to tell if it is worth it
o SS:
 SSR: Sum of squares explained by regression SS(Total) = ∑ ¿¿ ¿
 SSE: sum of squares residual/error term SS(Error) = ∑ ¿¿ ¿
 Amount of certainty that remains in the model
 SST: sum of squares total = SSR + SSE SS(Total) = ∑ ¿¿ ¿
 The total amt of variation in the data that cannot be account for by the model

20
 Unlike Significance F, NOT much affected by sample size
If sample size large, Significance F tends to be small
BUT Adjusted R2 can still be very small  % of variation explained by model. As no. of
 Can be < 0 or > 1 Regression Output predictors increase, R square always increases.
 Increases as sample size decreases
Regression Statistics
Multiple R √ R Square % of variance explained by model. Adjusted R square will
R Square SSR/SST stop increasing (and drops) after a certain no. of
(1−R Square)(n−1) variables are added. Then you stop adding variables
Adjusted R Square 1− Usually smaller than R2
(n−k−1)
Standard Error √ MSE
Observations n SE: sample estimate of the standard
deviation (variance) of the error term.
(SD of vertical distances from points
from regression line)

k = no. of variables (x’s), does NOT include intercept

ANOVA
SS
df MS F Significance F
SSR MSR
MSR =
Regression k SSR (k ) MSE 9.3996E-37
MSE =
F is “signal to noise”
SSE ratio. F-statistic p-value of f-test;
Residual n–k–1 SSE (n – k−1)

Total n–1 SST = SSR + SSE


If t Stat is small: there might be collinearity and/or variable not useful

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


β0
Intercept β0 Sβ 0
Sβ 0 6.3723E-11 -120.53457 -75.3223
β1
X1 β1 Sβ 1
Sβ 1 1.0913E-07 0.07971772 0.15287973
β2
X2 β2 Sβ 2
Sβ 2 2.2717E-06 -0.2188881 -0.1012004
Notes:
 Large F gives small p-value. F is the F-statistic Coefficient X gives least square estimate of β j
 Significance F and P-value are the same.
 MSR=SSR/k, MSE=SE2 (SE is the standard error of estimate) or MSE=SSE/n-k-1)
 RMSE (root mean squared error) = √(SSE/n) (by definition) not √MSE=SE
 t Stat gives the computed t-statistic for the individual coefficient test. t-Stat =Coefficient/standard error

Confidence interval for βj


 ^β ± ME= ^β ± t * ^ ^ t* n−k−1 is critical value for the specified
n−k−1 × SE( β j ) , where β j is the j coefficient,
th
j j
confidence coefficient, and SE is the SE given in the regression output. Given confidence interval and
standard error in the regression output table, t-statistics may also be calculated.

Good model

21
 R square large → good model
 Significance F small → good model
 All P-values are small → good model

if overall p-value for variables is small, but 1 variable has large p-value → we reject (not all p-values are small)
 We will want to find a model that removes that variable with large p-value

22
Usefulness Tests of Regression Coefficients (2 tests)

1. Right-sided F test: test that all Xs taken together do NOT linearly contribute to Y
o To test if regression model is useful; to test if all = 0, if one deviates then it’s NOT true
 H0: β 1 = β 2 = … = β k = 0
 Interpretation: all X’s are not contributing → regression is NOT useful
 Note: intercept β 0 is NOT included here
 Ha: at least one β j ≠ 0
o Right-sided F test with k, n-k-1 degrees of freedom
 k = no. of variables in regression
o Results:
 Large Significance F → do NOT reject H0
 all β ’s = 0 → all X’s are NOT contributing
 Regression is NOT useful
 Small Significance F → reject H0
 All you can say is: at least one x that is at least a little linearly related to Y
o Regression model is useful
 Smallest Significance F is most useful
o CANNOT say every x is
o Because can have individual variables with small Significance F, but overall
Significance F is large (need to do individual t-test)
 Significance F decreases as sample size increases
 Unlike Adjusted R2, which is NOT affected by sample size
2. 2-sided t-test: test if one particular Xj does NOT linearly contribute to Y, in the presence of the other Xs
o Assuming regression has some use, test the usefulness of individual X variables in the regression
 H0: β j = 0 → variable is NOT useful
 Ha: β j ≠ 0
o 2-sided t-test with n-k-1 degrees of freedom
^β −0 coefficient j
j
 T-value/ t-statistic/t-ratio/t= =  determines p-value
SE ( ^β ) Standard Error j
j

o Use the results from this test for variable selection (next section)
 Throw away variable with the largest p-value
 Keep throwing away until overall p-value stops improving
o Results:
 Small p-value: reject the null that the variable is not useful.
 Large p-value could mean either:
1. X intrinsically NOT linearly related to Y but could have other relationship, OR
2. X linearly related to Y, BUT collinear with other X’s
o Multicollinearity: variables are correlated with one another
o Other X might make an important variable seem less important

Variables Selection

Not all X’s are useful; sometimes putting in useless X’s might hurt the regression.
Therefore, need to select useful variables

Backward stepwise regression

23
p-value: use this to determine which variable can be thrown away; throw largest p-value → see if overall p
improves
 If overall p-value improves (lower), then continue throwing away variables
 If overall p-value worsens (higher), then add back that variable, and use that model
 In the end, model with the smallest overall p-value will be chosen
o All necessary x’s included; all unnecessary x’s excluded
o All x variables should have necessary transformations and interactions
o Abandon the final model if Adjusted R2 < 0.2

Forward stepwise regression


 Begins with no variables in the equation, addd one a time until no remaining variables make a significant
contribution. At most k+(k-1)+(k-2)+…+3+2+1=k(k+1)/2 regression tried.
The best subset regression
 Only involves 2k -1 regressions (in or out, and minus the case where all are out)

Only check assumptions AFTER selecting the best model

If 2 models are equally good (Compare 2 models using adjusted R2 only):

 Choose the model with fewer variables (parsimony)


 If 2 models are have the same no. of variables:
o Choose the model with fewer or simpler transformations

LECTURE 6

Indicator or Dummy Variables

 On-off 0-1 switch


o Dummy variable takes value of 0 or 1 to represent the absence or presence of categorical variables
o Dummy variables also used for: imputing missing values, and identifying outliers
 For Categorical variables with m distinct values (eg. Monday to Friday (1 to 5))
o Include only m-1 dummy variables (eg. here, use 4 dummies)
 Omitted dummy (the “-1” part) is the baseline or reference dummy
 Omit dummy for the first/oldest (or least/most interesting) category
 Coefficient of dummy variable is the amount by which that category differs from the baseline, after allowing
for the effect of other variables (or in the presence of all the other explanatory variables)
 ANOVA is regression with only dummy variables
o ANOVA on income of different areas can be converted to dummy variables (area 1 as the reference,
indicator variables for having or not having certain level of income for areas 2, 3, and 4) and the
ANOVA results will be the same as the regression output for the dummies.

For few distinct values:


 Can use dummies
 Discover any non-linear effect of X
 Original quantitative variables CANNOT be used together with its dummies – 1 since they are already
represented by the dummies

For continuous variables:


 Can do ‘binning’:
o Divide continuous variables into intervals and treat as categorical variable
o Binning: create ‘bins’ (intervals) out of quantitative variables
 Study non-linear effect of X, OR interactions with other X’s

24
Dummy variable to capture different intercepts
 Predicted Y = a + b1X + b2D
 When D=1, Predicted Y = (a+ b2) + b1X
 When D=0, Predicted Y = a + b1X
 For the 2 different D:
o I have 2 different regression lines with the same slope but different intercepts
o In effect, when you include only a dummy variable in a regression equation, you are allowing the
intercepts of the two lines to differ (by an amount b2), but you are forcing the lines to be parallel.
To be more realistic, you might want to allow them to have different slopes, in addition to possibly
different intercepts.

Grouping of dummy variables

 Dummies are deleted during variable selection


o And grouped with the omitted (reference) dummy
 Dummies remaining after variables selection may be amenable to grouping

Variable Selection using dummy rows (choose which variables you want to include in your model)

Example: using RegressTemplate.xlsm


 Use area 2 as reference dummy (can pick any)
o Delete Area 2 → because can only have m-1 dummies
o Start with area 1, 3 and 4
 Variable selection: find out if the dummies cluster in some way
 Find 2 areas that have closest coefficients
 Create a combined dummy for the 2 with the closest coefficients (Area 3 + 1)
o This test whether the 2 intercepts are the same. (dummies variables result in lines with different
intercepts after inputting ‘1’ or ‘0’, if combining two dummies improve the model, it means the
two line segments actually have the same intercept)
 Delete original 2 variables used to combine (i.e. Delete Area 3 and Area 1
 Now left with Area 3+1 and Area 4 → does significance F decrease?
o Yes? It’s better
o No? Then use the previous combination
 If yes: again, combine another variable
o Now Create Dummy: Area 3+1+4 (add another variable with closest coefficient)
o Delete Area 3+1 and Area 4 → does Significance F decrease?

Interaction among X’s

If 1 X and 1 dummy interact:

 Put X and dummy separately, and Interaction term = XD (product)


 the effect of one explanatory variable on Y depends on the value of another explanatory variable.
 Predicted Y= a + b1X + b2D + b3XD (a is intercept0, b1 is coefficient1, b2 is dummy intercept, b3 is dummy coeff.)
 Look at 2 separate cases:
o When D = 0, Predicted Y= intercept0 + coefficient1(X)
o When D = 1, Predicted Y= (intercept0 + dummy intercept) + (coefficient1 + dummy coefficient) X
 Check if the model improves (e.g. Adjusted R2 improves)
 Check p-value for these variables after including dummy variables and interaction variables
o Small p-value for interaction variable “carbs*meat”

25
 the variable is significant and different slopes are needed
o Large p-value for dummy variable “meat”: the variable is not significant
 The dummy variable is not significant, they have the same intercept

26
 For the 1 dummy variable and 1 interaction variable:
o (ONLY P-VALUE FOR INTERACTION IS SIGNIFIICANT) I have 2 different regression lines with the same
intercept but 2 different slopes for the 2 different categories
o OR. (IF BOTH P-VALUES ARE SIGNIFICANT) I have 2 different regression lines with 2 different
intercepts and slopes for the 2 different categories

 Interaction variable can be the product of any two variables, a numerical and a dummy variable, two dummy
variables, or even two numerical variables, can be used. Also works when both variables are dummies from
different categorical variables
 Pros and cons to adding interaction variables
o More complex and interesting model, significantly better fits.
o Extremely difficult to interpret.

Missing Values

 Replace missing values by column average or median


o This assumes that values are missing at random, but they may not be so.
 Replace x variable having missing value(s) by 2 new variables:
o D: dummy, which indicates 1 only at missing value(s)
o X+0: original variable with 0 in place of missing value(s) (original column)
 Treat both variables like regular X’s in variable selection
o If D is thrown away, but X+0 remains:
 Missing values should be replaced by 0 in original variable
 The original column is a useful explanatory variable (assuming small p-value)
o If X+0 is thrown away, but D remains:
 Original variable is unimportant, except when the value is missing. Only the missing values
contribute to explaining the predicted y.
o If both X+0 and D remain:
 y = … + b1X+0 + b2D +…
 For X not missing:
o …+ b1X+0 + b2D → b1X + b20 → b1X
 For X missing:
b2
o …+ b1X+0 + b2D → b10 + b21 → b2 → b1( )
b1
b2
o Therefore, missing X can be imputed (replaced) with
b1
 If X+0 is lagged by j, then dummy D is also lagged by j
o If both X+average and D remain:

27
b2
 For X missing: …+ b1X+average + b2D → b1(average) + b21 → b1(average+ )
b1

Outliers

 It has an extreme value for one or more variables


 Outliers changes a lot of things:
o Y-outlier (outliers in the Y direction)
 a large residual and no leverage on the slope
 omitting outlier will not change the slope, but likely to change intercept and increase R 2
o X-outlier (outliers in the X direction):
 a large leverage (affect slope) and small residual
 omitting outlier change slope and likely to increase R2
 We can use dummy variables to identify outliers: (RegressionTemplate)
o Extra dummy variable column for each existing variables containing outliers
 ‘1’ at the outlier and ‘0’for the rest
o Run the regression and check the p-values of the dummy variables
 Small p-values
 The row dummy that remains AFTER variable selection indicate that it is an outlier
 Why?
 Model sacrificed one degree of freedom just to fit row 2
 Once you do that the residual for row 2 is 0
 Outlier point = the one which corresponds to the outlier row dummy = “1”
 Dealing with outliers
o We usually leave out outliers and their β ’s
o They may simply do not belong with the rest of the data and ought to be omitted
o These points could be important. They can indicate that the underlying relationship is complicated.
o When in doubt, create and report two models: one with the outlier(s) and one without.

Variable Transformations

 Consider X transforms first


o Look at the shape of the scatterplot of X and Y to determine suitable transformation
o For modelling the 4 basic shapes with no time for looking at data:
 Include only square & cube-root
 If cube-root in final model, see if sqrt, inverse or log can substitute for it
o For modelling the 4 basic shapes when X’s are +ve:
 Include only square and sqrt
 Y transforms trickier to determine, NOT likely used

Prediction:
 Try to stay within or near the range of each X
 Use feasible values of X
o Eg. for dummy variables, don’t use 0.5, use either 0 or 1
 Prediction is almost always better with analytics than without

Interpretation of transformed variable (log independent variable)


 Suppose that Units increases by 1%, for example, from 600 to 606. Then Equation implies that the
expected Cost will increase by approximately 0.01(16654) = 166.54 dollars. In words, every 1% increase in
Units is accompanied by an expected $166.54 increase in Cost. Note that for larger values of Units, a 1%
increase represents a larger absolute increase (from 700 to 707 instead of from 600 to 606, say). But each

28
such 1% increase entails the same increase in Cost. This is another way of describing the decreasing marginal
cost property.

Logistic Regression

 Frequently used when deciding between only two possible outcomes


 Regression where Y (NOT X) is a 0 to 1 variable (eg. originally Y is a number p: 0 < p < 1)
o In its simplest form, the Y-variable only takes on the values 0 or 1
o Transform Y so that Y can take values from −∞ < p< ∞
p
 Logit(p) = ln( )
1− p
p
o ∴ Regression: ln( ) = b0 + b1x1 + b2x2 + … + bkxk + ε i
1− p
 Then transform back to p 0<p<1
 It CANNOT be re-expressed as a common multiple regression problem
 Used a lot in financial situations (eg. judgment on whether to invest or not, assessing loan applications)
 Needs specialized software

LECTURE 7: INTRODUCTION TO TIME SERIES

 Big data types:


o Business data (eg. quarterly sales, quarterly no. of mortgage applications)
o Economic data (eg. quarterly unemployment rates, price of COE)
o Financial data (eg. STI, Dow Jones Industrial Average, FTSE 100 Index)

Time Series
o Objective:
 Search for patterns in historical series and extrapolate these patterns into the future
 Provide forecasts of future values of the time series, based on past information
o Time sequence of data is an important aspect
 Eg. if we change the sequence, we will get a different result
 Most time series are equally spaced at roughly regular time intervals
 Eg. daily, monthly, quarterly, annually
 NOT time series if:
 NOT recorded sequentially, OR
 Sequence NOT important
o Time series plot:
 Time on x-axis (horizontal)
 Variable on y-axis (vertical) → eg. quarterly sales
o Components of Time Series: (can exhibit none, or maybe one or two components)
 Trend component (T)
 Types:
o Linear Trend α + βt (the change in Y remains constant)
2
o Quadratic Trend α + βt + yt
o Exponential Trend α exp(βt) (the percentage change in Y remains)
p
o Polynomial Trend α 0 + β 1 t+ …+ β p t
 Seasonal component (S)
 Short-term (1 year or less), repetitive behaviour
 Time between peaks: Period
o Each Period → similar pattern

29
Seasonality + Increasing Exponential Trend
 Cyclical component (C)
 Longer than 1 year, more irregular (random fluctuations) and difficult to predict

 Irregular component (I)


 Residuals/errors of the model → CANNOT be explained using fitted model
 Irregular unpredictable changes; Tend to hide other components
 Assumed to have Mean = 0 (because nobody expects error)
 Errors are useless, they tell you nothing
 We assume that errors are independent of one another
 We do NOT observe value or irregular component. We compute from fitted model

To compute the error:

Actual value – Fitted Model = Residuals


(forecast)

 Irregular component → Population


 Residuals/errors → Sample
 Of interest:
o Whether variability of error component changes over time
o Whether any outliers/spikes
 Use Residuals Analysis to justify model:
o Indirectly tells us whether model makes sense
o Random/No pattern

 Multiple Regression vs Time Series:

30
 Additive model:
o Y = Trend + Seasonality + Cycle + Random
 Multiplicative model:
o Y = Trend x Seasonality x Cycle x Random
o Always fit the multiplicative model by taking logarithms (transformations)
 Ln(yt) = a + bt + et
 Why?
 Easy to estimate coefficients ()
 Transformed data closer satisfies assumptions of statistical models (eg. normality)
o Better fitted model (smaller Significance F)
 To change the form to an Additive model

Modelling Procedure of Regression Analysis

1. Formulate the Model (specify the variables)


2. Estimate the parameters
3. Perform model diagnostic testing (residual plots)
 To justify your model selection
4. Conduct hypothesis testing
 To test whether necessary to use your predictive model
5. Reformulate the model if necessary, then repeat steps 2 to 4
 If satisfied, use the model to do forecasting

STEP 1–2:
Linear Trend Model → Salest = a + b(time)
Sales changes by a constant amount each time
need to convert quarter data (Q1, Q2, Q3, Q4, Q1,Q2…) to time= 0, 1, 2,3,4,5,…
regression of sales and time
o Slope, b:
o Expected change of sales
o Salest – Salest-1 = (a+bt) – (a + b (t – 1)) = b
o Because variable is time, and there is only one variable
o Intercept, a:
o Expected value of quarterly sales at the initial time
o Trend line:
o Ignores seasonal variation in the sales
o Using linear trend equation to forecast sales may result in over /underestimate in different quarters

Exponential Trend Model → Ln(Salest) = a + b (time) OR Salest = e a*b(time)


Sales changes by a constant percentage
 Excel Steps:
o Create a new variable and name it Ln(Sales) → just the natural log of sales
o After this transformation, the exponential trend model is converted to linear trend model.
o Data Analysis > Regression > use transformed data as variable, and ‘Time’ as predictor
o From output, can write down fitted model from intercept and slope

31
 Check to make sure Adjusted R2 is higher for the transformed data (Ln (Sales)) than original
data (Sales)
 If higher: transformed data is better model → greater variance explained by model
 For Exponential:
o Slope, b:
 Approximately the % change of sales per time unit (eg. quarter)
 HOWEVER, only holds true if slope is close to 0
 If slope is very large, forget it
o Intercept, a:
 Expected value of ln(sales) at initial time
o to find actual % change in sales for a particular month:
 Sales = e slope – 1
o For t = 0 (month 0):
 Sales = e intercept → since exp(0) = 1
 Trend line:
o Still ignores seasonal variation in the sales

STEP 3–4: Diagnostics


Test for Independence Assumption for residuals in time series

o Durbin-Watson Test:
o Used to test for Autocorrelation between 2 sequential errors (e.g. lag1 and lag2, lag2 and lag3)
 Autocorrelation of errors: correlation between errors; ie. Errors are dependent
 Autocorrelation of lag 1 and lag 2 errors means overprediction in Jan will lead to
overprediction in Feb
 Test whether errors are independent or not
o DW = average
n n n n

∑ (~ϵ i−~ϵ i−1 )2 ∑ (~ϵ i)2 +∑ ( ~ϵ i−1)2−2 ∑ (~ϵ i−1 ∙ ~ϵ i)


o d= i=2 n
= i=2 i=2
n
i=2

∑ (~ϵi )2 ∑ (~ϵ i )2
i=2 i=2

 Numerator: Sum of [Error at time i (starts from 2), subtract previous error, and square it]
 Denominator: Sum of [square error for each time i]
o Null hypothesis: No lag-1 autocorrelation
o Alternative hypothesis: there is lag-1 autocorrelation (either positive or negative  two-sided test)
 0 ≤ d ≈2 (1−~ ρ)≤ 4
 When ~ ρ=0 → d ≈2
 When ρ=1~ → d ≈0
~
 When ρ=−1 → d ≈ 4
 Easier to use p-value
 Since tables for critical values are NOT always readily available
 If p-value < α , reject null
 Reject null when it’s in the 2 yellow regions
 There is autocorrelation
 When d close to 0: positive autocorrelation
 When d close to 4: negative autocorrelation
 Do NOT reject null:
 No evidence of autocorrelation
 Dependence between 2 consequent errors are NOT correlated to each other

32
 White Area:
 Inconclusive region, we don’t know what to do
 dL and dU:
 critical values, can be found in table for DW test
 fixed number, but varies for different no. of:
o Observations
o Variables

o Limitations:
 Only test for first autocorrelation (between 2 consequent errors), but not others
 If fall within inconclusive region (white), we don’t know what to do
 However, DW test usually found in business reports, so we learn it
o Interpretation:
 If there is Autocorrelation btw the errors (Errors NOT independent):
 Formula used to compute Standard Error is wrong
o Thus confidence interval & hypothesis test will be wrong
o Errors are supposed to be useless
 Least squares estimator
o Still linear and unbiased (expectation = true parameter)
o BUT it is NOT efficient (less accurate estimator)

Test for Variance assumption

Check equal variance/spread assumption (linearity/independence assumption also) using residual plot
Non-equal variance Non-linear  dependent residuals

Test for whether coefficients in trend model are significant

o F-test
o Used to test if independent variables can predict the dependent variables
 To test how useful the fitted model is
o Null hypothesis:
 H0: β 1=β 2=β 3=…=β k =0
 Coefficients (slopes) are jointly zero
 F-test is a joint test. If even one is NOT 0, then reject the null hypothesis
 If all are 0 (do NOT reject null), then variables are useless
o Alternative hypothesis:
 H1: β i ≠ 0 (at least one coefficient is not zero/can predict the dependent variable)

33
o Reject Null
 When p-value (significance F) < α
 Conclude: the fitted model is useful in predicting the dependent variable
o Individual t-test
o If reject null in F-test, may want to check significance of each coefficient
~
( β i−0) coefficient
o t-test statistic: t= →
SE(β i ) standard error of coefficient
o Null hypothesis:
 H0: β i = 0
 Coefficient (Slope) of a particular variable (predictor) = 0
 It is NOT a joint test like F-test
 If null hypothesis NOT rejected → model is useless
o Reject Null
 When p-value (significance F) < α
 Conclude: the variable is useful in predicting the dependent variable
 Since t-statistic is significant, then advisable to include this extra variable

Comparing trend models (eg. linear vs exponential trend model)


o For non-linear models:
o DON’T use a linear trend to describe a non-linear pattern
o By trial and error, we transform data to get a linear trend.
o For exponential, it is quite obvious that we use ln
o If start from t = 0 vs t = 1
o Different intercepts
o Same slopes
o To test if models are really a better fit:
o Run regressions on both original and transformed data
o Goodness of Fit (how much variation explained by model)
 R Squared
 Larger R2 (closer to 1) → better model (more variation explained by model)
 Adjusted R Squared
 Larger Adjusted R2 (closer to 1) → better model (more variance explained by model)
o Forecast Accuracy (smaller forecast errors → better model)
 Compare MAPE (or any other measures of forecast eg. MAD, RMSE)
 Smaller MAPE → better model
o eg. 0.036%, means that fitted values only deviate 0.036% from actual values
 we usually read MAPE instead of MSE
o MAPE usually removes the units → makes more sense in terms of %
o Residual Analysis (whether model is adequate to describe underlying pattern of model)
 Residual plot (visual analysis)
1. Check that Expectation (Mean) of residual should be zero
o Fluctuates around 0 quite evenly?
2. Variance of residual should be constant as time increases
o Eg. if variance increases over time → NOT constant
o If first part fixed? It’s ok
3. Should NOT have Autocorrelation (Independence)
o Random → NOT Autocorrelated
o Not Random → Autocorrelated
 If autocorrelated (errors are correlated): NOT independent

34
o Keep in mind to choose models based on:
o Interpretability & Parsimony
 Parsimony: less complex, less predictors (simple structure)

Forecasting methods

 Qualitative methods: (rely on subjective opinions from one or more experts)


o Grass Roots
 Derive future value by asking person closest to customer
o Market Research
 Trying to identify customer habits; new product ideas
o Panel Consensus
 Deriving future estimations from the synergy of a panel of experts in the area
o Historical Analogy
 Identifying another similar market
o Delphi Method
 Similar to Panel Consensus, but with concealed identities

 Quantitative methods: (rely on data and analytical techniques)


o Time Series
 Models that predict future values based on past history
o Causal Relationship
 Models that use statistical techniques to establish relationships between various items and
variable of interest (Regression Analysis)
o Simulation
 Models that can incorporate some randomness and non-linear effects (try to mimic reality)

Forecast Errors
 Forecast Origin: Time at which forecast is made
 Forecast Horizon: Time period to which the forecast relates
 Forecast Error: Difference between acttual value and forecasted value from fitted model
o Smaller forecast errors → better forecast method
o h-step-ahead forecast: Ft+h → forecast for period t+h made at time t
o h-step-ahead forecast error: et+h = Yt+h – Ft+h → error of forecast (actual – forecast)
o Measures of forecast error: (can choose whichever to minimize; they tend to make the others small)
 Bias:
 Arithmetic average of the errors
n n
1 1
o Bias = ∑ et = ∑ (Y t −F t ) Excel =AVERAGE(SUM(actual – forecast))
n t =1 n t=1
 Limitation:
o When 0, 0, 0, 0 → Bias gives us 0 error Bias cannot differentiate
o When 10, -10, 10, -10 → Bias also gives us 0 error Use MAD instead
 MAD (Mean Absolute Deviation):
1
o MAD =
n
∑|et| Excel =AVERAGE{SUM of [ABS(actual – forecast)]}

 Limitation: Use MAPE instead


o MAD is scale dependent, depends on units
 Eg. if we change currency, forecast error also changes
o MAD may not identify existence of large errors MAD cannot differentiate
Won’t identify large error
35
Use MSE instead
 When 10, -10, 10, -10 → MAD gives 10
 When 5, 15, -5, -15 → MAD also gives 10
 MAPE Mean Absolute Percentage Error (best; most often used in business analytics)
1 |e t|
o MAPE = 100 x
n
∑ |Y t|
o Excel =AVERAGE{SUM of [ABS(actual – forecast)/actual]}
o Standardized by dividing by actual value
 Impact of unit disappears
 MSE (Mean Square Error)
 Mean Square Error (similar to simple sample variance)
T
1
o MSE = ∑ e2
n t =1 t
Excel =AVERAGE{SUM of [(actual – forecast)^2]}

 Limitation:
o MSE penalizes large errors because the errors are squared
 NOT in the same unit as the data (it’s a squared unit)
 RMSE (Root Mean Square Error)
 Root Mean Squared Error (similar to standard deviation of the sampling distribution)
o Square Root of the MSE → so RMSE is the same unit as the data

o
RMSE = √ MSE=
√ 1
n
∑ e 2t
Excel =SQRT(AVERAGE{SUM[(actual – forecast)^2]})

Prediction Intervals

Point forecast: sub in values of 26th month to get a single data point (forecasted)
 Point forecast may differ from actual value
 Chance that they are the same is very small
 “Future Expected Sales at time = ___ are [point value]”

Interval forecast: forecasted values will fall within a certain range, with a 95% confidence
 Compute lower and upper confidence intervals:
o Ft+1 ± z x SE (lower: Ft+1 – z x SE, upper: Ft+1 + z x SE)
 Ft+1 is point estimate for (eg. log sales → if you use SE of log sales model)
 z: (value from Normal tables)
 90% confidence interval → z = 1.645
 95% confidence interval → z = 1.960
 99% confidence interval → z = 2.576
 SE: (estimated standard error of residual/error)
 Taken from regression output

Regression Statistics
Multiple R 0.999999996
R Square 0.999999992
Adjusted R Square 0.999999992
Standard Error 0.000462371
Observations 25

36
o However, Future Expected Sales may be as low as [lower interval] under bad conditions, and as high
as [higher interval] under good conditions”

37
LECTURE 8
How to model both Trend and Seasonality (Whole_foods.xlsx)

Trend ignores seasonality → what if there’s both Trend and Seasonality?


*Seasonal Model does NOT work well since there is trend as well
Additive Model (Linear Trend and Seasonality):
 Create N-1 dummy variables
o To represent seasonal pattern in a time series model
o -1 because 1 season has to be left out to avoid collinearity
 Q4 is our ‘Base’ Season
o Since have 4 predictors → 3 dummies and Time
o For Q4, all dummies are 0
 When the sales happen in the season, it’s represented by dummy variable = 1, if not then dummy = 0
o If Q1 = 1, then Q2 = 0 and Q3 = 0 (by definition)
o Interpretation: Coefficient of dummies: additional increase in sales compared the base season (Q4)
 Data > Data Analysis > Regression
o Y-variable = Sales
o X-variables = the 4 predictors
 From the output:
o Write down fitted model: intercept, slope for time and coefficients for seasonal dummies
o Coefficient of dummies: additional increase in sales compared to the base season (Q4)
 Look at the p-value of each variable
o If insignificant…???????????? The coefficients for the variables are insignificant.

Multiplicative Model (Nonlinear Trend and Seasonality):


 Transform the data (ln )
o Then you can work the multiplicative model into additive model
 Create N-1 dummy variables like before
 Data > Data Analysis > Regression
o Y-variable: ln (Sales)
o X-variable: Time + Seasonality
 Write down fitted model from output
o Growth Rate due to trend: exp(slope of Time) – 1
o Growth Rate due to seasonality in (eg. Q1): exp(slope of Q1 dummy) – 1
 If slope is small, growth rate should be close to slope
 If slope is large, growth rate should NOT be close to slope → not very accurate
o Growth Rate for Q1 = Growth Rate due to Trend + Growth Rate due to Seasonality

Ratio-to-Moving-Averages (Seasonal Index)


 Idea:
o Data has both trend and seasonality
o So just split the trend and seasonality, and model them independently
 Process:
o Introduce a Seasonal Index to model how a particular season compares to the average season
 Average Season: average across your data (instead of base season like in previous model)
1. Deseasonalize:
Actual Sales
 Deseasonalized data =
Seasonal Index
 Seasonal Index = Average value of Ratios in a particular season (eg. Q1)
 Find “average sales” of ALL sales (=AVERAGE(sales))

38
 Find “Ratio” (=Sales/Average Sales)
 Seasonal Index: averaging the all the ratios for a particular season
o Eg. for Q1, take Average for all Q1 Ratios
o Average of Seasonal Indices = 1:
 If SI > 1, sales in Q1 is higher than average (eg. 1.157 → 15.7%
higher than average)
 if SI < 1, sales is lower than average (eg. 0.919 → 8.91% lower than
average)
o Sum of Seasonal Indices = no. of seasons
 Eg. if 4 quarters, sum of season index = 4
 Eg. if 12 months, sum of season index = 12
 Impact of Season disappears, only Trend left
2. Get a Forecast for Deseasonalized data
 Choose an appropriate Trend model (since now only have trend component) by plotting
Deseasonalised Sales and Time to see the trend
 Eg. If observe a linear trend: Data Analysis > Regression
 Deseasonalised Sales: y-variable. Time: x-variable
 Fitted model: deseasonalized Sales = intercept + slope x Time
 Eg. if observe an exponential trend (nonlinear trend): Data Analysis > Regression
 ln (deseasonalized Sales): y-variable. Time: x-variable
 Fitted model is double transformed data → ln & deseasonalized Sales
 ln (deseasonalized Sales) = intercept + slope x Time
 To cancel ln , take exponential on both sides:
o Deseasonalized Sales = eintercept ∙ eslope x Time
 Expected sales at time = 0 are eintercept
 Sales expected to increase by eslope – 1 every (time) (eg. month/quarter)
3. Reseasonalize: Point Forecast
 Multiply forecast with Seasonal Index to get forecast for actual data
Eg. if initial time point
 Eg. for exponential trend model: Sales = eintercept ∙ eslope x Time x SI
is Q1 of 2007,
To find point forecast
for Q3 of 2008:

Use time = 7 (NOT 3)


Because it is the 7th
quarter in sequence

Summary of Multiple Regression-based models

 Linear Trend Model:


o If data shows only linear trend
 Exponential Trend Model
o If data shows only nonlinear trend
 Additive Model of Seasonal Patterns
o If data shows linear trend AND seasonality
 Multiplicative Model of Seasonal Patterns
o If data shows nonlinear trend AND seasonality
 Ratio-to-Moving Averages Method
o If data shows trend AND seasonality → ‘deseasonalize’

Model Assumption: dependence between Y and X’s is stable overall, in the past and in the future
MR model is better for long term forecast as compared to time series model.

39
40
How to Model if Underlying Pattern is NOT Apparent? (NO obvious trend or seasonality)

Naïve forecast:
 Forecast = Last Observation (Ft+1 = Yt)
 Naïve forecast only makes sense if history repeats itself
 Trace values well, but lagged behind
 Naïve forecast is random walk (non-stationary)
 Limitation 1:
o Only appropriate for immediate/short term forecast
o Always same value for long-term prediction (eg. 10 yrs)
 Yt+1 = Yt; Yt+10 also = Yt (because Yt is last known)
 So will be a straight line after the last known value
 Limitation 2:
o Consists of past errors (random noise) as well
o Use Smoothing Out Method to ‘smooth out’ past errors

Smoothing Out method:


 Take average of past errors so the errors will be smoothed out
 Yt = C + et → E(Yt) = C + E(et) ‘average’ of errors
o C : systematic pattern (true value) et : error
o Model Assumption: C is constant over the period when taking average
o SMA, WMA, SES only effective for short-term forecasting as they require continuous updating

How to smooth?
 Simple Moving Average (SMA) can determine the value of n using Excel:
o Forecast is the average of past n observations (NOT all observations)
Y t +Y t −1+ …+Y t −n+1
 Ft+1 =
n
 n = forecasting horizon (how far back we look)
 the n observations are treated equally → equal weights
o Larger no. of n → smoother forecast (stable BUT less accurate)
 Higher MAPE (percent error)
 Use large n if expect there to be little or no change in the future
o Smaller no. of n → more responsive to changes (less stable BUT more accurate)
 Lower MAPE (percent error)
 Use small n if expect there to be change
 Because small n is more responsive to changes
 Easily influenced by outliers. Use median instead of mean
o Limitation:
 May miss trends (eg. downward trend) as the data gets average out
 Weighted Moving Average (WMA) can determine the value of n and wn using excel:
o Weighted Average of past n observations (NOT all observations)
o Ft+1 = w1Yt + w2Y2 + … + wnYt-n+1
 Higher weights assigned to more recent (in most cases, but not always)
 w1 > w2 > … > wn
 weights reflect relative importance of each previous observation
o higher importance given to more recent data → may reveal trends
 Weights sum up to 1
 w1 + w2 + … + wn = 1
o WMA more flexible than SMA (with equal weights)

41
 Eg. SMA may miss a downward trend
 WMA give higher importance to recent data, which can reveal downward trend
 Single Exponential Smoothing (SES):
o Idea: the prediction of the future depends mostly on the most recent observation, and on the
latest forecast
o Weighted moving average with exponentially decreasing weights that are controlled by smoothing
constant α
 Ft+1 = α Yt + (1 – α )Ft
 Smoothing Constant alpha α :
 α is given to most recent observation, the 1-α to historical forecast
 α is a self-learning procedure → automatically corrects previous forecast by
considering the forecast errors it made in the past
 Denotes importance of the most recent observation
o How much our forecast will react to previous forecast error
 Smaller α :
 There is little reaction to previous error
 No need to update forecast so much
 Smoother and stable to sudden changes
 Selection of the initial forecast is more important
 Larger α :
 There is a lot of reaction to previous error
 Update a lot: should depend on more recent observations
 Less smoothing effect
 Follow historical values closely
 α tells us how much we should update our current forecast from previous forecast
o α tells us if previous forecast is trustable or not
o If α = 0:
 Previous forecast error has no impact on current forecast
 No need to update previous forecast
 past forecast = current forecast
 Forecasts over time are similar to each other
o Flatter forecast curve
 Same as naïve forecast for long-term forecast (last known Ft)
 Since we only have 1 most recent observation → if predict
far into the future, it’s the same value
o If α = 1:
 Previous forecast is not good at all → need to update
 Depend on most recent observation
 Most recent observation = current forecast
 Same as naïve forecast
 Can be rearranged as Ft+1 = Ft + α (Yt – Ft)
 Yt – Ft = forecast error in the past
 Interpretation: Forecast + Correction on previous forecast error
 α tells you whether your previous forecast is trustable or not
 Can be rearranged as
 Weights decline exponentially into the past
 Distant values get smaller weights
o How to choose initial forecast? (2 methods)
 Naïve forecast

42
 Just copy from previous observation
 Simple Moving Average (to better smooth out error)
 Take average of previous four or five observations
 Disadvantage of SES:
 SES does NOT consider trend (or seasonality)
o Eg. If there is a trend in the data:
 Regular exponential smoothing will always lag behind the trend
 However, SES has Advantages:
o Considers all past available observations (better than SMA & WMA)
o Can find the optimal smoothing constant
 Therefore, we modify the method → Holt-Winters’ Method (considers trend)

Holt-Winter’s Method: Extension of Single Exponential Smoothing method if there is trend or seasonal variation

Holt’s exponential smoothing method


 If there is trend, use 2 smoothing constants (Alpha & Beta):
o α → smoothing constant for the data
 Systematic pattern (fixed)
o β → smoothing constant for the trend
 Controls change of trend
 Trend: Increase/drift over time
 Idea:
o Split the effects of level and trend
 Forecast for the next period: Ft+1 = Lt+1 + Dt+1
 Forecast for h periods into the future: Ft+h = Lt+1 + hDt+1
o h: periods to be forecasted into the future
1. Lt+1: forecast for level → same as SES
 Depends on most recent observation (Yt) + previous forecast
 Lt+1 = α Yt + (1 – α )(Lt + Dt) Yt: actual observation of series in period t
2. Dt+1: forecast for trend
 Dt+1 = β (Lt+1 – Lt) + (1 – β )Dt
 Depends on most recent observation for trend + previous forecast of trend
 Lt+1 – Lt: change from new level from old level (trend)

Winter’s exponential smoothing method (note: even if NO trend or seasonality, still can use Holt-Winter’s)
 IF there is both trend & seasonality, introduce 1 more smoothing constant (Gamma):
o α → smoothing constant for the data
o β → smoothing constant for the trend
o γ → smoothing constant for seasonality
 Idea:
o Split the effects of level and trend and seasonality
 Ft+1 = (Lt+1 + Dt+1) x St+1 OR
 Ft+h = (Lt+1 + hDt+1) x S(t+h–M) (1 < h ≤ M)
 Forecast for h periods in future: (new Level + [h x new trend]) x seasonal component
 M: length of seasonality (no. of periods in the season)
1. Exponentially Smooth Series: (same as SES)
Yt
o Lt+1 = α +(1−α )( Lt + Dt )
St +1− M

43
 Actual observation divided by Seasonal Index (St+1-M) (deseasonalise) to
ignore impact of seasonality + previous forecast for level
2. Trend Estimate: (same as Holt’s method)
o Dt+1 = β (Lt+1 – Lt) + (1 – β )Dt
 Most recent observation level for trend + previous forecast for trend
3. Seasonality Estimate:
Y t +1− M
o St+1 = γ + ( 1−γ ) St +1− M
Lt +1−M
 Most recent observation divided by Level to ignore impact of trend +
previous forecast for seasonality
 Smoothing constants α , β , γ tell us:
o How good their respective previous forecasts are
 How much we should update our previous forecast (the greater the values of α, β and ϒ,
the more updating needed)
 eg. if α near 0 → previous forecast of level is good, no need to update much
 eg. if β = 1 → previous forecast of trend is not good, need to continuously update it
 eg. if γ = 0.43 → somewhat neutral, seasonal forecast should be updated over time
o eg. over time, magnitude of seasonal variation becomes larger and larger. If
use constant seasonal index, will miss increase in change in seasonal impact
o Ifα , β , γ = 0, does NOT mean that there is no trend or seasonality
 Just means initial forecast for level/trend/season are good enough
 No need to update initial forecast
o If Ifα , β , γ = 1, all components need to be updated.

Quick Summary of Time Series Forecasting Methods:

Underlying pattern is not apparent, use smoothing methods, without assumtpions on the trend and seasonal
components. Only appropriate for immediate/short term forecast.

44
Underlying pattern of the series is clear, use regression-based modelling methods for forecast. If there is linear
pattern, linear trend model and additive model of seasonal patterns are appropriate. Or else use log transformation
to obtain linearity, which leads to exponential model or multiplicative model.

If data show trend and seasonality, use ratio-to-moving-averages method and HW model

45
Simple Exponential Smoothing Prediction Interval

Forecasting method most effective when parameters for the trend and seasonal components may be changing over
time.

o Assigns more weight to more recent observations

1. Estimate initial level of time series at t=0

= initial estimate of level t=0


2. A point forecast made at time T for yT+T is

3. for yT+1

for yT+2

The above formula shows the coefficients of yt’s on the r.h.s. decrease exponentially with time.

46
LECTURE 9:

Autocorrelation: time series depends on its own past values (serial dependence)
 To determine if there is autocorrelation:
o Copy data, then paste it one row down to get Lag-1 data (so that Yt+1 = Yt for lag-1 data)
o =CORREL function to get autocorrelation

How to assess the existence of autocorrelation?

o Visual analysis: lagged scatterplot (e.g. original vs. lag 1, lag 1 vs lag 2)
o Copy data, then paste it one row down to get Lag-1 data. Insert > Scatterplot
o from scatterplot:
 Highly correlated if
 Strongly clustered around straight line (there is autocorrelation)
 if Random scattering: indicates that NO autocorrelation
o Value at time t independent of values at other times
o Past values CANNOT be used to predict future values
 Sign of correlation:
 Downward sloping: negative correlation
 Upward sloping: positive correlation
 HOWEVER, visual analysis is just a rough idea
 More precise values, may want to calculate the autocorrelation
 Quantitative value: autocorrelation function (ACF) (how to compute autocorrelations)
E [ ( Y t−μ )( Y t−k −μ ) ]
o ρk = 2
σ
 Interpretation: Covariance of 2 random variables/variance of the time series
 Same time series, therefore the 2 variables use the same mean, and same variance
 k = time lag
 Autocorrelation is a function of the time lag k
o Sample Autocorrelation Function (ACF)
 for measuring autocorrelations in samples, instead of population
 eg. lag-1: (measure autocorrelation between 2 successive observations)
T

∑ ( Y t−Y )( Y t−1−Y )
 ^ρk = t =2 T

∑ ( Y t −Y )2
t=1

 Summation starts from 2, corresponding to the first available pair


 Y = average of all Yt
 T = sample size
 Numerator: T – 1 terms (because start from 2)
 Denominator: T terms → don't match? (because start from 1)
 However, we don't change it because of nice theoretical feature of the estimator
o Under some assumptions (normality and randomness of the ACF), sampling
distribution of sample ACF is Normal distribution with mean 0 and
1
standard deviation of (T is the number of observations/sample size)
√T
o We know for Normal random variable:
 2/3 of its value are within 1 SD of its mean
 95% of its value are within 2 SD
 Almost all values are within 3 SD

47
 With these results, we can introduce individual test (below)
 Higher order autocorrelations:
T

∑ ( Y t−Y )( Y t−k −Y )
t =k+1
 ^ρk = T

∑ ( Y t −Y )2
t=1
 Numerator: T – k terms
 Denominator: T terms
 Limitations:
o If k > T, impossible to estimate (k: time lag, T: sample size)
o As k increases, accuracy becomes lower
1
o Rule of thumb: T ≥ 50 and k ≤ of sample size T
4
o Why compute autocorrelation?
 Our interest: If autocorrelations are always zero
 If zero, may not consider time series model like AR or MA (no serial dependence)
 If not, can use past value to predict future value
st
o 1 test: Individual Test for Autocorrelation:
 Null Hypothesis: autocorrelation (ACF) = 0
 H0: ρk =0 → no autocorrelation, time series is random
 Applied to only an autocorrelation at any lag k (individual)
 Eg. lag-1 autocorrelation: null hypothesis is ρ1 = 0
 Eg. lag-k autocorrelation: null hypothesis is ρk = 0
 Because sample estimator is Normally distributed:
2
 | ^ρk | >
√T
2
 Only values larger than 2 SDs ( ) (i.e. >95%) indicate a significance at 5% level
√T
o Because 95% of the values should be within 2 SDs
o Reject null hypothesis → can use past values to forecast future values
o Conclude that the time series is NOT random

 Correlogram
 Each correlation is displayed as a bar, to give us an idea of the sample
autocorrelation
 x-axis: time lag k, from 1 to a large number
 Each bar tells the autocorrelation btw lag k and original data
 Dashed lines: 5% significant limits at ± 2/√ T
o T = no. of observations (eg. daily prices of stock)
o Any bar beyond dashed lines: significant
 Do NOT reject null
o If within the limits: insignificant
 Reject the null
 Limitation:
 Need to repeat it many times, time consuming
o What if when more than 1 autocorrelation (eg. lag-1 to lag-100)?
nd
o 2 test: Joint Test for Autocorrelation (Ljung-Box test/Q test):
 Null Hypothesis: first m autocorrelations are jointly 0
 H0: ρ1= ρ2=…=ρm=0

48
 If even one ρ ≠ 0, then reject H0
 m: no. of autocorrelations you are jointly testing
m
^ρk 2
Q(m) = T (T +2) ∑
2
 x ( m) follows Chi-square distribution
k=1 (T −k )

 because autocorrelation estimator is Normal random variable, sum of square


Normal gives you the Chi-square distribution
o degrees of freedom = m (how many correlation you want to test) v
o eg. if want to test lag-1 autocorrelation is 0 → m = 1
o eg. if want to test first 3 autocorrelations → m = 3
 remember to sum for each k = 1, 2, 3
 Standardize estimator using sample size – k
o k: takes value from 1 to m
 Reject null hypothesis when:
1. Q-test statistic, Q(m) > critical value
2. OR p-value < α
 If rejected, first m autocorrelations are NOT jointly zero
o Indicates there is autocorrelation in the data
o So can use past values to forecast future values
 SAS output: (understand output by SAS)
 Output:
o Number of Observations = T (sample size)
o To test adequacy of fitted model, use Autocorrelation check for residuals
o To test original data:
 Autocorrelation check for white noise table:

lag-1 lag-2 lag-3 lag-4 lag-5 lag-6

lag-7 lag-8 lag-9 lag-10 lag-11 lag-12

lag-13 lag-14 lag-15 lag-16 lag-17 lag-18


4 Ljung-Box tests done to test if the
first m ACFs are jointly equal to lag-19 lag-20 lag-21 lag-22 lag-23 lag-24
zero. Where m= 6, 12, 18, 24
 Eg. if ask for Q(7) statistic, use the first 7 autocorrelations, plug in
numbers into Q(m) equation
 However, you can directly read the Q(m) test statistic from the
table:
 To Lag: value of m
 Chi-Square: Q(m) test statistic
 DF: degree of freedom = m
 Pr > ChiSq = p-value of Q-statistic (reject when < 0.05)

o 3rd test: Durbin-Watson test (autocorrelation check for lag-1 residuals only)
o Used to test for first order Autocorrelation between 2 sequential errors (lag-1)
o Limitations:
 Cannot detect for higher order autocorrelations
 Possible that lag-1 autocorrelation is insignificant, but how about lag-2, lag-3 etc.
 If fall within inconclusive region (white), we don’t know what to do
 However, DW test usually found in business reports, so we learn it

49
o Null hypothesis: No lag-1 autocorrelation (e.g. original and lag 1, lag 1 and lag 2)
 H0: ρ1=0
n n n n

∑ (~ϵ i−~ϵ i−1 )2 ∑ (~ϵ i)2 +∑ ( ~ϵ i−1)2−2 ∑ (~ϵ i−1 ∙ ~ϵ i)


o d= i=2 n
= i=2 i=2
n
i=2

∑ (~ϵi )2 ∑ (~ϵ i )2
i=2 i=2

 Numerator: Sum of [Error at time i (starts from 2), subtract previous error, and square it]
 Denominator: Sum of [square error for each time i]
o Test statistic d is approximately related to autocorrelation of order 1:
 0 ≤ d ≈2 (1−~ ρ)≤ 4
 When ~ ρ=0 → d ≈2
 When ρ=1 ~ → d ≈0
 When ρ=−1 ~ → d ≈4
 Easier to use p-value
 Since tables for critical values are NOT always readily available
 If p-value < α , reject null
 Reject null when it’s in the 2 yellow regions
 There is autocorrelation
 When d close to 0: positive autocorrelation
 When d close to 4: negative autocorrelation
 Do NOT reject null:
 No evidence of autocorrelation
 Dependence between 2 consequent errors are NOT correlated to each other
 White Area:
 Inconclusive region, we don’t know what to do
 dL and dU:
 critical values, can be found in table for DW test
 fixed number, but varies for different no. of:
o Observations
o Variables

o Interpretation:
 If there is Autocorrelation (Errors NOT independent):
 Formula used to compute Standard Error is wrong
o Thus confidence interval & hypothesis test will be wrong
o Errors are supposed to be useless
 Least squares estimator
o Still linear and unbiased (expectation = true parameter)
o BUT it is NOT efficient (less accurate estimator)

Stationarity

o If significant autocorrelation → can use past values to forecast future values


o But in order to get a stable estimation, we require dependence of current value on its past value to be stable

50
o If NOT stable, relationship is changing over time → prediction is NOT dependable anymore
o We want stability/stationarity, so that relationship does NOT change over time

How to know if Stationary? Should satisfy the 3 conditions:


1. Constant Mean over time
o Refers to population, NOT sample coz samples tend to be different
2. Constant Variance over time
o Spread, possible deviation from mean is constant all the time
3. Constant Autocovariance structure (ACF) over time
o Autocorrelation is the same for the same time lag
 Eg. Lag-1 autocorrelation should always be the same (e.g. autocorrelation between lag1 and
lag2 values being the same as that of lag5 and lag6)
 HOWEVER does NOT mean lag-1 autocorrelation = lag-2 autocorrelation (e.g.
autocorrelation between lag1 and lag2 values does not need to be the same as that of lag1
and lag3)
o Different time lag autocorrelations do NOT have to be same

If strictly stationary series:


o Distribution of values remains the same over time, probability in an interval is the same in past and future
o However, difficult to satisfy in practice, so can ignore

Stationary time series Non-stationary time series

Conclusion:
 y or zt fluctuates with constant variation around a constant mean.
 reasonable to conclude that the time series or first differences zt are stationary.

Can test for stationarity using ADF test


 H0: The data is NOT stationary
 Look at Single Mean. p-value is less than 0.05, reject. The data is likely to be stationary

Time Series Models if Original Data have Autocorrelation and Stationary

Autoregressive Model (AR):


 Autocorrelation: time series depends on its own past values
o Regression of the time series on its own past values → that’s why called autoregressive model
 AR(1):
o Autoregressive model of lag 1
o Yt = δ +ϕ Y t−1 +e t
 Current value of Yt can be predicted using its past value Yt-1

51
 ϕ slope: AR coefficient (constant) → take from SAS output (e.g. AR1,1)
o if ϕ is zero, then middle term disappears
 Past values CANNOT be used to predict future values
o If ϕ is large, past values strongly influence future values
Zero mean  δ delta: Intercept (constant) → NOT mean (don’t take MU directly from SAS output)
Non-constant variance δ
independent μ=  δ = μ∗(1−ϕ) where ϕ is AR coefficient (e.g. AR1,1)
(1−ϕ)

et: should have no autocorrelation in the residuals (ie. et = 0)
o Deviation between actual value and fitted value is due to random shock et
o Assumptions for residuals:
1. Zero mean (residual plot of residuals all around the x-axis)
2. Constant Variance
3. Mutually Uncorrelated (Independent, random)
 Past errors do NOT depend on current error, vice versa
o AR(1) model is stationary only if: (stationarity condition)
 −1< ϕ<1 (note: there is NO equals sign)
 Stationary models have:
δ
 Constant mean: μ =
(1−ϕ)
2
σe
 Constant variance: γ 0= 2
1−ϕ
k
 Constant ACF: corr (Y t , Y t −k )= ρk =ϕ k = 1,2,…
o Use this to derive true autocorrelations:
 Eg. lag-3 correlation = ϕ 3
o Sample autocorrelation can differ from true autocorrelation (using ϕ k )
 Due to random noise
 However, it is very close, and reflects similar patterns
 When the true autocorrelation is zero, the sample autocorrelation
function can be negative.
 AR(1) model can also be represented by its demeaned (- μ from both sides) series:
o Yt −μ = ϕ (Y t −1−μ)+ e t
Stationarity (−1< ϕ<1) is important so that the data shows mean reversion:
 Yt = δ +ϕ Y t−1 +e t with Y1 and δ (ignore et), Yt can be calculated
 Mean reversion:
o Future values of stationary time series always fluctuate around its mean
o If non-stationary: value of time series becomes explosive → confusing
 NOT able to estimate future values accurately
o To check adequacy of AR(1) model:
 Residuals, et = 0 (NO autocorrelation)
 Check Autocorrelation for residuals (Ljung-Box test to test autocorrelation of residuals)
 Null: there is no autocorrelation of residuals
 Check Pr>ChiSq, if it is greater than 0.05, do not reject the null  no autocorrelation
of residuals  adequate model
 Check Pr>ChiSq, if it is less than 0.05, reject the null  there is autocorrelation of
residuals  inadequate model
 If residuals are autocorrelated:
 Means AR(1) model does NOT successfully capture data’s characteristics
 Indicates “left-over” dependence → model is NOT adequate

52
 Try higher order AR models (more lagged values)
 AR(p) → higher order AR model
o Depends on p lagged values
 Lag order p refers to the last lag value
o Yt = δ +ϕ 1 Y t −1+ …+ϕ p Y t −p + et
 Same assumptions for residuals, et:
 Zero mean
 Constant Variance
 Mutually Uncorrelated (Independent, random)
 Eg. for AR(2):
δ
 Mean: μ =
(1−ϕ 1−ϕ 2)
ϕ1
 ACF: ρ0 =1, ρ1= , ρ =ϕ ρ + ϕ ρ
1−ϕ 2 k 1 k−1 2 k−2
o Higher order AR models have more complex ACF patterns
 When lag k becomes larger, autocorrelations become smaller → pattern does NOT change
 Sample autocorrelations are good estimators of population autocorrelation
o Partial Autocorrelations for lag order higher than p are zero
 π kk =0 , for all k > p
 Partial Autocorrelations (PACF):
o Amount of correlation between a time series and a lag of itself that is NOT explained by correlations
at all lower order lags
1
 Also Normally distributed with standard deviation of
√T
 Remove impact of lag-1 autocorrelation
 Eg. lag-1 autocorrelation between wife and husband
 Husband and mother-in-law also lag-1 autocorrelation
 When wife tells husband things, husband may tell it to mother-in-law
 Partial autocorrelation is autocorrelation between wife and mother-in-law
o π kk =Corr (Y t −P ( Y t|Y t +1 , … ,Y t + k−1 ) , Y t +k −P ( Y t +k|Y t+1 , … , Y t +k−1 ))
 P ( W |Z ) is the ‘best linear projection’ of W on Z
 No need to compute this manually, but understand the SAS
 Regression of data based on AR(p)
o SAS output:
Autocorrelation

Time Series Plot


Blue portion
2 SD region, reject if
bar is beyond region
Partial Autocorrelation

Blue portion
2 SD region, reject if
bar is beyond region

 We use the plot (empirical feature) to choose the AR order

53
o How to choose AR order:
 ACF plot: (in below example)
 We see when k increase, sample ACF declines over time
 But, PACF plot:
 The p partial autocorrelations are significant if they are above dashed lines
 Eg. Cutoff at lag-2 → indicates the data follows AR(2) model

 For larger values of k, (eg. k = 15 in the plot above) might see the bar larger than 2 SDs.
However, remember that for large k, our estimation is NOT that accurate
 Can ignore it
o it appears only because of the random noise
o These 2 SDs is just a rough confidence interval, and it is weakly correlated

54
Moving Average Model (MA):
 Motivated by impact of forecast error
o Eg. we find that future exchange rates do not depend on past exchange rates, instead they depend
on previous forecast error (discrepancy of actual from forecast in the past)
 MA(1):
o Yt = μ+e t +θ 1 e t−1
o Current value Yt, depends on 1 previous forecast error (lag-1 error) et-1 and current error et
 et: current forecast error
 et-1: previous forecast error
 μ: μ= intercept. Taking negative signs of MA coefficients
 If slope θ = 0: current value is random (only depends on current random noise)
 If slope θ is large (-ve or +ve)
 it indicates that the previous forecast error has strong influence on current value
 Error:
 Current or Past, we require it to have:
o Zero mean
o Constant variance
o NO autocorrelation
o Difference between AR and MA:
 AR includes lagged terms of time series itself
 MA includes lagged terms of the noise/residuals of the time series
o Link between AR and MA:
 MA model can be reformulated as AR(∞ ) model
 Advantage of MA model:
 Instead of using a higher order of AR model to forecast eg. foreign exchange rate,
can just use a compact and simple model MA(1) model to explain the same thing,
because MA(1) is identical to higher order AR model
o MA(1) only has one unknown parameter, higher order AR has many
o Accuracy also improves when you use a simple model
o Property of MA(1) model:
 Has only one non-zero (significant) autocorrelation at k = 1 (NOT partial autocorrelation)
 It is a lag-1 autocorrelation (k = 1)
 The rest are zero (insignificant)
 Higher order MA(q) model:
o Yt = μ+e t +θ 1 e t−1 + … + θq e t −q
 Same error assumptions:
 Zero mean
 Constant variance
 NO autocorrelation
o MA models are stationary
 E(Yt) = μ → take the -ve μ from SAS output
2 2
 Var(Yt) = (1 + θ1 +…+ θq)σ 2
 Corr(Yt, Yt-k) = 0 if k > q, …
 Autocorrelations are always 0 if time lag > q
o To determine how many MA terms are needed:
 If sample ACF is significant at lag q, and NOT significant at higher lags
 Correlogram cuts off at lag q
 PACF declines over time

55
 Then, should chose MA(q) model → order q

Summary:*****
 PACF cutoff at p, ACF exponentially declines over time → AR(p) model
 ACF cutoff at q, PACF exponentially declines over time → MA(q) model
 BOTH PACF and ACF exponentially decline over time → ARMA, but have to fit p & q by trial and error
 BOTH PACF and ACF, all 0 → white noise
Note: if ACF lag 1 & lag 3 significant, but lag 2 insignificant → use MA3 if it is strongly significant, otherwise use
MA(1)
 In SAS, can just type “1, 3” → to show that lag 2 is insignificant
Note: if residual autocorrelations are strongly significant, then the model is NOT adequate → consider higher
orders
 If weakly autocorrelated → can ignore the autocorrelations
Note: if data is stationary, it does NOT mean data is autocorrelated, vice versa. 2 separate concepts.
 Eg. random errors (with zero mean, constant variance) are stationary, but autocorrelations = 0
 Eg. non-stationary data may have serial dependence (though that dependence is changing over time)

ARMA Model (Autoregressive Moving Average; combination of AR and MA):


 Extension of AR model
 Future values depend on both the p historical values (AR part) and q past forecast errors (MA part)
 ARMA(1,1)
o Depends on lag-1 value (AR), and lag-1 past error (MA) ARMA, AR, MA has a triangle relation:
o Yt = δ +ϕ Y t−1 +e t +θ e t−1 For any model, I’m able to use the
o Why do we choose ARMA? other 2 to represent it
 Eg. ARMA(1,1) vs AR(10) or MA(9)
 They give similar results, so choose ARMA(1,1), since less unknown parameters
 δ=μ∗(1−∑ of ϕ i )
o Properties of ARMA:
 PACF and ACF exponentially decreases over time
 However, does NOT tell you how to choose p and q
 Fitting ARMA is guess work and trial and error
 Consider using information criteria (below)
 ARMA(p, q) (higher order)
o Yt = δ +ϕ 1 Y t −1+ …+ϕ p Y t −p + et + θ et −1+ …+θq e t −q
 MA part learns from error made over time and tries to improve future forecast accuracy
o AR(p) = ARMA(p,0)
o MA(q) = ARMA(0,q)

Information Criteria (IC) → to choose p and q for ARMA

AIC (Akaik Information Criterion)


2k
 AIC=ln ( MSE )+
T
o k: no. of parameters
o T: no. of observations
BIC (Bayesian Information Criterion) → also called SBC in SAS
k ln(T )
 BIC=ln ( MSE )+
T

The 2 IC balances:

56
 Forecast accuracy
 Model complexity
o more predictors, forecast error automatically becomes smaller → doesn’t make sense
o penalizes unnecessarily complicated models
How to use IC?
 The smaller the IC, the better the fitted model:

Box-Jenkins Methodology

1. Model Identification
 Use sample ACF and sample PACF to choose model
 If both do NOT have cutoff → consider ARMA model
 Use AIC and BIC to choose p and q for ARMA
 SAS:
 Plots and results > check “Actual values plot”
 To get ACF and PACF
2. Model Estimation
 Estimate unknown parameter → SAS will do it
 SAS:
i. Enable estimation steps > check “Perform estimation steps”
ii. Model definition
 Add p for AR model > click “Add” (note: to consider AR(3): type “1,2,3” > Add)
 Add q for MA model > click “Add”
3. Model Validation NOT just “3”, or it won’t
 Check adequacy of model with focus on the residual consider lag-1 and lag-2
 If autocorrelation in residual → indicates selected model is NOT adequate in explaining serial
dependence → should consider higher order model
 How to check existence of autocorrelation? 3 tests:
1. Individual test
2
| ^ρk | >  reject the null that there is no autocorrelation
√T
2. Joint test (Q-test or Ljung-Box test)
3. DW test
 Limitation of DW test → only for lag-1 autocorrelation (not higher order)
 Check statistical significance of the coefficients by check their p-values. P-value < 0.05  significant
coefficient

4. Model Forecasting (take note of how many differenced)


 Generate forecast and 95% confidence limits of the forecasts

57
t-statistic p-value
SAS Output

Estimator for mean


(NOT intercept)

AR Coefficient
−θ : MA Coefficient → always take the opposite sign!! Eg. here, take –9
AR model:
intercept = MU X (1-sum of AR coefficients) Constant Estimate 3.51E-07

MA model: Variance Estimate 632.6909


Information Criteria
Intercept = MU
Choose model with Std Error Estimate 25.15335
ARMA model: small AIC and SBC
intercept = MU X (1-sum of AR coefficients) AIC 54753.93
Information Criteria
Choose model with SBC 54767.29
small AIC and SBC
Number of Residuals 5895

To see if fitted model is good: look at Autocorrelation Check of Residuals (Q-test / Ljung-box test)
Conditional Least Squares Estimation t-value=
Autocorrelation Check of Residuals estimate/standard
Parameter Estimate Standard Error t Value Approx Pr > |t| Lag error
To Lag Chi-Square DF Pr>ChiS Autocorrelations
MU 825.850351 25.15334729
q 32.83 0.0000 0 Approx Pr > |t| is
the p-value of the
AR1,1
6 1
25.52 5 0.000248001
0.0001 lag-14032.24
lag-2 0.0000
lag-3 lag-4 1 lag-5 testlag-6
of whether the
MA1,1 9 coefficients are
lag-7 lag-8 lag-9 lag-10 lag-11 lag-12
significant
lag-13 lag-14 lag-15 lag-16 lag-17 lag-18

lag-19 lag-20 lag-21 lag-22 lag-23 lag-24

Test statistic
No. of joint test Null hypothesis: no autocorrelation
Degrees of freedom
p-value: smaller than 5% → reject the null that they are
Residue depends on eg. 1 jointly 0, the residuals are autocorrelated and the model
parameter, your AR(1) is not adequate
coefficient. Therefore, DF If eg. ARMA(3,1):
is reduced by 1 There’s 4 unknown parameters:
AR(1),AR(2),AR(3),MA(1)

DF is reduced by 4

58
 95% confidence interval:

 Standard Error =
 Forecast for the next period (obs=61) = 98.152 + 1.96 x 1.025 = 100.161
 Standard error of future forecast (obs=62) based on current forecast (obs=61)
o The standard error is increased to 1.052 because the forecast is baesd on obs=61 which is a
forecast. The standard error increases due to the accumulation of forecast error.
 Parsimony of model
Higher AR models are complex even though they may be more adequate. Always use the simplest possible model
based on PACF (tells AR) and ACF (tells MA).

59
LECTURE 10: UNIT ROOT AND PAIRS TRADING

Stationarity Condition:
 AR Coefficient: −1< ϕ<1 (note: there is NO equals sign)
 There are 3 parts:
o Constant Mean
o Constant Variance
o Constant ARF structure
 If |ϕ|< 1 → Stationary data
 If ϕ=1 → Random Walk (non-stationary data)
 If ¿ ϕ∨¿1 → Explosive data
 AR, MA, ARMA can ONLY be applied to stationary data
Non-Stationary Models

Random Walk Model:


 Future values of data is just a random step away from the previous value
o Random movement is controlled by the stochastic error
 Yt+1 = Yt + ε t+1
o Special case of AR(1) with AR coefficient ϕ = 1 and δ = 0
 For AR model, AR coefficient should be < 1 if the data is stationary
 Therefore, random walk is NOT Stationary since ϕ = 1 (violates stationarity condition)
 For ϕ = 1: we can say it has Unit Root
 Therefore, random walk model also called Unit Root Process
 Which part of the stationarity conditions (assumptions) is not stationary? (variance)
 Mean: Random Walk model does have a constant mean
o E(Yt) = E(Y0)
 Variance is time dependent (NOT constant) ← violates stationarity condition
o var(Yt) = σ 2 t
 Forecast based on random walk value will be based on the last available value
o Random walk is identical to naïve forecast method
o Random Walk has a long memory: (it is a DS, see next section)
 No matter how far in future we look, our best prediction is always the initial value
 It memorizes what has happened in the past
Random Walk with Drift
 Yt+1 = δ + Yt-1 + ε t δ ≠0
o It is AR(1) model with coefficient ϕ = 1
o δ : drift; acts as intercept
 Intercept shows expectation at time = 0
 Non-stationary:
o Still violates stationarity condition because ϕ = 1
o Where does the non-stationarity come from? (BOTH mean and variance)
 Mean is time dependent:
 E(Yt) = δt + E(Y0)
 Variance is time dependent:
 var(Yt) = σ 2 t

Non-Stationarity

60
Non-stationarity causes problems:
 Time Series Analysis extrapolates historical patterns into future via statistical methods
o If non-stationary, history doesn’t necessarily repeat itself, methods may fail
If we see trend in data, it indicates that the data is NOT stationary, because expectation of mean level is changing
over time (another way to check for stationarity besides checking the value of -1 < φ < 1)
2 kinds of non-stationarity are important:
1. Trend-stationary (TS): Yt = ϑ + βt +ε t
o Deterministic trend
 ϑ + βt
 Intercept + slope x time → time dependent
 If we can use formula to describe the trend, it is deterministic
 Future value of trend is fixed, NOT Random → intercept and slope do not change
o However, it is non-stationary:
 Mean is time dependent
o HOWEVER, Demeaned series becomes stationary
 Remove the mean from the data:
 Data – expectation: Yt -ϑ + βt = (ϑ + βt+ ε t ) -ϑ + βt
 First term cancels
 what's left is the noise → error = 0
o Error is stationary because:
 Constant mean
 Constant variance
 Constant ACF
2. Difference-stationary (DS):
o A process with a stochastic trend or a unit root (ϕ = 1)
 Eg. Random Walk(e.g Finance series, Stock prices, exchange rate),, Random Walk with Drift
o Non-stationary data
o Even if you compute demeaned series, it is still non-stationary
 Because what’s left is the SUM of noise
 How to obtain stationary data from DS data?
 Differencing the data

Important to differentiate TS and DS:


 Different reactions to Shock:
o TS has temporary reaction to shock, data goes back to normal trend quickly
o DS has permanent reaction to shock
 DS has a long memory
 HOWEVER, the shape is similar

To make Non-Stationary become Stationary:


 TS: Demean the series: Data – expectation → stationary
 DS:
o Difference the data
o Eg. Lag-1 Difference
 ∆Yt = Yt – Yt-1 → stationary
 current value – previous value
o If has to be differenced d times to make it stationary:
 It is called “integrated process of order d” → Yt ~ I(d)
 Eg. I(0):

61
 Original data is stationary (no need to do transformation)
 Eg. I(1):
 Original NOT stationary
 First order differentiation (change of price) is stationary
 Eg. I(2):
 Original NOT stationary
 First order differentiation NOT stationary
 Second order differentiation (change of change) is stationary
o After making data stationary, select eg. ARMA model:
 NOT for original data, but for differenced data
 We should NOT use ARMA (etc.) model for non-stationary
 After you difference and forecast, you can compute it back to get forecast for original data
 Just use forecast from fitted model + lagged value
e.g. first order differenced model. Given current price = 253. Forecast rate of return: 2.3%
Forecast price = 253* 1.023 = 259
Given Forecast interval: [1%, 4%]. Forecast price interval= [253 *1.01, 253 * 1.04] = [256, 263]
 If second order differencing used
o Model forecasts the change of rate of return
o Given rate of return  calculate forecast return of rate  calculate price
Unit-root tests to test for stationarity
Consider α is like φ, if -1 < φ < 1, data is stationary, so if -1 < α < 1  non-stationary
Define: ρ=α −1
First question
Is my data is stationary or not?
 Test estimator α :
o Null Hypothesis, H0 in t-test: α = 1 (data is NOT stationary)
estimator – target α −1
o t-statistic = =
SE of estimator SE α
 Results:
o if ¿ α ∨¿ < 1, for sure my data is stationary
o if α = 1 ( ρ=0), data follows random walk → data is NOT stationary
 HOWEVER, if data is non-stationary, test statistic is NOT t-distributed anymore
 Instead, it is Dickey-Fuller distributed
 Use new critical value (shifts left) from DF distribution, NOT t-distribution
 Called a Dickey-Fuller test, instead of a t-test
If H0 is Stationary → Red curve

If non-stationary → Blue curve


 Shifted left
 Corresponding rejection region
will change, influencing your
decision
 Called a Dickey-Fuller test
 Random Walk → “Zero Mean”

Random walk with a drift**


 Called “Single Mean”
 Our focus for this module

Limitation of Dickey Fuller Test: Drift + Deterministic Trend


 Called “Trend”
62
 Always depends on lag-1 value
o Problem of DF test: DF distribution only holds if the errors are independent and identically
distributed. If errors are autocorrelated  estimation is not accurate
o Possible that data depends on more than one lag value
o It uses lag-1, whether stationary or non-stationary, to estimate alpha
o Will have autocorrelation in residuals → Estimation is not accurate

Augmented Dickey Fuller (ADF) Test:


 Null hypothesis for ADF:
o The data is NOT stationary → α = 1 or ρ=0
 ADF difference from Dickey Fuller test:
o Not only one lagged value, but have even higher lagged values
o ∴ Estimator for ρ is accurate
SAS Output:
Augmented Dickey-Fuller Unit Root Tests
Type Lags Rho Pr < Rho Tau Pr < Tau F Pr > F
Zero Mean 0 0.0089 0.6829 0.20 0.7432
1 0.0043 0.6818 0.08 0.7071
Single Mean 0 -17.9335 0.0151 -3.16 0.0257 5.01 0.0411
1 -30.1144 0.0009 -3.84 0.0036 7.37 0.0010
2 -15.4853 0.0295 -2.61 0.0953 3.40 0.2141
Trend 0 -25.2457 0.0162 -3.72 0.0257 6.94 0.0383
1 -48.8663 0.0003 -4.81 0.0009 11.57 0.0010
2 -29.1596 0.0060 -3.51 0.0442 6.16 0.0662

 Only depends on data, NOTHING to do with selected model


o Whether data is stationary or not
 Zero mean: Random Walk
 Trend: Random Walk with both Drift and Deterministic Trend
 Single Mean: Random Walk with Drift (our focus)
o Pr < Rho
 If < 0.05 → reject the null  the data is not non-stationary
 If > 0.05 → do NOT reject the null
 If ADF test results suggests that my data is not stationary
 What should we do? Consider its first order difference series, then ADF test again
o Choose data
o Excel SAS > Tick “Difference the response series”
 SAS will base its analysis on first order (or order specified), NOT
original data
o Eg. Autocorrelations will be for the differenced data
o If < 0.05 → reject the null → current data is stationary
o Accept the current data → the first order differenced data
o Eg. since stationary after d = 1, then use ARIMA(0,1,0)

Then look at the time plot of the differenced data and sample ACF and PACF to select the appropriate model

 Time plot:
o Reflects differenced data
o See if mean reversion feature → value fluctuate around the mean
 ACF and PACF → select appropriate ARMA model

63
o Reflects differenced data
o If no clear clue which model to use, use trial and error
o Eg. use MA(1)
 Stage 3: Forecasting > Enable forecasting steps > tick “Perform forecasting steps”
 HOWEVER, remember model is for the differenced data
 Forecasts for variable Close:
o Forecasts are for original data (NOT differenced/stationary data)
 Because SAS knows that you considered the differenced
o Gives you the forecast, as well as the 95% confidence interval
 Manually: interval forecast = point forecast ± 1.960 x SE

ARIMA(p,d,q) model (Auto-Regressive Integrated Moving Average)

If non-stationary data differenced to become stationary → use ARIMA

 “I” stands for integration → how many times differencing I should conduct to get a stationary model
 Check d with ADF test, DON’T over difference
o Eg. if d = 1, p = 1, q = 0 → ARIMA(1,1,0) model
o Eg. if data only becomes stationary at d =2, then → ARIMA(1,2,0) model
 p and q find using the stationary (differenced/integrated) data
SAS:
 Choose data
 Excel SAS > Tick “Difference the response series”
o SAS will base its analysis on first order (or order specified), NOT original data
 Eg. Autocorrelations will be for the differenced data

Limitations of ARIMA:
 ARIMA identification is difficult and time consuming
o However, visual analysis is subjective
 Model may NOT have intuitive interpretation
o Difficult to explain why sales depends on lag 3 sales, why not yesterday’s sales
 Identification and estimation can be badly distorted by outlier effects
 Models that perform similarly on the historical data may yield quite different forecasts
 Box-Jenkins approach does NOT tell us if model is too big/unnecessarily big
o Only tells us if the model is big enough, and whether it is too small
o Eg. if use AR(10) for an AR(1) → still tells us that model is adequate
 Problem: more than necessary unknown parameters
 Reduces accuracy

Summary of Unit Root Test


1. Examine your data.
o Does it appear stationary?
o Is it trended?
 if deterministic trend (TS) or stochastic trend (DS) → suggests data is NOT stationary
2. If it may be non‐stationary, apply ADF test
o Include time trend if trended
3. If test rejects hypothesis (null: α =1) of a unit root (i.e. p-value < 0.05)
o The evidence implies that the series is stationary.
4. If the test fails to reject
o The evidence is not conclusive.

64
o Many users then treat the series as if it has a unit root.
 Difference the data, forecast changes or growth rates.
 we stop when we find stationary data
For differenced data
Summary: Forecasting with ARIMA models

ADF test Look at PACF, ACF to select ARMA model


Eg. d = 1, p = 2, q = 0
Use ARIMA(2,1,0)

Increase d, and redo ADF test

Stop at d where data becomes stationary


Residual Plot, Goodness
Eg. d = 1 → data becomes stationary of Fit, Forecast Accuracy
Means I in ARIMA = 1

Real Life Applications (optional)

Dating Financial Bubbles:


 If data is stationary → no bubble, because everything is stationary
 When data becomes non-stationary → possibility of a bubble
 Putting bubbles sequentially over time,
o May be able to tell how bubbles shift from one market to another market

Spurious Regression:
 Have 2 random walk process (non-stationary), both are independent
 If regress one data on another, should expect no dependence
o HOWEVER, we see some correlation (dependence)
o 2 non-stationary processes share some non-stationary root together
 Called “co-integration” → Spurious dependence (won’t be tested)
 “Spurious” because the 2 data are independent from each other

Pairs Trading:
 To make profits using stationarity
 Nobody knows the true value of security, how to know if overvalued or undervalued?
 Don’t consider absolute value → look at relative value instead
1. Pick out 2 financial instruments that are similar to each other
 Eg. same product, similar management board
 We expect them to have similar price
2. We see if there is a deviation of price of one from the other
 If there is a large deviation, one is overvalued and the other undervalued
 Can combine the 2 by regressing one on the other

65
 We get the residuals (error = actual – forecast)
o Error is stationary
o We also observe that the data is mean-reverting
 Eg. DJ: dependent variable, FTSE: predictor variable
o Portfolio: et = DJt – β x FTSEt
3. We don’t need the true price, we just use the relative idea to see if one is over or undervalued
 Then we make our decision
 Trading strategy: when we see large deviation from the mean:
 Below the mean: we should buy the portfolio
o buy 1 unit of DJ, short-sell β units of FTSE
o Clear the position at the mean
 Above the mean: we should short-sell the portfolio
o Short-sell 1 unit of DJ, buy β units of FTSE
o Clear the position at the mean
 We are only sure that it has mean-reversion, but NOT if it will continue to go above the
mean or below the mean → so we clear position at the mean

Model Diagnostics
Moving Average Model – MA(1)
What can go wrong?
 The data may contain outliers.
 The time series may be non-stationary.
at : Random shock at time t
 The errors may be autocorrelated. - Assumptions:
 The errors may show changing variances over time. o Normal distribution
 The mean of the errors may be non-zero. o Independent of t
 The errors may not be normally distributed. o Independent of other at terms
at-1 : Random shock at time t-1
Choose model based on goodness of fit, forecast accuracy and θ1 : Unknown parameter; estimate from data
residual analysis. δ = μ : constant term (if applicable)
Also consider interpretability and parsimony

66
Cluster Analysis

Dividing objects into separate sub-sets where


 High intra-class similarity - objects in the same group are similar wrt a given set of characteristics
 Low inter-class similarity - objects belonging to different groups are dissimilar wrt the same set of
characteristics
 Cannot deal with missing values (i.e. they must be removed)
Distance measures (smaller the distance, more similar) (all three satisfy the triangular inequality)
 Euclidean Distance (ED)
o For ED of 2 observations defined by the variables (U1, U2) and (V1, V2)
 L= √ ¿ ¿
o For ED of observations defined by more than 2 variables (X1, X2, … X7) and (Y1, Y2, … Y7)
 √ 2 2
L= ( X 1−Y 1 ) + ( X 2−Y 2 ) + …+ ( X 7 −Y 7 )
2

 Manhattan Distance (MD)


o For MD of observations defined by more than 2 variables (X1, X2, … X7) and (Y1, Y2, … Y7)
 L=| X 1−Y 1|+| X 2−Y 2|+…+|X 7 −Y 7|
 Mahalanobis Distance (Dm)
o For MD of observations defined by more than 2 variables (X1, X2, … X7) and (Y1, Y2, … Y7)
 Divide the original variables by their respective standard deviation:

 √ 2 2
D m= ( Z1−S 1 ) + ( Z 2−S 2) + …+ ( Z7 −S 7 )
2


2 2 2
 ( X 1 −Y 1 ) ( X 2−Y 2 ) ( X 7−Y 7 ) (σ2 is variance)
Dm= 2
+ 2
+…+ 2
σ1 σ2 σ3
 Dm does not depend on scales of measurement used as they are standardised
 Dm will be influenced when new objects are included as the SD needs to be recalculated
 ED and MD are not affected significant when including new objects (even if they are outliers)
 ED and MD are scale dependent
 MD, as compared to ED, can reduce the effect of outliers as differences are not squared
 Using different distance measures can lead to very different cluster results.
Two types of clustering
 Hierarchical algorithms (using dendrogram)
o No need to specify the number of clusters to begin with (just do it using agglomerative method)
o Interpretation of results (E.g. no. of clusters to use) is very subjective
o Categorical variables cannot be calculated (should not be taken into account)

o Look at the scatterplot (only possible for observations determined by two variables) of the objects to
determine the number of clusters needed (after the SAS cluster analysis)
o Scatterplots and dendrogram can also help identify possible outliers

67
o Top-Down (divisive): starting with all the data in a single cluster, consider every possible way to
divide the cluster into two. Choose the best division and recursively operate on both sides. i.e.
Weeding out dissimilar observations
o Bottom-Up (agglomerative): starting with each item in its own cluster, find the best pair to merge
into a new cluster. Repeat until all clusters are fused together. i.e. joining together similar
observations
 SAS
 Using SAS, choose Euclidean distance and Ward’s minimum variance method
 Look at the resulting dendrogram to determine the number of cluster to use
 Rerun the analysis, adding the number of cluster in the results.
 The new analysis sorts the observations into the number of clusters specified
 Sort the original observations into the clusters assigned
 Calculate the cluster mean for each variable for each cluster (e.g. Price mean for
cluster 1’s price)
 Plot bar chart to compare the cluster means and for interpretation
 Manual
 Need to standardise the data when we do cluster analysis as different scales may be
used for variables
 Find distance matrix by determining the Euclidean distance for each two observations
 Merge the two observations with the smallest distance into a cluster
 Consider the newly formed cluster as one observation, update the distance matrix by
finding the wards linkage btw every observation to the new cluster
 Wards linkage: minimises the variance of the merged clusters
o Robust to outliers and noises

P and Q are clusters with np and nq observations in each


 Single linkage (nearest neighbour): the distance btw two clusters are the
shortest distance from any point in one cluster to any point in the other cluster
 Complete linkage (furthest neighbour): maximum distance btw two clusters
 Merge the observations/cluster with the lowest distance, repeat until all
observations/clusters are fused together, the resulting dendrogram is hierarchical
clustering
 Partitional algorithms
o Construct various partitions and then evaluate them by the same criterion
o Non-hierarchical, each object is placed in exactly one of K nonoverlapping clusters
o Relatively efficient, converges fast (compared to hierarchical clustering)
o Cannot deal with categorical data
o The desired number of clusters K needs to be determined (sometimes may be difficult)
o Unable to handle noisy data and outliers (Results may be easily influenced)
o May terminate a local optimum. The global optimum may be found using techniques such as
deterministic annealing and genetic algorithms

68
o K means algorithm
 SAS
 Manual
 Define a value for K, say 3. (not arbitrarily defined)
 Arbitrarily specify K number of cluster centres/means
 Compute the distance of each object to the cluster centres, classify object to the
nearest cluster
 When all objects have been assigned, recalculate the positions of the centres by
taking average of the observations in the respective 3 clusters
 Compute the distance of each observations to the 3 new cluster centres, classify the
observations to the nearest cluster
 Find new cluster centre again and reclassify the observations until no observations
changes their positions (the algorithm converges)
o Selecting K
 Scatterplot of observations to see how many clusters there are

K=5

 But observations with more than 2 variables cannot have scatterplot


 Need to find a new set of variables to represent some original ones, so that we can
display the observations in a 2D scatterplot using Principal Component Analysis
 Principal Component Analysis (PCA) (Dimension reduction method)
 To discover new set of variables to represent the original possibly correlated data
 The new variables are called Principal Components (PC)
o PCs are linear combinations of the original ones
o PCs are uncorrelated with one another
o PCs capture as much of the original variance in the data as possible
o Makes use of eigenvector ϒ and eigenvalues λ to come up with PCs

o PCs can be ordered according to the magnitude of their variances, which are
the associated eigenvalues
o PCs are affected by outliers (thus run PCA to obtain PCs, scatterplot PC1
against PC2 to check for outliers, if there are, remove the outliers and rerun
PCA to get better PCs and do scatterplot to find K (the number of clusters))

K=2

69
o Cumulative column gives the percentage of variation of the data explained
by the PCs

o Justifying selection of K using J function (objective function)


 Measures the distance of all the data observations from their respective cluster centres

 Xi(j): the ith observation that is classified into the jth group
 Cj: the jth cluster centre
 When K=1, J is the sum of distances from C1, the only mean, to all the data observations
 Plot objective function values for K= 1, 2, 3…6
 Select K that corresponds to the abrupt change in the plot (“knee finding”/ “elbow finding”)

K=2

70

You might also like