Professional Documents
Culture Documents
(2022 Full) ASM SRM (Ocr)
(2022 Full) ASM SRM (Ocr)
Actuarial University
Component Your Path to SUCCESS 40mb GOAL
‘1111/ Videos & MorE
Practice_ Quiz_ Test, PASS
of Your Study 0 0
Study Manual Virtual eFlasheards
410nual Program
aistm
Exam SRM Study Manual
a/s/M
Actuarial Study Materials
ISBN: 978-1-64756-515-2
your exam dashboard, and . The (Type II) Pareto distribution with parameters z8> 0 has pa
best match.
and
list of integrated products '
Locked Products are products that you do not own, and are
available For purchase.
*Available standalone, or included with the Study Manual Program Video Bundle
Coming Soon for FAM, ASTAM & ALTAM!
Available now for P, FM, IFM, STAM, SRM, MAS-I, MAS-II„ CAS 5, CAS 6U & CAS 6C
Prepare for your exam confidently with GOAL custom Practice Sessions,
Quizzes, and Simulated Exams.
1111•••••••
At time t =0 year, Donald puts $1,000 into a fund crediting interest at a nominal rate of i Flag problems for
compounded semiannually. review, record
At time t = 2 years, Lewis puts $1,000 into a different fund crediting interest at a force notes, and email
= 1/(5 + t) for all t.
your professor.
At time t = 16 years, the amounts in each fund will be equal.
Calculate i.
View difficulty
pos.sible Answers level.
Helpful strategies
Help Me Start
to get you
started.
Equate the expressions for the AVs at t = 16. Then solve for i 0):
Full solutions,
Equate the expressions for for the AVs at t = 16 and calculate i(2): with detailed
(1 + 021/2)" = 3 explanations
(1 +2(2)/2) = 3111321=1.03493 to deepen your
io)/2 = 0.03493 13..111: 11+ P/39" -(1 4.ir'lf 2342 understanding.
011C) do e6o'l'un —2,117 — 3
= 7.0%
Commonly
Common Questions &Errors encountered
Student Question 1: After solving this problem I got .069855. Are we expected to round to .07? errors.
Answer: The provided answer choices are all rounded to 1 decimal place. So the answer 6.9855%
n luld be rounded to 7.0% to be correct to 1 decimal place. Rate a
Available now for PI FM, IFM, STAM, SRM, MAS-I & MAS-II
Core 00 • 92 / 304
improve.
Advarioni ; 60 169/304
• mastery 70 43/304
Detailed
GOAL Session Prz1MtySominoty Fitter: AS Types
•
. •: AllStatuaeS performance
Created Last Accessed mode Categories Qoeselosse Status
tracking.
0pip4,2022 2147,43 01/24/2022.2157:43 Quiz • Continuous P_ 29 Hew
4,1
a/san
Exam ASTAM Study Manu al
aiSIM aistm
E-xa
xlsoi stud,/4.413&"
Nta"ual Exam FAM-S Study/Manual
rzti.
h.11.1%4 ARA.,
,4,0)617., 71.1101111Iftr
asim
I Linear Regression 9
n• 3.3 t statistic 34
3.4 Added variable plots and partial correlation coefficients 36
Exercises 38
Solutions 50
4 Linear Regression: F 57
Exercises 59
Solutions 68
6 Resampling Methods 95
6.1 Validation set approach 95
6.2 Cross-validation 96
Exercises 99
Solutions 101
257
15 K-Nearest Neighbors
15.1 The Bayes classifier 257
15.2 ICNN classifier 258
323
18 Cluster Analysis
323
18.1 K-means clustering
325
18.2 Hierarchical clustering
18.3 Issues with clustering 330
Exercises 332
Solutions 336
Appendices 441
The syllabus has two links at the bottom. The second one links to sample questions and solutions. There are 28
sample questions.
The syllabus includes the following topics and weights:
Number of
2. Linear models is the largest topic, but includes generalized linear models and various other topics.
3. The distribution of the sample questions is different from the syllabus weights. Part of this may be to provide
more questions on topics which have not appeared on exams before STAM, such as cluster analysis.
About 60% of the sample questions are conceptual; no calculations are needed. Some of these questions are taken
from obscure passages in the Frees textbook. All of the concepts in these questions are covered in this manual, but
in some cases very briefly. However, there is no guarantee that they won't ask a question on something that I didn't
n include. I believe that knowing the information in this manual will be enough to get a 10, but not necessarily enough
to answer every exam question.
You should also download the tables that are linked to the bottom of the syllabus. The tables include the normal
distribution, critical values for the t distribution, and critical values for the chi-square distribution. The SOA hasn't
provided rounding rules for the normal table, but you won't be using it that heavily anyway.
There are two textbooks on the syllabus. There is some overlap between the two textbooks, as they both discuss
linear regression. The styles of the two textbooks are different.
The first textbook is Regression Modeling with Actuarial and Financial Applications by Edward Frees, an actuary.
This book covers the linear models and time series parts of the syllabus. The author comes across as a scholar who
is very familiar with his material and tries to get it across by showing many practical examples of its use. Practical
examples means computer outputs. To make the book more readable, technical detail is usually placed in a section
at the end of each chapter (and those sections are not on the SRM syllabus). Despite the author's good intentions, I
found this book somewhat difficult to read for the following reasons:
1. The book has a fairly large number of errors. The errata list for the book is at
https://instruction.bus.wisc.edu/jfrees/jfreesbooks/Regression%20Modeling/BookWebDec2010/
RegressionFreesErrata12September.pdf
The errata list must be taken into account, since many formulas in the book are incorrect.
2. The book lacks a useful index. And the table of contents only lists chapters and sections, not subsections. Isn't
it interesting that the index does not have anything starting with F? (The F-ratio is discussed in this textbook,
although the author seems to prefer using t-ratios.) If you wanted to know something about Cook's distance,
where would you find it? (If you were really clever and knew that Cook's distance had something to do with
leverage, you could look under leverage and find Cook's distance there, but hey - an index should be usable
by dummies!) If some practice question mentioned "Dickey-Fuller" and you wanted to know what it is, where
would you look? (If you knew that the full name of the test is the "Dickey-Fuller unit root test", you would still
not find "unit root tests" in the index, but you would find it in the table of contents. Hurrah!) I was frustrated
by the difficulty of finding things in the book.
3. It is often very hard to understand what the author is saying without knowing the technical background, and
often the technical background is not provided in the last section of the chapter.
I have omitted some of the more obscure topics from this textbook, and I don't think they would appear on an
exam.'
The second textbook is An Introduction to Statistical Learning, coauthored by four non-actuarial authors. This
textbook covers all parts of the syllabus except time series, but does not discuss logistic models.2 This book is
available free as a download, and I encourage you to download it! The style of this book is enthusiasm; these
authors are excited about this topic and want you to be excited as well! You can read this book in bed, and I
challenge you to find an error in it. An Introduction to Statistical Learning avoids technical details as much as possible,
and rather than have you do calculations, shows you how to use R to carry out the modeling.
You will find An Introduction to Statistical Learning easier to read than this manual. However, this manual will
still help you in the following ways:
1. It summarizes the material. You can pick up the material faster by reading this manual, although you will not
be motivated as much. You may even want to read An Introduction to Statistical Learning once and then this
manual many times to review the material.
2. It provides you with exam-like examples and questions. An Introduction to Statistical Learning is interested in
teaching you practical uses of the material. It rarely provides simple small-scale examples since they don't
represent realistic situations. But Exam SRM is does not provide you a computer to do calculations; the
calculation questions on it will be simple and small-scale.
3. On rare occasions An Introduction to Statistical Learning is not completely clear.
'In particular, I do not discuss the multinomial logit model
aActually it does discuss logistic models, but the chapter that discusses them is not on the syllabus.
Much of the material on this exam does not lend itself to calculation by hand. Therefore, much of this exam will
contain knowledge questions rather than calculation questions. Knowledge question are 3-way or 4-way true/false
questions. On 3-way true/false questions, the SOA enforces symmetry of the answer choices. The only two sets of
answer choices that you will encounter on an exam are
(A) I only (B) Il only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
and
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
In both cases, there is symmetry among the three statements I, II, and III, and also symmetry in how likely a
statement is to be true. Apparently symmetry is not required for 4-way true/false questions. For either type, choice
E should be the correct choice about 1/5 of the time only.
The online version of this edition has been linked to the Actuarial University.
The Principal Components Analysis lesson has been rewritten in a clearer way.
The four new SOA sample questions have been added to the appropriate lessons.
The R language
This manual does not cover R, and you won't need it for SRM. However, you will need it for the PA exam. You
should read the labs in An Introduction to Statistical Learning to learn how to use R to carry out the statistical learning
methods you learn in this course.
Cross-reference tables
Note that Appendix B has cross-reference tables showing you which section of the manual corresponds to each
section in the textbooks. As discussed in that appendix, these may be helpful when you are studying for Exam PA.
Errata
Please report all errors to the author. You may send them to the publisher at mail@studymanuals . corn or directly to
me at errata@aceyourexams . net. Please identify the manual and edition the error is in. This is the 3rd edition of
the Exam SRM manual.
An errata list will be posted at http : //errat a . aceyourexams . net. Check this errata list frequently.
Acknowledgements
I would like to thank the CAS for allowing me to use questions from their old exams, and the SOA for allowing me
to use its sample questions.
The creators of TX, P:TEX, and its multitude of packages all deserve thanks for making possible the professional
typesetting of this mathematical material.
I'd like to thank Michael Bean for his diligent job proofreading this manual, as well as advice on how to improve
the content.
I'd like to thank the following correspondents who submitted errata: Hiu Tung Chan, Cheng Chen, Joel Cheung,
Maria Doran, Neil Xavier Elpa, Lingyi Fang,Natalie Jacobsen, Boren Jiang, Dan Kamka, Drew Lehe, Yingxin Liu,
Mario Mendiola, Li Kee Ong, Greg Schlottbohm, Aaron Shotkin, Tara Starling, Ryan Talley, Chan Hiu Tung, Isaac
Zhang, Wei Zhao, Dihui Zhu.
Reading: Regression Modeling with Actuarial and Financial Applications 1.2; An Introduction to Statistical Learning
2.1-2.2.2
There is no perfect f, so we have to allow for error. Let E be the error. Then we want a function f for which
Y f X2, X3, • . .) £
We would like to pick the f that makes the error E as small as possible. The remaining error is the irreducible error. NI°
The input variables Xi are called explanatory variables, independent variables, features, or predictors. (Usually, SI
the term "features" is only used if Xi is a discrete random variable with a finite number of possible values.) The
output variable Y is called the dependent variable or the response.
Prediction and inference Statistical learning is used for prediction and inference. Prediction means determining •-ur
what the response will be for some values of Xi, possibly values that have not been observed in the past. Often
you have no control over the values of Xi. Inference means understanding how the explanatory variables influence NI.
the response. You may be able to control the explanatory variables and thus influence the response. For example,
the response variable sales may be influenced by various types of advertising. If a model shows how each type of
advertising influences sales, you may be able to adjust advertising strategy to increase sales.
Parametric and non-parametric There are two types of methods to specify f: parametric and non-parametric.
A parametric method specifies a function, and statistics is used to fit the function's parameters. For example, the Si?
response may be specified as a linear function of the predictors, and then the coefficients of the linear function are
estimated. Once the linear function is estimated, there is no need for the data that generated the linear function.
A non-parametric method does not specify a simple form for the relationship. Instead, a function that is close to
all the points without getting too wiggly is specified. Unlike parametric methods, all of the data that generated
this function is needed to specify the relationship. An example of a non-parametric method (which, however, is
not on the syllabus) is a spline. A cubic spline connects the points with cubic polynomials so that the connections
have continuous curvature. The disadvantage of parametric methods it that they assume a functional form for the
relationship, and the functional form may be wrong. But non-parametric methods have the disadvantage of needing
a large amount of observations to properly specify.
Flexibility versus interpretability A flexible method bends and twists in order to match the observations. An
inflexible method will not do that.
You may think that flexible methods are better. But one problem with them is that just because they fit the
observations well does not mean that they will do a better job predicting the response given new observations.
For example, one may have 10 observations of a response as a function of a predictor. One can fit a ninth-degree
polynomial to these observations, and it will fit the observations perfectly. But it will do a poor job predicting the
response for other values of the predictor.
air Another problem with flexible methods is interpretability. Linear regression specifies the response as a linear
function of the predictors. It is an inflexible method, but it is easy to understand how each predictor influences the
response. Flexible methods are difficult to interpret. In general, the more flexible a method, the less interpretable it
is. If statistical learning is used for prediction, then the lack of interpretability is not critical; a black box will do. But
if it is used for inference, the lack of interpretability is a drawback.
•41` Supervised and unsupervised learning So far we've been discussing supervised learning. Supervised learn-
-f ing has a response variable that is influenced by explanatory variables. Unsupervised learning does not have a
response variable. Unsupervised learning relates the observations to each other or finds patterns in the observa-
tions. It is more difficult than supervised learning since we cannot measure how well we have done. Typically, the
quality of a supervised method is measured by comparing the predicted response with the actual response. This
comparison is not available for unsupervised learning.
149 Regression versus classification problems Sometimes the response variable is continuous and sometimes it is
limited to a small number of values. When it is continuous, we have a regression problem, whereas when it is limited
to a small number of values we have a classification problem. Sometimes in classification problems the values of
the response are not numbers. For example, if the effectiveness of a medicine is being modeled, the response may
be "Effective" or "Not effective".
For regression problems, the function we fit can assume any real value. For classification problems, the function
we fit estimates the probabilities of the various classifications.
We will discuss types of variables in greater detail in the next section.
Quality of Fit The quality of fit for a regression problem is measured using mean squared error' or MSE. MSE is
defined by )
MSE = —
ii
- 902 (1.1)
1=1
Sf where pi is the fitted value. Typically there is training data, data that is used to fit the model, and test data, data
•
that was not used to fit the model. We will discuss in Lesson 6 how to obtain test data. The quality of a model is
determined by its MSE on the test data. A flexible model can reduce the MSE on the training data by using a lot of
parameters, but this does not indicate a high-quality model. In fact, standard error measurements for training data
as: typically divide the sum of squared errors by something less than n to compensate for the fitted parameters.
For classification problems, the quality of the fit is measured by the proportion of cases for which the correct
classification is selected. As with regression problems, the quality of the fit is measured on test data.
Bias versus variance Mean squared error is the sum of squared bias, variance, and irreducible error. There is
nothing we can do about the irreducible error, so we won't discuss it further.
NI° The variance of an estimator measures how much the estimator varies with different random samples of data.
Bias measures the extent to which the expected value of the estimator differs from the true value. Generally there
• is a tradeoff between variance and bias, and the goal is to select the estimator that minimizes the mean squared
error, thus optimizing the tradeoff. Inflexible estimators are not very sensitive to the input data, so they have low
variance. However, they make assumptions for the functional form of the relationship between the explanatory
variables and the dependent variable, and those assumptions may not be true; thus they have high bias. We will
learn later that the coefficient estimates of linear regression are unbiased, but this assumes that the true relationship
is linear! In reality, the relationship is almost surely not linear, so linear regression is a high-bias estimator. However,
it is a low-variance estimator, since it is not so sensitive to each individual point. A spline that goes through every
'Regression Modeling with Actuarial and Financial Applications calls it "mean square error", but An Introduction to Statistical Learning calls it
"mean squared error", which is probably better grammatically.
observation point is very sensitive to the points, so it has high variance, but it has low bias since it does not make
assumptions on the underlying function.
2. Categorical variables, sometimes called qualitative variables. These variables assume a small number of `11
category values. An example of such a variable is a "yes/no" random variable, like "Does the car have an anti-
theft system", "does the house have a sprinkler", and the like. Such variables, that can assume only one of two
values, are called "binary" or "Bernoulli" variables. Another common binary variable is male/female. Some
categorical variables have more than two categories. For example, auto usage may have the three categories
"farm", "pleasure", and 'business".
We assign numbers to the categories so that we can use these variables in equations. For the auto usage
variable, we may assign 0 to farm, 1 to pleasure, and 2 to business. There is no particular order to these
categories, so this variable is a "nominal variable". Sometimes the categories have a meaningful order. For
example, there may be various categories of injury in an accident, with categories ordered from mildest to
most severe injury. If the injury codes are 1 through 5, then a category 4 injury is worse than a category 2
injury, but it is not necessarily twice as bad as a category 2 injury. When the category numbers have a logical
order, we call the variable an "ordinal variable".
3. Count variables. A count variable assumes nonnegative integral values. Number of claims is an example of a
count variable.
There are also discrete variables that assume negative or nonintegral values. However, it is rare that we deal with
them.
1.3 Graphs
This section is based on Regression Modeling with Actuarial and Financial Applications 1.2, which is background reading
only.
We want to get some idea of which explanatory variables to use in our model. Looking at plots is one way
to do this. In the following plots, the x axis is used for an explanatory variable and the y axis is used for the
response variable, the variable we are trying to explain. The sample consists of observed pairs of (xi, yi), where xi
are observations of the explanatory variable and yi are corresponding observations of the response.
Scatter plots If we are considering a continuous variable as an explanation of another continuous variable, we can
graph them using a scatter plot. A scatter plot is a plot of all sample pairs (xi, yi). For example, if a sample has the 1141
5 points
(1,2) (2,4) (4,3) (7,5) (9,4)
10
8-
2-
o
o 2 4 10 0 1 2 3 4 5 6
Figure 1.1: Scatter plot for 5 points Figure 1.2: Scatter plot for 2000 points
30
25 -
—9—
20 -
15
10 -
5-
Male Female
and blue. In this case, if there is a preponderance of red dots at the bottom and blue dots at the top, we can conclude
the category which is coded as blue dots tends to increase the response variable. On the other hand, if the red and
blue dots are randomly distributed around the graph, then the categorical variable is not relevant.
Box plots If we are considering a categorical variable as an explanation of a continuous variable, then we can draw
•
a box plot. A box plot has a rectangle above each category. A thick line in the middle of the box indicates the median.
The bottom line of the box is the first quartile and the top line of the box is the third quartile. Additional lines,
'141 called "fences" are placed above and below the box. Dashed vertical lines between the rectangle and the additional
•
lines, sometimes called "whiskers", are drawn. Different authors and programs put the fences in different places.
The syllabus reading uses the following method, proposed by John Tukey. Let Qi and Q3 be the first and third
quartiles respectively. Then let 11 = 1.5(Q3 — Q1). Place the lower fence at Qi — 11 and the upper fence at Q3 + 11.
•41 Potential outliers, defined as the sample points above and below the fences, are plotted individually, vertically above
.4* the center. The points within the fences but closest to the fences are called adjacent points.
An example of a box plot is Figure 1.3. Let's say the y axis is claim size. Then this plot indicates that for males,
median claim size is 15, the first quartile is 13, and the third quartile is 19, with some high percentile equal to 26
and a few potential outliers above 26. Female claim sizes tend to be lower. The distribution of claim sizes is skewed
upwards, as the solid lines in the middles of the rectangles are below the centers of the rectangles, and the distances
to the high percentiles are greater than the distances to the low percentiles. This plot was not drawn using the Tukey
method; using the Tukey method, the fences would be equidistant from the box.
q plots The 100qth percentile is also called the Cith quantile. For example, the 40th percentile is the 0.4 quantile. A
qq plot compares quantiles of two distributions. A gq plot consists of a plot of coordinate pairs: the x coordinate
is the observed quantile and the y coordinate is the fitted quantile. For example, suppose the observed values
1.-.? are 1, 3, 6, 10, and 15. Suppose they are fitted by maximum likelihood to an exponential distribution of the form .j
Exam SRM Study Manual
Copyright 02022 ASM
1.3. GRAPHS
20
oo 5 10 15 20
Fitted quantiles
•
F(x) = 1 — e '0. The maximum likelihood fit sets t9 equal to the sample mean, which is 7 here. The eh quantile of "Ni
an exponential distribution can be worked out as follows:
F(x) = q
1 — e-x/0 = q
x
—
=— in(1 q)
0
x = —0 ln(1 — q)
There are many possible methods to assign quantiles to observations. For a sample of size n, we will set the quantile
of order statistic j (the ith observation when the observations are ordered from lowest to highest) equal to j/(n + 1).
Then, if we let the x-coordinates be the fitted values and let the y-coordinates be the observed data, then the qq plot
has the five points (-71n 5/6,1), (-71n 4/6,3), (-71n 3/6,6), (-71n 2/6,10), and (-7 ln 1/6,15). Figure 1.4 shows the
qq plot, with the y-coordinates for the observations and x-coordinates for the fitted distribution. In this qq plot, the
points are connected with lines, but q q plots do not have such lines; instead, the number of observations is usually
large, and one can see if they lie on a line.
If a completely specified distribution is being fitted, then the fit is good if the line is close to the 450line. Figure 1.4
has a 459ine. qq plots do not necessarily have a diagonal line, and if they do, they are typically drawn through the
25th and 75th percentiles of the observed and fitted distributions.
Usually, rather than comparing observed data to a completely specified distribution, a qq plot is used to compare
observed data to a family of distributions. One member of the family is used to draw the plot. If the fitted distribution
has cumulative distribution function F and the points of the qq plot lie on a straight line, it suggests that F ((x — a)/b)
is a good fit. in our example, rather than fitting the data to an exponential with mean 7, one may fit them to an
exponential with an arbitrarily selected mean. Then, the sign of a good fit would then be that the points lie on a
straight line. In Figure 1.4, the first four points of the plot lie more or less on a straight line, indicating a good fit,
even though the line is not the diagonal, but the point for the observation 15 is off the line. As long as points lie on a
straight line, even though it is not the diagonal, there is an exponential with some parameter whose quantiles will
match those points.
A special case of a qq plot is a normal probability plot, where the fitted distribution is a normal distribution. This *vs
is the most common type of qq plot.
Figure 1.5 shows how qq plots look with good and bad fits. For each plot, the fitted distribution is standard
normal, and the observations are 100 simulations from a distribution. In Figure 1.5a, the observations are standard
normal, and the fit is very good, a straight line. In Figure 1.5b, the observations are normal with mean 6 and
standard deviation 2, and the fit is just as good as the first one; only the scale of the y axis changes. You see how a
qq plot tells you that the distribution family you picked is a good fit, even if you don't know the parameters of the
i tted distribution.
f
0
T
1 1 1 1 1 1 I I I I I
-3 2 1 0 1 2 -3 -2 -1 0 1
LC, -
'.l o 03 —
0
o
0
CD —
6'
o 0
Lo
0 —
vt —
1
0.1
/
—
00 C- o caum0101111.1111111
I I I i I I I I i r i
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
In Figure 1.5c, the observations are from a Student's t distribution with 3 degrees of freedom. The t distribution
is a symmetric distribution but has wider tails than a normal distribution, especially for low degrees of freedom. I
used R to draw these plots, and R automatically adjusts the axis scales unless you override it. Due to the outliers, the
limits of the y axis have been expanded. The fit is actually a 45° line in the middle. But the extreme values of the t
distribution cause the line to curve at the far left and far right. The intermediate quantiles of t are similar to those
of a normal distribution, but the very high and very low quantiles are respectively higher and Tower than those of a
normal, causing the pattern seen in that plot.
In Figure 1.5d, the observations are from an exponential distribution with mean 1. An exponential is not
symmetric; it is concentrated near 0 and is skewed towards the right. Its median is less than its mean. The low
quantiles of an exponential are very small relative to the symmetric normal's quantiles. The normal distribution
puts too much weight on the lower values and too little weight on the higher values, causing the upward facing
curve seen in this plot.
Exercises
4. Claim count
5. Claim size
Solutions
1.1. Legal representation is either "yes" or "no", so it is categorical. Injury code has a short discrete set of possible
values, so it is categorical. The other variables are not categorical. 1 and 3
1.2. The first quartile is 3 and the third quartile is 58. Then h = 1.5(58 — 3) = 82.5 and the lower fence is placed at
3 — 82.5 —79.5
Linear Regression
Lesson 2
In a linear regression model, we have a variable y that we are trying to explain using variables xi, xk.1 We have
n observations of sets of k explanatory variables and their responses: {y1,x1,xi2, , xik} with i = 1,. ,n. We
would like to relate y to the set of xj, j = 1,. k as follows:
where ei is an error term. We estimate the vector [3 = (n0,131, , fik) by selecting the vector that minimizes E7_1 gss
For statistical purposes, Ei is a random variable. We make the following assumptions about these random
variables:
1. E[r1] = 0 and Var(Ei) := 02. In other words, the variance of each error term is the same. This assumption is
called homoscedasticity (sometimes spelled homoskedasticity).
2. ei are independent.
3. Ei follow a normal distribution.
If these assumptions are valid, then for any set of values of the k variables {x1, x2, , xk}, the resulting value of y
will be normally distributed with mean /30 + pixi and variance 02. Moreover, the estimate of f3 is the maximum
ilkelihood estimate.
Notice that our linear model has k parameters f3i, P2, ,pk in addition to the constant fib.Thus we are really
estimating k + 1 parameters. Some authors refer to "k + 1 variable regression". I've never been sure whether this is
because k 1 ps are estimated or because the response variable is counted as a variable.
Often we use Latin letters for the estimators of Greek parameters, so we can write bi instead of Pi.3
The formula for (3/ can be expressed as the quotient of the covariance of x and y over the variance of x. The
sample covariance is
E(xi — x)(yi —9)
evxy 11 — 1
s2
n—1
The n — is cancel when division is done, so they may be ignored. Then equation (2.1) becomes
„
cpxy
1i= 2
Sx
•
You may use the usual shortcuts to calculate variance and covariance:
Cov(X, Y) = WY] — E[X] ELY]
Var(X) = E[X21 — E[X12
In the context of sample data, if we use the biased sample variance and covariance with division by n rather than
n —1 (It doesn't really matter whether biased or unbiased is used, since the denominators of the sums, whether they
are n or n 1, will cancel when one is divided by the other.), these formulas become
n r1 ti
ni2
n
E xi 1 yi
V.ixiYi n ri=i xiyi — nig
fii — =
Let sx, sy, be the sample standard deviations of x and y, and let rxy be the sample correlation of x and y, defined
as follows:
CVxy
YXy
Sx Sy
rxysxsy
From formula (2.1), we have pi = 2
, Or
Sx
Sy
-40 Pi rXy Sx
-
(2.3)
y 34 38 38 53 50 60 70
SOLUTION: First we calculate E 4 and xiyi, then we subtract n22 and n,U. We obtain:
E 4 = 132
E xiyi = 1510
_
28
x= =4
_
343
y= = 49
You would never go through the calculations of the previous example since your calculator can carry out the
regression. On the TI-30XS, use data, ask for 2-Var statistics. In those statistics, item D is 131 (with the unusual name
a) and item E is Po (with the unusual name b). You can try this out on this quiz:
Quiz 2-1 •N? For a new product released by your company, revenues for the first 4 months, in millions, are:
Month 1 27
Month 2 34
Month 3 48
Month 4 59
yi = ei
More likely, an exam question would give you summary statistics only and you'd use the formulas to get #0 and
pi.
EXAMPLE 2B ¶Z For 8 observations of X and Y, you are given:
2=6 x=408 = 462
Yi = Po + Pixi
Determine ijo. •
y, xiyi — n2y
SourrioN: 131 0
Ex —
462 — 8(6)(8) —065
408 — 8(62)
(io= y - = 8 — 0.65(6) =
The next example illustrates predicting an observation using the regression model.
EXAMPLE 2C Experience for four cars on an automobile liability coverage is given in the following chart:
Miles Driven 7,000 10,000 11,000 12,000
Aggregate Claim Costs 600 2000 1000 1600
SOLUTION: We let xi be miles driven and y, aggregate claim costs. It is convenient to drop thousands both in miles
driven and aggregate claim costs.
7 + 10 + 11 + 12 0.6 + 2 + 1 + 1.6
=10 =1.3
—
4
y= 4
E 4 = 72 + 102 + 112 + 122 = 414 x,y; = (7)(0.6) + (10)(2) + (11)(1) + (12)(1.6) = 54.4
denominator = 414 — (4)(102) = 14 numerator = 54.4— (4)(10)(1.3) = 2.4
2.4 6 6 2900
1
14 35 po = 1300 — (-3) (10000) = 7
The fitted value of yi, or fio + 1x1, is denoted by gi. The difference between the actual and fitted values of
•-: yi, or ei= y —9j, is called the residual. As a result of the equations that are used to solve for #, the sum of the residuals
Z;Li ei on the training set is always 0. As with pi, we may use Latin letters instead of hats and denote the residual by
ei.
13kxik + Li
Let's now discuss multiple regression, the case when k > 1. We then have k explanatory variables plus an
*41 intercept and n values for each one. We can arrange these into an n x (k + 1) matrix:
1 X X12 Xlk
X21 X22 X2k
x=
X ni Xn2 Xnk
13o\
( . and y = IYY21).
31'
Notice how the intercept was turned into a variable of Is. Set /3 = Then equation (*) can be
Pk / kY
written like this:
X/3 = y
• X is called the design matrix.4 The generalized formulas for linear regression use matrices. We will use lower case
boldface letters for column and row vectors and upper case boldface letters for matrices with more than one row
• and column. We will use a prime on a matrix to indicate its transpose. The least squares estimate of fl is
p (XX)-1X'y (2.4)
and then the fitted value of y is9= X. I doubt you'd be expected to use formula (2.4) on an exam, unless you were
given (X/X)-1, since it involves inverting a large matrix. In fact, I doubt you will be asked any questions requiring
matrix multiplication.
The (X'X)-1 matrix is singular (non-invertible) if there is a linear relationship among the column vectors of X.
Therefore, it is important that the column vectors not be collinear. Even if the variables are only "almost" collinear,
9'1? the regression is unstable. We will discuss tests for collinearity in Section 5.3.
'The reason for this name is that in some scientific experiments, the points x are chosen for the experiment. But this will generally not be the
case for insurance studies.
Even though regression is a linear model, it is possible to incorporate nonlinear explanatory variables. Powers of
variables may be included in the model. For example, you can estimate
yi =Po +Piex' + el
Linear regression assumes homoscedasticity, linearity, and normality. If these assumptions aren't satisfied,
sometimes a few adjustments can be made to make the data satisfy these conditions.
Suppose the variance of the observations varies in a way that is known in advance. In other words, we know
that Var(Ei) = c2/w, with wi varying by observation, although we don't necessarily know what cy2 is. Then wi is
the precision of observation i, with wi = 0 for an observation with no precision (which we would have to discard) "41
and WI co for an exact observation. We can then multiply all the variables in observation i by v/7... After this
multiplication, all observations will have the same variance. Let W be the diagonal matrix with wi in the ith position
on the diagonal, 0 elsewhere. Then equation (2.4) would be modified to
13 = (X'WX)-1X'Wy (2.5)
In this model, In yi is assumed to have a normal distribution, which means that yi is lognormal. A lognormal •411
distribution is skewed to the right, so logging y may remove skewness.
A general family of power transformations is the Box-Cox family of transformations: •41
yA _ 1
1
A °
y(A) = (2.6)
my A=0
This family includes taking y to any power, positive or negative, and logging. Adding a constant and dividing by
a constant does not materially affect the form of a linear regression; it merely changes the intercept and scales the
13 coefficients. So (yA — 1)/A could just as well be yA. The only reason to subtract 1 and divided by A is so that as
A —> 0, (yA — 1)/A ln y.
I doubt that the exam will require you to calculate parameters of regression models. Do a couple of the calculation
exercises for this lesson just in case, but don't spend too much time on them.
PO y — filx (2.2)
=
(2.1)
z(xi — x.)2
Y
ht.=,xy—s x
(2.3)
ei = 0
Y
—
A (2.6)
my A=0
Exercises
2.1. •-11 You are given the linear regression model yi = Po +pixi + ei to fit to the following data:
x —2 —1 0 1 2
y 3 5 8 9 10
2.3. al [SRM Sample Question #171 The regression model is y = P0 +16ix + E. There are six observations.
The summary statistics are:
2.4. swii [SRM Sample Question #47] You are given the following summary statistics:
= 3.500
g 2.840
2.5. (SRM Sample Question #53] Determine which of the following statements is NOT true about the equation
=
PiX + e
(A) pa is the expected value of Y.
(B) pi is the average increase in Y associated with a one-unit increase in X.
(C) The error term, E is typically assumed to be independent of X and Y.
(D) The equation defines the population regression line.
(E) The method of least squares is commonly used to estimate the coefficients 130 and Pi.
2.6. •41/ [SRM Sample Question #23] Toby observes the following coffee prices in his company cafeteria:
• 12 ounces for 1.00
• 16 ounces for 1.20
• 20 ounces for 1.40
The cafeteria announces that they will begin to sell any amount of coffee for a price that is the value predicted
by a simple linear regression using least squares of the current prices on size.
Toby and his co-worker Karen want to determine how much they would save each day, using the new pricing, if,
instead of each buying a 24-ounce coffee, they bought a 48-ounce coffee and shared it.
Calculate the amount they would save.
(A) It would cost them 0.40 more.
(B) It would cost the same.
(C) They would save 0.40.
(D) They would save 0.80.
(E) They would save 1.20.
2.7. "I [MAS-I-F18:29] An ordinary least squares model with one variable (Advertising) and an intercept was fit
to the following observed data in order to estimate Sales:
2.8. Nr You are fitting the linear regression model yi = po + pixi + e; to the following data:
x 2 5 8 11 13 15 16 18
y —10 —9 —4 0 4 5 6 8
2.9. 64° You are fitting the linear regression model yi = po + 13xi Ei to the following data:
x 3 5 7 8 9 10
y 2 5.7 8 9 11
2.10. '45 You are fitting the linear regression model pi = po+pixi+ E . You are given:
(i) E 1 xi = 392
(ii) y; = 924
(iii) a3.1 xiy; = 13,272
(iv) pp := —23
Determine En, X.
2.11. ev? [3-F84:51 You are fitting the linear regression model y; =Po +pix; + E; to 10 points of data. You are given:
y; = 200
xiy; = 2000
E 4 = 2000
E y 5000
Calculate the least-squares estimate of pi.
(A) 0.0 (13) 0.1 (C) 0.2 (D) 0.3 (E) 0.4
E= 144
y; = 1,742
E 4 = 2,300
E = 312,674
xiy; = 26,696
71 12
Yi = Po + Aix; + E;
2.13. [120-F90:6] You are estimating the linear regression model yi = flo + f3ixi + Ei. You are given
i 1 2 3 4 5
Determine jdi•
(A) 0.8 (B) 0.9 (C) 1.0 (D) 1.1 (E) 1.2
2.14. LA? [120-S90:11] Which of the following are valid expressions for b, the slope coefficient in the simple linear
regression of y on x?
I.
(A) I and IT only (B) I and III only (C) II and III only (D) I, II and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
2.15. s: [Old exam] For the linear regression model yi = 130 + pixi + E with 30 observations, you are given:
(i) rxy = 0.5
(ii) sx = 7
(iii) sy = 5
where rxv is the sample correlation coefficient.
Calculate the estimated value of Pi.
(A) 0.4 (B) 0.5 (C) 0.6 (D) 0.7 (E) 0.8
2.16. 1L [110-S83:14] In a bivariate distribution the regression of the variable y on the variable x is 1500 + b(x — 68)
for some constant b. If the correlation coefficient is 0.81 and if the standard deviations of y and x are 220 and 2.5
respectively, then what is the expected value of y, to the nearest unit, when x is 70?
(A) 1357 (B) 1515 (C) 1517 (D) 1643 (E) 1738
2.17. Nr [120-82-97:7] You are given the following information about a simple regression model fit to 10 observa-
tions:
E xi =20
E yi = 100
s, = 2
sy = 8
You are also given that the correlation coefficient rxy = —0.98.
Determine the predicted value of y when x = 5.
(A) —10 (B) —2 (C) 11 (D) 30 (E) 37
2.18. s: In a simple regression model yi = flo + 131x1 + ci, you are given
y; = 450
xiyi = 8100
y5 =40
Period y x1 x2
1 1.3 6 4.5
2 1.5 7 4.6
3 1.8 7 4.5
4 1.6 8 4.7
5 1.7 8 4.6
Yi = Po + p2xi2+ i= l,2,...5
(X1X)-1 = ( 26.87
-374.67
0.93
-7.33
-7.33
93.33
Calculate e2.
(A) -0.2 (B) -0.1 (C) 0.0 (D) 0.1 (E) 0.2
2.20. al You are fitting the following data to a linear regression model of the form yi = po+13ix1i+132 X12 +133Xi3 Ei:
y 5 3 10 4 3 5
X1 0 1 0 1 0 1
X2 1 0 1 1 0 1
X3 0 1 1 0 0 0
(-10
20 0 0
pex 1 -1 = 30
I 1 -18 0 24 6
-12 0 6 24
2.21. e'llf [120-82-94:11] An automobile insurance company wants to use gender (xi = 0 if female, 1 if male) and
traffic penalty points (x2) to predict the number of claims (y). The observed values of these variables for a sample \---)
of six motorists are given by:
Motorist xi x2 y
1 0 0 1
2 0 1 0
3 0 2 2
4 1 0 1
5 1 1 3
6 1 2 5
2.22. %I' You are fitting the following data to the linear regression model yi = Po + Pixii + P2Xi2 P3Xi3 Ei:
y 1 2 6 5 1 2 3
Xi 0 0 1 -1 0 1 1
X2 0 -1 0 0 1 -1 0
X3 1 1 4 0 0 0 1
2.23. '41 [Old exam] You are examining the relationship between the number of fatal car accidents on a tollway
each month and three other variables: precipitation, traffic volume, and the occurrence of a holiday weekend during
the month. You are using the following model:
where
y = the number of fatal car accidents
xi = precipitation, in inches
x2 = traffic volume
X3 = 1, if a holiday weekend occurs during the month, and 0 otherwise
The following data were collected for a 12-month period:
Month y xi x2 x3
1 1 3 1 1
2 3 2 1 1
3 1 2 1 0
4 2 5 2 1
5 4 4 2 1
6 1 1 2 0
7 3 0 2 1
8 2 1 2 1
9 0 1 3 1
10 2 2 3 1
11 1 1 4 0
12 3 4 4 1
257 -82
QCX)-1 = —
(
6506-446
82 254
-364 -446)
-364
2622)
2.24. el [S-F13:33] You are given a regression model of liability claims with the following potential explanatory
variables only:
• Vehicle price, which is a continuous variable modeled with a third order polynomial
• Average driver age, which is a continuous variable modeled with a first order polynomial
• Number of drivers, which is a categorical variable with four levels
• Gender, which is a categorical variable with two levels
• There is only one interaction in the model, which is between gender and average driver age.
Determine the maximum number of parameters in this model.
(A) Less than 9 (B) 9 (C) 10 (D) 11 (E) At least 12
2.25. %IP [MAS-I-S18:371 You fit a linear model using the following two-level categorical variables:
1 if Account
=
0 if Monoline
1 if Multi-Car
0 if Single Car
Po = —0.10
pi = —0.25
02 = 0.58
03 = —0.20
Another actuary modeled the same underlying data, but coded the variables differently as such:
0 if Account
=
1 if Monoline
{0 if Multi-Car
X2 =1 if Single Car
with the equation
E[Y] = ao aiXi + a2X2 + a3X1X2
Afterwards you make a comparison of the individual parameter estimates in the two models.
Calculate how many pairs of coefficient estimates (ai,13 i) switched signs, and how many pairs of estimates stayed
identically the same, when results of the two models are compared.
(A) 1 sign change, 0 identical estimates
(B) 1 sign change, 1 identical estimate
(C) 2 sign changes, 0 identical estimates
(D) 2 sign changes, 1 identical estimate
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
l
f 2.26. •-(11 [114AS-1-S19:32] You are fitting a linear regression model of the form:
y = Xf3 £;e N(0, a2)
and are given the following values used in this model:
1 0 1 9
1 1 1 15 32 2 3 32 1.38 0.25 0.54 -0.16
C21 (3 51 )'
1 1 1 8 19 2 4 4 36 0.25 0.84 -0.20 -0.06
X =
0 1 1 7 17
; X'X =
3 4 6
x,x-i = 0.54 -0.20 1.75 -0.20
0 1 1 6 15 32 36 51 491 -0.16 -0.06 -0.20 0.04
0 0 1 6 15
f20.93\
0.297 32.03
(-37: ; X(X'X)-1X'y =
19.04
(X1X)-1Xly = 16.89
; a-2 = 0.012657
1.854 15.04
\15.07
2.27. '41 [MAS-I-519:291 Tim uses an ordinary least squares regression model to predict salary based on Experience
and Gender. Gender is a qualitative variable and is coded as follows:
Gender =
11 if Male
0 if Female
Abby uses the same data set but codes gender as follows:
Gender =
{1 if Female
0 if Male
Solutions
2.1.
2.2.
z(Xi 2) 3092 2162 =, 500
18
Ii =
_
0.68
500
2.3. The least squares estimate of 131 is the covariance of x and y divided by the variance of x. In the following
calculation, the numerator is n times the covariance and the denominator is n times the variance; the ns cancel. We
have n = 6 observations.
0.7 (D)
2.7. An exam question like this asking you to carry out a linear regression is rare. You can carry out a linear
regression on your calculator without knowing the formulas. But anyway, here is the calculation, with X being
advertising and Y being sales.
6.38
h = 23.806
= 0.268
29 4
554 — 23.806
a = —
5 5 ) = —29.179
= —29.179 + 23.806(6.0) = 113.657
The third residual is 112 — 113.657 = —1.657. (B)
2.8. In the following, on the third line, because 9 = 0, z(xi — 2)(yi — 9) = E(xi — )7)N.
= 11
(41) _ 197 _
924
5.7941
2.10. if
= = 33
392
=
= 14
28
=
+ /312, so
33 = —23 + (14)
9)
fi t = 4 = Dx1 "Yi
—
2.13.
Exi =35.5
E 4 = 252.25
Ecx, _ 252.25 552
5
— 0.2
—
yi =5.3
xiyi = 37.81
0.18
0.9 (B)
z x,y, E XlnE YI
x? (E XI)2
fl
2.18.
8100- (30)(450)/15 2
lfl = 347
270 - 302/15
450
g =
1.3
1 1 1 1
81 1.581. 577.93
X'y = ( 64.5 4.6
7 7
4.5
8
4.7
4.6) 1.6
1.7
=
36.19)
fi = (VX)-1X'y =
( 9.9107 )
0.2893
-2.2893
(130)2
„
2.20. A
= 24
13
Si = -2
TO 114
72 )
96
2.21. The first coefficient of X'y is the sum of y, or 12. The second is 1 + 3 + 5 = 9 (not needed because (X/X)il = 0),
and the third is 2(2) + 1(3) + 2(5) = 17. Then
1 15
P2 = - ((-3)(12) + 3(17)) = - =
1.25 (C)
12
2.22.
60.5
-0.26
2.23. A little unusual not to have an intercept term f30, but the formulas are the same as usual.
We need to compute X'y:
3
1
2
4
3 2 2 5 4 1 0 1 1 2 1 4
1
X'y = (11
1
1
1
0
2
1
2
1
2
0
2
1
2
1
3
1
3
1
4
0
4
1
3
2
0
= (557)1
20
2
1
Then we multiply the first row of (X'X)-1 by X'y to get the first coefficient of the P's,
257(57) - 82(51) - 446(20) _ 1547 _ 0.2378 (C)
6506 - 6506 -
2.24. A third order polynomial has 3 parameters that are multiplied by x, x2, and x3. A categorical variable with n
levels has n - 1 parameters. Thus there are 3 parameters for vehicle price, 1 for driver age, 3 for number of drivers, 1
for gender, and 1 for interaction. That sums up to 9. Add the intercept, and there are a total of [id] parameters. (C)
2.25. Since the model must produce the same results regardless of the values of the Xis, products of parameters
and variables must be the same. Expressing the second model in terms of the first,
E[Y] = ao + a1(1 - X1) + a2(1 - X2) + a3(1 - X1)(1 - X2)
ao + al + a2 + a3 + (-al - a3)Xi + (-a2 - a3)X2 + a3X1 X2
We see that a3 = p3, but the relationships of the other parameters is not a simple sign change. (E)
2.26. By formula (2.4), the modeled estimate of all parameters is (XTX)-1xTy-.Here the intercept is the third
variable, since the column of X that is all is is the third column, so the estimate of the intercept parameter is 3.943
(E)
2.27. In Tim's model, salary level is 18,169.3 + 1110.233xi + 169.55 for males and 18,169.3 + 1110.233x1 for females.
Abby's model must produce the same result, and for males she has bo + 1110.233xi, so bo = 18,169.3 + 169.55 =
18,338.85 (B)
Quiz Solutions
2-1.
Exi= 1 + 2 + 3 + 4 = 10
E4=,2+22.+32+42=30
Ey1= 27+34+48+59 = 168
xiyi = 27+ 2(34) + 3(48) + 4(59) = 475
475 - (10)(168)/4 11
30 - 102/4
Reading: Regression Modeling with Actuarial and Financial Applications 2.3-2.5, 3.2-3.4; An Introduction to Statistical
Learning 3.1.2-3.1.3, 3.2.2-3.2.3
Z (YI g)2 =
(3.1) •41
i.1
In this expression, we omit the cross term 2 E/1_1(y; 9i)(9i — 9) = 2 E;'_i — 9). In the sidebar, we show that
this cross term is 0.
We will use the notation Total SS for total sum of squares, Regression SS for regression sum of squares, and
Error SS for error sum of squares. In the old days, we used to use the abbreviations RSS for regression sum of
squares, ESS for error sum of squares, and TSS for total sum of squares. But then the R language came along and
used RSS for residual sum of squares (= error sum of squares)! An Introduction to Statistical Learning, a book closely
tied to R, uses RSS for residual sum of squares and TSS for total sum of squares, and if the regression sum of squares
is needed, it writes TSS — ESS.' In some non-syllabus books, you may also see SSE (sum of squared errors) used for
what we are calling Error SS.
An alternative formula for Total SS is
/I
Total SS = — n92
Recall that the sum of the residuals is 0, which means that E ei = E(yi = 0, so z pi = E yi ng. Therefore, a
formula similar to the one for Total SS is available for the regression sum of squares:
Regression SS = n g2
fl II wonder why it doesn't use ESS, for "explained sum of squares".
with respect to every pi, and set the result equal to 0. Thus
k
1=1
( yi —Z fiixii
i.o ) =0
E Xiiti = 0
1=1
for every j. And for the intercept (j = 0), xii = 1, so riLi = 0. Now,
• E7,191.; = 91 ê1 = 0 because gi = 0.
eigj = E Pixq = 0
It follows that the cross term 2 (9i — is 0.
but this formula is not as useful as the one for Total SS since you hardly ever are given 9.
The total sum of squares has n —1 degrees of freedom. One degree of freedom is lost because the sum of squares
is calculated as differences from the sample mean rather than the unknown true mean. The regression sum of
squares has k degrees of freedom, one for each variable not counting the intercept. The error sum of squares has the
remainder of the degrees of freedom, or n — k — 1 degrees of freedom. The quotient of the error sum of squares over
al its number of degrees of freedom is the mean squared error(MSE) of the regression. The square root of this quotient
0
: is called the residual standard error (RSE). It is also called the residual standard deviation (s), or the standard error of the
regression.2 Thus
$ = RSE = =
1./ Error SS
n—k—1
(3.2)
Notice that we divide by n — k — 1, not by n, to calculate MSE. Contrast this with equation (1.1), where MSE is
calculated with division by n. In that context, MSE is calculated using test data, data that was not used to fit the
model, so no degrees of freedom are lost.
When the model is estimated from data and residuals are calculated based on the model's output from
that data, degrees of freedom are lost. You divide by a number less than the number of points in the data.
But when the model is estimated from data, then applied to different data and residuals are calculated
2Regression Modeling with Actuarial and Financial Applications calls it s, the residual standard deviation; An Introduction to Statistical Learning
calls it RSE, the residual standard error.
based on the model's output from that different data, no degrees of freedom are lost. You divide by the number
of points in the different data.
Even though s is called the residual standard deviation, it is not the standard deviation of the residuals. It is the
standard deviation of what you are modeling, Y. So it is the standard deviation of the real residual, of e without a
hat. We will discuss the standard deviation of the residuals—the standard deviation of P with a hat—later.
The sum of squares information is summarized in an analysis of variance (ANOVA) table. The table looks like °:
this:3
s2 is an unbiased estimator of a2, the variance of the response variable y or the variance of ci. *4'
as the coefficient of determination. R2 is the proportion of the sum of squares explained by the regression.
EXAMPLE 3A `411
You are fitting a linear regression model yi = 130+131x, + ei to 18 observations.
You are given the following information:
• x=5
• V-81 4 = 480
•y=4
• Zri = 1056
•
xiyi = 480
You are given that the error sum of squares is 288.
Calculate the coefficient of determination, R2. •
'In most textbooks, although not in the textbooks on the syllabus, an ANOVA table has an additional column with an F statistic. We will
discuss the F statistic later.
In a simple linear regression model, R2 is the square of the correlation between x and y; in other words
If k > 1, the square root of R2 is called the multiple correlation coefficient. R2 is the square of the correlation of yi and
in other words, the square of the correlation between the true value of y and the fitted value of y.
Quiz 3-3 6: You are fitting a linear regression model yi = po + p1X; + ei to 10 observations.
You are given the following information:
48
R2 is an intuitive way to assess the quality of the model, but it has two disadvantages:
1. Adding more variables to the model always increases R2, no matter how irrelevant the variables are.
2. Its sampling distribution is hard to determine, so the R2 statistic cannot be evaluated statistically. There is no
objective way to state a critical value for it.
To address the first disadvantage, an adjusted R2 is often used to compare models with different numbers of
variables. We will discuss adjusted R2 in Section 7.2.
3.3 t statistic
.4' The linear regression estimator b of 13 is an unbiased estimator of 13. As with all estimators, b is a function of the
observations, which are random variables, and is therefore a random variable. It has the minimum variance of all
NI° unbiased estimators. The covariance matrix of b is
The diagonal elements of the matrix are the variances of the components of b. Since a2 is unknown, we use s2,
the square of the standard error of the regression, instead of 0-2 in order to estimate the variance of b. Therefore
the variance of bi is estimated as s = s2lpi, where tpi is the i + 15t diagonal element4 of (X/X)-1. The square root of
the ith component of the estimated variance of b is called the standard error of bi, and is sometimes denoted se(bi), •41
although we will usually use sbi•
For a simple linear regression model yi = 130 + /3ixi, it is easy to compute E = (X0C)-1. The components of this
matrix are
-2
E X? 1
= =
(3.6)
n E(xi - )2 n E(x1 -
1
En = -
(3.7)
E12 = (3.8)
(x— ±-)2
The covariance matrix of (b0, hi) is the product of a2 and E, and we estimate cr2 with s2 so we have the following
approximate covariance matrix .41
COV(bOr bl) 52
(E(x—i i")2)
nAn alternative formula for sb, is
(3.12)
EXAMPLE 3B For a simple linear regression based on 50 observations, you are given
(1) The sample variance of x is 108.
(ii) The residual sum of squares (RSS) is 234.
Calculate the variance of the estimator for pi. •
SournioN: The mean square error is s2 = RSS/(n - 2) = 234/48 := 4.875. The sum of square differences of xi from
its mean is
= 4.875(-) = 5292
0.0009212
s2b1 =
875 ) 0.0009212
To test the null hypothesis thatf3 = 0, we use the I statistic basbi, which has n - k 1 degrees of freedom. More •41
generally, to test the null hypothesis that pi = fl*, we use
b1-
tn-k-1 = (3.13)
Sbi
4The rows and column of the matrix go from 1 to k +1 while the components of b are subscripted from 0 to k, necessitating adding 1 to i
when we refer to matrix elements.
A 100q% confidence interval for fj may be constructed as bi tsb,, where t is the 100(1 + q)/2 percentile of a
t-distribution with n - k - 1 degrees of freedom.
Note that the t distribution tables you get at the exam show percentiles of the t distribution. For a 2-sided
interval with significance a or confidence level 1 - a, you must use the (1 - a)/2 percentile of the t distribution so
that the region outside the interval has probability a. For example, for 5% significance, the critical value is in the
t0.025 column.
EXAMPLE 3C A linear regression model based on 20 observations has 2 explanatory variables and an intercept.
You are given:
0.5 0.2
(i) (X/X)-1 =
( 0.2
-0.1
1.2
0.6
0.6
2.7 0.1)
(ii) The residual standard error is 1.5.
(iii) b2 = 6.4
Using the t statistic, determine an interval for the p-value of b2 * 0. •
SourrioN: The standard error of b2 is 1.5\1 = 2.465. The t statistic is 6.4/2.465 = 2.597. It has 20 - (2 + 1) = 17
degrees of freedom. This is between the 2% significance level (critical value 2.5669) and the 1% significance level
0
(critical value 2.8982), so the p value is between 1% and 2%.
variable and any one of the explanatory variables, and verify the linear relationship. However, this relationship may
be distorted due to other variables in the model.
The example in Regression Modeling with Actuarial and Financial Applications models refrigerator prices in terms
of refrigerator size, features, and energy cost. One would expect that a more efficient refrigerator, one with a lower
energy cost, would have a higher price, all other things being equal. And indeed, the regression shows that the
coefficient of energy cost is negative. Yet, a scatter plot of price on energy cost has a positive slope. The reason for
this is that higher energy cost correlates with larger refrigerator size and more features, and larger refrigerators with
more features tend to have higher prices. To illustrate the true effect of energy cost on price, the other variables must
be removed.
An added variable plot removes the effects of other variables. To construct such a plot for xj:
1. Regress y on the other explanatory variables, excluding xi. Let ey be the residuals from this regression.
2. Regress xi on the other explanatory variables. Let ej be the residuals from this regression.
3. Construct a scatter plot of ey on ej.
The correlation between ey and ej is called the partial correlation coefficient. It is denoted by r(y, xj I
,x j_, x j+1, • • • , xk). It may be calculated directly from the full regression, the regression of y on all k ex-
planatory variables including xj, using the following formula:
t (bj)
/"(Y, xi, • • • , xj-i, xj+i, • • • , xk) (3.14)
t(bi)2 + n - (k + 1)
Here, t(b) is the t statistic of xi in the full regression. While the partial correlation coefficient shows the degree
of correlation of y and xi after eliminating effects of other explanatory variables, it does not show nonlinear
relationships. Only the added variable plot shows nonlinear relationships.
EXAMPLE 3D ••• y is regressed on xi and x2 based on 15 observations. You are given:
(i) b2 = 0.764
(ii) s2 = 0.525b2
Exam SRM Study Manual
Copyright C2022 ASM
3.4. ADDED VARIABLE PLOTS AND PARTIAL CORRELATION COEFFICIENTS 37
0.764
SOLUTION: t(b2) — = 1.05442 0
V0.525
1.05442
r(y, X2 I X1) = 0.29119
V1.054422 + 15 3
1
Total SS = Regression SS + Error SS
2 (Z(x1 — 1)2
2
sb, = 5 (3.10)
It is sometimes abbreviated as TSS. -
Error SS (3.12)
s= =
(3.2)
Regression SS Error SS
R2 — —
(3.3)
Total SS Total SS
Exercises
3.1. •41 [SRM Sample Question #11] You are given the following results from a regression model.
1
Observation „,
number (i)
Yi f(x1)
1 2 4
2 5 3
3 6 9
4 8 3
5 4 6
3.2. 111 You are fitting the linear regression model yi = po + pixi + ei to 20 observations.
You are given:
• E(yi — 9)2 = 12
•
1:1=10
. E(2i — y)2 = 108
Determine R2.
3.3. [3-F85:10] You fit the regression model yi = pa 4- f3ix Ei to 10 observations. You have determined:
R2 = 0.6
yi = 30
Ey = 690
Calculate S2.
3.4. You are given the following excerpt from an ANOVA table for a regression:
Determine R2.
6.8 0.8
7.0 1.2
7.1 0.9
7.2 0.9
7.4 1.5
3.6. [120-589:71 You are interested in the relationship between the price movements of XYZ Corporation and
the "market" during the fourth quarter of 1987.
You have used the least squares criterion to fit the following line to 14 weekly closing values of XYZ stock (Yt)
and the Dow Jones Industrial Average (xi) during the period of interest:
9, = —116.607 + 0.195x, +
E(g, - = 677.1142
t =1
14
Determine the percentage of variation in the value of XYZ stock that was "explained" by variations of the Dow.
(A) 50 (B) 60 (C) 70 (D) 80 (E) 90
3.7. [SRM Sample Question #18] For a simple linear regression model the sum of squares of the residuals is
1=1
e? = 230 and the R2 statistic is 0.64.
Calculate the total sum of squares (TSS) for this model.
(A) 605.94 (D) 701.59 (E) 750.87
3.8. [120-590:14] Which of the following statements are true for a two-variable linear regression?
I. R2 is the fraction of the variation in Y about that is explained by the linear relationship of Y with X.
R2 is the ratio of the regression sum of squares to the total sum of squares.
HI. The standard error of the regression provides an estimate of the variance of Y for a given X based on n 1
degrees of freedom.
(A) I and II only (B) I and III only (C) II and III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) ,or (D) .
3.9. "41 [120-S91:6] A bank is examining the relationship between income (x) and savings (y). A survey of six
randomly selected depositors yielded the following sample means, sample variances, and sample covariance:
= 27.5
= 87.5
2
5 =35
Sxy = 17.0
Determine R2.
(A) 0.1 (B) 0.2 (C) 0.3 (D) 0.7 (E) 0.9
3.10. sf[120-81-95:1] You fit a simple regression model with the dependent variable yi = i for i = 1, ,5. You
determine that s2 = 1.
Calculate R2.
(A) 0.1 (B) 0.3 (C) 0.5 (D) 0.6 (E) 0.7
3.11. Nti [120-83-96:31 You fit a simple regression model to five pairs of observations. The residuals for the first
four observations are 0.4, —0.3 0.0, —0.7, and the estimated variance of the dependent variable y is 1.5.
Calculate R2.
(A) 0.82 (B) 0.84 (C) 0.86 (D) 0.88 (E) 0.90
3.12. sir [S-F17:341 For an ordinary linear regression with 5 parameters and 50 observations, you are given:
• The total sum of squares, TSS = 996.
• The unbiased estimate for the constant variance t72 is 52 = 2.47.
Calculate the coefficient of determination.
(--- 3.13. svie [MAS-I-S19:30] You are given the following information about a linear model:
• Y=P0+131X1+p2x,±E
Observed Estimated
Y's Y's
2.441 1.827
3.627 3.816
5.126 5.806
7.266 7.796
10.570 9.785
3.14. •16. For a linear regression model for claim sizes, the model output is
Variable Coefficient
Intercept 22
Age of driver
18-24 15
25-64 U
65 and up 13
Income group
Under 50000 12
50000-100000 0
Over 100000 —3
The residual standard error of the regression is 2. Use this as the estimate of the standard deviation of E.
The response variable is transformed using a Box-Cox transformation with A = 1/2.
Calculate expected claim sizes for a 30-year old driver earning 150,000.
t test
3.15. [4-S01:401 For a classical linear model of the form yi = po + + E based on seven observations, you are
given:
(i) E(xi — 1)2 = 2000
(ii) = 967
3.16. '11. You are given the following excerpt from regression output
Regression of LOSS on AGE + MALE + MARRIED
Estimate Std. Error
With regard to the hypothesis that AGE has no effect on LOSS, which of the following statements is correct?
(A) Reject at 1% significance
(B) Reject at 2% significance but not at 1% significance
(C) Reject at 5% significance but not at 2% significance
(D) Reject at 10% significance but not at 5% significance
(E) Do not reject at 10% significance
3.17.
%, [ST-F15:211 You wish to explain Y using the following multiple regression model and 32 observations:
Y = 13o + +p2x2+133x3+ E
A linear regression package generates the following table of summary statistics:
Estimated Standard
Coefficient Error
44.200 5.960
Intercept
pi -0.295 0.118
P2 9.110 6.860
P3 -8.700 1.200
For the intercept and each of the betas, you decide to reject the null hypothesis which is that the estimated
coefficient is zero at a = 10% significance.
Which variables have coefficients significantly different from zero?
(A) Intercept
(B) Intercept, X1
(C) Intercept, X2
(D) Intercept, X1, X3
(E) Intercept, X2, X3
Determine the number of coefficients in the table above for the five explanatory variables that are not statistically
different from zero at a significance level of a = 10%, based on a two-tailed test.
(A) 1 (B) 2 (C) 3 (0)4 (E) 5
5The table, as it appeared on the exam, mistakenly has 1.661 on the 151 line, but 0.02/0.012 :-- 1.667.
3.20. '11? [120-81-98:4] You fit the regression model yi = 132Xi2 Ei to a set of data.
You determine:
(A) 1.9 (B) 2.2 (C) 2.5 (D) 2.8 (E) 3.1
3.21. [120-83-98:6] You fit the multiple regression model yi = 131 + P2X12 f33Xi3 El to 30 observations.
You are given:
yTy = 7995
0.0286 0.0755 0.0263
426411
0.55
X/Y = ( 6177.5
5707.0
( 5.22 )
1.62
b =
0.21
-0.45
Determine the length of the symmetric 95% confidence interval for 132.
(A) 0.3 (B) 0.6 (C) 0.7 (D) 1.5 (E) 1.8
3.22. alr [4-F03:36] For the model yi = Po + ixi + fi2xi2 P3Xi3 Ei, you are given:
(i) There are 15 observations.
(A) 6.4 (B) 6.8 (C) 7.1 (D) 7.5 (E) 7.8
(---N 3.23. ikir [4-F00:5] You are investigating the relationship between per capita consumption of natural gas and the
price of natural gas. You gathered data from 20 cities and constructed the following model:
3T =Po +Pix+
where
y is per capita consumption
x is the price, and
c is a normal random error term
You have determined:
bo = 138.561
= —1.104
E 4 = 90,048
116,058
3.25. el? [4-F03:51 For the model yi = Po + Pia:: + ei, where i = 1, 2, ... , 10, you are given:
1, if the ith individual belongs to a specific group
xi =
1 0, otherwise
40 percent of the individuals belong to the specified group
The least squares estimate offlu is b1 = 4
E(yi — bo — b1xi)2 92
l
f
3.26. •a . [ST-515:22] You are given the following linear regression model fitted to 12 observations:
Y = 13o + pix + £
The results of the regression are as follows:
Determine the results of the hypothesis test Ho: pi = 0 against the alternative Hi: pi # 0.
(A) Reject at a = 0.01
(B) Reject at a = 0.02, Do Not Reject at a = 0.01
(C) Reject at a = 0.05, Do Not Reject at a = 0.02
(D) Reject at a = 0.10, Do Not Reject at a = 0.05
(E) Do Not Reject at a = 0.10
3.27. '41 [ST-514:20] The model yi = Po + Aix; + ei was fit using 6 observations. The estimated parameters are as
follows:
•
1)0 = 2.31
• b1 = 1.15
• se(bo) = 0.057
• se(bi) = 0.043
3.28. [4-F02:381 You fit a simple linear regression model to 20 pairs of observations.
You are given:
(i) The sample mean of the independent variable is 100.
(ii) The sum of squared deviations from the mean of the independent variable is 2266.
(iii) The ordinary least-squares estimate of the intercept parameter is 68.73.
(iv) The error sum of squares (Error SS) is 5348.
Determine the lower limit of the symmetric 95% confidence interval for the intercept parameter.
(A) —273 (B) —132 (C) —70 (D) —8 (E) —3
n 3.29. [ST-F14:20] For the linear model yi = f30 + Ej, you are given:
• n = 6.
• b1 = 4.
• ri._1(xi - 2)2 = 50.
• Error SS = 25
Calculate the upper bound of the 95% confidence interval for Pi.
(A) Less than 5.1
(B) At least 5.1, but less than 5.3
(C) At least 5.3, but less than 5.5
(D) At least 5.5, but less than 5.7
(E) At least 5.7
3.30. r[120-81-98:3] You fit the model yi = Ei to 10 observed values (xi, yi).
You determine:
Determine the width of the shortest symmetric 95% confidence interval for 130.
(A) 1.1 (B) 1.2 (C) 1.3 (D) 1.4 (E) 1.5
3.31. ' [120-81-95:5] Performing a regression of y on xi and x2 with 12 observations, you determine that the
regression equation is
= 1.2360 + 0.8683xii + 0.8517x12
(i i) OC X)-1 = (-0.82314
-1.05459
0.10044
0.10480
0.10480
0.15284
Determine k such that a 95% confidence interval for 132 is given by b2± k.
(A) 0.16 (B) 0.18 (C) 0.33 (D) 0.37 (E) 0.40
(X'X)-1 = (-31, 1)
i
(iv) 0
T 2
-3 0 3
Calculate the value of the t statistic for testing the null hypothesis Ho : p2 . 1.
(A) -0.9 (B) -1.2 (C) -1.8 (D) -3.0 (E) -5.0
3.33. "Nr You perform a multiple-regression analysis on Y's relationship to three explanatory variables
3
y =flo + E pixi + £
i=1
3.34. •-: [SRM Sample Question #27] Trevor is modeling monthly incurred dental claims. Trevor has 48 monthly
claims observations and three potential predictors:
• Number of weekdays in the month
• Number of weekend days in the month
• Average number of insured members during the month
Trevor obtained the following results from a linear regression:
Determine which of the following variables should be dropped, using a 5% significance level.
I. Intercept
II. Number of weekdays
III. Number of weekend days
IV. Number of members.
3.35. You wish to test the hypothesis Ho: 162 = 0 against Hi: P2 <0 using the t statistic.
Which of the following is true regarding Ho?
(A) Reject at 0.01 significance.
(B) Reject at 0.025 significance, do not reject at 0.01 significance.
(C) Reject at 0.05 significance, do not reject at 0.025 significance.
(D) Reject at 0.10 significance, do not reject at 0.05 significance.
(E) Do not reject at 0.10 significance.
3.36. 'I Calculate the partial correlation coefficient of the price of gasoline.
Use the following information for questions 3.37 and 3.38:
You are given:
(i) y is the annual number of discharges from a hospital.
(ii) x is the number of beds in the hospital.
(iii) Dummy variable d is 1 if the hospital is private and 0 if the hospital is public.
(iv) The classical three-variable linear regression model y = poi-pix+p2d+ e is fitted to n cases using ordinary
least squares.
(v) The matrix of estimated variances and covariances of bo, bi, and b2 is:
1.89952 -0.00364
( -0.00364
-0.82744
0.00001
-0.00041 -0.82744)
-0.00041
2.79655
3.37. 44° [VEE Applied Statistics-Summer 05:2] Determine the standard error of 1)0 + 600b1.
(A) 1.06 (B) 1.13 (C) 1.38 (D) 1.90 (E) 2.35
3.39. A regression model yi = flo Pi xii fl2xi2 g3Xi3+ Ei is fitted to 48 observations. The partial correlation
coefficient of b1 is 0.6540.
Calculate the t ratio for
Solutions
3.1. How can a sum of squares be negative? You can eliminate (A) and (B) with no calculation.
302
3.3. Total SS = 690 — —
600
10
E (xi -17)2 = X? r
(I Xi)2
5
= 252.25 3552
5
— 0.2
(35'5 (5'3)
E(xi — 2)(yi — g) xiyi = 37.81
5) 0'18
0.18
131 = = 0.9
2
(s i") 5.32
Total SS = (yi — = - —
= 5.95 — = 0.332
5 5
3.6. Total SS = 949.388. Regression SS = 677.1142. The question is requesting R2, which is
R2 Regression SS _ 677.1142 _ —
0.7132 (C)
Total SS 949.388
3.7. Total SS = Error SS + Regression SS, so R2 = Regression SS/Total SS = 1 — Error SS/Total SS. Using this
equation
230
0.64 = 1
Total SS
230
Total SS = — =
638.89 (B)
0.36
3.8. I and II are true. III would be true for the square of the standard error, and based on n —2 degrees of freedom.
(A)
3.10. y=3
R2 Regression SS _ 1 Error
— Total SS
—
SS
Total SS
You are given that Total SS = 996. The estimate s 2 = Error SS/(n (k + 1)), and n = 50, k + 1 = 5 (The intercept Po is
counted as a parameter), so Error SS = 2.47(45) = 111.15. Therefore
R2 _ 111.15 _ 0.8884 (D)
—
996
3.13. R2 is 1 LiTt:12. We are given that Error SS = 1.772, and Total SS is the sum of the square difference
between observed Ys and their mean, or 4 times the unbiased sample variance. The unbiased sample variance of
the Ys, 2.441, 3.627, 5.126, 7.266, and 10.570, is 10.3402. So the total sum of squares is 4(10.3402) = 41.361. Then
R2 = 1 — 1.772/41.361 = 0.9572. (E)
3.14. The linear expression is the intercept, 22, plus 0 for age of driver, plus —3 for the income group, and
22 + — 3 = 19. Performing the Box-Cox transformation, y* = 2(y1/2 — 1) is a normal variable with variance equal to
the variance of E, 22 = 4. Then
y=2
5 =
n k 1 5
S2
0.311 (C)
3.16. The absolute value of the t statistic is 2.503/1.042 = 2.402. At 15 degrees of freedom, this is between 2.131
(0.05) and 2.602 (0.02), making the answer (C).
3.17. Divide the estimated coefficients by their standard errors to obtain the t statistic. The t statistic has n—(k +1) =
32— (3 +1) = 28 degrees of freedom. The critical value at 10% of the t statistic for a 2-sided test is 1.7011. The absolute
values of the quotients of coefficients over standard errors are 7.4161, 2.5, 1.3280, 7.25. Therefore the answer is (D):
accept the intercept and the first and third variables as significant.
3.18. It is not clear whether there is a po coefficient for the intercept in the regression, but let's assume there is,
so that there are n — (k + 1) = 20 — (5 + 1) = 14 degrees of freedom. The critical value at 10%, which is 10.050 for a
two-tailed test, is then 1.7613, so 13 )321 and /34 are not significant. (C)
3.19. The true variance can be calculated, since we have 02 (instead of the usual s2).
Var(bi — b2) = Var(bi) + Var(b2) — 2 Cov(bi, b2)
= 4(0.7 + 0.2— 2(0.4)) = 0.4
,•2 = 62(2.14)
We need 52, and it is Error SS/(n k —1) = 282.82/11 = 25.7109.
s2 = (25.7109)(2.14) = 55.0213
s2 = (25.7109)(0.03) = 0.7713
Cov(b2, b1) = (25.7109)(0.11) = 2.8282
The standard error is the square root of the variance, The final answer is
3.23. We will use formula (3.10) for sb,. First compute s2.
s2 = Zq =
7,832
= 435.111
n — (k + 1) 20-2
1435.111
= 0.2020
_
sal —
10,668
The t coefficient for 95% at 18 degrees of freedom is 2.101. The confidence interval is
—1.104 2.101(0.2020) = (-1.52, —0.68) (D)
3.25. We will use formula (3.10) for 5b1. Notice that (iv) is the error sum of squares.
2 92
S
=10-2= 11.5
Since 40 percent belong to the group, .54.' = 0.4 and xi equals 1 four times and 0 six times.
E(xi =
6((_0.4)2)+ 4(0.6)2 = 2.4
11.5
sbi = = 2.189
2 —
Error SS 25
=
s
n—2 4
The third bullet provides the sum of the squared deviations of xis from their means. By formula (3.10),
s2 25/4 „ „
s2122 = = = U .1LD
VA' )2 50
The t coefficient for 4 degrees of freedom and 0.05 in both tails is 2.7764. The upper bound of the 95% confidence
interval for f3i is 4 + 2.7764 25 = 4.9816. (A)
2.79
3.30. s2= Error
n—2
SS = = 0.34875
3.31. The estimated variance of the regression isrrc;r SS = 1 87399= 0.2082, since there are 9 degrees of freedom.
The standard error of b2 is
= V0.2082(0.15284) = 0.1784
The t coefficient for 95% confidence, 9 degrees of freedom, is 2.2622, and 2.2622(0.1784) = 0.4036 . (E)
3.32. The estimated variance of the regression, which has 3 degrees of freedom, is y = 4. Then
sh, = 4(2/3) = 1.633
Since we're testing p2 = 1, the difference between the fitted value and the null hypothesis is —2.0 — 1 = —3 and the t
statistic is = —1.837 (C)
3.33. The error has n k —1 = 15 — 3 — 1 = 11 degrees of freedom. The t value for 0.05 significance with 11 degrees
of freedom is 2.2010. So the confidence interval is 1.372 ± 2.2010(0.258) = (0.804,1.940)
3.34. There are 44 degrees of freedom, so you would not be able to look up critical values in the tables you get at
the exam. But they give you the p-values, so there is no need to look up tables. If the p-value is 5% or less, the
variable is accepted. Only number of weekend days has a p-value that is too high. (C)
3.35. The t statistic for p2 is —40.50/15.10 = —2.6821. There are 5 data points and 3 variables, so there are 2 degrees
of freedom. The test is a one-sided test, so we use the percentiles in the table. 2.6821 is between 1.8856, the 10th
percentile, and 2.9200, the 5th percentile. (D)
3.36. We calculated the t statistic as —2.682. So the partial correlation coefficient is
—2.682
—0.8846
V2.6822 + 2
3.37. The standard error is the square root of the variance, and
3.38. Let t(b2) be the t statistic for b2 and r the partial correlation coefficient of b2.
t (b2) =
,v —12r5 = —7.17579
—7.17579
r— 0.61421
V7.175792 + 88 — 3
0.6540 =
t2 +48 — (3 + 1)
0.65402(t2 + 44) = t2
—0.57228t2 + 18.8195 = 0
t 2 = 32.8849
t = 5.7345
Since the partial correlation coefficient is positive, the t ratio must also be positive; we do not select the negative
square root.
Quiz Solutions
3-1. The Error SS has 20-2 = 18 degrees of freedom, so Error SS = 52(18) = 450. Total SS is 884, so the regression
sum of squares is 884 — 450 = 434
3-2. Z(yi — 902 is the error sum of squares and Z(9i - 9)2 is the regression sum of squares.
122
R2 = 0.2998
285 + 122
Linear Regression: F
The first F statistic we discuss tests the significance of the entire regression. In other words, H0 is the model y = p0+E
and H1 is the model under consideration. The error sum of squares for H0 is the total sum of squares. So our
F statistic is the quotient of the mean square of the regression over the mean square error:
Regression SS/k
Fk,n—k—l= (4.1)
Error SS/(n — k — 1)
with k and n — k —1 degrees of freedom. Most textbooks, but not the textbooks on the syllabus, place F in a separate
column of the ANOVA table:
You may be wondering, why test the whole model? Can't we just check the t statistic of every variable? If at •-••
least one t statistic is significant, then the model is significant. The answer is that if there are a lot of variables,
insignificant variables may accidently have significant t statistics. If you have 20 insignificant variables and are
evaluating t statistics using 5% significance, on the average one of the variables will have a significant t statistic.
That's why we need an F test.
In a model of the form y = 0+ pix ± e, the F1,)1-2 statistic is the square of the t,,_2 statistic for 13i.
The F statistic is related to R2 by
Don't memorize this; just remember the trick: to go from F to R2, divide numerator and denominator by Total SS.
EXAMPLE 4A •-: In a linear regression model yi = Po + pixi + Ei you are given:
• There are 15 observations.
• bi = 3.1
("-----'
SoLurioN: The standard error of b1 is 10/6, so the t statistic is 3.1/(10/6) = 1.86. The F statistic is 1.862 = 3.4596
If additional independent variables are added to a regression model, the error sum of squares will go down,
since the added variables will improve the fit. However, if the decrease in Error SS is small, the additional variables
may not be justified. Suppose that the model with fewer variables fits k — q +1 ps, where the intercept /30 is counted
as one of the ps. If we add q more variables, we should test the hypothesis $3k-q+1 = Pk-q+2 -= • • • = = 0 using an F
statistic. To perform this test, estimate the model with and without the additional q variables. The model without
the additional variables is called the restricted or reduced model, since the coefficients of the omitted variables are
being forced to be 0. It will have a higher Error SS; call it Error SSR. The Error SS of the model with the additional
sir variables, the unrestricted or full model, is called Error SSuR. Then the F statistic to test the significance of the
variables is
(Error SSR — Error SSuR)/q
=
(4.3)
Error SSuR /(n — k 1)
where q is the number of restrictions, the number of coefficients forced to be 0. Note that this generalizes equa-
tion (4.1), where q = k and Error SSR = Total SS.
EXAMPLE 4B gki? You are considering the model y •• +Pixi +P2x2+P3x3+P4x4+p5x5+ E based on 60 observations.
You are testing the hypothesis P2 = P3 = g4 := P5 = 0 and y = po• + plxi + E'.
You have the following statistics for these models:
Model Residual standard error
Original model 4,506
P2 = P3 = g 4 .135 = 0 10,321
SOLUTION: The standard error s — EnTIT sis , so Error SS = s2(n — k — 1). Here, n = 60, k = 5, and the number of
restrictions q =4.
Error SSuR = 4,5062(54) = 1,096,417,944
Error SSR = 10,3212(58) = 6,178,336,378
F—
(6,178,336,378 — 1,096,417,944)/4 62.57 0
1,096,417,944/54
Regression output may only give you R2 rather than Error SS. Use the method of equation (4.2) to express F in
terms of R2; the Total SS is the same for both the restricted and the unrestricted models:
(Error SSR — Error SSuR)iq
Fq,n-k-i
Error SSuR /(n — k — 1)
(Error SSR /Total SS — Error SSuR /Total SS)/q
(Error SSuR /Total SS)/(n — k —1)
((1 — Error SSuR /Total SS) — (1 — Error SSR /Total SS)) q
(Error SSuR /Total SS)/(n k —1)
— R2R)/g
(1—RR)/(n — k —1)
The tables you get on the exam do not include tables of the F distribution. Apparently they will not ask you
to determine the significance of variables using the F statistic. A possible exception would be when there is only
one degree of freedom in the numerator; in that case, the square root of the F statistic is the t statistic having the '41
same number of degrees of freedom as the denominator, and then you can use the t table to look up the critical
value.1 But keep in mind that the square root is a 2-sided variable (it can be positive or negative), so the probability
of Fid, > 1 — a corresponds to the probability of Ili >1—
Quiz 4-1 ssi? You are considering a model with 12 explanatory variables and an intercept. The model is based
on 35 observations. The regression sum of squares for the model is 6520 and the error sum of squares is 349.
You wish to test the hypothesis 1310 = pH = P12 = 0. When the model is run without the corresponding
three variables, the regression sum of squares is 6410.
Calculate the F ratio to test the hypothesis.
A special case of this F test is a categorical variable. Usually you would want to test all the dummy variables
arising from a single categorical variable as a group for significance, rather than testing each dummy variable
separately.
Regression SS/k
(4.1)
Fk,n-k-1 =Error SS/(n — k — 1)
n k—1 R2
(4.2)
k 1—R2
Exercises
4.1. •411 For a linear regression model of the form yi = 10+ f3txit + 02x12 P3Xi3ti you are given:
(i) There are 25 observations of the variables.
E7!1(Yi — 902 = 30
(iii) EP1(Y/ = 40
Calculate the F statistic for the model.
1But the fact that F(1, q) is the square of 1(q) is only mentioned in a footnote of An Introduction to Statistical Learning, so I don't know whether
they'll test on this.
4.2. al? [120-82-94:10] A sample of size 20 is fitted to a linear regression model of the form
Yi + P2x12 + P3xi3 + p4xi4 + p5xi5 + Ei
The resulting F ratio used to test the hypothesis Ho : f3i —P2 — /33 — p4 - P5 0 is equal to 21.
Determine R2.
Y Po Pix + E
4.5. 'Li? [MAS-I-S18:33] Consider a multiple regression model with an intercept, 3 independent variables, and 13
observations. The value of R2 = 0.838547.
Calculate the value of the F-statistic used to test the hypothesis Ho : =
= /33 = 0.
(A) Less than 5
(B) At least 5, but less than 10
(C) At least 10, but less than 15
(D) At least 15, but less than 20
(E) At least 20
4.6. [ST-S16:21] You are given the following linear regression model which is fitted to 11 observations:
Y = Po + Pix +
The coefficient of determination is R2 = 0.25.
4.7. For the linear regression model yi = Po + Pixi + e you are given the following ANOVA table:
4.8.
•41 A linear regression model with 22 observations is fitted as yi = —2 + 2.5xi + Ei. You are given R2 = 0.9.
Determine the width of the shortest symmetric 90% confidence interval for 13, the coefficient of X.
4.9. [3-F84:61 You are fitting the linear regression model yi = Po + f3ixi + Ei. You have determined:
io
— 27)2 = 225
i=1
4.10. -11. [120-F89:41 You are studying the average return on sales as a function of the number of firms in an industry.
You have collected data for 1969-88 (20 years) and performed a two-variable regression of the form yi = /30 + x; +E.
You have obtained the following summary statistics from these data:
Determine the upper bound of the shortest 95-percent confidence interval for the regression coefficient pi.
(A) —0.0010 (D) —0.0004 (E) —0.0002
4.11. 'lb [120-81-98:2] You fit a simple regression model to 47 observations and determine y`i = 1.0 + Lai. The
total sum of squares (Total SS), corrected for mean, is 54, and the regression sum of squares (Regression SS) is 7.
Determine the value of the t statistic for testing H0:131 = 0 against Hi: Pi 0.
(A) 0.4 (B) 1.2 (C) 2.2 (D) 2.6 (E) 6.7
4.12. r[120-83-98:3] You fit a one-variable plus intercept regression model_ to seven observations.
You determine:
Error SS = 218.680
F = 2.088
Calculate R2.
(A) 0.3 (13) 0.4 (C) 0.5 (D) 0.6 (E) 0.7
4.13. %Ii[SRM Sample Question #441 Two actuaries are analyzing dental claims for a group of n = 100 participants.
The predictor variable is gender, with 0 and 1 as possible values. Actuary 1 uses the following regression model:
Y=p+e
Y = go f31 x Gender + £
The residual sum of squares for the regression of Actuary 2 is 250,000 and the total sum of squares is 490,000.
Calculate the F-statistic to test whether the model of Actuary 2 is a significant improvement over the model of
Actuary 1.
(A) 92 (B) 93 (C) 94 (D) 95 (E) 96
4.14. *-1? A company analyzes sales of its products by its agents. It considers the following explanatory variables:
• xi is the amount of time the agent has been with the company.
• x2 is the population of the agent's territory.
• x3 is the number of continuing education hours for the agent during the year.
Let y be total sales for an agent in one year. The company fits the following regression model:
The company uses data from 18 agents. Summary statistics from this model are:
18
— 902 = 1060
18
Dyi — 902
i=1
1820
4.15. *-411 You are given the following excerpt from an ANOVA table for a regression:
Source df Sum Sq Mean Sq F value
Regression 4 20.79
Error
Total 22 14,230
4.16. 41 [MAS-I-F18:32] An actuary uses a multiple regression model to estimate money spent on kitchen equip-
ment using income, education, and savings. He uses 20 observations to perform the analysis and obtains the
following output:
Sum of
Squares
Regression 2.65376
Total 7.62956
4.17. "41 A regression model has 5 variables: xi, x2, ,x5, and an intercept. An excerpt from the output from the
regression is
Residual standard error: 3.85 on 44 degrees of freedom.
Multiple R-squared: 0.82.
The variable x5 is removed from the model and the regression is run again. In the resulting regression, the
residual standard error is 4.02.
Calculate the F statistic to test the significance of the variables x.3 and .x4.
4.19. •121 [ST-S15:21) The following two linear regression models were fit to 20 observations:
• Model 1: Y = 130 +131X1 +132X2 + E
• Model 2: Y = flo +131X1 132X2 )33X3 p4x4 + E
The results of the regression are as follows:
Model Error Sum Regression Sum
Number of Squares of Squares
1 13.47 22.75
2 10.53 25.70
The null hypothesis is H0: p3 = /34 = 0 with the alternative hypothesis that the two betas are not equal to zero.
Calculate the statistic used to test H0.
(A) Less than 1.70
(B) At least 1.70, but less than 1.80
(C) At least 1.80, but less than 1.90
(D) At least 1.90, but less than 2.00
(E) At least 2.00
4.20. sir [ST-S16:22) The following two models were fit to 18 observations:
• Model 1: Y = po + + p2x2 + E
• Model 2: Y = PO + + p2x2 + p3x1x2 + p4x? + 135x1 + e
The results of the regression are:
Model Error Sum Regression Sum
Number of Squares of Squares
1 102 23
2 78 39
Calculate the value of the F-statistic used to test the hypothesis that p3 = p5 = 0.
(A) Less than 130
(B) At least 1.30, but less than 1.40
(C) At least 1.40, but less than 1.50
(D) At least 1.50, but less than 1.60
(E) At least 1.60
4.21. s: You are given the following data regarding two models based on 15 observations:
Model Error sum of squares
Y Po +Pixi +132x2 + Opq +)34x1+ £ 22.8
Y= yo + yjx + y2x2 + E 57.4
4.22. "41' [S-F17:35] Consider the following 2 models, which were fit to the same 30 observations using ordinary
least squares:
Modell Model 2
SS Total 19851 SS Total 19851
SS Error 2781 SS Error 2104
Parameter if df Parameter p df
X3 35 1
X4 —2.3 1
You test Ho: f_33 = /34 = 0 against the alternative hypothesis that at least one of f33, f34 0.
Calculate the F statistic for this test.
4.23. ".41 [120-81-98:6] You wish to find a model to predict insurance sales, using 27 observations and 8 variables
labeled xi, x2,... , x8 and an intercept. The analysis of variance tables for two different models from these data follow.
Model A contains all 8 independent variables; Model B contains xi and x2 only. Both models include an intercept.
Model A
Source SS df MS
Regression 115,175 8 14,397
Error 76,893 18 4,272
Total 192,068 26
Model B
Source SS df MS
Regression 65,597 2 32,798
Error 126,471 24 5,270
Total 192,068 26
Calculate the F ratio for testing the hypothesis Ho: P3 —/34 — /35 — /36 — 137 — P8 — 0
(A) 5.8 (B) 4.5 (C) 2.6 (D) 1.9 (E) 1.6
4.24. 4kili [120-82-97:4] You apply all possible regression models to a set of five observations with three explanatory
variables and an intercept. You determine Error SS, the sum of squares due to error (or residual), for each of the
models:
Variables
Model in the Model Error SS
I xi 5.85
II X2 8.45
III X3 6.15
IV xi, x2 5.12
V xi, x3 4.35
VI X2i X3 1.72
VII xl, X2, X3 0.07
You also determined that the estimated variance of the dependent variable y is Var(y) = 2.2.
Calculate the value of the F statistic for testing the significance of adding the variable x3 to the model y =
13o + pixii + Ei.
(A) 0.3 (B) 0.7 (C) 1.0 (D) 1.4 (E) 1.7
4.25. IN? [120-81-98:5, Sample C4:35] You are determining the relationship of salary (y) to experience (xi) for both
men (x2 = 1) and women (x2 = 0). You fit the model yi =160 + Pixn + /32xi2 +133xiixi2 + ri to a set of observations for
a sample of employees.
You are given:
(i) There are 11 observations.
(ii) Regression SS is 330.0117 for this model and Error SS is 12.8156.
(iii) For the model yi = l ix + r, Regression SS is 315.0992 and Error SS is 27.7281.
f
Determine the F statistic to test whether the linear relationship between salary and experience is identical for
men and women.
(A) 0.6 (B) 2.0 (C) 3.5 (D) 4.1 (E) 6.2
4.26. '41 [4-S01:5] A professor ran an experiment in three sections of a psychology course to show that the more
digits in a number, the more difficult it is to remember. The following variables were used in a multiple regression:
xi = number of digits in the number
x2 = 1 if student was in section 1, 0 otherwise
X3 = 1 if student was in section 2, 0 otherwise
y = percentage of students correctly remembering the number
You are given:
(i) A total of 42 students participated in the study.
(ii) The regression equation y = /30 + p1x1 +1324 + /33x2 + 0(3 + e was fit to the data and resulted in R2 = 0.940.
(iii) A second regression equation y = 130 + Pixi + f39q. was fit to the data and resulted in R2 = 0.915.
Determine the value of the F statistic used to test whether class section is a significant variable.
(A) 5.4 (B) 7.3 (C) 7.7 (D) 7.9 (E) 8.3
4.27. Nsii A company is modeling auto collision losses. In addition to two other variables xi and x2, the company
is considering territory as a categorical variable. Three binary variables are used for territory: x3, x4, x5. There are
820 observations.
4.28. 41 [SRM Sample Question #24] Sarah performs a regression of the return on a mutual fund (y) on four
predictors plus an intercept. She uses monthly returns over 105 months.
Her software calculates the F statistic for the regression as F = 20.0, but then it quits working before it calculates
the value of R2. While she waits on hold with the help desk, she tries to calculate R2 from the F-statistic.
Determine which of the following statements about the attempted calculation is true.
(A) There is insufficient information, but it could be calculated if she had the value of the residual sum of squares
(RSS).
(B) There is insufficient information, but it could be calculated if she had the value of the total sum of squares
(TSS) and RSS.
(C) R2 = 0.44
(D) R2 = 0.56
(E) R2 = 0.80
Regression SS = 61.3
Total SS = 128
You then fit the following new model, with an additional variable x3, to the same data:
Regression SS = 65.6
Total SS = 128
4.30. •••• [4-F02:27] For the multiple regression model y = Po + Pixi + p 2X2 + 83X3 + /34x4 +135x5 + E, you are given:
(i) There are 3,120 observations.
(ii) The total sum of squares is 15,000.
(iii) 110: P3 = P4 = 35 = 0
(iv) R2 for the unrestricted model is 0.38.
(v) The regression sum of squares for the restricted model is 5,565.
Determine the value of the F statistic for testing Ho.
(A) Less than 10
(B) At least 10, but less than 12
(C) At least 12, but less than 14
(D) At least 14, but less than 16
(E) At least 16
Solutions
= —
Error SS/(n — k — 1) 30/21 j3
4.2. 21 =
R2/5
(1 — R2)/(20 — 6)
15 _ R2
T — 1— R2
15
R2 = (C)
17
4.3. Regression
Total SS
SS , 0.95
Error SS
= 0.05
Total SS
Regression SS 0.95
F— (n 2) = —(8) = 152
Error SS 0.05
4.4.
Regression 88/k R2/1 0.64
F—
Error SS/(n — k — 1) (1— R2)/18
= —
0.36/18
— I .. 1 (B)
4.5. Notice that Ho eliminates all variables other than the intercept, so this is the F test for the entire regression.
By equation (4.1),
R2/k 0.838547/3
F3,9 — = 15.58 (D)
—
4.7.
12,235/1
F1'8 = = 32.48
3,014/8
4.8. We'll calculate F and then use the fact that t is the square root of F.
F
Regression SS 0.9
=
(n 2) = —(20) = 180
Error SS 0.1
= tp = 'VT8-1:1= 13.4164
so
2.5
s = = 01863
P 13.4164
The t coefficient with 20 degrees of freedom for 90% confidence is 1.725. The width of the confidence interval is
2(1.725)(0.1863) = 0.6429
4.9. The second sum is Total SS and the third sum is Regression SS, so R2 = E = *7.. Then
R2(8) (9/17)(8) _9
F1,8 = = -
1 - R2 8/17
t = =111 (D)
4.10. The t-statistic, which is pliss,, is the square root of the F ratio, or Vig-.LB' = 4.4102. The coefficient of the t
distribution with 18 degrees of freedom and 95% confidence is 2.101. Then
(4860)(2.452)
- 9) = 539.309 20
= 56.527
48602
1,330,224 =
149 244
'
20
-56.527
p" -0 .000379
1'1 = 149,244
=
iPli _ 0.000379 = 0 0000859
sPi 4 .4102
1'13,
The upper bound of the confidence interval is -0.000379 + 2.101(0.0000859) = —0.0002 . (E)
(2.088)(218.680) = 91.321
Regression SS = 5
91.321
R2 = Regression
Total SS
SS = 218.680 + 91.321
= 0.295 (A)
4.13. Actuary l's model is intercept only, so we're just calculating the significance of a non-null model. Use
formula (4.1). Total sum of squares is 490,000 and residual sum of squares (Error SS) is 250,000, so regression sum
of squares is 490,000 - 250,000 = 240,000. There are 100 observations and 1 variable, so n = 100, k = 1.
240,000/1 94.08 (C)
_
4.14. E181(9i- p1)2 is the regression sum of squares, while Er i(yi — gi)2 is the total sum of squares. The error sum
of squares is the difference, or 1820 — 1060 = 760. The regression sum of squares has 3 degrees of freedom, while
the error sum of squares has 18 —4 = 14 degrees of freedom. The F statistic is
1060/3
F3,14 —
6.51
760/14 —
4.15. Since the total sum of squares has 22 degrees of freedom and Regression SS has 4, Error SS must have 18
degrees of freedom.
4.16. The error sum of squares is Total SS — Regression SS = 7.62956 — 2.65376 = 4.97580. The regression SS has
p —1 = 1 = 3 degrees of freedom and the error SS has n —p = 20-4 = 16 degrees of freedom. Here's the ANOVA
table:
Total 7.62956 20
(B)
4.17. Since s = .\/Error SS /(n — k — 1), it follows that Error SS = (n — k — 1)s2 . The Error SS of the unrestricted
model is 44(3.852) = 652.19. For the restricted model, Error SS = 45(4.022) = 727.22. Then the F statistic is
(Error SSR — Error SSuR)/q 727.22 — 652.19
5.062
Fi,44 =
Error SSuR /(n — k — 1) 652.19/44
4.18. The numerator has 2 degrees of freedom for q = 2 constraints; the denominator has 15 degrees of freedom
for n = 20 observations with k = 4 explanatory variables for the unrestricted model.
(Error SSR — Error SSuR)/q (10 — 8)/2 15
1.875
F2,15 — = =
4.22. Notice that there is an intercept in addition to four variables. Thus the number of degrees of freedom of the
=restricted model is 30 — 5 = 25. We are testing 2 restrictions. The F ratio is
(2781 — 2104)/2 4.022
F2,25
2104/25
4.23. To compare the two models, we compare the error sum of squares. Notice that n = 27 while k = 8, so
n k — 1 = 18; but the degrees of freedom for the Error SS is explicitly stated as 18 anyway. The number of
coefficients set equal to 0 is q = 6, so
(Error SSR — Error SSuR)/q (126,471 — 76,893)/6
F = 1.9 (D)
Error SSuR/(n — k — 1) 76,893/18
4.24. The unrestricted model with two variables x1 and x3 and a constant has n — k —1 = 5 — 2 — 1 = 2 degrees of
freedom and we add q = 1 restriction. The F statistic is
The estimated variance of y was not needed, although it was used in the official solution.
4.25. There are n = 11 observations. The unrestricted model has k =3 variables and there are q = 2 restrictions.
(27.7281 — 12.8156)/2
F2,7 = 3.5(1.163621) = 4.0727 (D)
12.8156/7
4.26. There are q = 2 restrictions, n = 42 observations, and k = 4 variables in the unrestricted model. In terms of
R2, the F statistic is
(RtR —4)/q (0.940 — 0.915)/2 —
7.708 (C)
F2,37 =
(1 — 4R)/(n — k — 1) (1 — 0.940)/37
R2 _ 419(5) = 2.57371
1 — R2 814
2 57371
R2 = = 0.72018
3.57371
F—
R2/2 = 759
(1 — R2)/817
R2 759(2) = 1.85802
1 — R2 = 817
R2 = 1'85802
2.85802
= 0.65011
There are q = 3 restrictions in the 2-variable model. The F statistic to test the significance of territory is
67.95
F3814—
(1 — qR)/(n k 1) = (1 — 0.72018)/814
4.28. You can use equation (4.2) to obtain the answer immediately, but since I suggested not memorizing that
formula, we'll work this out from first principles.
F =
Regression SS/k
Error SS/(n — k — 1)
(Total SS — Error SS)/4
Total SS
20 =
Error SS/100
= 25
( Total SS )
Error SS
1
La+1 = 1.8
= 25
Error SS
Error SS
R2 —1— = 1— =
0.4444 (C)
Total SS 1.8
4.30. Error SSR = Total SS — Regression SSR = 15,000 — 5,565 = 9,435. For the unrestricted model,
Error SSUR
Total SS = 1 — R?.JR = 0.62 Error SSuR = 15,000(0.62) = 9,300
The F statistic is
(9,435 — 9,300)/3
F3,3114 — 15.068 (D)
9,300/(3,120 — 6)
Quiz Solutions
4-1. The error sum of squares of the smaller model is the total sum of squares, 6520 + 349, minus 6410, or 459. The
F ratio is
(459 — 349)/3
F3,22 — 2.311
349/22
BREAK. #asmstudybreak
Exercise is a proven way to boost your verbal memory, thinkin and learninEli
V
0
V
0
'L
E
Note that we are mathematicians, not yogi: attempt poses at your own risk!
atsim
Actuarial Study Materials
Lesson 5
Reading: Regression Modeling with Actuarial and Financial Applications 2.5,5.3-5.7; An Introduction to Statistical Learning
3.3.3
where hii is the ith diagonal element of the H matrix. As usual, o2 is estimated by $2, the residual variance of the
regression. The variance of 2; is estimated by (1 - h11)s2, where hi; is the ith diagonal element of the H matrix and
s2 is the square of the standard error of the regression. The leverage of the ith observation is defined as hii. We'll
discuss the use of leverage later in this lesson.
To best use the residuals, we should standardize them. Their mean is already 0, so standardizing consists of
dividing by their standard deviations. There are three possible formulas to standardize them:
ei
1. —.
s
This is simple, but not precise, since s is the estimated standard deviation of the true error, not of the
residual.
P
2. ri = This is the standardized residual, the residual divided by its estimated standard deviation.
s
3. ei , where s(i) is the residual standard error of a regression that excludes observation i. If yi is
s(;) 11177Z
unusually high or low, it may artificially increase s and thus reduce the standardized residual. By excluding
observation i, we avoid this problem. This residual follows Student's t distribution with n - (k +1) degrees of
freedom, so it is called the studentized residual. Sr
'The syllabus textbook Regression Modeling with Actuarial and Financial Applications mentions that H puts the hat on y, but does not call it the
hat matrix, so you are not responsible for knowing this term. However, I will use it for convenience.
= My
and
H2 = X(X1X)-1X1X(X/X)-1X' = X(VX)-1X` = H
so M2 = M. It follows that
Var(E) = M2 Var(e) = MG2
This is the covariance matrix for E. Since M = H, the diagonal elements of M are 1 minus the diagonal elements
of H, while the other elements are negative the corresponding elements of H. In particular, the variance of ti is
(1 -
EXAMPLE 5A '141 A regression with three parameters including the intercept is based on 6 observations. The hat
matrix is
0.458 0.376 0.232 0.132 0.009 0.189
0.376 0.411 0.255 0.037 0.164 0.085
0.232 0.255 0.200 0.112 0.036 0.165
H=
0.132 0.037 0.112 0.288 0.417 0.014
-0.009 0.164 0.036 0.417 0.719 0.001
\ -0.189 0.085 0.165 0.014 0.001 0.924
SOLUTION: 1. The covariance matrix of 2 is s2 times I - H. The variances of Pi and E2 are (1 - 0.458)s2 = 0.542s2
and (1 - 0.411)s2 = 0.58952. The covariance is -0.376s2.
n- Let's now discuss how to use residuals to check whether regression assumptions are satisfied.
Linear regression makes the following assumptions:
1. The response is a linear function of the explanatory variables.
2. The response is distributed normally.
3. The response has constant variance. A variable is homoscedastic if its variance is constant and heteroscedastic
otherwise.
10
•
0
—5
—10
•
—15 Pi
0 5 10 15 20 25 30 35 40
The absolute values of the residuals grow as the fitted value grows, indicating that variance is not constant. To
correct this problem, if variances are known, use weighted least squares. The weights should be the reciprocals of
the standard deviations.
To check linearity of the model, plot the response or the residuals against each explanatory variable. A pattern
in the plot indicates that the relationship is not linear. For example, the graph in Figure 5.1 indicates a quadratic
relationship to x2 that has not been incorporated into the model:
To check that observations are independent, plot the ei in the order for which correlation is expected, such as
order of occurrence. Observe whether values of consecutive residuals are close to each other.
E
30
•
25
20
15
10
—5 •
•
—10
•
—15
X2i
—200 2 4 6 8 10 12 14
An observation is unusual vertically if the value of the response is unusually high or low. Then the residual has an
unusually high absolute value. If the errors are normally distributed, very few standardized residuals should have
0
1 absolute value greater than 3. Observations with unusually high (in absolute value) residuals are called outliers.
An outlier can be handled in one of the following ways:
1. Include it in the model but comment on it.
We see that hii is high if and only if lxi — 11, the absolute distance of the explanatory variable from its mean, is
disproportionately high. This contrasts with outliers, which are observations for which the response variable is
disproportionately far from its fitted mean.
A high leverage point can be handled in one of the following ways:
1. Include it in the model and comment on it.
3. Replace the variable causing the high leverage with another variable.
Di =
7= 1(91 — "i(i))2
= Ti (5.2)
(k +1)s2 (k + 1)(1 — hii)
where ri is the ith standardized residual. The first expression is a definition. We measure the impact of observation i
by squaring the difference between the prediction of the model with and without that observation. We sum up these
impacts and divide by a normalizing constant. The second expression is a product of the standardized residual, a
measure of how much of an outlier the observation is, and an expression depending only on leverage.
What value of Di indicates an unusual observation? The first factor, the standardized residual squared, should
average about 1 (the variance of a standard normal distribution) and average leverage is (k + 1)/n so the second
factor should be about 1/n if we ignore 1 — hii in the denominator. So the product should be about 1/n. Another
way to evaluate Di is to compare it to an F distribution with k +1,n — (k + 1) degrees of freedom. If Di is unusually
high, the observation should be checked.
EXAMPLE 5B c: (Continuation of Example 5A) In Example 5A, calculate Cook's distance for the first two residuals..
SOLUTION: The first factor in equation (5.2) is the standardized residual squared, and we calculated the standardized
residual in Example 5A. In that example there are k +1 = 3 parameters.
Quiz 5-2 •41 For a linear model with 4 variables and an intercept, there are 100 observations. The first
standardized residual is 1.24 and the leverage of the first observation is 0.48.
Calculate Cook's distance for the first observation.
VIFi = (1 — (5.3)
Since R2 > 0, VIF • must be at least 1. The larger it is, the more collinear variable xi is. Values above 10 indicate
severe collinearity.
Quiz 5-3 %II Consider the linear regression model based on 21 observations:
Yi = Po + Plr1l+132xi2 +p3xi3 + El
It is suspected that x3 is collinear with the other variables. A regression of x3 on the other two variables is
performed. The resulting F statistic is 6.48.
Calculate the VIF for x3.
There is a relationship between the standard error of 1)1 and the VIF of xj:
VVIFi
.4* Ski = S (5.4)
Sx)
In this equation, sbj and s are based on the full regression of y on the k variables, not omitting xj. We see that all
other things being equal, a higher VIF corresponds to a higher standard error of the coefficient estimate. That means
it is more difficult to detect the significance of variables in the model.
High leverage may induce or mask collinearity.
••41 An explanatory variable may improve the model even if it is mildly collinear with others. A suppressor variable is
an explanatory variable that increases the significance of other variables when included in the model.
Two matrices X1 and X2 with equal numbers of rows are orthogonal if ViX2 = 0. If xi is the 1 x n vector of an
explanatory variable and X2 has the other explanatory variables, then xi is an orthogonal variable. In a sense, it
is the opposite of a collinear variable. Its VIF is 1. Adding it to the model does not affect the other (3 estimates.
We will later discuss principal components regression, a method of creating orthogonal variables. However, while
orthogonal variables remove the collinearity problem, they are difficult to interpret.
H = X(VX)-1X'
Estimated variance of residuals
ri =
s-VT. 1TH
Studentized residuals
r. =
S(i)
Leverage hii.
Properties of leverage
Cook's distance
E7=1 (Pi — 2 hi;
D, = ' =r. (5.2)
(k +1)52 ' (k +1)(1— hii)
viF1=0_eu,y1 (5.3)
(5.4)
Exercises
Residuals
(
0.4407 0.3559 0.1864 0.0169
0.1356 0.1864 0.2881 0.3898
—0.1695 0.0169 0.3898 0.7627
5.2. °call For a linear regression with 11 observations and 2 variables plus an intercept, you are given:
•
— 902 = 32
• y3 — 93 = 1.2
• The leverage of the third observations is 0.6.
Calculate the standardized third residual.
5.3. •-.11 For a linear regression with 25 observations and 3 variables plus an intercept,
(i) The residual standard error is 22.4.
(ii) If the 21st observation is removed, the residual standard error is 17.2.
(iii) The standardized residual for the 215t observation is 0.83.
Calculate the studentized residual for the 21st observation.
5.4. [MAS-L-F18:351 You are fitting a linear regression model of the form:
(6
1 1 1 8 19 4 4 2
X=
y=117;; "
=
1 1 0 7 3 2 3 321
1
1
1
0
0
0
6
6
1 15/
13 51 36 32 491
oc,xyi .= ( 0.20 )
0.84 0.25 -0.06 0.297
(rX)-1X'Y =
,
5.5. A regression model has the form yi = flo + pixi + e1. It is based on 4 observations of xi: {3,7,11,15). You
are given
_ ( 1.2625 -0.1125)
-
-0.1125 0.0125
5.6-7. (Repeated for convenience) Use the following information for questions 5.6 and 5.7:
A regression model with 5 variables and an intercept is fitted to 60 observations. The first 5 diagonal elements
of the hat matrix are 10.1612, 0.0808, 0.2159, 0.0259, 0.13071.
5.8. A regression model with 4 variables and an intercept is fitted to 32 observations. For the sixth observation,
you are given:
(i) The standardized residual is 0.824.
(ii) The leverage is 0.471.
Calculate Cook's distance for the sixth observation, D6.
5.9. *-41° [MAS-I-F19:311 In order to predict individual candidates' test scores a regression was performed using
one independent variable, Hours of Study, plus an intercept. Below is a partial table of data and model results:
Hours of Standardized
Candidate Test Score Study Leverage Residuals
1 2,041 538 0.6205 —1.3477
2 2,502 548 0.2018 —0.4171
3 2,920 528 0.6486 —1.1121
4 2,284 608 0.2807 1.1472
Calculate the number of observations above that are influential using Cook's Distance with a unity threshold.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
5.10. 4' You are given the following information about 5 observations from a linear regression:
i Residual Leverage
1 0.153 0.305
2 0.274 0.223
3 —0.211 0.190
4 —0.414 0.101
5 0.352 0.176
Using Cook's distance as a measure, which of these five points is most influential?
5.11. 14. A regression model with 3 variables and an intercept is fitted to 48 observations. You are given
(i) El = 7.624
(ii) The standard error of the regression is 4.823.
(iii) Cook's distance for the first observation is 0.804.
Determine the leverage of the first observation.
5.12. s: [S-F17:361You are analyzing a database and have fit a multiple regression with 12 continuous explanatory
variables and one intercept. You are given the following:
(i) Sum of square errors = 618
(ii) The 10th diagonal entry of the hat matrix, hum = 0.35
(iii) The 10t11residual value = io = 9.5
(iv) Cook's Distance = D10 = 5.0
Calculate the number of data points this model was fit to.
(A) Less than 200
(B) At least 200, but less than 400
(C) At least 400, but less than 600
(D) At least 600, but less than 800
(E) At least 800
5.13. A regression model of the form yi = Po + xi + ei is fitted to 80 observations. You are given that 1 = 51
and Er 1(xi - 1)2 = 60,919.
An observation is considered a high leverage observation if its leverage is greater than 2 times average. In other
words, an observation xi is high leverage if xi <a or xi > b.
Determine a and b.
5.14. s: [MAS-I-S19:31] You are fitting a linear regression model of the form:
y= + E; ei N (0, a2)
and are given the following values used in this model:
/1 0 1 9 21
1 1 1 15 32 3 2 3 32 1.38 0.25 0.54 -0.16
1 1 1 8 19 2 4 4 36 -0.20 -0.06
..Y = x,,, = 00..5245 _0.84
,,,
x=
0 1 1 7
'
17
' A'
3 4 6 51
;
v`'A)-1 =. 0.20 1.75 -0.20
0 1 1 6 15 32 36 51 491 -0.16 -0.06 -0.20 0.04
k0 0 1 6j \15/
0.684 0.070 0.247 -0.171 0.146 0.316
0.070 0.975 -0.044 0.108 -0.038 0.070
0.247 -0.044 0.797 0.063 0.184 -0.247
H = X(VX)-1X1 = -0.171 0.108 0.063 0.418 0.411 0.171
-0.146 -0.038 0.184 0.411 0.443 0.146
\ 0.316 -0.070 0.247 0.171 0.146 0.684
/20.93
32.03
0.293720 0
( 1943 );
19.04
(x/X)-IX'Y X(X'X)-1X'y = 16.89 '
a- = 0.012657
1.854 15.04
15.07
Calculate how many observations are influential, using a unity threshold for Cook's distance.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
5.15. s: [S-F17:371 You are given the following residual rig plot for a fitted linear regression model:
Normal Q-Q
-2 -1 0 2
Theoretical Quantiles
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
VIF
5.16. "41 Some authors suggest that a VIFj of 5 or higher indicates high collinearity of xi to the other variables in
a regression model.
Suppose xj is regressed against the other explanatory variables in a model. Let be the coefficient of
determination for this regression. Which values of R2(i) indicate high collinearity according to those authors?
5.17.
•1? A linear regression model with an intercept is used to model y as a function of xi and x2. The correlation
coefficient of the two explanatory variables is 0.4.
Determine VIF2.
5.18. "41 Auto collision losses are modeled using two categorical variables: USE (business or pleasure) and MILES
(less than 10,000 miles driven per year or more than 10,000 miles driven per year). You have 5 observations of the
two variables:
USE 1 0 1 1 0
MILES 1 0 1 0 0
You suspect that the two variables are almost collinear. To test this hypothesis, you calculate the VIF.
Determine the VIF of MILES.
5.19. "4' [S-F15:401 You are given the following information related to the linear model:
Y +P2x2 + p3x3 e
•
,c1 yo + Y1X2 + y2x3 + E
Degrees of Sum of
Source Freedom Squares
Regression 2 24.74
Error 503 226.34
• x2 = 60 + (52x3 + E
Degrees of Sum of
Source Freedom Squares
Regression 2 3.73
Error 503 3.04
E
• X3 = ÷ 171X1 172X2
Degrees of Sum of
Source Freedom Squares
Regression 2 214,2.58
Error 503 85,883
Calculate the variable inflation factor for the variable which exhibits the greatest collinearity in the original
model.
5.20. `11. [MAS-I-S18:32] Two different data sets were used to construct the four regression models below. The
following output was produced from the models:
Residual
Dependent Independent Total Sum of
Data Set Model Sum of
variable variables Squares
Squares
1 YA X/11, X./12 35,9030 2,823
A
2 XA1 XA2 92,990 7,070
1 YB XBi, X22 27,5700 13,240
B
2 Xsi XB2 87,020 34,650
The threshold of the Variance Inflation Factor for variable j (VIFi) for determining excessive collinearity is
VIFi > 5.
Determine which one of the following statements best describes the data.
(A) Collinearity is present in both data sets A and B.
(B) Collinearity is present in neither data set A nor B.
(C) Collinearity is present in data set A only.
(D) Collinearity is present in data set B only.
(E) The degree of collinearity cannot be determined from the information given.
5.24. Nei For a linear regression with 3 explanatory variables based on 24 observations, you are given:
(i) —11)2 = 64.4284.
(ii) zp1oi-y02. 196.0310.
(iii) The standard error of bi is 0.8955.
Calculate VIFi.
Miscellaneous
5.25. 6: [S-S16:361 You are given the following two graphs comparing the fitted values to the residuals of two
different linear models:
Graph 1 ,Graph 2
•
•
•
•
• a
• 0
41D
•
•
—
• a •• •
• • ..
•
S
•
in ••
•
t •
•
4
I .
• • •• •
•
•• 1.
•
• t• • • e
•
.I, • llio • •
7 i
• ...• •S •
• *
• o.
•-• ,
••• •• it
• - •
••
• •,,, . .0.
,,,, •
ea ••
_
'4
I •
00-r
00 0.2 0.4 08 08 1.0 18 20 22 24 26 28 30
Pitied fitted
(A) I only (B) II only (C) III only (D) I and III (E) II and III
Solutions
902 32
s=
\IE(Yi n + 1) y (2 + 1)
=2
1.2
T3 = 0.948683
s A,/ —1 - - h3; 2 VT-75
(---\ 5.3. For the studentized residual, we divide the residual by 17.2 instead of 22.4 and get 0.83(22.4/17.2) = 1.0809
0.2)
1 7 1 1 1 0.4 0.3 0.2 0.1
X(X'X)-1-X'
( 1.2625 -0.1125) (1
_
5.6. Average is (k + 1)/n = 6/60 = 0.1. Leverage is higher than 0.2 for the third observation only
5.7. Standardized residuals are
2
r
1 P(1 - hii)
where p, the number of parameters, is 2 here: the intercept and the slope. Candidates 1 and 3 should be calculated
i rst, since they have high leverage and high standardized residuals. Then calculate Candidate 4. If Cook's distance
f
for candidate 4 is less than 1, it is certainly less than 1 for Candidate 2, whose leverage and absolute value of
standardized residual are both lower.
For Candidate 1,
1 0.6205
DI = 1.34772 = 1.485 > 1
2(1 - 0.6205)
For Candidate 3,
0.6486
D3 = 1.112122(1( - 0.6486) ) = 1.141 > 1
For Candidate 4,
1 0.2807
D4 = 1.14722
2(1 - 0.2807) ) = 0.257 < 1
and we conclude that since Candidate 4 is not influential, neither is Candidate 2. (But if you are curious, D2 = 0.022.)
(C)
D, = ?r
hii I E`‘ 11
(k + 1)(1 — hii) s2(1 h) (k + 1)(1 — hi1))
Since we are comparing the observations, we can ignore multiplicative factors such as s and k + 1, and simply
calculate
(1 — hii)2
For the five observations, this works out to 0.014781, 0.027731, 0.012893, 0.024636, 0.032118. The fifth observation
is the most influential.
5.11. Let Di be Cook's distance for the first observation.
\2 iln
= Air (k +1)(1 — hu)
7.6242h11
0.804 =
4.8232(4)(1 — h11)2
74.8084(1 — 1402 = 58.1254h11
74.8084141 — 207.7422h11 + 74.8084 = 0
hil = 0.4252 ,2.3518
2 Error SS 618
=
5
n — (k + 1) n-13
2
E210 9.52(n — 13)
rio
=
— 0.22467(n — 13)
,S2(1 h10,10) 618(1 —035)
hi0:10
D10—r10(k + 1)(1 — hio,10)
0.35
5.0 = 0.22467(n — 13)
13(0.65)
5
=537
(0.22467)(0.04142012)
n = 550 (C)
5.13. The average leverage is (k + 1)/n = 2/n. From formula (5.1), leverage is greater than 2 times average if
1 (xi — I)2 2
+
80
n Ei =i(xi _ x)2 > 2 (— )
n
and n = 80
1 _ 3
(xi — 51)2 > 804 80 — aid
60,919
(xi —51)2>2284.46
ixi — 511 > 47.80
5.14. There's a lot to calculate here. The formula for Cook's distance, (5.2), has two factors. The first factor of
Cook's distance is the square of the standardized residual, ri. The fitted values are given by X(X'X)-1X/y, the hat
matrix applied to y, and the observed values are y. The differences between the two, observed minus fitted, are 0.07,
-0.03, -0.04, 0.11, -0.04, and -0.07. To standardize, we divide the squares of these by o-2(1 1111), where hii are the
diagonal elements of the hat matrix II, so
0.072
T2
1
= = 1.225119
0.012657(1 - 0.684)
and similarly the other qs are 2.844276, 0.622721, 1.642599, and 1.225119 respectively. The other factor of Cook's
distance is hid (p(1 - hii)) where p = 4 is the number of variables including the intercept. This factor will have to
be large to make the product greater than 1. For the second observation, the second factor is
0.975
= 9.75
4(1 - 0.975)
and multiplying this by 2.844276 results in a product greater than 1, making it an influential observation. For the
second highest hil, h33, we get 0.797/ (4(1 - 0.797)) = 0.981527, and multiplying this by 0.622721 results in a number
less than 1. For the first and sixth observations, hi; = 0.684, and the quotient is 0.684/(4(1 - 0.684)) = 0.541139. The
quotients for the fourth and fifth observations are even lower. Multiplying these quotients by rs that are no higher
than 1.225119 results in numbers less than 1. So only the second observation is influential.
The following table summarizes the calculations:
h-•
ei hii ri = 02(i-h) 4(1-hu)
Di
1 0.07 0.684 1.225119 0.541139 0.66296
2 -0.03 0.975 2.844276 9.75 27.73169
3 -0.04 0.797 0.622721 0.981527 0.61122
4 0.11 0.418 1.642599 0.179553 0.29493
5 -0.04 0.443 0.226952 0.198833 0.04513
6 -0.07 0.684 1.225119 0.541139 0.66296
(3)
5.15.
I. The standardized residual distribution has several very low observations, observations below -1. The q q plot
indicates that the number of low observations is greater than one would expect for a normal distribution, so
the distribution is skewed to the left. t/
II. Nothing can be deduced about serial correlation, since we aren't given the order of the residuals, nor even that
there is an order. X
III. Influential points have high leverage. We are not given the leverage, so we can't determine influential points. X
(A)
1
5.16. >5
1 - R2
(i)
1 > 5 - 5R2
(i)
R2 > 0.8
(i )
5.17. When regressing x2 on xi, since it is a simple regression, R2 is the square of the correlation coefficient. So
1
VIF2 = -
1.1905
1 - 0.42 =
fc =O.6 y =0.4
E (xi - 702 = 3(0.42) + 2(0.62) = 1.2
E(xi - Fc)(yi = (0.4)(0.6) + (-0.6)(-0.4) + (0.4)(0.6) + (0.4)(-0.4) + (-0.6)(-0.4) = 0.8
E(yi - y)2 = 2(0.62) + 3(0.42) = 1.2
R2 is the square of the correlation coefficient of x and y, or
R2 . (0.8)2
k1.2
0.44444
- 0.713858
214,258 + 85,883
1
VIF, =
= -3.495 (D)
1 - 0.713858
5.20. We will compute the VIF, which is (1 - R2)-1, when regressing one predictor on another. And 1 - R2 is the
quotient of the error sum of squares over the total sum of squares. For Data Set A, 1- R2 for model 2 is 7,070/92,990 =
0.07603 making the VIF 0.07603-1 = 13.153. For Data Set B, 1- R2 for model 2 is 1 34,650/87,020 = 0.39818 making
the VIF 0.39818-1 = 2.511. The first VIF is greater than 5 but the second isn't, making the answer (C).
5.21. Use equation (5.3) to compute the VIF.
VVIF3
Sb3 = S
Sx3
V196.0310 = 3.1307
(sb'SX1"1
= (0.893551(380.072 67) )2 5.2713
Quiz Solutions
5-1. The standardized residual is e1/(s — hii). We are given el = 3.2, s = 67, and h11 = 0.04, so the first
standardized residual is 3.2/(67V0T ) = 0.04875
0.28386
k5(1 — 0.48)) —
5-3. We derive R2 from F.
Regression SS/2 R2
9.5 Regression SS/Total SS
F218 =
6.48
Error SS/(21 —2 — 1)
R2
( Error SS/Total SS )
9
(1 — R2)
9 1 — R2
0.72 — 0.72R2 = R2
, 0.72
R.-
—
(3) 1.72
The VIP is
VIF3 — 1.72
1 — 0.72/1.72
Resampling Methods
Reading: Regression Modeling with Actuarial and Financial Applications 5.6.2-5.6.3; An Introduction to Statistical Learning
5.1
The previous lessons have discussed measuring the quality of a linear regression model using classical statistical
concepts. These methods make assumptions about the model and then test the assumptions. They are designed for
linear regression models.
More modern methods use the heavy computational power that has become available within the last few decades
to directly calculate how good the model predictions are, as measured by mean squared error. These methods may
be used for any model, not just for linear regression.
A repeated theme in An Introduction to Statistical Learning is the fact that mean squared error equals bias squared
plus variance. Many times an estimation method may decrease bias only at the cost of increasing variance. For
example, some methods have less flexibility. They may fit the model to a straight line. Other methods have more
l exibility. They may fit a curve that goes through every point of the data. Methods with more flexibility will have
f
less bias but more variance. Figure 6.1 shows how the MSE decreases to some extent as the flexibility is increased,
but reaches a minimum and then starts increasing due to higher variance. Our job is to pick the level of flexibility
that minimizes the MSE. (Figure 6.1 is similar to Figure 2.12 in An Introduction to Statistical Learning.)
The quality of a model is determined by how good its predictions are. We can create a model that fits the training
data perfectly by using a large number of variables. But such a model will probably perform poorly when used to
make predictions on other data. To test a model, we must measure differences between the values it predicts and the
actual values. We have actual data. But from where do we get test data so that we can see what the model predicts?
The way we get test data is we split our data into training data and test data. Training data is used to fit the
model, and then the model is applied to the test data. The rest of this lesson discusses different ways to perform
this split.
bias squared
variance
MSE
••••
.................... •
l exibility
f
MSE is calculated using the validation set. If we start with n points, let's say that n1 points are training data and
n2 = n — n1 are test data. Then we may calculate the MSE:
no-n2
1
MSE = —
n2 E (yi —g•)2
i=n3 +1
which is just the MSE times n2. Select the model with the lowest MSE or SSPE.
As a rule of thumb, use 25-35% of the sample for the validation set with 100 or fewer observations; with 500 or
more observations, use 50%.
el? Problems with this approach are:
1. The validation set is selected at random. Therefore the resulting MSE is highly variable.
2. The training set is significantly reduced in size. Fitting models to smaller data sets leads to higher variance
and poorer statistical results. Therefore the MSE of the test data is likely to be an overestimate of the true MSE
of the fit.
6.2 Cross-validation
Cross-validation methods select test data sets in a systematic non-random fashion, and then average the MSE results
from all runs. These methods usually require multiple runs, but fast computers have made these methods feasible.
We'll discuss two methods.
The first cross-validation method we'll discuss is Leave One Out Cross-Validation (LOOCV). In this method, in
each fit, one observation is held out as the test data set. The other observations are used as the training data set. For
each observation, one fit is performed removing that observation and using that observation as the test data set. Let
n be the number of observations. After n fits are made, the MSEs of the fits as measured on the test observation are
averaged. Let CV() be the LOOCV statistic. Then
1
CV(n) = —
n 1=1
L MSEi (6.2)
where MSE; is the mean square error of the test observation from the fit that removes that test observation) This
approach avoids the problems with the validation set approach that we mentioned above. The result is not random.
And since only one observation is omitted, the MSE is not significantly overestimated.
When the model is fitted by least squares, as in a standard linear regression model, it is unnecessary to perform
multiple fits. There is a simple formula for the LOOCV statistic:
(6.3)
1=1
where Ej = yj — 9; is the residual from the least squares fit on the entire n-observation data set. It is interesting
comparing this formula to the formula for the standard error of the regression, namely E 1(n — 1). The LOOCV
is calculated by not just summing up squares of residuals, but by weighting them with the complements of leverages:
the higher the leverage, the heavier the weight of the residual and the higher the estimate of the MSE is. The division
here is similar to the division for standardizing the residual, where we divide by s 1
If CV(„) is not divided by n, the resulting statistic is called PRESS, for predicted residual sum of squares. So the
'Notice that MSE, is just the square difference between the generated II; and the fitted yi; MSEi = (yi — 9j)2. There is no sum and no division
(division is by 1), since there is only one observation.
PRESS statistic is
11
(6.5)
where 90) is the fitted value when observation i is omitted from the training data and used as test data.
The second cross-validation method we'll discuss is k-fold cross-validation.2 In this method, k fits are done. The 'Ir
data set is randomly partitioned into k subsets, each of approximately equal size. Then for each of the k subsets, one
fit is performed. In this fit, that subset is removed and used as test data. After the k fits are performed, the test data
MSEs are averaged:
Usually k = 5 or k = 10. There is some variability in the result since the split into subsets is random, but the
variability is less than in the validation set approach. LOOCV is the special case of k =
Even if cross-validation does not calculate the MSE accurately, it still can identify the right amount of flexibility
to use, the level of flexibility that minimizes MSE.
k-fold cross-validation may lead to lower MSE than LOOCV because of its lower variance. In general, the higher the
k, the higher the variance and the lower the bias. LOOCV is n-fold cross validation, and n is the highest possible k.
2This k has nothing to do with the number of variables. Fortunately the number of variables does not enter into our discussion of cross-
validation.
LOOCV statistic
CV(H) = (6.2)
PRESS = (6.4)
71 n P=1
ei
PRESS = —
1 — hii )2 (6.5)
n Exercises
6.1. "4/ A regression of the form yi = Po + flix+ e is performed on the following data:
xi 8 12 16 18 24
yi 10 20 30 50 55
To validate the model, an out-of-sample validation procedure is used. The validation set consists of the last two
points, (18,50) and (24,55).
Calculate the SSPE statistic.
6.3. 'kJ' A linear regression model is of the form yi = /30 + ix + el. There are 5 observations. You are given the
following residuals and leverages:
Residual Leverage
3 0.6
—1 0.3
—3 0.2
—3 0.3
4 0.6
6.4. '41 Using the 5-fold cross validation method, you obtain the following mean square errors: 842, 759, 805,
738, 824.
Calculate the CV(5) statistic.
6.5. Rank the bias and variance of the following cross-validation methods from lowest to highest:
1. LOOCV
2. 5-fold cross-validation.
3. 10-fold cross-validation.
6.6. 641° [MAS-I Sample:31 You are given the following statements about different resampling methods:
I. Leave-one-out cross-validation (LOOCV) is a special case of k-fold cross-validation
k-fold cross-validation has higher variance than LOOCV when k < ii
III. LOOCV tends to overestimate the test error rate in comparison to validation set approach
Determine which of the above statements are correct.
(A) I only (B) H only (C) HI only (D) I, II, and III
(E) The correct answer isn't given by (A), (B), (C), or (D)
6.7. I': [MAS-I-S18:301 You are considering using k-fold cross-validation (CV) in order to estimate the test error
of a regression model, and have two options for choice of k:
• 5-fold CV
• Leave-one-out CV (LOOCV)
Determine which of the following statements makes the best argument for choosing LOOCV over 5-fold CV.
(A) 1-fold CV is usually sufficient for estimating the test error in regression problems.
(B) LOOCV and 5-fold CV usually produce similar estimates of test error, so the simpler model is preferable.
(C) Running each cross-validation model is computationally expensive.
(D) Models fit on smaller subsets of the training data result in greater overestimates of the test error.
(E) Using nearly-identical training data sets results in highly-correlated test error estimates.
6.8. [MAS-I-F18:33] Two ordinary least square models were built to predict expected annual losses on
Homeowners policies. Information for the two models is provided below:
Model 1 Model 2
Replacement Cost (000s) 0.03 <0.001 Replacement Cost (000s) 0.02 <0.001
Roof Size 0.15 <0.001 Roof Size 0.17 <0.001
R2 0.91 R2 0.94
Adj R2 0.87 Adj R2 0.89
MSE 31,765 MSE 30,689
AIC 25,031 AIC 25,636
1 33,415 1 26,666
2 38,741 2 38,554
3 32,112 3 39,662
4 37,210 4 36,756
5 29,501 5 30,303
You use 5-fold cross validation to select the best of the two models.
Calculate the predicted expected annual loss for a homeowners policy with a 500,000 replacement cost, a 2,000
roof size, a 0.89 precipitation index, and three bathrooms, using the selected model.
(A) Less than 1,000
(B) At least 1,000, but less than 1,500
(C) At least 1,500, but less than 2,000
(D) At least 2,000, but less than 2,500
(E) At least 2,500
6.9.
NI' [MAS-I-S19:34] A statistician has a dataset with n = 50 observations and p = 22 independent predictors.
He is using 10-fold cross-validation to select from a variety of available models.
Calculate the number of times that the first observation will be included in the training dataset as part of this
procedure.
(A) 0
6.10. %II [MAS-I-F19:29] An actuary has a dataset with four observations and wants to use Leave-One-Out Cross
Validation (LOOCV) to determine which one of the two competing models fits the data better. The model preference
will be based on minimizing the mean squared error.
The values of the dependent variable are:
Corresponding fitted values under each model and training data subset are:
Training Model I Model 2
Obs. Used pi 92 93 94 91 92 93 94
1,2,3 1.50 1.60 1.20 1.80 1.60 1.70 1.60 Z
1,2,4 2.00 1.50 1.10 1.90 1.80 1.40 1.30 1.70
1,3,4 1.75 1.55 1.70 2.10 1.40 1.30 1.50 1.95
2,3,4 1.70 1.65 1.60 2.00 1.60 1.70 1.20 2.00
Calculate the maximum value of Z for which the actuary will prefer Model 2.
(A) Less than 1.5
(B) At least 1.5, but less than 1.8
(C) At least 1.8, but less than 2.1
(D) At least 2.1, but less than 2.4
(E) At least 2.4
Solutions
6.1. Fitting the other three points, which are on a straight line with slope (20 - 10)/(12 - 8) = 2.5, gives yi =
-10 + 2.5xi. Applying this fit to the validation set, we get 94 = 2.5(18) - 10 = 35 and ij5 = 2.5(24) - 10 = 50. The
SSPE is
SSPE = (50 - 35)2 + (55 - 50)2 = 250
6.2. Leverage sums up to k + 1, the number of parameters, so the average leverage is (k + 1)/n. Here, every
observation has leverage 4/84 = 1/21. Then the LOOCV is
64
CV(54) = 84 E
i=1 )2 84(1 - 1/21)2
(1090) = 14.30625
2 2 2 2 2
3 -1 -3 -3 4
) (
l
f
6.3. PRESS =
(1 - 0.6)
-
+
- 0.3) + ‘1 + (1 0.3)
-
+
- 0.6)
190.72
6.4. The CV(5) statistic is the average of the five MSEs, or 793.6
6.5. The fewer elements left out of the training set, the less bias. Thus LOOCV has the lowest bias; 10-fold is
second; and 5-fold, which leaves out 1/5 of the elements, has the highest bias.
Variance works the other way around.
6.6.
6.9. Each cross-validation will have 45 training observations and 5 test observations, so each observation will be
in the training set 9 times and in the test set 1 time. (C)
6.10. We compute sums of square differences between fitted values in the one-observation test data sets and actual
values of the dependent variable. We need not divide by 4 to average, since the division is the same for both models , ,,)
and we just want to equate the results.
For training observations 11,2,3} in Model 1, the test observation 94 is 1.80 versus the actual 1.95, for a difference
of —0.15. For training observations {1,2,4} we compare 1.10 to actual 1.60, a difference of —0.50. And the same for
the third and fourth rows:
Reading: Regression Modeling with Actuarial and Financial Applications 5.1-5.2; An Introduction to Statistical Learning
6.1
Modern models may have large numbers of predictors, sometimes more predictors than observations. Using all
predictors will result in lower standard error, even 0 standard error, on the training data but will lead to poor
predictions. It is necessary to select predictors that truly impact the response; mechanically fitting coefficients does
not guarantee a true relationship.
This lesson discusses methods for selecting the important predictors. The next lesson discusses other techniques
for reducing the number of variables.
If there are k possible predictors, one can fit 2k models containing every subset of the predictors. One can then select
the best model. The method for determining which model is best depends upon whether or not the models being
compared have the same number of predictors:
• When comparing two models with the same number of predictors, the one with the lower RSS is better.'
• When comparing two models with different numbers of predictors, RSS cannot be used directly, since adding
a predictor to a model, no matter how irrelevant, cannot increase the RSS and will almost surely decrease
it.2 Instead, cross-validation or one of these four statistics: Mallow's Cp, AIC, BIC, or adjusted R2, is used to 141
compare models.
This method is sometimes called "Best Subset Selection" to distinguish it from the heuristic methods we will s:
discuss next.
When k is high, it is impractical to fit 2k models. For example, if k = 20, more than 1,000,000 models are possible.
A heuristic approach is then needed. We will discuss forward stepwise selection and backward stepwise selection.
Forward stepwise selection consists of starting with the empty model, the one that sets y equal to the sample Nr
mean. Then k models are formed by adding one predictor from the available k predictors. The best model is selected
using RSS. This process is repeated: k — 1 models are formed by adding one predictor from the remaining k — 1
predictors, and the model with the lowest RSS is selected. This is continued until all k predictors are added to the
model. The k +1 resulting best models, one for each number of predictors, are then compared using cross-validation
or one of the four statistics. Thus a total of 1 + Erc(k — 1) = 1 + k(k + 1)/2 models are fitted.
This algorithm is greedy in the sense that it picks the immediately best model, not considering that picking an s:
inferior model at the current step may lead to a better model ultimately. There is no guarantee that the best model
is selected. In a situation like this:
'Remember, RSS is the sum of squared residuals. It is exactly the same as what we called Error SS. We ll use the symbol RSS in most of this
lesson.
2It is possible for a model with d +1 predictors to have a larger RSS than a model with d predictors if the latter d predictors are not a subset
of the d +1 predictors. Still, the fact that RSS must decrease when the models are nested indicates that comparing RSS for models with different
numbers of parameters is not a good way to measure model quality.
Xi would be the preferred model with 1 variable and then only the {X1, X2} and {X1, X3} models would be
considered. The {X2, X3} model would be bypassed even though it is better than the {X1, X2} and {X1, X31 models.
Forward stepwise selection can be used even when n <k, but only models with n —1 or fewer parameters would
be considered.
Backward stepwise selection consists of starting with the full model, the one having all predictors. Then for
each predictor a model is fitted by removing that predictor and fitting using the other k 1 predictors. From these
k models the one with lowest RSS is selected. This process is repeated: k 1 models are fitted by removing one
predictor at a time. This is continued until the model has no predictors. The k +3. resulting best models, one for each
number of predictors, are then compared using cross-validation or one of the four statistics. A total of 1 + k(k + 1)/2
models are fitted.
As with forward stepwise selection, there is no guarantee that the best model is selected. And backward stepwise
selection cannot be used if k > n, since a fit with more than n predictors generates meaningless statistics.
441 There are hybrid versions of stepwise selection, which are called mixed selection methods, in which variables
are added sequentially but variables may also be removed if they no longer improve the model fit. The steps for
mixed selection are:
2. For each unused variable, create a model by adding it. Select the model with the best RSS.
3. Look at each variable in the model. If there are variables with t ratios below a predetermined threshold (that
you set before starting selection), remove the variable with the lowest t ratio. That is the algorithm presented
in Regression Modeling with Actuarial and Financial Applications. In An Introduction to Statistical Learning, they
instead remove the variable with the highest p value if it is above the predetermined threshold. (The two
methods are similar but not equivalent since the t ratio is a function of the degrees of freedom.)
4. Repeat steps 2 and 3 until no more variables satisfying the t ratio or p value threshold can be added.
In all subset selection methods, one parameter at a time is added or removed. For a categorical variable with
more than two categories, only one category is added at a time, so it is possible that the final model will have only
some of the categories as variables. An individual not having the characteristics of the accepted categories would
effectively be placed in the base category.
For example, suppose the categorical variable "type of vehicle", with categories coupe, sedan, SUV, and van was
under consideration, and "sedan" was the base category. Forward subset selection is used. At the first iteration,
SUV is added to the model. Then this model effectively puts coupe and van in the base category.
Regression Modeling with Actuarial and Financial Applications also mentions a "best regressions" routine, which
finds the best model having a specific number of variables.
Regression Modeling with Actuarial and Financial Applications lists 7 problems with stepwise regression:
1146 1. Data snooping. This means fitting a large number of models to one set of data. If one fits large numbers of
models, one of them is likely to look good even though it is false. Statistics typically uses 95% confidence
intervals, so that 1 out of 20 times a model is accepted even though it is false. If we go through 100 models,
we're likely to find 5 that look good even though they are false.
2. It ignores the possibility that none of the models are right, either because the correct model is nonlinear or
because of outliers and high leverage points in the data.
3. Only some of the 2k models are considered; one of the models not considered may be better.
4. Rather than using I, another statistic should be used for determining which variables are added or removed.
5. The true significance of the model is greater than the significance level of the t statistic used as the addi-
i on/removal criterion, since separate additions are made. If one variable addition is good 95% of the time
t
and a second variable addition is good 95% of the time, the probability that both variables belong in the model
is less than 95%.
6. Since variables are added one by one, the joint effect of adding two variables is not considered. However,
backward stepwise regression does consider joint effects since only one variable is removed.
7. Automatic procedures don't use additional information that an investigator may have. For example, data from
year 2015 may be unusual because something special happened in that year.
Mallow's Cp Suppose the full model has k predictors, and we are considering a subset of that model with p < k
predictors. Let S2 be the residual variance of the k-predictor regression:
2 _
r1=1 r
s —
n—k 1
Let (RSS)p be the residual sum of squares of the p-predictor regression. Then Mallow's Cp is defined by
RSSp
fl Cp /1 + 2p
That is the definition given in Regression Modeling with Actuarial and Financial Applications. The definition of Cp
given in An Introduction to Statistical Learning is
These two definitions are quite different, but they lead to the same results. Remove —n from equation (7.1); it is a
constant that does not depend on the model. Then bring 2p into the fraction:
RSSp + 2ps2
s2
Now change the denominator from s2 to n; both s 2 and n are independent of model, so this doesn't affect the
comparison. Now you have An Introduction to Statistical Learning's version of the formula.
If $2 is an unbiased estimate of the residual variance, then the C defined in An Introduction to Statistical Learning
is an unbiased estimate of the test MSE.
AIC and BIC Linear regression maximizes the likelihood of the data. If additional variables are added to a model,
the maximum likelihood cannot decrease, since setting the additional ps equal to 0 will result in the same likelihood
as the smaller model. The maximum likelihood will probably increase, regardless of whether the added variables
are significant or not. So merely comparing likelihoods of different models is not adequate.
Later on in the course we'll discuss likelihood ratio tests. Those tests set a threshold for adding variables to a
model. However, they are only available for comparing nested models.
The Akaike Information Criterion (AIC) and the Bayes Information Criterion (BIC) are penalized loglikelihood
measures. They may be used to compare any models whether or not they are nested. Both statistics start out with
twice the negative loglikelihood. The AIC adds 2d for a d-parameter model. The BIC adds d ln n for a d-parameter
model with n observations. Thus the penalty per parameter of BIC is almost always higher than AIC. A model
selected using BIC will tend to have fewer parameters than a model selected by AIC.
In Regression Modeling with Actuarial and Financial Applications, a formula is developed for AIC of a linear regres-
sion. For linear regression with k explanatory variables, the likelihood L of the set of n observations with errors
following a normal distribution with mean 0 and variance a2 is
L=
n exp Hy; Pixii)2/2c2)
Let 1 = In L be the loglikelihood. Twice the negative loglikelihood, —2 In L = —2/, is
RSS
—21= + 2n ln •Nir
0.2 + 2n In
If the usual estimate of a2, namely s 2, is used, then we get
RSS
52 =n — p — 1
—2/ =n—p-1+2n ln s + 2n ln V = n ln s2 + n ln(2n) +n—p-1
For AIC, we add twice the number of parameters that are estimated. There are k + 1 13s, and a2 is considered a
parameter as well, so we add 2(p + 2) to —2/ and get
Since AIC is only used to compare models, we can ignore the constants n In(2n) and n + 3. We see AIC balances
improvements in s2 against number of parameters k.
Formula (7.3) uses S2 to estimate a2.
An Introduction to Statistical Learning, on the other hand, does not assume that we use S2 as an estimate of cr.
Instead, we use s 2 from the full model, as we did for Mallow's C. Then s is independent of model, and we can
therefore ignore n ln 62 as well as n ln(2n) and n + 3. For some reason, An Introduction to Statistical Learning divides
the resulting formulas for AIC and BIC by n. (Dividing by a constant does not affect comparisons of models.) The
resulting formulas for AIC and BIC in An Introduction to Statistical Learning are
1
AIC =---(RSSp + 2ps2)
ns
(7.4)
1 ,
BIC = ---(RSS + (ln n)ps2)
n52
(7.5)
where p is the number of predictors in the subset model. Thus AIC is a multiple of Mallow's Cp as defined in
An Introduction to Statistical Learning and therefore leads to identical results. BIC puts a higher penalty on adding
parameters.
Presumably any Exam SRM question asking you to calculate AIC or BIC will tell you which formula to use.
Adjusted R2 = 1 (7.6)
Sometimes the symbol .IZ, is used for adjusted R2. Increasing k decreases the subdenominator n — p — 1, which
increases the numerator of the fraction and decreases adjusted R2. Thus RSS must decrease sufficiently to justify the
increase in p. •,.....)
Exam SAM Study Manual
Copyright C24:122 ASM
7.2. CHOOSING THE BEST MODEL 107
For Mallow's Cp, AIC, and BIC, the lower the statistic the better the model. Each of them has theoretical
justification based on asymptotic arguments. Adjusted R2, on the other hand, is an ad hoc measure without strong
theoretical justification, and higher values indicate better models.
EXAMPLE 7A al For a linear model with 30 observations and 5 parameters (including the intercept), the residual
sum of squares is 100. The unbiased sample variance of the response variable is 40. The residual variance of the
regression is 10.
Calculate Mallow's Cp, AIC, BIC, and adjusted R2, using the formulas of An Introduction to Statistical Learning.
SOLUTION:
Cp = -n1- (RSSp + 2ps2) = (100 + 2(4)(10)) = E 0
cp 6
AIC = 7.. (RSSp + 2ps2) = 72- = -
f o= 0.6
1 1
n Exercises
7.1. 4' A least squares model is fitted to 62 observations. 22 different predictors are considered. The best
predictors are selected using backward stepwise selection.
Determine the number of models that are fitted.
7.2. ktr [SRM Sample Question #541 For a regression model of executive compensation, you are given:
(i) The following statistics:
Executive Compensation
Coefficients Estimate Std. Error t-statisic p-value
(INTERCEPT) —28,595.5 220.5 —129.7 <0.001
AGEMINUS35 7,366.3 12.5 588.1 <0.001
TOPSCHOOL 50.0 119.7 0.4 0.676
LARGECITY 147.9 119.7 1.2 0.217
MBA 2,490.9 119.7 20.8 <0.001
YEARSEXP 15,286.6 7.2 2132.8 <0.001
7.3. A normal linear model is fitted to 30 observations. The model has the following predictors:
AGE Categorical variable with categories "Under 30", "30 to 39", "40 to 49", "50 to 59", "60 and over".
SEX Male or female
BLOOD PRESSURE Real-valued variable
7.4. Nr A normal linear model is used to estimate the sales of oranges. The explanatory variables are various
characteristics of the oranges:
SIZE Small, medium, large
TYPE Navel, temple, juice
STATE Florida, Arizona, California
SEASON Fall, winter, spring, summer
The best explanatory variables are selected using backward stepwise selection.
Determine the number of models that are fitted.
7.5. •-: For a normal linear model based on 26 observations, 100 predictors are under consideration. The best
predictors are selected using forward stepwise selection.
Determine the number of models that are fitted.
7.6. "41 [MAS-I-F19:401 You have p = 10 independent variables and would like to select a linear model to fit the
data using the following two procedures:
• Best Subset Selection (BSS)
• Forward Stepwise Selection (FSS)
Let 1\11 be the maximum number of models fit by model selection procedure i.
NFSS
Calculate —
N1355
(A) Less than 0.005
Xi 152 {Xi, X2} 118 {X2, X4} 122 {Xl, X3/ X4} 101
X2 145 {Xi, X3} 144 {X3, X4} 135 {X21 X3, X4} 107
X3 160 {Xi, X4} 129 {Xi, X2, X3} 110 {Xi, X2, X3, X4} 85
X4 138 {X2, X3} 131 {Xi, X2, X4} 105 None 258
7.7. •': Which three variable model is selected by best subset selection?
7.8.
'4" Which three variable model is selected by forward stepwise selection?
7.9. *-: Which two variable model is selected by backward stepwise selection?
7.10. N. [MAS-I-F19:39] An actuary has a dataset with one independent variable, Y, and five independent
variables, X1, X2, X3, X4, X51. She is trying to determine which subset of the predictors best fits the data, and is
using a Forward Stepwise Selection procedure with no stopping rule. Below is a subset of the potential models:
Dependent Independent
Model variable RSS variable p-value
X1 0.0430
1 Y 9,823
X2 0.0096
Xi 0.0464
2 Y 7,070 X2 0.0183
X3 0.0456
Xi 0.0412
3 Y 6,678 X2 0.0138
X4 0.0254
Xi 0.0444
4 Y 4,800 X2 0.0548
X5 0.0254
X1 0.0333
X2 0.0214
5 Y 3,475 X3 0.0098
X4 0.0274
X5 0.0076
Exam SRM Study Manual Exercises continue on the next page ...
Copyright 132022 ASM
112 7. LINEAR REGRESSION: SUBSET SELECTION
7.11. "41. [MAS-I-F18:31] An actuary fits two GLMs, M1 and M2, to the same data in order to predict the probability
of a customer purchasing an automobile insurance product. You are given the following information about each \---)
model:
Degrees of Log
Model Explanatory Variables Included in Model Freedom Used Likelihood
• Offered Price
• Number of Vehicles
Mi 10 —11,565
• Age of Primary Insured
• Prior Insurance Carrier
• Offered Price
• Number of Vehicles
7.12.
•41 Calculate Mallow's Cr, using the formula in James et al.
7.13. NI. Calculate Mallow's Cp using the formula in Frees.
7.14.
11-: A linear regression is fitted to 20 observations. The model has 5 explanatory variables and an intercept.
The RSS is 84.
Calculate Mallow's Cp for the full model using the formula in James et al.
7.15. •41. Various normal linear models are fitted to 60 observations. The models with the lowest residual sum of
squares (RSS) for each fixed number of explanatory variables are:
Number of
Explanatory Lowest
Variables RSS
0 326
1 314
2 303
3 293
4 284
7.16. Various normal linear models are fitted to 29 observations. The models with the lowest residual sum of
squares (RSS) for each fixed number of explanatory variables are:
Number of
Explanatory Lowest
Variables RSS
0 162
1 145
2 140
3 136
4 132
7.17. "1r For a linear regression model with 100 observations, you are given:
• The model has 10 variables and an intercept.
• The residual sum of squares is 64.8.
• The estimated variance of the residual term is 5.5.
7.18. •••• A linear regression model has n observations and 8 predictors. You are testing a model having a subset
of the 8 predictors. The subset has 4 predictors.
You are given:
• Both models have intercepts.
• The training RSS of the original model is 82.8.
• The training RSS of the subset model is 116.2.
• The variance of the residuals is estimated using the full model.
• AIC and BIC of the subset model are calculated using the formulas in James et al.
• The AIC of the subset model is 1.271084.
Calculate the BIC of the subset model using the formula in James et al.
7.19. *NI° [MAS-I-F18:39] Two actuaries were given a dataset and asked to build a model to predict claim frequency
using any of 5 independent predictors 11,2,3,4,5} as well as an intercept III.
• Actuary A chooses their model using Best Subset Selection
• Actuary B chooses their model using Forward Stepwise Regression
• When evaluating the models they both used R-squared to compare models with the same number of parameters,
and AIC to compare models with different numbers of parameters.
Below are statistics for all candidate models:
# of Non # of Non
7.20. rYou have fit 5 models using linear regression. The loglikelihoods of the models are
7.21. 1": [MAS-I-S19:391 Two actuaries were given a dataset and asked to build a model to predict claim frequency
using any of 5 independent predictors 11,2,3,4,5) as well as an intercept {I).
• Actuary A chooses their model using Best Subset Selection
• Actuary B chooses their model using Forward Stepwise Regression
• Actuary C chooses their model using Backwards Stepwise Regression
• When evaluating the models they all used R-squared to compare models with the same number of parameters,
and AIC to compare models with different numbers of parameters.
Below are statistics for all possible models:
# of Non # of Non
Intercept Intercept
Model Parameters Parameters R2 AIC Model Parameters Parameters R2 AIC
1 0 I 0 1.9 17 3 1,1,2,3 0.73 1.3
2 1 1,1 0.56 1.4 18 3 1,1,2,4 0.71 1.5
3 1 1,2 0.57 1.2 19 3 1,1,2,5 0.72 1.4
4 1 1,3 0.55 1.6 20 3 1,1,3,4 0.75 1.0
5 1 1,4 0.52 1.7 21 3 1,1,3,5 0.76 0.8
6 1 1,5 0.51 1.8 22 3 1,1,4,5 0.79 0.2
7 2 1,1,2 0.61 1.0 23 3 1,2,3,4 0.78 0.6
8 2 1,1,3 0.64 0.5 24 3 1,2,3,5 0.74 1.2
9 2 1,1,4 0.63 0.8 25 3 1,2,4,5 0.75 1.1
10 2 1,1,5 0.69 0.0 26 3 1,3,4,5 0.73 1.3
11 2 1,2,3 0.61 1.0 27 4 1,1,2,3,4 0.88 1.6
12 2 1,2,4 0.62 0.9 28 4 1,1,2,3,5 0.80 2.1
13 2 1,2,5 0.68 0.2 29 4 1,1,2,4,5 0.87 1.8
14 2 1,3,4 0.66 0.4 30 4 1,1,3,4,5 0.83 2.0
15 2 1,3,5 0.64 0.5 31 4 1,2,3,4,5 0.85 1.9
16 2 1,4,5 0.60 1.1 32 5 1,1,2,3,4,5 0.90 3.5
7.22. NI" You have fit 5 models based on 70 observations using linear regression. The loglikelihoods of the models
are
7.24. *-411 A regression is performed based on 28 observations. The form of the regression is
Yi = pix1i+p2xi2+ 83xi3+p4xi4+ ri
The AIC, using the formula in Frees, is 111.03.
Determine the BIC.
7.25. So [S-F15:39] You are given the following output from five candidate models:
7.26. You are given the following ANOVA table from a regression:
7.28.
•••• [120-83-98:4] You fit the regression model Y=i +18X; + ei to 11 observations.
You are given that R2 = 0.85.
Determine adjusted R2.
(A) 0.77 (B) 0.79 (C) 0.80 (D) 0.81 (5) 0.83
7.29. 46 [MAS-I-F18:301 An actuary uses statistical software to run a regression of the median price of a house on
12 predictor variables plus an intercept. He obtains the following (partial) model output:
Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared: 0.7406
F-statistic: 117.3 on 12 and 493 DF
p-value: <2.2e-16
7.31. •-: For a linear regression model of the form yi = 13o + Pixit + f32Xi2 -I- ei you are given:
(i) F = 96
(ii) Adjusted R2 = 0.95
(iii) E(gi — 02 = 1000
3 8
(iv)
(XCX)-1 = ( 8 12)
12
2
0
0
1
Determine the width of the shortest 95% symmetric confidence interval for pi.
7.32. s: For a linear regression model yi = f30 + Pixii + 132 Xi2p2X13 /34Xi4 Ei with 9 observations you are given
that s2 = 20. The values of gi are 1, 2, 3, 4, 5, 6, 7, 8, 9.
Determine adjusted R2.
l
f
Exam SRM Study Manual Exercises continue on the next page . .
Copyright 02022 ASM
118 7. LINEAR REGRESSION: SUBSET SELECTION
s
7.33. You are given the following regression models:
(A) yi = + PiXil PkXik E.;
(B) y; = pi° + 161 xii + • - • + + Ei
where y; = 2yi.
Which of the following statements are true?
1. The standard error of the regression will be the same in both models.
2. The adjusted R2 will be the same in both models.
3. Both models will have the same F statistic.
Solutions
7.1. We fit the full model, then 22 models with 1 predictor removed, 21 models with 2 predictors removed, etc.,
until we fit the model with just an intercept. Total number of models fitted is
22
1+ =1+
22(23)
254
2
1=1
7.2. TOPSCHOOL and LARGECITY both have p-values greater than the significance level of 0.10. However, only
TOPSCHOOL, which has the greater p-value, is removed. Sometimes a variable that appears to be not significant
may become significant after another variable is removed. (C)
7.3. We need 4 variables for AGE, 1 for SEX, 1 for BLOOD PRESSURE, and 1 for CHILDREN, a total of 7 variables.
We start with 1 model with just the intercept, then consider 7 models for the first variable to add, 6 for the second,
and so on. Total number of models considered is
(7)(8)= 29
2
k=0
7.4. There are 2 variables for each of SIZE, TYPE, and STATE, and 3 variables for SEASON, for a total of 9 variables.
We start with the model having all variables, then 9 models with 1 variable removed, 8 models with 2 variables
removed, and so on, for a total of
8
(9)(10) 1 1
1 +E(9 —k)= 1 +
k=0
2
=1
7.5. The model has an intercept, and can contain at most 25 predictors; otherwise the variables will not be linearly
independent. We start with the model with an intercept only, then 100 models with 1 predictor, 99 with 2 predictors,
and so on. The total number of models considered is
24
1 + E (100 — k)
k=0
As usual, the sum of an arithmetic sequence is the average of first and last terms (100 and 76 here) times the number
of terms (25 here), so
24
(76 + 100)(25)
1 + E(100 — k) = 1 +
k=0
2
=
2201
7.6. BSS considers every possible model. Each variable may be included or excluded, so there are 210 = 1024
models to consider. FSS starts with the empty model, then considers 10 choices to add, followed by 9 choices, and so
on. The last step considers adding the 1 variable that hasn't entered the model yet. Thus 1 + 10 + 9 + 8 + • • • + 1 = 56
models are considered. The ratio is 56/1024 = 0.055. (D)
7.7. The three variable model with the lowest RSS is {Xi, X3, X4}
7.8. The best one variable model, the one with lowest RSS, is X4. Among the two variable models with X4, the
one with {X2, X4} has the lowest RSS. Among the three variable models with {X2, X4}, the one with pri, x2, x4)
has the lowest RSS.
7.9. Among the models with three variables, the one with (X1, X3, X4} has the lowest RSS. Among the models
with two of those three variables, {X1, X4} has the lowest RSS.
7.10. Forward Stepwise Selection adds one variable at a time, so answer choices (A) and (E) cannot be right. It
selects the variable that lowers RSS the most, and that is the one added for Model 4, namely (D)
7.11. The models are not nested; M2 does not have Prior Insurance Carrier of M1 and does have 2 explanatory
variables that M1 does not have. Thus (A) and (C) are not available, and (B) is only for linear models. (E) is not
relevant. That leaves (D).
7.12. Here p = 4.
cp =2.(Rss 2ps2) = 25
(132 + 2(4)(8)) = 7.84
7.2
7.15. The mean squared error of the full model is 284/(60 — 5) = 5.163636. Then, using the James et al version of
the Cp formula,
326
Cr (0) = 60
— = 5.4333
314 + 2(5.163636)
Cp(1) = — 5.4055
60
303+ 2(2)(5.163636) = 5.3942
Cp (2) = 60
293 + 2(3)(5.163636)
C,,(3) = 5.4000
60
284 +2(4)(5.163636) = 5.4218
C(4) = 60
7.16. The estimated value of the mean square error of the model with 4 explanatory variables is
2 RSS 132
S -
= 5.5
n—k—1 29-5
We calculate Mallow's Cp for each model using the An Introduction to Statistical Learning formula, but you will
get the same final answer with the Regression Modeling with Actuarial and Financial Applications formula. j
162
C (O) = = 5.586
The model with1 explanatory variable has the lowest Cp and is therefore the best.
1
7.17. AIC — (64.8+ 2(10)(5.5)) = 0.3178
100(5.5)
1
BIC — (64.8 + (ln100)(10)(5.5)) = 0.5783
100(5.5)
82.8
7.18. s 2=
n-9
(n_—
AIC =
1.271084 —
116.2(n — 9) + 8(82.8)
82.8n
105.2458n = 116.2n — 383.4
383.4
n= = 35
10.9542
26 (1n35)(4)(82.8)
BIC =
7.19. With best subset selection, for each number of parameters we select the model with the highest R2. The
models Actuary A selects are Models 1, 3, 10, 22, 27, 32. In each case, the AIC is —2(/ — p) where p is the number of
parameters including the intercept, or 1.9, 1.2, 0, 0.2, 1.6, and 3.5 respectively; 0 is best.
With forward stepwise regression, we add one variable each time that maximizes R2. We start with I, then add
variable 2, then variable 5, then variable 4, then variable 1, then variable 3. In other words, the models are 1, 3, 13,
25, 29,32 with AICs of 1.9, 1.2, 0.2, 1.1, 1.8, and 3.5 respectively; 0.2 is best. The difference of AICs is [rill. (B)
7.20. We will use the formula in Frees, but you will get the same result using the formula in James et al. The
AIC is —2/ + 2p, where p is the number of parameters, which is the number of variables plus the intercept and the
variance, k 2. There is no need to consider the second and fourth models, since the third and fifth models have
higher loglikelihoods with the same numbers of parameters. The resulting AICs for the remaining models are
2(-15.785) + 2(4) = 39.570
— 2(-13.015) + 2(5) = 36.030
— 2(-12.021) + 2(6) = 36.042
The third model has the lowest AIC and is therefore preferred.
7.21. With best subset selection, A ends up choosing the model with the lowest AIC, model 10 (AIC = 0.0).
With forward stepwise regression, first we select model 1, then add in parameter 2 which generates the 1-
parameter model with the highest R2. That is model 3. Then add in variable 5, which among models 11, 12, and 13
(---' generates the highest R2. The AIC of that model, model 13, is 0.2. We see that the models with more parameters do
not have a lower AIC, so this is optimal.
With backward stepwise regression, first we select model 32, then model 27 which removes parameter 5 and
has the highest R2 among 4-parameter models. Then remove parameter 1, model 23, the model with the highest R2
among 3-parameter models without parameter 5. Then among the 2-parameter models without parameters 1 and 5,
models 11, 12, and 14, the best is model 14. We can't improve the AIC of 0.4 by removing additional parameters, so
that one is selected. (A)
7.22. Using the formula in Frees, the BIC is -2/ + p In n, where p = k + 21s the number of parameters, which is the
number of variables plus the intercept and variance parameters. In this case, In 70 = 4.2485. The second model can
be skipped since the third model has the same number of parameters and higher loglikelihood.
- 2(-120.1) + 4(4.2485) = 257.194
- 2(-118.2) + 5(4.2485) = 257,642
- 2(-116.6) + 6(4.2485) = 258.691
- 2(-113.8) + 7(4.2485) = 257.339
The first model has the lowest BIC and is therefore preferred.
7.23. Using equation (7.3),
AIC = 501n82.42 + 501n 2n +50+3+3 = -589.05
7.24. We subtract 2p and add p Inn, where p is the number of parameters, which is 6 including the constant and
az.
BIC = 111.03- 2(6) + 61n28 = 119.02
(----'
7.25. Higher R2 and lower AIC and BIC are better. So Model 5 is best according to R2 and Model 4 is best according
to AIC and BIC. (D)
7.26. Let p be the number of predictors. The regression has p degrees of freedom, and the error has n - p - 1
degrees of freedom. Here, p = 3 and n - p - 1. = 12. It follows that n - 1 = 15. We will use formula (7.7).
21
1 - R2 ,.. 21
= 348 + 21 - 369
21 15
Adjusted R2 = 1 - (1 - R2)
( "-1 ) - 1 0.9289
_
-
n -
p 1 369 T.2
138.89/5
7.27. Adjusted R2 =1 = 0.8449 (A)
(1115.11 + 138.89)/7
7.28. We will use formula (7.7).
7.29. We'll use formula (7.7) to calculate adjusted R2. Here, n - p - 1 is the number of degrees of freedom, 493,
and n is total of degrees of freedom, predictors, plus 1 for the intercept, or 506.
505
Adjusted R2 = 1 - (1 - 0.7406)
(493-) = 0.7343 (C)
7.31.
n—1
0.95 = Adjusted R2 = 1— (1 R2)n _ 3
0.05(n — 3) = 1 — R2
(1
n—1
The regression has 2 variables and therefore 2 degrees of freedom. So the F statistic has 2,n —3 degrees of freedom.
96= F —
Regression SS/2
RSS/(n —3)
Divide numerator and denominator by Total SS.
96=
R2/2 R2(n — 3)
(1— R2)/(n —3) — (1— R2)(2)
Now replace R2 and 1 — R2 using (*)
0.05(n — 3) n — 1 — 0.05n + 0.15
R2 =1
n—1 n—1
(n — 1 — 0.05n + 0.15)(n —3) 96
2(0.05(n — 3))
0.95n — 0.85 = 9.6
n 11
—
0.04
10
R2 = 0.96 Regression
Total SS
SS
Regression SS = 1000
1000
Total SS =
0.96
1000
Error SS = 1000(0.04\)
1000
0.96 24
2
s = 5.20833
(24)(8)
Now we make the only usage of the (VX)-1 matrix. The variance of Pi is S2 times the (2,2) coefficient of that matrix.
s2n = (5.20833)(2) = 10.4167
The t critical value at 5% significance for 8 degrees of freedom is 2.306. The width of the confidence interval is
2(2.306) 67 = 14.885
n7.32. R2——
Error SS = 20(9 — 5) = 80. Regression SS = E(9i - g)2, and g = 5 since g = 9, so Regression SS = 60.
6°
— 2 There are p =4 predictors and n = 9 observations.
60+80 — 7
n—1 1
141
Adjusted R2 = 1 — (1 — R2)
(12 p — 1) — k9 1) — 7
i=1
(yi — fio — Efiixii
j=1
+ A Xi p?
j=1
The shrinkage penalty function is A >2 . This is A times the square of the 6 norm of the vector (pi,. , We
denote the 6 norm by 11p112. Notice that the sum starts at 1, not 0; there is no penalty for the intercept.
A is a tuning parameter. As A goes to infinity, the coefficients go to 0. The tuning parameter is selected using 'I
cross-validation.
An equivalent formulation of ridge regression is
Minimize
s is called the budget parameter.' For every A, there is an $ that makes this formulation equivalent to the one stated
earlier. Higher values of A correspond to lower values of s.
EXAMPLE 8A '141 For a set of 10 observations, two predictors (xi and x2), and one response (y), the residual sum of
squares has been calculated for several different estimates of a linear model and an intercept. Only integer values
from 1 to 3 were considered for the estimates of Po (the intercept), pi, and /32.
The following table shows the residual sum of squares for every combination of the parameter estimates:
'This s has nothing to do with standard error of a regression.
1. The parameters pi are estimated using ridge regression with a tuning parameter of A = 20.
Determine the resulting estimates of phi = 1, 2, 3.
2. The parameters f3i are estimated using ridge regression with a budget parameter of s = 10.
Determine the resulting estimates of phi = 1, 2, 3. •
SourrioN: 1. We have to add 20(ft T. + AD (but do not include Tql in that sum) to each RSS. The resulting table (but
you don't have to calculate every value; some are clearly out because RSS is larger than for a lower value of 13i;
for example, (Ao, Ai, Az) = (1, 3,3) versus (Ao, Ai, A2) = (1, 3,1)) is
Ao = 1 Ao =2 4' = 3
if2 = 1 fi2 = 2 132 = 3 132 = 1 /32 = 2 /32 = 3 fi2 = 1 fi2 = 2 132 = 3
Ai = 1 812 730 725 667 705 698 651 709 912
The minimum value in the table is for (P09 Ail #2) = (3,1,1)
2. With s = 10, we must have pT f4 10, which disallows (Ai ,J62) = (3,2), (3. 3), (2,3). Among the remaining
RSS values, the minimum is 489 at (Po, #1, Az) = (1,3,1) 0
Ridge regression shrinks the coefficients but does not set them equal to O. Thus all variables are left in the regression,
but the less important ones have small coefficients.
In standard regression, the scale of the predictors does not affect the solution. But in ridge regression, it does,
since each p, and therefore the penalty function, depends on scale. It is therefore best to standardize the predictors
by dividing by their standard deviation:
xij
(8.2)
— i)2
(1/n instead of 1/(n —1) is used in the denominator, but as long as it is used consistently for all predictors they will
all be on the same scale.)
You may get a better feel for ridge regression if you try it out by hand on a simple (1-predictor) linear regression.
Suppose you want to minimize
E(Yi — Po /3ixi)2 + /1/3
Differentiate with respect to po and set the derivative equal to 0 and you get the usual normal equation,
Differentiate with respect to /3i and set the derivative equal to 0 and you get
E
(yi -130 - j=1
+ A I/3i I
rfa
(8.3)
The shrinkage penalty function is A -1 1/311. This is A times the 4 norm of the vector (/31, , pk). We denote the el "41'
norm by 1116111. As with ridge regression, the sum starts at 1, not 0; there is no penalty for the intercept.
A is a tuning parameter. As A goes to infinity, the coefficients go to 0. It is selected using cross-validation. sr
As in ridge regression, the variables should be standardized.
An equivalent formulation of the lasso is
Minimize
2
ZIP 5
j=1
Once again, s is called the budget parameter. For every A, there is an s which makes this formulation equivalent '41
to the one stated earlier. Unlike ridge regression, the lasso forces coefficients to equal 0, dropping those variables
from the model. In other words, the lasso performs feature selection.
EXAMPLE 8B For a set of 10 observations, two predictors (xi and x2), and one response (y), the residual sum of
squares has been calculated for several different estimates of a linear model and an intercept. Only integer values
from -1 to 1 were considered for the estimates of /30 (the intercept), pi, and 132.
The following table shows the residual sum of squares for every combination of the parameter estimates:
ifi = 0 610 559 572 707 665 601 578 521 562
=1 489 495 512 722 705 651 549 498 503
1. The parameters /3; are estimated using the lasso with a tuning parameter of A = 60.
Determine the resulting estimates of pi, i = 1,2,3.
2. The parameters f3j are estimated using the lasso with a budget parameter of s = 1.
Determine the resulting estimates of pi, i= 1,2,3.
SOLUTION: 1. We have to add 60(Iiii I + 021) (but do not include 001 in that sum) to each RSS. The resulting table
is
Po = —1 110 = 0
P2 = —1 132 = 0 132 = 1 P2 = —1 132 = 0 132 = 1 132 = —1 132 = 0 132 = 1
Psi = —1 892 690 645 747 665 618 731 669 832
The minimum value in the table is for (#0, ch, #2) = (1, 0, 0)
2. With s = 1, we must have oil ± 021 1, which disallows any solution in which 131 and Psz are both nonzero.
Among the remaining RSS values, the minimum is 495 at (X, #1,13z) (-1,1,0)
You may get a better feel for the lasso if you try it out by hand on a simple (1-predictor) linear regression. Suppose
you want to minimize
Z (yi — flo — Pi xi)2 + A11311
To make the differentiation easier, assume pi > 0. Then differentiate with respect to Po and set the derivative equal
to 0 and you get the usual normal equation,
— E yi 4- npo + pi Z xi =0
npo + (E xi) pi = E yi
Differentiate with respect to /31 and set the derivative equal to 0 and you get
it = n () xiyi — 0.5A) — E x; E yi
— (Z x 02
For standard linear regression, A =0. Increasing A will eventually make the numerator 0. If /31 is negative for A = 0,
iPil = —131, causing the sign of A to switch in the solution forf3i. So PI in this case will increase to 0 as A is increased.
An Introduction to Statistical Learning gives a different simplified example for ridge regression and the lasso.
Consider a regression with no intercept and with n = k (so there are n variables and no intercept). Let the X matrix
be the identity matrix, with is on the diagonal and Os elsewhere. Then standard regression results in 13‘j = y/. Ridge
regression results in P;i = yi/(1 + A). The lasso results in
yi — A/2 yi > .A/2
Pil =
lyi + A/2 yi <—A/2
0 lyil - A/2
0.5 0.5
0.4 - 0.4 -
0.3 - 0.3 -
0.2 - 0.2 -
0.1 - 0.1 -
0 i
0 1 1
—3 —2 —1 o 1 2 3 —3 —2 —1 0 1 2 3
Figure 8.1: Probability density functions for standard normal distribution (left) and standard double-exponential distribution
(right)
We see that ridge regression reduces all coefficients whereas the lasso selects features.
Both ridge regression and the lasso decrease the MSE of the estimate on the test data. When A = 0, there is no el
bias but variance is high. As A increases, the squared bias increases but the variance decreases. Initially the variance
decreases by more than the squared bias increases, until the MSE reaches its minimum at the optimal value of A. As
A increases above its optimal value, the decrease in variance does not offset the increase in squared bias. Remember
that s moves in the opposite direction of A, so higher s leads to less squared bias and more variance. Bias and
variance can also be measured against R2; the unadjusted model has the highest R2, so squared bias and variance
have the same relationship to R2 as s: higher R2 lowers bias and raises variance.
Both ridge regression and the lasso can be interpreted in a Bayesian manner. In Bayesian statistics, a prior must
be stated for /3.
• For ridge regression, the prior for each p is a normal distribution with mean 0 and standard deviation a svii
function of A. The ridge regression solution is the posterior mode for p . It is also the posterior mean.
• For the lasso, the prior for each p is a double-exponential distribution. The density function for a double-
exponential is
1
x1/9
f (x) =_e-l
26
— c <x <00
Figure 8.1 graphs the probability density functions of the standard normal distribution and the standard double-
exponential distribution.
We have been discussing methods for reducing the number of variables in a linear model. As an alternative to
selecting the most important variables, we can create new variables that are linear combinations of the original
variables. These new variables capture the most important information from the original variables, so that fewer
variables are needed. We will discuss two dimension reduction methods: principal components regression (PCR)
and partial least squares (PLS).
is maximized. The direction of the principal component is the one that minimizes the distance of the data from the
line.
The second principal component is selected to maximize the variance and to be uncorrelated to the first principal
component. It is perpendicular to the first principal component direction. This process can be repeated to generate
additional principal components.
With n observations, the Zi are vectors with n components just like the Xi. The components of Zi are
Zji = E(Pii(X —
j=1
and the same holds for Z21., • Zk, replacing subscripts 1 in the formula with the subscript of Z. The Op are called
loadings and zil are called principal component scores.2 The scores are the distances between the points and the
principal component.
EXAMPLE 8C "•41° For two variables X and Y:
(i) 5C =5
(ii) Y = 9
(iii) The principal component is Z = 0.6X + 0.8Y.
Calculate the principal component score of (X, Y) = (5.5, 8.5).
In principal components regression (PCR), the regression is performed on principal components. Since the
components are weighted averages of all the variables, PCR does not do feature selection. It is analogous to ridge
regression in this way. However, since each principal component explains less variance than the previous one, only
the first few components are used, reducing the number of variables in the model. The more components used, the
lower the bias and the higher the variance, so test MSE has a U shape.
It is advisable to standardize the variables, using equation (8.1.1), so that the maximization of variance considers
all variables equally.
Since the response is taken into account, the directions of the predictors aren't fitted as well as they are by
principal components analysis. However, the predictors generated by PLS do a better job in explaining the response.
Since this approach is supervised, it reduces bias relative to PCA but increases variance, so overall PLS does not
perform better than PCA.
Table 8.1: Summary of concepts and formulas from this lesson, Part 1
RIDGE REGRESSION
Minimize:
2
(8.1)
i.1 j=1 j.=.1
Equivalently: Minimize
2
if
yi _ Efi;x41=1
subject to the constraint
Table 8.2: Summary of concepts and formulas from this lesson, Part 2
THE LASSO
Minimize:
2
21 k
Equivalently: Minimize
yi —13o —
.1=1 ) +A
Z Iflii
j.1
(8.3)
j=1
l
f Exercises
8.1. "-ir For ridge regression, which of the following patterns does the test MSE follow as the tuning parameter
is increased?
(A) Flat.
(B) Decreasing.
(C) Increasing.
(D) First decreasing, then increasing.
(E) First increasing, then decreasing.
If ridge regression is used, for which values of A is the second model preferred?
8.5. ser When performing ridge regression, the predictors should be standardized.
The observations for one of the predictors, xi, are:
4 5 5 6 8 10 11 12
8.6. s: [MAS-I-S19:281 Determine which one of the following statements about ridge regression is false.
(A) As the tuning parameter A —) 00, the coefficients tend to zero.
(B) The ridge regression coefficients can be calculated by determining the coefficients g, 1§1R, that
2
8.11. `141 WAS-1419:361 You are given the following three statements regarding shrinkage methods in linear
regression:
I. As tuning parameter A increases towards co, the penalty term has no effect and a ridge regression will result in
the unconstrained estimates.
For a given dataset, the number of variables in a lasso regression model will always be greater than or equal to
the number of variables in a ridge regression model.
The issue of selecting a tuning parameter for a ridge regression can be addressed with cross-validation.
Determine which of the above statements are true.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
8.12. rYou are considering the following two linear models for a set of observations:
1.
yi = 4.2— 1.7xii + 2.5x12 + 1.2xj3 + E;, residual sum of squares is 26.
2. yi = 5.4 + 3.1x12 + el residual sum of squares is 50.
For which values of the tuning parameter A would the second model be preferred when using the lasso?
Yi = Po + Pixii + g2xi2
For 131 and /32, only integer values between 1 and 4 are considered.
The resulting lowest residual sum of squares for the optimal value of #0 and each combination of ;61 and #2 is:
/32 1 2 3 4
1 56 52 46 52
2 48 41 34 16
3 42 37 31 33
4 36 40 44 48
8.14. "411 A ridge regression is performed based on two predictors and an intercept:
Yi = /3o + Pixii +132x12 + Ei
For pi_ and 132, only integer values between 1 and 4 are considered.
The resulting lowest residual sum of squares for the optimal value of )(30 and each combination of r31 and [32 is:
pi
162 1 2 3 4
1 56 52 46 52
2 48 41 34 30
3 42 37 31 33
4 36 40 44 48
8.15. Ls [MAS-I-F19:371 For a set of data with 40 observations, 2 predictors (Xi and X2), and one response (Y),
the residual sum of squares has been calculated for several different estimates of a linear model with an intercept. \--1
Only integer values from 1 to 3 were considered for estimates of po (the intercept), pi, and P2.
The grid below shows the residual sum of squares for every combination of the parameter estimates, after
standardization:
o=1 130 = 2
S2 S2 Sz
1 2 3 1 2 3 1 2 3
1 3,924 1,977 1,250 3,949 1,822 1,174 3,784 1,671 1,107
2 1,858 1,141 711 1,907 1,187 717 1,827 1,128 668
3 1,386 822 369 1,363 711 349 1,294 700 344
Let be the estimate of pi using a ridge regression with budget parameter s = 5. Assume the intercept is not
subjet to the budget parameters.
Calculate the value of pg
(A) Less than 6 (B) 6 -
8.16. 'kir [MAS-I-S18:34] You are estimating the coefficients of a linear regression model by minimizing the sum:
2
From this model, you have produced the following plot of various statistics as a function of the budget parameter, s:
0 le+10
8.17. [MAS-I-F18:37] You are estimating the coefficients of a linear regression model by minimizing the sum:
2
n
( +AE
From this model you have produced the following plot of various statistics as a function of tuning parameter A:
c:!
0 le+10
("--- Use the following information for questions 8.18 and 8.19:
You are given the following observations of two variables:
i xi yi
1 1 2
2 2 5
3 3 12
4 4 13
5 5 18
E xi= 15 y; = 50
8.18. 'I? You are fitting the linear model yi = /30 + Aix; + Ei using ridge regression with A = 4.
Determine Pi.
8.19. •111 You are fitting the linear model y; = P0 + Plx; + Ei using the lasso with A = 4.
Determine /i.
8.20. s-i? A linear regression is performed based on two predictors and an intercept:
lf Yi = Po + Pixn +132xi2 + Ei
For /31 and P2, only integer values between 0 and 3 are considered.
The resulting lowest residual sum of squares for the optimal value of po and each combination of pi and 132 is:
P2 0 1 2 3
82 77 70 62
1 74 70 65 61
2 69 63 62 60
3 67 60 59 58
Yi = Po + pixii p2xi2 + £1
For /31 and p2, only integer values between 0 and 3 are considered.
The resulting lowest residual sum of squares for the optimal value of po and each combination of pi and p2 is:
132 1 2 3
0 82 77 70 67
1 74 70 65 62
2 69 64 62 57
3 67 63 59 56
8.22. [MAS-I-S18:361 For a set of data with 40 observations, 2 predictors (X1 and X2), and one response (Y),
the residual sum of squares has been calculated for several different estimates of a linear model with no intercept.
Only integer values from 1 to 5 were considered for estimates of /31 and (32.
The grid below shows the residual sum of squares for every combination of the parameter estimates, after
standardization:
/32
1 2 3 4 5
1 2,855.0 870.3 464.4 357.2 548.6
2 1,059.1 488.4 216.3 242.8 567.9
8.23. it: You are given the following statements regarding principal components regression.
I. The principal components are weighted averages of the explanatory variables.
A principal component Zi = E opiiXij where Zi ciqj = 1 and the coefficients 4ii are selected to minimize the
variance of Zi.
The second principal component is selected to be uncorrelated with the first principal component.
Which statements are true?
(A) None (B) I only (C) II only (D) III only (E) I, n, and III
8.24. 'se You are given the following statements regarding partial least squares:
I. Partial least squares is a supervised alternative to principal components analysis.
II. The direction of a partial least squares variable does not fit the predictors as well as the direction of a principal
components regression.
The bias of partial least squares is higher than that of principal components regression.
Which statements are true?
(A) I only (B) II only (C) I and II (D) I and III (E) I, II, and III
8.25. •-•• You are given the following observations of two explanatory variables and a response:
Xi X2 Y
1 2 10
2 6 2
3 4 4
4 1 11
5 5 17
xi = 15 xi2 = 18 yi = 45
8.26. 64 [SRM Sample Question #8] Determine which of the following statements describe the advantages of
using an alternative fitting procedure, such as subset selection and shrinkage, instead of least squares.
I. Doing so will result in a simpler model
II. Doing so will improve prediction accuracy
III. The results are easier to interpret
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
8.27. NI° [MAS-I-S19:371 You want to perform a regression of Y onto predictors X1, X2, Xp, using a large
number of observations, and are considering the following modelling techniques:
• Lasso Regression
• Partial Least Squares
• Principal Component Analysis
• Ridge Regression
Determine how many of the above modelling procedures perform variable selection.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
8.28. [MAS-I-F19:28] Determine which of the following statements about Principal Component Regression
(PCR) is false.
(A) When performing PCR it is recommended that the modeler standardize each predictor prior to generating
the principal component.
(B) PCR is useful for performing feature selection.
(C) PCR assumes tha the directions in which the features show the most variation are the directions that are
associated with the target.
(D) PCR can reduce overfitting.
(E) The first principal component direction of the data is that along which the observations vary the most.
Solutions
8.1. At first the test MSE decreases, since variance decreases more rapidly than bias increases. The test MSE
attains a minimum and then increases, as variance decreases less rapidly than bias increases. (D)
8.2. 3(2.52 + 3.12 + 0.82) = 49.5
8.3. After applying the shrinkage penalty, the adjusted residual sum of squares is 4 + (1.12 + 0.82)A = 4 + 1.85A
for the first model, 9 + (12 + 0.12)A = 9 + 1.01A. We want
9 + 1.01A < 4 + 1.85A
0.84A > 5
A> 5.9524
8.6. Ridge regression increases bias but decreases variance, making (E) false.
8.7. -\/32 + + 42 + 12 ÷ 62 + 92
8.8. 3 +1+4 +1 +6+9 = 24
8.9. Subtract 5 from In > 5 and add 5 to yi <-5; otherwise 0. b = (This only works for the
special situation mentioned in the box before the question.)
8.10. Multiply each value of yi by 1/(1 + A) = 1/3. b = (1, —5/3,10/3, 14/3, —4,7/3,1, —4/3) (This only works
for the special situation mentioned in the box before the question.)
8.11.
8.12. The adjusted RSS for the first model is 26 + (1.7 + 2.5 + 1.2)A = 26 + 5.4A. The adjusted RSS for the second
model is 50+ 3.1,1. We want the latter to be smaller, and solve for A.
,
24
10.43478
A> —2.3
8.13. pl+fl < 12. That means neither variable may be 4, and if one is 3, the other must be 1. With these constraints,
the lowest RSS is 41, which occurs at (fi1,132) = (2,2)
8.14. The RSS at the fitted values is 34, and the penalty function is (32 + 22)A = 13A. At the three lower values of
RSS in the table, we have:
513o + 15131 = 50
1 df
2d13
1513o + 5913i = 190
14/31 = 4°
pi = 20 — 2677
8.19. We minimize f (Poi Pi) = Z(yi — (Po + fl i Xj))2 +41131 F. Differentiating, assuming /31 > 0, the partial with respect
to /30 is the same as in ridge regression, and
1 df
2 df3i
15/30 + 55/31 = 188
Subtracting three times the equation for the partial with respect to go,
iopi = 38
131 = En
8.20. We require ir3iI 1121 3. With that constraint, (r31, ft2) = (3,0) results in the lowest RSS.
8.21. We add 4(113ii + ifizi)to each RSS. The lowest value of that sum is 76, obtained at = (1,2)
8.22. The budget parameter s is the parameter such that E s. Here you want 11311+ 021 5, and the smallest
RSS with that property is 216.3, with i = 2 and /32 =3, and then fi1/1q2 = 2/3 (B)
8.23.
I. The principal components are linear combinations of the explanatory variables but not weighted averages
necessarily; the sum of the coefficients need not be 1. X
They are selected to maximize, not minimize, the variance. X
This is true./
(D)
8.24.
8.26. All three are advantages. Fewer variables result in a simpler model and an easier explanation. And the
variance of the prediction is lowered. (D)
8.27. Ridge regression reduces the values of the coefficients but does not make them 0, so it does not perform
variable selection. The lasso sets coefficients equal to 0, which eliminates the associated variables. Principal
component regression and partial least squares do not select variables; instead, they reduce the dimension of the
model by creating new variables that are functions of the original variables. (B)
8.28. Statement (B) is false. While PCR reduces dimension, the variables it creates are functions of all of the
predictors. Thus it does not select predictors.
BREAK. #asmstudybreak
Self-reflection is a means to observe and anal ze oneself in order to grow as a person and actua
aisim
Actuarial Study Materials
Lesson 9
Reading: Regression Modeling with Actuarial and Financial Applications 2.5.3, 6.1.2
One of the purposes of a model is to predict the response when given a set of predictors. However, note the following
cautions:
Notice the similarity between this formula and the formula for leverage, (5.1).
The variance of the realized value of y is the variance of y* plus the variance of the error term, which is estimated
by s2. Accordingly, a 1 — a confidence interval for the predicted value y" is "Ls.
1 (x" — 2)2
g* ± ti-ans111 + n
A prediction interval can then be calculated by multiplying the square root of this formula by an appropriate t
coefficient and adding/subtracting the product to/from the prediction.
I doubt you will be expected to carry out a matrix calculation like this on the exam.
Exercises
•
9.1. For the linear model yi = Po + Pixii + P2xi2 + e,, you are given that bo = 10.29, b1 = -0.29, and b2 = 0.91.
Calculate the forecasted value when x1 = 10 and x2 = 5.
y 10 9 15 8 12
You estimate the linear model yi = Po + Pixii + 132xi2 ei. You are given that
0.9254 -0.1503 0.0068
(X1X)-1 =
(-0.1503
-0.0068
0.1002
0.0621
-0.0621
0.0585
9.3. For the linear model yi = Po + Pixi + ri based on 25 observations, you are given:
(i) 2 = 223
(ii) 9 = 1160
(iii) The fitted values of the parameters are 1;0 = 47.26, b1 = 5.00.
(iv)
Source Sum of Squares
Regression 23,958,920
Error 5,245
Calculate the upper bound of a 95% prediction interval for y when x = 100.
9.4. `-ir For the linear model yi = Po + Pixi + Ei based on 10 observations, you are given:
(i) 2 = 7.1
(ii) The standard error of the regression is 3.7981.
(iii) The variance of the prediction for y when x = 3 is 17.3568.
Calculate the unbiased sample variance of x.
9.5. a: [SRM Sample Question #131 Determine which of the following statements is/are true for a simple linear
relationship, y= Po + Plx + e.
I. If r = 0, the 95% confidence interval is equal to the 95% prediction interval.
The prediction interval is always at least as wide as the confidence interval.
The prediction interval quantifies the possible range for Efy
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
9.6.
f [SRM Sample Question #561 Determine which of the following statements about prediction is true.
(A) Each of several candidate regression models must produce the same prediction.
(B) When making predictions, it is assumed that the new observation follows the same model as the one used
in the sample.
(C) A point prediction is more reliable than an interval prediction.
(D) A wider prediction interval is more informative than a narrower prediction interval.
(E) A prediction interval should not contain the single point prediction.
9.7. [S-S17:361 You are given the following information for a model fitted using ordinary least squares (OLS):
Calculate the upper bound of the 95% prediction interval for Rating, for an observation with a Complaints value
of 50.
9.8. "lb [SRM Sample Question #491 Trish runs a regression on a data set of n observations. She then calculates
a 95% confidence interval (t, u) on y for a given set of predictors. She also calculates a 95% prediction interval (v, w)
for the same set of predictors.
Determine which of the following must be true.
I. lirn„,(u — =0
II. lirn„,(w — v) = 0
III. w—v>u—t
(A) None
(B) I and II only
(C) I and III only
(D) II and III only
(E) The correct answer is not given by (A), (B), (C), or (D).
9.9. •-: For the linear model yi = Po + pixii+p2xi2+ El, you are given:
0.4465 -0.0426 0.0064
(i) (X/X)-1 =
( 0.0426
-0.0064
0.0169
0.0112
-0.0112
0.0125
(ii) The square of the residual standard error of the regression is 20.439
(iii) y* is a forecasted value based on x; = 18, x;' = 10.
Calculate the variance of the forecasted value.
9.10. A linear regression model yi = Po + 13.2X i2 P3Xi3 Ei is used to forecast values of the independent
variable. Let 2 be the forecasted value given that x, is the column vector of values of the explanatory variables. You
are given:
(i) The regression is based on 15 observations.
(ii) 34 (X1X)-1x2 = 5.662.
(iii) The standard error of the regression is 1.290.
(iv) 2 = 26.500.
9.11. asir [MAS-I-S19:36] An ordinary least squares regression model is fit with the following model form:
E[Y] = po + pixi
After fitting the model, the following plot with the original data (points) and three sets of 95% intervals are provided:
AO.
0
,..• • ' •
.•"
"
0.- •
8
... -
0
O•. -Ø.00000
- -- —
-
• *"
•••
=is
••••
- - - - Interval 1
Interval 2
—
Interval 3
Let "CI" be the 95% confidence interval for E[Yi], and let "PI" be the 95% prediction interval for Y.
Determine which of the following best describes the intervals shown above.
(A) Interval 1 = CI Interval 2= PI
lf Solutions
9.1.
y" = 10.29- 0.29(10) + 0.91(5) = 11.94
9.2. 13 = (XIX)-1)Cy
10
0.9254 -0.1503 1 1 1 1 1 9 10.732
= ( -0.1503
-0.0068
0.1002
0.0621 -0.0068)
-0.0621
0.0585
5
4
4
2
2
5
3
3
9
11
15
8
12
=
-1.2135
1.1385
V5245 23
= 15.101
1 (100 - 223)2
15.101V1 + 25
+
958,356.8
= 15.517
The t coefficient at 23 degrees of freedom is 2.069. The upper bound of the prediction interval is 547.26 +
2.069(15.517) = 579.4
9.4. Use formula (9.3) for the variance. The variance of the prediction as we see in that formula is
(1* 1)2
52 (1 + 1 + n
)
S2 (1 + 1 + n
1
)7)2 ) = 17.3568
3.79812 (1 + 1
10
1-
7.1)2 )
E(xi - 1)2
16.81
- 17.3568
1+ + = 1.2032
10 E(xi - 502
Eoc i
0161.08312 .= 162.89
I. The prediction interval takes into account the variance of the forecasted mean, which the confidence interval
also takes into account, and the random error given the mean, which the confidence interval ignores. If the
random error E = 0, then the two intervals are equal. /
II. The prediction interval takes into account the random error that the confidence interval ignores, so it is at least
as large as the confidence interval./
III. The confidence interval is for the expected value of y given x, The prediction interval is for the value of y give x,
y I x.X
(E)
9.6.
9.7. The predicted value is 14.37632 + 0.75461(50) = 52.10682. The standard deviation of it is
s111. n
Z(xi — 2)2
(xi —
= 13.3147572
n—1
9.8.
I. The confidence interval measures the uncertainty of the predicted mean value. This goes to 0 as the sample
size goes to infinity.
We can see this clearly for a simple linear regression by looking at the confidence interval formula. Looking at
formula (9.2), we see that 1/n —> 0. The denominator of the other summand under the radical is E(x1 — 1)2,
which is n times the variance of x. As n —> oo, this goes to infinity, making the fraction go to 0. /
The prediction interval includes the intrinsic variance of the dependent variable. This variance never changes
regardless of n, so II is false. X
HI. The prediction interval includes the variance of y in addition to the variance of the forecast of its mean, so it
must be larger than the confidence interval, which only includes the latter variance.
(C)
10) (—0.0426
—0.0064
0.0169
—0.0112 —0.0064) ( 1
—0.0112
0.0125
18
10
= 1.4785
9.11. The prediction interval must be wider than the confidence interval since it includes variation of Y1 itself (not
just variation of its mean), and only (B) has that property.
This lesson is a brief summary of the material in the reading, which is non-mathematical material. Refer to the
textbook for more details or for amusing examples.
Sb, S (5.4)
Sxj
Overfitting a model by adding an extraneous variable may increase the residual standard error due to loss of one
degree of freedom. Disadvantages of overfitting models are:
'I'm not sure why the textbook doesn't count sxj as a fourth factor. By using explanatory variables that are more spread out we can increase
significance.
Sampling frame error means obtaining data from the wrong group. An example is using mortality data from life
insurance policyholders to estimate annuity mortality. People who buy annuities expect to live longer, whereas life
insurance purchasers may adversely select against the company.
The sampling region may be limited, and thus forecasts outside the region may not be appropriate. A quadratic
curve may look like a straight line in a small region. The earth appears flat for someone traveling in a small area.
Dependent variables may be censored, meaning that observations outside a certain region are not known exactly.
For example, for insurance with a policy limit, the exact amount of claims higher than the policy limit may be
unknown. An observation of a claim at the policy limit, where all that is known is that the underlying loss was at
least as high as the policy limit, is a censored observation. Severe censoring results in biased estimates.
Dependent variables may be truncated, meaning that observations outside a certain region are not observed.
For example, if insurance has a deductible, losses below the deductible may not be reported. Every observation
is conditional on the loss being above the deductible. Generally truncation is a more serious source of bias than
censoring, since nothing is known about amounts that are truncated.
Omitting variables leads to bias. Sometimes variables must be omitted due to legal reasons; the law may prohibit
using certain factors for rating.
Omitting variables may lead to inclusion of endogenous variables in the model. An exogenous variable is a variable
‘49 whose values are specified outside the model. An endogenous variable is a variable that is a function of other variables
already in the model. Usually an endogenous variable is another variable in the model that has been lagged. Time
series usually include endogenous variables. For example, suppose we are given the time series {lit}, where t
represents a month. Then the model
Yt = Po + + et
has one endogenous variable which is the response variable lagged one month. A more complicated example
is
2. If data is mostly missing from one variable, delete the variable. This may lead to less data loss than deleting
observations with missing values.
3. Impute missing data. Fill it in based on some algorithm. However, the filled in data will have less variability
than true data.
Reading: Regression Modeling with Actuarial and Financial Applications 13.1-13.3.2, 13.6
Linear regression models assume that the response variable is normally distributed with constant variance. They
are inadequate for many situations.
For example, we often deal with "Yes" or "No" response variables. The response variable may be "Will the
insured be hospitalized within a year?" "Will the person survive one year?" To turn the response into a number,
we could let "No" be 0 and "Yes" be 1. But then we're stuck. No matter how we transform the response, it will
have only two or less possible values. No way a normal distribution with mean p and variance (32 can have just two
values.
So we give up on modeling the variable directly, and instead we model its mean. Let the response be the
probability of "Yes", which we'll denote by it. We can then model it as a normal random variable whose mean is a
linear expression of predictors. Such a model is called a linear probabilihj model. But there are many shortcomings to a:
such a model:
1. The response does not have a normal distribution. So residual analysis is meaningless.
2. The linear function may assume values less than 0 or greater than 1, which are impossible for it.
3. The variance is not constant; in fact, it is n(1 — n), a function of the mean.
In a linear model, we have a linear expression qi LJf3 jx1, where x01 = 1. This expression is the systematic tf
component of the model. The mean of the response yi is the systematic component, and yi has a normal distribution
with fixed variance a2.
In a generalized linear model, the systematic component is the same linear expression. However, rather than
setting the mean of In equal to it, a function of the mean, g(E[yi j), is set equal to qi. And yi may have any distribution
in the linear exponential family, which we will soon define. The function g(x) is called the link function, or just the •,.?
link for short. A linear regression model is a special case of a generalized linear model with a normally distributed
response and an identity link.
To repeat, for a generalized linear model:
g(E[Y]) = Epixi
Here, 0 is the parameter of interest and 0 is a scale parameter. S(y, 0) is a function of y and cp only, not of 0. y and
0 appear together only in the numerator of the first fraction. The fact that y is alone there, not raised to a power or
otherwise transformed, is what makes this family linear.
Distributions in the linear exponential family may be discrete as well as continuous. For a discrete distribution,
f(y; 0,4)) is the probability function.
Some distributions that are members of the linear exponential family are:
• binomial
• normal
• Poisson
• exponential
• gamma
• inverse Gaussian
• negative binomial
Regression Modeling with Actuarial and Financial Applications, in Table 13.8, shows how to parametrize these
distributions as linear exponential. Let's do two examples:
Y!
First step is to bring the entire function into an exponential, logging it as needed:
f (y) = exp (—A ÷ y ln A — ln y!)
Since we want y multiplied by the parameter 0, we set 0 = In A, so A = e°, and we get
f (y) =
crA,ITrr
There are two parameters, and we must choose one to transform into the parameter of interest. Our choice will be
p. Let's bring the right side into an exponential.
2 n
(Y — 02 112
f (y) = exp 2a2
ln
0.51n 2n) = exp( Y 2a2 Inc — 0.5 In 2n)
We can cancel the 2s and get yp in the numerator, so let U = p and 0 = 02. Then
y — 0.502 — 0.5y2
f (y) = exP 0.51n
— 0.51n 2n)
b(6) = 0.562 and we can stick the rest into S(y,
u2
S(y, =— — 0.5 In — 0.51n 271
Here are formulas for the mean and variance of members of the linear exponential family:
E[Y] = b' (0) (11.2)
Var(Y) = 4)b" (0) (11.3)
S=
E xi
The Tweedie distribution is a compound distribution with the number of claims having a Poisson distribution and
the claim sizes having gamma distributions; claim sizes are mutually independent and claim sizes are independent 0:
of claim counts. The Tweedie distribution is a member of the linear exponential family with Var(Y) = cp yP , and
1 <p <2. The Tweedie distribution is a mixed distribution: it has a point mass at 0 (the probability of 0 is Pr(N = 0))
and is otherwise continuous. For a Tweedie distribution, 1 <p <2. A Tweedie distribution can also be viewed as a
mixture distribution. Given that n claims occur, the Tweedie variable Y is a sum of n gamma distributions, which is
a gamma distribution. So Y is a weighted mixture of gamma distributions, with weights equal to the probabilities
of n claims, n = 0,1, 2,
Vehicle body
Sedan a
Coupe 0.7
SUV 0.9
Therefore, E[Y] . e6.51 = 671.826 The variance is cp E[Y]2 for a gamma, so the variance here is
is."\ Quiz 11-1 •-.11 For a generalized linear model of claim sizes,
(i) The gamma distribution is selected.
(ii) The link g(p) = VII is selected.
(iii) The model output is:
Variable
Intercept 18.4
Gender—male 3.1
Area
1.0
3.5
(iv) The variance of claim sizes is 3 times the square of the mean.
Calculate the variance of claim sizes for a male in Area B.
1 1.3 Estimation
The best parameters b for a generalized linear model are estimated using maximum likelihood. For a standard linear
regression model, maximum likelihood leads to the same result as least squares.
To perform maximum likelihood estimation, we log the density function f (y) and take partial derivatives with
respect to each parameter /3i. These partial derivatives are called scores.' The scores are set equal to 0. This gives us
k +1 equations in k +1 unknowns. Although they usually cannot be solved in closed form, they can be approximately
solved using iterated weighted least squares. The partial second derivatives form a matrix. The negative expected value
of this matrix is called the information matrix. Maximum likelihood estimators are consistent and asymptotically
normal, with covariance matrix equal to the inverse of the information matrix. Thus the inverse of the information
matrix is used to test goodness of fit.
Let's illustrate the solution process for simple regression. In simple regression, we're given pairs (xi, yi) and yi is
normally distributed with mean po + f3 ix; and variance a2. We want to select pp and )61 to maximize the likelihood.
To solve this, let's determine the likelihood. The density function for each yi is
( (Yi
f (yi) =
AlTrt
exp
(Po + Pixi))2)
20-2
Since a is a constant, we can ignore 1/a1/ . The likelihood function is then •41'
(Yi — Pixi))2
L(po, pi) = exp (— i•=1
2o2
'These scores have nothing to do with the scores related to principal component analysis defined on page 130.
In other words, we minimize the sum of the squared difference between In and the fitted value ho + pixi. We see
that the least squares solution is the maximum likelihood solution.
Let's maximize /(f30, hi) by differentiating with respect to ho and /31 and solving.
al E;'=i(Yi (ho Pi xi))
apo 0-2
NIP The two expressions on the right are the scores. We will set them equal to 0. To do this, we just have to set the
numerators equal to O. Rearranging, we have
11
/I
n[30 + (E xi) 13i = E 12
Y;
( E Xi) 4)hi=Exiyi
1=1
PO +
Cov(x, y)
Var x
where the last line is derived by dividing numerator and denominator of the previous line by n2, and variance and
covariance are the empirical variance and covariance.
The estimate for f30 can be backed out from the first normal equation; since (yi — (Po + Pixi)) = 0, it follows that
Po g
Now let's look at the information matrix.
d21
(3132 02
d21 E xi
19/3041 02
d2i E4
4 21 — a2
There are no ys in these expressions, so the negative expected value of each expression is just negative the value. So
the information matrix is
3=
( E xi
a2
The information matrix is the covariance matrix for the scores. And its inverse is the asymptotic covariance matrix
of the maximum likelihood estimates. The inverse of the information matrix is
0.2 ( —Exi)
n — (E xi)2- Exi ) n
and we recognize the components of the matrix as the variances and covariances of g'o and pi, as in equations (3.9),
(3.10), (3.11).
11.4 Overdispersion
The distribution used for the response determines the variance. Sometimes we need more flexibility, because the
variance of the data is greater than indicated by the model. For example, if a Poisson distribution is used, the
variance should equal the mean, but the data may indicate that the variance is greater than the mean. This is called
overdispersion. GLM estimation only requires a mean function, not a fully specified distribution, so we can arbitrarily
set the variance to be a multiple of what it would be for the specified distribution:
Var(yi) = a26/9b"(ei)
where (pi is the scale parameter for N. The extra parameter a2 allows for overdispersion. It is unnecessary for
distributions such as the normal distribution, where .1); already equals the variance.
The overdispersion parameter is estimated using the Pearson chi-square statistic divided by the number of "I
degrees of freedom:
1 j1- ,(y — E[yi])2
(11.4)
a2 =N — p Var(yi)
where N is the number of independent cells, p is the number of parameters being estimated (usually p = k + 1 in
our models), and Var(yi) is the theoretical variance of the distribution.
f (y; 0, 0) = exp
ye —:(0) S(y , op)) (11.1)
Exercises
11.1. eir [MAS-I-F19:341 You are given the following statements comparing k-fold cross validation (with k < n)
and Leave-One-Out Cross Validation (LOOCV), used on a GLM with log link and gamma error.
1. k-fold validation has a computational advantage over LOOCV
II. k-fold validation has an advantage over LOOCV in bias reduction
III. k-fold validation has an advantage over LOOCV in variance reduction
Determine which of the above statements are true.
(A) None are true (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
11.3. "41 Consider a random variable Y with the following probability density function:
y3e—YiY
f (Y) = y>0
24y4
11.4. `41 [S-F16:311 Within the context of Generalized Linear Models, suppose that y has an exponential distribution
with probability density function expressed as:
11.5. Nr For a distribution in the linear exponential family, you are given
f (y; xi , = exp
± 750 + s (y 0))
Determine Var(Y), the variance of the distribution, when u = 10.
11.6. ": [Based on S-F16:331 You are given the following two probability density functions:
(i) f(y; 6) = 0y-e-1 for y>
(ii) f (y; 6) = Oe-Y8 for y > 0 and 6 > 0
Which of these distributions are in the linear exponential family?
11.7. mIr [MAS-I-F18:27) You are given the following three functions of a random variable, y, where —00 < y <o0.
I. g(y)= 2 + 3y + 3(y — 5)2
IL g(y)=4-4y
III. g(y) = lyl
Determine which of the above could be used as link functions in a GLM.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
11.8. ¶P [MAS-I-F19:271 An actuary is asked to model a non-negative response variable and requires that the
model produce an unbiased estimate.
Determine which error structure and link function combination would be the best choice for the modelling
request.
(A) Poisson and Identity
(B) Compound Poisson-Gamma and Log
(C) Normal and Identity
fl (D) Gamma and Log
(E) Poisson and Log
11.9. ikir [MAS-I-F19:261 A number of candidate models were fit using the following variables:
• An intercept term
• Variable A—a Yes/No indicator
• Variable B—a Yes/No indicator
• An interaction of Variables A and B
There are four observations, which were arranged into the following design matrix:
0 0
(1 01
1 1 0 0
X=
1 0 1 0
1 1 1 1
III. Log
The predicted values, given below, were the same under all three models:
_ (C0.}.8500)
Y— 0.40
0.70
Determine for which of the above link functions the estimated interaction coefficient is non-zero.
Exam SRM Study Manual Exercises continue on the next page ...
Copyright ©2022 ASM
EXERCISES FOR LESSON 11 171
11.10. al For a generalized linear model for claim sizes, you are given
Response variable: Claim sizes
Response distribution: Normal
Variable Coefficient
Intercept 22
Age of driver
18-24 15
25-64 0
65 and up 13
Income group
Under 50000 12
50000-100000 0
Over 100000 —3
Calculate expected claim sizes for a 20-year old driver earning 40,000.
Variable Estimated /3
Intercept —2.45
Gender—female —0.85
Income (000) —0.04
Age 0.10
Calculate expected number of claims for a 30-year old male with income of 100,000.
11.12. q (S-F15:341 You are given the following information for a model of vehicle claim counts by policy:
(i) The response distribution is Poisson and the model has a log link function.
(ii) The model uses two caegorical explanatory variables: Number of Youthful Drivers and Number of Adult
Drivers.
Degrees of
Parameter Freedom 13
Intercept 1 —2.663
Number of Youthful Drivers
0
1 1 0.132
Number of Adult Drivers
1
2 1 —0.031
Calculate the predicted claim count for a policy with one adult driver and one youthful driver.
(A) Less than 0.072
(B) At least 0.072, but less than 0.074
(C) At least 0.074, but less than 0.076
(D) At least 0.076, but less than 0.078
(E) At least 0.078
11.13. •41` [S-F16:32} You are given the following GLM output:
Response variable Pure Premium
Response distribution Gamma
Link log
Parameter df 13
Intercept 1 4.78
Risk Group 2
Group 1 0 0.00
Group 2 1 —0.20
Group 3 1 —0.35
Vehicle Symbol 1
Symbol 1 0 0.00
Symbol 2 1 0.42
Calculate the predicted pure premium for an insured in Risk Group 2 with Vehicle Symbol 2.
(A) Less than 135
(13) At least 135, but less than 140
(C) At least 140, but less than 145
(D) At least 145, but less than 150
(E) At least 150
Intercept 1 5.26
Risk Group 2
Group 1 0 0.00
Group 2 1 0.18
Group 3 1 0.37
Territory Code 2
Region 1 0 0.00
Region 2 1 0.12
Region 3 1 0.25
Calculate the predicted pure premium for an insured in Risk Group 3 from Region 2.
(A) Less than 250
(B) At least 250, but less than 275
(C) At least 275, but less than 300
(D) At least 300, but less than 325
(E) At least 325
l
f
11.15. L You have estimated the following generalized linear model:
You used a gamma distribution as the response distribution, and the link function you used is g(p) = 1/ p.
Determine the mean of the response when x1 = 5 and x2 = 6.
11.16. N. [MAS-I-S18:24] You are given the following output from a model constructed to predict the probability
that a Homeowner's policy will retain into the next policy term:
Intercept 1 0.6102
Tenure
<5 years 0 0.0000
5 years 1 0.1320
Let it be the probability that a policy with 4 years of tenure that experienced a 12% prior rate increase and has
225,000 in amount of insurance will retain into the next policy term.
Calculate the value of 71.
11.17. Nr [S-S16:31] You are given the following information for a fitted GLM:
Response variable Claim size
Response distribution Gamma
Link Log
Dispersion parameter 1
Parameter df
Intercept 1 2.100
Zone 4
1 1 7.678
2 1 4.227
3 1 1.336
4 0 0.000
5 1 1.734
Vehicle Class 6
Convertible 1 1.200
Coupe 1 1.300
Sedan 0 0.000
Truck 1 1.406
Minivan 1 1.875
Stationwagon 1 2.000
Utility 1 2.500
Driver Age 2
Youth 1 2.000
Calculate the predicted claim size for an observation from Zone 3, with Vehicle Class Truck and Driver Age Old.
(A) Less than 650
(B) At least 650, but less than 700
(C) At least 700, but less than 750
(D) At least 750, but less than 800
(E) At least 800
11.18. [SRM Sample Question #4.5] The actuarial student committee of a large firm has collected data on exam
scores. A generalized linear model where the target is the exam score on a 0-10 scale is constructed using a log link, N--)
resulting in the following estimated coefficients
Predictor Variables Coefficient
Intercept —0.1
The company is about to offer a job to Patricia, who is a female with a Master's degree. It would like to offer her
half of the study time that will result in an expected exam score of 6.0.
Calculate the amount of study time that the company should offer Patricia.
(A) 123 hours (B) 126 hours (C) 129 hours (D) 132 hours (E) 135 hours
11.19. *I? You have estimated the following generalized linear model:
= 0.8 + 0.6xii + 1.4x12
11.20. 'v. [S-S17:29] You are given the following GLM output:
Response variable Pure Premium
Response distribution Gamma
Link log
Scale parameter (0) 1
Parameter df
Intercept 1 3.25
Risk Group 2
Group 1 0 0.00
Group 2 1 0.30
Group 3 1 0.40
Vehicle Symbol 1
Symbol 1 0 0.00
Symbol 2 1 0.45
Calculate the variance of the pure premium for an insured in Risk Group 3 with Vehicle Symbol Group 2.
(A) Less than 3,000
(13) At least 3,000, but less than 4,000
(C) At least 4,000, but less than 5,000
(D) At least 5,000, but less than 6,000
(E) At least 6,000
11.21. "-st. [S-F16:381 You are given the following probability density function for a single random variable, X:
e
11/2 e(x — 1)2 \
f (x I 0) =
(27/X3) exp ( 2x )
Consider the following statements:
I. f (x) is a member of the linear exponential family of distributions.
II. The score function Me) is
1
me). 20 2x
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) ,(B) , (C) , or (D) .
11.22. "1:1 [S-S16:321 You are given the following information for a fitted GLM:
Response variable Claim size
Response distribution Gamma
Link Log
Scale parameter (0) 1
Parameter df S
Intercept 1 2.100
Zone 4
1 1 7.678
2 1 4.227
3 1 1.336
4 0 0.000
5 1 1.734
Vehicle Class 6
Convertible 1 1.200
Coupe 1 1.300
Sedan 0 0.000
Truck 1 1.406
Minivan 1 1.875
Stationwagon 1 2.000
Utility 1 2.500
Driver Age 2
Youth 1 2.000
Middle age 0 0.000
Old 1 1.800
Calculate the variance of a claim size for an observation from Zone 4, with Vehicle Class Sedan and Driver Age
Middle age.
(A) Less than 55
(B) At least 55, but less than 60
(C) At least 60, but less than 65
(D) At least 65, but less than 70
(E) At least 70
11.23. s: A generalized linear model with an inverse Gaussian response uses the link g(p) = 1/112. The model
has one explanatory variable x and an intercept. The estimated parameters are b = (0.0135,0.0582)'.
Let p1 be the mean of the response when x = 10.
Let 1./ 2 be the mean of the response when x = 11.
Determine p2 — pi •
(y) = exp
(yt9 — b(6) + S(y,,
0))
(ii) b(19) =
(iii) 0 = —0.3
(iv) = 1.6
Calculate E[Y].
(A) Less than —1
(B) At least —1, but less than 0
(C) At least 0, but less than 1
(D) At least 1, but less than 2
(E) At least 2
11.25. Nr [Version of S-F15321 A GLM is used to model claim size. You are given the following information about
the model:
Variable b
(Intercept) 2.32
Location—Urban 0.00
Location—Rural —0.64
Gender—Female 0.00
Gender—Male 0.76
Calculate the variance of the predicted claim size for a rural male.
(A) Less than 25
(B) At least 25, but less than 100
(C) At least 100, but less than 175
(D) At least 175, but less than 250
(E) At least 250
11.26. Nr [MAS-I-S18:25] Three separate GLMs are fit using the following model form:
+ pixi+p2x2
The following error distributions were used for the three GLMs. Each model also used their canonical link functions:
Model 1: gamma
Model II: Poisson
Model III: binomial
When fit to the data, all three models resulted in the same parameter estimates:
Po 2.0
Pi 1.0
P2 —1.0
Determine the correct ordering of the models' predicted values at observed point (Xi, X2) = (2,1).
(A) I < II < III (B) I < III < II (C) II < I < III (D) II < III <I
(E) The correct answer is not given by (A) , (B) , (C) , or (D)
11.27. `-ir [MAS-I-S18:39] A GLM was used to estimate the expected losses per customer across gender and
territory. The following information is provided:
(i) The link function selected is log.
(ii) Q is the base level for Territory.
(iii) Male is the base level for Gender.
(iv) Interaction terms are included in the model.
The GLM produced the following predicted values for expected loss per customer:
Territory
Q R
Male 148 545
d Female 446 4,024
Calculate the estimated beta for the interaction of Territory R and Female.
(A) Less than 0.85
(B) At least 0.85, but less than 0.95
(C) At least 0.95, but less than 1.05
(D) At least 1.05, but less than 1.15
(E) At least 1.15
11.28. %. [S-F17:32] Given a family of distributions where the variance is related to the mean through a power
function:
Var(Y) = a E[Yr
One can characterize members of the exponential family of distributions using this formula.
You are given the following statements on the value of p for a given distribution:
I. Normal (Gaussian) distribution, p = 0
II. Compound Poisson-gamma distribution, I<p<2
III. Inverse Gaussian distribution, p = —1
Determine which of the above statements are correct.
(A) I only (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) or (D) .
11.29. %I' IMAS-I-S18:401 Art actuary fits a Poisson distribution to a sample of data,
-o
f (x) = x!
To assure convergence of the maximum likelihood fitting procedure, the actuary plots three quantities of interest
across different values of B.
0 0
Determine which of the three plots the actuary can use to visually approximate the maximum likelihood estimate
for O.
(A) None can be used (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Solutions
11.1. This question is based on Lesson 6, but was placed here because of its reference to GLMs.
LOOCV is n-fold validation, k-fold validation requires less computation since it is only creating k validation
sets, whereas LOOCV creates n validation sets. The only exception to this is for a standard linear regression,
where LOOCV has a simple formula, but the question specifies that this is a GLM, not a standard linear regression.
Increasing k increases variance but decreases bias; thus k-fold validation has less variance than LOOCV. (C)
11.2. As usual, write f (y) as the exponential of something.
The parameter of interest is 0 = —1/y, and then 41n y = —4 ln(-8), so b(0) = —41n(-0) , or a multiple thereof.
11.4. Whether in the context of Generalized Linear Models or not, the variance of an exponential random variable
is the square of its mean. Here the mean is p, as you can verify in the distribution tables, so the variance is p2 . (D)
But if you wanted to do it using the GLM formula for variance, you can set 0 = —1/p and note that cp = 1. Since
exponential is a special case of gamma, and for a gamma Var(Y) = 0E1112, the result follows.
11.5. To get it in the form of (11.1), let B = —25/u2, so u = 5/1r-7. Then
f (y; ,
exp (ye + 10/7) + S(y ,, 4)))
We see that oP = 1 and b(e) = , so
11.11. Gender is a categorical variable with male as the base level, so nothing is added to the linear expression for
gender.
g(y) = -2.45 - 0.04(100) + 0.10(30) = -3.45
Since g(p) = in p, we have p = e-3'45 = 0.031746 . We can then calculate the probability of n claims using the
Poisson distribution. For example, the probability of 1 claim is 0.031746e-0.031746= 0.030754.
11.12. For the log link,
In p = = -2.663 + 0.132 = -2.531
where the base class for adult drivers is 1 so the value of that variable is 0. Then p = e-2.531 = 0.07958 (E)
11.13. The systematic component is g(p) = 4.78 - 0.20 + 0.42 = 5. The link is g(p) = in p, so p = es = 148.41 (D)
11.14. The systematic component is 5.26 + 0.37 + 0.12 = 5.75. This is the logarithm of pure premium, so pure
premium is e5'75 = 314.19 . (D)
11.15. g(p) = 2 + 3(5) + 4(6) = 41, so p = 1/41
11.16. For a binomial response, the predicted value fi is the probability as well as the mean of the distribution.
Here, the systematic component g(n) = -Fr is
g(n) = VTE = 0.6102 + 0 - 0.0920 + 225(0.0015) = 0.8557
so fr = 0.85572 = 0.7322. (C)
11.17. The systematic component for 3-Truck-Old is 2.1 + 1.336 + 1.406 + 1.8 = 6.642. With the log link, the result
is e6.642 = 766.63 . (D) The dispersion parameter only affects the variance and is not relevant for this question.
11.18. The linear component, with x being study time in hundreds of hours, is -0.1+0.5x +0.5-0.1+0.2 = 0.5+0.5x.
We want e°31-°-5x = 6.
Since we offer half the study time, we multiply x by 0.5, and also by 100 to get study time in hours. The result is
129.18 . (C)
11.19. g(p) = 0.8 + 0.6(2) + 1.4(1) = 3.4
Therefore, p = 3.42 = 11.56. The variance is 0.5p3 = (0.5)(11.50) = 772.4022
11.21.
0(x— 1)2
ln f (x Ff9)= 0.51n (9— 0.51n(2nx3) 2x
and we see that in the term that combines x and 0, the last term, there is not just x, so f (x) is not a member of
the linear exponential family.X
The score function is the derivative of the logarithm with respect to 0, or
1 (x
u(e).
20
1)2 t/
2x
The information matrix 1(0) is a 1 x 1 matrix (since there's only one variable). It is negative the expected value of
the second derivative. The second derivative is —1/(202). Since X drops out, the second derivative is constant
with expected value —1/(202). Negating we get 1(0) = 1/(2(92), not 202.X
(B)
11.22. The systematic component for 4-Sedan-Middle age is 2.100 +0 + 0 + 0 = 2.100, so the mean is p = e2.100 =
8.16617. For a gamma, the variance is cpp2, so the variance in our case is 8.166172 = 66.6863 . (0)
11.23. The model is
1
= 0.0135 + 0.0582x
11
Or
1
0.0135 + 0.0582x
11.26. g(Y) = 2.0 + 1.0(2) — 1.0(1) = 3. Use the inverses of the canonical link functions, the mean functions. For
gamma, g(x) = —1/x so the g-1(3) = —1/3. For Poisson, g(x) = In x so g-1(3) = e3 = 20.086. For binomial,
g(x) = In (x /(1 — x)) so g-1(3) = = 0.9526. (B)
Let pi be the coefficient of territory, 132 the coefficient of gender, and p3 the coefficient of interaction. Then
Po= 4.9972
Po + = 6.3008
Po + 132 = 6.1003
Po + 131 +P2+133 = 8.3000
Then the estimate for Pi is 6.3008 — 4.9972 and the estimate for 132 is 6.1003 — 4.9972. Adding these to Po we get
Po + 01+132 = 6.3008 + 6.1003 — 4.9972 = 7.4039. Therefore the interaction beta,133, must be 8.3000— 7.4039 = 0.8961
(B)
11.28. As discussed in this manual p = 3 for inverse Gaussian. But I and II are true. (B)
11.29. The score function is 0 at the maximum likelihood estimate for 19 and the deviance is minimized. However,
information is just the reciprocal of the variance, and does not indicate the maximum likelihood estimate. (B)
Quiz Solutions
p = 22.52 = 506.25
The variance is 3(506.252) = 768,867
Reading: Regression Modeling with Actuarial and Financial Applications 11.1-11.2, 11.4-11.6
One of the most important uses of a generalized linear model is to model categorical responses.
Categorical responses are of three types:
1. Binomial: there are two categories. These are "Yes/No" variables: Is the drug effective? Will the policyholder
submit a claim? Will a student pass SRM?
2. Nominal: there are multiple categories with no particular order. For example: What type of vehicle will a
person buy? Which investment will perform best?
3. Ordinal: there are multiple categories that follow a logical order. For example: How severe was an accident?
(Property damage only, Bodily injury but no fatality, Fatality)
Sometimes the categories are called levels.
In the following discussion, let -r7 be the systematic component Vi=0 piXi.
If the response variable Y has only two possible values, then the two values are coded as 0 and 1, and the expected
value of Y is the probability of the value coded as 1. Let 77 be that probability. it must be between 0 and 1. However,
the systematic component ri can be any real number. Therefore, the link g(ni) =ij must be a function going from
the interval [0,11 to (-co, co). The inverse of the link, g-1(1ii) = rig, takes any real number to the interval [0,11.
The most popular link, which is also the canonical link, is the logit link:
(12.1)
The function g(n) is called the logit function, and we will write logit(n). When this link is used, we say that we've •
performed logistic regression.
Two other links that are commonly used are
• The probit link, which is the inverse of the standard normal distribution function: •41
0.9 —
0.8 —
0.7 —
0.6 —
0.5 —
0.4 —
0.3 —
0.2
0.1 —
-5 -4 -2 -1 0 1
The ratio o = n/(1 n) is the odds ratio. The odds ratio is defined as follows: the odds of an event are o if and
only if the probability of the event 77 is o/(1 o). For example, when we say the odds are 2 to 1, or in other words
o = 2/1 = 2, we are saying the probability is 2/3. In a gambling context, if odds of an event are 2 to 1, then in a fair
gamble the gambler wins 1 when the event occurs and loses 2 when it doesn't. In the logistic model, the systematic
component is the logarithm of the odds. The logit is the log of the odds ratio. Given odds of o, the probability is
o/(1 o).
For the probit link, the inverse of the link is the cumulative distribution function of the standard normal
distribution:
01) = VT1) (12.5)
For the complementary log-log link, the inverse is the extreme value distribution:
71 = 1 — (12.6)
where E has a certain distribution. When y" is less than or equal to a threshold, which we set equal to 0, we observe
0; when y* is greater than 0, we observe 1. Thus ri(rii) is the probability that y; is positive. The higher qi is, the
more likely that y; is positive.
For example, if we assume Ei has a logistic distribution with distribution function:
1
F(x) = 1 + e_x
then the density function is
e-x
(1 + e-x)2
Then
1
Pr(yi = 1 I xi) = Pr(y; > 0 I xi) = Pr(ei > -rp) =1
0.9 —
0.8 —
0.7 —
0.6 —
0.5 —
0.4 —
0.3 —
0.2 —
0.1 —
o 11
—5 —4 —3 —2 —1 0 1 2 3 4 5
0.9 —
0.8 —
0.7 —
0.6 —
0.5 —
0.4 —
0-3 —
0.2 —
0.1 —
0 I I 1 I 11
—5 _
—3 —2 —1 0 4 5
which is the mean function for the logit link that we developed above.
For the density function of the logistic distribution, if you replace x with —x, you get
ex
=
(1 + ex)2
and dividing the numerator and denominator by e 2x, this equals -(ee-ixt )r, the same as f (x), so the density is symmetric
around 0, as we mentioned above.
If we assume E has a standard normal distribution, then the threshold calculation gives:
pi = in 01 — In 02 = in 6'1
02
e' = °1
02
So di is the ratio of the odds. For example, if = 1, then the odds of an event when xi = 1 are e times the odds
when xi = 0.
EXAMPLE 12A The following logistic model for the probability of passing Exam SRM is fitted:
Response variable: Probability of passing Exam SRM
Response distribution: Binomial
Link: Logit
Parameter
Intercept —1.3
Familiar with R
Yes 1.0
No 0.0
5 or less —2.5
6 0.0
7 1.2
8 or more 1.6
Calculate the odds of passing Exam SRM for someone who is not familiar with R, has studied the ASM manual
for 150 hours, and who passed Exam STAM with a 7. •
The odds are elA = 4.0552 . The question didn't ask for the probability of passing, but in case you are interested, it
is 4.0552/5.0552 = 0.80218.
EXAMPLE 12B Ler In the previous example, calculate the probability of passing if the probit link is used. •
SOLUTION: The systematic component was calculated in the previous example and was found to equal 1.4. With the
probit link,
it (1)(1.4) = 0.9192
If a model were fitted for this link, the coefficients would vary by link and the result for probability might not be so
different. 0
I expect few if any exam questions on nominal response and ordinal response models. An Introduction to Statistical
Learning says that these models are rarely used. The CAS has not asked any questions on these models since they
added GLM to their syllabus in 2015. And none of the SRM sample questions ask about these models. If you are
in a hurry, you may skip the rest of this lesson.
We've so far talked about response variables having one of two possible values. Now let's talk about response
variables having one of c > 2 values. In the generalized logit model, we select one category as the base category or al
reference category. Let's say it's category c. For each of the other categories, the model has one equation for the If
relative odds. The odds of category j relative to category i is the probability of category j divided by the probability k:
of category i. The nominal logistic model has an equation for the logarithm of the odds of each category j relative es:
to the reference category:
fl ln .jTr '
ne i=0
Epux, J.-, 1,2,3, ... , c — 1
Let rij = 011x1. Then ni = ncert j , and the probabilities must sum up to 1, so
1
nc
eqj
j — 1, 2,3,...,c — 1
—
1± ei
EXAMPLE 12C k? You use a logistic model to predict the color of the car covered by an insurance policy, as a function
of certain characteristics of its owner. The color may be white, silver, or red. White is the reference category. The
iftted model has the following coefficients:
Parameter Silver /3 Redi3
Sex
Male 0 0
Female —1.2 0.8
Income
Calculate the probability that a policy on a female with 40,000 income covers a white car. •
SourrioN: Let white be category 3. We have to calculate the relative odds of silver (category 1) and red (category
2) cars first.
ln = -1.2 - 0.5 = -1.7
n3.
1
0.353182
7r3 = 1 ± e-1.7 e0.5
Quiz 12-1 slir A student entering college may ultimately become a physician (category 1), an actuary (category
2), or a professor (category 3). To calculate the probabilities of these three careers, a generalized logit model is
constructed with intercepts only. Category 3 is the base category and pa = 1.3 for category 1,0.8 for category 2.
Calculate the relative odds of the student becoming an actuary versus becoming a physician.
We are now going to provide an interpretation for Al in a generalized logit model. Suppose there is a binary
•-.11 explanatory variable x. Then for that variable, the odds ratio of category j to the base category is the ratio of the
probability of category j when x = 1 (the explanatory variable is present) over the probability of category j when
x = 0 (the explanatory variable is absent), divided by the same ratio for the base category:
rtiP
ln = Poi + pi) Jefixi
i=2
The odds ratio is e1311. hi other words, pi; is the logarithm of the odds ratio. The odds ratio does not vary with the
values of the other explanatory variables.
The generalized logit model has a nested structure. If we condition on yi #1, for example, the conditional value
of yi follows a generalized logit. If we condition on yi equalling one of two values, the conditional model is the
logistic model we studied for binary variables:
Pr(y; =a IY = a or yi = b) =
Oa I (1+ eqi)
(e,747 eb) (1 j_ vc-1 z,71;)
eq.
+
eria-rm
1 + erin-qt,
Compare this to equation (12.4). If we treat a as 1 and b as 0, this probability is a logistic model with ps equal to the
excess of the ps for value a minus the ps for value b.
Quiz 12-2 A student entering college may ultimately become a physician (category 1), an actuary (category
2), or a professor (category 3). To calculate the probabilities of these three careers, a generalized logit model is
constructed with intercepts only. Category 3 is the base category and f30 = 1.3 for category 1,0.8 for category 2.
Calculate the probability that the student becomes a professor, given that she didn't become a physician.
The multinomial logit model generalizes the generalized logit model. In the generalized logit model, f3 varies '11
by alternative for y but the explanatory variables don't vary. In the multinomial logit model, f3 does not vary by
alternative but the explanatory variables do.
In the logit models, the relative odds of alternative 2 versus alternative 1, Pr(yi = 2)/Pr(yi = 1), equals
Pr(yi = 2) erii2
Pr(yi = 1) —
This ratio does not vary by qij for j 1,2. This property is known as independence of irrelevant alternatives. In some 11-:
cases this property is not desirable.
One way to model an ordinal response variable is the cumulative logit model. This model estimates the cumulative s:
odds of each category. The cumulative odds of category in are
1 — Pr(Y in) 1— ni + • • • + nc
Thus
" rtm
In PimXi
7Im+1+ - - - + 7Ic
i=0
No equation is needed or possible for m c, since the cumulative probability of category c is 1 and thus the
cumulative odds are infinite. Therefore, such a model would have (k 1)(c —1) parameters. In the proportional odds
model, all parameters except for the intercept are assumed to be the same for all categories. In other words
771 + • • • + 17m
In —
PiXi
nm+1 + • • • + ne
1=1
does not vary by category in for i > 0; the categories differ only through km. The reason this model is called
the proportional odds model is that the cumulative odds of Y < m are
If we fix the category but consider two sets of values of the variables Xi, say Xil. and X12, the relative odds of Y
given each of the two sets of values are
dow+V., ii
efiwn+VLI pixi2
This quotient is independent of the category in; the letter in cancels out in the quotient. The level of the odds
depends on the category, but the ratio of the cumulative odds for different inputs does not vary by category.
EXAMPLE 12D Type of accident (1=property damage, 2=bodily injury, 3=fatality) is modeled using a cumulative
proportional odds model. The explanatory variable is type of vehicle. You are given the following based on the
model:
SOLUTION: For a car, the odds of property damage is 0.3/0.7 = 3/7 and the odds of property damage or bodily
injury is 0.4/0.6 = 4/6. For a truck, the odds of property damage is 0.45/0.55 = 9/11. The relative odds of property
4/6 14
damage or bodily injury to property damage for a truck are the same as for a car, 7 = 7 . So the odds of property
damage or bodily injury is
9 (14) 14 _
Ti 9 11
77
14/11 0.56
1 -1- 14/11
Odds are not additive. Suppose events 1 and 2 are mutually exclusive. If the odds of event 1 are oi and the odds
of event 2 are 02, it does not follow that the odds of event 1 or 2 are 01 + 02; in fact, unless one of the odds is 0, you
can be sure the odds are not 01 + 02. So in a cumulative proportional odds model, if you have the cumulative odds of
event 1 and the cumulative odds of event 2 (which means the odds of events 1 or 2), you cannot calculate the odds
of event 2 by subtracting the former from the latter. You must convert them to probabilities. We saw how to convert
odds to probabilities in the previous example. In general, to convert odds to probabilities, since
71
o =
1—n
Quiz 12-3 The odds of event 1 are 0.5 and the odds of events 1 or 2 are 2. The two events are mutually
exclusive.
Calculate the odds of event 2.
It is possible to use probit and complementary log-log links instead of logit for the cumulative ordinal model.
rc ell
Logit i=11-1 rz -
7 1 + el
Probit 77 _ 0-1(n) TI = cD(11)
Complementary log-log r! = ln(-1n(1 - it)) it = 1 - exp(- exp(ri))
Logistic model for nominal response variable
In =
EpijXi j=1,2,...,c -1
i=o
1
nc =
1 + E=1 eqi
eqi
n = j 1,2,3,...,c -1
1
In
Tri÷i + • • • + ne 1=0
If 13 does not vary by category except for intercept, then the model is the proportional odds model:
rti + • • • +
In Poi +
hTc
Ti'
=1.
eP°0-4113'xi
1 - EL ni
Exercises
Binary response
12.1. %. [SRM Sample Question #42] Determine which of the following statements is NOT true about the linear
probability, logistic, and probit regression models for binary dependent variables.
(A) The three major drawbacks of the linear probability model are poor fitted values, heteroscedasticity, and
meaningless residual analysis.
(B) The logistic and probit regression models aim to circumvent the drawbacks of linear probability models.
(C) The logit function is given by 71(z) = ez /(1 + ez).
(D) The probit function is given by m(z) = (1)(z), where (1) is the standard normal distribution function.
(E) The logit and probit functions are substantially different.
l
f
12.2. 41-1° [SRM Sample Question #14] From an investigation of the residuals of fitting a linear regression by
ordinary least squares it is clear that the spread of the residuals increases as the predicted values increase. Observed
values of the dependent variable range from 0 to 100.
Determine which of the following statements is/are true with regard to transforming the dependent variable to
make the variance of the residuals more constant.
I. Taking the logarithm of one plus the value of the dependent variable may make the variance of the residuals
more constant.
II. A square root transformation may make the variance of the residuals more constant.
IlL A logit transformation may make the variance of the residuals more constant.
(A) None of I, II, or III is true
(B) I and II only
(C) I and III only
(D) II and III only
(E) The answer is not given by (A), (B), (C), or (D)
12.3. tr [SRM Sample Question #7] Determine which of the following pairs of distribution and link function is
the most appropriate to model if a person is hospitalized or not.
(A) Normal distribution, identity link function
(B) Normal distribution, logit link function
(C) Binomial distribution, linear link function
(D) Binomial distribution, logit link function
(E) It cannot be determined from the information given.
12.4. [SRM Sample Question #20] An analyst is modeling the probability of a certain phenomenon occurring.
The analyst has observed that the simple linear model currently in use results in predicted values less than zero and
greater than one.
Determine which of the following is the most appropriate way to address this issue.
(A) Limit the data to observations that are expected to result in predicted values between 0 and 1.
(B) Consider predicted values below 0 as 0 and values above 1 as 1.
(C) Use a logit function to transform the linear model into only predicting values between 0 and 1.
(ID) Use the canonical link function for the Poisson distribution to transform the linear model into only predicting
values between 0 and 1.
(E) None of the above.
12.5. 4: [S-S16:30] You are given the following information for a fitted GLM:
Response variable Occurrence of Accidents
Response distribution Binomial
Link Logit
Parameter df if se
Area 2
Suburban 0 0.000
Urban 1 0.905 0.062
Rural 1 —1.129 0.151
12.6. al' For a binary response, there is an underlying linear model. The variable in the underlying linear model
has the following density function:
e-lx1
f (x) = 2
-00 <X <00
Determine the link function g(it) based on the threshold interpretation of the model.
Parameter
Intercept —1.485
Vehicle Body
Coupe —0.881
Roadster —1.047
Sedan —1.175
Station wagon —1.083
Truck —1.118
Utility —1.330
Driver's Gender
Male —0.025
Area
0.094
0.037
—0.101
n12.9. •Ir [MAS-I-F18:261 You are given the following output from a model constructed to predict the probability
that a Homeowner's policy will retain into the next policy term:
Intercept 1 0.4270
Tenure
<5 years 0 0.0000
5 years 1 0.1320
Let ft be the modeled probability that a policy with 4 years of tenure that experienced a +12% prior rate change
and has 225,000 in amount of insurance will be retained into the next policy term.
Calculate the value of ft.
12.10. •-: [S-S16:33] You are given the following information for a GLM of customer retention:
Response variable Retention
Response distribution Binomial
Link Logit
Parameter df 13
Intercept 1 1.530
Number of Drivers 1
1 0 0.000
>1 1 0.735
Calculate the probability of retention for a policy with 3 drivers and a prior rate change of 5%.
(A) Less than 0.850
(B) At least 0.850, but less than 0.870
(C) At least 0.870, but less than 0.890
(D) At least 0.890, but less than 0.910
(E) At least 0.910
12.11.
•kir [MAS-I-F18:28] In a study 100 subjects were asked to choose one of three election candidates (A, B, or N..._.,
C). The subjects were organized into four age categories: (18-30, 31-45, 45-61, 61+).
A logistic regression was fitted to the subject responses to predict their preferred candidate, with age group (18-
30) and Candidate A as the reference categories.
For age group (18-30), the log-odds for preference of Candidate B and Candidate C were —0.535 and —1.489
respectively.
Calculate the modeled probability of someone from age group (18-30) preferring Candidate B.
(A) Less than 20%
(B) At least 20%, but less than 40%
(C) At least 40%, but less than 60%
(D) At least 60%, but less than 80%
(E) At least 80%
12.12.
`41' [MAS-I-819:27] A statistician uses a logistic model to predict the probability of success, it, of a binomial
random variable.
12.13. '146 [MAS-1-F19:25] A bank uses a logistic model to estimate the probability of clients defaulting on a loan,
and it comes up with the following parameter estimates:
Variable Pi
0 Intercept —1.6790
1 Income(in 000's) —0.0294
2 Student [Yes] —0.3870
3 Number of credit cards 0.7710
The following four clients applied for loans from the bank:
1 25,000 Y 1
2 10,000 Y 3
3 20,000 N 0
4 75,000 N 3
The bank will reject any loan if the probability of default is greater than 10%.
Calculate the number of clients whose loan requests are rejected.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
12.14. s: A binary response is modeled with a generalized linear model and a logit link. The fitted model is
g (n) 3.56 + 0.42xi
12.15. 'Lir [MAS-I-S18:27] You are given the following information about an insurance policy:
(i) The probability of a policy renewal, p(X), follows a logistic model with an intercept and one explanatory
variable.
(ii) Po = 5
(iii) 131 = —0.65
Calculate the odds of renewal at x = 5.
12.16. [S-S16:291 You are given the following information for a fitted GLM:
Intercept 1
Driver's Age 2
1 1 0.288
2 1 0.064
3 0 0
Area 2
A 1 —0.036
1 0.053
0 0
Vehicle Body 2
Bus 1 1.136
Other 1 —0.371
Sedan 0 0
The probability of a driver in age group 2, from area C and with vehicle body type Other, having an accident is
0.22.
Calculate the odds ratio of the driver in age group 3, from area C and with vehicle body type Sedan having an
accident.
12.17. NI' You are given the following information about an insurance policy:
(i) The probability of a policy renewal is modeled using a generalized linear model with a probit link.
(ii) The model has an intercept and one explanatory variable.
(iii) Po = —3
(iv) 131 = 0.8
Calculate the odds of renewal at x = 2.
12.19. ev? You are modeling the probability that a policy will be renewed. You use a generalized linear model and
a logit link. The output of the mode! is:
Variable Coefficient
Intercept 0.22
Number of years policy is in force 0.15
Number of claims submitted —0.25
Age of policyholder
Under 30 —0.30
30-49 0
50 and up 0.18
Calculate the predicted probability that a 55-year old policyholder for whom the policy was in force for 5 years
and who has not submitted any claims will renew the policy.
12.20. •41 A binary response is modeled with a generalized linear model and a logit link. The fitted model is
g (T) = 0.24 + 0.57xi
Calculate the amount by which the probability of the event increases if X1 is increased from 2 to 3.
12.21. "41i A binary response is modeled with a generalized linear model and a probit link. The fitted model is
g(n) = 0.2 + 0.4x1 + 0.6x2
12.22. A binary response is modeled with a generalized linear model and a complementary log-log link. The
fitted model is
g(n) = 0.02 + 0.04x1 + 0.06x2
Determine the odds of an event when xi = 1 and x2 = 2.
12.23. A binary response is modeled with a generalized linear model and a complementary log-log link:
Nominal response
12.24. '41 A nominal logit model is used to predict the political party of a person. The base category is "Demo-
cratic". The fitted model is
Income Level
Under 50,000 0 0 0
50,000-99,999 0.51 0.35 0.45
100,000 or more 0.37 0.52 0.22
Calculate the odds, relative to the base category, that a person who completed college but not graduate school
and has an income of 100,000 or more is Republican.
Exam SRM Study Manual Exercises continue on the next page ...
Copyright ©2022 ASM
EXERCISES FOR LESSON 12 205
(--, Use the following information for questions 12.25 and 12.26:
A nominal logistic model is used to predict the type of car one buys. The base category of car is "sedan".
The other categories are "van" and "SUV". The explanatory variables are "gender" and "age group". The fitted
coefficients are:
Gender
Male 0 0
Female —0.18 —0.06
Age group
Under 25 —0.11 0.18
25-54 0 0
55 and up 0.06 0.04
12.26. ..11 Calculate the probability that a male age 20 buys an SUV.
12.27.
•-it For a nominal response variable with 3 categories, a logistic model with 1 explanatory variable and an
intercept is fitted:
g(n1) Poi +
l
f The estimated parameters for the two non-base categories are
2. Beach
3. Visiting family
Vacation category is modeled as a function of two explanatory variables: number of family members and length
of vacation. A nominal logit model is constructed. Category 3 is the reference category. The following are the fitted
coefficients:
Vacation length
1 week or less —0.23 0.04
8 days-2 weeks 0.13 0.08
More than 2 weeks 0.32 —0.05
Calculate the predicted probability that a 2-week vacation for a family of 4 is active.
12.29. A nominal generalized logit model is used to model income level as a function of education. Income
level of 50,000-99,999 is the reference category. The fitted model is:
Calculate the probability that a person who completed graduate school earns 100,000 or more.
n Ordinal response
12.30. al [MAS-II-F19:25] A book of 122 commerical policies are observed for one year. The observed claim
counts distribution is shown below:
0 45
1 32
2 27
3 10
4 7
5 1
j Category
1 Low risk
2 Medium risk
3 High risk
A cumulative logit model results in the following coefficients for the explanatory variables:
Category Low Risk Medium Risk
Gender
Male 0 0
Female 0.75 0.80
Age
Under 25 —1.80 —0.65
25-44 0 0
45-64 1.00 1.47
65 and over 0.23 —0.12
Calculate the probability that a male driver age 65 or over is a medium risk.
j Degree of crash
1 Non-casualty
2 Injury
3 Fatal
In
Ei<j ni
— bcv + 0.056x1 + 0.082x2
— 24<i 77i
The fitted values of the intercept are boi = 0.4, b02 = 0.6.
Xi is a binary variable.
12.32. '41 Calculate the relative odds of an injury or non-casualty for someone with x1 = 1 relative to someone
with xi = 0.
12.34. NI° In a cumulative proportional odds model for an ordinal variable, the fitted model is
In = boi bixi
1 — Vi=1 iti
and brrii =0.05] for) 1,2,3,4.
For xi = 1, the odds that the response is 1 relative to xi = 0 are 1.5.
Calculate the odds that the response is 1 or 2 for xl= 1 relative to xi = 0.
Use the following information for questions 12.35 and 12.36:
A cumulative proportional odds model is used for a 4-category ordinal response variable. There is one
explanatory variable. Therefore, the form of the model is
12.37. evf A survey is done among drivers to determine the importance of rear window wipers on cars. The
response variable has the following values:
j Importance
1 Not important
2 Important
3 Very important
A cumulative proportional odds model is used to predict the response. The model has an intercept and one
explanatory variable: number of inches of rain per year in the driver's region. The fitted coefficients are
I Category
1 Low risk
2 Medium risk
3 High risk
A cumulative proportional odds model estimates category using several explanatory variables. One of the
explanatory variables is city.
For two insureds, values of all explanatory variables except for city are identical. The values of city, along with
the probabilities of the three categories, are as follows:
Category 1 2 3
New York 0.724 0.220 0.056
Chicago 0.698
Exam SRM Study Manual Exercises continue on the next page ...
Copyright ©2022 ASM
210 12. GENERALIZED LINEAR MODEL: CATEGORICAL RESPONSE
Car size
Compact 0
Midsize 0.35
Full size 0.71
Speed of car
Under 30 mph
30-59 mph -0.12
60 mph and over -0.64
Calculate the probability of a fatality for a driver in a full size car driving at 40 mph who has an accident.
12.40. "IP A cumulative probit model for a categorical variable with 4 categories results in the following fit:
0-1(70 = boi + 0.1xi + 0.5x2
with bol = 0.060, b02 = 0.150, and b03 = 0.370.
Calculate the probability of category 3 for someone with x1 =2, x2 = 1.5.
12.41. rThe grade a student gets on an actuarial exam (0-10) is predicted using a cumulative complementary
log-log model. Explanatory variables are hours of study (xi, a continuous variable) and use of the ASM manual (x2,
a binary variable). The base category for x2 is that the ASM manual was not used. The form of the model is
Solutions
12.1. The list of statement (A) can be found at the beginning of the previous lesson. (B), (C), and (D) are found in
this lesson, although (C) and (D) are not accurate; the logit and probit functions are actually the inverses of what is
stated. Comparing Figures 12.1 and 12.2 shows that (E) is false.
12.2.
At i7 = 0, TE = 0.5, so that is the borderline between the two cases of the inverted function. Inverting this function,
for TE > 0.5:
n = 1 -
TE = 0.50
q = in 2n
We conclude
n < 0.5
1 -
Here female is the base class for gender so its value is 0, whereas the other two variables have nonzero values.
e-2.761
= 0.05947 (D)
= 1 ±e2761
(1 )
0.88877
0.96562
= 2.078238 = /4) + 4b1
ln
(1 - 0.96562) = 3.335295 = bo + 6b1
Subtracting the first from the second, b1 = (3.335295 -.2.078238)/2 = 0.628529, and therefore 1)0 = 2.078238 -
4(0.628529) = -0.43588. (B)
12.13. logit(0.1) = In = -2.197. We will calculate the systematic component for each client and reject if it is
greater than -2.197.
#1 : -1.679 - 25(0.0294) - 0.3870 + 0.7710 = -2.03 > -2.197
#2 : -1.679 - 10(0.0294) - 0.3870 + 0.7710(3) = -0.047> -2.197
#3 : -1.679 - 20(0.0294) = -2.267 < -2.197
#4 : -1.679 - 75(0.0294) + 0.7710(3) = -1.571 > -2.197
Three clients are rejected. (D)
12.14. In this model, the odds are the exponential of the systematic component, which here is e356+°'42(5) = 287.15
12.15. In a logistic model, odds are the exponential of the systematic component, so here the odds are
1.75
e54)•65(5) = e = 5.7546 (C)
12.16. The odds for the 2-C-other driver are 0.22/(1 - 0.22) = 0.282051, with logarithm -1.2657. This is equal to
x + 0.064 + 0- 0.371, so x = -0.9587. Then the odds for the 3-C-sedan driver are e-119587 = 0.3834. (E)
12.17. Probit links give you probabilities, not odds, but it's easy to convert a probability into an odds ratio. We
have
17 = -3 + 0.8(2) = -1.4
it = 4'(-1.4) = 0.0808
12.18. The systematic component is 0.02 + 0.3(4) = 1.22. For the logistic model
e 1.22
77 = = 0.7721
1 + e1.22
= 0.87545
1 4. 0.24+0.57(3)
The increase is 0.07646
12.21.
= 11)(0.2 + 0.4(1) + 0.6(2)) = 0(1.8) = 0.9641
12.22. Just because it isn't a proportional odds model doesn't mean you can't calculate the odds.
The probability is
e119 17
77 = 1 - exp (_e0.02+0.04(1)+0.06(2)) = 1 =0.6980
12.25. In the nominal logistic model, the odds ratio for a binary variable, a variable that is either present or absent,
is di, where Pi is the coefficient for the binary variable. See page 192 for a discussion of this. In this question,
pi = -0.18, so the odds ratio is e-0'18 = 0.83527
12.26. The systematic component of male age 20 is 0.10 - 0.11 = -0.01 for van and -0.02 + 0.18 = 0.16 for SUV.
Thus the probability of buying an SUV is
e0.16
0.370946
TCSLIV = 1 + e-0.01 0.16
713
1/2
e0.307+0.135(2) = 1.78069
113
So it3 is
1
It3 0.21146
1 + 1.94838 + 1.78069
The cumulative probabilities are 4.61818/5.61818 = 0.82201 and 6.88951/7.88951 = 0.87325. The probability of
medium risk is 0.87325 - 0.82201 = 0.05124
12.32. As we discussed, the relative odds are the exponential of the corresponding (3 coefficient.
e0.056 1.0576
12.33. The model is cumulative, so we'll have to calculate the odds of non-casualty and the odds of non-casualty
or injury, translate odds into probabilities, and then take the difference. The odds of non-casualty are
eo.4+0.o82(5) =2.2479
12.34. Since it is a cumulative proportional odds model, the relative odds of response 1 to response 0 for cumulative
category 2 (which is categories 1 and 2 combined) equals the relative odds of responsel to response 0 for cumulative
category 1 (which is category 1 alone), and we are given that the relative odds of response 1 to response 0 for
category 1 is 1.5
12.35. The cumulative probability of category 2 is 0.627148 + 0.126841 = 0.753989. The cumulative odds of
categories 1 and 2 are 0.627148/(1 - 0.627148) = 1.68203 and 0.753989/(1 - 0.753989) = 3.06486 respectively. The
logged odds of categories 1 and 2 are 0.52 and 1.12 respectively. From the form of the model,
Poi + pi = 0.52
po2 + p1 = 1.12
Thus pO2 I301 = 1.12 -0.52 = 0.6
12.36. The logged odds of category 1 are In (0.719100/(1 - 0.719100)) = 0.94. From the form of the model,
Poi +43i =0.94
From the previous exercise, we know that poi + pi = 0.52. Thus pi = 0.42 and j301 =171)71 1 .
12.37. We need the cumulative probability of categories 1 and 2, and then the probability of category 3 will be the
complement of that cumulative probability.
The linear component is -0.07- 0.01(40) = -0.47. The cumulative odds of category 2 are e-(147 = 0.625002. The
cumulative probability of category 2 is 0.625002/1.625002 = 0.384616. The probability of category 3 is 1 - 0.384616 =
0.615384
12.38. We use the fact that the relative cumulative odds are the same for categories 1 and 2. For category 1, the
odds for New York are 0.724/(1 - 0.724) = 2.62319. For Chicago, they are 0.698/(1 - 0.698) = 2.31126. The relative
odds of Chicago to New York are 2.31125/2.62319. Therefore the same proportion applies to the sum of the first two
categories. For New York, the odds of the first two categories are 0.944/(1 - 0.944) = 16.85714, so for Chicago the
odds of the first two categories are 16.85714(2.31125/2.62319) = 14.85262. The probability of the first two categories
for Chicago is 14.85262/15.85262 = 0.93692. The probability of medium risk for Chicago is 0.93692 - 0.698 = 0.239
12.39. For category 2, the systematic component is 0.62 + 0.71 - 0.12 = 1.21. Thus the cumulative probability of
category 2 is (1)(1.21) = 0.8869. The probability of a fatality, category 3, is 1 - 0.8869 = 0.1131
12.40. The cumulative probability of 2 or lower is
0(0.150 + 0.1(2) + 0.5(1.5)) = 0(1.10) = 0.8643
and the cumulative probability of 3 or lower is
0(0.370 + 0.1(2) + 0.5(1.5)) = (I)(1.32) = 0.9066
The probability of exact category 3 is the difference of the two cumulative probabilities, or 0.0423
12.41. We'll calculate the probability of 5 or less; the probability of passing is the complement. The linear component
is -2 -4- 0.5(5) - 0.01(200) - 0.4 = -1.9. Using the inverse of the complementary log-log link,
n = 1 - exp(-e-1.9) = 0.138921
The probability of passing is 1 - 0.138921 = 0.861079
Quiz Solutions
12-1. The logistic model gives the logarithm of the relative odds as 0.8 - 1.3 = -0.5, so the relative odds are
e-0.5 0.6065
12-2. The model for the conditional probability of becoming a professor is logistic. pp for this model is the excess
of 130 for professor over the po for actuary, or 0 - 0.8 = -0.8, since for the base category professor po = 0. Using
equation (12.4), the requested probability is
e-0.8
0.3100
+
12-3. The probability of event 1 is 0.5/1.5 = 1/3. The probabilit events 1 or 2 is 2/3. Therefore, the probability
of event 2 is 1/3 and the odds of event 2 are (1/3)/(1 — 1/3) = 0.51.
BREAK #asmstudybreak
Inspiration for when you need a little push and some fuel to help you keep • am
\
•••••=,
John D. Rockefeller
atstm
Actuarial Study Materials
,)
Lesson 13
When Poisson regression is performed, usually the canonical link function, g(y) = In y, is used. Then if xi •41
assumes two values, x11 and .7c21 and all other predictors stay the same, then
E[yi]= efijx1J+•••
E[y2I = efijx21+...
E[y2] = epi(x2)-xii)
ELM
We see that ePj can be interpreted as the proportional change in E[iii] per unit change in xii.
We will sometimes use the notation y i for9,the fitted value in the GLM model.
Sometimes observations may need to be weighted for exposure. For example, each observation of y may be
total claims submitted by a group with n members, and you may want to assume that the number of claims per
member is Poisson with mean A. Or you may have number of claims per policyholder, but not every policyholder
had a policy for a full year; some may have had it for only a couple of months, just having exposure less than 1. To
incorporate exposure into the model, let Ei be the exposure of observation L The model form is then:
ln = ln .E; Epixii
i=0
The additional term In Ei, which is known in advance (unlikefi,which has to be estimated), is called the offset. •41
Even though the distribution of yi is not a real probability distribution, the estimates of pi are still consistent.
Typically is estimated by
(yi pi)2
cl" = n — (k +1) fri.J jui
(13.1)
The rationale of this formula is that the sum in this formula is assumed to have a chi-square distribution with n—(k+1)
degrees of freedom, and the mean of that chi-square distribution is n — (k + 1). Compare this to equation (11.4);
here, the Poisson variance equals the mean, each observation is considered as an independent cell, and k +1 ps are
estimated.
The overdispersion parameter affects the standard error calculations of the Pis, which get divided by it.
The drawback of using cp is that since no probability distribution corresponds to the model, one cannot estimate
probabilities of having specific numbers of claims; only moments can be estimated.
An alternative to this method is to use a probability distribution whose variance is greater than its mean. The
'I negative binomial distribution, a member of the linear exponential family, fits the bill. It has the following probability
mass function:
j+r—1
Pr(y = j) = ( r—1 )pr(1 — p)i j = 0, 1,2, ...
with E[Y] r(1 — p)/p and Var(Y) = r(1 — p)/p2. We setp = r(1 — p)/p and Var(Y) = goy where 1/p. The Poisson
distribution is the limit of the negative binomial as p 1 and rp —* A. Typically the log link is used even though it
is not the canonical link for this distribution.
Four other count distributions are popular. The first two provide flexibility for the probability of 0.
„ + (1 — n)h(0) j = 0
Pr(y= P = (13.2)
701(j) 1>0
For example, suppose the count distribution is Poisson with mean 0.2 and the 0 component has weight 0.1. Then
Pr(y = 0) = 0.1 + 0.9e-0-2
Pr(y = 1) = 0.9(0.2)e-12
0 22
Pr(y = 2) = 0.9
H 2
e-°.2
ni is a function of the predictors xi, and is estimated using a binary model such as logit.
II am following the textbook's notation of calling it (if) here even though it was called 472 in Section 11.4, but I don't understand why a2 is
necessary. For 2-parameter distributions, there is no need for 1
4
2 since the second parameter cp can already be adjusted to account for variance,
while for 1-parameter distributions there is no second parameter cp, so op can be used as the overdispersion parameter.
2The textbook uses g(y) for this, but this is confusing since we use g(y) for the link function.
For a Poisson distribution with mean pi and probability of the zero component rz, the double expectation formula
gives:
E[y] = (1 — ni)pi (13.3)
and the variance of yi is computed by the conditional variance formula (with I indicating the component of the
mixture):
Var(yi) = E[Var(yi I I)] + Var(E[yi I I]) = (1— Tri)pi + pN(1 — ni) (13.4)
We see that the variance has an extra term that the mean doesn't, so the variance is greater than the mean.
The rationale for hurdle models is that the response is a result of a two-step process. The first step is the hurdle; the Ikt
decision to make the count greater than 0 is the first decision. The second step is actually determining the non-zero
count. For example, a patient decides whether to go to the hospital, and in the second step the hospital decides how
many days the patient is there.
In these models, there is a specified probability of 0 and a specified count distribution h (y). Given that the count
is not 0, its distribution is the count distribution truncated at 0. In other words,3
(13.5)
where k =
.1-h(o) • Those who took STAM will recognize this distribution as a zero-modified (a, b, 1) distribution.
For example, suppose the count distribution is Poisson with mean 0.2, and n = 0.1. Then k = 0.9/(1 — e—(12) and
Pr(y = 0) = 0.1
Pr(y = 1) =
pr(y 2) =
0 22)
Probabilities of nonzero integers are multiplied by k, so first and second moments are multiplied by k. If h(y) is a
Poisson distribution with mean pi, it follows that
k may be greater or less than 1. When k < 1, then 1 — k> 0 and the second term in positive, so variance is greater
than mean. When k> 1, then 1 — k <0 and variance is less than mean. So this model can handle underdispersion
as well as overdispersion.
In yi I = ai
3Even though we usually use k for number of variables, the textbook uses k here for the scaling constant. I don't think you will confuse this
non-integral k with the other k, so I followed the textbook.
or
yi I ai = exp(ai + Pixi)
The distribution of the random variable a• is allowed to vary by observation i. Adding a constant to ai leads to an
equivalent model by adjusting the intercept, so we specify E[e] = 1 to make the intercept unique. Let pi =
Assuming yiIai has a Poisson distribution, so that its variance equals its mean, then the moments of yi are
E[y1] = E[E[yi I ad]= pi Ejeall = (13.8)
Var(yi) = E[Var(E[yi I as])] + Var(E[yi I ail)
= E[e'l+q3]+ Var(eai*O)
=
+ 1.4 Var(e) (13.9)
The variance is greater than the mean. We model overdispersion by selecting a distribution for ai that results in the
desired ratio of variance to mean.
The textbook discusses two distributions that may be used for ai. The first distribution for a is a gamma
distribution. This means that eal follows a loggamma distribution. Multiplying a gamma random variable by a
constant results in a gamma distribution. So e'''''113 , the mean of the conditional Poisson distribution for yi, has
a gamma distribution as well. And as you learned in Exam STAM4, a gamma mixture of Poisson distributions is
negative binomial. So by using a gamma distribution for eai, yi will have a negative binomial distribution for all
observations having the same distribution for ai.
The second distribution for a is normal. Then e", follows a lognormal distribution. Normal distributions are
used very frequently when we're not sure what distribution is best. Claim count probability does not have a closed
form expression if a lognormal distribution is used, but can be approximated by computer algorithms.
In a latent model, we assume that there is some unobservable discrete random variable that affects yi. For example,
this random variable may have two values: "low risk" and "high risk". Thus yi is modeled as a discrete mixture,
which for conditionally Poisson yi results in variance higher than mean. While this model is intuitively appealing,
it has computational issues; mixtures may have more than one maximum, complicating maximum likelihood
estimation, and convergence is slow.
Exercises
13.1. al? You are given the following information for a fitted GLM:
Response variable Number of cars
Response distribution Poisson
Link log
Parameter df #
Intercept 1 0.186
In(Income) 1 0.009
Family size 2
1 or 2 0 0.000
3 or 4 1 0.137
5 or more 1 0.355
Calculate the variance of the number of cars for a family of size 4 with 150,000 income.
4If you didn't learn this, either prove it yourself or take it on faith.
Overdispersion estimate
1 (yi
43 =
(13.1)
i=1
ri
Zero-inflated models
n + (1 — n)h(0) j=0
Pr(Y
;\
{(1 — 77 )11(j)
E[Yi] =(1— ni)pi
j>0
(13.2)
(13.3)
Var(p) = (1 — ni)pi + /47(1 — ni) (13.4)
Mean and variance formulas assume that h(j) is the probability mass function of a Poisson with mean pi.
Hurdle models
j 0
E[y] =
{Ich(j) j > 0 (13.5)
(13.6)
Var(yi) = k + k(1 — (13.7)
1 —
where k = Mean and variance formulas assume a base Poisson distribution before the hurdle.
1 —
h(0)
Heterogeneity models
E[N] = (13.8)
Var(yi) = +p Var(ei) (13.9)
These mean and variance formulas assume If;Iai follows a Poisson distribution.
13.2. sil° You are given the following information for a fitted GLM:
Response variable Claim count
Response distribution Poisson
Link log
Parameter df
Intercept 1 -1.512
Rating class 1
Standard 0 0.000
Preferred 1 —0.301
Calculate the probability that a preferred driver who drives 10,000 miles submits at least 1 claim.
13.3. [S-S16:411 Calculate the expected number of deaths for a population of 100,000 females age 25.
(A) Less than 3
(B) At least 3, but less than 5
(C) At least 5, but less than 7
(D) At least 7, but less than 9
(E) At least 9
13.4. "-ir For a 60 year old, calculate the ratio of expected number of diabetes deaths for males over expected
number of diabetes deaths for females.
13.5. %. [S-S17:371 Let Y1, , Y be independent Poisson random variables, each with respective mean pi for
= 1, 2, ... ,n, where:
a, for i = 1,2, ... ,rn
In pi =
{ p, for i = m + 1, m + 2, ... , n
The claims experience for a portfolio of insurance policies with in = 50 and n = 100 is:
50
yi = 563
100
E yi
1=51
1,261
13.6. •46 [S-S17:30] You are given the following information for a fitted GLM:
Response variable
Response distribution Poisson
Link log
AIC 221.254
Parameter s.e.(s)
Intercept 5.421 0.228
Gender
Male 0.000 0.000
Female —0.557 0.217
Calculate the predicted value of Y for a Female with an Age value of 22.
(A) Less than 1,000
(B) At least 1,000, but less than 1,500
(C) At least 1,500, but less than 2,000
(D) At least 2,000, but less than 2,500
(E) At least 2,500
13.7. a: You are given the following data for 5 groups, each of which is covered by workers compensation policy.
Group Number of Claim Hazard
Name Employees Count Class
A 14 5
29 3 1
16 7 3
24 6 2
42 8 1
You are building a generalized linear model in which hazard class will be an explanatory variable and the
response will be claims per employee. You will use a log link.
Determine the offset for the first group.
yi 1 2 4 7
13.9. & You are given the following output from a generalized linear model:
Based on this model, if a driver drives an additional 1000 miles, what is the percentage increase of claim
frequency?
13.10.
For a zero-inflated Poisson regression model with an intercept only:
(i) The log link is used
(ii) The intercept isJo= —0.3
(iii) The probability of 0 is 0.8.
Calculate the probability of 1.
Intercept 1 —0.300
Gender
Male 0 0.000
Female 1 —0.200
A zero-inflated Poisson distribution is a mixture distribution. The component of the mixture that is the constant 0
has a weight of 0.4.
Calculate the variance of Y for a female.
13.12. 'I For a hurdle model, the base distribution of the response is Poisson and a log link is selected. The
model has two variables, xi and x2, and an intercept. The estimated parameters are bo = 1.5, bi = 0.4, b2 = 0.2. The
probability of 0 is 0.25.
Calculate the fitted probability of 2 when xi = 0.5 and x2 = 0.7.
13.13. You are given the following information for a fitted GLM:
Intercept 1 —1.135
Gender
Female 0 0.000
Male 1 0.408
Rating class 1
Standard 0 0.000
Preferred 1 —0.677
The model is a hurdle model. For a preferred male driver, the probability of 0 claims is 0.8.
Calculate the variance of the claim count from a preferred male driver.
13.14. svr A hurdle model is used. The base distribution is Poisson and the log link is selected.
For observation 1, the mean is 0.250. You wish to set the variance equal to 1.1 times the mean.
Determine the probability of 0.
13.15. In a heterogeneity model for y, the distribution of yi I ai is Poisson and a log link is used. The fitted
value of an observation yi is pi = 1.35. The random component a; follows a gamma distribution with mean 0.4 and
variance 0.3.
Solutions
The mean of Y is e-"13 = 0.199289. The probability of 0 claims is C0199289 = 0.819313. The probability of at least 1
claim is 1 - 0.819313 = 0.1807
13.6. The inverse of the link is exponentiation, so we exponentiate the systematic component.
exp (5.421 - 0.557 + 0.107(22)) = 1363.76 (13)
13.7. As discussed in the lesson, the offset is the logarithm of exposure, in 14 = 2.639
13.8. There are n = 4 observations and k = 1 parameter plus an intercept. The estimated overdispersion is
1 1(1 - 1.0699)2 (2 -2.3008
2.3008)2 + (4 -3.3739
3.3739)2 + (7 - 7.2553
7.2553)2
(t)
-4- (1 + 1) k 1.0699
+
) _
0.0845
13.13. The systematic component is -1.135 + 0.408 - 0.677 = -1.404. Then pi = = 0.245613. The k ratio
(actual probability of greater than 0 divided by fitted probability of greater than 0 in Poisson) is
k =1 e-0.245613= 0.918380
Using formula (13.7), the variance is
Var(y1) = (0.918380)(0.245613) + (0.918380)(0.2456132)(1 - 0.918380) = 0.23009
13.14. Dividing formula (13.7) by formula (13.6) and setting the quotient equal to 1.1, we get
1.1=
kipi + kq4(1 - kt)
=1+pi- kipi
0.1 = pi-0.25
pi = 0.35 k1=
Reading: Regression Modeling with Actuarial and Financial Applications 11.3.2, 11.4, 12.1.4, 12.2, 13.3.3, 13.4, 13.5
Once we have estimated the coefficients of a generalized linear model, we have to determine whether it is the best
model.
Many of the tests we discuss here are for nested models. That means that we compare one model that has several Nr
explanatory variables to another that has a subset of those variables. The second model may even have only one
variable, the intercept, in which case we're testing the significance overall of our first model. Typically we compare a
model to another model with one variable removed in order to determine the significance of that variable. However,
with categorical variables, we may want to remove all associated dummy variables at once.
Q has degrees of freedom equal to the number of cells (number of distinct claim counts), minus 1 because actual
and expected number of observations are constrained to be equal, and minus the number of parameters fitted from
the data.
An alternative formula is
n
Q= —n (14.2) 41
npi
Formula (14.2) only works when the following two conditions apply:
1. Sum of fitted values equals sum of actual values.
2. The denominators equal the fitted values.
0 2018
1 428
2 45
3 9
4 or more 0
Total 2500
Assume that claim counts follow a Poisson distribution, and the Poisson parameter is estimated using maximum
likelihood.
SOLUTION: The maximum likelihood estimate is the sample mean, 0.218. The expected number of policies with k
claims is 2500pk = 2500e-11218(0.218k RD. This works out to
Number Number Expected Number
of Claims of Policies of Policies
0 2018 2010.31
1 428 438.25
2 45 47.77
3 9 3.47
4 or more o 0.20
Total 2500
Notice that the expected number of claims in the "4 or more" cell is 2500 minus the sum of the expected numbers
of claims in the other cells. Then the chi-square statistic is
(2018 — 2010.31)2 1-
(428 — 438.25)2 +
(45 — 47.77)2 (9 — 3.47)2 (0 — 0.2)2 = 9.443
2010.31 438.25 47.77 3.47 0.2
Everything we've said so far applies regardless of what distribution is fitted. If you fitted a negative binomial
and used a log link, formula (14.1) would still apply.
But sometimes the fit is tested observation by observation. In that case, you would use the following formula to
calculate the Pearson chi-square statistic:
- PO'
(14.3)
1=1 Gpv(pi)
where cpv(f.ii) is the estimated variance of yi using the variance formulas in Table 11.1 on page 163. The number \—)
of degrees of freedom, if we assume all observations are independent (and we usually do), is n — p, where p is the
number of fitted parameters, usually k + 1.2
Back in Lesson 4 we studied the F test for testing the whether a linear regression model or a set of its parameters is
significant. The corresponding test for a generalized linear model is the likelihood ratio test.
Let 7be the loglikelihomi of the unconstrained model and f the loglikelihood of the constrained model. Then the
likelihood ratio statistic
LRT = 2(1 — f) (14.4)
It has an approximate chi-square distribution with q degrees of freedom, where q is the number of constraints.
The likelihood ratio test compares nested models, cases where one model is included in the other model.
14.3 Deviance
'Those who have studied the chi-square test know that we'd merge the cell with 4 or more with the cell with 3, and perhaps merge both
of these into the cell with 2 claims, due to the low number of expected claims in these cells. But Regression Modeling with Actuarial and Financial
Applications, on page 344, does not merge cells despite having 0.40 and 0.01 expected claims in two cells.
2This is what the textbook seems to say, but the chi-square statistic assumes a normal distribution for each group, so responses should be
grouped to calculate the statistic; it shouldn't be done observation by observation.
under consideration.
D(ê) = 2 (Osaturated) — 0)) (14.5)
The deviance is the scaling factor rp times the scaled deviance: D(e) = cpD*(6). For Bernoulli and Poisson models,
where = 1, the deviance equals the scaled deviance.3 The lower deviance is, the better the model.
For nested models, the likelihood ratio statistic may be calculated as the difference of the scaled deviances. When
calculating the difference of two scaled deviances, the loglikelihood of the saturated models cancels.
EXAMPLE 14B •41 A Poisson regression with 5 variables has a deviance of 20.56. When 2 of the variables are removed
from the model, the deviance is 26.78.
Determine the p level of the hypothesis that the 2 variables that were removed from the model are significant..
SOLUTION: The loglikelihood ratio statistic is the difference of the (scaled) deviances, 26.78 — 20.56 = 6.22. The log-
likelihood ratio statistic is chi-square with 2 degrees of freedom, or exponential with mean 2, and for an exponential,
the distribution function is F(6.22) = 1 — C62212 = 0.955399, so the p level is 1 — 0.955399 = 0.044601. Even though
you learn in probability that a chi-square with v degrees of freedom is a gamma distribution with a = v/2 and f3 = 2,
or an exponential with mean 2 when v = 2,1 doubt an exam would expect you to know this. You'd probably just be
expected to look up 6.22 in the tables and note that it is between the 95th percentile (5.991) and the975thpercentile
(7.378).
We will now derive deviance formulas for the normal, Bernoulli, and Poisson models. We assume all observations
are unique, so that the fitted values in the saturated model equal the observations.
Normal Ignoring the constant 1/(o-VTn) in the normal density, which will cancel when we take the difference of
loglikelihoods, the loglikelihood is
—Oh —
2a2
For the saturated model, pi yi, so yi — = 0, making the loglikelihood 0. For the model under consideration,
pi =9.So the scaled deviance, twice the difference in loglikelihoods, is
- .902
D(0) = C2
(14.6)
D(0) = (14.7) .1
1.1
Quiz 14-1 For a normal linear regression model, the deviance is 125. The model has 22 observations and
2 parameters, Po and Pi.
Calculate the residual standard error of the regression.
Bernoulli Assume that the parameters of the Bernoulli are 7Ii for each observation of the explanatory variables.
The mean of the distribution is ni. In the saturated model, yi = ni. The loglikelihood, ignoring binomial coefficients,
which don't vary by model and that will therefore cancel when taking differences, is
/(b) =
3This is the way it is defined in the textbook on the syllabus. But many other authors reverse the definitions, calling "deviance" what we call
"scaled deviance" and vice versa.
In the saturated model, ni = yi while in the model under consideration fri pi. Notice that
ln yi — lngi = ln —Yi
9i
and similarly
ln (1 — In) — ln (1 — 9i) lri
In this formula, the convention is that y In y = 0 whenever y = 0. That means that if all the yis are 0 or 1 (in other
words, the data are not grouped), the only terms in the sum are the ones that have yi = 1 or 1— yi = 1, and D reduces
to
D = —2 (E In + ln(1 — 9i))
yi=1 yi=0
(14.9)
EXAMPLE 14C I': Drivers are classified as low-risk (class 0) and high-risk (class 1). You use a generalized linear
model to predict the class. For 6 drivers, the results are
Actual class 0 0 1 1 1
Fitted class 0.25 0.35 0.12 0.47 0.84 0.52
SOLUTION: The class is a Bernoulli random variable. If you look at how the formula is derived, you'll see that we
can drop all terms involving logarithms of 0. In other words, we sum up yi ln t only for yi = 1 and (1 — yi) In
only for yi = 0.
1 1 1 1 1
D = 2 (ln 1 — 0.25
+ +
in 1 — 0.35 In 1 — 0.12 0.47 0.84 0.52
4.8592
Poisson If the means are A.i, the loglikelihood, ignoring ln yi! terms, which will cancel when taking differences, is
1(b) = yi ln Ai)
If the model has an intercept and the log link is used, E yi = Epi so the second term drops out, and we're left
with
D =2Eyiln Yi
1=1
(14.11)
To put it differently,for a Poisson regression with the log link and an intercept, the sum of the residuals 9.; - yi is 0.
EXAMPLE 140 "411 A Poisson regression with a log link is run. The results are
Actual 0 1 1 2 2
Fitted 0.35 0.85 1.22 1.74 1.84
SourrxoN: Since a log link is used, we can use the simpler formula (14.11) and skip the first summand which
includes a logarithm of 0.
1 1 2 2
D = 2 (In — 0.85
—
1.22
2In —
1.74
21n
1.84
= 0.8179
1. Akaike Information Criterion. The penalty is 2 for each parameter. The formula is
where / is the loglikelihood and p is the number of parameters estimated, usually k + 1. The lower the AIC is,
the better the model is.
2. Bayesian Information Criterion. The penalty varies with the number of observations, and is In n for each 41/4:
parameter. The formula is
BIC = —2/ + p inn (14.13)
Quiz 14-2 sr For a generalized linear model, the negative loglikelihood is 158.06. You are considering adding
a 4-category categorical variable to the model.
Determine the highest value of the negative loglikelihood for which the additional variable is accepted if
the AIC is used to select models.
R2 1 exp(/o/n) 12 (14.14)
exp(/(b)/n))
The problem is that while this R2 is never less than 0, it can never be as high as 1. Max-Scaled R2 divides R2 by its
maximum value so that the statistic is 1 for a perfect model:
R2
max-scaled R2 = 2 (14.15)
1 — (exp(/0/n))
For the pseudo-R2, let /max be the loglikelihood of the saturated model. The pseudo-R2 statistic is defined by
—
pseudo-R2 = /max — lo
For a linear regression model, this reduces to R2, as we will now show. For a normal distribution, the minimal
model sets fp = y. Then
and
E(yi - 9i)2 n 2no-2
1(b) =
az 2
ln 2no-2 = —Error SS/412 — 2
In equation (14.16), the 3- In 2no-2 terms cancel in the numerator and denominator, and we are left with
Total SS/c2 — Error SS/a-2 Regression SS
pseudo-R 2 = Total SS
Total SS/a2
Some authors (such as the author of the textbook on the CAS MAS-I syllabus) define pseudo-R2 differently.
14.6 Residuals
'If One way to improve a model is to look at residuals of the current model. Residuals are useful for
2. Identifying outliers
3. Displaying effects of individual observations on the model
4. Displaying heteroscedasticity and time trends
Raw residuals are meaningless in nonlinear models. To create meaningful residuals, three methods are briefly
discussed in the textbook:
1. Define a function Ej = R(yi; xi, 0), where U is a vector of model parameters f3i and scale parameters. The
function is defined so that Ej are independent and identically distributed. This is the Cox-Snell method.
2. Define a function Ei = R(yi; x,, 49) that is based on transforming yi.
3. Use deviance residuals.
where f(y) is the distribution of yi and "sat" stands for saturated. For example, for a Poisson regression, the 6:
deviance residual is
•41
Anscombe residuals An Anscombe residual is an example of the second type of residual mentioned above, the *.:
type based on transforming yi. It is of the form
h(y1) — E [h (NA
r
n
Vai
A, (yi))
where 11 is a transformation that makes h(y) approximately normally distributed. (It is a function of both yi
and p.i.) The textbook does not state how to derive h, but just provides a short table with the transformations
for binomial, Poisson, and gamma distributions. Deviance residuals and Anscombe residuals are close in
many cases.
The textbook also describes another unnamed residual of the second type, where h is selected to stabilize
variance.
Poisson
-2( E h, 9, + z _9))
yi=1 yi=o
(14.9)
di = sign(Yi — 942 On f(yi; of,sai — f(y1; 61))
(14.20)
Poisson:
Yi
(14.10)
Bernoulli:
2(/ — i) = /5 — 15
where hat indicates the unconstrained model, tilde
indicates the constrained model, and there are q
constraints.
di = sign(yi — 91).\12 (yi lnYi+ (1 — y-)ln
Yi — gi
(1422)
n Exercises
Pearson chi-square
14.1. a: ISRM Sample Question #281 Dental claims experience was collected on 6480 policies. There were a total
of 9720 claims on these policies. The following table shows the number of dental policies having varying numbers
of claims.
Calculate the chi-squared statistic to test if a Poisson model with no predictors provides an adequate fit to the
data.
0 0.2
0 0.4
1 0.7
1 0.8
1 1.2
2 1.5
2 2.1
2 2.1
14.3. 41 A generalized linear model has a gamma response variable with scale parameter 119 = 1/3. There are 4
observations with actual and fitted values as follows:
Actual Fitted
2.1 2.5
1.8 1.4
3.9 3.8
3.4 3.1
14.4. '1 [S-F16:41] A modeler is considering revising a linear model for claim counts with age as an explanatory
variable. It is currently being included in the model as a continuous variable with no interactions.
Determine which of the following statements is false.
(A) Including age as a categorical variable with more than two levels would increase the model degrees of
freedom.
(E) Including several interaction terms involving age may make the model more parsimonious.
14.5. al For a generalized linear model, the loglikelihood of the saturated model is —42.51 and the deviance of
the model under consideration is 21.20.
14.6. A generalized linear model with Bernoulli response is fitted to 15 observations. All the observations
except for the first, ninth, and twelfth are 0. The model sets the Bernoulli parameter 7I = 0.2 for all 15 observations.
Calculate the deviance.
14.7. "41 For a normal linear model, the deviance is 22.52. The model has 25 observations and 5 parameters (4
explanatory variables and an intercept).
Calculate the residual standard error of the regression.
14.8. .4' A generalized linear model with Poisson response is fitted to 5 observations. There is 1 explanatory
variable and an intercept. The results of the model are
Yi Pi
10 7
12 12
15 20
18 19
20 25
fl 14.9. *It [S-F16:35] You are given ,y„, independent and Poisson distributed random variables with respec-
tive means pi for i 1, 2, ... , n.
A Poisson GLM was fitted to the data with a log-link function expresses as
E[y1] eflo+flixi
xi yi gi yi in(yilgi)
0 7 6.0 1.0791
0 9 6.0 3.6492
0 2 6.0 —2.1972
1 3 6.6 —2.3654
1 10 6.6 4.1552
1 8 6.6 1.5390
1 5 6.6 —1.3882
1 7 6.6 0.4119
Calculate the observed deviance for testing the adequacy of the model.
(A) Less than 4.0
(B) At least 4.0, but less than 6.0
(C) At least 6.0, but less than 8.0
(D) At least 8.0, but less than 10.0
(E) At least 10.0
l
f
14.10. [MAS-I-F19:331 You have a sample of five independent observations, xi, • • • , x5, each with exponential
distribution:
(A) I only (8) II only (C) III only (D) All but III (E) All
14.12. '41 [MAS-I-S18:291 You are given the following statements relating to deviance of GLMs:
I. Deviance can be used to assess the quality of fit for nested models.
II. A small deviance indicates a poor fit for a model.
III. A saturated model has a deviance of zero.
(A) None are true (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) ,or (D)
14.13. ": The loglikelihood of a fitted generalized linear model with 6 parameters 130,161, • • • s P5 is —130.52. The
null hypothesis is Ho: 134 = P5 = 0. Under the null hypothesis, the loglikelihood is —134.88.
Under the likelihood ratio test, which of the following statements is correct?
(A) Reject Ho at 0.5% significance.
(B) Reject Ho at 1% significance but not at 0.5% significance.
(C) Reject Ho at 2.5% significance but not at 1% significance.
(D) Reject Ho at 5% significance but not at 2.5% significance.
(E) Do not reject Ho at 5% significance.
14.14. NI. [S-S17:421 A study was commissioned on the effect of type of fertilizer and type of seed on crop yield.
There are four types of fertilizers and five types of seed included in the study.
Two separate Poisson GLMs with log link functions were fit to the dataset:
1) Using type of fertilizer and type of seed without an interaction term, the log-likelihood of this GLM is —283.
2) Using type of fertilizer and type of seed, and all interaction terms between those two main effect variables, the
log-likelihood of this GLM is —272.
Let:
• Ho: The effect of type of fertilizer is independent of type of seed on crop yield.
• Hi: The effect of type of fertilizer is not independent of type of seed on crop yield.
Calculate the smallest significance level at which you reject Ho.
(A) Less than 0.5%
(B) At least 0.5%, but less than 1.0%
(C) At least 1.0%, but less than 2.5%
(D) At least 2.5%, but less than 5.0%
(E) At least 5.0%
14.15. The scaled deviance for a fitted generalized linear model with 5 parameters po, 434 is 52.08. The
null hypothesis is Ho: P4 -= 0. If P4 is set equal to 0, the scaled deviance is 55.98.
Using the likelihood ratio test, which of the following statements is correct?
(A)Reject Ho at 0.5% significance.
(B)Reject Ho at 1% significance but not at 0.5% significance.
(C) Reject Ho at 2.5% significance but not at 1% significance.
(D)Reject Ho at 5% significance but not at 2.5% significance.
(E) Do not reject Ho at 5% significance.
14.16. [ SRM Sample Question #191The regression model Y = pol-pixi+,62x2+,53x1 X2+ E is being investigated.
The following maximized log-likelihoods are obtained:
• Using only the intercept term: —1126.91
• Using only the intercept term, Xi, and X2: —1122.41
• Using all four terms: —1121.91
The null hypothesis Pi = 132 = f33 = 0 is being tested at the 5% significance level using the likelihood ratio test.
Determine which of the following is true.
(A) The test statistic is equal to 1 and the hypothesis cannot be rejected.
(B) The test statistic is equal to 9 and the hypothesis cannot be rejected
(C) The test statistic is equal to 10 and the hypothesis cannot be rejected.
(D) The test statistic is equal to 9 and the hypothesis should be rejected.
(E) The test statistic is equal to 10 and the hypothesis should be rejected.
14.17. You are given the following results for two generalized linear models fit to the same data:
Model AIC
g(Y) = Po 89.2
g(P) = PO ± P1X1 P2X2 + P3x3 88.4
14.18. f [S-F15:361 You are given the following information for two potential logistic models used to predict the
occurrence of a claim:
Parameter Parameter
(Intercept) —3.264 (Intercept) —2.894
Vehicle Value ($000s) 0.212 Gender—Female 0.000
Gender—Female 0.000 Gender—Male 0.727
Gender—Male 0.727
14.19. [S-F15:371 You are given the following table for model selection:
Scaled Number of
Model Deviance Parameters (k + 1) AIC
Intercept + Age A 5 435
Intercept + Vehicle Body 392 11 414
Calculate Y
14.20. ": [S-F15:38) You are testing the addition of a new categorical variable into an existing GLM. You are given
the following information:
(i) The change in scaled deviance after adding the new variable is —53.
(ii) The change in AIC after adding the new variable is —47.
(iii) The change in BIC after adding the new variable is —32.
(iv) Prior to adding the new variable, the model had 15 parameters.
Calculate the number of observations in the model.
14.21. kir [S-S16:35] You are given the following information about three candidates for a Poisson frequency GLM
on a group of condominium policies:
Model Variables in the Model df loglikelihood AIC BIC
1 Risk Class 5 —47,704 95,418 95,473.61182
2 Risk Class + Region —47,495
3 Risk Class + Region + Claim Indicator 10 —47,365 94,750
Insureds are from one of five risk classes: A, B, C, D, E.
Condominium policies are located in several regions.
Claim Indicator is either Yes or No.
All models are built on the same data.
Calculate the absolute difference between the AIC and the BIC for Model 2.
14.22. [S-S16:37] Determine which of the following GLM selection considerations is true.
(A) The model with the largest AIC is always the best model in model selection process.
(B) The model with the largest BIC is always the best model in model selection process.
(C) The model with the largest deviance is always the best model in model selection process.
(D) Other things equal, when the number of observations>1000, AIC penalizes more for the number of param-
eters used in the model than BIC.
(E) Other things equal, when the number of observations>1000, BIC penalizes more for the number of param-
eters used in the model than AIC.
14.23. •41 IS-S16:38) You are testing the addition of a new categorical variable into an existing GLM, and are given
the following information:
(i) A is the change in AIC and B is the change in BIC after adding the new variable.
(ii) B > A + 25
(iii) There are 1500 observations in the model.
Calculate the minimum possible number of levels in the new categorical variable.
(A) Less than 3 (5) 3 (C) 4 (D) 5 (E) More than 5
Exam SRM Study Manual Exercises continue on the next page ...
Copyright 02022 ASM
246 14. GENERALIZED LINEAR MODEL: MEASURES OF FIT
14.24. •.+? [MAS-I-F19:321 You have three competing GLMs that each predict the number of claims under an
insurance policy, and are evaluating the models using AIC and BIC. All models are trained on the same dataset of \--)
300 observations. These models are summarized below:
Number of
Model Likelihood Parameters
0.0456 4
2 0.0567 5
3 0.0575 6
The following are three statements about the fit of these models:
I. Model #1 is best based on BIC
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Use the following information for questions 14.25 through 14.27:
You are given a set of 65 observations. Two models are proposed for the underlying distribution. The models
only differ in that the first model includes a categorical variable with 4 categories and the second one doesn't.
Let 11 be the loglikelihood of the first model and let 12 be the loglikelihood of the second model.
14.25. P Calculate the most by which h may exceed /2, using the Akaike Information Criterion, if the second
model is preferred.
14.26. '''‘r Calculate the most by which /1 may exceed /2, using the Bayesian Information Criterion, if the second
model is preferred.
14.27. rCalculate the most by which /1 may exceed 12, using the likelihood ratio test at 5% significance, if the
second model is preferred.
14.28. •41 You are given two models based on the same n observations. The loglikelihood of the first model is
—110.52 and the loglikelihood of the second model is —105.34. The second model has 2 more parameters than the
i rst model.
f
14.29. N? A generalized linear model is used for claim frequency. One of the explanatory variables is age. Age is
a categorical variable. There are 5 age groups,
Model I includes age as an explanatory variable and Model II does not. The models are otherwise identical.
You are given:
Yi 9i
1 1.2
3 2.6
4 4.2
6 6.8
8 7.2
Actual Fitted
1 0.80
2 1.23
3 3.57
4 5.00
5 4.40
The minimal model sets the fitted value equal to the mean.
Residuals
14.34. 14 [SRM Sample Question #52] Determine which of the following statements is/are true about Pearson
residuals.
14.37. 'kJ° For a generalized linear model with 5 observations, the deviance residuals are 0.125, 0.342, -0.207,
0.408, -0.603.
Calculate the scaled deviance.
14.38. •46 For a generalized linear model, the response distribution is Poisson. You are given that y5 = 10 and
gs = 8.
Calculate the deviance residual d.
Solutions
14.1. We calculate the expected number of policies with each number of claims, using Poisson probabilities
p„ = e-A /1"/n!, where A is the mean. With no predictors, the sample mean is used as the mean, and 9720/6480 = 1.5.
Let EL be the expected number of policies with k claims.5 You may calculate Ek directly, but we will use (a, b, 0)
methods that you learn in Exam STAM.
E0 = 6480C1.5 1446
1.5E0 = 2169
1.5
E2 = —El = 1627
2
1.5
Es T E2 = 813
1.5
E4 = - E3 305
1.5
Es = —E4 = 91
5
5
E6 = 6480 - EA, = 29
k=0
Q z(Ok Ek
Ek )2 = Ek
n
sDo not confuse this with exposure, which is not used in this question.
14.2. For Poisson, where variance equals mean and E yi = we may use the alternative formula (14.2). There
are a total of 9 observations; n = 9.
02 02 12 22
Q= + + + • • • + - - 9 = 0.9881
0.2 0.4 0.7 2.1
14.4.
(A) There is one variable for each category in excess of 1, so including more than 2 categories will increase the
number of variables in the model and the number of degrees of freedom.
(B) Increasing the number of variables in the model may decrease deviance and cannot increase it.
(C) The plot will show whether the residuals are randomly distributed.
(D) The VIF is calculated by regressing a variable against the others. VIF is R2 for that regression. High VIF
indicates collinearity.
(E) A parsimonious model has fewer variables, not more variables.
(E)
1.06113
14.10. The maximum likelihood estimate is the sample mean, or 6 = (100+100+500 +800 +1000)/5 = 500. (You may
have learned this in your statistics course, which is preliminary to this course, or while taking STAM. But perhaps this
question is not suitable for SRM.) The loglikelihood of each observation is — ln ei — xi/Oi. Thus for the fitted model,
with 0 = 500, the loglikelihood is —51n 500-2500/500 = —36.073. In the saturated model, each observation is fitted to
its own model, and the fitted value is 0; = xi. Then the loglikelihood is —21n 100—In 500—in 800—In 1000-5 = —34.017.
The difference between double the negative log likelihoods is 2(-34.017 + 36.073) = 4.111 . (C)
14.11. Statement I is true since differences in loglikelihoods can be deduced from differences in deviances, and
then the likelihood ratio test may be used.
The deviance for a normal distribution is E(yi — 9i)2, making II true.
Statement III is true. Refer to the definition of deviance. (E)
14.12. I is true. II is false, since the smaller deviance, the better the fit. III is true since the deviance is a multiple of
the difference from the loglikelihood from the saturated model's loglikelihood. (C)
14.13. Twice the difference in loglikelihoods is 2(134.88 — 130.52) = 8.72. There are 2 constraints, so this statistic
is chi-square with 2 degrees of freedom. For chi-square, 8.72 is between the critical value 7.378 at 2.5% significance
and 9.210 at 1% significance, making (C) the correct answer.
14.14. One category of fertilizers and one category of seed type is baseline, leaving 3 fertilizers and 4 seed types to
interact, a total of 12 interactions. Twice the loglikelihood increases by 2(-272 + 283) = 22 with interactions, and is
chi-square with 12 degrees of freedom. The critical values are 21.026 at 5% and 23.337 at 2.5%, making the answer
(D).
14.15. The difference in deviances is 55.98 — 52.08 = 3.9, and this is the likelihood ratio test statistic. One parameter
is being constrained, so there is 1 degree of freedom. For chi-square with 1 degree of freedom, the critical values are
3.841 at 5% significance and 5.024 at 2.5% significance. So (D) is correct.
14.16. The test statistic is twice the difference in log-likelihoods between using all 4 terms and using only the
intercept:
2(-1121.91 + 1126.91) = 10
There are 3 additional parameters in the bigger model, hence 3 degrees of freedom for the chi-square distribution.
At 5% significance, the critical value is 7.815. Thus the null is rejected. (E)
14.17. The AIC is —21+ 2p, where p is the number of parameters, k + 1. The first model hasp = 1; the second model
has p = 4. For the first model:
—2/ +2 = 89.2
—2/ = 87.2
2/ = —87.2
—2/ + 8 = 88.4
—21 = 80.4
r-, 21 = —80.4
The likelihood ratio statistic is twice the difference in loglikelihoods, or (-80.4) — (-87.2) =
14.18. The lower the AK the better, so Model 1 is selected. From the given coefficients,
14.19. The AIC is twice the negative loglikelihood plus twice the number of parameters. The scaled deviance is
twice the excess of the loglikelihood of the saturated model over the loglikelihood of the model. Since the saturated
model is the same for all four models, the difference in AICs is the difference in scaled deviances phis twice the
difference in the number of parameters.
From the first model we see that Age has 4 parameters (although this is extraneous), and from the second model
we see that Vehicle Body has 10 parameters. Since the AIC of the third model is 32 more than the AIC of the second
model and the scaled deviance is the same, the third model must have 16 more parameters than the second model,
so X = 27 and Age + Vehicle Value has 26 parameters. The fourth model, Intercept + Age + Vehicle Value + Vehicle
Body must have 1 parameter for Intercept, 26 for Age + Vehicle Value, and 10 for Vehicle Body, for a total of 37
parameters. (C)
14.20. Statement (iv) is extraneous.
Twice the negative loglikelihood increases by the same amount the scaled deviance increases, so twice the
negative loglikelihood increased by —53. The AIC is —2/ + 2q where q is the number of parameters. Let p be the
number of parameters for the new variable. We see that the AIC increased by 6 more than —21, so 2p = 6 and p = 3.
The BIC, which is —21 + q Inn increased 21 more than —2/, so p In n = 21 and Inn = 7. The number of observations
is n = e7 = 1097. (B)
14.21. Since claim indicator has one degree of freedom, eliminating it lowers the degrees of freedom from 10 to 9.
From Model 1, we see that
Inn = 95'473.61182
5
— 95,408= 13.12236
It is unnecessary to calculate n, but if you're curious, it is 500,000.
The difference between BIC and AIC in Model 2 is
14.22. (E) is true, since the penalty function per parameter is hi n with BIC and 2 with MC, and Inn > 2 for
n > 1000. That makes (D) false. In the other 3 statements, the opposite of each statement is true: The model with
the smallest AIC/BIC/deviance is best.
14.23. The difference in penalties per parameter between BIC and AK is In n — 2 = In 1500 — 2 = 5.3132. If the
difference is greater than 25, then the number of parameters must be at least 25/5.3132, which rounds up to 5. Since
the number of parameters is 1 less than the number of levels, there must be at least 6 levels. (E)
14.24. Twice the negative loglikelihoods are
Model 1 —21n 0.0456 = 6.176
Model 2 —21n 0.0567 = 5.740
Model 3 —21n 0.0575 = 5.712
After adding the AIC penalty of 2 per parameter, Model 1 is best. The BIC penalty is in 300 = 5.704 per
n parameter, so once again Model #1 is preferred. (A)
14.25. A 4-category variable generates 3 parameters. Model II has 3 parameters less than Model I. The AIC is
obtained by doubling the loglikehood, negating, and adding 2 for each parameter. If Model II's loglikelihood is less
than Model I's by the number of parameters, El, then the two AICs will be the same.
14.26. The BIC is obtained by doubling the loglikelihood, negating, and adding In n for each parameter. The BICs
will be equal if the loglikelihood for Model II is less than Model Is by 3(ln 65)/2 = 6.262
14.27. Twice the difference in loglikelihoods is chi-square with 3 degrees of freedom. The critical value for x2(3) at
95% is 7.815, so we are indifferent between the two models if 2(l1 - /2) = 7.815, or I - 12 3.908
14.30. We must calculate /(b), and 'max. The likelihood and loglikelihood for Poisson observations with means
Ai is
L= e ' ''n
/=-ZAi+EnilnA1-In n ni !
For all models, n, = yi. For the model under consideration, Ai = 9i. For the minimal model, Ai = 4.4. For the
saturated model, A1 yi. We can ignore In n!, which cancels. In fact, we can even ignore E Ai, since Eyi = z 91,
but we won't ignore it.
14.31. We'll use the calculations from the previous exercise. But we'll have to respect signs and subtract inn ni!
from each one since it doesn't cancel in the R2 formula.
R2 =
(exp(-11.5584/51 2 =0.752224
exp(-8.0703/5)
0.752224
max-scaled R2 - 0.7597
1 - exp(-11.5584/5)2
14.32. The density function for an exponential is f(x; p) = II' p. Setting pi = pi, the loglikelihood is
)
-E(ing,+
For the minimal model, y = = 3. For the saturated model, y = y. With the given fitted values,
With pi = = 3,
15
/0 -7- -51n3- - = -10.4931
3
With 9; = yi,
/max = -1n(1 2 • 3 • 4 • 5) - 5 = -9.7875
The pseudo-R2 statistic is
10A931 - 10.0002 _ 0.6985
10.4931 - 9.7875 -
R2 = 1
texp(-10.4931/5)) 2
exp(-10.0002/5)
0.17893
0.17893
0.17893
max-scaled R2 = 0.18166
1 - (exp(-10.4931/5))2 0.98496
14.34. The sum of the residuals is the Pearson chi-square goodness-of-fit statistic, making I true. II and III are listed
in the list of four items that residuals are useful for at the beginning of Section 14.6. (D)
14.35. residual =
yi - fEi
-0.46852
V(0.180)(0.820)
Yi 1-yi
14.36. residual = sign(yi -
) 2 (yi In + (1 - yi) In ist )
42 (In 0.8120 = -0.63000
d5=11 Yi
10
Quiz Solutions
14-1. Comparing formulas (3.2) and (14.7), the residual standard error of the regression is ./D(e)/(n — (k + 1)).
s = V125/(22 — 2) = 2.5
14-2. A 4-category variable adds 3 parameters to the model. The AIC penalty adds twice the number of parameters,
or 6, to twice the negative loglikelihood. To compensate for this, the negative loglikelihood must decrease by at least
3, to 155.06
e955185'
14-3. R2 = 1
(e-743/85) 2 = 0.392755
0.392755 0.392755
max-scaled R2 = = = 0.439181
1 — (e-953/85)2 0.894290
K-Nearest Neighbors
As mentioned in the preface, none of the SOA sample questions are on the topic of this lesson. Possibly they never
test on it. So feel free to skip this lesson if you're short on time, unless your friends tell you that they got questions
on it.
We are now going to discuss alternatives to regression. We begin by discussing categorical responses, the
classification setting.
EXAMPLE 15A Z A variable Y may have the value 1 or 2. Given the value of two explanatory variables, Xi. and X2,
the probability that Y is 1 is
Ixi
Pr(Y = 1 I Xi = X1 . X2 = X2) =
IX1 I + IX21
•
iThe textbook is silent on what to do if there is more than one maximum probability; for example, if Pr(Y = j I xo) = 0.5 and there are two
classes. You'd have to pick one of the maxima randomly then.
That is the boundary, with points having Ixil > Ix21 mapping to 1. Here is a graph:
xi
The Bayes classifier is a theoretical method_ We do not know the conditional probabilities Pr(Y = j I xo). The
•••• K-nearest neighbors (KNN) classifier is one way to estimate those probabilities. To carry out KNN, select an integer
K. Then look at the values of Y at the K observations nearest to the point of interest. Set Pr(Y = j I xo) equal to
the proportion of those points with Y = j. Use the Bayes classifier to assign a value to Y; in other words, the value
assigned to Y is the most common value at the K nearest points.
EXAMPLE 15B Y is a classification variable with two possible values: 1 and 2. X is an explanatory variable. You
are given the following observations (X, Y):
Using K-nearest neighbors with K = 3, determine the Xs that go to each value, 1 and 2. Then calculate the error
rate. •
SOLUTION: The majority of the points, at least 2 out of 3, will go to 1 as long as X <8.5. When X > 8.5, then X = 12
is closer than X = 5 and the three nearest points are 7, 10, and 12, so points with X > 8.5 go to 2.
The assignment is incorrect for X = 2 but is correct for all other points, so the error rate is 1/6 0
In the previous example, if K = 1, then each point would go to the Y at the nearest point; the assignment rule would
be
X < 1.5 1
1.5 < X < 3.5 2
3.5 < X < 8.5 1
X > 8.5 2
If K = 5, then points closer to 1 would go to 1 and points closer to 12 would go to 2. The rule would be X <6.5
goes to 1 and X > 6.5 goes to 2, with error rate 1/3.
We see that the higher K is, the less flexible the method is, and the higher the training error is. Higher K
means higher bias. However, higher K also means lower variance. For test data, the error rate is minimized for an
intermediate value of K. Using a value of K too low is overfitting the model.
Sometimes we analyze the error rate as a function of 1/K, in which case low values of 1/K mean high bias and
low variance while high values of 1/K mean low bias and high variance.
y 34 38 53 50 70
SoLuTioN: With K = 1, the value of Y at each point is the value at the closest observation. Thus
X <2.5 Y =34
2.5 < X <3.5 Y =38
3.5 <X <4.5 =53
4.5 < X <6 Y = 50
X>6 Y = 70
With K = 3, the three closest Xs are 2,3, and 4 as long as X <3.5. The closest Xs are 3,4, and 5 when 3.5 < X <5.
The closest Xs are 4,5, and 7 when X > 5. Thus
X <3.5 Y =41
3.5 <X <5 Y =47
X > 5 Y
l
f With K = 5, all five points are closest, and Y is always set equal to its mean, 49.
As we see from the example, the method produces a step function. It becomes less flexible as K increases. What we
said for KNN classification applies here as well; higher K increases the MSE for training data. For test data, higher gl
K increases bias but decreases variance; the value of K minimizing test MSE is in the middle.
How does KNN regression compare to linear regression? If the underlying variable follows the assumptions
of linear regression, then linear regression will give a better fit. If not, KNN may give a better fit, particularly if
the number of predictors is small. But if the number of predictors k is large, there may be no nearby neighbors in
k-dimensional space, and KNN may be based on faraway values of observed Ys. Parametric methods are superior
when there are only a small number of observations per predictor.
In addition, parametric methods such as linear regression are easier to interpret than KNN.
2The textbook is silent on what to do if there is a tie for nearest point. Perhaps you'd average the responses at all the tied points in that case.
Exercises
15.1. ": You are predicting whether a policyholder will renew his policy based on how many years the policyholder
has had the policy with the company. Based on your experience, the probability of renewal is:
1 450 40%
2 240 60%
3 200 70%
4 or more 110 75%
Calculate the Bayes error rate of the renewal prediction based on number of years the policyholder had the policy
with the company.
15.2. Joe is a chess player. The probability that a player wins or draws a chess game against Joe, given the class
of that player, is as follows:
If a player does not win or draw against Joe, then the player loses.
Calculate the Bayes error rate of the predicted outcome of the game based on the player's class.
15.3.
%. The probability of passing Exam SRM is modeled as a function of the number of hours of study, h. The
probability of passing is h/(150 h). The Bayes classifier is used to predict whether a person passes.
The number of hours that people study is uniformly distributed between 100 and 900.
Calculate the Bayes error rate.
15.4. •-ir Y is a classification variable with two classes, I and II. A model for Y has two explanatory variables, X1
and X2. The probability that Y is class I is 4/(4 + 44).
Determine the Bayes decision boundary, and draw a graph with X1 and X2 as the axes, showing the regions in
which class I and class IT are predicted.
15.5. ',11` There are 3 candidates running for mayor, Susan, Jack, and Mae. The probability that a voter will vote
for a candidate is modeled as a function of the voter's income (in thousands). Income is always positive. Let x be
the voter's income. The probability that a candidate will vote for Susan is x/(100 + x). The probability that a voter
will vote for Jack is (100x + 3600)/(100 + x)2. Otherwise the voter will vote for Mae.
Assume x is positive.
Determine the Bayes decision boundaries between the candidates a voter votes for.
15.7. •41 There are two real-valued predictor variables, X1 and X2, and a classification response variable Y with
two classes, A and B. You are given the following data:
)(2
1 2 A
1 6 A
2 2 A
3 4 B
4 1 A
4 5 B
5 3 B
15.8. N? Using KNN with K = 3, determine the regions for which 0 is predicted for Y.
15.9. Using KNN with K = 7, determine the classification error rate.
15.10. sir [MAS-II-F18:361 A training data set contains eight observations for two predictor variables, X1 and X2,
and a response variable, Y. The response Y has three possible classes: P. N, and U.
Distance from
i Xi X2 Y
(xii,xi2) to (3,2)
1 4.1 3.0 P 1.5
2 —2.6 —3.0 N 7.5
3 —1.1 1.3 U 4.2
4 0.0 1.2 U 3.1
5 —3.0 —5.0 N 9.2
6 2.0 2.0 U 1.0
7 —3.1 —2.0 N 7.3
8 3.2 3.1 P 1.1
Three models are constructed using K-Nearest Neighbors and the data set above to predict Y for the two-
dimensional space of predictors Xi and X2.
• Model 1: K-Nearest Neighbors with K = 1.
• Model I: K-Nearest Neighbors with K = 3.
• Model I: K-Nearest Neighbors with K = 7.
Each model is used to classify the point (3,2).
Determine the predicted response Y at this point using each of the three models.
(A) Model I: Y =I', Model II: Y =I; Model III: Y =P
(B) Model I: Y =P, Model II: Y =P, Model III: Y =U
(C) Model I: Y =U, Model II: Y =P, Model III: Y =P
(D) Modell: Y =U, Model H: Y =P, Model III: Y =U
(E) The answer is not given by (A), (B), (C), or (D)
15.11. [MAS-II Sample:11] You are given the following data to train a K-Nearest Neighbors classifier with
K = 5:
Distance to
X1 X2 Y Xi = 0, X2 = 5
4 4 Yes 4.1
1 6 No 1.4
7 5 No 7.0
5 5 Yes 5.0
2 7 Yes 2.8
7 2 Yes 7.6
8 4 Yes 8.1
8 6 Yes 8.1
2 3 Yes 2.8
2 5 No 2.0
2 2 Yes 3.6
6 6 No 6.1
1 8 No 3.2
0 5 Yes 0.0
15.12. NIP [MAS-H-S19:381 You are provided with training and test data samples consisting of a single variable X,
and an observation Y consisting of two possible classes, T and F.
Training Data Test Data
xi y I xi yi
1 —1.60 T 1 —1.3 F
2 —1.50 F 2 0.9 F
3 —0.60 T 3 1.2 T
4 —0.30 T
5 0.40 F
6 0.40 F
7 0.70 T
8 1.20 T
9 1.30 T
10 2.10 F
15.13. N. [MAS-II-F18:37] You are given a data set consisting of 150 data points. There are three possible
classifications for each data point, as shown in the leftmost graph, with 50 data points falling into each classification.
Based on this data, you train a K-Nearest Neighbors model using 95 of the data points. You evaluate k ranging
from 1 to 95. For each k, you calculate the test error rate on the remaining 55 data points with results shown in the
rightmost graph.
+4
•o 00
o 0 4-
000 o
o 0 0 0 0 0 ++
o +
O00 a A di+ ++A+
O 0 0 + 44
re)
00 000 • AA 4+4 +A4+ ++ -H-
o AA AA A+A A
+11+ A+-1-3-6 a + +
a 4, a ++
AAA +
+A AA+ 4 +
a A a
4 A
1 1
Feature I
Determine the cause of the rapid increase in error rate between k equals 70 and k equals 95.
(A) As k increases the K-Nearest Neighbors algorithm performs better.
(B) As k increases the K-Nearest Neighbors algorithm performs worse.
(C) As k approaches 95, all data points are predicted to have the same classification.
(D) As k approaches 95, all data points are incorrectly classified.
(E) There is no clear relationship between the value of k and the error rate.
15.14. Nr A continuous variable Y is modeled as a function of X using KNN with K = 3. You are given following
data:
X 5 8 15 22 30
Y 4 1 10 16 30
15.15. •11. A continuous variable Y is modeled as a function of X using KNN with K = 2. You are given the
following data:
X 4 7 12 14 15 21 22
Y 3 8 15 22 , 30 40 53
Cross-validation is used to test the fit. The training set consists of the observations with X = 4,12,14,15,22.
(Th Calculate the mean square error on the training data and on the test data.
Solutions
15.1. The Bayes rule predicts nonrenewal for policies with the company for one year and renewal otherwise. There
are 1000 policies total; 45% with the company for one year, 24% for two years, 20% for three years, and 11% for four
or more years. The error rate is 1 minus the probability of the right prediction, the probability of nonrenewal for
policies with the company for one year or renewal for policies with the company for two years or more. The Bayes
error rate is
0.45(0.4) + 0.24(0.4) + 0.20(0.3) + 0.11(0.25) = 0.3635
15.2. The Bayes prediction is the most likely outcome given the class: win for Class B and up, draw for Class C,
lose for Class D and Class E. The average error rate is the sum of the probabilities of error (not selecting the correct
outcome) times the proportion in the class:
0.01(0.01) + 0.05(0.10) + 0.10(0.21) + 0.20(0.50) + 0.30(0.63) + 0.20(0.51) + 0.14(0.16) = 0.4395
15.3. The Bayes decision boundary is h = 150, since the probability of passing is 50% when h = 150. Above h = 150
we predict the person passes; below h = 150 we predict the person fails. Notice that yevh- = 1 151g4°.1„ and that the
density function of the uniform distribution between 100 and 900 is 1/800. The Bayes error rate is
150
1 150
Error Rate =
800
1
(f (1
too ) 1.50900150 dh )
150 + h
dh +
150
150 + h
1
800 (50 — (1501n(150 + /2)1 ) + (1501n(150 + h)19°°))
100 150
= —
800 (50 — 150(ln 300— In 250) + 150(In 1050— In 300))
1
=
(50 — 150 In 1.2 + 1501n 3.5) = 0.2632
800
15.4. The decision boundary is the set of points for which the Pr(Y = I) = 0.5.
x21
=0.5
x2 + 44
I
X2 = 0.5X2 + 2411
0.54 = 24
x2 = ±0.5xi
The Bayes decision boundary is the pair of lines x2 = ±0.5x1. Here is the graph showing the Bayes decision boundary
and the predictions:
x2
X1
15.5. Notice that the probability that a voter will vote for Mae is
x 100x + 3600 100x + X2 + 100X + 3600 6400
1 = 1
100 + x (100 + x)2 x2 + 200x + 10,000 (100 + x)2
We have to determine when each of the probabilities is the largest.. The probability of Susan is greater than the
probability of Jack when
100x + 3600
100 + x (100 + x)2
100x + x2 > 100x + 3600
x2 > 3600
x > 60
We conclude that one decision boundary is E1; below 28 Mae is most likely. The other decision boundary is
!
Ifil; above 60, Susan is most likely. Between 28 and 60, Jack is most likely.
15.6.
h
T e three closest points are (2,2) (A), (3,4) (B), and (5,3) (B), so E is assigned.
15.8. For X <4, the three closest points are 1, 2, and 4, and the majority of those have Y = 0, so 0 is assigned. For
4 < X <6.5, the closest points are 2, 4, and 7, and once again 0 is assigned. For 6.5 < X <8, the closest points are
4, 7, and 11, and the majority of those have Y = 1, so 1 is assigned. For 8 <X <11, the closest points are 7, 11, and
12, and the majority of those have Y = 0, so 0 is assigned. For X > 11, the closest points are 11, 12, and 15, so 1 is
assigned.
15.9. All 7 points are used, and the majority of Y values is 0, so 0 is always assigned. This is correct at 4 of the
points and incorrect at the other 3, so the error rate is 3/7
15.10. The nearest point to (3,2) is i = 6, and Y is U there, so in Model I Y = U. The 3 nearest points are i = 1,6,8,
with two Ps and one U, so in Model II Y = P. The 7 nearest points are every point except i = 5, with two Ps, three
Us, and one N, so in Model III Y = U. (D)
15.11. The 5 closest points, based on the distances in the last column, are at distances 0.0, 1.4, 2.0, 2.8, 2.8, and the
Y column has Yes, No, No, Yes, Yes for those 5 lines respectively, so the probability is 3/5= 0.6 . (C)
15.12. The Bayes error rate is the average of 1 - max(Pr(Y = j I X)) over all values of X. Here, the Bayes decision
boundary is
e'2= 0.5
lx1=V =0.832555
All 3 values of X in the test data set are greater than 0.832555 in absolute value, so Pr(Y = 1 X) is maximized for
j = T. We therefore need to average the complements of those probabilities, or e_'2,over the three xis:
e-(-1.3)2 c0.92 e-1.22
Bayes error rate = 0.184520 + 0.444858 + 0.236928 = 0.288768
3 3
Notice that the Bayes error rate does not depend on the yi observations in the test data.
The three nearest neighbors of -1.3 in the training data are -1.60, -1.50, and -0.60, with two Ts and one F, or
a prediction of T. The three nearest neighbors of 0.9 are 0.70, 1.20, and 1.30, with three Ts, or a prediction of T. The
three nearest neighbors of 1.2 are 0.70, 1.20, and 1.30, with three Ts, or a prediction of T. Only the prediction for 1.2
is correct, so the error rate is 2/3. Then 2/3 - 0.288768 = 0.377898. (B)
15.13. Since the training data set is 95 points, all points are classified based on distance to some of those 95 points.
If k = 95, then the 95 nearest points are the entire training data set and all points are classified as belonging to the
class of the majority of those 95 points. (C)
15.14. The three nearest X values are 5, 8, and 15, and Y = 4,1,10 at those points, so the fitted value is the average
of those three numbers, or (4+ 1 + 10)/3 = E.
15.15. We calculate the fitted values at all seven Xs using only the five Xs in the training set.
4 4,12 3+15 _ a 3 6
2 — '
7 4,12 3+15 _ a 8 1
2 — '
12 12,14 15.22 = 18.5 15 3.5
14 14,15 2213° = 26 22 4
15 14,15 22-r° = 26 30 -4
21 15,22 3°42.53 = 41.5 40 1.5
22 15,22 313+53
2
- 41.5 53 -11.5
BREAK. #asmstudybreak
Not all important questions are mathematical; creative thinkins is an important skill to practice!
In order to protect
themselves from poachers,
African Elephants have been
evolving without tusks, which
unfortunately also hurts their
species.
aisim
Actuarial Study Materials
Lesson 16
Decision Trees
Reading: An Introduction to Statistical Learning 8. If using second edition, skip subsections 8.2.4, 8.25., and 8.3.5
Rapch
Bedro s > 2
The tree is drawn upside down. The nodes on the bottom are called leaves or terminal nodes. The nodes in the tree
that are split into two branches are called intermediate nodes. Notice the following regarding the tree:
• The predictors can be categorical, count, or continuous variables. For continuous variables, a cut point must
be selected. Here we arbitrarily selected 1500 for the cut point of square footage. We will discuss how to select
the cut point.
• Every split is binary. To split the 3-category categorical variable House Style, we first split it into two categories,
then split one of those two into two categories.
• Questions may be different at each node. We can ask for square footage of ranches and number of bedrooms
of colonials, and not ask any question for Cape Cods. If we had asked about square footage for Cape Cods,
the cut point may be different than the cut point for ranches.
Drawing the regions of the above tree requires three dimensions, since there are three variables. If we only consider
House Style and footage, arid prune the bedroom node off the tree, a graph of the regions would look like this:
1500
8
(Yi — )2
j=1.
where Ri is the ith region and there are J regions. However, it would not be possible to calculate MSE for every
Lir possible split into regions due to the huge number of possible splits. Instead, trees are grown by recursive binary
e-r splitting. The algorithm selects a binary split that minimizes the MSE. This algorithm is greedy in that it only
optimizes the current split, and does not take into account that a split that is not optimal at the current iteration may
lead to a better overall tree at a later step.' At each step, we select a region Ri to split, a predictor Xk, and a cut point
s so that the split into Rh and Ri, minimizes
The algorithm continues until the number of observations in a region is below a prespecified number; for example,
until there are fewer than 5 observations.
The resulting tree is probably too big. More splits means more flexibility, lower bias, higher variance, and as
usual, there is an optimal number of splits that minimizes test MSE. We therefore prune the tree, using a method
similar to the one discussed in Section 8.1 for the lasso. This pruning method is called cost complexity pruning or
446 weakest link pruning. We specify a tuning parameter a. The cost of a tree is a per terminal node. For each value of a,
we prune the tree to minimize
IT'
0.9 —
0.8 —
0.7 — •M1
cross-entropy
0.6 —
0.5
0.4 —
0.3 —
Gini index
0.2—
0.1 —
0
0 0.1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1
Figure 16.1: Gini index and cross-entropy as a function of pink when there are two classifications.
the MSE for each number of terminal nodes. After we determine the best a (or the optimal number of terminal
nodes), we calculate the MSE of the pruned tree on the test data.
We'll now discuss optimizing classification trees.
For classification trees, instead of mean squared error, one might consider using the classification error rate as
the number to minimize:
EE Pr(yjgizi)
j=1 iERj
For a single region Rj and classes 1, 2, ... , K, y'Rj is the most common class. Let pjk be the proportion of observations
in R./ for which yj = k. Then the classification error rate for region Rj is
Ej = 1 — max /3 jk (16.2)
But this measure is not sufficiently sensitive for tree growing. Instead, either the Gini index or the cross-entropy is
used. For a single region Rj, the Gini index is the variance of the observations: •41
(16.3)
Dj = —Epikinpik
k=1
(16.4)
Since the logarithm of Pik, a number between 0 and 1, is negative, D3 is positive. Cross-entropy and Gini index are
close numerically. Both of them measure node purity; they are minimized when pik is close to 0 or 1, so they are
minimized when all the classes are right or when they are all wrong. Neither measure takes into account the fitted
classifica tion.2
Figure 16.1 shows how the Gini index and cross-entropy vary with pink when there are two classifications.3
These are the definitions of Gird index and cross-entropy for a single region. When splitting a tree at a node, you
need to compute these measures for both split regions overall. The measure for a set of regions equals the weighted
average of the measures for each of the regions, with weights being the proportions of observations in each region.
21n cross-entropy, Pik in pik is treated as 0 when pik = 0.
3This is part of An introduction to Statistical Learning exercise 8.3.
EXAMPLE 16A A categorical response variable Y has values A, B, and C. There is one explanatory variable X.
You are given the following observations:
X 0 0 1 2 3 5 6 9 10 12
Y A B A A B B C B C C
SoLurioN: 1. The proportions of observations in each class are 0.3 in A, 0.4 in B, and 0.3 in C. Then
2. The region X <6 has 6 observations and gets 0.6 weight the region X 6 has 4 observations and gets 0.4
weight. For X <6, the proportions of observations in each class are 0.5 in A, 0.5 in B. For X 6, the proportions
of observations in each class are 0.25 in 13, 0.75 in C.
The Gini index or cross-entropy are used for splitting the tree. But for pruning the tree, while any of the three
measures may be used, it is best to use classification error rate as the criterion if predictive accuracy is desired. The
cross-validation method, which involves a tree built with training data and cross-validation for the pruning of this
tree, is the same as the method for regression trees.
The use of the Gini index or cross-entropy for splitting may result in split nodes with the same predicted class.
Even though the predicted class is the same, node purity may be increased. For example, suppose there are two
classes A and B, with 16 As and 4 Bs. Split it into one region with 10 As and another region with 6 As and 4 Bs. Both
splits will predict A. The Gini index before the split is
(0.8)(0.2) + (0.2)(0.8) = 0.32
After the split, each region has half the observations and gets a weight of 1/2. The Gini index is
where n,„k is the number of observations in class k in terminal node in and /3„, is n,„k/n (as it was when we
discussed the cross-entropy.)
el? How do decision trees compare with linear regression? If the assumptions of linear regression are satisfied,
linear regression will do a better job. In fact, linear regression usually does a better job than decision trees. But if
the underlying variable is complex and nonlinear, decision trees may be better. If the decision boundary is a set of
horizontal and vertical lines, decision trees will capture that relationship better than a linear model.
Decision trees have the following advantages over linear models:
1. Easier to explain.
2. Closer to the way human decisions are made.
3. Tree can be graphed, making it easier to interpret.
4. Easier to handle categorical predictors; linear regression requires dummy variables.
However, decision trees suffer from two shortcomings: they do not predict as well, and they are not robust. Small
changes to input data can have big effects on trees. The methods of the next section address these shortcomings.
16.2.1 Bagging
Bagging is a form of bootstrapping. In fact, bagging stands for bootstrap aggregation. Strangely, bootstrapping is not •Nr
on the syllabus, but you need to understand the bootstrapping concept to understand bagging, so here's a short
explanation.
The basic idea of bootstrapping is that we often want to know something about the underlying distribution,
but we don't know the underlying distribution. To get around that problem, use the empirical distribution, the
observations that we have, as the underlying distribution. To learn something about the underlying distribution,
simulate off the sample. (This sounds crazy at first, since the sample is already random, but get used to it!)
Simulating off the sample of size n means drawing n items from the sample with replacement. (If it weren't
with replacement, we know what those n items would be.) For example, if you're observing claim sizes from some
insurance, and you have 5 claim sizes: (1000, 2000, 4000, 6000, 9000), you may randomly draw bootstrap samples
like:
Of course in real situations, the sample has hundreds or thousands of items, and you draw hundreds or thousands
of samples.
And that's what bagging is about! Take your training set, with n observations, and select B bootstrap samples
from it. Construct trees, using the algorithms we discussed in the previous section. If you are interested in the
response for x = {x1, x2, , xk}, a vector of values for the k predictors, calculate the corresponding y for each of the
B trees and average them.
1 B
Aas(x) = f (x)
1)=1
where ](x) is the value of the response for tree b. By averaging the trees, variance is reduced. Remember what you
learned in probability: the variance of the sample mean for an independent sample is the variance of the distribution
divided by the size of the sample. Bagging divides the variance of a single tree by B, assuming that the trees are
independent. They probably are not independent, and we'll discuss how to fix that problem in the next subsection.
If the response variable is categorical, we can set the bagged value equal to the most commonly predicted value
in the B trees.
There is no danger of overfitting by making B too large.
Income
Age
CreditHist
Experience
0 20 40 60 80 100
One can show that for n sufficiently large, about 1/3 of the items in the sample are not used in a bootstrap sample.
This allows "out-of-bag (008) validation". For each tree, the test MSE or error rate may be computed using the
out-of-bag part of the sample, the items that were not used to build the tree. This is very convenient, eliminating
the need for cross-validation. It can be shown that for B sufficiently large, the 00B error is virtually equivalent to
the leave-one-out cross-validation error.
Unfortunately, bagging makes the model difficult to interpret. However, to measure the importance of predictors,
one can calculate the amount that the RSS (for regression trees) or Gini index (for classification trees) is decreased as
64/ a result of bringing in that predictor, and average that amount over all trees. A variable importance plot summarizes
this information. A sample of such a plot for a hypothetical tree based on 4 variables is shown in Figure 16.2. In
this graph, the predictor with the largest average decrease is shown as a bar going to 100, and the bars for the other
predictors are proportionate to that bar. For example, if the best predictor's average decrease in RSS was 300 and
another predictor's average RSS decrease was 99, that predictor's average would be shown with a bar going to 33.
As stated above, the average decrease in Gini index would be used for classification trees.4
Bagged trees may be correlated. An important predictor may appear in all trees regardless of which sample values
s: were used to build the tree. To correct this problem, in a random forest, a positive integer in is specified. At
each split, In predictors are selected randomly, and those are the only predictors that are considered for splitting.
•
The trees are decorrelated. Typically, in -6., where k is the number of predictors. If in = k, the random forests
method reduces to bagging. There is no danger of overfitting by making B too large.
16.2.3 Boosting
We will only discuss boosting in a regression setting, that is, for continuous response variables.
Boosting starts off with a small tree, and adds in a portion of the response. It then recursively builds small trees
based on the residuals from the existing model, and adds a portion of those results to the response. So it learns
slowly.
air Boosting has three parameters:
1. B, a positive integer, the number of cycles. Unlike bagging and random forests, making B too large will overfit
the model; it is selected using cross-validation.
2. A, a positive real number no greater than 1, the shrinkage parameter. It is the proportion of the response that
is added in at each cycle.
3. d, a positive integer, the number of splits for each tree. It is the number of terminal nodes minus 1. If d = 1,
there will be only one split, and the resulting model will be an additive model with no interaction between
predictors. So d is the depth of interaction.
4In An Introduction to Statistical Learning Figure 8.9 and the accompanying discussion, it says that the mean decrease in Gini index is plotted.
On the other hand, on page 330, it says that the average of the decrease in deviance is plotted, and on page 325, the deviance for a classification
tree is defined as the cross-entropy.
k=1
To compute overall Gini index or cross-entropy, use a weighted average of their values in each region. The
weights are the proportions of observations in each region.
Residual mean deviance for classification trees
Bagging Build B trees using B bootstrap samples from the data, then use their average.
Out-of-bag validation Use the observations outside the bootstrap sample to calculate error measures.
Random forests Use bagging, but at each split only allow in predictors, with in;LI IN typically.
Boosting Gradually obtain prediction by building trees on residuals with d splits, shrinking results by factor A,
adding results to current prediction, and subtracting results from residuals.
r"-- Exercises
16.1. 'sir [SRM Sample Question #291 Determine which of the following considerations may make decision trees
preferable to other statistical methods.
I. Decision trees are easily interpretable.
II. Decision trees can be displayed graphically.
III. Decision trees are easier to explain than linear regression models.
(A) None (13) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
16.2. '141 [SRM Sample Question #33] The regression tree shown below was produced from a dataset of auto claim
payments. Age Category (1, 2, 3, 4, 5, 6) and Vehicle Age (1, 2, 3, 4) are both predictor variables, and log of claim
amount (LCA) is the dependent variable.
> 2.5
veh_a >3.5
l
f
16.3. '',46 [SRM Sample Question #511 You are given the following regression tree predicting the weight of ducks
in kilograms (kg): \--2
1.10 kg 1.25 kg
16.4. f A regression tree is being constructed. There is one explanatory variable X and the response variable
is Y. Four observations of (X, Y) are:
After one iteration of recursive binary splitting, the observations are split into two groups.
Determine the members of the two groups.
16.5. alir [MAS-II Sample:12] A data set contains six observations for two predictor variables, X1 and X2, and a
response variable, Y.
Xi X2 Y
1 0 1.2
2 1 2.1
3 2 1.5
4 1 3.0
2 2 2.0
1 1 1.6
X2 <t2
A
Si <t3
A A
t3
Xi C Xi
(A) ti B (B) t3
t2 t2
X2 X2
A A
t3
X2 D
(D) t1 B
t2
xl t2xi
t2
X2
(---- 16.7. 1141 [SRM Sample Question #571 You are given
(i) The following observed values of the response variable, R, and predictor variables X, Y, Z:
R 4.75 4.67 4.67 4.56 4.53 3.91 3.90 3.90 3.89
X M F M F M F F M M
Y A A D ID B C B D B
Z 2 4 1 3 2 2 _
5 5 1
Y = A,B
X=F
Ti T2
T3 T4
Calculate the Mean Response (MR) for each of the end nodes.
(A) MR(T1) = 4.39, MR(T2) = 4.38, MR(T3) = 4.29, MR(T4) = 3.90
(B) MR(T1) = 4.26, MR(T2) = 4.38, MR(T3) = 4.62, MR(T4) = 3.90
n (C) MR(T1) = 4.26, MR(T2) = 4.39, MR(T3) = 3.90, MR(T4) = 4.29
(D) MR(T1) = 4.64, MR(T2) = 4.29, MR(T3) = 4.38, MR(T4) = 3.90
(E) MR(T1) = 4.64, MR(T2) = 4.38, MR(T3) = 4.39, MR(T4) = 3.90
16.8. '45 [MAS-II-F18:38] You are given the following unpruned decision tree:
82
20 58
The values at each terminal node are the residual sums of squares (RSS) at that node. The table below gives the
RSS at nodes S, T, and X if the table was pruned at those nodes:
Node RSS
251
209
X 86
The RSS for the null model is 486. You use the cost complexity priming algorithm with the tuning parameter, a,
equal to 9 in order to evaluate the following pruning strategies:
No nodes pruned
Prune node S only
Prune node T only
Prune node X only
Prune both nodes S and X
16.9.s: A classification tree is constructed to predict whether students will pass Exam SRM on their first try.
Two binary explanatory variables are used: X1 (Did the student take a statistics course in college?) and X2 (Did the
student pass STAM?) Available data is as follows:
= 0, X2 = 0 5 passed on their first try, 20 didn't
= 1,X2 = 0 10 passed on their first try, 10 didn't
Xi = 0,X2 = 1 7 passed on their first try, 9 didn't
Xi = 1,X2 = 1 6 passed on their first try, 4 didn't
Determine the first split made in the tree using the Gini index.
Exam SRM Study Manual Exercises continue on the next page ...
Copyright ©2022 ASM
EXERCISES FOR LESSON 16 287
16.10. •41 For a regression tree, two nodes of the tree have the following response values:
R4: 2, 3, 3, 4, 5
R5: 2, 4, 6
The tree is pruned using cost complexity pruning. The split of R4 and R5 is the optimal one to prune.
Determine the smallest value of a for which pruning will occur.
16.11. •-: [SRM Sample Question #25] Determine which of the following statements concerning decision tree
pruning is/are true.
I. The recursive binary splitting method can lead to overfitting the data.
II. A tree with more splits tends to have lower variance.
III. When using the cost complexity pruning method, a = 0 results in a very large tree.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
16.12. "4" [SRM Sample Question #411 For a random forest, let p be the total number of features and in be the
number of features selected at each split.
Determine which of the following statements is/are true.
I. When in = p, random forest and bagging are the same procedure.
IL is the probability a split will not consider the strongest predictor.
P
16.16. [MAS-II-F18:401 You are given the following classification decision tree and data set:
Xi <21 i Y
X1 X2
1 12 Y T
2 23 N'F
= Y 3 4 Y F
4 32 Y F
5 22 N T
6 30 Y T
7 18 N T
Determine the relationship between the classification error rate, the Gini index, and the cross-entropy, summed
across all nodes.
16.17. "I [SRM Sample Question #9] A classification tree is being constructed to predict if an insurance policy
will lapse. A random sample of 100 policies contains 30 that lapsed. You are considering two splits:
Split 1: One node has 20 observations with 12 lapses and one node has 80 observations with 18 lapses.
Split 2: One node has 10 observations with 8 lapses and one node has 90 observations with 22 lapses.
The total Gini index after a split is the weighted average of the Gini index at each node, with the weights
proportional to the number of observations in each node.
The total entropy after a split is the weighted average of the entropy at each node, with the weights proportional
to the number of observations in each node.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
16.18. [SRM Sample Question #501 Determine which of the following statements regarding statistical learning
methods is/are true.
I. Methods that are highly interpretable are more likely to be highly flexible.
II. When inference is the goal, there are clear advantages to using a lasso method versus a bagging method.
III. Using a more flexible method will produce a more accurate prediction against unseen data.
(A) I only
(B) II only
(C) III only
(D) I, II, and III
(E) The answer is not given by (A), (B), (C), or (D)
16.19. [SRM Sample Question #10) Determine which of the following statements about random forests is/are
true.
I. If the number of predictors used at each split is equal to the total number of available predictors, the result is
the same as using bagging.
II. When building a specific tree, the same subset of predictor variables is used at each split.
III. Random forests are an improvement over bagging because the trees are decorrelated.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
16.20. ": [SRM Sample Question #39] You are given a dataset with two variables, which is graphed below. You
want to predict y using x.
Determine which statement regarding using a generalized linear model (GLM) or a random forest is true.
o 0 o
U, o(93
o 0
oo o0 0 8)88 0 0 00 8 o
o2_ 0 o
CV 0 g°% ° o GEo-v
oo
0, co
ci2P'
>. O o 0o
o o co
-4 -2 0 2 4
(A) A random forest is appropriate because the dataset contains only quantitative variables.
(B) A random forest is appropriate because the data does not follow a straight line.
(C) A GLM is not appropriate because the variance of y given x is not constant.
(D) A random forest is appropriate because there is a clear relationship between y and x.
(E) A GLM is appropriate because it can accommodate polynomial relationships.
n'16.21.'141' EMAS-II-F18:391 An actuary creates three tree-based models using bagging, boosting, and random
forests. The error on a test data set, as a function of the number of trees in each model, is plotted on the graph below.
0.55
0.5
%
gimpapjaira, eta 0. • w VP fp& lib Oita sir elm es%
0.35
0.3
Determine the type of model most likely to have created each of the lines on the graph.
(A) I: Boosting, II: Bagging, III: Random forest
(B) I: Bagging, II: Boosting, III: Random forest
(C) I: Bagging, II: Random forest, III: Boosting
(D) I: Random forest, II: Bagging, III: Boosting
(E) The answer is not given by (A), (B), (C), or (D)
X2 < 1 X4 <2.67
6.4 4.34
X3 = Y X3 = Y
1
4.83 4.75
Xi = Y X4 <3
10 3.14 6.67
X7 <8
31.5 8.81
X7 <6.5 I X&
0.34 5.89
=Y X3 = Y
2.81 5.9
X3 = Y X4 <2
8.1 6.3 7.41 14.1
X1 X2 X3 X4. X5 X6 X7
N 6 Y 4 0.5 0.25 6
Calculate the prediction of the boosted tree model for this record.
(A) Less than 2
(B) At least 2, but less than 5
(C) At least 5, but less than 8
(D) At least 8, but less than 11
(E) At least 11
(---- 16.23. a•: [SRM Sample Question #261Each picture below represents a two dimensional space where observations
are classified into two categories. The categories are represented by light and dark shading. A classification tree is
to be constructed for each space.
Determine which space can be modeled with no error by a classification tree.
I. II.
(A) I only (B) II only (C) III only (D) 1, II and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
16.24. s: [SRM Sample Question #12] Determine which of the following statements is true.
(A) Linear regression is a flexible approach
(B) Lasso is more flexible than a linear regression approach
(C) Bagging is a low flexibility approach
(D) There are methods that have high flexibility and are also easy to interpret
(E) None of (A), (B), (C), or (D) are true
I. The main difference between bagging and random forests is the number of predictors considered at each step
in building individual trees.
II. Single decision tree models generally have higher variance than random forest models.
III. Random forests provide an improvement over bagging because trees in a random forest are less correlated than
those in bagged trees.
Determine which of the statements I, II, and III are true.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Solutions
16.1. All three statements are true. Decision trees do not require appreciation of the effects of coefficients, link
functions, etc., making them easily intepretable and easier to understand than linear regression. And they are
displayed graphically. (E)
16.2. For I, you immediately go right and end up at 8.146.
For II, you go left, then right since vehicle age is at least 2.5, then left because age category is at least 4.5, then
right since vehicle age is at least 3.5, and end up at 8.028.
For III, you do the same as II, except at the last juncture you go left since vehicle age is less than 3.5, and end up
at 7.771. (E)
16.3. For X, we go left at the Age node and then right at the Gender node, getting 0.90 kg.
For Y, we go left at the Age node and left at the Gender node, getting 0.8 kg.
For Z, we go right at the Age node, right at the Gender node, and right at the Wing Span node, getting 1.25 kg.
(C)
16.4. We minimize the RSS from using the mean of each group.
We can split at x = 0.5, at x = 2, or at x = 4.5. (Other splits can be made, such as x = 2.5, but they would be
equivalent to one of these three in that they split the 4 points up into the same groups.)
If the x = 0.5 split is made, (0,8) is split from the other observations. The mean response for the other
observations is 19/3 and the RSS is 3 times the variance of the three responses:
2
(5 213 ) + (8 \2± (6
) k 3) 19)2= 42
3
If the x = 2 split is made, (0,8) and (1,5) are put in the first group. Then the prediction is .84 5-= 6.5 for the first
group and = 7 for the second group, and the RSS is
2
2
± 2(12) = 6.5
If the x = 4.5 split is made, (6,6) is split from the other observations. The mean of the other observations is 7
and the RSS is
2(8 — 7)2 + (5 — 7)2 =6
So the first option is taken: (0,8) is in the first group and the other three observations are in the second group.
16.5. Splits I and III don't split at all; all observations go into R2. Split II puts (4,1) into R2 and everything else
into Ri. There is no error for (4,1), whereas the error of the other 5 is the square difference from the mean, or the
population (division by 5) variance, 0.1096, times 5/6. Split IV puts (1,0) into Ri and everything else into R2. Once
again, we can compute the MSE as the variance in R2, or 0.2824, times 5/6. Split V puts two observations, (3,2) and
(2,2), into R2 and the others into Ri. The variance of the observations in Ri is 0.451875 so the sum of squares is
4(0.451875) = 1.975. Dividing by 6 gets a mean square error higher than for Split II, and this is even before adding
in the error for R2. (B)
16.6. If xi > ti then A is chosen. If ti <Xl <t3 then D is chosen. That is enough to select (B). You can then verify
that it does the correct selection for x2 as well.
16.7. There are nine observations, and the nodes they correspond to are:
R 4.75 4.67 4.67 4.56 4.53 3.91 3.90 3.90 3.89
X M F M F M F F M M
Y A A D D B C B D B
Z 2 4 1 3 2 2 5 5 1
Ti T3 T2 T2 Ti T2 13 T4 Ti
Using X2:
When X2 .= 0, 15 pass and 30 don't.
When X2 = 1, 13 pass and 13 don't.
The Gini index is
16.10. If the rectangles are kept separate, the mean of each rectangle (3.4 for R4, 4 for R5) will be the fitted value,
and the square differences between the values and their means add up to:
16.13. The most common class for Region I is Class A, so the classification error rate is 1 — 72/(72 + 22 + 6) = 0.28
16.16. In this question, they expected you to sum the Gini index and cross-entropy over the nodes, rather than
compute a weighted average. It's a bit confusing that Y is used both as a variable name and as a value of X2.
Observations i = 1, 3, and 7 are in the Xi <21 node, and the most common value of Y for those three values of
i is T, which occurs twice, so the classification error is 1 — 2/3 = 1/3. Observations i = 4 and 6 are at the Xi 21,
X2 = Y node, with Y equal to T once and to F once, making the classification error 1/2. Observations i = 2 and i = 6
are at the X1 21, X2 = N node with Y equal to T once and to F once, making the classification error 1/2. Total
classification error is 1/3 + 1/2 + 1/2 = 4/3.
Gini index at first node is (1 /3)(2 /3)÷(2 /3)(1/3) = 4/9. At the second and third nodes it is (1/2)(1/2)+(1/2)(1/2) =
1/2. Sum is 4/9 + 1/2 + 1/2 = 13/9.
Cross entropy at first node is —(1/3) ln(1 /3)— (2/3) In(2/3), and at the second and third nodes it is —(1/2) ln(1 /2)—
(1/2) ln(1/2). The sum over the three nodes is 2.022809. (A)
16.17. For the Gini index, we start with Split 1. For the first node, for 20 observations, the majority are lapses, so
this node will be predicted to lapse. Then 12 of the 20 observations observations, or 0.6, are properly classified and
8 observations, or 0.4, are not. Thus the Gini index is (0.6)(0.4) + (0.4)(0.6) = 0.48. For the second node, the majority
are non-lapses, so non-lapse will pe predicted. 62 observations are non-lapse and are properly classified and 18 are
lapses and are not properly classified, so the Gini index is (0.775)(0.225) + (0.225)(0.775) = 0.34875. Weighting these
by proportions of observations, 0.2 in the first node and 0.8 in the second node, we get
0.2(0.48) + 0.8(0.34875) = 0.375
For Split 2, the node with 8 lapses will be classified as lapse and 8 of the 10 observations will be properly classified,
while the node with 22 lapses will be classified as non-lapse and 68 of the 90 observations will be properly classified.
We get
Split 2 is preferred.
Split l's cross-entropy is
—0.2(0.6 In 0.6 + 0.4 in 0.4) — 0.8(0.775 In 0.775 + 0.225 In 0.225) = 0.5611
Split 2's cross entropy is
68 68 22 22
L Methods that are highly interpretable tend to be simple and have few parameters. For example, linear regression
is easier to interpret than splines. Thus methods that are highly interpretable are less flexible.X
IT. Lasso is a simple method; it simply selects parameters. Bagging is hard to use since it uses an average of lots
of models./
III. Flexible methods may fit the training data better but may not fit test data so well.X
(B)
16.19.
16.20. The plot seems to show a quadratic relationship between y and x, and we know that GLM (or even linear
regression) can handle such relationships by using x2 as a predictor, so on an exam I would answer (E) and move
on. But let's look at the other four statements.
(A) is wrong because tree-based methods are better at qualitative variables and weaker with quantitative
variables.
(B) is wrong because tree-based methods do not have a special technique to handle linear versus non-linear
relationships.
(C) is wrong because the variance does look constant in the plot.
(D) is wrong because tree-based methods don't produce any formula relationship between variables. GLM
does that.
16.21. Bagging selects bootstrap samples from the nodes. As the number of those samples goes to infinity, the
method converges to its lowest error rate and cannot improve after that. Random forests are like bagging except
that they remove correlation between samples by forcing the selection to only consider some of the nodes; thus the
results will once again converge after a while, but the removal of correlation should improve the error rate. Boosting
builds a lot of small trees and sets the response equal to a function of the results; thus it can result in improvements
even as the number of trees grows large, although if the number of trees is too large there may be overfilling. Thus
the answer is (C), even though line III doesn't increase in the graph; perhaps it increases when the number of trees
is higher than 750.
16.22. For the first tree, X2 > 1, then X3 = Y, then X1 Y, so we end up at the 3.14 node.
For the second tree, X4 2.67, then X3 = Y, then X4 3, then X7 <8, so we end up at the 3.5 node.
For the third tree, X7 <6.5, then X1 # Y, so we end up at the 2.81 node.
For the fourth tree, X6 <1, so we end up at the 0.34 node.
The prediction is 0.2(3.14 + 3.5 + 2.81 + 0.34) = 1.958 . (A)
16.23. Only a function that is piecewise constant can be modeled with no error by a classification tree. For such a
function, the regions are rectangles parallel to the axes. Thus only I can be modeled with no error by a classification
tree. (A)
16.24. Linear regression forces a straight line, so it is not flexible. The lasso limits the variables further, so it is less
l exible than linear regression. Bagging uses an average of many models so it is more flexible than a single tree.
f
Methods with high flexibility require complicated relationships to the explanatory variables, making them difficult
to interpret. (E)
16.25. All three statements are true, as discussed in this lesson. (D)
Reading: An Introduction to Statistical Learning 10.2 (first edition) or 12.2 (second edition)
= 0 for j = 1, 2,... ,p
zm
j=1
In other words,
Zim = Eq5jmXij
j=1
Since xi/ = 0, if follows that
n P n
E
:=1
Zim .7.:-.
E (ping z xii = 0
j.i i.i.
For the first principal component Z1, Oil are selected to maximize the variance of Z1. In other words,
Maximize
2
TI E Zi — TIE DpfiXii
1=1 1=1 j=1
—1
CPA
—
j=1
Thus the most important direction of the Xj, the one on which they vary the most, is selected. Then
j=1
For the other principal components Zr,,, goi,„ are selected to maximize the variance of Z„, with the constraint that
Zm is uncorrelated to any of the previous principal components. That is equivalent to it being orthogonal to all
previous components. Thus all the equations stated above for Zi hold generally for Z„1:
Elb? 171
=1
.1=1
To illustrate this, consider two variables that have already been centered at mean 0, X1 and X2, with values shown
in the following table:
i xii xi2 zil zi2 i Xil X12 Zil Zi2
The loadings of the first principal component are 0.9143 and 0.4050. The loadings of the second principal
component are -0.4050 and 0.9143. As you can see, the second loading component is perpendicular to the first
loading component. The table shows the scores of the two components for each observation. For example, for the
i rst observation
f
Figure 17.1 shows the points and the first principal component line, the line defined by
021
Y-
7
To understand what the scores represent, look at the figure and project the observations perpendicularly to the
principal component line. They hit the line at a point. The score of the observation is the distance of that point
to (0,0). In other words, if you computed both principal components, and then drew a graph with axes Zi and Z2, the
coordinate of that point would be the two scores, (zii , z12). For example, Figure 17.2 projects observation 12 on the
two principal components. The scores are 9.83699 for the first component and —9.86072 for the second component.
After looking at this graph, you can understand why observation 13's score on the first principal component is lower
than observation 12's score even though its x coordinate is higher.
14-7"\.JQuiwords,
z 17-1theNIPlength
Consiofdera perpendicular
the 20 point exampl e we just discussed and il ustrated in Figure 17.1.
Calculate the Euclidean distance of the first observation (-24, —21) to the principal component line, in other
line dropped from the first observation to the principal component line.
The second principal component is the vector of loadings Op subject to the constraint Ei 4:12 = 1 that maximizes 111
the variance of zi2 = E cpprij such that Z2 is uncorrelated to Z1. And all other principal components are defined
the same way, to maximize variance and be uncorrelated to the previous principal components. Each principal
component is orthogonal (perpendicular) to the hyperplane of the previous principal components. Thus principal
component m is the set of cpims, j = 1,. . , p, that solves the following:
Maximize Z1 opi i subject to
Ei!j=1 4)2int —
—1
17.2 Biplots
l
f One way to visualize the principal components is through a biplot. A biplot plots two things, using labels on the left 61
Exam SRM Study Manual
Copyright ©2022 ASM
302 17. PRINCIPAL COMPONENTS ANALYSIS
and bottom for one and labels on the top and right for the other. For principal components, the biplot shows which
variables are correlated with each other and how the observations vary with the two principal components shown.
Our biplot will plot the first two scores of our 20 observations, and the first two loadings of each variable. The
biplot is more interesting with more than 2 variables, so we'll add a third variable to the 20 point example given
in the previous section. Table 17.1 lists the three variables and the three scores, but we will use only the first two
scores.
1 2 3
The biplot is Figure 17.3. The three variables are labeled X1, X2, and X3, and the 20 observations are labeled
pi., • • , P20 • The biplot shows the first two components of the loading vectors by variable]; it shows the vector of the
principal component 1 and principal component 2 loadings for each variable. The numbers on the bottom and left
axes are scores. The numbers on the top and right axes are loadings. For example, the line labeled "Xi" goes from
(0,0) to (0.16741,0.90061). We see that variables 1 and 2 each have loadings that are virtually in the same direction;
they are highly correlated. Variable 3, however goes in a quite distinct direction, and consists almost entirely of
component 1. The observations are all around the place. Some examples: Observation 20 (shown as p20) is high
in variables 1 and 2 but only average in variable 3. Observation 12 is average in variables 1 and 2 and very high in
variable 3. Observation 10 is low in variable 3 and a little above average in variables 1 and 2. .)
lf Table 17.1: Variables and their scores on the three principal components
-0.5 0 0.5
toD
Lt
C•1
Cg
17.3 Approximation
Another interpretation of principal components is that they are the best linear approximation of the observations.
Nr They constitute the hyperplane with minimum Euclidean distance to the observations. If M principal components
are used, then the score vector of an observation times the loading vector for a variable approximates the observation's
value for that variable:
111? x
E zin, (pin,
m=1
(17.1)
assuming that the variables are centered at 0. The approximation is exact if M = min(7/ — 1, p), as the following
shows for M = p:
But
SO
E ofinzi,„ = Xii
ni=1
EXAMPLE 17A 11 For the 3-variable example above, the scores of the first observation in the first two components,
to four decimal places, are —25.5751 and —26.0207.
Approximate the three components of the first observation. •
SOLUTION:
Actual values are X1,i= —24, x/,2 = —21, and x13 = —20.
17.4 Scaling
In linear regression, the scale of variables does not matter; if a variable is multiplied by r, then the corresponding
'sir coefficient p is divided by r. But the scale does matter in principal components analysis. If a variable is multiplied by
a constant greater than 1, its variance increases, and PCA puts a higher loading on the variable in order to maximize
variance. To avoid giving some variables spurious importance, the variables are usually scaled so that their standard
deviations are 1. However, if the variables are expressed in the same units and the relative scale of the variables has
meaning, one may choose not to scale the variables.
The example we just did shows how variance affects the accuracy of the approximation. The variances of the
three variables are 280.5, 76, and 680; we see how x1,3 is estimated very well and x1,2 is estimated poorly due to its
low variance.
We will now scale the variables by dividing by their standard deviations; in other words, we'll standardize the
variables. The resulting loadings are
Table 17.2: Standardized variables and their scores on the three principal components
i Xil Xi2 Xi3 Zil Zi2 Zi3
1 -L4330 -2.4089 -0.7670 -2.8248 0.0827 -0.6766
2 -1.3733 -2.0647 0.0000 -2.3534 -0.5948 -0.5065
3 -1.1942 -1.3765 0.7670 -L5644 -1.1962 -0.1752
4 -1.1345 -0.9177 -1.1504 -1.6947 0.7374 0.1918
5 -1.0747 -0.6882 0.0000 -1.2049 -0.3278 0.2638
6 -0.8956 -0.1147 1.1504 -0.3974 -1.3168 0.4970
7 -0.7165 0.0000 -1.5339 -0.8770 1.3320 0.5682
8 -0.5374 0.2294 0.0000 -0.2086 -0.0787 0.5401
9 -0.2985 0.1147 1.5339 0.2642 -1.5280 0.2251
10 -0.1194 0.4588 -1.9174 -0.2524 1.8960 0.4926
11 0.0597 0.0000 0.0000 0.0407 0.0125 -0.0419
12 0.2985 1.4912 1.9174 1.7134 -1.5689 0.7696
13 0.3582 0.4588 -0.7670 0.3647 0.8845 0.1083
14 0.5971 0.5735 0.0000 0.8007 0.2105 -0.0105
15 0.7165 0.8030 0.7670 1.2340 -0.4716 0.0361
16 0.9553 1.0324 -1.1504 1.0682 1.4656 0.1142
17 1.2539 1.1471 -0.3835 1.5448 0.8041 -0.0464
18 L3733 0.6882 1.1504 1.6999 -0.7214 -0.5225
19 1.4927 0.2294 0.3835 1.2718 -0.0233 -0.8999
20 1.6718 0.3441 0.0000 1.3754 0.4020 -0.9274
1 2 3
We see that the loadings of the first principal vector are much higher for the first and second variables.
The standardized variables and their scores are shown in Table 17.2.
The biplot for standardized variables is shown in Figure 17.4. It is quite different from the unstandardized biplot.
X1 and X2 are still shown as highly correlated, but are about the same size now. They have much higher loadings
on the first principal component so they point rightward rather than upward, and as a result X3's loading on the
i rst principal component is lower.
f
We'll redo the Example 17A with standardized variables. The scores of the first observation in the first two
components, to four decimal places, are now -2.8248 and 0.0827.
Actual values are x1,1 = -1.4330, x1,2 = -2.4089, and x13 = -0.7670.
The vector of loadings is unique up to sign. In other words, one may obtain an equivalent solution by flipping •-dr
the sign on all of the loadings. The loadings indicate the direction of the principal component, and direction is not
affected by reversing the sign. Flipping the sign of the loading vector results in flipping the sign on all of the scores.
The flippings cancel in the approximation formulas.
—0.5 0 0.5
Lc?
—2 —1 0 1 2 3
j=1
Var(Xj) =
j=1 I1xI (17.2)
If the variables are standardized (scaled so that their variances are 1), the total variance is p.
The variance of the mth principal component is
2
" n
( .IPJmXiJ
P
Var(Z,n) = — = lv (17.3)
'
n n
1=1 j=1
0.9
0.8
XL' 0.7
V, 0.3
z
0.2
c13
0.1
0
1 2 3 4 5
Number of principal components
points at which the plot bends so much that very little is explained by further principal components. Such a plot is
called a scree plot. Figure 17.5 is an example of a scree plot for a hypothetical principal components analysis. In this 'sr
plot, the elbow is at the point representing the variance of the second principal component, so we would use two
principal components. Obviously this method is ad hoc.
Here are the percentages of variance explained in our 3-variable example, both the original and the scaled
versions:
In the unsealed version the first two components explain 97.5% of the variance, which is probably good enough.
In the scaled version 92.4% of the variance is explained by the first two components, which may still be good enough.
Znr =
çbjmXj
j=1
subject to
1.
E Zinic
i=1
(17.1)
Proportion of variance explained (PVE) The following two are variance formulas:
p n
(17.2)
)2
1 2 1
Var(Z„,) = —
n
zim = — ripjrn 1.1 (17.3)
n
i=1 i=1 j=1
l
f Exercises
17.1. "4/ [SRM Sample Question #5] Consider the following statements:
I. Principal Components Analysis (PCA) provides low-dimensional linear surfaces that are closest to the obser-
vations.
II. The first principal component is the line in p-dimensional space that is closest to the observations.
PCA finds a low dimension representation of a dataset that contains as much variation as possible.
IV. PCA serves as a tool for data visualization.
17.2. el [MAS-II-S19:41] You are reviewing a dataset with 100 observations in four variables: X1, X2, X3, and X4.
You analyze this data using two principal comopnents:
2=
1. Eii2i° (zt=i i2xii
El=i ipoi2 = 0
)=1 cp2
j1 + Ei)=1 j2 —
iii -1
(A) I only (B) II only (C) III only (D) 1,11, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
The loadings of the first principal component are —0.8297 and 0.5583.
Calculate the first principal component's score for the first observation.
17.4. For a principal components analysis on two variables, the means of the two variables are 0. You are given
the following observation and its score on the first principal component:
Xi X2 Score
4 10 8
Calculate the first principal component's loading on the first variable, assuming that it is positive.
17.5. [MAS-I-S18:26 fixed] You are given the following daily temperature readings, which are modeled as a
function of two independent variables:
Independent variables
Observation Temperature Xi X2
1 5 4 6
2 10 8 2
•
17.6. •-•
[MAS-I-F18:381 You are given a series of plots of a single data set containing two variables:
Plot 1 Plot II
•
—
PC I
•
-- PC 2
_...----- • •
• ••
4
• •
•
—
PC 1
—
PC 1
•
--- PC 2 --- PC 2
•
• • • • •
•
S.
• .
•
Plot V
mi
•
—
PC 1
---
PC 2
-
•
•
ci 4. a • •
el -
el.
Determine which of the above plots accurately represent the first and second principal components (PC1 and
PC2, respectively) of this dataset.
(A) I (B) IT (C) III (D) IV (E) V
17.7. sir A department store performs principal components analysis to visualize the purchases of customers in
four categories.
You are given the following biplot, with the scores for two customers.
—0.5 0 0.5
Food
3
Bradley
2 0.5
Clothes
1
Linens
0 0
—1
Abigael
—2 Applicances —0.5
—3
—3 —2 —1 0 1 2 3
n 17.8. s: (MAS-1I-S19:40) You perform two separate principal component analyses on the same four variables in
a particular data set: Xl, X2, X3, and X4. The first analysis centers but does not scale the variables, and the second
analysis centers and scales the variables. The biplots of the first two principal components produced from these
analyses are shown below. The location of Observation 24 is labeled on the plots as well.
CNI
0
0—
PC1 PC1
l
f Given the following statements:
I. X1 is more highly correlated with X2 than with X3.
IL X3 has the highest variance of these four variables.
III. Observation 24 has a relatively large, positive value for X4.
Determine which of the preceding statements are demonstrated in the biplots shown above.
(A) None of I, II, or III are demonstrated in the biplots
(B) I and II only
(C) I and III only
(D) II and III only
(E) The answer is not given by (A), (B), (C), or (D)
17.9. 111 You are performing a principal components analysis of three variables, Xi, X2, and X3. Their loadings
on two principal components Zi and Z2 are
X1 X2 X3
Z1 —0.85956 —0.18960 0.47457
Z2 0.25890 0.63908 0.72426
The scores of the third observation on the two components are, in order, —22.5603 and —45.9338.
Calculate an approximation of the third observation of X/.
17.10. si? You are performing a principal components analysis of three variables, X1, X2, and X3. Their loadings
on two principal components Z1 and Z2 are
.X1 x2 x3
Z1 —0.97932 —0.18906 0.70210
Z2 0.03669 0.18452 0.98214
The scores of the fourth and fifth observations on the two components are
I zii zi2 zia
4 —37.1131 15.4973 14.6442
5 —46.5435 —14.2271 15.3719
17.11. 6.: [SRM Sample Question #371Analysts W, X, Y, and Z are each performing principal components analysis
on the same data set with three variables. They use different programs with their default settings and discover that
they have different factor loadings for the first principal component. Their loadings are:
Determine which of the following is/are plausible explanations for the different loadings.
I. Loadings are unique up to a sign flip and hence X's and Y's programs could make different arbitrary sign
choices.
Z's program defaults to not scaling the variables while Y's program defaults to scaling them.
Loadings are unique up to a sign flip and hence W's and X's programs could make different arbitrary sign
choices.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D)
17.12. •111 [MAS-II-F18:41] You are provided with the following normalized and scaled data set:
i Xi X2 X3
1 —0.577 1 —1
2 —0.577 1 1
3 —0.577 —1 1
4 1.732 —1 —1
The first principal component loading vector of the data set is (0.707, —0.500, —0.500).
Calculate the proportion of variance explained by the first principal component.
(A) Less than 53%
(B) At least 53%, but less than 58%
(C) At least 58%, but less than 63%
(D) At least 63%, but less than 68%
(E) At least 68%
1
17.14. [MAS-II-F19:41] You are given:
(i) A data set contains 500 observations for five predictor variables {Xi, X2, X3/ X4/ X5}.
(ii) Each predictor has been standardized to have mean 0 and standard deviation 1.
(iii) Each of the 500 observations takes the form (x11, xj2/ 743, X14/ X15) for i ranging from 1 to 500.
(iv) For each observation, a new predictor Z is calculated by projecting onto the first principal component.
(v) The projection for the ith observation is denoted by zi.
The total variance present in the data set is equal to 4 and Er? = 750.
Calculate the proportion of variance explained for the first principal component.
(A) Less than 0.20
(B) At least 0.20, but less than 0.30
(C) At least 0.30, but less than 0.40
(D) At least 0.40, but less than 0.50
(E) At least 0.50
17.15. i'l [MAS-H-F19:391 Dataset Z contains 4 variables and 100 records and has the following correlation matrix. j ,.....
(
0.93 1.00 0.10 —0.93
0.08 0.10 1.00 —0.08
—1.00 —0.93 —0.08 1.00
Determine which of the following plots of cumulative proportional variance is produced by principal components
analysis on dataset Z.
(A) (3)
•
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4
/
0
1 2 3 4
Principal Component
(----‘ 17.16. '411 You are given the following four observations for two variables:
x1 x2
—2 I
-1 —2
-1 0
4 1
The first principal component has loadings of 0.973402 on X1 and 0.229106 on X2.
Calculate the proportion of variance explained by the first principal component.
17.17. I': [SRM Sample Question #301 Sarah is applying principal component analysis to a large data set with
four variables. Loadings for the first four principal components are estimated.
Determine which of the following statements is/are true with respect to the loadings.
I. The loadings are unique.
II. For a given principal component, the sum of the squares of the loadings across the four variables is one.
III. Together, the four principal components explain 100% of the variance.
(A) None (B) I and II only (C) I and III only (0)11 and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
17.18. '41' [SRM Sample Question #35] Using the following scree plot, determine the minimum number of principal
components that are needed to explain at least 80% of the variance of the original dataset. ,.._....)
1-1
Cq
C3
cl
o
•
•
1 2 3 4 5
Principal Component
Solutions
I. The two expressions (after division by 100) represent the variance of the principal components 1 and 2
respectively, and there is no reason that they should be equal. On the contrary, the first principal component
is selected to have the highest variance. X
II. This formula says that the principal components are orthogonal, and indeed they are. i
III. The sum of the squares of the loading for each principal component is 1. Summing up two is results in 2,
not 1. X
(B)
17.3. The mean of the first variable is 4.4. The mean of the second variable is 4. So the score is
17.4. Let the loadings be 491 and 02. Then og = 1 — 44. From the score, we have
44), + 1002 = 8
So
44)1 + 101/1 — =8
17.5. The squares of the loadings must add up to 1, so 021 = — (1/V)2 = —1hri, where the negative square
root was used since the loading for X2 is negative. The observations are centered around 0. The mean of X1 is 6
and the mean of X2 is 4, so we subtract the mean 6 from the first observation of X1, 4, and the mean 4 from the first
observation of X2, 6.
Then the score of observation 1 is (1/ )(4 —6) + (-1/V)(6 —4) = —2.828
17.6. Principal components are the lines closest to the data. Plot I seems like the plot that accomplishes this.
Plot II reverses the lines, but the solid line in Plot I looks closer to most of the data. Principal components must be
perpendicular to each other, eliminating Plot III. In Plot IV the first line is not so close to the data. In Plot V the
principal components are not even lines. (A)
17.7. Clothes and linens have similar loadings, so I is correct.
Abigael has a negative score from the first component. But we cannot tell whether that is because she purchases
l
f more food than the average customer or because she purchases less clothes, linen, ana appliances than the average
customer. We cannot conclude
Bradley has a positive score on both components. If Bradley purchases more food than the average customer,
this would probably lead to a negative first component, unless offset by greater than average purchases of the other
items. We cannot conclude III. (E)
17.8.
I. Since X1 and X2 go in virtually the same direction, the loadings for PC1 and PC2 on the variables must be
similar, which implies high correlation. /
IL When a variable is scaled, it is divided by its standard deviation to make the variance 1. Since the first principal
component has maximal variance, it will put lower loading on variables with lower variance. The higher the
variance of the original variable, the greater the reduction in loading.
Comparing the unscaled and scaled biplots, we see that X3's loading on the first principal component was
significantly decreased whereas the loadings of the other variables on the first principal component were
increased. We conclude that X3 has the highest variance. /
III. X4's loadings and the scores of observation 24 are in the same direction, but this does not necessarily imply
observation 24 has a high coefficient for X4. It may have high negative coefficients for X1, X2, and X3 and have
a high score in PC1 as a result. X
(13)
17.9. xi3 —0.85956(-22.5603) + 0.25890(-45.9338) = 7.4997
17.10. (-0.97932)(-37.1131) + (0.03669)(15.4973) + 013(14.6442) = 34
14.6442013 = 34 — 36.9142 = —2.9142
013 = —0.19900
= (-0.97932)(-46.5435) + (0.03669)(-14.2271) + (-0.19900)(15.3719) =
17.11. Loadings are unique up to sign, but all signs must be flipped. Thus I is correct but III is not, since there is
only one sign flip between W and X. II is another possible explanation of differences in loadings. (B)
17.12. Since the data are scaled, the variance of each Xi is 1, and the sum of the variances of the three Xis is 3.
The principal component is computed from the loading vector as zi = E,Pk=i OkXki, and we get zi = (0.707)(-0.577) —
0.5 + 0.5 = —0.40794, and similarly z2 = —1.40794, z3 = —0.40794, and z4 = 2.224524. The sum of the squares
of these 4 numbers is 7.263628, so the variance of the first principal component is 7.263628/4 = 1.815907. Then
1.815907/3 = 60.53% of the variance is explained. (C)
17.13.
I. The principal component explaining the greatest amount of variance is added first, and each additional one
explains less variance. X
II. Each added principal component explains additional variance. t/
III. Data is best understood with fewer principal components, making the model simpler. X
IV. A scree plot shows how much additional variance is explained by each principal component; one can limit the
number of principal components used to the ones that explain a significant amount of variance.
(E)
17.14. The total variance is 4. The variance of the first principal component, by formula (17.3), is
1 2 750
'
n 500
The proportion of variance explained by the first principal component is 1.5/4 = 0.375 . (C)
Incidentallyl, it is hard to see how the total variance present in standardized data could be 4. The variance
of a standardized variable is 1, and there are 5 predictors, so either the variance of the data set is 5, or if the
standardization used division by 499 instead of 500, it is 5(499/500).
17.15. From the matrix, variables 1 and 4 are perfectly correlated with correlation —1.00, so the fourth principal
component cannot explain any variance, while the other components explain some variance. Thus (C) is the only
possible plot.
17.16. The two variables are already centered at 0 (their means are 0). The sum of their squares is
(-2)2 + (-1)2 + (-1)2 + 42 + 12 -I- (-2)2 -I- 02 + 12 = 28
The scores are
I. The loadings are unique up to sign, but the signs may be reversed. X
II. The loadings are defined under the constraint that the sum of their squares must equal one. *7
The four principal components have the same dimension as the original variables so they must explain all of
their variance.
(D)
(---- 17.18. Looking at the scree plot, the first principal component explains more than 60% of the variance, but less
than 80%. The second principal component explains a little more than 20%. So the first two principal components
explain more than 80% of the variance, while the first principal component alone does not explain 80% of the
variance. (B)
Quiz Solutions
17-1. The distance from (-24, —21) to (0,0) is V242 + 212. The score, whose absolute value is the distance from the
projection of (-24, —21) onto the principal component line to (0,0) is —30.45. By Pythagoras' Theorem, the length of
the perpendicular line, one of the legs of the right triangle, is the square root of the difference between the square
of the hypotenuse and the square of the other leg:
Cluster Analysis
Reading: An Introduction to Statistical Learning 10.3 (first edition) or 12.4 (second edition)
Cluster analysis is an unsupervised learning method. It groups the observations into a small number of homo-
geneous clusters, groups of observations that are similar to each other. Contrast this with principal components
analysis:
• Principal components analysis looks for a low-dimensional representation that explains most of the variance. •41/
• Cluster analysis tries to group the observations into a small number of groups of similar observations.
Marketing Customers are grouped based on what they buy. Advertising can be targeted to the groups that are
most interested in the advertised products.
Medical Patients with a certain disease may be grouped based on clinical measurements, to determine the best
therapy.
minimize
E w(co
k=1
(18.1)
As usual, let Xi be vectors of the variables or features, withj= 1, 2, ... , p. There are n observations, so Xi =
zii, x211 ,x,,1}. So a single observation is X12,. xip 1. The most common W(Ck) used involves squared
Euclidean distance:
1
W(Ck) = z z (xii — xi,j)2 (18.2) •-•°
iCk i,i,Eci j.1
where ICk I is the number of observations in Ck. Notice that the sum is over all pairs {i, PI in both orders. (This
doesn't matter for optimization, since doubling W(Ck) does not affect where the minimum occurs. But it makes
proving the minimizing algorithm easier.)
EXAMPLE 18A kir There are two variables. We are given the four-point cluster (1,7), (3,5), (6,2), and (2,10). Calculate
W(Ck) for this cluster. •
SOLUTIO14: As an example, the distance between (1,7) and (3,5) is (1 — 3)2 + (7 — 5)2 = 8. The distances between all
pairs of points is given in the following table:
(1,7) (3,5) (6,2) (2,10)
(1,7) 0 8 50 10
(3,5) 8 0 18 26
(6,2) 50 18 0 80
(2,10) 10 26 80 0
The sum of all the numbers in this table is 384. Then W(Ck) = 384/4 = 96
Putting (18.1) and (18.2) together, our objective is
K
1
minimize EI—
Ck IX ( — Xpi)2 (18.3)
k=1 i,PEck j=1
It would be computationally infeasible to exhaustively go through every partition of the points into K clusters and
determine which one has minimal distance. But there is a simple algorithm to find local minima. This algorithm
is based on centroids of clusters. The centroid of a cluster is the point whose coordinates are the means of the
coordinates of the cluster. For example, for the 4-point cluster in Example 18A, the centroid is
(1 +3+6+2 7 + 5 + 2 + 10)
4 4
= (3, 6)
The algorithm is
1. Split the observations arbitrarily into K clusters.
2. For each cluster, calculate the centroid.
3. Create new clusters by associating each point with the nearest centroid.
4. Repeat steps 2-3 until cluster assignments do not change.
This algorithm finds a local minimum, since the distance cannot increase at each iteration. The following identity
demonstrates this:
1
l
f
Proof of equation (18.4)
Start with the expression in the inner sum, adding and subtracting
Keep in mind that the sum on the left side of equation (18.4) is over all (i, i'), including i = i'. So the sum has IC1,12
terms. Consider the first two summands of (1. Each one of them will be summed ICk times, once for each i (for
the first summand) or for each i' (for the second summand). So (xii — ici)2 will be summed 2ICkl times. The sum
on the left side is divided by ICk I, so we end up with twice the sum of those squares, which is what the right side
of equation (18.4) is. So we just have to show that the sum of the third summand, the cross products, is 0. That
sum is double the following:
E(x„ -
ieCk
z
PcCk
Now, by the definition of the mean gkj shown in equation (18.5), Eieck xii = ICkliki, so the sum of the differences
xfi — which is the difference of the sums, is 0. The proof is complete.
SOLUTION: 1. The centroids are 2 and 11. Points less than 6.5 go to the first duster and points above 6.5 go to the
second cluster. No assignments changed, so we are done.
2. The centroids are 5 and 14. Anything below 9.5 goes to the first cluster, so the new clusters are {0, 1, 5, 7} and
{12,141. The new centroids are 31 and 13. Anything below 8i goes to the first cluster. Assignments don't
change, so we are done.
This solution is better than the one for the first part.
As the example shows, one must perform the algorithm multiple times with different starting assignments to have
a good chance of finding the best cluster assignments. Another issue is choosing K.
difference when we are dealing with distances between clusters having more than one point for one of the distance
measures we will discuss.'
We start with one cluster for each point. At each iteration of the algorithm, we compare every pair of clusters. If
there are k clusters, we make k(k — 1)/2 comparisons. We select the pair of clusters with the smallest dissimilarity
and fuse them. We keep repeating this algorithm until we're left with one cluster.
We know how the dissimilarity between two points is defined. But how is the dissimilarity between two clusters
'49 defined? Dissimilarity between clusters is called linkage. Four types of linkage are popular: complete, single,
•,? average, centroid.
Complete linkage Calculate the dissimilarity between every point of cluster A and every point of cluster B. If there
are a points in A and b points in B, do ab calculations. The dissimilarity is the maximum of these numbers.
Single linkage Similar to complete linkage, except the dissimilarity is the minimum of the ab calculations. This
linkage leads to trailing clusters, clusters in which one point at a time is fused to a single cluster.
Average linkage As in complete linkage, calculate ab dissimilarities. Then use the average.
Centroid linkage Calculate the centroid of each cluster and use the dissimilarity between the centroids. This
method has the disadvantage that it may result in inversions. An inversion occurs when a later fusion occurs
at a height lower than an earlier fusion— the dissimilarity of a later fusion is less than the dissimilarity of an
earlier fusion involving the same points.
The textbook has a 9-point example in 2 dimensions using complete linkage. Before we do a 2-dimensional
example, let's do an even simpler example with 6 points in 1 dimension.
EXAMPLE 18C 41 You are given the points 0, 1,5, 7, 10, 13.5.
Carry out hierarchical clustering using each of the four linkages. •
SoLurioN: The first two links are the same regardless of linkage. The points closest together are 0 and 1, so we link
them into 1 cluster.
Next (5,7) are closest and are linked, so we have (0,1), (5,7) , (10), (13.5).
Complete linkage With complete linkage, the distance from (0,1) to (5,7) is the largest difference, or 7, and the
distance from (5,7) to (10) is 5. So we link (10,13.5), since they are only 3.5 apart.
The distance from {0,11 to (5,7) is 7 and the distance from (5,7) to {10,13.51 is 8.5, so we link the first two and have
(0,1,5,7) and (10,13.5). Then we link these two into one cluster.
As we see, complete linkage prefers fusing smaller clusters to fusing larger clusters. More points in a cluster
means more numbers to maximize over. Here, it first fused all the points into three 2-point clusters before it
considered fusing those clusters.
Single linkage With single linkage, starting with (0,1), {5,7} , (10), (13.5), distance is the smallest difference, so the
distance from (OM to {5,71 is 4 and the distance from (5,7) to 10 is 3, which is the smallest, so we link them. We have
(0,1), (5,7,10), and {13.5}. The distance between the first two is 4 and the distance between the last two is 3.5, so we
link them and get {0,11 and (5,7,10,13.5). Then we link these into one cluster.
Average linkage With average linkage, starting with (OM, (5,7), f101, (13.5), average distance between the first two
clusters is 5.5. Average distance between (5,7) and (10) is 4. Both of these are higher than 3.5, so we link (10,13.5),
and have (0,1), (5,7), and {10,13.51. Average distance between first two is still 5.5 and average distance between the
last two is 5.75, so we link the first two. We end up with the same hierarchy as with complete linkage.
35 -
010
30
09 08
07
25- 0
5
06
20- 0
15- 03
10 -
01 02
5 -
0 1 I I 1 I 1 I
0 5 10 15 20 25 30 35
Centroid linkage with centroid linkage, starting with {0,11, {5,7} , {10}, {13.5}, distance between the first two is
6 - 0.5 = 5.5. Distance between {5,7} and {10) is 10 - 6 = 4. Distance between (10) and {13.5) is 3.5. So we link
110,13.51, and have {0,1), (5,7), and (10,13.5). Distance between the first two is still 5.5. Distance between {5,7) and
(10,13.5) is 11.75 - 6 = 5.75, so we link the first two. We end up with the same hierarchy as with complete linkage
and average linkage.
As an example of an inversion using centroid linkage, consider the following three points: (0,0), (40,0), (20,38).
The first link is between (0,0) and (40,0), since the distance to (20,38) from either point is 11202 + 382 = 42.94 > 40.
The centroid of {(0,0),(40,0)) is (20,0). The distance from the centroid to (20,38) is 38, which is less than the distance of
the first link, which is 40. While there is nothing per se wrong with an inversion, it produces a strange dendrogram,
and cutting the dendrogram in the inverted area produces unusual clusters.
Let's now look at dendrograms. We will work with the following 10 observation points:
1. (10,10) 6. (15,24)
2. (18,10) 7. (17,27)
3. (18,16) 8. (13,30)
4. (20,20) 9. (8,30)
5. (8,25) 10. (12,34)
They are graphed in Figure 18.1.
The first link is the same regardless of linkage, as it must be. It links point 6 to point 7. The distance between
them is V(17 - 15)2 + (27 - 24)2 = 3.606.
It turns out that the second link is the same regardless of linkage. It links point 8 to point 10. The distance
between them is 1112 + 42 = 4.123.
The third link is also the same regardless of linkage. It links points 3 and 4, with distance V22 42 = 4.472.
After the third link, the links differ by linkage. Let's start with complete linkage. Complete linkage continues
to link single points: first 5 and 9, and then 1 and 2. Complete linkage tends to prefer linking smaller groups since
4
10
the distance is the maximum of the point-to-point distance; the more points, the higher the maximum tends to be.
Figure 18.2 shows the dendrogram and table of distances for complete linkage.
Next let's look at average linkage. The fourth link is the same as for complete linkage, but then it links 15,9) with \J
18,10). Average linkage is not as biased towards linking single points as complete linkage is. Figure 18.3 shows the
dendrogram and table of distances for average linkage.
Single linkage is the most unusual, so let's discuss centroid linkage next. Fortunately there are no inversions.
But the fourth link is already different from complete and average linkage. The link of (9) with (8,10) takes priority
over the link of (9) with 151 indicated by the other two methods. 151 does get linked with (8,9,10) at the next step,
making it like average linkage. But the later links are in a different order. See Figure 18.4 for the dendrogram and
table of distances.
Single linkage suffers from having some ties for distance. The dendrogram here breaks the ties by following
average linkage through link 5. At the three tied heights, the dendrogram has been distorted a bit to show the
sequence of links. See Figure 18.5 for the dendrogram and the table of distances. Notice how it behaves the
opposite of complete linkage; it prefers to fuse groups with greater numbers of observations, so that there are more
observations to minimize over, and saves the single point 111 to the end before fusing it to the clusters.
One thing to be careful about is that distance is determined vertically, not horizontally. For example, in centroid
linkage, point 5 is no closer to point 9 than it is to points 8 and 10. It is doser to those points than to points 6 and 7
by a whisker, because the link of that group to the group with 5 is higher than the link of 5 with (8,9,10).
Clusters are formed by cutting the tree at a specific height. For example, if you wanted 3 clusters, you'd cut the
trees below the first two links. The clusters you'd get would then be
Complete (1,2), 13,41,(5,6,7,8,9,10)
Average (1,2), (3,4,6,71, (5,8,9,10)
Centroid (1,2), 0,41,(5,6,7,8,9,10)
Single (1), (2,3,4), (5,6,7,8,9,10)
Look back at Figure 18.1. Which linkage leads to the most plausible clusters? I think in this case single linkage
is quite logical! But if we split the data into two clusters, I think single linkage is the least plausible in splitting 111
from the other nine observations. Splitting it as {1,2,3,4} and (5,6,7,8,9,10), as the complete and centroid linkages do,
seems more logical. You may have a different opinion.
10
1 {61-17) 3.606
2 {81-110} 4.123
3 {31—(4) 4.472
4 19)—(8,10} 4.924
5 151—{8,9,10} 7.008
6 {6,7}—{5,8,9,10) 7.150
7 111—(21 8.000
8 {1,2}-13,4) 9.434
9 {1,2,3,4}-0,6,7,8,9,10) 14.974
10
1 (61-171 3.606
2 {8}—{10) 4.123
3 {3)—(41 4.472
4 (51—{9} 5.000
5 (5,9)—(8,10) 5.000
6 {6,7}-15,8,9,10) 5.000
7 {2}—(3.4) 6.000
8 {2,3,4)-15,6,7,8,9,101 6.403
9 (1)—(2,3,4,5,6,7,8,9,10) 8.000
5 9
3 4
8 10
6 7
Practical clustering links large numbers of observations and has thick dendrograms. Refer to the textbook for
such dendrograms.
1
minimize vi ti aI (18.3)
ICkl LiteCk j=1
(18.4)
Exercises
18.1. 41 (MAS-I-518:351 You are given the following three statistical learning tools:
I. Cluster Analysis
II. Logistic Regression
III. Ridge Regression
Determine which of the above are examples of supervised learning.
(A) None are examples of supervised learning
(B) land II only
(C) 1 and III only
(D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
18.2. 41 [SRM Sample Question #32] Determine which of the following statements is/are true with respect to
clustering methods.
I. We can cluster the n observations on the basis of the p features in order to identify subgroups among the
observations.
II. We can cluster p features on the basis of the n observations in order to discover subgroups among the features.
III. Clustering is an unsupervised learning method and is often performed as part of an exploratory data analysis.
(A) None (13) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (0) .
18.3. sur [MAS-II-F19:40] You are given three statements about the k-means clustering algorithm.
I. The k-means clustering algorithm requires that observations be standardized to have mean zero and standard
deviation one.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
18.4. ": [SRM Sample Question #43] Determine which of the following statements is NOT true about clustering
methods.
l
f 18.5. A K-means clustering process with K =2 produced the following clusters:
A: (0,0), (1,2), (2,0), (3,2)
B: (0,3), (1,5), (2,4)
Clustering is based on squared Euclidian distance.
Calculate the value of the objective function, the function that is minimized by the clustering algorithm.
18.6. [MAS-II-S19:42] You have decided to perform K-means clustering with K 2 on the following dataset
and have already randomly assigned clusters as follows:
Observation 11 x2 Initial Cluster
1 5 5 2
2 4 6 2
3 3 0 1
4 5 3 1
5 5 1 2
6 3 6 1
7 2 5 2
Calculate the Euclidean distance of Observation 5 from the final centroid of Cluster 2.
18.7. Nr A professor with 10 students gives a final, and the grades are
99 98 95 93 91 89 87 85 79 77
The professor would like to divide these grades into 3 clusters. Grades in the highest cluster will be A; grades in
the middle cluster will be B; and grades in the lowest cluster will be C.
The professor uses K-means clustering with Euclidean distance squared. Initially, all grades above 90 are put in
the first cluster; grades between 80 and 89 are put in the second cluster; and grades below 80 are put in the third
cluster.
Determine the three clusters ultimately resulting from the clustering algorithm.
18.8. rYou are performing K-means clustering on a set of data. The data has been initialized randomly with 3
clusters as follows:
A single iteration of the algorithm is performed using Euclidean distance between points.
Determine the three clusters resulting after the iteration.
18.9. q [SRM Sample Question #15] You are performing a K-means clustering algorithm on a set of data. The
data has been initialized randomly with 3 clusters as follows:
Cluster Data Point
A (2,—i)
A (-1,2)
A (-2,1)
A (1,2)
B (4,0)
B (4,—i)
B (0,-2)
B (0,-5)
C (-1,0)
C (3,8)
C (-2,0)
C (0,0)
A single iteration of the algorithm is performed using the Euclidian distance between points and the cluster
containing the fewest number of data points is identified.
Calculate the number of data points in this cluster.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
Use the following information for questions 18.10 and 18.11:
You are given the following four pairs of observations:
18.10.
•-.11 [SRM Sample Question #11 A hierarchical clustering algorithm is used with complete linkage and
Euclidean distance.
18.11. q A hierarchical clustering algorithm is used with average linkage and Euclidean distance.
Calculate the intercluster dissimilarity between {xi, x21 and {x4}.
18.12.
`-‘11.[MAS-II-F18:421 You are provided the following data set with a single variable X:
i X
1 9
2 15
3 4
4 2
5 18
A dendrogram is built from this data set using agglomerative hierarchical clustering with complete linkage and
Euclidean distance as the dissimilarity measure.
Calculate the tree height at which observation i = 1 fuses.
(A) Less than 6 (B) 6 (C) 7 (D) 8 (E) At least 9
18.13.
1 A hierarchical clustering algorithm is used with single linkage and Euclidean distance.
The observations are split into three clusters.
Determine the three clusters.
18.14. "All A hierarchical clustering algorithm is used with complete linkage and Euclidean distance.
The observations are split into three clusters.
Determine the three clusters.
A hierarchical clustering algorithm is used with average linkage and Euclidean distance.
Calculate the intercluster dissimilarity between {x1, x2} and {x3, x4}.
18.16.
[MAS-II-F19:42] An actuary is using hierarchical clustering to group the following observations:
13
40
60
71
The actuary recalculates the clustering using two linkage methods: complete and average.
(i) hump is the height of the final fuse using complete linkage.
(ii) hayg is the height of the final fuse using average linkage.
Calculate I//comp — havg •
(A) Less than 10
(B) At least 10, but less than 15
(C) At least 15, but less than 20
(D) At least 20, but less than 25
(E) At least 25
18.17.
r[SRM Sample Question #2] Determine which of the following statements is/are true when deciding on
the number of clusters.
I. The number of clusters must be pre-specified for both K-means and hierarchical clustering.
II. The K-means clustering algorithm is less sensitive to the presence of outliers than the hierarchical clustering
algorithm.
The K-means clustering algorighrn requires random assignments while the hierarchical clustering algorithm
does not.
(A) I only (13) II only (C) III only (D) I, II and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
18.18. [SRM Sample Question #341 Determine which of the following statements is/are true about clustering
methods:
I. If K is held constant, K-means clustering will always produce the same cluster assignments.
Given a linkage and a dissimilarity measure, hierarchical clustering will always produce the same cluster
assignments for a specific number of clusters.
IlL Given identical data sets, cutting a dendrogram to obtain five clusters produces the same cluster assignments
as K-means clustering with K = 5.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D)
18.19. NI° [SRM Sample Question #16] Determine which of the following statements is applicable to K-means
clustering and is not applicable to hierarchical clustering.
(A) If two different people are given the same data and perform one iteration of the algorithm, their results at
that point will be the same.
(B) At each iteration of the algorithm, the number of clusters will be greater than the number of clusters in the
previous iteration of the algorithm.
(C) The algorithm needs to be run only once, regardless of how many clusters are ultimately decided to use.
(D) The algorithm must be initialized with an assignment of the data points to a cluster.
(E) None of (A), (B), (C), or (D) meet the meet the stated criterion.
18.20. [SRM Sample Question #40] Determine which of the following statements about clustering is/are true.
I. Cutting a dendrogram at a lower height will not decrease the number of clusters.
II. K-means clustering requires plotting the data before determining the number of clusters.
III. For a given number of clusters, hierarchical clustering can sometimes yield less accurate results than K-means
clustering.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
18.21. [SRM Sample Question #361 Determine which of the following statements about hierarchical clustering
is/are true.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Solutions
18.1. Cluster analysis has no response variable, so it is unsupervised. The two regression techniques have a
response variable, so they are supervised. (D)
18.2. I and II are true; clustering can be done on observations or on features. And it is an unsupervised method.
So all three statements are true. (E)
n. 18.3. Statement I sounds like it came from principal component analysis, or perhaps ridge regression or the lasso,
although none of them absolutely require standardizing. And statement III sounds like it came from principal
component analysis. Only H is true. (B)
18.4. Clustering does not reduce the dimensionality of a data set. It breaks a data set into clusters. (D)
18.5. We can measure distances between pairs of points or use the centroid formula. We'll use the centroid formula.
The centroid of A is
The centroid of B is (1,4). The objective function is the sum of squared differences from the centroid, doubled, or
2((1.52 + 12) +(0.52 + 12) +(0.52 + 12) + (1.52 + 12) + (12 + 12) + (02 1- 12) + (12 + 02)) = 26
Thus observations 1, 2, 6, and 7 go to cluster 2 and observations 3,4, and 5 go to cluster 1. The new centroids
are (4.333,1.333) for cluster 1 and (3.5,5.5) for cluster 2. The new squared Euclidean distances are
Observation Distance from Centroid 1 Distance from Centroid 2
1 13.8889 2.5
2 21.8889 0.5
3 3.5556 30.5
4 3.2992 8.5
5 0.5556 22.5
6 23.5556 0.5
7 18.8889 2.5
No cluster assignments change, so this is the final clustering. The squared distance of Observation 5 from
cluster 1 is 22.5; -k,/2 = 4.7434 . (E)
18.7. The means of the clusters are 95.2, 87, and 78. Then 91 is moved to the second cluster since it is closer to 87
than to 95.2. The resulting clusters have means 96.25, 88, and 78, and no improvement is possible, so they are the
final clusters.
18.9. The centroid of A has x coordinate (2 + (-1) + (-2) + 1)/4 = 0 and y coordinate (-1 + 2 + 1 + 2)/4 = 1. The
centroid of B is (2, —2). The centroid of C is (0,2). Looking at these centroids on a graph:
•
*C
A
*B
it's clear that the only points near (0,2) are the three upper ones on the graph; the other points are closer to the
centroids of A or B. But if you want to be sure, you have to calculate all the distances between points and their closest
centroids. (D)
18.10. For complete linkage, we calculate the maximum distance from the two pairs of observations in the group.
1x4— =J(5_(_1))2+(10_0)2= VT )
I X4 — X21 = 1/(5 1)2 + (10 — 1)2 = A/T7
The larger distance is Ji = 11.6619 . (E)
18.11. The distance is now based on the average of the two distances, 0.5(.0737; + -\17") = 10.7554
18.12. Complete linkage means we consider the maximum difference between clusters. The difference between 2
and 4 is the minimum of the differences of two numbers, so i = 3 and 4 fuse first. Then 15 and 18 are closest, so i = 2
and 5 fuse. Now the two clusters of two elements are 16 apart (18 — 2), so i = 1 fuses next, and it is closer to 12,41
(distance 7) than to 115,181 (distance 9), so it fuses at height E. (0
18.13.
44 and 45 are fused. With single linkage, the distance from 41 is 3, so 41 is fused into (44,45). The distance
of (41,44,45) from 49 is 4, s049 is fused into this group. Now 36 is fused into this group, and (64,69) is fused, since
they have distance 5. So far we have
29 {36,41,44,45,49) 55 {64,69}
Now 55 is fused into 136,41,44,45,491. The three clusters are (29), {36,41,44,45,49,551, and (64,69).
18.14.
44 and 45 are fused. Then 41 is fused with them since it is 4 away. The distance of (41,44,45) to 49 is 8, so
{64,69) are fused next. Then {49,55) are fused. Then (29,36) are fused. So far we have
Distances between groups are 16 from first to second, 14 from second to third, and 20 from third to fourth, so second
and third are fused. The three clusters are (29,36), (41,44,45,49,55), and (64,69).
18.15. We must compute the four distances between the two points in each cluster.
I. K-means clustering starts out with a random assignment to K clusters, and the final result may vary depending
on the initial assignment. X
II. True; there is nothing random in hierarchical clustering./
III. Since I is false, this is certainly false; moreover, different linkages may lead to different results from hierarchical
clustering. X
(B)
18.19.
(A) With K-means clustering, one must choose a random initial distribution of observations into clusters, so the
statement is not applicable to K means clustering.
(B) With K-means clustering, the number of clusters is fixed in advance, so the statement is not applicable to K
means clustering.
(C) With K-means clustering, the algorithm produces a local minimum and must be run several times to obtain a
global minimum.
(ID) This statement is true for K-means clustering but not for hierarchical clustering.
(D)
18.20.
I. The dendrogram shows fusions at their height, and lower heights are fused before higher heights (at least if
centroid linkage is not used), so this statement is true.
II. There is no requirement to plot data before deciding on K. X
III. This is true, since K-means clustering starts off with a random assignment to clusters which may work out
well, whereas hierarchical clustering is force to fuse based on linkage and may miss better alternatives for a
specific number of clusters. s/
(C)
18.21.
I. Every point is assigned to its own cluster right from the start; then the clusters are fused together. No point
drops out. X
The dendrograrn may be cut at arty height to obtain different numbers of clusters. /
III. This is true. i/
(D)
Time Series
Lesson 19
19.1 Introduction
A time series is a series of observations y, y2, , YT over consecutive time periods, such as months or years.
Examples of time series are:
• Daily stock prices
• Volume of stock trades, by day
• Monthly sales
• Population of a country, by year
Time series analysis attempts to find patterns in a time series that can help predict future values. The patterns may •41
relate the terms yi to previous terms in the series or to the time variable t. We'll now briefly discuss models that
relate yt to t.
Longitudinal data is data from a process that varies with time. Cross-sectional data is the opposite: data that is not •1?
organized by time. A regression model in which a dependent variable is a function of explanatory variables other
than time is a causal model, in the sense that it states that the dependent variable y is caused by explanatory variables ikit
xi. However, statistical models only find correlation, not causation. When several variables grow with time, a causal
model may find a spurious relationship between the variables. On the other hand, a regression of a time series
variable against time (this is not a causal model) may find a legitimate deterministic relationship between the time
series and time. Causal models have the additional drawback that to forecast the dependent variable, you must have
a forecast of the independent variables.
A time series can be decomposed into three parts: trend, seasonal factors, and random patterns. Trend is the
long-term pattern of the data, while seasonal factors are the cyclical pattern. Let 7; be trend and St the seasonal
pattern. Then an additive model would be "41
+ Sj + Et
while a multiplicative model would be
yi = T, x S1 ±
To help analyze a time series, one may draw a times series plot. A time series plot is a Scatter Plots of a time
series against time, with the consecutive points connected with lines.
A regression model for a time series may simply have a linear trend in time:
yi = Igo+ it 4- el
One may also use a polynomial. Seasonal patterns may be modeled by dummy variables for the seasons or by
trigonometric functions. Sometimes seasonal adjustment is done; a new time series is created with the seasonal '46
pattern removed. It is also possible to model a regime change, a change in the behavior of the time series starting at
a certain point of time, by using a dummy variable set equal to 0 before that time and 1 after that time.
Regression models are naive in the sense that they ignore information other than the time series being modeled.
Another shortcoming of regression techniques is that they give the highest weight to the earliest and latest observa-
tions, the ones with the t variable furthest away from the mean. Generally, when forecasting, one wants to give the
highest weight to the latest forecasts and the lowest weight to the earliest forecasts.
• If the mean of a time series does not vary by t, the series is said to be stationary in the mean. For such a time series,
we can estimate the mean as the sample mean of the observed values.
Let p(t) be the mean of y,, the tth term of a time series. The variance of a time series at time t is
0-2(t) = ERY, — P(0)2] (19.1)
"-.16 If the variance does riot vary with t, the series is said to be stationary in the variance. If the series is stationary in the
*1:1 variance, we can estimate the variance using the usual forrnula for the sample variance:
2 Z;1=1(Yt fi)2
s =
(19.2)
n —1
Time series terms tend to be correlated with each other (otherwise the time series wouldn't be interesting). The
sample variance will tend to underestimate the true variance for this reason. However, this bias reduces rapidly as
the size of the series increases.
The correlation of terms in a time series is of great interest. The correlation of a time series with itself is called
1141 autocorrelation or serial correlation. We will look at the correlation of terms in a time series at times t to terms at times
t k, the correlation of yi with yt,k. The distance between terms, k, is called the lag. If a series is stationary in mean
•
and variance and autocorrelation is a function only of the lag, not of the time, we say that the time series is (weakly)
stationary. The higher moments of such a series may still vary with t. If none of the moments vary with time, then
the series is strongly stationary. We will assume, in our discussion of autocorrelation, that the time series is weakly
stationary.
We calculate the sample autocorrelation at lag k with this formula:
(19.3)
Notice that the numerator and denominator do not have the same number of terms. The textbook refers to rk as the
autocorrelation, leaving out the word "sample". When the textbook deals with the underlying autocorrelation of
the time series, it calls it "the correlation between terms k apart" and uses the symbol pk. In this manual, we will
refer to the underlying autocorrelation as the true autocorrelation.
The (true or sample) autocorrelation at lag 0 is 1.
EXAMPLE 19A "Ili You are given the following time series:
5, 9, 3, 6, 7
(_1)2 4. 32 + ± 02 + 12
You can use your statistical calculator for this calculation. Use the E x2 statistic for the denominator and the E XY
statistic for the numerator, where X for the numerator calculation is the first 4 terms and Y is the last four terms. 0
Quiz 19-1 ki? In Example 19A, calculate the sample autocorrelation at lag 3.
where tT...1,1_02 is the f-distribution critical value with T — 1 degrees of freedom and confidence level a, and S2y is
the unbiased sample variance of the T observations. The width of the forecast interval is independent of 1. Often
we use the approximate 95% forecast interval 9 ± 2sy.
Quiz 19-2 N? You are given the white noise series{yi,...ys} = {1, —1, 2, 0,3}.
Calculate the upper bound of a 90% forecast interval for y7.
We try to reduce any time series to white noise by finding patterns, leaving the unexplained part as white noise.
The procedure for reducing a time series to white noise is called a filter. The uncertainty that cannot be explained
by patterns is called irreducible.
A white noise series may be identified if a series has a more or less constant mean and variance and doesn't move
around much.
Let pc and ac2 be the mean and variance of the white noise series C. Then E[yll = yo + tpc and Var(y1) = tcf. Even
if pc = 0, a random walk is nonstationary because the variance increases with time. (Many other authors consider
= 0 to be part of the definition of random walk. If pc 0, they call the series a "random walk with drift", and pc
is called the "drift")
The forecast of a random walk is
YT+I = YT + re (19.5)
where ë is the sample mean of cf. The standard error of an /-period lookahead forecast is se Nil, where se is the
estimate of the standard deviation of c. Therefore, the forecast has approximate 95% confidence interval sir
2. Differencing the time series should result in white noise. The pattern of values should be constant with
constant variance over time.
3. Standard deviation of differences should be significantly lower than standard deviation of the original series.
•41 You should distinguish between a random walk and a linear trend in time. A linear trend in time is
= yo + kt et
whereas a random walk is
yr = yo+pct Ej
j=1
• Whereas they both have the same mean if k = pc, the variance is different. A linear trend in time has stationary
variance, whereas a random walk's variance increases with time. The linear trend in time's error term is white noise,
making it stationary, whereas the random walk with drift's error term u = Eit=i ei is a random walk, making it
nonstationary. If the drift of the random walk pc = 0, the comparison between it and a linear trend in time is not so
clear since the random walk's mean does not increase (or decrease) in time.
Differencing a random walk is an example of filtering. Another filtering method is to take the logarithm of a
time series, which may stabilize variance. If that doesn't work, one may take differences of logarithms. Difference
of logarithms correspond to approximate proportional changes;
y1-1 Yi YI-1
- in Yi-i = in
yt—i = In (1 +
19.5 Control charts
'41' A control chart for a time series is a chart upon which control limits are superimposed. The limits are typically
• UCL = 7+3sy and LCL = 17— 3sy, where UCL is the upper control limit and LCL is the lower control limit. Examples
of control charts are
Xbar charts calculate the averages of series of k observations. For example, if k = 5, compute the average of
observations 1-5, observations 6-10, observations 11-15, etc. The variance of an average is lower than the
variance of the series, so unusual patterns should stick out.
R charts calculate ranges of series of k observations. The range is the maximum observation minus the minimum
observation. The range is a simple measure of variability, and the chart helps evaluate patterns.
To evaluate a forecast, we use out-of-sample validation techniques, similar to the ones we learned in Lesson 5. We
may split the data, which goes through time T, at time T1 <T. The data up to time Ti is used as a training data set
and the subsequent data is a test data set. We fit the model using the training data set. The data from time Ti + 1
to time T is forecasted using the model, and the residuals (excess of actual over forecast) et = yt - 91 are computed.
The following statistics may be used to evaluate the differences; for all of them, the smaller the better.
1. Mean error statistic
1
ME =
T Eei
1=3'1+1
n—1
(19.2)
9 ± tT-1,1-42 4 1+ (19.4)
yr + 2sc1/7 (19.6)
Quality measures for validation of models ME, MPE, MSE, MAE, MAPE. See Section 19.6 for formulas.
=71+1
ME and MPE can reveal trend patterns, but they will not reveal problems when the residuals are positive and
negative but have a low average. The other measures will reveal such problems. MPE cannot be used if the series
has Os and may not be logical if the series has negative terms. The same applies to MAPE if differences are 0 or
negative.
Exercises
yr
1 8
2 5
3 12
4 18
5 7
6 10
19.2. r[S-F16:42) You are given the following ordered sample of size 6 from a time series:
1 1.5 1.6 1.4 1.5 1.7
19.3. [S-F17:431 You are given the following information from a time series:
xi
1 4.0
2 3.5
3 2.5
4 5.5
5 4.5
6 4.0
4.0
19.4. air [MAS-I-818:41] You are given the following information from a time series:
yi
1 3.0
2 2.5
3 2.0
4 3.5
5 4.0
6 3.0
3.0
19.5. [120-F90:16] A mutual fund has provided investmentyield rates for five consecutive years as follows:
Year Yield
1 0.07
2 0.06
3 0.07
4 0.10
5 -0.05
19.7. •-: [MAS-1-F18:45] You are given the following quarterly rainfall totals over a two-year span:
Quarter Rainfall
2016q1 25
2016q2 19
2016q3 10
2016q4 32
2017q1 26
2017q2 38
2017q3 22
2017q4 20
19.8. s: [MAS-I-F19:44] You are given the following annual sales totals for a department store.
Year Sales
2013 400
2014 375
2015 410
2016 420
2017 410
2018 525
•
19.9. e'li
[MAS-I-S19:421 Consider the following time-series data for price of a stock on January 1 for the last 5
years:
Date Jan. 1, 2013 Jan. 1, 2014 Jan. 1, 2015 Jan. 1, 2016 Jan. 1, 2017
Price 63.18 81.89 103.43 123.90 133.53
where ct, t = 0,1, 2, ... , T denote observations from a white noise process.
(ii) The following nine observed values of Ct
t 11 12 13 14 15 16 17 18 19
ci 2 3 5 3 4 2 4 1 2
19.11. •••• [SRM Sample Question #311 Determine which of the following indicates that a nonstationary time
series can be represented as a random walk
I. A control chart of the series detects a linear trend in time and increasing variability.
II. The differenced series follows a white noise model.
III. The standard deviation of the original series is greater than the standard deviation of the differenced series.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D)
19.12. kir [SRM Sample Question #38] You are given two models:
Model L: y, = ho + f31f ± el
where fell is a white noise process, for t = 0,1, 2, ...
Model M: y = yo + pct + ut
if, — Ys-i
\-1
ut
2,i=1 ei
where { et} is a white noise process, for t = 0,1, 2, ...
Determine which of the following statements is/are true.
I. Model L is a linear trend in time model where the error component is not a random walk.
Model M is a random walk model where the error component of the model is also a random walk.
III. The comparison between Model L and Model M is not clear when the parameter pc = 0.
(A) I only (B) II only (C) HI only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
yi 2 5 10 13 18 20 24 25 27 30
(iii) yo = 0
Calculate the standard error of the 9 step-ahead forecast, 919.
(A) 4/3 (B) 4 (C) 9 (D) 12 (E) 16
where
E[ci] = pc and Var(ci) = a, t = 1,2,...
Determine which statements is/are true with respect to a random walk model.
I. If 1.1, = 0, then the random walk is nonstationary in the mean.
II. if cr = 0, then the random walk is nonstationary in the variance.
III. If cr > 0, then the random walk is nonstationary in the variance.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
19.15. s: You are given that ct is white noise with variance 16.
Determine the width of an approximate 95% confidence interval for a three step ahead forecast.
19.16. "411 You are given that xi is a random walk. The variance of .x-5 is 35.
Determine the variance of x6.
19.17. a: For a random walk { yi}, differences of consecutive terms are white noise with mean 2 and variance 16.
Assume that the terms of the white noise series are normally distributed.
You are given that yi = 3.
Calculate the probability that y2 is positive.
19.18. sfr You are given that yt is a random walk. Assume all terms of the series are normally distributed. o-,2 = 3.
Determine the approximate width of a 95% confidence interval for a three step ahead forecast.
19.19. 441 You are given that yi is a time series satisfying the following equation:
Yi =
+4+ ci
where ci is a normally distributed error term. The variance of the error term is estimated to be 5. You are also given
that VT 25.
Determine the lower bound of a 95% confidence interval for a three step ahead forecast of yt, 9T+3.
19.20. a'? [4-F00:40] You are given two random walk models. These models are identical in every respect, except
that for one of them pc = 0 and for the other one Uc > 0.
Which of the following statements about these random walk models is incorrect?
(A) For the random walk with pc = 0, all forecasted values from time T are equal.
(13) For the random walk with pc = 0, the standard error of the forecast from time T increases as the forecast
horizon increases.
(C) For the random walk with pc # 0, the forecasted values from time T will increase linearly as the forecast
horizon increases.
(D) For the random walk with pc 0, the standard error of a forecasted value from time T is equal to the
standard error of the corresponding forecasted value for the random walk with pc = 0.
(E) For the random walk with pc 0, the standard error of the forecast from time T increases or decreases,
depending on pc, as the forecast horizon increases.
Use the following information for questions 19.21 and 19.22:
You are given the following time series:
2,4,1,3,5,3,2,1,6,4
It is modeled as a white noise series. To test this model, out-of-sample validation is done, with the first 6 terms
used for the model development subsample.
It is modeled as a random walk. To test this model, out-of-sample validation is done, with the first 6 terms used for
the model development subsample.
Calculate the mean square error.
It is modeled as a random walk. To test this model, out-of-sample validation is done, with the first 6 terms used for
the model development subsample.
Calculate the mean absolute error.
19.25. [SRM Sample Question #55) You are given the following eight observations from a time series that
follows a random walk model:
Time (t) 0 1 2 3 4 5 6 7
Observation (y,) 3 5 7 8 12 15 21 22
You plan to fit this model to the first five observations and then evaluate it against the last three observations
using one-step forecast residuals. The estimated mean of the white noise process is 2.25.
Let F be the mean error (ME) of the three predicted observations.
Let G be the mean square error (MSE) of the three predicted observations.
Calculate the absolute difference between F and G, IF - G.
(A) 3.48 (B) 4.31 (C) 5.54 (D) 6.47 (E) 7.63
Solutions
= 706
19.2. The sample mean is 1.45. Subtracting the sample mean, the terms are -0.45, 0.05, 0.15, -0.05, 0.05, 0.25. The
sum of their squares is 0.295. The lagged sum (xi - 1)(xi+2 - R) is
19.3. The numerator of the autocorrelation formula at lag 4 is (xi -5) (x5-2)+ (x2-1)(x6-1) = (0)(0.5)+ (-0.5)(0) = 0,
so the autocorrelation is E. (C)
19.4.
(3.5 - 3)(3 - 3) + (4 - 3)(2.5 - 3) + (3 - 3)(2 - 3) -0.5
r3 - = -0.2 (B)
-
Y2 —
EL3(Yi - 2)(yr-2 - Y)
V=1 (Yt g)2
After subtracting 0.05 from each term, the series is 0.02, 0.01, 0.02 0.05, -0.10.
The denominator is 0.022 + 0.012+ 0.022 + 0.052 + (-0.10)2 = 0.0134. The numerator is (0.02)(0.02) + (0.05)(0.01) +
(-0.10)(0.02) = -0.0011. So r2 =
= -0.08209 . (E)
12 =
V.3(Yt - 0)(Yt-2 - 9)
ELi(Yi - 9)2
After subtracting 1.2 from each term, the series is 0, -0.1, -0.3, 0.1, 0.3.
The denominator is 2(0.12 + 0.32) = 0.2. The numerator is (0.1)(-0.1) + (0.3)(-0.3) = -0.1. So r2 = -0.5
(A)
19.7. You'd use your calculator to work this out. The average of the 8 numbers is 24; the sum of square differences
of the numbers from 241s 506. Then the lag 4 product is
while the denominator before division by 5 is the biased sample variance times 5 or the unbiased sample variance
times 4: 3383.887. The quotient is 1475.695/3383.887 = 0.4361. (C)
19.10. The forecast is yio plus 9 times the drift p(c), the mean of the underlying white noise, and the drift
is estimated as the average value of the past observations, or 2, so 919 = Y10 + 18. The actual value of y19 is
Yio + + 3 + 5 + • • • + 2 = yio + 26. The forecast error is 26 - 18 = n. (D)
19.11. See the enumerated list on page 345, taken from the Frees textbook (except that "control chart" is not
mentioned); it includes all three statements. (D)
19.12. See page 346, which summarizes an obscure passage in Frees. Note that statement III means that the
comparison is not clear because L has trend and M doesn't, so you wouldn't think of comparing them; it doesn't
mean that you can't compare and contrast them. (D)
19.13. The 10 observed changes, el = yi _1, are 2, 3, 5, 3, 5, 2, 4, 1,2, 3. Their mean is 3 and their unbiased sample
variance is 16/9. So the standard deviation is 4/3. The standard error of 919, is 1/§. times the standard deviation, or
(B)
19.14. The mean of a random walk is yo pct, which is not constant unless pc = 0, so I is true. The variance of a
random walk is ta., which is not constant unless cy = 0, so II is false and III is true. Of course if a = 0 the walk is
not really random. (C)
19.15. The variance is 16 at all future times, so the confidence interval is ±2V3.-6, and 2(2)A/ 17 =
19.16. The variance of each term is t a2. If52= 35, then 6a2 =
19.17. y2 is a normal random variable with mean yi + pc = 3 + 2 = 5 and variance 16. The probability that it is
greater than 0 is 1 - 1)((0 5)/-{1-6) = 1 - (13(-1.25) = 0.89441
Exam SRM Study Manual
Copyright 02022 ASM
356 19. TIME SERIES: BASICS
19.20.
A. Since the mean change at each time unit is E[mr+ki = 0, the forecast is xT. for all future times. t/
B. Since a white noise term is added at each time, the standard error for k periods ahead is "sffc.
C. For the random walk with pc at 0, the forecasted value k periods ahead is XT + kp. 1/
D. The standard error is the square root of the variance of the sum of k white noise terms.
E. pc is non-stochastic so it has no effect on the standard error, which always increases. X
(E)
19.21. The average of the first 6 terms is 3, so the forecast for all future times is 3. The mean percentage error is
100(1 2 —3 —1
= 43.75
4 2 1 6 4
19.22.
(_1)2 (_2)2 + 32 ± 12 3.75
4
19.23.(yo — yi)/5 = (17— 2)/5 = 3, so the forecasted increase per period is 3. The forecasted series is 20, 23, 26, 29.
The mean square error is
02 + 12 + 12 + 22
1.5
4
19.24. (y6 — yi)/5 = (22— 2)/5 = 4, so the forecasted increase per period is 4. The forecasted series is26,30, 34,38.
The mean absolute error is (1 + 2 + 1 + 2)/4 = 1.5
19.25. For each forecast period, the forecast is pt =9 + 2.25. Thus 95 = 14.25, 96 = 17.25, and 97 = 23.25. The
forecast errors are 0.75, 3.75, and —1.25 respectively.
The mean error is
0.75 + 3.75 — 1.25
F= = 1.08333
3
Quiz Solutions
19-1.
(-1)(0) + (3)(1)
r3 —
(_1)2 + 32 + (_3)2 4. 02 + 12
—
[0.15
19-2. The sample mean is 1 and the unbiased sample variance is (02 + 22 + 12 + 12 + 22)/4 = 2.5. A forecast interval
is two-sided, so we use the 95thpercentile of a t distribution with 4 degrees of freedom, or 2.1318. The upper bound
of a 90% forecast interval is
1
1 + 2.1318 -k./ 5-\/1 + — = 5
4.6924
BREAK. #asmstudybreak
Never stop testin your knowledge, whether it be relevant to our career, or ust trivia.
QUESTION:
Which car is often called the first
muscle car?
atstm
Actuarial Study Materials
Lesson 20
An autoregressive model of order 1, or AR(1) is a time series where each term may be expressed in terms of the '46
previous term plus white noise:
pk (20.2)
To test whether a series is white noise, in other words Pi = 0, one may check whether the sample autocorrelations
rk are significant. The standard error of rk is 1/Vf. So if Irti > 2/VT, the autocorrelation may be regarded as
significant, rejecting the white noise model.
The coefficients po and pi may be estimated using the method of conditional least squares, which is simply
regression of yi on The coefficients are approximately b1 = 1'1 and bo = 9(1 — b1). Let et be the residuals,
et=y' — (ho +1)12/1-1)
1
S2 —
T—3 E (et —
1=2
1)2 (20.3)
Here, there are T 1 terms in the sum, but 2 degrees of freedom are used up to estimate po and (31, leaving T —3
degrees of freedom.
EXAMPLE 20A a-.? You are given the time series {55, 35, 52, 40, 46, 42). You fit an AR(1) model to it using the
autocorrelation at lag 1 to approximate pi.
Calculate the estimated variance of the error term.
SOLUTION: First calculate ri. The mean of the six terms is 45. After subtracting 45 from each term, the remainders
of the time series terms are {10, —10,7, —5,1, —3}. The autocorrelation at lag 1 is
Then P'.0 = 45 + 0.75(45) = 78.75. The estimated values of the time series are gi = 78.75 — 0.75y,_, and the error is
Et =ys— 9, — (78.75 — 0.75y/-4)
We cannot calculate9,so we can't calculate the first error. The other errors are —2.5, —0.5, 0.25, —2.75, and —2.25.
The mean error is —1.55. The variance of the error is
(( 2.5 — (-1.55))2 + (-0.5 — (-1.55))2 + (0.25 — (-1.55))2 + (-2.75 — (-1.55))2 + (-2.25 — (-1.55))2)
:= 2.391667 0
Notice that the variance of yt is greater than the variance of r. By taking the variance of the two sides of
equation (20.1), we get
Var(si)
Var(yt) = (20.4)
One may define autoregressive models of higher order, but they are not on the syllabus.
To forecast values in an AR(1) model, start at time T and express them in terms of the previous values recursively,
omitting the El term:
9t+1 = PO + poi
Var(gT+k) = 62 E p321(1-1)
t=1
(20.5)
where s 2 is the estimated variance of the error term. Notice that for a 1-step ahead forecast, the sum has only one
term, 1, so Var(frT+1) = s2.
The approximate 95% forecast interval is
Quiz 20-1 s:
For an AR(1) model, Po = 5,131 = 0.6, and yy, = 32.
Calculate the forecast of YT+2.
Definition of process:
yt =p0 +piyi_i + (20.1)
The process is stationary if and only if VII <1
pk = (20.2)
S2 = Var(el ) = T—3
(et — (20.3)
t =2
Var(ei)
Var(yl) = 2 (20.4)
1 — P1
k
Var(9T+k) = 62 E P21(1-1)
1=1
(20.5)
Exercises
20.1. si? You are given the AR(1) process yi = + Ei. Also, icq = 4.
r." Determine Var(yi ).
20.2. For an AR(1) process yi = 0.6y1_1 + el, you are given that Var(yt) = 5.
Determine
20.3. %. [VEE Applied Statistics-Summer 05:1] You are given the following information about an AR(1) time-series
model:
p2 = 0.8
Determine 'Pi I.
(A) 0.5 (B) 0.6 (C) 0.7 (D) 0.8 (5) 0.9
• E[ei] = 0
• Var(Ei) = a2
Y20 = 60. Forecasts are based on y20.
Determine the lowest I for which 920+t <53.
20.5. NIP [MAS-1-519:43] You are given an autoregressive time series of order 1:
= +
S1 $2
0 20 40 00 80 100 0 20 40 00 SO 100
Time Time
$3 Si4
0 20 40 ao 80 100 0 20 40 ao 80 100
Time Time
Exam SRM Study Manual Exercises continue on the next page ...
Copyright ©2022 ASM
EXERCISES FOR LESSON 20 363
20.6. [SRM Sample Question #22] A stationary autoregressive model of order one can be written as
20.7. am? [S-F15:421 An AR(1) model is fitted to time series data through time t = 7. The resulting model is
yt = 14.3379 — 0.79yt_i + Ei
(---- 20.8. r[MAS-I-S18:44] You are given the following fitted AR(1) model:
ye = 5 + 0.85y/...1 + Et
20.9. You are given the following fitted AR(1) model based on 20 observations:
ye = 4 — + El
20.10. You are given the following fitted AR(1) model based on 15 observations:
= 10 + 0.4y, + et
Solutions
a2
20.2. By formula (20.4) for Var(y/), 5 = 1-0r6 ' Therefore, 4 = 3.2
20.3. By formula (20.2), p2 = fq, so 131 = 1/U with absolute value approximately 0,9. (E)
20.4. The mean of the series is 50 and the current value y20 = 60, which is 10 higher than the mean. In the forecast \---)
of an AR(1) series, each excess of a term over the mean is 131 times the excess of the previous term over the mean.
Here, pi = 0.75. We want the t such that yi = 53, an excess of 3 over the mean, so we want 10(0.751) <3. Solving
fort,
10(0.751) <3
0.75' <0.3
t In 0.75 < 1n0.3
In 0.3
t > = 4.185
ln0.75
916 10 + 0.4(15) = 16
917 = 10 + 0.4(16) = 16.4
The upper bound of the 99% forecast interval is 16.4 + 3.0545-V5(1 + 0.42) = 23.7562
20.11. From (i), S2 = 10. From (ii), s2(1 + fiT) = 15, so g 0.5. Then pl = 0.25, and 10(1 + 0.5 + 0.25) = 173
Quiz Solutions
20-1.
19.52
If there are no trends in the data, then we can ignore the second term on the right and forecast 3r = g.T. for all I.
If a time series has trend, we can perform double moving average smoothing. In other words, we can smooth st: 6-r
..(2) =
gi—k+1 g(—k+2 " • gl
SI
(2)
The estimate of trend is bi,r 2(gr — sT )/(k — 1), and the forecast is 9T+1 =
Moving average smoothing is weighted least squares, with weights of 1 on the most recent k periods and 0 on
earlier periods.
Setting w low leads to very little smoothing, whereas setting it high leads to a lot of smoothing. Using this equation,
the forecast using exponential smoothing is that future values will equal the last known value of s:
9T+k = gT (21.3)
'The weights do not add up 1. I would prefer to pot the balance of the weight on yo, but the textbook does not do that. It doesn't make a big
difference ill is large. It makes no difference if yo = 0.
Quiz 21-1 I': You are given the following time series:
12, 18, 20, 21, 25
You are to exponentially smooth it with smoothing parameter w = 0.6 and starting value yo = 0.
Calculate 55.
One way to evaluate this model is to check what this model would have predicted in the past versus the data we
have. The one-step prediction error is et = y, — gi_l. Thus the sum of squared one-step prediction errors is
(where presumably go = yo, although the textbook doesn't say). You should select w to minimize SS(w).
If the time series has trend, one may perform double exponential smoothing; calculate gi as above, then expo-
nentially smooth gi to obtain g1(2). Then, let bo,T be the intercept and bix be the slope. They are
bo,T = 2gr g.?)
1-W .(2)
(21.5)
bi,T (gT sT
Fixed effects
Fixed seasonal effects can be modeled using trigonometric functions or dummy variables for each season. The most
convenient form is a linear regression model with the variables being trigonometric functions. Let SB be the seasonal
base—the number of terms in the time series per year, or per whatever seasonal cycle is of interest. For example,
if a time series has annual seasonal cycles and the terms in the series are monthly observations, then SB = 12. Let
fi = 271f ISB, where j is a positive integer. Then the linear regression model for the seasonal time series is
j=1
If SB is even then nz is at most SB 12; any additional variables would be collinear with the ones already in the model.
In fact, sin fm t is identically 0 for m = SB /2 and would be omitted. The optimal value of 171 is selected by comparing
models, possibly using F tests.
When there are P lagged terms, this model is called an SAR(P) model.
= (1 - wi)(Yt - St-sB)+ +
SOLUTION: Projection is 121 - 90 = 31 periods into the future. The cycle is annual and there are 12 months, so
ST(121) = ST(109) = ST(97) = ST(85).
9121 = 122 + 31(5) -3 274
Sometimes a time series may show trend, but it is not clear whether the time series is a random walk, has a linear
trend, or is autoregressive. To test this, we set up the regression model
ill - y1-1= Po ( - 1)yr-i + pit + et
yt is a random walk if rp = 1. We use the t statistic to determine the significance of the hypothesis cp = 1 versus Gp <1,
but we cannot use the standard critical values for t under the null random walk hypothesis. Instead, we use critical
values compiled by Dickey and Fuller, which are higher, so it is harder to reject the random walk hypothesis.3
The Dickey-Fuller test assumes that El are not serially correlated. The augmented Dickey-Fuller test allows for
serial correlations of Ci. In the augmented Dickey-Fuller test, we add lagged variables to the regression:
yr - =
It is not clear how many lagged variables should be added, so typically the test is performed for several values of p.
2There is also a multiplicative model, but we will not discuss it.
3To partially make up for this, it is traditional to use 10% significance instead of 5% significance.
Nr The autoregressive changing heteroscedasticity model, or the ARCH model, allows the conditional variance of a time
series to vary based on previous errors, yet leaving the unconditional variance constant, and therefore allowing the
1-1? time series to be weakly stationary. The formula for the variance at any time t is
2 2
at -w+Eyj Et-1 •
( 21.8)
j=1
4 =W + (21.9)
•411 with bi 0, yi 0, and yi +Ebi <1. A GARCH model is stationary with unconditional variance
Var(Ei) = w (1 - yj ai) (21.10)
=
- zu3)(Y1 - bo,t)+ W3g i—SB
(21.6) Var(ei) = w
(1 - _ E oi) (21.10)
lf Exercises
Yt 50 50 60 44 56
You forecast this series using an exponentially weighted moving average model with w = 0.6.
Determine the forecast of y6.
21.2. '41 You are applying exponential smoothing to a time series. You are given that Y20 = 100, yn = 101,
= 110, w = 0.6.
Determine g 21 .
21.3. *se? You apply exponential smoothing to a time series. You are given
y5 = 90 y6 = 81 g4 = 70 55 = 82
21.4. "Lir [Old exam] You are given the following information about a series of observations, y, t = 1, 2, 3, ... ,100:
td< io for all t
Y96 = 7.1
Y97 = 8.9
Y98 = 5.5
y99 = 8.2
Yioa = 7.3
You are to use exponential smoothing with w = 0.1 to forecast values beyond the end of the observation period.
You are given that the smoothed value g95 = 7.1.
Calculate the forecasted value of yioi.
(A) 7.1 (B) 7.2 (C) 7.3 (D) 7.4 (B) 7.5
21.3. NI° [120-81-95:13] You use exponential smoothing with w = 0.9 and go = 27 to forecast the following time
series:
Yt
27
1 20
2 30
3 10
4 60
5 15
Determine 55.
21.6. '41 [120-83-98:121 You are given the following observations from a time series:
t 1 2 3 4 5
and go = 15.6.
Use exponential smoothing with smoothing constant w = 0.40 to calculate the smoothed value gs of the fifth
observation.
(A) -1.8 (B) -0.4 (C) 1.2 (D) 5.2 (E) 11.3
21.7. 82 [VEE-Applied Statistics-Summer 05:3] You use exponential smoothing with w = 0.9 to make one-step-
ahead forecasts, that is, 9t+1 = gl. t = 0,1,2, 3, for the following data:
Y! 91+1
81
1 80
2 81
3 89
4 74
:lit 1 4 8 11
.(2)
You perform double exponential smoothing with smoothing constant w = 0.8 and starting values so = so = 0.
Calculate the estimated trend.
21.9. [SRM Sample Question #461 A time series was observed at times 0, 1,..., 100. The last four observations
along with estimates based on exponential and double exponential smoothing with w = 0.8 are:
All forecasts should be rounded to one decimal place and the trend should be rounded to three decimal places.
Let F be the predicted value of yin using exponential smoothing with w = 0.8.
Let G be the predicted value of y102 using double exponential smoothing with w = 0.8.
Calculate the absolute difference between F and G, IF - GI.
(A) 0.0 (B) 2.1 (C) 4.2 (D) 6.3 (E) 8.4
21.13. "-.11' You apply the Holt-Winters method to a time series. You are given box-1
and you forecast bo,T+i = 536. The smoothing constant zy2 = 0.5.
Determine the smoothing constant w1.
Use the following information for questions 21.14 and 21.15:
You apply the Holt-Winters method to a time series. You are given that the data are monthly, y2 = 19,1/14 = 15,
yi5 = 20, b0,14 = 17, bi,14 = 2, ,. 3 = 3, zvi = 0.8, w2 = 0.7, w3 = 0.6.
21.14. 6.1? Calculate S15.
21.15. ••• g4 = -2.
Calculate the forecasted value 916.
Solutions
21.1. The forecast of yo is the same as forecast of y5, namely $4. We use
S4
EL0 wk(1 - W)Y4-k
1/(1 - w)
„
56 + 0.6(44) + (0.62)(60) + (0.63)(50) + (0.64)(50) 48.512
Y6 -
1/0.4
g6 = y6(1 - w) g5w
82 = 81(1 - w) + (90- 20w)w = -20w2 + 9w + 81
20w2 - 9w + 1 = 0
9 ± A/81 :1
w= = 0.20,0.25
40
21.4. The forecasted value is 9ioi. = gioo• Note that gge, = 0.1(7.1) + 0.9(7.1) = 7.1.
21.5. This can be done recursively by calculating i,g2. etc., or in one calculation with weights of 0.1 on the current
observation y5, (0.9)(0.1) on the previous one y4, (0.92)(0.1) on y3, etc., with the leftover weight 0.95 placed on h:
g5 = 0.1(15) + 0.09(60) + 0.081(10) + 0.0729(30) + 0.06561(20) + 0.59049(27) 27.15243 (E)
21.6.
4
-0.13225 (B)
You would've ended up with the same answer choice if by mistake you summed up yi - gi instead of yt -9,.
0.6528
=
0.2(11) + 0.8(2.368) = 4.0944 042) = 0.2(4.0944) + 0.8(0.6528) = 1.34112
Estimated trend is
(
_ w)/w)(41) _42)54 ) 3:28 (4.0944 - 1.34112) =
0
0.68832
= 0.8(7.5) + 0.2(10) =8
4
21.16. Var(et) =
yi = 1 -0.2
Quiz Solutions
Practice Exams
Practice Exam 1
1. g'ir A life insurance company is underwriting a potential insured as Preferred or Standard, for the purpose of
determining the premium. Insureds with lower expected mortality rates are Preferred. The company will use
factors such as credit rating, occupation, and blood pressure. The company constructs a decision tree, based on its
past experience, to determine whether the potential insured is Preferred or Standard.
Determine, from a statistical learning perspective, which of the following describes this underwriting method.
I. Classification setting
IL Parametric
III. Supervised
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
2. An insurance company is modeling the probability of a claim using logistic regression. The explanatory variable
is vehicle value. Vehicle value is banded, and the value of the variable is 1, 2, 3, 4, 5, or 6, depending on the band.
Band 1 is the reference level.
(A) 0.30 (B) 0.35 (C) 0.40 (D) 0.45 (E) 0.50
3. kor Auto liability claim size is modeled using a generalized linear model. Based on an analysis of the data, it is
believed that the coefficient of variation of claim size is constant.
Exam SRM Study Manual 379 Exam questions continue on the next page . . .
Copyright 02022 ASM
380 PRACTICE EXAM 1
4. a: You are given the following output from a GLM to estimate loss size:
(i) Distribution selected is Inverse Gaussian.
(ii) The link is g(p) = 1I2.
Parameter
Intercept 0.00279
Vehicle Body
Coupe 0.002
Sedan —0.001
SUV 0.003
Area
—0.025
0.015
0.005
Calculate mean loss size for a sedan with value 25,000 from Area A.
6. N. In a principal components analysis, there are 2 variables. The loading of the first principal component on the
first variable is —0.6 and the loading of the first principal component on the second variable is positive. The variables
have been centered at 0.
For the observation (0.4, x2), the first principal component score is 0.12.
Determine x2.
(A) 0.25 (B) 0.30 (C) 0.35 (D) 0.40 (E) 0.45
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Exam SRM Study Manual Exam questions continue on the next page . . .
Copyright Q2022 ASM
PRACTICE EXAM 1 381
A generalized linear model for automobile insurance with 40 observations has the following explanatory
variables:
Model I includes all of these variables and an intercept. Model II is the same as Model I except that it excludes
USE. You have the following statistics from these models:
Deviance AIC
Model I 23.12 58.81
Model II 62.61
Using the likelihood ratio test, which of the following statements is correct?
(A) Reject Model II at 0.5% significance.
(B) Reject Model II at 1.0% significance but not at 0.5% significance.
(C) Reject Model II at 2.5% significance but not at 1.0% significance.
(D) Reject Model II at 5.0% significance but not at 2.5% significance.
(E) Do not reject Model II at 5.0% significance.
Calculate the dissimilarity measure between the clusters using Euclidean distance and average linkage.
(A) 3.6 (B) 3.7 (C) 3.8 (D) 3.9 (E) 4.0
10.
A normal linear model with 2 variables and an intercept is based on 45 observations. gi is the fitted value of yi,
and 9i(i) is the fitted value of yi if observation i is removed. You are given:
(i) — 9)(02 = 4.1.
(ii) The leverage of the first observation is 0.15.
Determine IP' I, the absolute value of the first residual of the regression with no observation removed.
(A) 3.9 (B) 4.4 (C) 4.9 (D) 5.4 (E) 5.9
11.
11411 A least squares model with a large number of predictors is fitted to 90 observations. To reduce the number of
predictors, forward stepwise selection is performed.
For a model with k predictors, RSS = ck•
The estimated variance of the error of the fit is 82 = 40.
Determine the value of cd cd+1 for which you would be indifferent between the d + 1-predictor model and the
d-predictor model based on Mallow's C.
(A) 40 (B) 50 (C) 60 (D) 70 (E) 80
Exam SRM Study Manual Exam questions continue on the next page .
Copyright ©2022 ASM
382 PRACTICE EXAM I
12. 11-: A classification response variable has three possible values: A, B, and C.
A split of a node with 100 observations in a classification tree resulted in the following two groups:
13. q Determine which of the following statements are true regarding cost complexity pruning.
I. A higher a corresponds to higher MSE for the training data.
II. A higher a corresponds to higher bias for the test data.
III. A higher a corresponds to a higher IT i•
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
15. 11 Determine which of the following statements are true regarding K-nearest neighbors (KNN) regression.
I. KNN tends to perform better as the number of predictors increases.
II. KNN is easier to interpret than linear regression.
III. KNN becomes more flexible as 1/K increases.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
16. *-: A department store is conducting a cluster analysis to help focus its marketing. The store sells many different
products, including food, clothing, furniture, and computers. Management would like the clusters to group together
customers with similar shopping patterns.
Determine which of the following statements regarding cluster analysis for this department store is/are true.
I. The clusters will depend on whether the input data is units sold or dollar amounts sold.
II. Hierarchical clustering would be preferable to K-means clustering.
III. If a correlation-based dissimilarity measure is used, frequent and infrequent shoppers will be grouped together.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Exam SRM Study Manual Exam questions continue on the next page ...
Copyright C2022 ASM
PRACTICE EXAM 1 383
'I Determine which of the following statements regarding principal components analysis is/are true.
I. Principal components analysis is a method to visualize data.
II. Principal components are in the direction in which the data is most variable.
III. Principal components are orthogonal.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
18. '41. A random walk is the cumulative sum of a white noise process ct. You are given that ci is normally distributed
with mean 0 and variance a2.
Which of the following statements are true?
I. The mean of the random walk does not vary with time.
II. At time 50, the variance is 50a2.
III. Differences of the random walk form a stationary time series.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
19. •41' You are given the following regression model, based on 22 observations.
y = )30 +isixi +02x2 + P3X3 134X4 ± /35 X5 ± e
(A) 1.3 (B) 1.7 (C) 2.1 (D) 2.5 (E) 2.9
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Exam SRM Study Manual Exam questions continue on the next page...
Copyright ©2022 ASM
384 PRACTICE EXAM /
22. sis° Determine which of the following statements about boosting is/are true.
I. Selecting B too high can result in overfitting.
II. Selecting a low shrinkage parameter tends to lead to selecting a lower B.
III. If d =1, the model is an additive model.
(A) None (B) I and H only (C) I and HI only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D)
23. To validate a time series model based on 20 observations, the first 15 observations were used as a model
development subset and the remaining 5 observations were used as a validation subset. The actual and fitted values
for those 5 observations are
Yi Pt
16 7 10
17 9 12
18 12 14
19 18 16
20 22 18
(A) 7.4 (B) 8.4 (C) 9.5 (D) 10.5 (E) 11.5
24. In a hurdle model, the probability of overcoming the hurdle is 0.7. If the hurdle is overcome, the count
distribution is kg(j), where go]) is the probability function of a Poisson distribution with parameter A = 0.6.
Calculate the probability of 1.
(A) 0.23 (B) 0.31 (C) 0.39 (D) 0.45 (E) 0.51
26.
`16. The number of policies sold by an agent in a year, y, is modeled as a function of the number of years of
experience, x. The model is a Poisson regression with a log link. The fitted coefficient of x is pi = 0.06.
The expected number of policies sold after 2 years of experience is a and the expected number of policies sold
after 5 years of experience is b.
Calculate b /a.
(A) 1.18 (B) 1.19 (C) 1.20 (D) 1.21 (E) 1.22
Exam SRM Study Manual Exam questions continue on the next page .
Copyright 02022 ASM
PRACTICE EXAM 1 385
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
28.
Disability income claims are modeled using linear regression. The model has two explanatory variables:
1.
Occupational class. This may be (1) professional with rare exposure to hazards, (2) professional with some
exposure to hazards, (3) light manual labor, (4) heavy manual labor.
2. Health. This may be (1) excellent, (2) good, (3) fair.
The model includes an intercept and all possible interactions.
Determine the number of interaction parameters pi in the model.
(A) 6 (B) 8 (C) 9 (D) 11 (E) 12
Exam SRM Study Manual Exam questions continue on the next page .. .
Copyright Q2022 ASIs.4
386 PRACTICE EXAM I
32. Determine which of the following statements about classification trees is/are true.
I. Classification error is not sensitive enough for growing trees.
II. Classification error is not sensitive enough for pruning trees.
HI. The predicted values of two terminal nodes coming out of a split are different.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
III. Observations 3 and 4 are closer to each other than observations 1 and 2.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
34. 'I? For a simple linear regression of the form y =Po +13pci Ei, you are given
(i) 9 =100
(ii) y = 81,004
(iii) Vi` 9= 80,525
Calculate R2.
(A) 0.46 (B) 0.48 (C) 0.50 (D) 0.52 (E) 0.54
Exam SRM Study Manual Exam questions continue on the next page
Copyright 02022 ASM
PRACTICE EXAM 1 387
(----.. 4.411 Determine which of the following are results of overfitting models.
I. The residual standard error may increase.
II. The model may be more difficult to interpret.
III. The variables may be collinear.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
1.
Nr With regard to statistical learning, determine which of the following is/are parametric approaches.
I. Ridge regression
II. K-nearest neighbors regression
III. Principal components analysis
(A) I only (B) II only (C) III only (D) 1, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
2. 141 For a linear regression model of the form
yi = Po +p2xi2+ p3xnxi2+ Ej
3. •11? Determine which of the following factors are drawbacks to using causal models for time series.
I. Time series patterns may induce or mask relationships between variables.
II. Causal models cannot properly handle non-linear relationships.
III. Forecasting the variable of interest requires forecasting independent variables.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D)
Exam SRM Study Manual 389 Exam questions continue on the next page...
Copyright 612022 ASM
390 PRACTICE EXAM 2
5. 'Lf A medical research project is studying the probability of getting a certain type of cancer, based on genetic traits.
The probability is modeled using logistic regression. The number of traits is greater than the number of individuals
in the study, so it is necessary to select a subset of the traits.
Determine which of the following characteristics of statistical learning pertain to this study.
I. Classification setting
II. Supervised learning
III. Parametric
(A) I only (B) II only (C) III only • (D) I, H, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D)
6. A survey is made of the importance of an automatic high beam system in a car. Importance levels are 1 (not
important), 2 (important), and 3 (very important).
A proportional cumulative odds model is used to model the responses. Explanatory variables are sex (male or
female) and age group (18-23, 24-40,> 40). The model is of the form
of 4.
Calculate the double-smoothed moving average at the last observation time, 432), using a running average length
(A) 8.5 (B) 8.6875 (C) 8.75 (D) 8.875 (E) 9.75
8.
For a linear regression based on 28 observations, there are 4 explanatory variables and an intercept. You are
given:
(i) The residual standard deviation is 12.4.
(ii) The leverage of the first observation is 0.04.
(iii) The first residual el = 6.2.
Calculate the first standardized residual.
(A) 0.50 (B) 0.51 (C) 0.52 (D) 1.1 (E) 2.5
Exam SRM Study Manual Exam questions continue on the next page...
Copyright ©2022 ASM
PRACTICE EXAM 2 391
The observations are to be grouped into two clusters. Initially, the first three points are grouped into one cluster
and the last three points into the other cluster.
Calculate the initial value of the objective function that is minimized by the clustering algorithm.
(A) 27 (B) 111 (C) 221 (ID) 332 (E) 664
64/ A logistic regression models the probability of a claim. The model includes the following explanatory variables:
Driver's age A categorical variable with 6 bands. Level 3 is the base level.
Area A categorical variable with 3 bands. Area A is the base level.
Vehicle value A continuous variable.
Parameter Estimate(b)
Intercept —2.521
Driver's age
1 0.345
2 0.102
3 0.000
4 —0.050
5 —0.173
6 —0.124
Area
A 0.000
0.155
0.374
Calculate the odds of a claim by a driver in Area B, age band 4, driving a vehicle with value $30,000.
(A) 0.11 (B) 0.13 (C) 0.15 (D) 0.17 (E) 0.19
Exam SRM Study Manual Exam questions continue on the next page.. .
Copyright ©2022 ASM
392 PRACTICE EXAM 2
12. 11/4: A classification tree is built based on 9 observations. There is a response and one predictor. The values of the
variables are:
X 2 5 6 8 12 16 19 25 30
Y No No Yes No Yes No Yes Yes Yes
Determine which of these splits is/are best using classification error as the criterion.
I. Between 6 and 8
(A) I only (B) H only (C) I and II only (D) I and III only (E) II and III only
13. 1141 You are using cost complexity pruning to prune a regression tree. One of the terminal nodes has the following
values for the response variable: {3,6, 8,9}. Another terminal node has the following values for the response
variable: {6,10, 12, 14}. These two terminal nodes are on branches from an intermediate node. We are considering
pruning these branches.
Determine the lowest value of a for which these branches are pruned.
(A) 12 (B) 18 (C) 22 (D) 28 (E) 32
14. u: Determine which of the following statements regarding subset selection is/are true:
I. At each iteration of forward subset selection, a variable is added to the model. The variable chosen is the one
that minimizes the test RSS based on cross-validation.
IL Forward subset selection may be used even if the number of variables is greater than the number of observations.
III. Adjusted R2 is not as well motivated in statistical theory as AlC and BIC are.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
15. '1/4: For a set of 35 observations, the following two models are under consideration:
Model I
Yi = Po Pixii + 132X12+ 3x23+ P44 + p5Xi1Xi2 + Et
Model H
Yl = yo + y2x22 y3x5 +
Exam SRM Shady Manual Exam questions continue on the next page...
Copyright ©2022 ASM
PRACTICE EXAM 2 393
n a-dr The probability of a claim on a policy, it, is modeled with probit regression, using several predictors. The
probability of a claim given specific values of x/, x2, ,xk is 0.2.
The fitted value of gi is 0.25.
Calculate the probability of a claim if x1 is increased by 1 and the other variables are unchanged.
(A) 0.24 (B) 0.26 (C) 0.28 (D) 0.30 (E) 0.32
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
19.
Determine which of the following statements about boosting is/are true.
I. The number of terminal nodes in each tree is d, the number of splits parameter.
II. Each tree depends on the previous trees.
III. At each node, all available predictors are considered.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
l
f
A time series follows the process
= 0.61.1_1 + Et
with Var(Et) = 6.
Determine the variance of terms in this series.
21.
"-.1. A Poisson regression model uses a log link. The systematic component is an intercept only; g(12)= Po.
Five observations are 0, 2, 0, 1, 0.
Calculate the fitted value of go using maximum likelihood.
(A) —0.5 (B) —0.1 (C) 0.1 (D) 0.6 (E) 1.8
Exam SAM Study Manual Exam questions continue on the next page. ..
Copyright Q2022 ASM
394 PRACTICE EXAM 2
based on 15 observations:
24. 641 Linear regression models are fitted using subsets of 4 variables. The resulting values of RSS are:
Variables RSS Variables RSS Variables RSS Variables RSS
None 160 X4 72 X2, X3 63 Xi, X2, X4 55
Xi 66 Xi, X2 59 X2, X4 61 Xi, X3, X4 52
X2 67 Xi, X3 62 X3, X4 67 X21 X3, X4 50
X3 69 Xi, X4 60 Xi, X2, X3 57 Xi, X2, X3, X4 30
25. •••• Determine which of the following statements regarding hierarchical clustering is/are true.
I. Complete linkage is based on minimal intercluster dissimilarity.
II. Single linkage is based on maximal intercluster dissimilarity.
III. Single linkage may result in extended trailing clusters.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
26. NI° Determine which of the following statements regarding principal components analysis is/are true.
I. Each principal component has n loadings, one for each data point.
II. Each principal component has p scores, one for each variable.
III. The sum of the squares of the loadings equals 1.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Exam SAM Study Manual Exam questions continue on the next page . . .
Copyright 02022 ASM
PRACTICE EXAM 2 395
A classification variable Y can have the value 0 or 1. It is modeled as a function of two variables, Xi and X2. You
are given the following observations:
X1 X2 Y
2 4 0
3 2 0
3 5 1
4 1 0
4 4 1
4 6 1
5 3 0
5 6 1
6 5 0
III. X1 = 3, X2 = 6: Y = 0
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) ,(C) , or (D) .
E y, = 1035 gi = 1030
ln yi = 92 >lngi=90
ln = 2047
E yi in = 2015
Calculate the deviance.
Exam SRM Study Manual Exam questions continue on the next page...
Copyright 02022 ASM
396 PRACTICE EXAM 2
29. A generalized linear model for amount of sales by agent, based on 65 observations, has the following explanatory
variables: j
REGION: North, South, East, West
AGE OF AGENT: Under 30, 30-39, 40-49, 50-59, 60 and over
YEARS OF EXPERIENCE OF AGENT: 1-5,6-10, over 10
All variables are categorical.
The model is run with and without the REGION variable. The BIC is 123.52 with the REGION variable and
121.08 without the REGION variable.
Determine which of the following statements is true based on the likelihood ratio statistic.
(A) Accept REGION at 1% significance
(B) Accept REGION at 25% significance but not at 1% significance
(C) Accept REGION at 5% significance but not at 2.5% significance
(D) Accept REGION at 10% significance but not at 5% significance
(E) Reject REGION at 10% significance
30. sk: Determine which of the following statements regarding collinearity are true.
I. Collinearity causes the residual standard error to increase.
II. Collinearity causes t statistics of the collinear variables to be low.
III. Collinearity is indicated by a low VIE
(A) I only (B) II only (C) HI only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
(0.293
0.075 0.425
0.425
0.293
0.091
(A) —0.5 (B) —0.4 (C) —0.3 (D) —0.2 (E) —0.1
32. Determine which of the following purposes a scree plot can serve.
I. Visualizing the directions of the principal components.
II. Understanding how the data relates to the principal components.
III. Deciding how many principal components to use.
(A) I only (B) H only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D)
Exam SRM Study Manual Exam questions continue on the next page.. .
Copyright ©2022 ASM
PRACTICE EXAM 2 397
34. 6-411 A generalized linear model for claim counts uses a Poisson distribution. You are given:
(i) The link function is log.
(ii) The output of the model is
Parameter
Intercept —0.73
Gender—Male 0.07
Marital Status—Single 0.03
Interaction of Gender and Marital Status 0.02
Area
0.12
0.05
—0.04
35. Determine which of the following statements are true regarding residuals of non-linear models.
I. Anscombe residuals are based on transforming the response variable to make it approximately normal.
II. Pearson residuals are based on defining a residual function such that the residuals are approximately normal.
III. Deviance residuals are close to Anscombe residuals in many cases.
(A) I only (B) LI only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D)
1. NI° A regression tree is built based on two predictors using cost complexity pruning. Cross-validation is used to
select the best value of a.
Xi<1001
X. <60
20 55 72
10.4 12.0
70 45 4
110 50 10
160 70 11
180 40 13
2. *-41 Determine which of the following statements regarding principal components analysis is/are true.
I. The first two principal components span the plane that is closest to the data.
II. The coordinates of the projected values of the data are the principal component scores.
III. Principal components are independent of the scale of the variables.
Exam SRM Study Manual 399 Exam questions continue on the next page . . .
Copyright CP2022 ASM
400 PRACTICE EXAM 3
3. 1.41 A Poisson regression model is used to model claim counts on auto insurance. Explanatory variables are the
following categorical variables:
(A) I only (B) H only (C) III only (D) I, II, and HI
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
6. '"41/ Determine which of the following statements are true with regard to AR(1) processes.
I. A random walk is a special case of an AR(1) process.
II. White noise is a special case of an AR(1) process.
III. For stationary AR(1) processes, autocorrelation is a decreasing function of lag.
IV. For stationary AR(1) processes, the variance of the terms equals the variance of the error.
(A) 1,11, and III only (B) I, II, and IV only (C) I, III, and IV only (D) II, IH, and IV only
(E) I, II, III, and IV
Exam SRM Study Manual Exam questions continue on the next page ...
Copyright ©2022 ASM
PRACTICE EXAM 3 401
For an inverse Gaussian regression with Var(Y) = E[Y]3/5, you are given that yi = 68.5 and 9, = 57.3.
Calculate the Pearson chi-square residual.
(A) Less than 0.05
(B) At least 0.05, but less than 0.06
(C) At least 0.06, but less than 0.07
(D) At least 0.07, but less than 0.08
(E) At least 0.08
0.09
8. Claim severity is modeled with a normal linear model. The current model does not have AGE as an explanatory
variable. You are considering adding AGE as an explanatory variable.
The model is run both with and without AGE You are given the following excerpt from an ANOVA table
compiled from the two runs:
10. •41 With regard to dimension reduction methods, determine which of the following statements are applicable to
principal component regression but not to partial least squares.
I. The new predictors are linear combinations of the original predictors.
IT. The method of creating predictors is unsupervised.
III. The original predictors should be standardized.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Exam SRM Study Manual Exam questions continue on The next page. ..
Copyright 02022 ASM
402 PRACTICE EXAM 3
11. %ID Determine which of the following statements regarding hierarchical clustering is/are true.
I. The height of the cut serves the same purpose as K in K-means clustering.
II. The centroid linkage may lead to inversions.
There are 2" reorderings of a dendrogram, where n is the number of leaves.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
12. •-• An AR(1) process is fitted to a set of 10 observations. To test this model, the first 6 observations are used as a
model development set and the last 4 observations are the validation set. The resulting model is yt = 6 — Ei.
The last 5 observations are 7, 6, 4, 5, 5.
Calculate the MAE statistic.
(A) 0.2 (B) 0.4 (C) 0.6 (D) 0.8 (E) 1.0
13.
6.-1? A classification response variable has three possible values: A, B, and C.
A split of a node with 100 observations in a classification tree resulted in the following two groups:
14.
*-4? A count variable is modeled as a function of two predictors using Poisson regression with a log link. The model
has an intercept. There are 5 observations. The results of the model are
Yi
1 2 1.2
2 2 2.2
3 1 1.8
4 3 4.0
5 2 0.8
It is believed that variance is greater than the mean, so an overdispersion parameter q5 is specified.
Calculate the estimate of cp.
(A) 1.0 (B) 1.2 (C) 1.3 (D) 1.5 (E) 1.6
15.
"-: Determine which of the following are advantages of regression trees over linear models.
I. Easier to interpret
II. More robust
(A) I and II only (B) I and III only (C) I and IV only (D) II and III only (E) II and IV only
Exam SRM Study Manual Exam questions continue on the next page...
Copyright 02022 ASM
PRACTICE EXAM 3 403
n.. 41 For a linear regression with 2 explanatory variables and an intercept, you are given:
(i) There are 50 observations.
(ii) The sample standard deviation of x2 is 2.204.
(iii) If x2 is regressed on x1, the standard error of the regression is 1.284.
Determine the VIF of x2.
(A) 2.5 (B) 3.0 (C) 3.5 (D) 4.0 (E) 4.5
17. House sales is the continuous response variable of a normal model that is based on 11 observations. Explanatory
variables are:
• Interest rates
• Unemployment rate
Model I uses only interest rates as an explanatory variable, while Model II uses both interest rates and unem-
ployment rates. Both models have an intercept. The results of the models are:
Model I Model IT
Source of Variation Sum of Squares Source of Variation Sum of Squares
Regression 14,429 Regression 17,347
Error 12,204 Error 9,286
18. An insurance company is studying its field force, the agents that sell its products. There are many factors that
may characterize agents: age, sex, number of years at the company, number of years of experience as an agent,
annual production, region, etc. The insurance company would like to summarize all of these characteristics into a
small number of variables.
Determine which of the following characteristics of statistical learning that this study has.
I. Classification setting
II. Supervised learning
III. Parametric
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Exam SRM Study Manual Exam questions continue on the next page . . .
Copyright Z2022 ASM
404 PRACTICE EXAM 3
19. A linear regression based on 52 observations has 5 explanatory variables plus an intercept You are given:
(i) The residual standard error is 8.25.
(ii) The studentized first residual is 0.895.
(iii) The standardized first residual is 0.823.
Calculate the residual standard error if the first observation is removed.
(A) 7.59 (B) 7.91 (C) 8.18 (D) 8.60 (E) 8.97
20. A classification tree is built based on 9 observations. There is a response and one predictor. The values of the
variables are:
X 2 5 6 8 12 16 19 25 30
Y No No Yes No Yes No Yes Yes Yes
Determine which of these splits is/are best using the Gini index as the criterion.
I. Between 6 and 8
(A) I only (B) II only (C) III only (D) I and II only (E) II and III only
21. rFor cross-validation, determine which of the following are advantages of k-fold cross-validation with k < n
over LOOCV.
I. k-fold cross-validation does not overestimate the test error rate as much as LOOCV.
III. Performing k-fold cross-validation multiple times produces the same results.
(A) I only (B) IT only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Principal component analysis results in a principal component with loadings 0.836 on the first variable and 0.549 on
the second variable.
23.
%/6 Determine which of the following are drawbacks of regression models for time series that fit trends in time.
I. Too much weight is placed on early observations.
II. Seasonal patterns cannot be incorporated.
III. Other sources of information are not considered.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Exam SRM Study Manual Exam questions continue on the next page. ..
Copyright 02022 ASM
PRACTICE EXAM 3 405
n.. For a linear regression model based on 4 observations, you are given:
(i) There is one explanatory variable and an intercept.
(ii) The residuals are -2.3, 0.1, -2.1, and 4.3.
(iii) The hat matrix is
0.3 0.4 0.1 0.2
(
0.4 0.7 -0.2 0.1
0.1 0.2 0.7 0.4
0.2 0.1 0.4 0.3
(iv) 2 = 14.1
(A) 1.2 (13) 2.3 (C) 2.7 (D) 3.2 (E) 3.6
(A) 1.0 (B) 1.1 (C) 1.2 (D) 1.3 (E) 1.4
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
27. •••• A classification variable assuming the values of 1 and 2 is modeled as a function of X using K-nearest neighbors
with K = 3. The training data is
X 10 19 30 39 43 45
Y 1 1 2 2 1 2
X 15 25 37 50
Y 1 2 2 2
Exam SRM Study Manual Exam questions continue on the next page . . .
Copyright 1P2022 ASM
406 PRACTICE EXAM 3
28. 4411 A binomial generalized linear model for the probability of a claim has an intercept and the following variables:
(i) Deductible: can be 250, 500, or 1000.
(ii) Gender: can be male or female.
(iii) Age group: there are 5 age groups.
There are 85 observations.
The AIC for the best fit is 261.53.
Calculate the BIC.
(A) 277 (B) 279 (C) 281 (D) 283 (E) 285
Exponential smoothing with smoothing parameter w = 0.4 is performed on this series. The resulting value of g5
is 108.78.
Determine gi.
(A) 102 (B) 104 (C) 106 (D) 108 (E) 110
30. Determine which of the following statements regarding K-means clustering is/are true.
I. Clusters are selected to maximize the sum of the distances between points of different clusters.
II. A simple algorithm finds a local optimum.
III. The number of clusters must be specified in advance.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (0).
31. P A generalized linear model uses a gamma distribution. The model is based on 8 observations. The results of
the model are
1 5 3
2 7 15
3 8 10
4 11 5
5 13 18
6 15 10
7 17 20
8 20 18
Exam SRM Study Manual Exam questions continue on the next page
Copyright ©2022 ASM
PRACTICE EXAM 3 407
The observations are to be grouped into two clusters. Initially, the first three points are grouped into one cluster
and the last three points into the other cluster.
Determine the number of points that move between clusters in the first iteration of the algorithm.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
Determine Var(Y3).
(A) 9.48 (B) 9.58 (C) 9.68 (D) 9.78 (E) 9.88
Exam SRM Study Manual Exam questions continue on the next page...
Copyright ©2022 ASM
408 PRACTICE EXAM 3
35.
N? For a set of data with 40 observations, 2 predictors (Xi and X2), and one response (Y), the residual sum of
squares has been calculated for several different estimates of a linear model with no intercept. Only integer values
from 1 to 5 were considered for estimates of pi and P2.
The grid below shows the residual sum of squares for every combination of the parameter estimates, after
standardization:
1 2 3 4 5
1 2,855.0 870.3 464.4 357.2 548.6
2 1,059.1 488.4 216.3 242.8 567.9
3 657.0 220.0 81.6 241.9 700.8
4 368.4 65.1 60.5 354.5 947.1
5 193.2 23.7 152.8 580.6 1,307.0
Let:
1. N?I With regard to statistical learning, determine which of the following statements is true.
I. Increasing flexibility results in decreasing training MSE.
II. Increasing flexibility results in decreasing test bias.
III. Increasing flexibility results in decreasing test variance.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
2. In a generalized linear model, the response distribution is Poisson. You are given
(i) y =2
(A) —0.39 (B) —0.36 (C) —0.33 (D) —0.30 (E) —0.27
3. •: The first ten observations of a white noise series, yi,. ,yio, are
21 24 22 18 23 22 16 18 21 22
4. 1`)? For a simple linear regression based on 18 observations of the form Y = Pa + /31X + e, the width of a 95%
prediction interval for Y when X = lc is 30.
Let s, be the unbiased standard deviation of X.
Calculate the width of a 95% prediction interval for Y when X = 2 + sx.
(A) 30.4 (B) 30.8 (C) 31.3 (D) 31.7 (E) 32.1
Exam SRM Study Manual 409 Exam questions continue on the next page . . .
Copyright Q202.2 ASM
410 PRACTICE EXAM 4
5. el For a logistic regression model for a binary response, actual and fitted values are as follows:
yi ñi
0 0.32
0 0.47
0 0.02
0 0.09
0 0.15
1 0.58
1 0.82
1 0.73
1 0.64
1 0.98
6. .4' A linear regression model with 5 variables and an intercept is fitted to 105 observations. You are given
(i) The 21s standardized residual is 0.934.
(ii) The leverage of the 21st observation is 0.13.
Calculate Cook's distance for the 215t observation.
7. '41/ For a set of 10 observations in a regression setting, a response variable's values are 8, 7,4, 6, 10, 14, 16, 13, 17, 15.
The response variable is modeled using a regression tree with boosting. The boosting parameters are d = 1,
= 0.1, and B = 100.
At the first iteration, the tree is split into two groups: the first 5 observations and the second 5 observations.
Calculate the revised value of the first observation that is used for the second tree.
(A) 6.9 (8) 7.0 (C) 7.1 (D) 7.2 (E) 7.3
8. 1141 Hierarchical cluster analysis with centroid linkage and Euclidean distance dissimilarity is performed. The
following four clusters result:
Determine the two clusters that get fused at the next iteration of the algorithm.
(A) 1(10,10),(18,10)1 and {(18,16),(20,20)I
(B) 1(10,10),(18,10)} and 1(8,25),(8,30))
(C) R10,10),(18,10)) and (15,24)
(D) 1(18,16),(20,20)) and (15,24)
(E) 1(8,25),(8,30)) and 1(15,24))
Exam SAM Study Manual Exam questions continue on the next page . . .
Copyright 02022 ASM
PRACTICE EXAM 4 411
n. A set of 100 observations of 4 variables is analyzed using principal components analysis. The loadings of the
ifrst three variables on the first principal component are 0.68, 0.65, and 0.32. The fourth loading is negative.
Calculate the first principal component score of the observation (2, —1,3,5).
(A) 0.3 (B) 0.7 (C) 1.1 (D) 1.3 (E) 1.7
10. 'kir A linear regression of the form yi = Po + flux11 +i32xi2 + P3Xi3Ei is based on 35 observations.
You are given:
The residual standard deviation is 8.25.
(ii) se(bi) = 0.86.
(Hi) The sample standard deviation of x1 is 2.70.
Calculate the VIF of xi.
(A) 1.2 (B) 1.6 (C) 2.0 (D) 2.3 (E) 2.7
The observations are to be grouped into two clusters. Initially, the first three points are grouped into one cluster
and the last two points into the other cluster.
Calculate the initial value of the objective function that is minimized by the clustering algorithm.
(A) 87 (B) 113 (C) 142 (D) 175 (E) 236
l
f
Exam SRM Study Manual Exam questions continue on the next page .. .
Copyright (g2022 ASM
412 PRACTICE EXAM 4
12. 111 You are given the following biplot for a principal components analysis of sales of life insurance, health insurance,
dental insurance, and disability insurance by agents.
—0.5 0 0.5
—3 —2 —1 0 1 2 3
You are given the following possible inferences from the biplot:
I. Life, health, and dental insurance sales are correlated.
II. Bob did not sell a lot of life insurance.
(A) I only
(B) II only
(C) III only
(D) I, II, and III
(E) The answer is not given by (A), (B), (C), or (D)
Exam SRM Study Manual Exam questions continue on the next page . . .
Copyright ©2022 ASM
PRACTICE EXAM 4 413
14.
'41/1 The relationship between type of claim on an auto insurance policy and type of vehicle is modeled using a
cumulative proportional odds model.
Type of claim is an ordinal variable with the following categories:
1. Property damage only
2. Bodily injury, no fatality
3. Fatality
Type of car is a categorical variable with values coupe, sedan, SUV, and van.
Based on this model:
l
f (i) The probability of claim type 1 for a sedan is 0.21.
(ii) The probability of claim type 2 for a sedan is 0.06.
(iii) The probability of claim type 1 for an SUV is 0.28.
Determine the probability of claim type 2 for an SUV.
(A) 0.071 (B) 0.074 (C) 0.077 (D) 0.080 (E) 0.083
15. s: A generalized linear model uses a complementary log-log link. The form of the model is
g(n) = Po + Pixi
The estimated values are bo .= 0.6, bi = 0.4.
Calculate the estimated mean when xi = 0.8.
(A) 0.92 (B) 0.93 (C) 0.94 (D) 0.95 (E) 0.96
(A) =
(B) g ( p ) =
(C) g(p) =ph3 (D) = Al2/5 (E) g(p)= 111/2
17. stt° At a node in a classification tree, there are 30 observations of "Yes" and 10 observations of "No". This node is
split into two groups, one with 20 observations of "Yes" and 4 observations of "No" and the other group with the
remainder of the observations.
Calculate the reduction in the overall Gird index from this split.
(A) 0.02 (B) 0.04 (C) 0.06 (D) 0.08 (E) 0.10
Exam SRM Study Manual Exam questions continue on the next page. ..
Copyright 02022 ASM
414 PRACTICE EXAM 4
19.
sli. A generalized linear model for a count variable is based on 8 observations. The response distribution is Poisson.
The observed values are
0 0 0 0 1 1 2 4
Exam SRM Study Manual Exam questions continue on the next page . ..
Copyright ©2022 ASM
PRACTICE EXAM 4 415
10 4 15
12 2 14
12 10 19
12 16 24
15 6 8
15 17 30
17 4 27
17 13 15
19 15 30
20 3 18
20 10 24
12
00 13 16 25x1
21. •111 Determine which of the following statements regarding dendrograms is/are true.
I. The number of clusters is determined by the height of the split.
II. The closeness of observations is determined by their horizontal distance.
III. Observations that fuse at the bottom of the tree are similar.
(A) None (B) 1 and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Exam SRM Study Manual Exam questions continue on the next page .. .
Copyright 02022 ASM
416 PRACTICE EXAM 4
L
22.
A real-valued variable Y is modeled as a function of two variables, X1 and X2. You are given the following
observations:
X1 X2
2 4 23
3 2 32
3 5 36
4 1 27
4 4 39
4 6 44
5 3 28
5 6 50
6 5 41
(A) 0.63 (B) 0.64 (C) 0.65 (D) 0.66 (E) 0.67
Exam SRM Study Manual Exam questions continue on the next page .
Copyright ©2022 ASM
PRACTICE EXAM 4 417
Parameter
Intercept —2.05
Gender—Male 0.40
Age group
Under 25 1.55
Over 65 0.25
Vehicle body
Coupe —0.07
SUV 0.32
Calculate the odds of a claim from a 34-year old male driving an SUV.
(A) 0.25 (B) 0.30 (C) 0.35 (D) 0.40 (E) 0.45
25. For a generalized linear model, which of the following would make it more likely that the model is accepted?
I. Higher AIC.
II. Higher BIC.
III. Higher deviance.
IV. Higher Pearson chi-square statistic.
(A) None (B) I only (C) II only (D) III only (E) IV only
26. .11 Determine which of the following statements regarding K-means clustering is/are true.
I. K-means clustering is robust to perturbations in the data.
II. At each iteration of the algorithm, Ki7 distances are calculated, where n is the number of observations.
III. The algorithm may be used for distance measures such as correlation.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Exam SAM Study Manual Exam questions continue on the next page...
Copyright 02022 ASM
418 PRACTICE EXAM 4
28. A time series with 15 terms has been double exponentially smoothed. You are given
1. The smoothing parameter w = 0.3.
2. S'15 = 72.
(2)
3.
s15 = 56'
Calculate the forecast of yi8.
(A) 136 (B) 152 (C) 168 (D) 184 (E) 200
29. •-ir A zero-inflated model consists of a Poisson distribution with mean A and a 0 component. The Poisson component
has a weight of 0.8.
Determine which of the following statements is true for this model.
I. The probability of 0 is 0.2.
II. The mean of the distribution is 0.8A.
(A) I and H only (B) I and III only (C) II and III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
30. NI' Linear regression models are fitted based on 15 observations. Four predictors are considered. Subset selection
is used to select the most significant variables. The best models for each number of predictors are
Determine the number of predictors in the model selected using adjusted R2.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
Exam SRM Study Manual Exam questions continue on the next page...
Copyright ©2022 ASM
PRACTICE EXAM 4 419
Determine which of the following statements about random forests is/are true.
I. Overfitting does not result no matter how high B is.
II. A fixed number of predictors is considered at every split.
III. Random forests reduce variance.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
32. *-11 The random variable X has four possible values: A, B, C, and D. The random variable Y has three possible
values: 1, 2, and 3. The probabilities of these values are as follows:
Pr(Y=yIX= x)
x Pr(X = x) 1 2 3
A 0.40 0.45 0.25 0.30
B 0.30 0.35 0.40 0.25
C 0.20 0.30 0.35 0.35
D 0.10 0.20 0.50 0.30
yT = 4
Calculate the 4-step ahead forecast, 9T+4.
(A) 1.18 (B) 1.40 (C) 1.62 (D) 1.85 (E) 2.02
Exam SRM Study Manual Exam questions continue on the next page . . .
Copyright ©2022 ASM
420 PRACTICE EXAM 4
34. •-ir For a generalized linear model for claim frequency, you are given:
Response variable: Claim frequency
Response distribution: Negative binomial, overdispersion parameter = 1.3
Link: log
Parameter df b
Intercept 1 0.205
Territory 2
A 0.000
0.132
0.198
Number of Claims,
Previous Year 2
0 0.000
1 0.405
2+ 1.101
Calculate the variance of the number of claims by an insured in Territory B who submitted one claim in the
previous year.
(A) 2.1 (B) 2.3 (C) 2.5 (D) 2.7 (E) 2.9
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
(A) I only (B) H only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) ,(C) , or (D) .
2.
•41 An actuarial department is estimating the probability that students will end up in one of the following areas:
1. Pricing
2. Financial Reporting
3. Investments
A nominal logistic model is constructed. Explanatory variables are major in college and score on Exam SRM.
Pricing is the reference level. The estimated parameters are:
College Major
Math 1.2 0.8
Economics 0.2 1.8
Other 0.7 0.4
Calculate the probability that a math major who scored 8 on Exam SRM will end up in Investments.
(A) 0.40 (B) 0.45 (C) 0.50 (D) 0.55 (E) 0.60
3. '11 You are given the first ten observations of a random walk, • , Yio:
15 18 23 25 24 28 31 32 37 40
Exam SRM Study Manual 421 Exam questions continue on tire next page .. .
Copyright 02022 ASM
422 PRACTICE EXAM 5
4. 11: The grade on Exam SRM is modeled as an ordinal random variable with four categories: Fail, 6, 7, 8+. The
explanatory variable is hours of study (x) and the proportional cumulative odds model is used. The fitted parameters
are
6. *-1? Determine which of the following is/are differences between K-means clustering and hierarchical clustering.
I. One method forces every observation into a cluster; the other method allows for outliers.
One method uses of within-cluster similarity; the other method uses between-cluster dissimilarity.
One method requires an initial assignment of clusters; the other method does not.
(A) I only (B) H only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
7.
%I° The variable Y is modeled as a function of X1 and X2 using K-nearest neighbors regression with K = 2. The
model is based on the following three pairs of observations:
(A) 87.0 (B) 91.0 (C) 95.0 (D) 99.5 (E) 104.5
Exam SRM Study Manual Exam questions continue on the next page. ..
Copyright ©2022 ASM
PRACTICE EXAM 5 423
n.. For a binomial generalized linear model based on 45 observations, the explanatory variables are
Time in system (continuous)
Time in system squared
Sex (male or female)
Department (4 levels)
Interaction of sex and department
he
T model has an intercept.
The Ioglikelihood of the minimal model is —182 and the max-scaled R2 is 0.361774.
Calculate the AIC.
9.
qi Observations of 3 variables are studied using principal component analysis. The loading matrix of 4;19ii is
0.732 0.307 0.609
( 0.437
—0.523
0.475
0.825
—0.764
0.213 )
where Oji is the loading of xi on the ith principal component.
The scores of the three components on the first observation are, in order, 1.220, 0.002, —1.279.
Calculate an approximation of the first component of the first observation, xii.
(A) 0.11 (B) 0.22 (C) 0.25 (D) 0.58 (E) 1.56
10. A response variable has 2 possible values, A and B. A node of a classification tree has 100 observations with
65 As and 35 Bs. It is split into two groups. The first group has 60 As and 15 Bs and the second group has the
remaining observations.
Calculate the decrease in cross-entropy resulting from the split.
(A) 0.06 (B) 0.09 (C) 0.12 (D) 0.15 (E) 0.18
11. 4' A binomial generalized linear model uses a probit link. The form of the model is
g(n) = Po + Fix].
The estimated values are 1)0 = 1, b1 = —0.1.
Calculate the estimated mean when x1 = 4.
(A) 0.1 (B) 0.3 (C) 0.5 (D) 0.7 (E) 0.9
Exam SRM Study Manual Exam questions continue on the next page . .
Copyright 02022 ASM
424 PRACTICE EXAM 5
12. f To validate a time series model based on 20 observations, the first 16 observations were used as a model )
development subset and the remaining 4 observations were used as a validation subset. The actual and fitted values
for those 4 observations are
Yr Pi
17 8 9
18 12 15
19 14 18
20 22 20
(A) 14.8 (B) 15.8 (C) 16.8 (D) 17.8 (E) 18.8
13.
sir The time series yi follows an AR(1) process of the form
yt =0.6y1-1+ Et
Var(si ) = 100
Calculate the variance of a 3-step ahead forecast.
(A) 136 (B) 149 (C) 153 (D) 160 (E) 196
14.
A generalized linear model for drivers is based on 5 observations. Drivers are in Class 0 or Class 1.The response
variable is Bernoulli. Actual and fitted values are
Actual class 0 0 0 1 1
(A) 0.22 (B) 0.23 (C) 0.24 (D) 0.25 (E) 0.26
Exam SRM Study Manual Exam questions continue on the next page „ .
Copyright t2022 ASM
PRACTICE EXAM 5 425
You are performing a K-means cluster analysis on a set of data. The data has been initialized with 3 clusters as
follows:
A single iteration of the algorithm is performed using squared Euclidean distance between points.
Calculate the number of data points that move from one cluster to another.
(A) 4 (B) 5 (C) 6 (D) 7 (E) 8
16. •••• Principal component analysis is applied to a set of observations of 2 variables. The score of the observation
(4,3) on the first principal component is 4.3077. The loading of the first principal component on the first variable is
greater than 0 and less than 0.5.
Calculate the loading of the first principal component on the first variable.
(A) 0.32 (B) 0.34 (C) 0.36 (D) 0.38 (5) 0.40
17.
•41 Determine which of the following statements about bagging is/are true.
I. In out-of-bag validation, approximately B/3 predictions are made for each observation.
II. The test MSE for out-of-bag validation is U shaped as a function of B.
III. For B sufficiently large, out-of-bag error is virtually equivalent to leave-one-out-cross-validation error.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) ,(C) , or (D) .
EXAM SRM Study Manual Exam questions continue on the next page...
Copyright C2022 ASM
426 PRACTICE EXAM 5
19. rIn a linear model with an intercept, variables are selected using subset selection. The model is based on 100
observations. The total sum of squares is 4184. The RSS when all 15 predictors are included in the model is 252.
Mallow's Cr, is used to select the best model among models with different numbers of predictors.
The RSS for the best model with 5 predictors is 474.
The RSS for the best model with 6 predictors is less than x.
The model with 6 predictors is selected.
Determine the highest possible value for x.
(A) 422 (B) 430 (C) 442 (D) 454 (E) 468
20. 44/ For a linear regression with 2 variables and an intercept based on 5 observations, you are given
i Leverage Residual
1 0.8831 0.7737
2 0.3534 —1.8664
3 0.5246 0.9216
4 0.3670 0.5549
5 0.8738 —0.3838
21. i With respect to shrinkage methods, which of the following statements are true?
I. The Bayesian interpretation of the lasso is to assign the double-exponential distribution as a prior for the
coefficients pi. ._)
Ridge regression may be used for feature selection.
For all shrinkage methods, variance is a non-increasing function of the tuning parameter.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
22.
"41 For a Poisson regression, you are given the following actual and fitted values:
k 1 2 3 4 5
ilk 1 2 3 3 2
0.7 1.6 2.5 3.5 2.5
Determine the observation with the highest deviance residual in absolute value.
(A) k=1 (B) k = 2 (C) k= 3 (D) k = 4 (E) k = 5
23.
•-11 For a regression model of the form y o + Pix + e based on 89 observations, you are given:
(i) I = 85.19
(ii) s, = 9.02
(iii) x35 = 92.03
Calculate the leverage of x35.
(A) 0.015 (B) 0.016 (C) 0.017 (D) 0.018 (E) 0.019
Exam SRM Study Manual Exam questions continue on the next page. ..
Copyright ©2022 ASM
PRACTICE EXAM 5 427
4: A linear regression model based on 5 observations is of the form yi = po+ pixii+132x,2+ Ei. The values of the
variables are:
xi Xi2 Yi
1 3 24 60
2 5 29 64
3 8 34 70
4 9 38 79
5 10 50 87
=7 = 35 = 72
(A) 5.2 (B) 5.7 (C) 6.4 (D) 6.8 (E) 7.3
(A) 0.46 (8) 0.56 (C) 0.66 (D) 0.76 (E) 0.86
(A) 0.09 (B) 0.20 (C) 0.31 (D) 0.42 (E) 0.53
Exam SRM Study Manual Exam questions continue on the next page...
Copyright (D2022 ASM
428 PRACTICE EXAM 5
(i i) (X/X)-1 = (-0.029
-0.065
0.010
0.014
-0.014 .
0.035
29. •-i? Determine which of the following statements regarding bagging is/are true.
I. Each training data set used in bagging has n components.
II. Choosing B too high can result in overfitting.
III. Bagging is not useful for classification settings.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) ,or (D) .
30. a: A least squares model with a large number of predictors is fitted to 92 observations. To reduce the number of
predictors, forward stepwise selection is performed.
For a model with k predictors, RSS = ck.
The estimated variance of the error of the fit is 52 = 25.
Determine the value of cd - cd+i for which you would be indifferent between the d + 1-predictor model and the
d-predictor model based on BIC.
(A) 108 (B) 113 (C) 118 (D) 123 (E) 128
Exam SRM Study Manual Exam questions continue on the next page . . .
Copyright ©2022 ASM
PRACTICE EXAM 5 429
Li•near regression is performed based on 5 observations. The regression is based on 2 predictors and an intercept.
To reduce the dimension, principal component regression is performed. Only the first principal component is used.
You are given
i xii xi2 yi
1 —1 0 —4
2 —2 4 —2
3 —1 1 0
4 1 —3 2
5 3 —2 4
The loadings of the principal component are —0.6 on xi, 0.8 on x2.
Calculate the fitted coefficient of zi, the first principal component, in the regression.
(A) —0.73 (B) —0.62 (C) —0.51 (D) —0.42 (E) —0.31
32. rA regression model for auto collision claim costs is of the form y =10+ 13ixi +132x2 + e, where xi is the CPI and
x2 is an index of the cost of gasoline. The model is based on the following 5 observations:
To test whether the two explanatory variables are collinear, the VIP is calculated. You are given:
(i) S1 = 0.0096625
(ii) 42= 0.0209767
(iii) Z(x1i — R1)(xi2— )= 0.029492
Calculate the VIP of xi.
(A) 1.4 (B) 1.6 (C) 1.8 (D) 2.0 (E) 2.2
33. %I. You are given the following six observations of a single variable:
21, 30, 40, 51, 63, x
They are analyzed using hierarchical cluster analysis with average linkage and Euclidean distance. After two fusions,
the clusters are 121,301, 140,511, (631, and fx}.
Determine the highest integer value of x for which the next fusion would be {63, x}.
(A) 76 (B) 78 (C) 80 (D) 82 (E) 84
Exam SRM Study Manual Exam questions continue on the next page . . .
Copyright Q2022 ASM
430 PRACTICE EXAM 5
34. a: The Holt-Winters model is used to forecast a time series. The last available value of the series is for period 20.
You are given:
y19 = 185 and y20 = 199
b0,19 190 and b1,19 = 2
wi = 0.6 and zv2 = 0.7.
The series is not seasonal; zv3 = 0.
35. Determine which of the following statements regarding hierarchical clustering is/are true.
I. If two different people are given the same data and perform one iteration of the algorithm, their results at that
point will be the same, regardless of the linkage they used.
At each iteration of the algorithm, the number of clusters will be greater than the number of clusters in the
previous iteration of the algorithm.
The algorithm needs to be run only once, regardless of how many clusters are ultimately decided to use.
(A) None (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
1. "Ir You are given the following results for the regression model y = po flix E:
Source of Variation Degrees of Freedom Sum of Squares
Regression 1 5,012
Error 9 4,296
You are also given that for the explanatory variable x, Z(x1 —5-02 = 120.
Determine the length of the symmetric 95% confidence interval for 131.
(A) 9.0 (B) 9.1 (C) 9.2 (D) 9.3 (E) 9.4
2. *41' A logistic regression model is used to model the color of a car as a function of the age of the owner (x2). The
color white is the base level. Other colors are black, gray, red, and blue. The fitted coefficients for the model are
Color b1 b2
Black 0.33 —0.009
3. rY is a classification variable that may be 0 or 1. X is a random variable uniformly distributed on [0,1]. You are
given
Pr(Y = 1 I X = x)= x
4. A classification tree is used for a variable with two classes. The tree has 5 terminal nodes. The number of
observations in each class at each terminal node are as follows:
1 45 10 55
2 32 8 40
3 10 30 40
4 5 45 50
5 3 52 55
(A) —0.82 (5) —0.80 (C) —0.78 (D) —0.72 (E) —0.70
Exam SRM Study Manual 431 Exam questions continue on the next page
Copyright =022 AS1VI
432 PRACTICE EXAM 6
5. *--:1.1 A regression model is of the form y = po + P1x1 + f32x2 + E. It is based on 6 observations. You are given
(i) The leverages are 0.4514, 05626, 0.3111, 0.5584, 0.3732, and 0.7433.
(ii) The residuals are —1.3009,0.0923, 0.9145,0.8649, —0.0974, and —0.4734.
Calculate PRESS.
(A) 10.3 (B) 11.4 (C) 12.5 (D) 13.6 (E) 14.7
6. iS The probability of a strike in an industry, n, is modeled with logistic regression, using an economic index xi as
a predictor. In the fitted model, f31 = 0.1. When x1 = 3, the probability of a strike is 0.1.
Calculate the probability of a strike when X1 = 7.
(A) 0.12 (B) 0.13 (C) 0.14 (D) 0.15 (E) 0.16
7. For a simple linear regression y=a+bx+E based on 12 observations, you are given:
(i) The fitted model is y = 2.1 + 1.20x + E.
=25
(B) ln(1 + ce)2 (C) — ln(1 + ce)2 (D) ln(1 — C61)2 (E) — ln(1 e0)2
9. el' You are given the following statements regarding principal component analysis.
I. The sum of the loadings for each principal component must be 0.
II. The sum of the squared loadings for each principal component must be 1.
III. The sum of the scores of each principal component must be 0.
Determine which of these statements are true.
(A) I only
(B) II only
(C) III only
(D) I, II, and III
(E) The answer is not given by (A), (B), (C), or (D)
Exam SRM Study Manual Exam questions continue on the next page .. .
Copyright CO202.2 ASM
PRACTICE EXAM 6 433
House sales is the response variable of a normal model that is based on 11 observations. Explanatory variables
are:
• Interest rates
• Unemployment rate
Model I uses only interest rates as an explanatory variable, while Model II uses both interest rates and unem-
ployment rates. Both models have an intercept. The results of the models are:
Model I Model II
Source of Variation Sum of Squares Source of Variation Sum of Squares
Regression 14,429 Regression 17,347
Error 12,204 Error 9,286
11. "-st. Determine which of the following statements regarding bagging is/are true.
I. It is difficult to interpret the model arising from bagging.
II. At each split, all available predictors are considered.
HI. Cost complexity pruning is used to reduce the variance of the trees.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
12. Y is a real-valued variable. It is modeled as a function of Xi and X2 using KNN regression, with K = 2. You are
given the following training data:
X1 10 12 12 15
X2 13 11 16 13
Y 2 5 10 6
(A) 5.2 (B) 5.4 (C) 5.6 (D) 5.8 (E) 6.0
13. al? Determine which of the following statements regarding hierarchical clustering of 11 observations is/are true.
I. At each iteration of the algorithm, the number of clusters is reduced by 1.
II. At iteration i of the algorithm, a comparison of (n i + 1)(n — i)/2 dissimilarities determines which clusters are
fused.
(A) None (B) I and II only (C) I and III only (D) II and III only
rTh- (E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Exam SRM Study Manual Exam questions continue on the next page. ..
Copyright ©2022 ASM
434 PRACTICE EXAM 6
14. I': Determine which of the following methods may be used to model stochastic seasonal effects in time series.
I. Trigonometric functions.
II. Seasonal autoregression
III. Holt-Winter
(A) I and II only (B) I and III only (C) I and IV only (D) II and III only (E) II and IV only
15. al In a hierarchical clustering analysis using Euclidean distance as the dissimilarity measure, the following 3
clusters have been formed:
Determine the linkages for which 1(60,22)) is fused with 1(40,30), (40,40)1 at the next iteration.
(A) I and II only
(B) I and III only
(C) land IV only
(D) I, II, and IV only
(E) II, III, and IV only
Exam SRM Study Manual Exam questions continue on the next page . . .
Copyright ©2022 ASM
PRACTICE EXAM 6 435
. r Salaries of actuaries are modeled using a regression tree. Three variables are used:
1. X1 is the credentials of the actuary, which May be FSA, ASA, or student.
2. X2 is number of years of experience.
3. X3 is the region where the actuary works, which may be E, W, N, or S.
This is the regression tree:
= FSA
X2 < 10 (0 X3 = E W
X3 = In X3 =
The total sum of squares is 9,865. After each split, the RSS is:
C) 9,075
8,302
® 7,845
CD 7,411
0 7,026
® 6,798
CD 6,502
6,398
Using the deviance-related variable importance measure, rank the importance of the three variables from highest
to lowest.
(A) Xi, X2, X3 (B) X11X3,X2 (C) X2, Xi, X3 (D) X21 X3, X1 (E) X31 Xl, X2
Exam SRM Study Manual Exam questions continue on the next page ..
Copyright 02022 ASM
436 PRACTICE EXAM 6
17. •41 You have fit a multiple regression model with k explanatory variables and an intercept to 80 observations. You
are given:
(i) Residual standard deviation is 8.
(ii) Leverage of 10th observation is 0.2.
(iii) 10th residual is 6.8.
(iv) Cook's distance D10 = 0.0376.
Determine k.
(A) 0.65 (B) 0.75 (C) 0.85 (D) 0.95 (E) 1.05
19. Sales of insurance are modeled as a function of the number of agents in the field force. Let Y be sales and let X
be the number of agents. The fitted model is
Yi = Po + ei
20. •41 You are given the time series yt = 11, 0, 2, 0, 01.
Calculate the autocorrelation of yi at lag 1.
(A) -0.49 (B) -0.42 (C) -0.35 (D) -0.28 (E) -0.21
21. al A zero-inflated model consists of a mixture of a Poisson distribution with mean 0.8 and the constant 0. The
constant 0 has a weight of 0.3.
Calculate the overdispersion of this model relative to a Poisson model.
(A) 1.06 (B) 1.12 (C) 1.18 (D) 1.24 (E) 1.30
Exam SRM Study Manual EXCI711 questions continue on the next page . . .
Copyright ©2022 ASM
PRACTICE EXAM 6 437
A generalized linear model is used to model claim size. You are given the following information about the
model:
Variable
Intercept 500
Vehicle value (000) 15
Number of violations 200
Number of accidents 1000
Gender—Male 600
Calculate the standard deviation of claim size for a vehicle with value 30,000 belonging to a male with 1 violation
and no accidents.
(A) 1800 (B) 2300 (C) 2800 (D) 3300 (E) 3800
23. ikir For a linear regression with k variables and an intercept:
(i) There are 30 observations.
Dyi - p)2 = 8,500
(iii) The residual sum of squares is 1,825.
(iv) The F ratio for the regression is 22.860.
Determine k.
24. '1•4r A hurdle model with a base Poisson count has 2 predictors. When x1 = 2 and x2 = 3, the Poisson parameter is
0.8 and overdispersion is 1.2.
Calculate the probability of 0 given that x1 = 2 and x2 = 3-
(A) 0.59 (B) 0.63 (C) 0.67 (D) 0.71 (E) 0.75
(A) 32.3 (B) 35.3 (C) 38.3 (D) 41.3 (E) 44.3
26. •41 With regard to shrinkage methods, determine which of the following statements is true.
I. The penalty function in ridge regression is a function of the £2 norm of 13.
IL It is best to standardize predictors when using ridge regression or the lasso.
IlL For both ridge regression and the lasso, the higher the budget parameter, the higher the variance.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
Exam SRM Study Manual Exam questions continue on the next page...
Copyright ©2022 ASM
438 PRACTICE EXAM 6
27.
Determine which of the following characteristics distinguish a linear trend in time from a random walk.
I. In a linear trend in time, consecutive terms are correlated but in a random walk they are not.
II. A linear trend in time has stationary variance but a random walk doesn't.
III. The differences of a linear trend in time are not stationary in the mean but the differences of a random walk are
stationary in the mean.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) (C) , or (U).
28.
N? A principal components analysis is performed on the following 4 observations of 3 variables:
(1,0.4, —0.6) (0,0.4, —0.1) (-0.5, —0.4,0) (-0.5, —0.4,0.7)
The scores of the four observations on the second principal component are 0.0016, 0.1074, —0.3432, and 0.2342.
Calculate the proportion of variance explained by the second principal component.
(A) 0.04 (B) 0.05 (C) 0.06 (D) 0.07 (E) 0.08
29.
ai? For a set of 6 observations, you perform the following two regressions:
I. yi = 130 +131.xii.
II. Xi = yo + E;
Residuals from these regressions are:
Observation Residual from Residual from
Number Regression I Regression II
1 2.022 —0.455
2 —2.587 —0.364
3 —2.391 —5.545
4 0.196 2.545
5 4.391 1.364
6 —1.630 2.455
The residual standard deviation is 3.102 for Regression I and 3.371 for Regression II.
Calculate the partial correlation coefficient of y and x2.
(A) 0.342 (13) 0.360 (C) 0.377 (D) 0.395 (E) 0.412
30.
Niel For a time series with 10 terms, you are given
yi =149
ZPIy2= 2287
V°2 yip-1 = 1994
= 20
Yio = 11
The series is fitted to an AR(1) model, yi = Po + pi yi_i
Calculate the fitted value of 13i.
(A) 0.26 (B) 0.28 (C) 0.30 (D) 0.32 (E) 0.34
Exam SRM Study Manual Exam questions continue on the next page . .
Copyright ©2022 ASM
PRACTICE EXAM 6 439
NI° You are given 4 potential explanatory variables for a linear model, xi, x2, x3, and x4. The model is based on 32
observations. When x4 is regressed with an intercept against the other three variables, the F ratio for the model is
2.532.
(A) 1.17 (B) 1.27 (C) 1.37 (D) 1.47 (E) 1.57
33.
The variable Y is modeled as function of X using a regression tree. You are given the following observations:
X 1 2 3 4 5 6
Y 10 12 17 21 22 24
Determine the first split of the feature space into two regions.
(A) Between 1 and 2
(B) Between 2 and 3
34. ••.? For a linear regression with 5 observations, you are given
y 9i
10 0.77 0.5403
9 10.78 0.3289
15 24.13 0.2036
30 44.16 0.3512
70 54.17 0.5760
Exam SRM Study Manual Exam questions continue on the next page „ .
Copyright ©2022 ASM
440 PRACTICE EXAM 6
35.
%. Determine which of the following are reasons to use parametric approaches.
I. Parametric approaches are more flexible than non-parametric approaches.
Parametric approaches require fewer observations than non-parametric approaches.
Parametric approaches are easier to interpret than non-parametric approaches.
(A) I only (B) II only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D)
Practice Exam 1
1. [Lesson 11 Classification setting—the company is choosing a class. Supervised—there is something being predicted.
But decision trees are not parametric. (C)
2. [Lesson 12] In logistic regression, g(p) is the logarithm of the odds, so we must exponentiate f3 to obtain odds ratio.
e-0.69.5 0.4991 (E)
3. [Section 11.1] The square of the coefficient of variation is the variance divided by the square of the mean. If it is
constant, then variance is proportional to mean squared. This is true for a gamma distribution. (C)
4. [Section 11.11 Area A is the base level, so nothing is added to g(p) for it.
g(p) = 0.00279 - 0.001 + 25(-0.00007) = 0.00004
1
7 = 0.00004
1
158.11 (B)
/-1- V0.00004
5. [Section 14.2] A cubic polynomial adds 3 parameters. The 99th percentile of chi-square at 3 degrees of freedom is
11.345. Twice the difference in loglikelihoods must exceed 11.345, so the loglikelihood must increase by 5.67. Then
-361.24 + 5.67 = -355.57 (B)
6. [Section 17.1] The loading of the first principal component on the second variable is 1/1 - 0.62 = 0.8. We are given
-0.6(0.4) + 0.8x2 = 0.12
7. [Lesson 11
I. The lasso is more restrictive than linear regression.X
Flexible approaches may not lead to more accurate predictions due to overfitting.X
This sentence is lifted from An Introduction to Statistical Learning page 35./
(C)
8. [Lesson 14] USE has 3 levels, so Modern has 2 parameters fewer than Model I. Thus the AIC penalty on Model II is
4 less than for Model I. The AIC for Model I is 3.80 less than for Model II, but before the penalty, twice the negative
loglikelihood of Model I is 7.80 less than for Model IL The critical values for chi-square with 2 degrees of freedom
are 7.378 at 2.5% and 9.210 at 1%, making (C) the correct answer choice.
9. [Section 18.2] We have to calculate all 6 distances between points and average them.
= 0E.815) C0):1855)
E1—
= 4.1(0.852) 19.74 83
0.15
=
4.4439 (B)
11.
[Section 7.2] We will use the definition of Mallow's C1, from An Introduction to Statistical Learning, but you would
get the same result using the definition in Regression Modeling with Actuarial and Financial Applications.
C n (RSS + 2d62), and we can ignore 1/n. So we want
This implies
Cd Cd+1 = 2(40) = 80 (E)
12.
[Section 16.1] We weight the cross-entropies for the two groups with the proportions of observations in each group,
0.6 and 0.4
2 2 1 1 1 1 1 1 5 5 1 1
D
—0.6 kit-15 gln gln -g) — 0.4 kin •§ + gln •
0.88064 (E)
[Section 16.11 Higher a means more tree pruning and fewer nodes. That will increase the MSE on the training data
and raise bias on the test data. I TI is the number of terminal nodes, which decreases. (B)
21. [Section 7.1, Lessons 16, and Section 18.2] II and III are greedy in that they select the best choice at each step and
don't consider later steps. While hierarchical clustering selects the least dissimilar cells at each iteration, there is no
("--"` particular measure that would indicate whether a better clustering is possible with a different choice, so it is not
considered greedy. (E)
22. [Section 16.21 I and III are true. The opposite of II is true: a low shrinkage parameter leads to selecting a higher B
since less is learned at each iteration, so more time is needed to learn (C)
23.
[Section 19.6] MSE is the mean square error, with division by 5 rather than 4, since the fit is not a function of the
validation subset. The residuals are -3, -3, -2, 2, 4
32 + 32 +22 +22 + 42
MSE 8.4 (B)
5
24.
[Subsection 13.3.2] k is the quotient (1 - n)/(1 g(0)), where it is the probability of 0 (0.3 here) and g(0) is the
Poisson probability of 0, which is e-0.6 here. The probabilityof 1 is
1
- ( -C103.6)0.6e-°-6
1
0.510875 (E)
25. [Section 14.5] The deviance is twice the excess of the loglikelihood of the saturated model, 'max, over the loglikelihood
of the model under consideration, /(b), so
26.
[Section 13.11 In a Poisson regression with a log link, the ratio of expected values is the exponential of the difference
of the xs. Here, that is el:P.1360-2) = 1.1972 . (C)
28.
[Lesson 21 For each explanatory variable there is a base level. There are 3 non-base occupational classes and 2
non-base health classes. Thus there are 3 x 2 = ]interaction parameters. (A)
29. [Section 8.1] Let v be the vector.
1lviii = 5 + 3 + 8 + 2 + 4 = 22
11v112 = ,52 +32 + 82 +22 + 42 ,10.8628
0.24833 (C)
x211 = 182
x312 = 0.2(138) + 0.8(182) = 173.2
x413 = 0.2(150) + 0.8(173.2) = 168.56
x514 = 0.2(192) + 0.8(168.56) = 173.248
The sum of squared errors is (-44)2 + (-23.2)2 + 23.442 + (-3.752)2 = 3037.751 (C)
32. [Section 16.11 I is true. But classification error is preferred for pruning tree since that is the measure of predictive
accuracy. And the predicted values of two terminal nodes coming out of a split may be the same, due to different
levels of node purity. (A)
[Section 18.2]
rag., I. There is an inversion; the split between {4) and (5,6,7) is at a lower level than the split between (5) and (6,7), and
of the four linkages we studied, only centroid has inversions./
II. All we know is that when the clusters were (1), (2), (3), (4), and (5,6,7), (4) was fused with (5,6,7). So (4) is closer
to the centroid of (5,6,7) than (3) is, and (4) is closer to the centroid of (5,6,7) then it is to (3). None of these
imply II. X
III. All we know is that (3) is closer to the centroid of (4,5,6,7) than (1) is to (2), since it was fused first. That doesn't
imply III. X
(A)
Practice Exam 2
1. [Lesson 1] Ridge regression is parametric, but K-nearest neighbors regression depends on each point of data.
Principal components is not a supervised method so it is not parametric. (A)
Error SS
2. [Section 3.2] R2=1
Total SS
Error SS
0.625 = 1
8016
Error SS = 3006
There are 4 parameters, so there are 65 — 4 = 61 degrees of freedom. The RSE is V3006/61 = 7.0199 (C)
3. [Lesson 191 I and III are true. Causal models can include polynomials and other functions, so Ills false. (E)
4. [Lesson 2] We will use bo = g biX. First we calculate 17. From the formula for b1, we have
3.2465 =
72,559 — 20(9)(31.5)
23,720 — 20(31.52)
3875
Then
72,559 — 3.2465(3875)
= 95.2045
20(31.5)
So bo = 95.2045 — 3.2465(31.5) = —7.0603 (A)
5. [Lesson 1] The study is a classification setting: either someone gets cancer or doesn't. It is supervised; there is a
response variable. Logistic regression is a parametric approach. (D)
6. [Lesson 12] The odds of 1 are 0.20/(1 — 0.20) = 0.25. From the form of the logistic model,
p-1
In oi = 0.64 +
j=1
It follows that
02
e1.22-0.64
01
and therefore
02 e1,22-0.64,0.25‘= 0.446510
The probability of 1 or 2 is
0.446510
71 4- 7t2 = = 0.308681
1.446510
7.
[Section 21.11 First calculate moving averages at times 5, 6, 7, 8.
5 + 6 +12+ 5
s•5 = - 7
4
„
s6 = 6+12+5+10
4
= 8.25
A 12+5+10+9
S7 = = 9
4
5+10+9+15
s = 9.75
4
• 12.41 - 0.04
9. [Section 18.11 The algorithm minimizes the sum of within-cluster squared Euclidean distance divided by cluster
size. Let's calculate that. We'll calculate distances between earlier and later points, then double at the end to take
care of distances between later and earlier points.
(5,15)-(6,12): 12 + 32 = 10 (5,15)-(6,20): 12 + 52 = 26 (6,12)-(6,20): 82 = 64
(9,12)-(11,4): 22 + 82 = 68 (9,12)-(11,16): 22 + 42 = 20 (11,4)-(11,16): 122 = 144
The sum of these six numbers divided by 3 (the sizes of the two clusters) and doubled, is 2211. (C)
10. [Section 3.31 Use formula (3.12) for the standard error of b1.
50.24
se(bi) = - 1.9403
6.280VT7
The t ratio is 4.637/1.9403 = 2.3900. There are n - k - 1 = 16 degrees of freedom. 2.3900 is between the 975th
percentile of t, which is 2.1199, and the 99th percentile of t, which is 2.5835. For a two-sided test, this implies (C).
11. [Lesson 12] We add up the factors, including the intercept, and exponentiate them.
-2.521 - 0.050 + 0.155 + 30(0.007) = -2.206
e-2.206= 0.1101 (A)
13. [Section 16.1] The sum of square differences of {3,6, 8,9} from their mean is 21 and the sum of square differences
of {6,10,12,14} from their mean is 35, for a total of 56. Putting the two sets of numbers together, the sum of square
differences from their mean is 88. Thus
a 88 - 56 = 32
15. [Lesson 4] The F ratio with 35 observations, 6 parameters in the unrestricted model, and 2 constraints is
(Error SSII - Error SS)/2
F2,29 =
Error SS! /29
= 14.5
(/q - Iq1)
1- R2
= 14.5
(0.900 - 0.860)
- 0.900
= 5.8 (B)
4.167 (B)
17.
[Section 12.11 In a probit model, 77 = C10(11). Initially, ij = 0-1(0.2) = -0.84. The systematic component 77 is increased
by 0.25(1) = 0.25, to become -0.59. The new probability is 0(-0.59) = 0.2776 (C)
Exam SRM Study Manual
Copyright 02022 ASM
PRACTICE EXAM 2, SOLUTIONS TO QUESTIONS 18-28 451
[Lessons 15, 16, and 181 For K-nearest neighbor regression, lower K leads to not averaging in far away points,
lowering bias. Clustering is not a supervised method, so it doesn't make sense to talk about bias. For regression
trees, a greater number of terminal nodes allows a closer estimate, lowering bias. (A)
19.
[Section 16.21 The number of terminal nodes is 1 more than the number of splits, or d + 1. But the other two
statements are true. (D)
a2 6
20. [Lesson 20] Var(P) 9.375 (B)
p2 1 - 0.62 =
21. [Section 13.1] The observations follow a Poisson distribution, and their likelihood is maximized if the Poisson
parameter is the sample mean, or 0.6. Now, g(p) = In p= 130, and the fitted p is 0.6, so po = In 0.6 = -0.51083 (A)
22. [Section 18.2] 21 and 30 are closest and are linked first. Complete linkage looks at maximum distance, and that
means 40 is 19 away from 121,301, so it is linked to 51 instead. Similarly, the next link is {63,76}. At that point, {21,30)
and {40,51] are at distance 30 while 140,511 and {63,76} are at distance 36, so 121,30,40,511 is fused, making the answer
(D)
[Section 18.21 I and II are reversed; complete linkage is based on maximal dissimilarity and single linkage is based
on intercluster dissimilarity. But III is true. (C)
26. [Section 17.1] The statements about loadings and scores should be switched; there are p loadings and n scores
for each principal component. Statement III is true since otherwise there would be no limit to the variance of the
components. (C)
27. [Lesson 15]
I. The points closest to (4,4) are (4,4) itself, (3,5), and (5,3), and 2 of those 3, (3,5) and (4,4), have Y = 1./
The points closest to (4,2) are (4,1), (3,2), and (5,3), and 2 of those 3, (4,1) and (3,2), have Y = 0.X
The points closest to (3,6) are (3,5), (4,6), and (4,4), all having Y = 1.X (A)
28. [Section 14.31 Based on b(0) and4), the response distribution is Poisson. But if you didn't recognize it, you could
work it out:
D=2
Dian yi - In fif) - (yi - 9i))
1=i
= 2 E(Yi in Yi Yi In
1=1
9i))
30. [Section 5.3] Collinearity does not affect the residual standard error, but causes higher standard errors for coefficients
which means lower t statistics. Collinearity is indicated by a high VIF. (B)
31. [Section 5.11 = 80 - 81 = -1
-1
-0.2379 (D)
s- = 6.1001/1 - 0.525
32.
[Section 17.5] A scree plot shows the percentage of variance explained by each principal component, so it is suited
for III only. (C)
2 2
W
34. [Lesson 13] Interaction is the product of gender and marital status and thus only applies to male single.
g(p) = -0.73 + 0.07- 0.04 = -0.70
p = e-0•70 = 0.496585
For a Poisson distribution with mean 0.496585, po + pi = e—M96585(1 + 0.496585) = 0.910830. The probability of 2 or
more claims is 1 - 0.910830 = 0.0892 . (A)
Practice Exam 3
1. [Section 16.11 The fitted values for the four points of test data are 5.5, 7.2, 12.0, and 10.4 respectively. The mean
squared error is
(4— 5.5)2 + (10 — 7.2)2 + (11 — 12.0)2 + (13 — 10.4)2 4.4625 (A)
4
[Section 17.1] The first two statements are true. But principal components are affected by the scale of the variables.
(E)
3. [Section 14.21 The likelihood ratio statistic is 4865.0 —4856.6 = 8.4. AGE GROUP has 4 categories, hence 3 indicator
variables, so there are 3 degrees of freedom. The critical values for chi-square at 3 degrees of freedom are 7.815 at
5% and 9.348 at 2.5%, so (D) is correct.
4. [Section 6.21
I. The more repeated observations, the more variance. 10-fold cross-validation uses 9/10 of the observations
in its test sets and therefore repeats observations more frequently, giving it higher variance than 5-fold cross-
validation. X
5-fold cross-validation requires 5 runs while 10-fold cross-validation requires 10 runs, so the former is more
efficient. s/
For polynomial or linear regression, only one expression needs to be evaluated to calculate the LOOCV
statistic. tt
(E)
5. [Section 7.21 The deviance for a normal model is the residual sum of squares, and the unbiased sample variance is
Total SS/(n — 1). Thus
RSS/(n — p) 44/(20 —4)
Adjusted R2 = 1 Total SS/(n — 1)
—1
88
0.96875 (E)
[Lesson 201 Statement IV is false; the variance of the terms is greater than the variance of the error. Letting 62, be
the variance of the error, the variance of the terms is aL2./(1 —132). (A)
7. [Section 14.6] The variance of the observation is p3/5 = 57.33/5. So the chi-square residual is
Oi - Ei 68.5 - 57.3
0.0577 (B)
VT V57.33/5
8. [Lesson 4] Residual has 38 - 12 - 3 = 23 degrees of freedom. Mean square is 92.55/3 = 30.85 for AGE and
302.88/23 = 13.169 for residual. The F ratio is 30.85/13.169 = 2.343 , at (3,23) degrees of freedom. (A)
9. [Section 18.21 21 and 30 are closest and are linked first. Single linkage looks at minimum distance, and that means
40 is 10 away from (21,30), and since it is 11 away from 51, it is linked to {21,30}. We get a trailing cluster; 51 is linked
to 121,30,401, 63 is linked to 121,30,40,631, and we get (E).
10. [Section 8.21 I and IJI apply to both principal component regression and to partial least squares. Only II is unique
to principal component regression. (B)
11. [Section 18.21 I and II are true. In statement Ill, r should be 2-1`; there are n 1 fusions, and the two branches of
each fusion may be reversed without affecting the clusters. (B)
12. [Section 19.6 and Lesson 20] The MAE is the mean absolute error. We must compute the 4 residuals in the
validation set.
= 6 - 4.6 = 1.4
e9 = 5 5.2 = -0.2
13. [Section 16.11 We weight the Gini indices of each group 0.6 and 0.4, the proportions of the 100 observations in each
group.
0.5125 (C)
14. [Section 13.21 Use formula (13.1). There are n = 5 observations and k = 2 variables.
15. [Section 16.11 Regression trees are less robust than linear models and predictions are less accurate. (B)
16. [Section 5.31 For the regression of x2 on xi, the total sum of squares (x2 - is the square of the standard deviation
of x2, multiplied by n - 1 = 50 - 1 = 49:
The VIF is
R2 = 1
(2)
79'1355 = 0.667530
238.0232
1
VIF2 = 3.008 (B)
1 — 0.667530
17.
(12,204 — 9,286)/1
[Section 4] F1,8 7- = 2.514
9,286/8
The t statistic is the square root of the F ratio, or 1.5855. At 8 degrees of freedom, this is less than the 95th
percentile of a t distribution (1.8595), so for a 2-sided test, unemployment rate is not included in the model at 10%
significance. (E)
18. [Lesson 1] It sounds like the company is going to use principal components analysis. That is unsupervised learning,
so none of the characteristics apply. (A)
19.
[Section 5.1] The standardized residual r1 has s in the denominator and the studentized residual r has s(i), the
residual standard error with observation 1 removed, in the denominator. So
S
—ri
so.)
8.25
0.895 = (0.823)
s(1)
8.25(0.823) = 7.586 (A)
5(1) =. 0.895
[Section 16.1] As usual, when there are only two classifications, the Gini index is double the proportion of one of
the classes in the region. Since we're only comparing Gini indices, we won't bother doubling.
For the split between 6 and 8, half the Gini index of the first region is (2/3)(1/3) = 2/9. Half the Gird index of the
second region is also (2/3)(1/3) = 2/9. The weighted average is 2/9.
For the split between 8 and 12, we get
21. [Lesson 6] k-fold cross-validation has smaller training sets than LOOCV, so it will overestimate the test error rate
more than LOOCV and has higher bias. And the k folds are chosen randomly, so it may not yield the same results
if different folds are created. (E)
22. [Section 17.5] The points are centered at 0, so the variance is the just the sum of squares. The variance of the two
variables is (1/3)(12 + 02 1- (-1)2 ± 0.42 ± 0.42 + (-0.8)2) = 2.96/3. The scores of the three points are
0.836(1) + 0.549(0.4) = 1.0556
0.836(0) + 0.549(0.4) = 0.2196
0.836(-1) + 0.549(-0.8) = —1.2752
The variance of these scores is (1/3)(1.05562+0.21962+ (-1.2752)2) = 2.78865/3. The proportion of variance explained
is 2.78865/2.96 = 0.9421. (D)
23. [Lesson 19] I is true since the influential points are the ones furthest from the mean time. II is not true; seasonal
patterns may be incorporated with dummy variables for the seasons or trigonometric functions. III is true. (C)
24. [Section 5.21 The diagonal of the hat matrix sums to k 1, so we see that k = 1. Using formula (5.2),
2
r2 _ e3 _
(-2.1)2 = 1.04255
-3- 82(1 - ha,) 14.1(1 -0,7)
0.7
0.7)
26. [Lesson 1] Any of these methods may be used in a classification setting. (D)
27. [Lesson 15] For X = 15, in the training data, 10, 19, and 30 are closest, so 1 is predicted, which is correct.
For X = 25, in the training data, 19, 30, and 39 are closest, so 2 is predicted, which is correct.
For X = 37, in the training data, 30, 39, and 43 are closest, so 2 is predicted, which is correct.
For X = 50, in the training data, 39, 43, and 45 are closest, so 2 is predicted, which is correct. (A)
28. [Section 14.4] We subtract the AIC penalty and add the BIC penalty.
There are 1 + 2 + 1 + 4 = 8 parameters, since the intercept is one parameter and there are k -1 parameters for each
k-way categorical variable. So the penalty function for AIC is 16, and the penalty function for BIC is 8 In 85 = 35.54.
The BIC is 261.53 - 16 + 35.54 = 281.07 (C)
30. [Section 18.11 Clusters are selected to minimize the average squared distance of points within each cluster, not the
sum of distances, and maximizing the sum of distances between clusters is not equivalent, so I is false. But II and III
are true. (0)
31. [Section 14.1] For a gamma distribution, the variance function is p2, so the denominators of the chi-square sum are
42. The statistic is
32.
[Section 18.1] The centroid of the first cluster is ((5 + 6 + 7)/3, (15 + 11 + 10)13) = (6, 12). The centroid of the second
cluster is ((5 + 6 + 7)/3, (18 + 14+ 7)/3) = (6,13). Thus observations with second coordinate less than 12.5 go to the
first cluster and those with second coordinate greater than 12.5 go to the second cluster. (5,15) moves to the second
cluster and (7,7) moves to the first cluster. (C)
[Section 11.1] By formulas (11.2) and (11.3), the variance is the derivative of the mean times 475, or 1/62. since
p = —1/0, it follows that 1/02 = p2 (D)
34. [Section 11.1] Since Yi is a Tweedie distribution, Var(Yi) = a E[YdP.
1n7.0711 =lna+pin2
In12.9904 = ln a + p In 3
0.608195 = 0.405465p
p = 1.5
ha = In 7.0711 — 1.51n 2
a = e0.916295 2.5
35.
[Section 8.1] We are constrained by jq, +133 25. That means we may not consider (1, $2) = (4,4) nor = 5 nor
S2 = 5. Among the remaining choices, 60.5 is the lowest RSS, with g = 4 and = 3. The answer is 4/3 (C)
Practice Exam 4
1. [Lesson 11 Increasing flexibility results in increasing test variance, but the other two statements are true. (B)
2. [Section 14.61 Using formula (14.21), with negative sign since y22 — 922 <0,
2.5
(C)
3. [Section 19.31 The sample mean is :17 20.7 and the sample standard deviation (with division by 9) is 2.5408. The
appropriate t coefficient is 41025 for a two-sided interval with 9 degrees of freedom, or 2.2622. The upper bound of
the forecast interval is
1
20.7 + (2.2622)(2.5408)V1 + —
10
= 26.73 (E)
1 (x• .t)2
2ti_a/2sAl1 + +
(2t1_02s)2 (1 +
When x* =1 + sx, then (e 1)2 = s = ((x, -1)2) /(n — 1), so the squared width of a prediction interval is
(214_02s)2 + +
302 = (2t1_42s)2 +
5. [Section 14.11 Use formula (14.3). The variance of a Bernoulli distribution is rri(1 -
0.322 0.472 (1 - 0.64)2 (1 - 0.98)2
+ + 3.550 (E)
(0.32)(0.68) + (0.47)(0.53) (0.64)(0.36) (0.98)(0.02)
7. [Section 16.2] The fitted value for the first 5 observations is the average of those observations, or (8+7+4+6+10)/5 = 7.
Subtracting 7A from 8, the residual entering the next tree is 8 - 0.1(7) = 7.3 (E)
[Section 18.2] The centroids for the four clusters are, in order, (14,10), (19,18), (8,27.5), and (15,24). We calculate the
distances between these four centroids.
(14,10)-(19,18): lig
(14,10)-(8,27.5); -N/34
(14,10)-(15,24): -‘/1
(19,18)-(8,27.5): V21.1 5
(19,18)-(15,24): '1,4
(8,27.5)-(15,24): —Nff6Y5
We could have skipped the fourth calculation, which was not one of the answer choices. The fifth calculation is
the lowest, making (D) the answer.
9. [Section 17.1] The fourth loading is --V1 - 0.682 - 0.652 - 0.322 = -0.1127. The score is
0.68(2) + 0.65(-1) + 0.32(3) - 0.1127(5) = 1.107 (C)
s
VVIFi
sh;
sxjAfli f.
0.86 = 8.25
2.70A/571
2
VIFJ
(0.86(2.70)VT4) 2.693 (E)
—
8.25
11. [Section 18.11 The squared Euclidean distances in the first cluster are
(10,0)—(8,3): 13 (10,0)—(5,6): 61 (8,3)—(5,6): 18
In the second cluster, the points are 26 apart.
The objective function's value is
odds of a claim of type 2 for an SUV are 1.46296(0.369863) = 0.541096. Let T2 be the cumulative probability of type 2
for an SUV. Then
T2
= 0.541096
1 — T2
0.541096
T2 = = 0.351111
1 + 0.541096
The probability of claim type 1 is 0.28, so the probability of claim type 2 is 0.351111 — 0.28 = 0.071111 . (A)
15.
[Lesson 12] Under the complementary log-log link, g(n) ln(-1n(1 — n)), so n = 1 — exp (—ebo+b'x'). For our
parameters, that is
16.
[Section 11.1] The canonical link function is the inverse of b'(0), the mean. The variance is Ob"(0), which here is
0(1;11(0)1.5. So the derivative of b' (0) is 13'(9)1-5. Let f(6) = b' (0) and solve a differential equation forf.
df 1.5
=f
df —
de
Thus ignoring the multiplicative constant, b'(19) = 1/02, with inverse g(p)= 1/1/1.i . (B)
18.
[Section 19.2] The sample mean is 7. Use formula (19.3). The denominator of the fraction in that formula is
_ 7)2 + (4 _/) + • • • ± (16 - 7)2 = 130
The numerator is
(-5)(0) + (-3)(2) + (0)(-1) + (2)(-3) + (-1)(1) + (-3)(9) = -40
The lag 2 autocorrelation statistic is -40/130 = -0.307692. (C)
19. [Section 14.5] The minimal model has only an intercept, and for a Poisson model the likelihood is maximized at
the mean of the observations, which is A = 1. The loglikelihood of each observation ni is -A + n in A - In ni!. The
loglikelihood of all 8 observations is
-8(1) + 81n 1 - 41n 1 - 21n 1 - In 2 - ln 24 = -11.8712 (A)
20. [Section 16.1] (18,11) is in the bottom right rectangle. The average of the training values in this region is (27 + 18+
24)/3 = 23 (D)
[Section 18.2] The closeness of observations is determined by how high they fuse vertically, so II is false. I and III
are true. (C)
22. [Lesson 15] The four nearest points are (4,1), (3,2), (5,3), and (4,4). The average of the values of Y at those points is
(27 + 32 + 28 + 39)/4 = 31.5. (B)
1.875 = 7.5
(Error SSI/Total SS - Error SS2/Total SS) 7.5( - R2, )
Error SS2/T SS 1 -
24. [Lesson 12] First compute g(71), where n is the probability of a claim.
g(n) = -2.05 + 0.4 + 0.32 = -1.33
For this link, n = 1 - exp(-&(P)), so
n = 1 - exp(-e-133) = 0.232393
77 is the probability. The odds are n/(1 n) = 0.232393/(1 - 0.232393) = 0.30275. (B)
25. [Lesson 14] For all of these statistics, the lower the value, the better the model. (A)
27. [Section 3.3] The regression has n k — 1 = 21 — 1 — 1 = 19 degrees of freedom. The standard error of S'i is
V25.882/19
se(i) = = 0.2127
s, V t
The 97.5th percentile of t at 19 degrees of freedom is 2.0930. The upper bound of the confidence interval is
1.674 + 2.0930(0.2127) = 2.1192. (E)
28. [Section 21.2] Use formulas (21.5) for the coefficients of the linear expression.
30.
[Section 7.2] We have the Total SS: it is the RSS for the model with no predictors, 82.4. However, it suffices to
compute RSS/(n — d — 1) where d is the number of explanatory predictors, since that is the only term that varies in
the formula for adjusted R2.
41.3 37.5 33.8
15 — 2
— 3.177
15 — 3
= 3.125
15 — 4
= 3.073 30.3
10
_ 3 03
—
•
To make sure adjusted R2 is not negative (and therefore 0 predictors is preferred), we'll calculate adjusted R2 for
the 4-predictor model.
3.03
1 = 0.485194
82.4/14
The El predictor model is selected. (E)
31.
[Section 16.2] All three statements are true. III is true since the variance of the average of decorrelated trees is
lower than the variance of the average of correlated trees. (D)
(-----'. [Lesson 151 We calculate the error rate for each X = x and then weight them with the probabilities of X = x. For
A, we choose Y = 1 with error rate 0.55. For B, we choose Y = 2 with error rate 0.6. For C, we choose Y = 2 or Y = 3
with error rate 0.65. For D, we choose Y = 2 with error rate 0.5.
35. [Lesson 10] Overfitting a model does not result in biased estimates, but underfitting and severe censoring does
result in biased estimate. (C)
Practice Exam 5
2. [Section 12.2] We'll have to calculate the linear expression for both categories 2 and 3 so that we can get the
probability of category 1, and then use the relative odds to get the probability that we want for category 3.
ri3
In — = -1.1 + 0.8 + 0.9 = 0.6
ill
1
hi
T = = 0.247309
1 + Ca2 ± e0.6
713 = 0.247309e" = 0.450627 (B)
3. [Section 19.3] We need the standard deviation of the differences of the series, which are 3, 5, 2, -1,4, 3,1, 5,3.
\IV-)2(c1— 02
8
=1.9221
The approximate width of a 95% prediction interval is 4(1.9221)V5 = 13.3167 where V3- is the square root of the
forecast period. (17)
4.
[Section 12.3] We will calculate the cumulative odds of 6 or less, and then the probability of 6 or less. The probability
of 7 or more is the complement of that probability.
The systematic component for 6 is 1.6 - 0.01(150) = 0.1. The odds are AI = 1.105171. The probability is
1.105171/2.105171 = 0.524979. Thus the probability of 7 or more is 1 - 0.524979 = 0.475021. (E)
J
Exam SEM Study Manual
Copyright ©2022 ASM
PRACTICE EXAM 5, SOLUTIONS TO QUESTIONS 5-8 465
n. [Lesson 21 The model expresses In Yi as a linear function of Xi. Use the usual formulas for bo and b1 in terms of Xi
and ln Yi.
lnYi nfan Y
b1=
— rig2
13.697 — (13.5)(9.681)/10
0.46665
19.57— 13.52/10
9.681 13.5
1)0 = ln Y — b1X=
10 0.46665( ) = 0.33812
10
6. [Lesson 181
I. Both methods force every observation into a cluster.X
II. K-means clustering looks at within-cluster similarities while hierarchical clustering looks at between-cluster
dissimilarity./
III. K-means clustering requires an initial assignment of clusters while hierarchical clustering does not./
(E)
7.
[Section 15.31 Let x, y, and z be the values of Y at the three points (5,9), (6,15), and (8,12). The point (4,12) is-%5 1
away from (5,9), /71 away from (6,15), and 4 away from (8,12); the average of the two closest points is (x + y)/2. The
point (5,12) is 3 away from (5,9),1,/ 71 away from (6,15), and 3 away from (8,15); the average of the two closest points
is (x z)/2. And the closest points to (7,12) are (6,15) and (8,12), so Y = (y + z)/2 at that point. Solving for x, y, and
z:
x+y
= 91
2
102 + y
= 91
2
= 80
x+z
=98
2
102 + z
=98
2
z = 94
y+z —
87 (A)
2
8. [Section 14.41 There are an intercept, 2 continuous variables, 1 variable for sex, 3 for department. There are
(2— 1)(4 — 1) = 3 interaction variables. That is a total of 10 parameters.
From the max-scaled R2 statistic and the loglikelihood of the minimal model, with 1(b) being the loglikelihood
of the model,
1(b) = -171.9
9.
[Section 17.31 We use the first column of the loading matrix, the loadings of the first variable on the three principal
components.
1.220(0.732) + 0.002(0.307) - 1.279(0.609) = 0.1147 (A)
10.
[Section 16.1] The cross-entropy at the node is -0.65 ln 0.65-0.351n 0.35 = 0.64745. After the split, the cross-entropy
is
11.
[Lesson 121 With the probit link, g(n) = 43-1(71), so n =4:13(xT/3). With our parameters, that is
(I) (1 - 0.1(4)) = 4:1)(0.6) = 0.7257 (D)
12. [Section 19.61 MAPE is mean absolute percentage error. The residuals are -1, -3 -42.
1
MAPE =
100 (
4 8
+
12
+
14
+
2) =
-
22
18.79 (E)
C: (7,3.5)
The closest points are (* indicates moved)
A: (10,7), (6,6)*, (8,4)*
B: (3,5)*, (2,6), (2,8), (4,3)
C: (6,2)*, (9,2)*, (8,1)*, (5,1), (9,3)
(C)
[Section 17.11 Let 0 be the loading we are solving for. The other loading is -\/i p2. We are given
4.3077 = 4c + 3J1 -
Let's solve for 0.
17. [Section 16.21 I and III are true, but not 11. As B grows larger, the test error settles down and becomes flat. (C)
18. [Subsection 13.3.31 By equations (13.8) and (13.9),
E[y] = 0.2
Var(N) = 0.2 + 0.22(0.6) = 0.224
Dividing the mean into the variance, we get an overdispersion factor of 0.224/0.2 = 1.12 (B)
19.
[Section 7.2] The answer does not depend on which formula you use for Cp, since they're equivalent. Using the
formula in Regression Modeling with Actuarial and Financial Applications:
adding 1 to p increases Cp by 2, so Error SS,/s2 must decrease by 2, or the RSS must decrease by 2s2. s 2 is based on
using all predictors. Since s 2 = RSS/(n k - 1), we have
s2 = 252/(100 - 15 - 1) = 3
So the RSS must decrease by 6. The highest possible value for x is 468. (E)
) (1 - 0.8738)\2
0.7737
\ I -1.8664 1 0.9216 0.5549 -0.3838
PRESS =
k1 - 0.8831) k1 - 0.3534) k1 - 0.5246) ( 1 - 0.3670
65.9115 (E)
22.
[Section 14.6] For a Poisson, the deviance residual is sign(yk - 9k)12(yk in(Yk jIk) (Yk 9k))• We can ignore
the 2 if we wish and use the textbook's incorrect formula since we just want to determine the maximum, but we
won't. However, we will ignore the sign and the square root, since we just want the maximum absolute value. The
calculations work out to:
24.
[Section 5.31 We need R). the coefficient of determination of regressing xi on x2. That is the square of the
correlation coefficient of x1 and x2. You can use your calculator to calculate the correlation, but we'll work it out. We
don't need any of the information provided for y, but we need Z(1:11 - FC1)(x12 - R2).
25.
[Section 3.11 The error sum of squares is 302. The samplevariance is the total sum of squares divided by n -1, so
the total sum of squares is 32.8(17) = 557.6.
302
R2 - 1 0.4584 (A)
557.6
220 -
1 -0.22-0.08-0.35- 0.15 = 7(
.2
w = 0.2(220) = 44 (A)
[Section 3.41 We will calculate the t ratio of b2 and then use equation (3.14). The standard error of b2 is 8.0882 1=
3.9706. The t ratio is 12.456/3.9706 = 3.1371. The partial correlation coefficient is
3.1371
0.5315 (E)
V3.13712 + 28 - (2 + 1)
28. [Section 3.3] The standard error of bi, based on the second diagonal entry of the (X1X)-1 matrix, is 11.155 10 =
1.1155. The t ratio is
4.986 - 3
= 1.7804
1.1155
There are 15 - 3 = 12 degrees of freedom. 1.7804 is between 1.7823, the 95th percentile of t12, and 1.3562, the 90th
percentile of t12, making (B) the correct answer.
29. [Section 16.2] I is true, although the n components may (and usually include) duplicates. II and III are not true.
(A)
30. [Section 7.2] Using the definition of BIC in James et al (you'll come to the same conclusion even if you use the usual
definition of BIC), BIC = (RSS + (In n)d82), and we can ignore 1/(n 82). So we want
cd + d(ln 92)(25) = cd+1 + (d + 1)(1n 92)(25)
This implies
cd - cd-Fi = (1n92)(25) = 113.0 (B)
-0.72917
32. [Section 5.3] The VIF is calculated by regressing xi on x2 and evaluating R2. For a two-variable model, R2 is the
square of the correlation coefficient of the dependent and independent variables, so let's calculate the correlation
coefficient squared. Since there are 5 observations, the sum in the third bullet is divided by 5 - 1 = 4.
2 (0.029492/4)2 - 0.26820
P = (0.0096625)(0.0209767)
That is R2 of the regression of xi on x2. The VIF is VIF = 1/(1 - 0.26820) = 1.3665 (A)
33. [Section 18.2] The distance between 121,301 and 140,511 is probably greater than the distance between {40,51} and
{63), but just to make sure, we'll calculate it:
0.25(19 + 30 + 10 + 21) = 20
whereas the distance from {40,51) and {63} is 0.5(23 + 12) = 17.5. As long as x 63 + 17.5 = 80.5, it will be linked to
{631 at the next iteration. (C)
35. [Section 18.21 Statement I is true since the distance between single points is the same regardless of linkage. IT is 1
false; the number of clusters goes down by 1 at each iteration of the algorithm. III is true. (C)
Practice Exam 6
s2 477.333
s2 =
= 3.9778
bl 20
E(Xi - F02
With 9 degrees of freedom, the 2-sided 95% confidence interval has 5% of the t distribution in the tails, so the
coefficient is 2.262. The width of the confidence interval is 2(2.262)V3. 9778 = 9.023 .(A)
2. [Lesson 12] Calculate the relative odds of the 4 colors to white for x = 35.
black
Eu3-0.009(35) =1.0151
nwhite
Trgray
e -1.05+0.01(35) =0.4966
white
nred
=„1.50-0.10(35) =0.1353
nwhite
nblue
= eos5+omo505) =1.2523
nwhite
1
The probability of a white car is 1 + 1.0151 + 0.4966 + 0.1353 + 1.2523
0.2564
3.
[Lesson 15] The Bayes decision rule selects 0 when X <0.5 and 1 when X > 0.5. The Bayes error rate is then X for
X <0.5 and 1 - X when X > 0.5. The density function of X is f (x) = 1,0 x 1. The Bayes error rate is
0.5 1
fo X dx +
0.5
(1 - x)dx = 0.125 + 0.125 = 0.25 (C)
-2
+51n0.1 +451n0.9 +3h4+521ng -0.8211 (A)
240-5
PRESS =
ei \z
2 2 2 2 2
-1.3009
( 0.0923
1 0.9145
1 0.8649
1 ( -0.0974 1 ( -0.4734 12
(1 - 0.4514) - 0.5626)
±
- 0.3111) ± 1 - 0.5584) ± ‘1 - 0.3732) ± 1 - 0.7433)
14.69 (E)
6. (Section 12.1) Since a logistic model is a logged odds model, the odds of a strike when xi = 7 is e(73)PI times the
odds of a strike when x1 = 3. The odds of a strike when xi = 3 is 0.1/(1 - 0.1) = 1/9. So the odds of a strike when
xi = 7 is e04 = 0.1658 and the probability of a strike is 0.1658/1.1658 = 0.1422. (C)
7.
[Lesson 9] Use formula (9.3). The predicted value of y is 2.1+1.2(32) = 40.5. Based on the RSS, s2 = 100/(12-2) = 10.
Based on se(b),
1.425 -
The t coefficient for a 2-sided 95% interval with 10 degrees of freedom is 2.2281. The lower bound of a 95% prediction
interval is
0
40.5 - 2.2281V10 + 12
+ 1.4252(32 - 25)2 = 17.0960 (B)
9.
[Section 17.1] I is not true. II is true, since some constraint is needed when maximizing the variance and solving for
the loadings. III is true since the principal components are linear combinations of the variables and all the variables
are assumed to be centered at 0. (E)
10. [Section 41 There are n (k +1) = 11 -3 = 8 degrees of freedom in the unconstrained model, Model II. The F ratio
is
(12,204- 9,286)/1
F1,8= = 2.5139
9,286/8
The t statistic is .1,39 = 1.5855. The critical value for t at 8 degrees of freedom is 1.8595 at 10% significance for a
2-sided test, so Model II is not accepted at 10% significance. (E)
11.
[Section 16.2] I is true, since it's difficult to interpret an average of B trees. II is true. III is false; in bagging, trees
are not pruned. (E)
[Lesson 15] At (10,13), the closest two points are (10,13) and (12,11), with average Y of (2 + 5)/2 = 3.5.
At (12,11), the closest two points are (12,11) and (10,13), with average Y = 3.5.
At (12,16), the closest two points are (12,16) and (10,13), with average Y of (2 + 10)/2 = 6.
At (15,13), the closest two points are (15,13) and (12,11), with average Y of (6 + 5)/2 = 5.5.
The MSE is
(3.5 - 2)2 + (3.5 - 5)2 + (6 - 10)2 + (5.5 - 6)2 (A)
4
13. [Section 18.2] I and II are true. With centroid linkage, dissimilarity between clusters may decrease at an iteration.
(B)
14. [Section 21.3] 11 and III are for stochastic seasonal effects; J and IV are for fixed effects. (D)
15. [Section 18.2] The distance between 160,221 and 175,411 is V152 192 = -\/ 5 for any linkage. The distance from
{(40,30), (40,40)1 and {60,221 is distance to the closest point (40,30) or V202 + 82 = VTle.4 for single linkage, distance to
the furthest point (40,40) or 182 = -V:1 for complete linkage, average distance or 0.5(V + iifg) = 24.22395,
while VW = 24.20744, for average linkage, and distance to the midpoint (40,35) or 11202 + 132 = for centroid
linkage. We see that complete linkage will prefer to fuse (60,22) and (75,41), and average linkage also prefers that by
a narrow margin. Single and centroid linkages will fuse (60,22) with {(40,30), (40,40)1. (C)
16. [Section 16.2] The reduction in RSS for the variables is:
X1: (9,865 - 9,075) + (7,411 - 7,026) = 1,175
X2: (9,075 - 8,302) + (7,026 - 6,798) = 1,001
X3: (8,302 - 7,845) + (7,845 - 7,411) + (6,798 - 6,502) + (6,502 - 6,398) = 1,291
The highest reduction in RSS is from X3, followed by X1, followed by X2. (E)
l
f
J.i. [Section 5.2] The standardized 10th residual is 6.8/(8111 - 0.2), and the square of this is 0.903125. Cook's distance is
0.2
0.0376 = 0.9031251'
1
(k + 1)(1 - 0.2) )
0.1665 -
k+1
k=11 (B)
18. [Section 5.1] Standardized residuals are ti/(sAii ii). First we compute Pi. Since Sr = Hy, we have ei'Sr-y=
(1 - H)y. The residuals are
0.65 -0.45 0.05 2.85
( 0.15 ) 14
-0.45 0.35 0.15 -0.05 12 _ -1.05
-0.05 0.15 0.35 -0.45 8 2.55
-0.15 -0.05 0.45 0.65 3 -4.35
=
sY
SX
sx = /105.6
87.36 = 0.78
(1./sY
87.361/1757
sy = 1150.93
0.78
= 1,324,646
Dyt
t=1
9)2 = 0.42 ± 1.42 ^I_
Ok 0.6) = 3.2
—1.56
r1 = —0.4875 (A)
3.2
21. [Subsection 13.3.1] The overdispersion is the ratio of variance to mean. The weight on the Poisson distribution is
0.7, so the mean is (0.7)(0.8) = 0.56. If you memorized the formula for variance you can use it. Otherwise, calculate
the second moment (N is the response):
and the variance is 1.008 — 0.562 = 0.6944. The overdispersion is 0.6944/0.56 = 1.24 .(D)
k 106.0685 = [7.1
12-1 (C)
26.5175
n. [Subsection 13.3.2] Overdispersion is variance divided by mean. We'll use equations (13.6) and (13.7) to compute
mean and variance. The quotient of those two equations, variance divided by mean, is:
1.2 = 1 + (1 - k)p
and y = 0.8. So
1.2 = 1 + 0.8(1 - k)
0.2
0.8
k = 0.75
1 -
ti = 1 - (0.75)(1 - C°.8) = 0.5870 (A)
25. [Section 14.5] First we have to back out the loglikelihood of the model. Using equation (14.15),
1 R2 exp(k/n) )2
kexp(/(b)/n))
e-0.4588
- 0.480419 =
er(b)/loo
0.632042
eI(b)/100= - 0.876837
0.720820
/(b) = 1001n 0.876837 = -13.1434
26. [Section 8.1] All three statements are true. In statement III, notice that increasing the budget parameter means
decreasing the tuning parameter, bringing them closer to unadjusted regression. (D)
27. [Section 19.3] II is true, but the other two statements are not. Consecutive terms are correlated in both types of
time series. Differences of a linear trend in time have mean pi and differences of a random walk have a mean equal
to the mean of the white noise series underlying the random walk. (B)
28. [Section 17.5] The sum of squared scores is 0.00162 + 0.10742 + (-0.3432)2 + 0.23422 = 0.1842. That number divided
by 4 is the variance of the second principal component. The total variance of the three variables is the sum of squares
of all coefficients divided by 4, and the sum of squares is 12 +02 + (-0.5)2 ± (-0.5)2 + 0.42 ± 0.42 -I- • • • -I- 0.72 = 3. The
proportion of variance explained is 0.1842/3 = 0.0614 . (C)
29. [Section 3.4] The partial correlation coefficient is the correlation between the residuals of the two regressions. Since
Ej = 0 for any regression, we only need to stun up squares and products. For the sum of squares, note that the
residual standard deviation is the square root of the sum of squares divided by the number of degrees of freedom,
4, so we multiply each one by 2 to get the square root of the sum of squares.
Ej< = 15.767
15.767
Partial correlation — 0.377 (C)
4(3.102)(3.371)
30. [Lessons 2 and 201 We use the usual simple linear regression formulas with yi as the response and yt_i as the
explanatory variable. Since the letter y is in use for the time series, we'll use z for the response. The regression is
based on 9 observations; n = 9.
E xi = 149 — 11 =138
E 4 := 2287 — 112 = 2166
zi = 149 — 20 = 129
Exfzi = 1994
1994 — (138)(129)/9
b1 = 0.32 (D)
2166 — 1382/9
31. [Section 5.3] For the regression, k = 3 and n = 32. The error sum of squares has 32 — 3 — 1 = 28 degrees of freedom.
Then
32. [Section 18.1] The clustering algorithm ends up with two candidates for the minimum: 12,4,7,111 for the first
cluster and (2,4,7,11,161 for the first cluster. The former has centroids at 6 and 25.75, so 16 stays in the second cluster,
whereas the latter has centroids at 8 and 29, so 16 stays in the first cluster.
It's probably easier to use formula (18.4) to calculate the objective function. In fact, you can use your statistical
calculator to calculate the required sums of squares, using the variance. For the split into 4 points and 4 points, the
objective function is
2((2 — 6)2 + (4 — 6)2 ± (7 — 6)2 + (11 — 6)2 + (16— 25.75)2 + (22— 25.75)2 + (29 — 25.75)2 + (36 — 25.75)2) = 541.5
For the split into 5 points and 3 points, the objective function is
2 ((2 — 8)2 + • • • + (16 — 8)2 + (22 — 29)2 + (29 — 29)2 + (36 — 29)2) = 448 (B)
n . [Section 16.11 You can try all 5 splits, but the splits between 2 and 3 and between 3 and 4 are the ones most likely
to minimize RSS. You can use your calculator to calculate the RSS, which is a multiple of the variance.
For a split between 2 and 3, the RSS of 1 and 2, for which 9 11, is (10— 11)2 (12— 11)2 = 2, and the RSS of
13,4,5,6), for which 9 = (17 + 21 + 22 +24)/4 = 21, is (17-21)2 + (22 — 21)2 + (24 —21)2 = 26. The sum of the RSSs is 28.
For a split between 3 and 4, the RSS of 11,2,3), for which9= 13, is 26, and the RSS of {4,5,6}, for which 9 = 221,
is 4Z'for a total RSS of 30Z'Thus the split between 2 and 3 yields a lower RSS. (B)33
2 2
k1 — 0.3512) k1 — 0.5760)
35.
[Lesson 1] Parametric approaches are less flexible than non-parametric approaches. But they require fewer
observations since they only estimate a small number of parameters, and they are easier to interpret since they relate
the response to the predictors in a simple way. (E)
Q Practice Exams
# 1 2 3 4 5 6
1 1 1 16 1 1 3
2 12 3 17 14 12 12
3 11 19 14 19 19 15
4 11 2 6 9 12 16
5 14 1 7 14 2 6
6 17 12 20 5 18 12
7 1 21 14 16 15 9
8 14 5 4 18 14 11
9 18 18 18 17 17 17
10 5 3 8 5 16 4
11 7 12 18 18 12 16
12 16 16 20 17 19 15
13 16 16 16 3 20 18
14 7 7 13 12 14 21
15 15 4 16 12 18 18
16 18 20 5 11 17 16
17 17 12 4 16 16 5
18 19 18 1 19 13 5
19 4 16 5 14 7 2
20 20 20 16 16 6 19
21 18 13 6 18 8 13
22 16 18 17 15 14 11
23 19 4 19 4 5 4
24 13 7 5 12 5 13
25 14 18 11 14 3 14
26 13 17 1 18 21 8
27 8 15 15 3 3 19
28 2 14 14 21 3 17
29 8 14 21 13 16 3
30 5 5 18 7 7 20
31 21 5 14 16 8 5
32 16 17 18 15 5 18
33 18 21 11 20 18 16
34 3 13 11 11 21 6
35 10 14 8 10 18 1
When you study the modules for Exam PA, they will reference the textbooks on Exam SRM. You may have to
refer to the textbooks. However, if you would rather refer to this manual, you can use the following cross-reference
lists to find the corresponding reading in this manual. This map is rough, since in many cases material is organized
differently.
Some material is not covered in this manual. This material may be
• Introductory material that doesn't say much
• Examples of topics covered in other sections provided by the textbooks
• Obscure topics I don't expect to be on the exam
Chapter in Lesson in
RM this manual
2.1-2.2 2.1
2.3 3.1-3.2
2.4-2.5.2 3.3
2.5.3 9
2.6 5.2
2.7-2.8 not covered
3.1-3.2 2.2
3.3 3.1-3.2
3.4.1-3.4.2 3.1-3.3
3.4.3-3.4.4 3.4
3.5 2.3
5.1 not covered
5.2 7.1
5.3-5.4 5.1-5.2
5.5 5.3
5.6 6
5.7 2.3
6 10
7.1-7.2 19.1
7.3-7.5 19.3-19.5
7.6 19.6
8.1 19.2
8.2-8.4 20
9.1 21.1
9.2 21.2
9.3 21.3
9.4 21.4
9.5 21.5
11.1-11.2 12.1
11.3 14.2,14.5
11.4 not covered
11.5 12.2
11.6 12.3
12.1-12.2 13.1
12.3 13.2
12.4 13.3
13.1-13.2 11.1-11.2
13.3-13.4 11.3-11.4,14.1,14.3
13.5 14.6
13.6 11.1
Table B.4: Correspondence between An Introduction to Statistical Learning and this manual
Chapter in Lesson in
ISL this manual
2.1-2.2.2 1.1
2.2.3 15.1-15.2
3.1 2.1,3.1-3.3
3.2,3.3.1 2.2,4,3.1,7.1
3.3.2 2.3
3.3.3 2.3,5
3.4 not covered
3.5 15.3
4.1-4.3 12.1
5.1 6
6.1.1-6.1.2 7.1
6.1.3 7.2
6.2.1 8.1.1
6.2.2 8.1.2
6.2.3 8.1
6.3.1 8.2.1
6.3.2 8.2.2
6.4 8.3
8.1 16.1
8.2 16.2
10.1 not covered
10.2 17
10.3 18