(2022 Full) ASM SRM (Ocr)

An Integrated
Actuarial University
Component Your Path to SUCCESS 40mb GOAL
‘1111/ Videos & MorE
Practice_ Quiz_ Test, PASS
of Your Study 0 0
Study Manual Virtual eFlasheards
410nual Program
aistm
Exam SRM Study Manual
Abraham Weishaus, Ph.D., FSA, CFA, MAAA

NO RETURN IF OPENED
a/s/M
Actuarial Study Materials
Learning Made Easier

J
wsim
Abraham Weishaus, Ph.D., FSA, CFA, MAAA

NO RETURN IF OPENED
TO OUR READERS:
Please check A.S.M.'s web site at www.studymanuals.com for errata

and updates. If you have any comments or reports of errata, please
e-mail us at mail@studymanuals.com.
ISBN: 978-1-64756-515-2
©Copyright 2022, Actuarial Study Materials, a division of ArchiMedia Advantage Inc.
All rights reserved. Reproduction in whole or in part without express written

permission from the publisher is strictly prohibited.
ITT1 Welcome to Actuarial University
Actuarial University is a reimagined platform built around a more
simplified way to study. It combines all the products you use to
study into one interactive learning center.
You can find integrated topics using this network icon.
*Jo When this icon appears, it will be next to an important

topic in the manual! Click the link in your digital manual,
or search the underlined topic in your print manual.
1. Login to: www.actuariatuniversity.com
2. Locate the Topic Search on a Pardo Distribution
your exam dashboard, and . The (Type II) Pareto distribution with parameters z8> 0 has pa
enter the word or phrase into

the search field, selecting the and cdf
best match.
If X is Type II Pareto with parameters tie,13, then
3. A topic "Hub" will display a Epri —1

if a 1,
and
list of integrated products '
that offer more ways to

study the material. GOAL for SRM
4. Here is an example of the

topic Pareto Distribution:
Probability for Risk Management, 3rd Edition a
ASM Manual for !FM a

Within the Hub, there will be unlocked and locked products.
Unlocked Products are the products that you own.
Locked Products are products that you do not own, and are
available For purchase.
Probability for Risk Management, 3rd Edition ill
Many oF Actuarial University's Features are already unlocked with your

study program, including:
GOAL Practice Tool Instructional Videos*

Virtual Flashcards Formula & Review Sheet
GOAL: Guided Online Actuarial Learning

Your Adaptable, Customizable, Online Actuarial Exam Prep Tool.
GOAL is an eLearning test preparation tool for students to practice skills learned in class
or from independent study. The online platform offers a massive database of SOA & CAS
exam-style problems with detailed solutions.
GOAL offers instructor support as well as performance reporting and tracking to monitor
your progress. Don't forget to check out GOAL Score to see how ready you are to sit for
your exam!
Practice. Quiz. Test. Pass!

. 10,000+ Exam-style problems with detailed solutions!
• Adaptive Quizzes
• 3 Learning Modes
• 3 Difficulty Modes
*Available standalone, or included with the Study Manual Program Video Bundle
Coming Soon for FAM, ASTAM & ALTAM!
Free with your ACTEX or

ak7 ASV Study Manual Procram.
GOAL
Available now for P, FM, IFM, STAM, SRM, MAS-I, MAS-II„ CAS 5, CAS 6U & CAS 6C
Prepare for your exam confidently with GOAL custom Practice Sessions,
Quizzes, and Simulated Exams.
1111•••••••
Actuarial University Quickly access the

Hub for additional
QUESTION 14 0P62 Question # 4 Nev X
Learning
Que,tion Difficulty: Mastery 8
At time t =0 year, Donald puts $1,000 into a fund crediting interest at a nominal rate of i Flag problems for
compounded semiannually. review, record
At time t = 2 years, Lewis puts $1,000 into a different fund crediting interest at a force notes, and email
= 1/(5 + t) for all t.
your professor.
At time t = 16 years, the amounts in each fund will be equal.
Calculate i.
View difficulty
pos.sible Answers level.
D 6.9% 1:117.o% 1117.1% 1137.2% (
Helpful strategies
Help Me Start
to get you
started.
Equate the expressions for the AVs at t = 16. Then solve for i 0):
Full solutions,
Equate the expressions for for the AVs at t = 16 and calculate i(2): with detailed
(1 + 021/2)" = 3 explanations
(1 +2(2)/2) = 3111321=1.03493 to deepen your
io)/2 = 0.03493 13..111: 11+ P/39" -(1 4.ir'lf 2342 understanding.
011C) do e6o'l'un —2,117 — 3
= 7.0%
Commonly
Common Questions &Errors encountered
Student Question 1: After solving this problem I got .069855. Are we expected to round to .07? errors.
Answer: The provided answer choices are all rounded to 1 decimal place. So the answer 6.9855%
n luld be rounded to 7.0% to be correct to 1 decimal place. Rate a
Rate this problem I6 Excellent c inadequate problem or give

feedback
Coming Soon for FAM, ASTAM & ALTAM!
Track your exam readiness

with GOAL Score!
GOAL
Available now for PI FM, IFM, STAM, SRM, MAS-I & MAS-II
GOAL Score tracks your performance through GOAL Practice Sessions,

Quizzes, and Exams, resulting in an aggregate, weighted score that gauges
your exam preparedness.
By measuring both your performance, and the consistency of your
performance, GOAL Score produces a reliable number that will give you
confidence in your preparation before you sit for your exam.
Your COAL Store IF your GOAL

74
Score is a 70 c,
above, you are
well-prepared
to sit For your
exam!
62 74 78
[plant OrobobAO windarnikriobles•Ithlkeuelat•ITPUbtft Irdlo.Virloblam with 1)14.64.e•Proballity 11:61011pAlen
r yhdra n
See key areas

General Probability Set functions 1ndutlieg Setnotation and bask Momenta 01PlobablItY v
where you can
Diffkulty PrePortIon
Core 00 • 92 / 304
improve.
Advarioni ; 60 169/304
• mastery 70 43/304
Detailed
GOAL Session Prz1MtySominoty Fitter: AS Types
•
. •: AllStatuaeS performance
Created Last Accessed mode Categories Qoeselosse Status
tracking.
0pip4,2022 2147,43 01/24/2022.2157:43 Quiz • Continuous P_ 29 Hew
0S/24/2022 111340S 05/24/2022 14L57:A9 05/24/2022 14;57:40 PrtkoSession 'Addition:and DO Complete
05/21/202219:3250 05/23/702220/0200 5imu1ate-0 WM !11m2 30 StrAewing

Quickly return
05/17/2022 15:1919 05/17/2022 1511645 05/23/202214.1929 Simulated 0MM Lam 6 30 Complete
to previous )
05/14/2022 1126159 4P00 125)L36 05/2312022 11:57547 Q012 O04CIN00111 0— 20 Complete
sessions.
Check out my exclusive
video solutions for the FAM,
ALTAM, and ASTAM Manuals!
1,Pr*MIEr
Have you heard of GOAL: Guided

Online Actuarial Learning? Use
it for practice, quizzes, creating
custom exams, and tracking your
study progress with GOAL Score.
Connect with me on the

ASM discussion group in
Actuarial University.
4,1
a/san
Exam ASTAM Study Manu al
aiSIM aistm
E-xa
xlsoi stud,/4.413&"
Nta"ual Exam FAM-S Study/Manual
rzti.
h.11.1%4 ARA.,
,4,0)617., 71.1101111Iftr
asim
aisim Study Manuals Learning Made Easier

Contents
1 Basics of Statistical Learning 1

1.1 Statistical learning 1
1.2 Types of variables 3
1.3 Graphs 3
Exercises 7
Solutions 7
I Linear Regression 9
2 Linear Regression: Estimating Parameters 11

2.1 Basic linear regression 11
2.2 Multiple linear regression 14
2.3 Alternative model forms 15
Exercises 16
Solutions 26
3 Linear Regression: Standard Error, R2, and t statistic 31

3.1 Residual standard error of the regression 31
3.2 R2: the coefficient of determination 33
n• 3.3 t statistic 34
3.4 Added variable plots and partial correlation coefficients 36
Exercises 38
Solutions 50
4 Linear Regression: F 57
Exercises 59
Solutions 68
5 Linear Regression: Validation 75

5.1 Validating model assumptions 75
5.2 Outliers and influential points 77
5.3 Collinearity of explanatory variables; VIP 79
Exercises 82
Solutions 89
6 Resampling Methods 95
6.1 Validation set approach 95
6.2 Cross-validation 96
Exercises 99
Solutions 101
7 Linear Regression: Subset Selection 103

7.1 Subset selection 103
7.2 Choosing the best model 105
Exercises 109
Solutions 118
Exam SRM Study Manual xi

Copyright 02022 ASM
xii CONTENTS
8 Linear Regression: Shrinkage and Dimension Reduction 125

8.1 Shrinkage methods 125
8.1.1 Ridge regression 125

8.1.2 The lasso 127
8.2 Dimension reduction methods 129
8.2.1 Principal components regression 130
8.2.2 Partial least squares 130

8.3 The curse of dimensionality 131
Exercises 133
Solutions 142
9 Linear Regression: Predictions 149

Exercises 150
Solutions 153
10 Interpreting Regression Results 157
10.1 Statistical significance 157

10.2 Uses of regression models 157
10.3 Variable selection 157
10.4 Data collection 158
II Generalized Linear Model 159
11 Generalized Linear Model: Basics 161

11.1 Linear exponential family 161
11.2 Link function 163
11.3 Estimation 165
11.4 Overdispersion 167
Exercises 168
Solutions 182
12 Generalized Linear Model: Categorical Response 187
12.1 Binomial response 187
12.2 Nominal response 191

12.3 Ordinal response 193
Exercises 195
Solutions 210
13 Generalized Linear Model: Count Response 219

13.1 Poisson response 219
13.2 Overdispersion and negative binomial models 219
13.3 Other count models 220
13.3.1 Zero-inflated models 220
13.3.2 Hurdle models 221
13.3.3 Heterogeneity models 221
13.3.4 Latent models 222
Exercises 222
Solutions 228
14 Generalized Linear Model: Measures of Fit 231

14.1 Pearson chi-square 231
14.2 Likelihood ratio tests 232 ,
14.3 Deviance 232

Copyright 02022 ASM
CONTENTS xiii
14.4 Penalized loglikelihood tests 235

14.5 Max-scaled R2 and pseudo-R2 236
14.6 Residuals 236
Exercises 239
Solutions 248
III Other Statistical Learning Methods 255
257
15 K-Nearest Neighbors
15.1 The Bayes classifier 257
15.2 ICNN classifier 258
15.3 KNN regression 259

Exercises 260
Solutions 266
16 Decision Trees 273

273
16.1 Building decision frees
277
16.2 Bagging, random forests, boosting
277
16.2.1 Bagging
16.2.2 Random forests 278
16.2.3 Boosting 278

Exercises 281
Solutions 294
17 Principal Components Analysis 299
17.1 Loadings and scores 299
17.2 Biplots 301
17.3 Approximation 304

304
17.4 Scaling
17.5 Proportion of variance explained 306
Exercises 309
Solutions 318
323
18 Cluster Analysis
323
18.1 K-means clustering
325
18.2 Hierarchical clustering
18.3 Issues with clustering 330
Exercises 332
Solutions 336
IV Time Series 341
19 Time Series: Basics 343

19.1 Introduction 343
19.2 Mean and variance 344
19.3 White noise 345
19.4 Random walks 345
19.5 Control charts 346
346
19.6 Evaluating forecasts
Exercises 348
Solutions 354

Copyright 02022 ASM
xiv CONTENTS
20 Time Series: Autoregressive Models 359

Exercises 361
Solutions 364
21 Time Series: Forecasting Models 367
21.1 Moving average smoothing 367

21.2 Exponential smoothing 367
21.3 Seasonal models 368
21.4 Unit root tests 369
21.5 ARCH and GARCH models 370
Exercises 371
Solutions 374
V Practice Exams 377
1 Practice Exam 1 379
Appendices 441
A Solutions to the Practice Exams 443

Solutions for Practice Exam 1 443
B Cross Reference Tables 479

Copyright 02022 ASM
Preface
Welcome to Statistics for Risk Modeling!

This course gives you an introduction to statistical learning and data science. It is a prerequisite to the Predictive
Analytics exam.
You should have some knowledge of calculus, probability, and mathematical statistics, and you should know
what matrix multiplication, transposition, and inversion means. However, the technical part of this course is light;
you should know what we're talking about when we mention testing a null hypothesis Ho against an alternative,
or Student's t test, and it would be nice if you know what maximum likelihood estimation is, but we don't go very
deeply into mathematical statistics.
Download the syllabus for the exam. At this writing (January 2022), you can find it at
hups://www.soa.org/Education/Exam-Req/edu-exam-srm-detail.aspx
The syllabus has two links at the bottom. The second one links to sample questions and solutions. There are 28
sample questions.
The syllabus includes the following topics and weights:
Number of
Topic Weight Lessons Sample Questions Sample Questions

Basics of Statistical 7.5-12.5% 1 0
Learning
Linear Models 40-50% 2-15 7, 8, 11, 12, 13, 14, 17, 18, 23
19, 20, 23, 24, 27, 28, 42,
44, 45, 47, 49, 52, 53, 54,
56
Time Series Models 12.5-17.5% 19-21 3,4, 21, 22,31, 38,46, 55 8
Principal Components 2.5-7.5% 17 5, 6, 30 35, 37 5

Analysis
Decision Trees 10-15% 16 9, 10, 25, 26, 33, 39, 41, 11
48, 50, 51, 57
Cluster Analysis 10-15% 18 1, 2, 15, 16, 29, 32, 34, 36, 10
40,41
The sample questions are classified by topic in this table.

Here are some comments on the sample questions:
1. It is interesting that the topic "K nearest neighbors" is not a separate topic of the syllabus; rather, it is included
in linear models. In this manual, there is a separate lesson on K nearest neighbors. None of the sample
questions are on this topic. It is possible that they never test on it.
2. Linear models is the largest topic, but includes generalized linear models and various other topics.
3. The distribution of the sample questions is different from the syllabus weights. Part of this may be to provide
more questions on topics which have not appeared on exams before STAM, such as cluster analysis.
About 60% of the sample questions are conceptual; no calculations are needed. Some of these questions are taken
from obscure passages in the Frees textbook. All of the concepts in these questions are covered in this manual, but
in some cases very briefly. However, there is no guarantee that they won't ask a question on something that I didn't
n include. I believe that knowing the information in this manual will be enough to get a 10, but not necessarily enough
to answer every exam question.
Exam SRM Study Manual XV
Copyright 02022 ASM

xvi CONTENTS
You should also download the tables that are linked to the bottom of the syllabus. The tables include the normal
distribution, critical values for the t distribution, and critical values for the chi-square distribution. The SOA hasn't
provided rounding rules for the normal table, but you won't be using it that heavily anyway.
There are two textbooks on the syllabus. There is some overlap between the two textbooks, as they both discuss
linear regression. The styles of the two textbooks are different.
The first textbook is Regression Modeling with Actuarial and Financial Applications by Edward Frees, an actuary.
This book covers the linear models and time series parts of the syllabus. The author comes across as a scholar who
is very familiar with his material and tries to get it across by showing many practical examples of its use. Practical
examples means computer outputs. To make the book more readable, technical detail is usually placed in a section
at the end of each chapter (and those sections are not on the SRM syllabus). Despite the author's good intentions, I
found this book somewhat difficult to read for the following reasons:
1. The book has a fairly large number of errors. The errata list for the book is at
https://instruction.bus.wisc.edu/jfrees/jfreesbooks/Regression%20Modeling/BookWebDec2010/
RegressionFreesErrata12September.pdf
The errata list must be taken into account, since many formulas in the book are incorrect.
2. The book lacks a useful index. And the table of contents only lists chapters and sections, not subsections. Isn't
it interesting that the index does not have anything starting with F? (The F-ratio is discussed in this textbook,
although the author seems to prefer using t-ratios.) If you wanted to know something about Cook's distance,
where would you find it? (If you were really clever and knew that Cook's distance had something to do with
leverage, you could look under leverage and find Cook's distance there, but hey - an index should be usable
by dummies!) If some practice question mentioned "Dickey-Fuller" and you wanted to know what it is, where
would you look? (If you knew that the full name of the test is the "Dickey-Fuller unit root test", you would still
not find "unit root tests" in the index, but you would find it in the table of contents. Hurrah!) I was frustrated
by the difficulty of finding things in the book.
3. It is often very hard to understand what the author is saying without knowing the technical background, and
often the technical background is not provided in the last section of the chapter.
I have omitted some of the more obscure topics from this textbook, and I don't think they would appear on an
exam.'
The second textbook is An Introduction to Statistical Learning, coauthored by four non-actuarial authors. This
textbook covers all parts of the syllabus except time series, but does not discuss logistic models.2 This book is
available free as a download, and I encourage you to download it! The style of this book is enthusiasm; these
authors are excited about this topic and want you to be excited as well! You can read this book in bed, and I
challenge you to find an error in it. An Introduction to Statistical Learning avoids technical details as much as possible,
and rather than have you do calculations, shows you how to use R to carry out the modeling.
You will find An Introduction to Statistical Learning easier to read than this manual. However, this manual will
still help you in the following ways:
1. It summarizes the material. You can pick up the material faster by reading this manual, although you will not
be motivated as much. You may even want to read An Introduction to Statistical Learning once and then this
manual many times to review the material.
2. It provides you with exam-like examples and questions. An Introduction to Statistical Learning is interested in
teaching you practical uses of the material. It rarely provides simple small-scale examples since they don't
represent realistic situations. But Exam SRM is does not provide you a computer to do calculations; the
calculation questions on it will be simple and small-scale.
3. On rare occasions An Introduction to Statistical Learning is not completely clear.
'In particular, I do not discuss the multinomial logit model
aActually it does discuss logistic models, but the chapter that discusses them is not on the syllabus.

Copyright 02022 ASM
CONTENTS xvii
Much of the material on this exam does not lend itself to calculation by hand. Therefore, much of this exam will
contain knowledge questions rather than calculation questions. Knowledge question are 3-way or 4-way true/false
questions. On 3-way true/false questions, the SOA enforces symmetry of the answer choices. The only two sets of
answer choices that you will encounter on an exam are
(A) I only (B) Il only (C) III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) , or (D) .
and
(A) None (B) I and II only (C) I and III only (D) II and III only
In both cases, there is symmetry among the three statements I, II, and III, and also symmetry in how likely a
statement is to be true. Apparently symmetry is not required for 4-way true/false questions. For either type, choice
E should be the correct choice about 1/5 of the time only.
Exercises and practice exams

All questions on the practice exams in this manual are original. However, a small number of the calculation questions
are slight variations of the SRM sample questions. For the knowledge questions, there was a limit to the number of
different tricky statements I could come up, so you may find them slightly repetitive.
Many exercises are original, but I also used SRM sample questions and questions from old CAS MAS-I, MAS-II,
and S exams. Let's discuss how appropriate CAS questions are.
First of all, the CAS style is in general different from the SOA style. It tends to be less precise. And answer
choices are usually ranges rather than specific answers. On three-way true/false questions, they used to not insist
on symmetry, but for last few years they do; all the MAS-I and MAS-II three-way true/false questions follow the
symmetry rules.
In terms of syllabus, the CAS uses the same textbook, An Introduction to Statistical Learning, for the topics that are
covered in that textbook. They use it even for logistic regression. However, the CAS uses textbooks for linear models
and time series different from the ones the SOA uses. The time series coverage on MAS-I has very little overlap
with the time series coverage on SRM. For generalized linear models, different authors use different terminology
and different parametrizations. For example, what we call "linear exponential family" is called "exponential family
in canonical form" in the CAS textbook, and is parametrized differently as well. When different terminology and
parametrizations were used, I translated CAS questions whenever possible, otherwise I did not use the question.
New for this edition
The online version of this edition has been linked to the Actuarial University.
The Principal Components Analysis lesson has been rewritten in a clearer way.
The four new SOA sample questions have been added to the appropriate lessons.
The R language
This manual does not cover R, and you won't need it for SRM. However, you will need it for the PA exam. You
should read the labs in An Introduction to Statistical Learning to learn how to use R to carry out the statistical learning
methods you learn in this course.
Cross-reference tables
Note that Appendix B has cross-reference tables showing you which section of the manual corresponds to each
section in the textbooks. As discussed in that appendix, these may be helpful when you are studying for Exam PA.

Copyright Q2022 ASM
xviii CONTENTS
Errata
Please report all errors to the author. You may send them to the publisher at mail@studymanuals . corn or directly to
me at errata@aceyourexams . net. Please identify the manual and edition the error is in. This is the 3rd edition of
the Exam SRM manual.
An errata list will be posted at http : //errat a . aceyourexams . net. Check this errata list frequently.
Acknowledgements
I would like to thank the CAS for allowing me to use questions from their old exams, and the SOA for allowing me
to use its sample questions.
The creators of TX, P:TEX, and its multitude of packages all deserve thanks for making possible the professional
typesetting of this mathematical material.
I'd like to thank Michael Bean for his diligent job proofreading this manual, as well as advice on how to improve
the content.
I'd like to thank the following correspondents who submitted errata: Hiu Tung Chan, Cheng Chen, Joel Cheung,
Maria Doran, Neil Xavier Elpa, Lingyi Fang,Natalie Jacobsen, Boren Jiang, Dan Kamka, Drew Lehe, Yingxin Liu,
Mario Mendiola, Li Kee Ong, Greg Schlottbohm, Aaron Shotkin, Tara Starling, Ryan Talley, Chan Hiu Tung, Isaac
Zhang, Wei Zhao, Dihui Zhu.

Copyright ©2022 ASM
Lesson 1
Basics of Statistical Learning
Reading: Regression Modeling with Actuarial and Financial Applications 1.2; An Introduction to Statistical Learning
2.1-2.2.2
There is no perfect f, so we have to allow for error. Let E be the error. Then we want a function f for which
Y f X2, X3, • . .) £
We would like to pick the f that makes the error E as small as possible. The remaining error is the irreducible error. NI°
The input variables Xi are called explanatory variables, independent variables, features, or predictors. (Usually, SI
the term "features" is only used if Xi is a discrete random variable with a finite number of possible values.) The
output variable Y is called the dependent variable or the response.
Prediction and inference Statistical learning is used for prediction and inference. Prediction means determining •-ur
what the response will be for some values of Xi, possibly values that have not been observed in the past. Often
you have no control over the values of Xi. Inference means understanding how the explanatory variables influence NI.
the response. You may be able to control the explanatory variables and thus influence the response. For example,
the response variable sales may be influenced by various types of advertising. If a model shows how each type of
advertising influences sales, you may be able to adjust advertising strategy to increase sales.
Parametric and non-parametric There are two types of methods to specify f: parametric and non-parametric.
A parametric method specifies a function, and statistics is used to fit the function's parameters. For example, the Si?
response may be specified as a linear function of the predictors, and then the coefficients of the linear function are
estimated. Once the linear function is estimated, there is no need for the data that generated the linear function.
A non-parametric method does not specify a simple form for the relationship. Instead, a function that is close to
all the points without getting too wiggly is specified. Unlike parametric methods, all of the data that generated
this function is needed to specify the relationship. An example of a non-parametric method (which, however, is
not on the syllabus) is a spline. A cubic spline connects the points with cubic polynomials so that the connections
have continuous curvature. The disadvantage of parametric methods it that they assume a functional form for the
relationship, and the functional form may be wrong. But non-parametric methods have the disadvantage of needing
a large amount of observations to properly specify.
Flexibility versus interpretability A flexible method bends and twists in order to match the observations. An
inflexible method will not do that.
You may think that flexible methods are better. But one problem with them is that just because they fit the
observations well does not mean that they will do a better job predicting the response given new observations.
Exam SFtM Study Manual

Copyright 02022 ASM
2 1. BASICS OF STATISTICAL LEARNING
For example, one may have 10 observations of a response as a function of a predictor. One can fit a ninth-degree
polynomial to these observations, and it will fit the observations perfectly. But it will do a poor job predicting the
response for other values of the predictor.
air Another problem with flexible methods is interpretability. Linear regression specifies the response as a linear
function of the predictors. It is an inflexible method, but it is easy to understand how each predictor influences the
response. Flexible methods are difficult to interpret. In general, the more flexible a method, the less interpretable it
is. If statistical learning is used for prediction, then the lack of interpretability is not critical; a black box will do. But
if it is used for inference, the lack of interpretability is a drawback.
•41` Supervised and unsupervised learning So far we've been discussing supervised learning. Supervised learn-
-f ing has a response variable that is influenced by explanatory variables. Unsupervised learning does not have a
response variable. Unsupervised learning relates the observations to each other or finds patterns in the observa-
tions. It is more difficult than supervised learning since we cannot measure how well we have done. Typically, the
quality of a supervised method is measured by comparing the predicted response with the actual response. This
comparison is not available for unsupervised learning.
149 Regression versus classification problems Sometimes the response variable is continuous and sometimes it is
limited to a small number of values. When it is continuous, we have a regression problem, whereas when it is limited
to a small number of values we have a classification problem. Sometimes in classification problems the values of
the response are not numbers. For example, if the effectiveness of a medicine is being modeled, the response may
be "Effective" or "Not effective".
For regression problems, the function we fit can assume any real value. For classification problems, the function
we fit estimates the probabilities of the various classifications.
We will discuss types of variables in greater detail in the next section.
Quality of Fit The quality of fit for a regression problem is measured using mean squared error' or MSE. MSE is
defined by )
MSE = —
ii
- 902 (1.1)
1=1
Sf where pi is the fitted value. Typically there is training data, data that is used to fit the model, and test data, data
•
that was not used to fit the model. We will discuss in Lesson 6 how to obtain test data. The quality of a model is
determined by its MSE on the test data. A flexible model can reduce the MSE on the training data by using a lot of
parameters, but this does not indicate a high-quality model. In fact, standard error measurements for training data
as: typically divide the sum of squared errors by something less than n to compensate for the fitted parameters.
For classification problems, the quality of the fit is measured by the proportion of cases for which the correct
classification is selected. As with regression problems, the quality of the fit is measured on test data.
Bias versus variance Mean squared error is the sum of squared bias, variance, and irreducible error. There is
nothing we can do about the irreducible error, so we won't discuss it further.
NI° The variance of an estimator measures how much the estimator varies with different random samples of data.
Bias measures the extent to which the expected value of the estimator differs from the true value. Generally there
• is a tradeoff between variance and bias, and the goal is to select the estimator that minimizes the mean squared
error, thus optimizing the tradeoff. Inflexible estimators are not very sensitive to the input data, so they have low
variance. However, they make assumptions for the functional form of the relationship between the explanatory
variables and the dependent variable, and those assumptions may not be true; thus they have high bias. We will
learn later that the coefficient estimates of linear regression are unbiased, but this assumes that the true relationship
is linear! In reality, the relationship is almost surely not linear, so linear regression is a high-bias estimator. However,
it is a low-variance estimator, since it is not so sensitive to each individual point. A spline that goes through every
'Regression Modeling with Actuarial and Financial Applications calls it "mean square error", but An Introduction to Statistical Learning calls it
"mean squared error", which is probably better grammatically.

1.2. TYPES OF VARIABLES 3
observation point is very sensitive to the points, so it has high variance, but it has low bias since it does not make
assumptions on the underlying function.
1.2 Types of variables

Our models relate explanatory variables to response variables. There are three major types of variables:
1. Continuous variables. Continuous variables may assume any real value. Claim amount and time to settlement "41
of claim are examples of continuous variables. Although claim amount has to be a specific number of dollars
and cents rather than any real number, a cent is such a small unit that treating a claim amount as a continuous
variable is reasonable.
2. Categorical variables, sometimes called qualitative variables. These variables assume a small number of `11
category values. An example of such a variable is a "yes/no" random variable, like "Does the car have an anti-
theft system", "does the house have a sprinkler", and the like. Such variables, that can assume only one of two
values, are called "binary" or "Bernoulli" variables. Another common binary variable is male/female. Some
categorical variables have more than two categories. For example, auto usage may have the three categories
"farm", "pleasure", and 'business".
We assign numbers to the categories so that we can use these variables in equations. For the auto usage
variable, we may assign 0 to farm, 1 to pleasure, and 2 to business. There is no particular order to these
categories, so this variable is a "nominal variable". Sometimes the categories have a meaningful order. For
example, there may be various categories of injury in an accident, with categories ordered from mildest to
most severe injury. If the injury codes are 1 through 5, then a category 4 injury is worse than a category 2
injury, but it is not necessarily twice as bad as a category 2 injury. When the category numbers have a logical
order, we call the variable an "ordinal variable".
3. Count variables. A count variable assumes nonnegative integral values. Number of claims is an example of a
count variable.
There are also discrete variables that assume negative or nonintegral values. However, it is rare that we deal with
them.
1.3 Graphs
This section is based on Regression Modeling with Actuarial and Financial Applications 1.2, which is background reading
only.
We want to get some idea of which explanatory variables to use in our model. Looking at plots is one way
to do this. In the following plots, the x axis is used for an explanatory variable and the y axis is used for the
response variable, the variable we are trying to explain. The sample consists of observed pairs of (xi, yi), where xi
are observations of the explanatory variable and yi are corresponding observations of the response.
Scatter plots If we are considering a continuous variable as an explanation of another continuous variable, we can
graph them using a scatter plot. A scatter plot is a plot of all sample pairs (xi, yi). For example, if a sample has the 1141
5 points
(1,2) (2,4) (4,3) (7,5) (9,4)
then Figure 1.1 would be the scatter plot.

Insurance data sets usually have thousands of observations, and the scatter plot may look like a gray cloud of
points. Figure 1.2 shows a scatter plot with 2000 points. In this figure, based on the upwardly sloping pattern of the
cloud, it appears that the response is a linear function of the explanatory variable.
Sometimes a single scatter plot can be used to examine the relationship of a categorical variable and a continuous
variable with another continuous variable. This is done by coloring the dots differently based on category, like red
Exam Sii.M Study Manual

Copyright 02022 ASM
4 1, BASICS OF STATISTICAL LEARNING
10
8-
2-
o
o 2 4 10 0 1 2 3 4 5 6
Figure 1.1: Scatter plot for 5 points Figure 1.2: Scatter plot for 2000 points
30
25 -
—9—
20 -
15
10 -
5-
Male Female
Figure 1.3: Box plot
and blue. In this case, if there is a preponderance of red dots at the bottom and blue dots at the top, we can conclude
the category which is coded as blue dots tends to increase the response variable. On the other hand, if the red and
blue dots are randomly distributed around the graph, then the categorical variable is not relevant.
Box plots If we are considering a categorical variable as an explanation of a continuous variable, then we can draw
•
a box plot. A box plot has a rectangle above each category. A thick line in the middle of the box indicates the median.
The bottom line of the box is the first quartile and the top line of the box is the third quartile. Additional lines,
'141 called "fences" are placed above and below the box. Dashed vertical lines between the rectangle and the additional
•
lines, sometimes called "whiskers", are drawn. Different authors and programs put the fences in different places.
The syllabus reading uses the following method, proposed by John Tukey. Let Qi and Q3 be the first and third
quartiles respectively. Then let 11 = 1.5(Q3 — Q1). Place the lower fence at Qi — 11 and the upper fence at Q3 + 11.
•41 Potential outliers, defined as the sample points above and below the fences, are plotted individually, vertically above
.4* the center. The points within the fences but closest to the fences are called adjacent points.
An example of a box plot is Figure 1.3. Let's say the y axis is claim size. Then this plot indicates that for males,
median claim size is 15, the first quartile is 13, and the third quartile is 19, with some high percentile equal to 26
and a few potential outliers above 26. Female claim sizes tend to be lower. The distribution of claim sizes is skewed
upwards, as the solid lines in the middles of the rectangles are below the centers of the rectangles, and the distances
to the high percentiles are greater than the distances to the low percentiles. This plot was not drawn using the Tukey
method; using the Tukey method, the fences would be equidistant from the box.
q plots The 100qth percentile is also called the Cith quantile. For example, the 40th percentile is the 0.4 quantile. A
qq plot compares quantiles of two distributions. A gq plot consists of a plot of coordinate pairs: the x coordinate
is the observed quantile and the y coordinate is the fitted quantile. For example, suppose the observed values
1.-.? are 1, 3, 6, 10, and 15. Suppose they are fitted by maximum likelihood to an exponential distribution of the form .j
Copyright 02022 ASM
1.3. GRAPHS
20
oo 5 10 15 20
Fitted quantiles
Figure 1.4: qq plot of data fitted to an exponential
•
F(x) = 1 — e '0. The maximum likelihood fit sets t9 equal to the sample mean, which is 7 here. The eh quantile of "Ni
an exponential distribution can be worked out as follows:
F(x) = q
1 — e-x/0 = q
x
—
=— in(1 q)
0
x = —0 ln(1 — q)
There are many possible methods to assign quantiles to observations. For a sample of size n, we will set the quantile
of order statistic j (the ith observation when the observations are ordered from lowest to highest) equal to j/(n + 1).
Then, if we let the x-coordinates be the fitted values and let the y-coordinates be the observed data, then the qq plot
has the five points (-71n 5/6,1), (-71n 4/6,3), (-71n 3/6,6), (-71n 2/6,10), and (-7 ln 1/6,15). Figure 1.4 shows the
qq plot, with the y-coordinates for the observations and x-coordinates for the fitted distribution. In this qq plot, the
points are connected with lines, but q q plots do not have such lines; instead, the number of observations is usually
large, and one can see if they lie on a line.
If a completely specified distribution is being fitted, then the fit is good if the line is close to the 450line. Figure 1.4
has a 459ine. qq plots do not necessarily have a diagonal line, and if they do, they are typically drawn through the
25th and 75th percentiles of the observed and fitted distributions.
Usually, rather than comparing observed data to a completely specified distribution, a qq plot is used to compare
observed data to a family of distributions. One member of the family is used to draw the plot. If the fitted distribution
has cumulative distribution function F and the points of the qq plot lie on a straight line, it suggests that F ((x — a)/b)
is a good fit. in our example, rather than fitting the data to an exponential with mean 7, one may fit them to an
exponential with an arbitrarily selected mean. Then, the sign of a good fit would then be that the points lie on a
straight line. In Figure 1.4, the first four points of the plot lie more or less on a straight line, indicating a good fit,
even though the line is not the diagonal, but the point for the observation 15 is off the line. As long as points lie on a
straight line, even though it is not the diagonal, there is an exponential with some parameter whose quantiles will
match those points.
A special case of a qq plot is a normal probability plot, where the fitted distribution is a normal distribution. This *vs
is the most common type of qq plot.
Figure 1.5 shows how qq plots look with good and bad fits. For each plot, the fitted distribution is standard
normal, and the observations are 100 simulations from a distribution. In Figure 1.5a, the observations are standard
normal, and the fit is very good, a straight line. In Figure 1.5b, the observations are normal with mean 6 and
standard deviation 2, and the fit is just as good as the first one; only the scale of the y axis changes. You see how a
qq plot tells you that the distribution family you picked is a good fit, even if you don't know the parameters of the
i tted distribution.
f

Copyright Q2022 ASM
6 1. BASICS OF STATISTICAL LEARNING
0
T
1 1 1 1 1 1 I I I I I
-3 2 1 0 1 2 -3 -2 -1 0 1
Theoretical Quantiles Theoretical Quantiles
(a) Observations are standard normal (b) Observations are Normal(6,o = 2)
LC, -
'.l o 03 —
0
o
0
CD —
6'
o 0
Lo
0 —
vt —
1
0.1
/
—
00 C- o caum0101111.1111111
I I I i I I I I i r i
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Theoretical Quantiles Theoretical Quantiles

(c) Observations are t, 3df (d) Observations are exponential, mean 1
Figure 1.5: Examples of qg plots

Copyright 02022 ASM
EXERCISES FOR LESSON 1 7
In Figure 1.5c, the observations are from a Student's t distribution with 3 degrees of freedom. The t distribution
is a symmetric distribution but has wider tails than a normal distribution, especially for low degrees of freedom. I
used R to draw these plots, and R automatically adjusts the axis scales unless you override it. Due to the outliers, the
limits of the y axis have been expanded. The fit is actually a 45° line in the middle. But the extreme values of the t
distribution cause the line to curve at the far left and far right. The intermediate quantiles of t are similar to those
of a normal distribution, but the very high and very low quantiles are respectively higher and Tower than those of a
normal, causing the pattern seen in that plot.
In Figure 1.5d, the observations are from an exponential distribution with mean 1. An exponential is not
symmetric; it is concentrated near 0 and is skewed towards the right. Its median is less than its mean. The low
quantiles of an exponential are very small relative to the symmetric normal's quantiles. The normal distribution
puts too much weight on the lower values and too little weight on the higher values, causing the upward facing
curve seen in this plot.
Exercises
1.1. 11-1/ Which of the following are categorical variables?

1. Legal representation
2. Settlement delay
3. Injury code
4. Claim count
5. Claim size
1.2. NI° A sample consists of the numbers {1,3,7,13,25,58,99)

Determine the lower fence of a box plot using Tukey's method.
Solutions
1.1. Legal representation is either "yes" or "no", so it is categorical. Injury code has a short discrete set of possible
values, so it is categorical. The other variables are not categorical. 1 and 3
1.2. The first quartile is 3 and the third quartile is 58. Then h = 1.5(58 — 3) = 82.5 and the lower fence is placed at
3 — 82.5 —79.5
Ready for more practice? Check out GOAL!

GOAL offers additional questions, quizzes, and simulated
exams with helpful solutions and tips. Included with GOAL
are topic-by-topic instructional videos! Head to
ActuarialUniversity.com and log into your account.

Part 1
Linear Regression
Lesson 2
Linear Regression: Estimating Parameters

Reading: Regression Modeling with Actuarial and Financial Applications 1.3, 2.1-2.2, 3.1-3.2; An Introduction to Statistical
Learning 3.1-3.2,3.3.2, 3.3.3
In a linear regression model, we have a variable y that we are trying to explain using variables xi, xk.1 We have
n observations of sets of k explanatory variables and their responses: {y1,x1,xi2, , xik} with i = 1,. ,n. We
would like to relate y to the set of xj, j = 1,. k as follows:
Yi = PO ± 131Xil P2Xi2 ± • • • ± PkXik Ei
where ei is an error term. We estimate the vector [3 = (n0,131, , fik) by selecting the vector that minimizes E7_1 gss
For statistical purposes, Ei is a random variable. We make the following assumptions about these random
variables:
1. E[r1] = 0 and Var(Ei) := 02. In other words, the variance of each error term is the same. This assumption is
called homoscedasticity (sometimes spelled homoskedasticity).
2. ei are independent.
3. Ei follow a normal distribution.
If these assumptions are valid, then for any set of values of the k variables {x1, x2, , xk}, the resulting value of y
will be normally distributed with mean /30 + pixi and variance 02. Moreover, the estimate of f3 is the maximum
ilkelihood estimate.
Notice that our linear model has k parameters f3i, P2, ,pk in addition to the constant fib.Thus we are really
estimating k + 1 parameters. Some authors refer to "k + 1 variable regression". I've never been sure whether this is
because k 1 ps are estimated or because the response variable is counted as a variable.
2.1 Basic linear regression

When k = 1, the model is called "basic linear regression" or "simple linear regression".2 In this case, the formulas '1/4411
for the estimators of Pb and ph are
A _ E(x1 k)(Yi — 9)
rl
Often we use Latin letters for the estimators of Greek parameters, so we can write bi instead of Pi.3
The formula for (3/ can be expressed as the quotient of the covariance of x and y over the variance of x. The
sample covariance is
E(xi — x)(yi —9)
evxy 11 — 1
and the sample variance is 'kJ"

'Regression Modeling with Actuarial and Financial Applications uses k for the number of variables, but An Introduction to Statistical Learning
uses p.
2Regression Modeling with Actuarial and Financial Applications calls it basic linear regression and An Introduction to Statistical Learning calls it
simple linear regression. As indicated in the previous paragraph, some authors call it "2 variable regression", and while this terminology is not
used by either textbook, you may find it on old exam questions.
'Regression Modeling with Actuarial and Financial Applications uses 14, while An Introduction to Statistical Learning uses
Exam SIZM Study Manual 11

Copyright 02022 ASM
12 2. LINEAR REGRESSION: ESTIMATING PARAMETERS
s2
n—1
The n — is cancel when division is done, so they may be ignored. Then equation (2.1) becomes
„
cpxy
1i= 2
Sx
•
You may use the usual shortcuts to calculate variance and covariance:
Cov(X, Y) = WY] — E[X] ELY]
Var(X) = E[X21 — E[X12
In the context of sample data, if we use the biased sample variance and covariance with division by n rather than
n —1 (It doesn't really matter whether biased or unbiased is used, since the denominators of the sums, whether they
are n or n 1, will cancel when one is divided by the other.), these formulas become
n r1 ti
E (Xj - .i)(yi — g) = Exiyi Z xinZ yi = E xiyi — nig

n n
ni2
n
E xi 1 yi
V.ixiYi n ri=i xiyi — nig
fii — =
vn= _ (E Xi)2 2 7=1 4 - n12

, n
Let sx, sy, be the sample standard deviations of x and y, and let rxy be the sample correlation of x and y, defined
as follows:
CVxy
YXy
Sx Sy
rxysxsy
From formula (2.1), we have pi = 2
, Or
Sx
Sy
-40 Pi rXy Sx
-
(2.3)
so Pi is proportional to the correlation of x and y.

EXAMPLE 2A "lie You are given the linear regression model yi = po + pixi Ei to fit to the following data:
x 2 2 3 4 5 5 7
y 34 38 38 53 50 60 70
Determine the least squares estimate of pi. •
SOLUTION: First we calculate E 4 and xiyi, then we subtract n22 and n,U. We obtain:
E 4 = 132
E xiyi = 1510
_
28
x= =4
_
343
y= = 49
E(xi )2 = 132 — 7(42) = 20

E(xf — 1)(yi — g) = 1510 — 7(4)(49) = 138
••
138
= 6.9
Pi =
Copyright 024322 ASM
2.1. BASIC LINEAR REGRESSION 13
Although not required by the question, we can easily calculate $o:
= 49— (6.9)(4) = 21.4
You would never go through the calculations of the previous example since your calculator can carry out the
regression. On the TI-30XS, use data, ask for 2-Var statistics. In those statistics, item D is 131 (with the unusual name
a) and item E is Po (with the unusual name b). You can try this out on this quiz:
Quiz 2-1 •N? For a new product released by your company, revenues for the first 4 months, in millions, are:
Month 1 27
Month 2 34
Month 3 48
Month 4 59
Revenues are assumed to follow a linear regression model of the form
yi = ei
where xi is the month and yi is revenues.

Estimate pi for this model.
More likely, an exam question would give you summary statistics only and you'd use the formulas to get #0 and
pi.
EXAMPLE 2B ¶Z For 8 observations of X and Y, you are given:
2=6 x=408 = 462
Perform a simple linear regression of Y on X:
Yi = Po + Pixi
Determine ijo. •
y, xiyi — n2y
SourrioN: 131 0
Ex —
462 — 8(6)(8) —065
408 — 8(62)
(io= y - = 8 — 0.65(6) =
The next example illustrates predicting an observation using the regression model.
EXAMPLE 2C Experience for four cars on an automobile liability coverage is given in the following chart:
Miles Driven 7,000 10,000 11,000 12,000
Aggregate Claim Costs 600 2000 1000 1600
A least squares model relates aggregate claims costs to miles driven.

Calculate predicted aggregate claims costs for a car driven 5000 miles. •

SOLUTION: We let xi be miles driven and y, aggregate claim costs. It is convenient to drop thousands both in miles
driven and aggregate claim costs.
7 + 10 + 11 + 12 0.6 + 2 + 1 + 1.6
=10 =1.3
—
4
y= 4
E 4 = 72 + 102 + 112 + 122 = 414 x,y; = (7)(0.6) + (10)(2) + (11)(1) + (12)(1.6) = 54.4
denominator = 414 — (4)(102) = 14 numerator = 54.4— (4)(10)(1.3) = 2.4
2.4 6 6 2900
1
14 35 po = 1300 — (-3) (10000) = 7
Notice that we multiplied back by 1000 when calculating ijo.

The predicted value is therefore —.2es + A.(5000) 442.8571 0
The fitted value of yi, or fio + 1x1, is denoted by gi. The difference between the actual and fitted values of
•-: yi, or ei= y —9j, is called the residual. As a result of the equations that are used to solve for #, the sum of the residuals
Z;Li ei on the training set is always 0. As with pi, we may use Latin letters instead of hats and denote the residual by
ei.
2.2 Multiple linear regression

We started the lesson with this equation:
Yi = Po pixii "
13kxik + Li
Let's now discuss multiple regression, the case when k > 1. We then have k explanatory variables plus an
*41 intercept and n values for each one. We can arrange these into an n x (k + 1) matrix:
1 X X12 Xlk
X21 X22 X2k
x=
X ni Xn2 Xnk
13o\
( . and y = IYY21).
31'
Notice how the intercept was turned into a variable of Is. Set /3 = Then equation (*) can be
Pk / kY
written like this:
X/3 = y
• X is called the design matrix.4 The generalized formulas for linear regression use matrices. We will use lower case
boldface letters for column and row vectors and upper case boldface letters for matrices with more than one row
• and column. We will use a prime on a matrix to indicate its transpose. The least squares estimate of fl is
p (XX)-1X'y (2.4)
and then the fitted value of y is9= X. I doubt you'd be expected to use formula (2.4) on an exam, unless you were
given (X/X)-1, since it involves inverting a large matrix. In fact, I doubt you will be asked any questions requiring
matrix multiplication.
The (X'X)-1 matrix is singular (non-invertible) if there is a linear relationship among the column vectors of X.
Therefore, it is important that the column vectors not be collinear. Even if the variables are only "almost" collinear,
9'1? the regression is unstable. We will discuss tests for collinearity in Section 5.3.
'The reason for this name is that in some scientific experiments, the points x are chosen for the experiment. But this will generally not be the
case for insurance studies.

2.3. ALTERNATIVE MODEL FORMS 15
As with simple linear regression, the sum of the residuals is 0.

When an explanatory variable is a categorical variable with m possible values, you must include m — 1 indicator k:
variables in the model. Sometimes indicator variables are called "dummy variables". Each indicator variable
corresponds to one possible value of the categorical variable. It is equal to 1 if the variable is equal to that value, 0
otherwise.
For example, if one of the explanatory variables is sex (male or female), you would set up one indicator variable
for either male or female. If the indicator variable is for female, it would equal 0 if male and 1 if female. If one of
the explanatory variables is age bracket and there are 5 age brackets, you would set up 4 indicator variables for 4 of
the 5 age brackets. Notice that if you set up 5 variables, their sum would equal 1. The sum would be identical to
xo, the first column vector of X, resulting in a linear relationship among columns of the matrix, which would make
it singular. Thus one variable must be omitted. The omitted variable is called the base level or reference level. You s:
should select the value that occurs most commonly as the base level. If you select a value that is almost always 0,
then the sum of the other indicator variables will almost always be 1, making the computation of the inverse of VX
less stable.
A special case is a variable with only two categories. The indicator variable is then a binary variable.
2.3 Alternative model forms
Even though regression is a linear model, it is possible to incorporate nonlinear explanatory variables. Powers of
variables may be included in the model. For example, you can estimate
Yi = Po + flirt]. + p24. + /334 + £i

You can include interaction between explanatory variables by including a term multiplying them together: "45
Another possibility is a regression with an exponential:
yi =Po +Piex' + el
Linear regression assumes homoscedasticity, linearity, and normality. If these assumptions aren't satisfied,
sometimes a few adjustments can be made to make the data satisfy these conditions.
Suppose the variance of the observations varies in a way that is known in advance. In other words, we know
that Var(Ei) = c2/w, with wi varying by observation, although we don't necessarily know what cy2 is. Then wi is
the precision of observation i, with wi = 0 for an observation with no precision (which we would have to discard) "41
and WI co for an exact observation. We can then multiply all the variables in observation i by v/7... After this
multiplication, all observations will have the same variance. Let W be the diagonal matrix with wi in the ith position
on the diagonal, 0 elsewhere. Then equation (2.4) would be modified to
13 = (X'WX)-1X'Wy (2.5)
The estimator ir is called the weighted least squares estimator. •41"

One may also transform y to levelize the variance or to remove skewness. If variance appears to be proportional
to y, logging y may levelize the variance:
In Yi = Po + Plx; + Ei
which is equivalent to
130-1-gixi
Yl = e
In this model, In yi is assumed to have a normal distribution, which means that yi is lognormal. A lognormal •411
distribution is skewed to the right, so logging y may remove skewness.
A general family of power transformations is the Box-Cox family of transformations: •41

Copyright C2022 ASM
yA _ 1
1
A °
y(A) = (2.6)
my A=0
This family includes taking y to any power, positive or negative, and logging. Adding a constant and dividing by
a constant does not materially affect the form of a linear regression; it merely changes the intercept and scales the
13 coefficients. So (yA — 1)/A could just as well be yA. The only reason to subtract 1 and divided by A is so that as
A —> 0, (yA — 1)/A ln y.
I doubt that the exam will require you to calculate parameters of regression models. Do a couple of the calculation
exercises for this lesson just in case, but don't spend too much time on them.
Table 2.1: Summary of Linear Model Formulas
For a simple regression model yi = f30 f3ixi Ei
PO y — filx (2.2)
=
(2.1)
z(xi — x.)2
Y
ht.=,xy—s x
(2.3)
For a multiple variable regression model

A. pcxylry (2.4)
For any regression
ei = 0
For a weighted least squares model

# = wwxylxvy (2.5)
Box-Cox power transformations

A
Y 3 n
Y
—
A (2.6)
my A=0
Exercises
2.1. •-11 You are given the linear regression model yi = Po +pixi + ei to fit to the following data:
x —2 —1 0 1 2
y 3 5 8 9 10
Determine the least squares estimate of Po-
Exam SRM Study Manual Exercises continue on the next page . . .

Copyright 02022 ASM
2.2. 1rYou are fitting a linear regression model yi = po + 13ixi + Ei to 18 observations.

You are given the following:
(i) E1/21 xi = 216

(ii) Z 4 = 3092
yi = 252
(iv) ZPiy = 4528
(v) xiyi = 3364
Determine the least squares estimate of pi.
2.3. al [SRM Sample Question #171 The regression model is y = P0 +16ix + E. There are six observations.
The summary statistics are:
y=8.5 Exi= 6 E x=16 xiyi = 15.5 y=17.25
Calculate the least squares estimate of Pi.

(A) 0.1 (B) 0.3 (C) 0.5 (D) 0.7 (E) 0.9
2.4. swii [SRM Sample Question #47] You are given the following summary statistics:
= 3.500
g 2.840
Z(xi — 2)2 = 10.820

E(xi — 2)(yi — 9) = 2.677
Z(yi — 9-)2 = 1.125
Determine the equation of the regression line, using the least squares method.
(A) y = 1.97 + 0.25x
(B) y = 0.78 + 0.59x
(C) y = 0.57 + 0.65x
(D) y = 0.39 + 0.70x
(E) The correct answer is not given by (A), (B), (C), or (D).
2.5. (SRM Sample Question #53] Determine which of the following statements is NOT true about the equation
=
PiX + e
(A) pa is the expected value of Y.
(B) pi is the average increase in Y associated with a one-unit increase in X.
(C) The error term, E is typically assumed to be independent of X and Y.
(D) The equation defines the population regression line.
(E) The method of least squares is commonly used to estimate the coefficients 130 and Pi.
Exam SRM Study Manual Exercises continue on the next page. ..

2.6. •41/ [SRM Sample Question #23] Toby observes the following coffee prices in his company cafeteria:
• 12 ounces for 1.00
The cafeteria announces that they will begin to sell any amount of coffee for a price that is the value predicted
by a simple linear regression using least squares of the current prices on size.
Toby and his co-worker Karen want to determine how much they would save each day, using the new pricing, if,
instead of each buying a 24-ounce coffee, they bought a 48-ounce coffee and shared it.
Calculate the amount they would save.
(A) It would cost them 0.40 more.
(B) It would cost the same.
(C) They would save 0.40.
(D) They would save 0.80.
(E) They would save 1.20.
2.7. "I [MAS-I-F18:29] An ordinary least squares model with one variable (Advertising) and an intercept was fit
to the following observed data in order to estimate Sales:
Observation Advertising Sales '

1 5.5 100
2 5.8 110
3 6.0 112
4 5.9 115
5 6.2 117
Calculate the residual for the third observation.
(A) Less than —2

(B) At least —2, but less than 0
(C) At least 0, but less than 2
(D) At least 2, but less than 4
(E) At least 4
2.8. Nr You are fitting the linear regression model yi = po + pixi + e; to the following data:
x 2 5 8 11 13 15 16 18
y —10 —9 —4 0 4 5 6 8
Determine the least squares estimate of pi.
2.9. 64° You are fitting the linear regression model yi = po + 13xi Ei to the following data:
x 3 5 7 8 9 10
y 2 5.7 8 9 11
Determine the fitted value of y corresponding to x = 6.

2.10. '45 You are fitting the linear regression model pi = po+pixi+ E . You are given:
(i) E 1 xi = 392
(ii) y; = 924
(iii) a3.1 xiy; = 13,272
(iv) pp := —23
Determine En, X.
2.11. ev? [3-F84:51 You are fitting the linear regression model y; =Po +pix; + E; to 10 points of data. You are given:
y; = 200
xiy; = 2000
E 4 = 2000
E y 5000
Calculate the least-squares estimate of pi.
(A) 0.0 (13) 0.1 (C) 0.2 (D) 0.3 (E) 0.4
2.12. [31,-505:27] Given the following information:
E= 144
y; = 1,742
E 4 = 2,300
E = 312,674
xiy; = 26,696
71 12
Determine the least squares equation for the following model:
Yi = Po + Aix; + E;
(A) 9 = —0.73 + 12.16x;

(13) 9; = —8.81 + 12.16x;
(C) 9= 283.87 + 10.13x;
(D) 9 = 10.13 + 12.16x;
(E) pi = 23.66 + 10.13x;
Exam SIZM Study Manual Exercises continue on the next page . . .

2.13. [120-F90:6] You are estimating the linear regression model yi = flo + f3ixi + Ei. You are given
i 1 2 3 4 5
xi 6.8 7.0 7.1 7.2 7.4
yi 0.8 1.2 0.9 0.9 1.5
Determine jdi•
(A) 0.8 (B) 0.9 (C) 1.0 (D) 1.1 (E) 1.2
2.14. LA? [120-S90:11] Which of the following are valid expressions for b, the slope coefficient in the simple linear
regression of y on x?
I.
(A) I and IT only (B) I and III only (C) II and III only (D) I, II and III
2.15. s: [Old exam] For the linear regression model yi = 130 + pixi + E with 30 observations, you are given:
(i) rxy = 0.5
(ii) sx = 7
(iii) sy = 5
where rxv is the sample correlation coefficient.
Calculate the estimated value of Pi.
(A) 0.4 (B) 0.5 (C) 0.6 (D) 0.7 (E) 0.8
2.16. 1L [110-S83:14] In a bivariate distribution the regression of the variable y on the variable x is 1500 + b(x — 68)
for some constant b. If the correlation coefficient is 0.81 and if the standard deviations of y and x are 220 and 2.5
respectively, then what is the expected value of y, to the nearest unit, when x is 70?
(A) 1357 (B) 1515 (C) 1517 (D) 1643 (E) 1738
2.17. Nr [120-82-97:7] You are given the following information about a simple regression model fit to 10 observa-
tions:
E xi =20
E yi = 100
s, = 2
sy = 8
You are also given that the correlation coefficient rxy = —0.98.
Determine the predicted value of y when x = 5.
(A) —10 (B) —2 (C) 11 (D) 30 (E) 37
Exam SRM Study Manual Exercises continue on the next page ..

2.18. s: In a simple regression model yi = flo + 131x1 + ci, you are given
y; = 450
xiyi = 8100
y5 =40
Calculate the fifth residual, 25.
2.19. [120-F89;13] You are given:
Period y x1 x2
1 1.3 6 4.5
2 1.5 7 4.6
3 1.8 7 4.5
4 1.6 8 4.7
5 1.7 8 4.6
• You are to use the following regression model:
Yi = Po + p2xi2+ i= l,2,...5
You have determined:

1522.73 26.87 -374.67
(X1X)-1 = ( 26.87
-374.67
0.93
-7.33
-7.33
93.33
Calculate e2.
(A) -0.2 (B) -0.1 (C) 0.0 (D) 0.1 (E) 0.2
2.20. al You are fitting the following data to a linear regression model of the form yi = po+13ix1i+132 X12 +133Xi3 Ei:
y 5 3 10 4 3 5
X1 0 1 0 1 0 1
X2 1 0 1 1 0 1
X3 0 1 1 0 0 0
You are given that

26 -10 18 12
(-10
20 0 0
pex 1 -1 = 30
I 1 -18 0 24 6
-12 0 6 24
Determine the least squares estimate of

Copyright 02022 ASM
2.21. e'llf [120-82-94:11] An automobile insurance company wants to use gender (xi = 0 if female, 1 if male) and
traffic penalty points (x2) to predict the number of claims (y). The observed values of these variables for a sample \---)
of six motorists are given by:
Motorist xi x2 y
1 0 0 1
2 0 1 0
3 0 2 2
4 1 0 1
5 1 1 3
6 1 2 5
You are to use the following model:
yi = Po + + P2xi2 + Ei, = 1, 2, ... 6
You have determined

7 -4
1
(rX)-1 = —
12
(-4
-3
8
0 3 3)
0
(13) 0.25 (C) 1.25 (D) 2.00 (E) 4.25
2.22. %I' You are fitting the following data to the linear regression model yi = Po + Pixii + P2Xi2 P3Xi3 Ei:
y 1 2 6 5 1 2 3
Xi 0 0 1 -1 0 1 1
X2 0 -1 0 0 1 -1 0
X3 1 1 4 0 0 0 1
You are given that

7 0 1.5 -2.5
11 0 1.2 3 -3
(VX)-1 = —
30 1.5 3 11.25 -0.75 '
-3 -0.75 3.25
Determine the fitted value of y for xi = x2 = x3 = 1.
Exam SRM Study Manual Exercises continue on the next page . .

2.23. '41 [Old exam] You are examining the relationship between the number of fatal car accidents on a tollway
each month and three other variables: precipitation, traffic volume, and the occurrence of a holiday weekend during
the month. You are using the following model:
where
y = the number of fatal car accidents
xi = precipitation, in inches
x2 = traffic volume
X3 = 1, if a holiday weekend occurs during the month, and 0 otherwise
The following data were collected for a 12-month period:
Month y xi x2 x3
1 1 3 1 1
2 3 2 1 1
3 1 2 1 0
4 2 5 2 1
5 4 4 2 1
6 1 1 2 0
7 3 0 2 1
8 2 1 2 1
9 0 1 3 1
10 2 2 3 1
11 1 1 4 0
12 3 4 4 1
257 -82
QCX)-1 = —
(
6506-446
82 254
-364 -446)
-364
2622)
(B) 0.15 (C) 0.24 (D) 0.70 (E) 1.30
2.24. el [S-F13:33] You are given a regression model of liability claims with the following potential explanatory
variables only:
• Vehicle price, which is a continuous variable modeled with a third order polynomial
• Average driver age, which is a continuous variable modeled with a first order polynomial
• Number of drivers, which is a categorical variable with four levels
• Gender, which is a categorical variable with two levels
• There is only one interaction in the model, which is between gender and average driver age.
Determine the maximum number of parameters in this model.
(A) Less than 9 (B) 9 (C) 10 (D) 11 (E) At least 12

Copyright 02022 ASM
2.25. %IP [MAS-I-S18:371 You fit a linear model using the following two-level categorical variables:
1 if Account
=
0 if Monoline
1 if Multi-Car
0 if Single Car
with the equation

E[Y] = po +pixi +fl2x2+133X1X2
This model produced the following parameter estimates:
Po = —0.10
pi = —0.25
02 = 0.58
03 = —0.20
Another actuary modeled the same underlying data, but coded the variables differently as such:
0 if Account
=
1 if Monoline
{0 if Multi-Car
X2 =1 if Single Car
with the equation
E[Y] = ao aiXi + a2X2 + a3X1X2
Afterwards you make a comparison of the individual parameter estimates in the two models.
Calculate how many pairs of coefficient estimates (ai,13 i) switched signs, and how many pairs of estimates stayed
identically the same, when results of the two models are compared.
(A) 1 sign change, 0 identical estimates
(B) 1 sign change, 1 identical estimate
(C) 2 sign changes, 0 identical estimates
(D) 2 sign changes, 1 identical estimate
Exam SRM Study Manual Exercises continue on the next page

l
f 2.26. •-(11 [114AS-1-S19:32] You are fitting a linear regression model of the form:
y = Xf3 £;e N(0, a2)
and are given the following values used in this model:
1 0 1 9
1 1 1 15 32 2 3 32 1.38 0.25 0.54 -0.16
C21 (3 51 )'
1 1 1 8 19 2 4 4 36 0.25 0.84 -0.20 -0.06
X =
0 1 1 7 17
; X'X =
3 4 6
x,x-i = 0.54 -0.20 1.75 -0.20
0 1 1 6 15 32 36 51 491 -0.16 -0.06 -0.20 0.04
0 0 1 6 15
0.684 0.070 0.247 -0.171 -0.146 0.316

0.070 0.975 -0.044 0.108 -0.038 -0.070
0.247 -0.044 0.797 0.063 0.184 -0.247
H = X(XX)-1X' = -0.171 0.108 0.063 0.418 0.411 0.171
-0.146 -0.038 0.184 0.411 0.443 0.146
k 0.316 -0.070 0.247 0.171 0.146 0.684
f20.93\
0.297 32.03
(-37: ; X(X'X)-1X'y =
19.04
(X1X)-1Xly = 16.89
; a-2 = 0.012657
1.854 15.04
\15.07
Calculate the modeled estimate of the intercept parameter.

(A) Less than 0
(B) At least 0, but less than 1
(E) At least 3

2.27. '41 [MAS-I-519:291 Tim uses an ordinary least squares regression model to predict salary based on Experience
and Gender. Gender is a qualitative variable and is coded as follows:
Gender =
11 if Male
0 if Female
His analysis results in the following output:
Coefficients Estimate Std. Error t-value Pr(> it i)

Intercept 18169.300 212.2080 85.62027 2.05E-14
Experience 1110.233 59.8224 18.55881 1.75E-08

Gender 169.550 162.9177 10.38285 2.62E-06
Abby uses the same data set but codes gender as follows:
Gender =
{1 if Female
0 if Male
Calculate the value of the Intercept in Abby's model.

(A) At most 18,169.3
(B) Greater than 18,169.3, but at most 18,400.0
(C) Greater than 18,400.0, but at most 18,600.0
(D) Greater than 18,600.0
(E) The answer cannot be computed from the information given
Solutions
2.1.
2.2.
z(Xi 2) 3092 2162 =, 500
18
z(xi_.,_)(yi_g). 3364 340

(2161)8(252) = 340
Ii =
_
0.68
500
2.3. The least squares estimate of 131 is the covariance of x and y divided by the variance of x. In the following
calculation, the numerator is n times the covariance and the denominator is n times the variance; the ns cancel. We
have n = 6 observations.
0.7 (D)
2.4. Using equations (2.1) and (2.2),

2.677
0.2474
pi =10.820
EXERCISE SOLUTIONS FOR LESSON 2 27
= 2.840 — 0.2474(3.500) = 1.9741 (A)
2.5. 130 is not the expected value of Y. In fact, po = Y — pi X. (A)

2.6.
The observations already lie in a straight line; each 4 ounce increase raises the price 0.20. The slope is therefore
0.2/4 = 0.05 and the intercept (using 12 ounces = 1 = 0.05(12) + 130) is 0.4. By buying 48 ounces, one intercept, or
0.40 , is saved. (C)
2.7. An exam question like this asking you to carry out a linear regression is rare. You can carry out a linear
regression on your calculator without knowing the formulas. But anyway, here is the calculation, with X being
advertising and Y being sales.
6.38
h = 23.806
= 0.268
29 4
554 — 23.806
a = —
5 5 ) = —29.179
= —29.179 + 23.806(6.0) = 113.657
The third residual is 112 — 113.657 = —1.657. (B)
2.8. In the following, on the third line, because 9 = 0, z(xi — 2)(yi — 9) = E(xi — )7)N.
= 11
E 4 = 1188 E(x, _1)2 = 1188 — 8(112) = 220

y=0 E(xi — 1)(yi — 9) = 2(-10) + 5(-9) +... + 18(8) = 270
270
a 1.2273
= 220 —
2.9. 1=9=7
E(xi — 1)2= 42+ 22+02+12+22+32...34

xiyi = (3)(2) + (5)(5) + (7)(7) + (8)(8) + (9)(9) + (10)(11) = 335
E(xi i)(yi — 9) = 335 — 6(7)(7) = 41

41
lu =
f
41 49
Po = 7 — (7) =
49
(41) _ 197 _
924
5.7941
2.10. if
= = 33
392
=
= 14
28
(xi — 2)(yi — 9) = 13272 — 28(33)(14) = 336

=
+ /312, so
33 = —23 + (14)
9)
fi t = 4 = Dx1 "Yi
—
Dxi _07)2 _ Dri — 4 4

_ 84
E 4 = E(xi — 2)2 ± 2812 = 84 + 28(142) = 5572

E xiEyi
2.11.
ZOci 1)(yi = xiy, 10
(200)(100) —0
= (2000) 10
It doesn't matter what the denominator E(xi — 2)2 is; fit = E(xi 1)(yi — 9)/ z(xi —2:)2 =E. ( A)
2.12. By equation (2.1)
12(26,696) — (144)(1,742) = 69,504 = 10.12587

12(2,300) — 1442 6,864
1,742
Po = 10.12587 4 = 145. 1667— 10.12587(12) = 23.6562 (E)
12
2.13.
Exi =35.5
E 4 = 252.25
Ecx, _ 252.25 552
5
— 0.2
—
yi =5.3
xiyi = 37.81
E(xi — i)(yi — 9) = 37.81 (35.5)(53)—

5
0• 18
0.18
0.9 (B)
2.14. The first one is correct, since it is equivalent to our formula
z x,y, E XlnE YI
x? (E XI)2
fl
The second one is incorrect since 12 in the denominator should be multiplied by n.

The denominator of the third one is E(xi — 1)2, like our formula. The numerator is the same as I, which is
correct as we said above. (B)
2.15. Using equation (2.3),
0.3571 (A)
2.16. Use equation (2.3).

Sy 220
b=
s, = 0.81 (.1 = 71.28
2.5
The predicted value is 1500 + 71.28(70 — 68) = 1642.56 (D)

2.17. Let the predicted value of y be y5.

8
pi= = -3.92 by equation (2.3)
100
So = 3.92.t = -
10 + 3.94-20) = 17.84
10
y5 = 17.84 - 3.92(5) = -1.76 (B)
2.18.
8100- (30)(450)/15 2
lfl = 347
270 - 302/15
450
g =
347.2 (30715-) = -38-)

E5 = 40- (-314 + 34(3)) = -2427
2.19. We calculate
1.3
1 1 1 1
81 1.581. 577.93
X'y = ( 64.5 4.6
7 7
4.5
8
4.7
4.6) 1.6
1.7
=
36.19)
fi = (VX)-1X'y =
( 9.9107 )
0.2893
-2.2893
Then e2 = 1.5 - 9.9107 - 0.2893(7) + 2.2893(4.6) = 0.09498 .(D)
(130)2
„
2.20. A
= 24
13
Si = -2
TO 114
72 )
96
2.21. The first coefficient of X'y is the sum of y, or 12. The second is 1 + 3 + 5 = 9 (not needed because (X/X)il = 0),
and the third is 2(2) + 1(3) + 2(5) = 17. Then
1 15
P2 = - ((-3)(12) + 3(17)) = - =
1.25 (C)
12
2.22.
60.5
'3 = ocxylx, y = .:,

y(1, 1,1) = th(60.5 - 91.8 - 8.25 + 31.75) =
(19812'58)
31.75
-0.26

2.23. A little unusual not to have an intercept term f30, but the formulas are the same as usual.
We need to compute X'y:
3
1
2
4
3 2 2 5 4 1 0 1 1 2 1 4
1
X'y = (11
1
1
1
0
2
1
2
1
2
0
2
1
2
1
3
1
3
1
4
0
4
1
3
2
0
= (557)1
20
2
1
Then we multiply the first row of (X'X)-1 by X'y to get the first coefficient of the P's,
257(57) - 82(51) - 446(20) _ 1547 _ 0.2378 (C)
6506 - 6506 -
2.24. A third order polynomial has 3 parameters that are multiplied by x, x2, and x3. A categorical variable with n
levels has n - 1 parameters. Thus there are 3 parameters for vehicle price, 1 for driver age, 3 for number of drivers, 1
for gender, and 1 for interaction. That sums up to 9. Add the intercept, and there are a total of [id] parameters. (C)
2.25. Since the model must produce the same results regardless of the values of the Xis, products of parameters
and variables must be the same. Expressing the second model in terms of the first,
E[Y] = ao + a1(1 - X1) + a2(1 - X2) + a3(1 - X1)(1 - X2)
ao + al + a2 + a3 + (-al - a3)Xi + (-a2 - a3)X2 + a3X1 X2
We see that a3 = p3, but the relationships of the other parameters is not a simple sign change. (E)
2.26. By formula (2.4), the modeled estimate of all parameters is (XTX)-1xTy-.Here the intercept is the third
variable, since the column of X that is all is is the third column, so the estimate of the intercept parameter is 3.943
(E)
2.27. In Tim's model, salary level is 18,169.3 + 1110.233xi + 169.55 for males and 18,169.3 + 1110.233x1 for females.
Abby's model must produce the same result, and for males she has bo + 1110.233xi, so bo = 18,169.3 + 169.55 =
18,338.85 (B)
Quiz Solutions
2-1.
Exi= 1 + 2 + 3 + 4 = 10
E4=,2+22.+32+42=30
Ey1= 27+34+48+59 = 168
xiyi = 27+ 2(34) + 3(48) + 4(59) = 475
475 - (10)(168)/4 11
30 - 102/4

Lesson 3
Linear Regression: Standard Error, R25 and

t statistic
Reading: Regression Modeling with Actuarial and Financial Applications 2.3-2.5, 3.2-3.4; An Introduction to Statistical
Learning 3.1.2-3.1.3, 3.2.2-3.2.3
3.1 Residual standard error of the regression

In a linear regression model with no variables other than the intercept,
Yi = Po + Li
the least-squares estimate of 130 is 9. To measure the improvement in the model from adding variables, we measure
the error of the model against the error in a model with just an intercept. To do this, we partition the difference
between yi and the sample mean into the difference between yi and the fitted value pi, and the difference between
the fitted value gi and the sample mean:
yi - = (yi pi) + (pi - p)
The second summand is the part explained by the explanatory variables and the first summand is the remaining
error. Squaring both sides and adding up over all i, we get
11
Z (YI g)2 =
(3.1) •41
i.1
Total sum of squares Error sum of squares Regression sum of squares
In this expression, we omit the cross term 2 E/1_1(y; 9i)(9i — 9) = 2 E;'_i — 9). In the sidebar, we show that
this cross term is 0.
We will use the notation Total SS for total sum of squares, Regression SS for regression sum of squares, and
Error SS for error sum of squares. In the old days, we used to use the abbreviations RSS for regression sum of
squares, ESS for error sum of squares, and TSS for total sum of squares. But then the R language came along and
used RSS for residual sum of squares (= error sum of squares)! An Introduction to Statistical Learning, a book closely
tied to R, uses RSS for residual sum of squares and TSS for total sum of squares, and if the regression sum of squares
is needed, it writes TSS — ESS.' In some non-syllabus books, you may also see SSE (sum of squared errors) used for
what we are calling Error SS.
An alternative formula for Total SS is
/I
Total SS = — n92
Recall that the sum of the residuals is 0, which means that E ei = E(yi = 0, so z pi = E yi ng. Therefore, a
formula similar to the one for Total SS is available for the regression sum of squares:
Regression SS = n g2
fl II wonder why it doesn't use ESS, for "explained sum of squares".
Exam SRM Study Manual 31

32 3. LINEAR REGRESSION: STANDARD ERROR, R2, AND t STATISTIC
Proof that the cross term in equation (3.1) is 0
To derive the least square estimators of pi, we partially differentiate
with respect to every pi, and set the result equal to 0. Thus
k
1=1
( yi —Z fiixii
i.o ) =0
E Xiiti = 0
1=1
for every j. And for the intercept (j = 0), xii = 1, so riLi = 0. Now,
• E7,191.; = 91 ê1 = 0 because gi = 0.
. 9, fif=ijxq. And ri1 jXjj= 0 for every j. So

11
eigj = E Pixq = 0
It follows that the cross term 2 (9i — is 0.
but this formula is not as useful as the one for Total SS since you hardly ever are given 9.
The total sum of squares has n —1 degrees of freedom. One degree of freedom is lost because the sum of squares
is calculated as differences from the sample mean rather than the unknown true mean. The regression sum of
squares has k degrees of freedom, one for each variable not counting the intercept. The error sum of squares has the
remainder of the degrees of freedom, or n — k — 1 degrees of freedom. The quotient of the error sum of squares over
al its number of degrees of freedom is the mean squared error(MSE) of the regression. The square root of this quotient
0
: is called the residual standard error (RSE). It is also called the residual standard deviation (s), or the standard error of the
regression.2 Thus
$ = RSE = =
1./ Error SS
n—k—1
(3.2)
Notice that we divide by n — k — 1, not by n, to calculate MSE. Contrast this with equation (1.1), where MSE is
calculated with division by n. In that context, MSE is calculated using test data, data that was not used to fit the
model, so no degrees of freedom are lost.
When the model is estimated from data and residuals are calculated based on the model's output from
that data, degrees of freedom are lost. You divide by a number less than the number of points in the data.
But when the model is estimated from data, then applied to different data and residuals are calculated
2Regression Modeling with Actuarial and Financial Applications calls it s, the residual standard deviation; An Introduction to Statistical Learning
calls it RSE, the residual standard error.

Copyright 02022 ASM
3.2. R2: THE COEFFICIENT OF DETERMINATION 33
based on the model's output from that different data, no degrees of freedom are lost. You divide by the number
of points in the different data.
Even though s is called the residual standard deviation, it is not the standard deviation of the residuals. It is the
standard deviation of what you are modeling, Y. So it is the standard deviation of the real residual, of e without a
hat. We will discuss the standard deviation of the residuals—the standard deviation of P with a hat—later.
The sum of squares information is summarized in an analysis of variance (ANOVA) table. The table looks like °:
this:3
Source Sum of Squares df Mean Square

Regression Regression SS k Regression SS/k
Error Error SS n — k —1 Error SS/(n — k 1)
Total Total SS n—1 Total. SS/(n — 1)
s2 is an unbiased estimator of a2, the variance of the response variable y or the variance of ci. *4'
Quiz 3-1 In a simple linear regression

(i) — 9)2 = 884
(ii) The standard error of the regression is 5.
Calculate the regression sum of squares.
3.2 R2: the coefficient of determination

One measure of fit for a regression is the proportion of the deviation from the mean that is explained by the
regression.
We define
Regression SS Error SS
R2 =
=1
Total SS Total SS
as the coefficient of determination. R2 is the proportion of the sum of squares explained by the regression.
EXAMPLE 3A `411
You are fitting a linear regression model yi = 130+131x, + ei to 18 observations.
You are given the following information:
• x=5
• V-81 4 = 480
•y=4
• Zri = 1056
•
xiyi = 480
You are given that the error sum of squares is 288.
Calculate the coefficient of determination, R2. •
'In most textbooks, although not in the textbooks on the syllabus, an ANOVA table has an additional column with an F statistic. We will
discuss the F statistic later.

Copyright 02022 ASM
SOLUTION: The total sum of squares is

18
Total SS = 1892 = 1056 — 18(42) = 768
Therefore, R2 = 1 — Error SS/Total SS 288/768 = 0.625
Quiz 3-2 e: For a linear regression, you are given:

(i) E(yi -9i)2 = 285
(ii) — y)2= 122
Compute R2.
In a simple linear regression model, R2 is the square of the correlation between x and y; in other words
R2 = (E(xi —1)(yi — g))2 (3.4)

E(xi — 1)2 Z(yi — 9)2
If k > 1, the square root of R2 is called the multiple correlation coefficient. R2 is the square of the correlation of yi and
in other words, the square of the correlation between the true value of y and the fitted value of y.
Quiz 3-3 6: You are fitting a linear regression model yi = po + p1X; + ei to 10 observations.
48
• Vi (.1/i 9)2 = 336

• En (xi — .17)(yi—9)= —120
Calculate R2, the coefficient of determination for the regression.
R2 is an intuitive way to assess the quality of the model, but it has two disadvantages:
1. Adding more variables to the model always increases R2, no matter how irrelevant the variables are.
2. Its sampling distribution is hard to determine, so the R2 statistic cannot be evaluated statistically. There is no
objective way to state a critical value for it.
To address the first disadvantage, an adjusted R2 is often used to compare models with different numbers of
variables. We will discuss adjusted R2 in Section 7.2.
3.3 t statistic
.4' The linear regression estimator b of 13 is an unbiased estimator of 13. As with all estimators, b is a function of the
observations, which are random variables, and is therefore a random variable. It has the minimum variance of all
NI° unbiased estimators. The covariance matrix of b is

3.3. t STATISTIC 35
The diagonal elements of the matrix are the variances of the components of b. Since a2 is unknown, we use s2,
the square of the standard error of the regression, instead of 0-2 in order to estimate the variance of b. Therefore
the variance of bi is estimated as s = s2lpi, where tpi is the i + 15t diagonal element4 of (X/X)-1. The square root of
the ith component of the estimated variance of b is called the standard error of bi, and is sometimes denoted se(bi), •41
although we will usually use sbi•
For a simple linear regression model yi = 130 + /3ixi, it is easy to compute E = (X0C)-1. The components of this
matrix are
-2
E X? 1
= =
(3.6)
n E(xi - )2 n E(x1 -
1
En = -
(3.7)
E12 = (3.8)
(x— ±-)2
The covariance matrix of (b0, hi) is the product of a2 and E, and we estimate cr2 with s2 so we have the following
approximate covariance matrix .41
COV(bOr bl) 52
(E(x—i i")2)
nAn alternative formula for sb, is
(3.12)
EXAMPLE 3B For a simple linear regression based on 50 observations, you are given
(1) The sample variance of x is 108.
(ii) The residual sum of squares (RSS) is 234.
Calculate the variance of the estimator for pi. •
SournioN: The mean square error is s2 = RSS/(n - 2) = 234/48 := 4.875. The sum of square differences of xi from
its mean is
Z (xi — = s(n —1) = (108)(49) = 5292

Using formula (3.10):
1
= 4.875(-) = 5292
0.0009212
Alternatively, using formula (3.12),

2
s2b1 =
875 ) 0.0009212
To test the null hypothesis thatf3 = 0, we use the I statistic basbi, which has n - k 1 degrees of freedom. More •41
generally, to test the null hypothesis that pi = fl*, we use
b1-
tn-k-1 = (3.13)
Sbi
4The rows and column of the matrix go from 1 to k +1 while the components of b are subscripted from 0 to k, necessitating adding 1 to i
when we refer to matrix elements.
Exam 5RM Study Manual

A 100q% confidence interval for fj may be constructed as bi tsb,, where t is the 100(1 + q)/2 percentile of a
t-distribution with n - k - 1 degrees of freedom.
Note that the t distribution tables you get at the exam show percentiles of the t distribution. For a 2-sided
interval with significance a or confidence level 1 - a, you must use the (1 - a)/2 percentile of the t distribution so
that the region outside the interval has probability a. For example, for 5% significance, the critical value is in the
t0.025 column.
EXAMPLE 3C A linear regression model based on 20 observations has 2 explanatory variables and an intercept.
You are given:
0.5 0.2
(i) (X/X)-1 =
( 0.2
-0.1
1.2
0.6
0.6
2.7 0.1)
(ii) The residual standard error is 1.5.
(iii) b2 = 6.4
Using the t statistic, determine an interval for the p-value of b2 * 0. •
SourrioN: The standard error of b2 is 1.5\1 = 2.465. The t statistic is 6.4/2.465 = 2.597. It has 20 - (2 + 1) = 17
degrees of freedom. This is between the 2% significance level (critical value 2.5669) and the 1% significance level
0
(critical value 2.8982), so the p value is between 1% and 2%.
3.4 Added variable plots and partial correlation coefficients
variable and any one of the explanatory variables, and verify the linear relationship. However, this relationship may
be distorted due to other variables in the model.
The example in Regression Modeling with Actuarial and Financial Applications models refrigerator prices in terms
of refrigerator size, features, and energy cost. One would expect that a more efficient refrigerator, one with a lower
energy cost, would have a higher price, all other things being equal. And indeed, the regression shows that the
coefficient of energy cost is negative. Yet, a scatter plot of price on energy cost has a positive slope. The reason for
this is that higher energy cost correlates with larger refrigerator size and more features, and larger refrigerators with
more features tend to have higher prices. To illustrate the true effect of energy cost on price, the other variables must
be removed.
An added variable plot removes the effects of other variables. To construct such a plot for xj:
1. Regress y on the other explanatory variables, excluding xi. Let ey be the residuals from this regression.
2. Regress xi on the other explanatory variables. Let ej be the residuals from this regression.
3. Construct a scatter plot of ey on ej.
The correlation between ey and ej is called the partial correlation coefficient. It is denoted by r(y, xj I
,x j_, x j+1, • • • , xk). It may be calculated directly from the full regression, the regression of y on all k ex-
planatory variables including xj, using the following formula:
t (bj)
/"(Y, xi, • • • , xj-i, xj+i, • • • , xk) (3.14)
t(bi)2 + n - (k + 1)
Here, t(b) is the t statistic of xi in the full regression. While the partial correlation coefficient shows the degree
of correlation of y and xi after eliminating effects of other explanatory variables, it does not show nonlinear
relationships. Only the added variable plot shows nonlinear relationships.
EXAMPLE 3D ••• y is regressed on xi and x2 based on 15 observations. You are given:
(i) b2 = 0.764
(ii) s2 = 0.525b2
Copyright C2022 ASM
3.4. ADDED VARIABLE PLOTS AND PARTIAL CORRELATION COEFFICIENTS 37
Calculate the partial correlation coefficient of x2. •
0.764
SOLUTION: t(b2) — = 1.05442 0
V0.525
1.05442
r(y, X2 I X1) = 0.29119
V1.054422 + 15 3
Table 3.1: Formula Summary for Regression Measures of Fit
• Error SS is the Error Sum of Squares: • R2 in simple linear regression:

11
Error SS = yi)2 (3.4)
It is also called RSS, the residual sum of squares.

It has n k — 1 degrees of freedom. • Variance of b:
2 2
s
• Regression SS is the Regression Sum of Squares:
l
f
where 1p is the j + 1st diagonal element of (X1X)-1.

Regression SS = Z4(9; — 2)2 s2(X1X)-1 is the (estimated) covariance matrix.
It has k degrees of freedom. • Covariance matrix for b in simple linear regres-

sion:
• Total SS is the Total Sum of Squares:
_ 2' EX
S2ho
Z(Xj — .7)2
It has n —1 degrees of freedom. (1

= S2
— 1)2)
n
+ (3.9)
1
Total SS = Regression SS + Error SS
2 (Z(x1 — 1)2
2
sb, = 5 (3.10)
It is sometimes abbreviated as TSS. -
• s is the residual standard deviation. It is also called

RSE, the residual standard error. It equals
Cov(bo, hi) = sz
(E(x; x— .)2) (3.11)
Error SS (3.12)
s= =
(3.2)
• R2 is the coefficient of determination: • t statistic:
Regression SS Error SS
R2 — —
(3.3)
Total SS Total SS
• Partial correlation coefficient
r(y, xi I xj+i, • • • xk) = (3.14)

111-(bi)2 + n (k +1)

Copyright Q2022 ASM
Exercises
Standard error of regression, R2
3.1. •41 [SRM Sample Question #11] You are given the following results from a regression model.
1
Observation „,
number (i)
Yi f(x1)
1 2 4
2 5 3
3 6 9
4 8 3
5 4 6
Calculate the sum of squared errors (SSE).

(A) —35 (B) —5 (C) 5 (D) 35 (E) 46
3.2. 111 You are fitting the linear regression model yi = po + pixi + ei to 20 observations.
You are given:
• E(yi — 9)2 = 12
•
1:1=10
. E(2i — y)2 = 108
Determine R2.
3.3. [3-F85:10] You fit the regression model yi = pa 4- f3ix Ei to 10 observations. You have determined:
R2 = 0.6
yi = 30
Ey = 690
Calculate S2.
(A) 27 (B) 30 (C) 36 (D) 45 (E) 50
3.4. You are given the following excerpt from an ANOVA table for a regression:
Source df Mean Square

Regression 4 1058.32
Error 20 84.58
Determine R2.

Copyright C2022 ASM
3.5. '4° [120-F89:81 You are given:
6.8 0.8
7.0 1.2
7.1 0.9
7.2 0.9
7.4 1.5
Determine R2 for the regression of y on x.

(A) 0.3 (B) 0.4 (C) 0.5 (D) 0.6 (E) 0.7
3.6. [120-589:71 You are interested in the relationship between the price movements of XYZ Corporation and
the "market" during the fourth quarter of 1987.
You have used the least squares criterion to fit the following line to 14 weekly closing values of XYZ stock (Yt)
and the Dow Jones Industrial Average (xi) during the period of interest:
9, = —116.607 + 0.195x, +
You are given:

14
E(g, - = 677.1142
t =1
14
(y, _p)2 = 949.388

1=1
Determine the percentage of variation in the value of XYZ stock that was "explained" by variations of the Dow.
(A) 50 (B) 60 (C) 70 (D) 80 (E) 90
3.7. [SRM Sample Question #18] For a simple linear regression model the sum of squares of the residuals is
1=1
e? = 230 and the R2 statistic is 0.64.
Calculate the total sum of squares (TSS) for this model.
(A) 605.94 (D) 701.59 (E) 750.87
3.8. [120-590:14] Which of the following statements are true for a two-variable linear regression?
I. R2 is the fraction of the variation in Y about that is explained by the linear relationship of Y with X.
R2 is the ratio of the regression sum of squares to the total sum of squares.
HI. The standard error of the regression provides an estimate of the variance of Y for a given X based on n 1
degrees of freedom.
(A) I and II only (B) I and III only (C) II and III only (D) I, II, and III
(E) The correct answer is not given by (A) , (B) , (C) ,or (D) .
Exam SRM Study Manual Exercises continue on the next page...

Copyright (P2022 ASM
3.9. "41 [120-S91:6] A bank is examining the relationship between income (x) and savings (y). A survey of six
randomly selected depositors yielded the following sample means, sample variances, and sample covariance:
= 27.5
= 87.5
2
5 =35
Sxy = 17.0
Determine R2.
(A) 0.1 (B) 0.2 (C) 0.3 (D) 0.7 (E) 0.9
3.10. sf[120-81-95:1] You fit a simple regression model with the dependent variable yi = i for i = 1, ,5. You
determine that s2 = 1.
Calculate R2.
(A) 0.1 (B) 0.3 (C) 0.5 (D) 0.6 (E) 0.7
3.11. Nti [120-83-96:31 You fit a simple regression model to five pairs of observations. The residuals for the first
four observations are 0.4, —0.3 0.0, —0.7, and the estimated variance of the dependent variable y is 1.5.
Calculate R2.
(A) 0.82 (B) 0.84 (C) 0.86 (D) 0.88 (E) 0.90
3.12. sir [S-F17:341 For an ordinary linear regression with 5 parameters and 50 observations, you are given:
• The total sum of squares, TSS = 996.
• The unbiased estimate for the constant variance t72 is 52 = 2.47.
Calculate the coefficient of determination.
(A) Less than 0.65

(B) At least 0.65, but less than 0.75
(C) At least 0.75, but less than 0.85
(D) At least 0.85, but less than 0.95
(E) At least 0.95

(--- 3.13. svie [MAS-I-S19:30] You are given the following information about a linear model:
• Y=P0+131X1+p2x,±E
Observed Estimated
Y's Y's
2.441 1.827
3.627 3.816
5.126 5.806
7.266 7.796
10.570 9.785
• Residual Sum of Squares = 1.772

Calculate the R2 of this model.
(A) Less than 0.6
(E) At least 0.9
3.14. •16. For a linear regression model for claim sizes, the model output is
Variable Coefficient
Intercept 22
Age of driver
18-24 15
25-64 U
65 and up 13
Income group
Under 50000 12
50000-100000 0
Over 100000 —3
The residual standard error of the regression is 2. Use this as the estimate of the standard deviation of E.
The response variable is transformed using a Box-Cox transformation with A = 1/2.
Calculate expected claim sizes for a 30-year old driver earning 150,000.
Exam SRM Study Manual Exercises continue on the next page .. .

t test
3.15. [4-S01:401 For a classical linear model of the form yi = po + + E based on seven observations, you are
given:
(i) E(xi — 1)2 = 2000
(ii) = 967
b1 is the least squares estimate of pl.

Calculate 41, the standard error of b1.
(A) 0.26 (B) 0.28 (C) 0.31 (D) 0.33 (E) 0.35
3.16. '11. You are given the following excerpt from regression output
Regression of LOSS on AGE + MALE + MARRIED
Estimate Std. Error
(Intercept) 20.850 5.483

AGE —2.503 1.042
MALE 3.205 1.076
MARRIED —1.082 0.004
Residual standard error: 15.20 on 15 degrees of freedom
With regard to the hypothesis that AGE has no effect on LOSS, which of the following statements is correct?
(A) Reject at 1% significance
(B) Reject at 2% significance but not at 1% significance
(C) Reject at 5% significance but not at 2% significance
(D) Reject at 10% significance but not at 5% significance
(E) Do not reject at 10% significance

Copyright 02022 ASM
3.17.
%, [ST-F15:211 You wish to explain Y using the following multiple regression model and 32 observations:
Y = 13o + +p2x2+133x3+ E
A linear regression package generates the following table of summary statistics:
Estimated Standard
Coefficient Error
44.200 5.960
Intercept
pi -0.295 0.118
P2 9.110 6.860
P3 -8.700 1.200
For the intercept and each of the betas, you decide to reject the null hypothesis which is that the estimated
coefficient is zero at a = 10% significance.
Which variables have coefficients significantly different from zero?
(A) Intercept
(B) Intercept, X1
(C) Intercept, X2
(D) Intercept, X1, X3
(E) Intercept, X2, X3
3.18. [ST-S16:20j You are given:

(i) A multiple linear regression model was fit to 20 observations.
(ii) There are five explanatory variables with fitted values for pi through 135.
(iii) The following summarizes the fitted coefficients excluding the intercept:5
Point Estimate Standard Error t statistic
Pi 0.020000 0.012000 1.661
P2 -0.004950 0.008750 -0.565

P3 0.216000 0.043200 5.000
P4 -0.034600 0.115000 -0.301
35 -0.000294 0.000141 -2.090
Determine the number of coefficients in the table above for the five explanatory variables that are not statistically
different from zero at a significance level of a = 10%, based on a two-tailed test.
(A) 1 (B) 2 (C) 3 (0)4 (E) 5
3.19. For a regression model y = X13 E you are given

(i) r2-4
0.2 0.1 0.5
(ii) (X/X)-1 =
(0.1
0.5
Determine Var(bi - b2).

0.7
0.4
0.4
0.2
5The table, as it appeared on the exam, mistakenly has 1.661 on the 151 line, but 0.02/0.012 :-- 1.667.

Copyright Q21:122 ASM
3.20. '11? [120-81-98:4] You fit the regression model yi = 132Xi2 Ei to a set of data.
You determine:
6.1333 -0.0733 0.1933

(CX)-1 =
(-0.0733
s2 = 280.1167
-0.1933
0.0087
0.0020
-0.0020
0.0087
Determine the estimated standard error of b1 - b2.
(A) 1.9 (B) 2.2 (C) 2.5 (D) 2.8 (E) 3.1
3.21. [120-83-98:6] You fit the multiple regression model yi = 131 + P2X12 f33Xi3 El to 30 observations.
You are given:
yTy = 7995
0.0286 0.0755 0.0263
oexri 2( 0 .8:0 1 279585 6

-0.0263
0.0027
0.0010
0.0014
0.0010
0.0035
0.0010
-0.0014
-0.0010
0.0032
426411
0.55
X/Y = ( 6177.5
5707.0
( 5.22 )
1.62
b =
0.21
-0.45
Determine the length of the symmetric 95% confidence interval for 132.
(A) 0.3 (B) 0.6 (C) 0.7 (D) 1.5 (E) 1.8
3.22. alr [4-F03:36] For the model yi = Po + ixi + fi2xi2 P3Xi3 Ei, you are given:
(i) There are 15 observations.
113.66 -0.33 2.05 -6.31

0.03 0.11 0.00
(ii) pcxyi = -0.33
2.05 0.11 2.14 -2.52
-6.31 0.00 -2.52 4.32
(iii) Error SS = 282.82

Calculate the standard error of b2 - b1.
(A) 6.4 (B) 6.8 (C) 7.1 (D) 7.5 (E) 7.8
Exam SRM Study Manual Exercises continue on tire next page.

Copyright ©2022 ASIVI
(---N 3.23. ikir [4-F00:5] You are investigating the relationship between per capita consumption of natural gas and the
price of natural gas. You gathered data from 20 cities and constructed the following model:
3T =Po +Pix+
where
y is per capita consumption
x is the price, and
c is a normal random error term
bo = 138.561
= —1.104
E 4 = 90,048
116,058
E(x,_ R)2 -= 10,668

E(y,_ sr)2 = 20,838
E = Ecyi—D2= 7,832
Determine the shortest 95% confidence interval for pt.
(A) (-2.1,-0.1) (B) (-1.9, —0.3) (C) (-1.7, —0.5) (D) (-1.5, —0.7) (E) (-1.3, —0.9)
l
f
3.24. °SP [4-F01:5] You fit the following model to eight observations:
3T=Po +Pix + £
You are given:

b1 = —35.69
E(xi —51)2 = 1.62
E(yi — 902 = 2394
Determine the symmetric 90-percent confidence interval for pt.
(A) (-74.1,2.7) (B) (-66.2, —5.2) (C) (-63.2, —8.2) (D) (-61.5, —9.9) (E) (-61.0,-10.4)
3.25. el? [4-F03:51 For the model yi = Po + Pia:: + ei, where i = 1, 2, ... , 10, you are given:
1, if the ith individual belongs to a specific group
xi =
1 0, otherwise
40 percent of the individuals belong to the specified group
The least squares estimate offlu is b1 = 4
E(yi — bo — b1xi)2 92
Calculate the t statistic for testing Ho:flu= 0.

(A) 0.9 (B) 1.2 (C) 1.5 (D) 1.8 (E) 2.1
l
f

Copyright 02022 ASM
3.26. •a . [ST-515:22] You are given the following linear regression model fitted to 12 observations:
Y = 13o + pix + £
The results of the regression are as follows:
Parameter Estimate Standard Error

go 15.52 3.242
pl 0.40 0.181
Determine the results of the hypothesis test Ho: pi = 0 against the alternative Hi: pi # 0.
(A) Reject at a = 0.01
(B) Reject at a = 0.02, Do Not Reject at a = 0.01
(C) Reject at a = 0.05, Do Not Reject at a = 0.02
(D) Reject at a = 0.10, Do Not Reject at a = 0.05
(E) Do Not Reject at a = 0.10
3.27. '41 [ST-514:20] The model yi = Po + Aix; + ei was fit using 6 observations. The estimated parameters are as
follows:
•
1)0 = 2.31
• b1 = 1.15
• se(bo) = 0.057
• se(bi) = 0.043
The following hypothesis test is performed:

• Ho: pi =1
• Hi: /31 #1
Determine the minimum significance level at which the null hypothesis would be rejected.
(A) Less than 0.01
(E) At least 0.10
3.28. [4-F02:381 You fit a simple linear regression model to 20 pairs of observations.
You are given:
(i) The sample mean of the independent variable is 100.
(ii) The sum of squared deviations from the mean of the independent variable is 2266.
(iii) The ordinary least-squares estimate of the intercept parameter is 68.73.
(iv) The error sum of squares (Error SS) is 5348.
Determine the lower limit of the symmetric 95% confidence interval for the intercept parameter.
(A) —273 (B) —132 (C) —70 (D) —8 (E) —3
Exam SIWI Study Manual Exercises continue on the next page ..

Copyright 02022 ASM
n 3.29. [ST-F14:20] For the linear model yi = f30 + Ej, you are given:
• n = 6.
• b1 = 4.
• ri._1(xi - 2)2 = 50.
• Error SS = 25
Calculate the upper bound of the 95% confidence interval for Pi.
(A) Less than 5.1
(E) At least 5.7
3.30. r[120-81-98:3] You fit the model yi = Ei to 10 observed values (xi, yi).
You determine:
Determine the width of the shortest symmetric 95% confidence interval for 130.
(A) 1.1 (B) 1.2 (C) 1.3 (D) 1.4 (E) 1.5
3.31. ' [120-81-95:5] Performing a regression of y on xi and x2 with 12 observations, you determine that the
regression equation is
= 1.2360 + 0.8683xii + 0.8517x12
You are given:

(i) Regression SS = 7.5753
(ii) Error SS = 1.8739
7.70997 -0.82314 -1.05459
(i i) OC X)-1 = (-0.82314
-1.05459
0.10044
0.10480
0.10480
0.15284
Determine k such that a 95% confidence interval for 132 is given by b2± k.
(A) 0.16 (B) 0.18 (C) 0.33 (D) 0.37 (E) 0.40

Copyright 02022 ASM
3.32. qi [120-82-97:5] You perform a regression of y on x1 and x2 with 6 observations.

You determine:
(i) pi . 20.0- 1.5xi1 - 2.0xi2

(ii) Regression SS = 42
(iii) Error SS = 12
4 1
-7 -3
(X'X)-1 = (-31, 1)
i
(iv) 0
T 2
-3 0 3
Calculate the value of the t statistic for testing the null hypothesis Ho : p2 . 1.
(A) -0.9 (B) -1.2 (C) -1.8 (D) -3.0 (E) -5.0
3.33. "Nr You perform a multiple-regression analysis on Y's relationship to three explanatory variables
3
y =flo + E pixi + £
i=1
You have 15 data points.

The fitted value of pi is 1.372, with a standard error 0.258.
Construct a 95% confidence interval for pi.
3.34. •-: [SRM Sample Question #27] Trevor is modeling monthly incurred dental claims. Trevor has 48 monthly
claims observations and three potential predictors:
• Number of weekdays in the month
• Number of weekend days in the month
• Average number of insured members during the month
Trevor obtained the following results from a linear regression:
Coefficient Standard Error t Stat p-value

Intercept -45, 765, 767.76 20,441,816.55 -2.24 0.0303
Number of
513,280.76 233,143.23 2.20 0.0330
weekdays
Number of
280,148.46 483,001.55 0.58 0.5649
weekend days
Average
number of 38.64 6.42 6.01 0.0000
members
Determine which of the following variables should be dropped, using a 5% significance level.
I. Intercept
II. Number of weekdays
III. Number of weekend days
IV. Number of members.
(A) I only (B) II only (C) III only (D) IV only

(E) None should be dropped from the model

nUse the following information for questions 3.35 and 3.36:

You are analyzing the relationship between annual losses on auto collision insurance (Y) and the following
explanatory variables:
1. The cost of auto parts (xi)
2. The price of gasoline (x2)
You fit a regression model of the form
Y = Po + flux1 +142x2 + e
The results of the regression model, using 5 years of annual data, are
Variable Coefficient Standard Error
Constant 1527.6 259.2
Cost of auto parts 2.89 0.58
Price of gasoline -40.50 15.10
3.35. You wish to test the hypothesis Ho: 162 = 0 against Hi: P2 <0 using the t statistic.
Which of the following is true regarding Ho?
(A) Reject at 0.01 significance.
(B) Reject at 0.025 significance, do not reject at 0.01 significance.
(C) Reject at 0.05 significance, do not reject at 0.025 significance.
(D) Reject at 0.10 significance, do not reject at 0.05 significance.
(E) Do not reject at 0.10 significance.
3.36. 'I Calculate the partial correlation coefficient of the price of gasoline.
Use the following information for questions 3.37 and 3.38:
You are given:
(i) y is the annual number of discharges from a hospital.
(ii) x is the number of beds in the hospital.
(iii) Dummy variable d is 1 if the hospital is private and 0 if the hospital is public.
(iv) The classical three-variable linear regression model y = poi-pix+p2d+ e is fitted to n cases using ordinary
least squares.
(v) The matrix of estimated variances and covariances of bo, bi, and b2 is:
1.89952 -0.00364
( -0.00364
-0.82744
0.00001
-0.00041 -0.82744)
-0.00041
2.79655
3.37. 44° [VEE Applied Statistics-Summer 05:2] Determine the standard error of 1)0 + 600b1.
(A) 1.06 (B) 1.13 (C) 1.38 (D) 1.90 (E) 2.35
3.38. 61 b2 = -12 and n =88

Calculate the partial correlation coefficient of x2.
Exam SIZM Study Manual Exercises continue on the next page . . .

Copyright C2022 ASM
3.39. A regression model yi = flo Pi xii fl2xi2 g3Xi3+ Ei is fitted to 48 observations. The partial correlation
coefficient of b1 is 0.6540.
Calculate the t ratio for
Solutions
3.1. How can a sum of squares be negative? You can eliminate (A) and (B) with no calculation.
(2 _ 4)2 + (5 _ 3)2 + (6 _ 9)2 + (8 _ 3)2 +(4_6)2 (E)
3.2. Regression SS = 108 and Error SS = 12. R2 = 108/(108 + 12) = 0.9
302
3.3. Total SS = 690 — —
600
10
Error SS = Total 55(1 — R2) = 600(0.4) = 240

2 Error SS r---1
s =I I (B)
8
3.4. Regression SS = 1058.32(4) = 4233.28
Error SS = 84.58(20) = 1691.6
4233.28
R2 0.7145
4233.28 + 1691.6
3.5. The easier way to work this out is to use the fact that R2 for a simple linear regression is the square of
the correlation coefficient, so just calculate that. The Multiview calculator's data table calculates, in register F,
r = 0.6985354731, and r2 = 0.487952. (C)
Statistical calculators were not allowed at the time this exam was given. Even so, you can calculate the correlation
coefficient by calculating the variances of x and y and their covariance. You still won't have to calculate pi.
However, we'll go through the regression in the following solution.
Note that 17, -i = pi(xi- 1), so Regression SS = E(9i — p)2 = E(xi - Y)2.
E (xi -17)2 = X? r
(I Xi)2
5
= 252.25 3552
5
— 0.2
(35'5 (5'3)
E(xi — 2)(yi — g) xiyi = 37.81
5) 0'18
0.18
131 = = 0.9
2
(s i") 5.32
Total SS = (yi — = - —
= 5.95 — = 0.332
5 5
Regression SS = = (0.92)(0.2) = 0.162

, Regression SS 0.162
=0.488
= =
(C)
Total SS .332
3.6. Total SS = 949.388. Regression SS = 677.1142. The question is requesting R2, which is
R2 Regression SS _ 677.1142 _ —
0.7132 (C)
Total SS 949.388

3.7. Total SS = Error SS + Regression SS, so R2 = Regression SS/Total SS = 1 — Error SS/Total SS. Using this
equation
230
0.64 = 1
Total SS
230
Total SS = — =
638.89 (B)
0.36
3.8. I and II are true. III would be true for the square of the standard error, and based on n —2 degrees of freedom.
(A)
3.9. For the sin-pie regression model, R2 is the correlation squared, or

2
5
xy
0.9437 (E)
3.10. y=3
Total SS = E(yi !7,1 )2 =

(1 _ 3)2 + _ 3)2 + (3 _ 3)2 ±(4— 3)2 + (5 _ 3)2 10
Error SS = (n — 2)s2 = (5 — 2)(1) = 3

Error SS
R2 = 1 = 0.7 (E)
Total SS
3.11. Total SS is Z(yi—g)2, whereas the estimated variance is E(yi —g)2/(n —1), and 11-1 = 4, so Total SS = 4(1.5) = 6.
The sum of the residuals must equal 0, so the fifth residual must be —(0.4 — 0.3 + 0.0 — 0.7) = 0.6. Therefore:
Error SS = 0.42 + 0.32 + 0.72 + 0.62 = 1.1
Error SS 11
R2 = 1 =1— = 0.8167 (A)
Total SS 6
3.12. The coefficient of determination is
R2 Regression SS _ 1 Error
— Total SS
—
SS
Total SS
You are given that Total SS = 996. The estimate s 2 = Error SS/(n (k + 1)), and n = 50, k + 1 = 5 (The intercept Po is
counted as a parameter), so Error SS = 2.47(45) = 111.15. Therefore
R2 _ 111.15 _ 0.8884 (D)
—
996
3.13. R2 is 1 LiTt:12. We are given that Error SS = 1.772, and Total SS is the sum of the square difference
between observed Ys and their mean, or 4 times the unbiased sample variance. The unbiased sample variance of
the Ys, 2.441, 3.627, 5.126, 7.266, and 10.570, is 10.3402. So the total sum of squares is 4(10.3402) = 41.361. Then
R2 = 1 — 1.772/41.361 = 0.9572. (E)
3.14. The linear expression is the intercept, 22, plus 0 for age of driver, plus —3 for the income group, and
22 + — 3 = 19. Performing the Box-Cox transformation, y* = 2(y1/2 — 1) is a normal variable with variance equal to
the variance of E, 22 = 4. Then
y=2
E[y] = E[(y*/2 + 1)2] = ERV21 + E[ysi +1

E[(021 E[y12 Var(y*) = 192 + 22 = 365
365
E[y] = + 19 + 1 = 111.25

Copyright =022 ASM
3.15. We will use formula (3.10) for 5b1.

2 967
=193.4
—
5 =
n k 1 5
S2
0.311 (C)
3.16. The absolute value of the t statistic is 2.503/1.042 = 2.402. At 15 degrees of freedom, this is between 2.131
(0.05) and 2.602 (0.02), making the answer (C).
3.17. Divide the estimated coefficients by their standard errors to obtain the t statistic. The t statistic has n—(k +1) =
32— (3 +1) = 28 degrees of freedom. The critical value at 10% of the t statistic for a 2-sided test is 1.7011. The absolute
values of the quotients of coefficients over standard errors are 7.4161, 2.5, 1.3280, 7.25. Therefore the answer is (D):
accept the intercept and the first and third variables as significant.
3.18. It is not clear whether there is a po coefficient for the intercept in the regression, but let's assume there is,
so that there are n — (k + 1) = 20 — (5 + 1) = 14 degrees of freedom. The critical value at 10%, which is 10.050 for a
two-tailed test, is then 1.7613, so 13 )321 and /34 are not significant. (C)
3.19. The true variance can be calculated, since we have 02 (instead of the usual s2).
Var(bi — b2) = Var(bi) + Var(b2) — 2 Cov(bi, b2)
= 4(0.7 + 0.2— 2(0.4)) = 0.4
3.20. G-r(b1 — b2) = Var(bi) + CiFr(b2) — 2Cov(b1, b2)

= 280.1167[0.0087 + 0.0087 — 2(-0.0020)]
= 5.9945
Standard error of b1 — b2 = V5.9945= 2.4483 (C)

3.21. Error SS is
Error SS = Yry — V"-y = 7995 — 5.22(261.5) — 1.62(4041.5) — 0.21(6177.5) + 0.45(5707.0) = 1353.615

The standard error of the regression is V1353.615/(30 —4) = 7.2154. Thus the standard error of b2 is 7.2154 times the
square root of 0.0035, or 0.4269. The t critical value for 95% confidence, 26 degrees of freedom is 2.056. The answer
is 2(2.056)(0.4269) = 1.755. (E)
3.22. The variance of the difference is
Var(b2 — hi) = Var(b2) + Var(bi) — 2 Cov(b2, b1)

The variance of b2 is
,•2 = 62(2.14)
We need 52, and it is Error SS/(n k —1) = 282.82/11 = 25.7109.
s2 = (25.7109)(2.14) = 55.0213
s2 = (25.7109)(0.03) = 0.7713
Cov(b2, b1) = (25.7109)(0.11) = 2.8282
The standard error is the square root of the variance, The final answer is
V55.0213 + 0.7713 — 2(2.8282) = V50.136 = 7.081 (C)

Copyright C12022 ASM
3.23. We will use formula (3.10) for sb,. First compute s2.
s2 = Zq =
7,832
= 435.111
n — (k + 1) 20-2
1435.111
= 0.2020
_
sal —
10,668
The t coefficient for 95% at 18 degrees of freedom is 2.101. The confidence interval is
—1.104 2.101(0.2020) = (-1.52, —0.68) (D)
3.24. We will use formula (3.10) for shi.

2 2394
S = = ,5
n—k 1 8—2
399
sb, = VT6-i = 15.694
The I coefficient for 90% confidence with 6 degrees of freedom is 1.943.
—35.69 ± 1.943(15.694) = (-66.2, —5.2) (B)
3.25. We will use formula (3.10) for 5b1. Notice that (iv) is the error sum of squares.
2 92
S
=10-2= 11.5
Since 40 percent belong to the group, .54.' = 0.4 and xi equals 1 four times and 0 six times.
E(xi =
6((_0.4)2)+ 4(0.6)2 = 2.4
11.5
sbi = = 2.189
The t statistic is 4/2.189 = 1.827 . (D)

3.26. The I statistic is 0.40/0.181 = 2.210. There are n — (k +1) = 12-2 = 10 degrees of freedom. The I-distribution
has critical values of 1.8125 at 10% and 2.2281 at 5% significance, where the significance levels are twice the percentile
levels of the table since the test is two-sided. The correct answer choice is (D).
3.27. The I statistic is (b1 — 1)Ise(bi) = 0.15/0.043 = 3.488. The number of degrees of freedom is the number of
observations minus the number of parameters estimated, or 6 — 2 = 4. Based on the table, the critical values for a
two-sided test (the percentiles listed in the table are doubled) are 2.7764 at 5% and 3.7469 at 2%, making the answer
(C).
3.28. Use formula (3.9).
5348
S z= = 297.1111
20_2
E4 = E(xi —)2+ n7c2

= 2266 + 20(1002) = 202,266
202,266
sk, = .\1297.1111( 20(2266) ) = 36.4146
The t critical value at 95% and 18 degrees of freedom is 2.1009.
68.73 — 2.1009(36.4146) = —7.773 (D)

3.29. From the fourth bullet, the residual standard error is
2 —
Error SS 25
=
s
n—2 4
The third bullet provides the sum of the squared deviations of xis from their means. By formula (3.10),
s2 25/4 „ „
s2122 = = = U .1LD
VA' )2 50
The t coefficient for 4 degrees of freedom and 0.05 in both tails is 2.7764. The upper bound of the 95% confidence
interval for f3i is 4 + 2.7764 25 = 4.9816. (A)
2.79
3.30. s2= Error
n—2
SS = = 0.34875
E 4 = E(xi + n.k2 = 180 + 10(62) = 540

s
0.34875(540)
2
=
= 0.104625
bo
10(180)
The t coefficient at 95% for 8 degrees of freedom is 2.3060.
The width of the confidence interval is 2(2.3060)1/0.104625 = 1.4918 . (E)
3.31. The estimated variance of the regression isrrc;r SS = 1 87399= 0.2082, since there are 9 degrees of freedom.
The standard error of b2 is
= V0.2082(0.15284) = 0.1784
The t coefficient for 95% confidence, 9 degrees of freedom, is 2.2622, and 2.2622(0.1784) = 0.4036 . (E)
3.32. The estimated variance of the regression, which has 3 degrees of freedom, is y = 4. Then
sh, = 4(2/3) = 1.633
Since we're testing p2 = 1, the difference between the fitted value and the null hypothesis is —2.0 — 1 = —3 and the t
statistic is = —1.837 (C)
3.33. The error has n k —1 = 15 — 3 — 1 = 11 degrees of freedom. The t value for 0.05 significance with 11 degrees
of freedom is 2.2010. So the confidence interval is 1.372 ± 2.2010(0.258) = (0.804,1.940)
3.34. There are 44 degrees of freedom, so you would not be able to look up critical values in the tables you get at
the exam. But they give you the p-values, so there is no need to look up tables. If the p-value is 5% or less, the
variable is accepted. Only number of weekend days has a p-value that is too high. (C)
3.35. The t statistic for p2 is —40.50/15.10 = —2.6821. There are 5 data points and 3 variables, so there are 2 degrees
of freedom. The test is a one-sided test, so we use the percentiles in the table. 2.6821 is between 1.8856, the 10th
percentile, and 2.9200, the 5th percentile. (D)
3.36. We calculated the t statistic as —2.682. So the partial correlation coefficient is
—2.682
—0.8846
V2.6822 + 2
3.37. The standard error is the square root of the variance, and
Var(bo + 600b1) = s + 600241 + 2(600) Cov(bo, bi)

= 1.89952 + 6002(0.00001) + 1200(-0.00364) = 1.13152
The answer is =1.06373 . (A)

Copyright 02022 ASM
QUIZ SOLUTIONS FOR LESSON 3 55
3.38. Let t(b2) be the t statistic for b2 and r the partial correlation coefficient of b2.
t (b2) =
,v —12r5 = —7.17579
—7.17579
r— 0.61421
V7.175792 + 88 — 3
3.39. Let t be the t ratio of b1.
0.6540 =
t2 +48 — (3 + 1)
0.65402(t2 + 44) = t2
—0.57228t2 + 18.8195 = 0
t 2 = 32.8849
t = 5.7345
Since the partial correlation coefficient is positive, the t ratio must also be positive; we do not select the negative
square root.
Quiz Solutions
3-1. The Error SS has 20-2 = 18 degrees of freedom, so Error SS = 52(18) = 450. Total SS is 884, so the regression
sum of squares is 884 — 450 = 434
3-2. Z(yi — 902 is the error sum of squares and Z(9i - 9)2 is the regression sum of squares.
122
R2 = 0.2998
285 + 122
3-3. Use formula (3.4)

R2 _ (-120)2 _ 0.8929
— (48)(336) —

exams with helpful solutions and tips. included with GOAL

Copyright 02022 ASM
Lesson 4
Linear Regression: F
Reading: An Introduction to Statistical Learning 3.2.2
The first F statistic we discuss tests the significance of the entire regression. In other words, H0 is the model y = p0+E
and H1 is the model under consideration. The error sum of squares for H0 is the total sum of squares. So our
F statistic is the quotient of the mean square of the regression over the mean square error:
Regression SS/k
Fk,n—k—l= (4.1)
Error SS/(n — k — 1)
with k and n — k —1 degrees of freedom. Most textbooks, but not the textbooks on the syllabus, place F in a separate
column of the ANOVA table:
Source Sum of Squares df Mean Square F
Regression Regression SS k Regression SS/k Fk,rt—k-1

Error Error SS n—k—1 Error SS/(n — k —1)
Total Total SS n—1 Total SS/(n — 1)
i---'‘
It tests whether the model as a whole is significant. In other words, the null hypothesis for the F test is H0:
You may be wondering, why test the whole model? Can't we just check the t statistic of every variable? If at •-••
least one t statistic is significant, then the model is significant. The answer is that if there are a lot of variables,
insignificant variables may accidently have significant t statistics. If you have 20 insignificant variables and are
evaluating t statistics using 5% significance, on the average one of the variables will have a significant t statistic.
That's why we need an F test.
In a model of the form y = 0+ pix ± e, the F1,)1-2 statistic is the square of the t,,_2 statistic for 13i.
The F statistic is related to R2 by
n — k — 1 Regression SS/Total SS _ n—k-1 R2 (4.2) ••••

Fkm—k—l= k k 1 — R2
Error SS/Total SS
Don't memorize this; just remember the trick: to go from F to R2, divide numerator and denominator by Total SS.
EXAMPLE 4A •-: In a linear regression model yi = Po + pixi + Ei you are given:
• There are 15 observations.
• bi = 3.1
• The standard error of b1 is 10/6.
Determine the F statistic and R2. •
("-----'
SoLurioN: The standard error of b1 is 10/6, so the t statistic is 3.1/(10/6) = 1.86. The F statistic is 1.862 = 3.4596

58 4. LINEAR REGRESSION: F
To get R2 using the above formula:

R2
F1,n_2 = (ii 2)
_ R2
Fl,-2 — Fl,-2R2 = (n — 2)R2
R2 =
n — 2 + F1,11-2
3.4596
0.2102
13 + 3.4596
If additional independent variables are added to a regression model, the error sum of squares will go down,
since the added variables will improve the fit. However, if the decrease in Error SS is small, the additional variables
may not be justified. Suppose that the model with fewer variables fits k — q +1 ps, where the intercept /30 is counted
as one of the ps. If we add q more variables, we should test the hypothesis $3k-q+1 = Pk-q+2 -= • • • = = 0 using an F
statistic. To perform this test, estimate the model with and without the additional q variables. The model without
the additional variables is called the restricted or reduced model, since the coefficients of the omitted variables are
being forced to be 0. It will have a higher Error SS; call it Error SSR. The Error SS of the model with the additional
sir variables, the unrestricted or full model, is called Error SSuR. Then the F statistic to test the significance of the
variables is
(Error SSR — Error SSuR)/q
=
(4.3)
Error SSuR /(n — k 1)
where q is the number of restrictions, the number of coefficients forced to be 0. Note that this generalizes equa-
tion (4.1), where q = k and Error SSR = Total SS.
EXAMPLE 4B gki? You are considering the model y •• +Pixi +P2x2+P3x3+P4x4+p5x5+ E based on 60 observations.
You are testing the hypothesis P2 = P3 = g4 := P5 = 0 and y = po• + plxi + E'.
You have the following statistics for these models:
Model Residual standard error
Original model 4,506
P2 = P3 = g 4 .135 = 0 10,321
Determine the F ratio to test the hypothesis /32 = P3 = ,I34 ps 0.
SOLUTION: The standard error s — EnTIT sis , so Error SS = s2(n — k — 1). Here, n = 60, k = 5, and the number of
restrictions q =4.
Error SSuR = 4,5062(54) = 1,096,417,944
Error SSR = 10,3212(58) = 6,178,336,378
F—
(6,178,336,378 — 1,096,417,944)/4 62.57 0
1,096,417,944/54
Regression output may only give you R2 rather than Error SS. Use the method of equation (4.2) to express F in
terms of R2; the Total SS is the same for both the restricted and the unrestricted models:
(Error SSR — Error SSuR)iq
Fq,n-k-i
Error SSuR /(n — k — 1)
(Error SSR /Total SS — Error SSuR /Total SS)/q
(Error SSuR /Total SS)/(n — k —1)
((1 — Error SSuR /Total SS) — (1 — Error SSR /Total SS)) q
(Error SSuR /Total SS)/(n k —1)
— R2R)/g
(1—RR)/(n — k —1)

Copyright 02022 ASM
The tables you get on the exam do not include tables of the F distribution. Apparently they will not ask you
to determine the significance of variables using the F statistic. A possible exception would be when there is only
one degree of freedom in the numerator; in that case, the square root of the F statistic is the t statistic having the '41
same number of degrees of freedom as the denominator, and then you can use the t table to look up the critical
value.1 But keep in mind that the square root is a 2-sided variable (it can be positive or negative), so the probability
of Fid, > 1 — a corresponds to the probability of Ili >1—
Quiz 4-1 ssi? You are considering a model with 12 explanatory variables and an intercept. The model is based
on 35 observations. The regression sum of squares for the model is 6520 and the error sum of squares is 349.
You wish to test the hypothesis 1310 = pH = P12 = 0. When the model is run without the corresponding
three variables, the regression sum of squares is 6410.
Calculate the F ratio to test the hypothesis.
A special case of this F test is a categorical variable. Usually you would want to test all the dummy variables
arising from a single categorical variable as a group for significance, rather than testing each dummy variable
separately.
Table 4.1: Formula Summary for Regression Measures of Fit
Regression SS/k
(4.1)
Fk,n-k-1 =Error SS/(n — k — 1)
n k—1 R2
(4.2)
k 1—R2
To test Pk—q÷i = Pk—q+2 = • • • =Pk := 0, use the following F statistic:

(Error SSR — Error SSUR )//
Fq,n-k-1 — (4.3)
Error SSuR — k — 1)
(RJR — R2R)/q (4.4)
(1 — qR)/(n — k 1)
Exercises
4.1. •411 For a linear regression model of the form yi = 10+ f3txit + 02x12 P3Xi3ti you are given:
(i) There are 25 observations of the variables.
E7!1(Yi — 902 = 30
(iii) EP1(Y/ = 40
Calculate the F statistic for the model.
1But the fact that F(1, q) is the square of 1(q) is only mentioned in a footnote of An Introduction to Statistical Learning, so I don't know whether
they'll test on this.

Copyright 02022 ASM
4.2. al? [120-82-94:10] A sample of size 20 is fitted to a linear regression model of the form
Yi + P2x12 + P3xi3 + p4xi4 + p5xi5 + Ei
The resulting F ratio used to test the hypothesis Ho : f3i —P2 — /33 — p4 - P5 0 is equal to 21.
Determine R2.
(A) R. (B) (c) 17 (D) (EN126

/ 139
4.3. For a linear regression model yi = Po + pixi + j based on 10 observations, R2 = 0.95.

Determine the F statistic of the regression.
4.4. [4-S00:1] You fit the following model to 20 observations:
Y Po Pix + E
You determine that R2 = 0.64.

Calculate the value of the F statistic used to test for a linear relationship.
(A) Less than 30
(E) At least 39
4.5. 'Li? [MAS-I-S18:33] Consider a multiple regression model with an intercept, 3 independent variables, and 13
observations. The value of R2 = 0.838547.
Calculate the value of the F-statistic used to test the hypothesis Ho : =
= /33 = 0.
(A) Less than 5
(E) At least 20
4.6. [ST-S16:21] You are given the following linear regression model which is fitted to 11 observations:
Y = Po + Pix +
The coefficient of determination is R2 = 0.25.
Calculate the F-statistic used to test for a linear relationship.

(A) Less than 1.5
(E) At least 4.5

4.7. For the linear regression model yi = Po + Pixi + e you are given the following ANOVA table:
Source Sum of Squares df

Regression 12,235 1
Error 3,014 8
Determine the F statistic for determining the significance of the model.
4.8.
•41 A linear regression model with 22 observations is fitted as yi = —2 + 2.5xi + Ei. You are given R2 = 0.9.
Determine the width of the shortest symmetric 90% confidence interval for 13, the coefficient of X.
4.9. [3-F84:61 You are fitting the linear regression model yi = Po + f3ixi + Ei. You have determined:
io
E(xi — 1)2 = 400

io
E(yi — g)2 = 425

i=1
— 27)2 = 225
i=1
Calculate the t statistic used for testing the hypothesis Ho : pi = 0.

(A) 1.5 (B) 2.0 (C) 2.5 (D) 3.0 (E) 3.5
4.10. -11. [120-F89:41 You are studying the average return on sales as a function of the number of firms in an industry.
You have collected data for 1969-88 (20 years) and performed a two-variable regression of the form yi = /30 + x; +E.
You have obtained the following summary statistics from these data:
E xi = 4860 yi = 2.452 E xiyi 539.309
E 4 = 1,330,224 yi2 = 0.341804

The F ratio is 19.45.
Determine the upper bound of the shortest 95-percent confidence interval for the regression coefficient pi.
(A) —0.0010 (D) —0.0004 (E) —0.0002
4.11. 'lb [120-81-98:2] You fit a simple regression model to 47 observations and determine y`i = 1.0 + Lai. The
total sum of squares (Total SS), corrected for mean, is 54, and the regression sum of squares (Regression SS) is 7.
Determine the value of the t statistic for testing H0:131 = 0 against Hi: Pi 0.
(A) 0.4 (B) 1.2 (C) 2.2 (D) 2.6 (E) 6.7

Copyright 6)2(322 ASM
4.12. r[120-83-98:3] You fit a one-variable plus intercept regression model_ to seven observations.
You determine:
Error SS = 218.680
F = 2.088
Calculate R2.
(A) 0.3 (13) 0.4 (C) 0.5 (D) 0.6 (E) 0.7
4.13. %Ii[SRM Sample Question #441 Two actuaries are analyzing dental claims for a group of n = 100 participants.
The predictor variable is gender, with 0 and 1 as possible values. Actuary 1 uses the following regression model:
Y=p+e
Actuary 2 uses the following regression model:
Y = go f31 x Gender + £
The residual sum of squares for the regression of Actuary 2 is 250,000 and the total sum of squares is 490,000.
Calculate the F-statistic to test whether the model of Actuary 2 is a significant improvement over the model of
Actuary 1.
(A) 92 (B) 93 (C) 94 (D) 95 (E) 96
4.14. *-1? A company analyzes sales of its products by its agents. It considers the following explanatory variables:
• xi is the amount of time the agent has been with the company.
• x2 is the population of the agent's territory.
• x3 is the number of continuing education hours for the agent during the year.
Let y be total sales for an agent in one year. The company fits the following regression model:
The company uses data from 18 agents. Summary statistics from this model are:
18
— 902 = 1060
18
Dyi — 902
i=1
1820
Calculate the F statistic.
Exam SRM Study Manual Exercises continue on the next page .

4.15. *-411 You are given the following excerpt from an ANOVA table for a regression:
Source df Sum Sq Mean Sq F value
Regression 4 20.79
Error
Total 22 14,230
Determine the regression sum of squares.
4.16. 41 [MAS-I-F18:32] An actuary uses a multiple regression model to estimate money spent on kitchen equip-
ment using income, education, and savings. He uses 20 observations to perform the analysis and obtains the
following output:
Coefficient Estimate Standard Error t-value
Intercept 0.15085 0.73776 0.20447

Income 0.26528 0.10127 2.61953
Education 6.64357 2.01212 3.30178
Savings 7.31450 2.73977 2.66975
Sum of
Squares
Regression 2.65376
Total 7.62956
He wants to test the following hypothesis:

• Ho: pi = P2 = P3 = 0
• H1: At least one of {pi,132,p3}*o
Calculate the value of the F-statistics used in this test.
(A) Less than 1

(D) At least 5
(E) The answer cannot be computed from the information given.
Comparison of nested models
4.17. "41 A regression model has 5 variables: xi, x2, ,x5, and an intercept. An excerpt from the output from the
regression is
Residual standard error: 3.85 on 44 degrees of freedom.
Multiple R-squared: 0.82.
The variable x5 is removed from the model and the regression is run again. In the resulting regression, the
residual standard error is 4.02.
Determine the F statistic to determine the significance of x5.
Exam SRM Study Manual Exerrises continue on the next page...

4.18. slir You are fitting the following model to 20 observations:

yi = Po + p1x11 + p2x12 + P3X i3 134x4 Ei
The error sum of squares for this model is 8.

You then constrain f33 = 134 = 0 and perform a regression. The error sum of squares for the constrained model is
10.
Calculate the F statistic to test the significance of the variables x.3 and .x4.
4.19. •121 [ST-S15:21) The following two linear regression models were fit to 20 observations:
• Model 1: Y = 130 +131X1 +132X2 + E
• Model 2: Y = flo +131X1 132X2 )33X3 p4x4 + E
The results of the regression are as follows:
Model Error Sum Regression Sum
Number of Squares of Squares
1 13.47 22.75
2 10.53 25.70
The null hypothesis is H0: p3 = /34 = 0 with the alternative hypothesis that the two betas are not equal to zero.
Calculate the statistic used to test H0.
(A) Less than 1.70
(E) At least 2.00
4.20. sir [ST-S16:22) The following two models were fit to 18 observations:
• Model 1: Y = po + + p2x2 + E
• Model 2: Y = PO + + p2x2 + p3x1x2 + p4x? + 135x1 + e
The results of the regression are:
Model Error Sum Regression Sum
Number of Squares of Squares
1 102 23
2 78 39
Calculate the value of the F-statistic used to test the hypothesis that p3 = p5 = 0.
(A) Less than 130
(E) At least 1.60
Exam SRM Study Manual Exercises continue on the next page.

4.21. s: You are given the following data regarding two models based on 15 observations:
Model Error sum of squares
Y Po +Pixi +132x2 + Opq +)34x1+ £ 22.8
Y= yo + yjx + y2x2 + E 57.4
Determine the F ratio for testing the hypothesis (33 = )64 = 0.
4.22. "41' [S-F17:35] Consider the following 2 models, which were fit to the same 30 observations using ordinary
least squares:
Modell Model 2
SS Total 19851 SS Total 19851
SS Error 2781 SS Error 2104
Parameter if df Parameter p df
Intercept 130.2 1 Intercept 130.2 1

Xi 5.1 1 Xi 4.5 1
X2 —4.2 1 X2 —5.9 1
X3 35 1
X4 —2.3 1
You test Ho: f_33 = /34 = 0 against the alternative hypothesis that at least one of f33, f34 0.
Calculate the F statistic for this test.
4.23. ".41 [120-81-98:6] You wish to find a model to predict insurance sales, using 27 observations and 8 variables
labeled xi, x2,... , x8 and an intercept. The analysis of variance tables for two different models from these data follow.
Model A contains all 8 independent variables; Model B contains xi and x2 only. Both models include an intercept.
Model A
Source SS df MS
Regression 115,175 8 14,397
Error 76,893 18 4,272
Total 192,068 26
Model B
Source SS df MS
Regression 65,597 2 32,798
Error 126,471 24 5,270
Total 192,068 26
Calculate the F ratio for testing the hypothesis Ho: P3 —/34 — /35 — /36 — 137 — P8 — 0
(A) 5.8 (B) 4.5 (C) 2.6 (D) 1.9 (E) 1.6

Copyright 02022 ASM
4.24. 4kili [120-82-97:4] You apply all possible regression models to a set of five observations with three explanatory
variables and an intercept. You determine Error SS, the sum of squares due to error (or residual), for each of the
models:
Variables
Model in the Model Error SS
I xi 5.85
II X2 8.45
III X3 6.15
IV xi, x2 5.12
V xi, x3 4.35
VI X2i X3 1.72
VII xl, X2, X3 0.07
You also determined that the estimated variance of the dependent variable y is Var(y) = 2.2.
Calculate the value of the F statistic for testing the significance of adding the variable x3 to the model y =
13o + pixii + Ei.
(A) 0.3 (B) 0.7 (C) 1.0 (D) 1.4 (E) 1.7
4.25. IN? [120-81-98:5, Sample C4:35] You are determining the relationship of salary (y) to experience (xi) for both
men (x2 = 1) and women (x2 = 0). You fit the model yi =160 + Pixn + /32xi2 +133xiixi2 + ri to a set of observations for
a sample of employees.
You are given:
(ii) Regression SS is 330.0117 for this model and Error SS is 12.8156.
(iii) For the model yi = l ix + r, Regression SS is 315.0992 and Error SS is 27.7281.
f
Determine the F statistic to test whether the linear relationship between salary and experience is identical for
men and women.
(A) 0.6 (B) 2.0 (C) 3.5 (D) 4.1 (E) 6.2
4.26. '41 [4-S01:5] A professor ran an experiment in three sections of a psychology course to show that the more
digits in a number, the more difficult it is to remember. The following variables were used in a multiple regression:
xi = number of digits in the number
x2 = 1 if student was in section 1, 0 otherwise
X3 = 1 if student was in section 2, 0 otherwise
y = percentage of students correctly remembering the number
You are given:
(i) A total of 42 students participated in the study.
(ii) The regression equation y = /30 + p1x1 +1324 + /33x2 + 0(3 + e was fit to the data and resulted in R2 = 0.940.
(iii) A second regression equation y = 130 + Pixi + f39q. was fit to the data and resulted in R2 = 0.915.
Determine the value of the F statistic used to test whether class section is a significant variable.
(A) 5.4 (B) 7.3 (C) 7.7 (D) 7.9 (E) 8.3

4.27. Nsii A company is modeling auto collision losses. In addition to two other variables xi and x2, the company
is considering territory as a categorical variable. Three binary variables are used for territory: x3, x4, x5. There are
820 observations.
A regression with all 5 variables and an intercept resulted in an F statistic of 419.

A regression without the 3 territory variables resulted in an F statistic of 759.
Determine the F statistic to test the significance of territory.
4.28. 41 [SRM Sample Question #24] Sarah performs a regression of the return on a mutual fund (y) on four
predictors plus an intercept. She uses monthly returns over 105 months.
Her software calculates the F statistic for the regression as F = 20.0, but then it quits working before it calculates
the value of R2. While she waits on hold with the help desk, she tries to calculate R2 from the F-statistic.
Determine which of the following statements about the attempted calculation is true.
(A) There is insufficient information, but it could be calculated if she had the value of the residual sum of squares
(RSS).
(B) There is insufficient information, but it could be calculated if she had the value of the total sum of squares
(TSS) and RSS.
(C) R2 = 0.44
(D) R2 = 0.56
(E) R2 = 0.80
4.29. P [120-83-98:8] You fit the following model to 10 observations:

Y = PO ± P1 X1 + 132 X2 + £
You are given:
Regression SS = 61.3
Total SS = 128
You then fit the following new model, with an additional variable x3, to the same data:
y = Po +pixi + P2X2 P3X3 E
For this new model, you determine:
Total SS = 128
Calculate the value of the F statistic to test Ho : p3 = a.

(A) 0.01 (B) 0.41 (C) 1.76 (D) 4.30 (E) 10.40

Copyright 02022 ASM
4.30. •••• [4-F02:27] For the multiple regression model y = Po + Pixi + p 2X2 + 83X3 + /34x4 +135x5 + E, you are given:
(i) There are 3,120 observations.
(ii) The total sum of squares is 15,000.
(iii) 110: P3 = P4 = 35 = 0
(iv) R2 for the unrestricted model is 0.38.
(v) The regression sum of squares for the restricted model is 5,565.
Determine the value of the F statistic for testing Ho.
(A) Less than 10
(E) At least 16
Solutions
4.1. Total SS = 40 and Error SS = 30 so Regression SS = 40— 30 = 10. Also k = 3, n = 25.
Regression SS/k 10/3 7

F3,21— = —
= —
Error SS/(n — k — 1) 30/21 j3
4.2. 21 =
R2/5
(1 — R2)/(20 — 6)
15 _ R2
T — 1— R2
15
R2 = (C)
17
4.3. Regression
Total SS
SS , 0.95
Error SS
= 0.05
Total SS
Regression SS 0.95
F— (n 2) = —(8) = 152
Error SS 0.05
4.4.
Regression 88/k R2/1 0.64
F—
Error SS/(n — k — 1) (1— R2)/18
= —
0.36/18
— I .. 1 (B)
4.5. Notice that Ho eliminates all variables other than the intercept, so this is the F test for the entire regression.
By equation (4.1),
R2/k 0.838547/3
F3,9 — = 15.58 (D)
—
(1 — R2)/(n — (k +1)) (1 — 0.838547)/9

/ R2 \ /0.25\ 1 i
=
(C)
F = (12 — 2)1. — R2) = 9 0) l i
4.7.
12,235/1
F1'8 = = 32.48
3,014/8

4.8. We'll calculate F and then use the fact that t is the square root of F.
F
Regression SS 0.9
=
(n 2) = —(20) = 180
Error SS 0.1
= tp = 'VT8-1:1= 13.4164
so
2.5
s = = 01863
P 13.4164
The t coefficient with 20 degrees of freedom for 90% confidence is 1.725. The width of the confidence interval is
2(1.725)(0.1863) = 0.6429
4.9. The second sum is Total SS and the third sum is Regression SS, so R2 = E = *7.. Then
R2(8) (9/17)(8) _9
F1,8 = = -
1 - R2 8/17
t = =111 (D)
4.10. The t-statistic, which is pliss,, is the square root of the F ratio, or Vig-.LB' = 4.4102. The coefficient of the t
distribution with 18 degrees of freedom and 95% confidence is 2.101. Then
(4860)(2.452)
- 9) = 539.309 20
= 56.527
48602
1,330,224 =
149 244
'
20
-56.527
p" -0 .000379
1'1 = 149,244
=
iPli _ 0.000379 = 0 0000859
sPi 4 .4102
1'13,
The upper bound of the confidence interval is -0.000379 + 2.101(0.0000859) = —0.0002 . (E)
4.11. We'll use the square root of the F statistic, which is

Regression SS 7
F1,45 = — 6.7021
Error SS/45 - (54 - 7)/45
The t statistic is therefore V0672.1. = 2.5889 .(D)
4.12. We back out Regression SS from F.

Regression SS 5(Regression SS)
2.088 =
Error SS/(7 -2) 218.680
(2.088)(218.680) = 91.321
Regression SS = 5
91.321
R2 = Regression
Total SS
SS = 218.680 + 91.321
= 0.295 (A)
4.13. Actuary l's model is intercept only, so we're just calculating the significance of a non-null model. Use
formula (4.1). Total sum of squares is 490,000 and residual sum of squares (Error SS) is 250,000, so regression sum
of squares is 490,000 - 250,000 = 240,000. There are 100 observations and 1 variable, so n = 100, k = 1.
240,000/1 94.08 (C)
_

4.14. E181(9i- p1)2 is the regression sum of squares, while Er i(yi — gi)2 is the total sum of squares. The error sum
of squares is the difference, or 1820 — 1060 = 760. The regression sum of squares has 3 degrees of freedom, while
the error sum of squares has 18 —4 = 14 degrees of freedom. The F statistic is
1060/3
F3,14 —
6.51
760/14 —
4.15. Since the total sum of squares has 22 degrees of freedom and Regression SS has 4, Error SS must have 18
degrees of freedom.
Regression 55/4 Regression SS \

F4,18 =
Error SS/18
= 4.5
(Total SS — Regression SS)
Regression SS \
20.79 = 4.5
( 14,230 — Regression SS)
4.62(14,230 — Regression SS) = Regression SS
65,742.6
11,697.97
4.16. The error sum of squares is Total SS — Regression SS = 7.62956 — 2.65376 = 4.97580. The regression SS has
p —1 = 1 = 3 degrees of freedom and the error SS has n —p = 20-4 = 16 degrees of freedom. Here's the ANOVA
table:
Source Sum of Squares df Mean Square

Regression 2.65376 3 0.884587 2.8444
Error 4.97580 16 0.310988
Total 7.62956 20
(B)
4.17. Since s = .\/Error SS /(n — k — 1), it follows that Error SS = (n — k — 1)s2 . The Error SS of the unrestricted
model is 44(3.852) = 652.19. For the restricted model, Error SS = 45(4.022) = 727.22. Then the F statistic is
(Error SSR — Error SSuR)/q 727.22 — 652.19
5.062
Fi,44 =
Error SSuR /(n — k — 1) 652.19/44
4.18. The numerator has 2 degrees of freedom for q = 2 constraints; the denominator has 15 degrees of freedom
for n = 20 observations with k = 4 explanatory variables for the unrestricted model.
(Error SSR — Error SSuR)/q (10 — 8)/2 15
1.875
F2,15 — = =
Error SSuR/(n — k — 1) 8/15 8
4.19. The statistic used is the F statistic of formula (4.3):
(Error SSR — Error SSuR)/9

Fq,,-(k+i)
Error SSuR /(2 — (k + 1))
In our case, n = 20, there are q = 2 restrictions, and k = 4, so the statistic is
(13.47 — 10.53)/2
F2,15 — 2.094 (E)
10.53/15

Copyright @2022 ASM
4.20. Use equation (4.3) with n = 18, k = 5, q = 3.
(Error SSR — Error SSuR)/q (102— 78)/3 1.231 (A)

F3'12 =Error SSuR — (k + 1)) 78/12
4.21. Squaring x3 and x4 does not affect the solution.

The first model, the unrestricted one, has five ps. The second one has three ys, so the number of restrictions is
q =2. Then
F=
(57.4 — 22.8)/2 7.5877
22.8/(15 — 5)
4.22. Notice that there is an intercept in addition to four variables. Thus the number of degrees of freedom of the
=restricted model is 30 — 5 = 25. We are testing 2 restrictions. The F ratio is
(2781 — 2104)/2 4.022
F2,25
2104/25
4.23. To compare the two models, we compare the error sum of squares. Notice that n = 27 while k = 8, so
n k — 1 = 18; but the degrees of freedom for the Error SS is explicitly stated as 18 anyway. The number of
coefficients set equal to 0 is q = 6, so
(Error SSR — Error SSuR)/q (126,471 — 76,893)/6
F = 1.9 (D)
Error SSuR/(n — k — 1) 76,893/18
4.24. The unrestricted model with two variables x1 and x3 and a constant has n — k —1 = 5 — 2 — 1 = 2 degrees of
freedom and we add q = 1 restriction. The F statistic is
(Error SSR — Error SSuR)/q _ (5.85 — 4.35)/1 _

— —
0.690 (B)
Error SSuR /(n — k 4.35/2
The estimated variance of y was not needed, although it was used in the official solution.
4.25. There are n = 11 observations. The unrestricted model has k =3 variables and there are q = 2 restrictions.
(27.7281 — 12.8156)/2
F2,7 = 3.5(1.163621) = 4.0727 (D)
12.8156/7
4.26. There are q = 2 restrictions, n = 42 observations, and k = 4 variables in the unrestricted model. In terms of
R2, the F statistic is
(RtR —4)/q (0.940 — 0.915)/2 —
7.708 (C)
F2,37 =
(1 — 4R)/(n — k — 1) (1 — 0.940)/37
4.27. For the 5-variable regression,

F=
R2/5 =419
(1 — R2)/814
So
R2 _ 419(5) = 2.57371
1 — R2 814
2 57371
R2 = = 0.72018
3.57371

For the 2-variable regression,
F—
R2/2 = 759
(1 — R2)/817
R2 759(2) = 1.85802
1 — R2 = 817
R2 = 1'85802
2.85802
= 0.65011
There are q = 3 restrictions in the 2-variable model. The F statistic to test the significance of territory is
(RtR Ri)/q (0.72018 — 0.65011)/3 —
67.95
F3814—
(1 — qR)/(n k 1) = (1 — 0.72018)/814
4.28. You can use equation (4.2) to obtain the answer immediately, but since I suggested not memorizing that
formula, we'll work this out from first principles.
F =
Regression SS/k
(Total SS — Error SS)/4
Total SS
20 =
Error SS/100
= 25
( Total SS )
Error SS
1
La+1 = 1.8
= 25
Error SS
Error SS
R2 —1— = 1— =
0.4444 (C)
Total SS 1.8
4.29. There is q = 1 restriction, n = 10 observations, and k = 3 variables in the unrestricted model.
(Error SSR — Error SSR)/1

F1,6 —
Error SSuR /6
=6
(65.6 — 61.3)
128 — 65.6
= 6(0.06891) = 0.4135 (B)
4.30. Error SSR = Total SS — Regression SSR = 15,000 — 5,565 = 9,435. For the unrestricted model,
Error SSUR
Total SS = 1 — R?.JR = 0.62 Error SSuR = 15,000(0.62) = 9,300
The F statistic is
(9,435 — 9,300)/3
F3,3114 — 15.068 (D)
9,300/(3,120 — 6)
Quiz Solutions
4-1. The error sum of squares of the smaller model is the total sum of squares, 6520 + 349, minus 6410, or 459. The
F ratio is
(459 — 349)/3
F3,22 — 2.311
349/22

000
Do you have any fun ideas
1 STUDY for future Study Breaks?
Share your inspiration with
BREAK. #asmstudybreak
Exercise is a proven way to boost your verbal memory, thinkin and learninEli
V
0
V
0
'L
E
Pyramid with Arms Extended

Yoga - Sun Salutations
*Hold each pose for three long breaths, breathing in through
your nose and out through your mouth.
Note that we are mathematicians, not yogi: attempt poses at your own risk!
atsim
Lesson 5
Linear Regression: Validation
Reading: Regression Modeling with Actuarial and Financial Applications 2.5,5.3-5.7; An Introduction to Statistical Learning
3.3.3
5.1 Validating model assumptions

We will discuss validating model assumptions by checking the pattern of residuals. Before we do that, let's derive if
the variance of the residuals.
The formula for (3 is
p= pexylry (2.4)
The fitted value of y is 649
Xl3 = X(X'X)-1X'y
Let H = X(X/X)-1X'. Then y = By. The n x n matrix H is called the hat matrix by many authors, since it puts a hat on •41
y.' Since y = ê, it follows that 2 = (I - H)y. In the linear regression model, y = pp+ 131x + E with all terms other
than y and e non-random, so Var(e) = Var(y) = c2. But in the fitted model, Var(2) < Var(y); Var(21) = (1 - hii)cr2.
You can see that while the variance of the error E is c2, the variance of the residual 2 is less than c2. This is not
surprising; E is estimated from the data, which restricts its variation. With a little matrix algebra, shown in a sidebar
(You need not know any of this derivation for the exam.), we get
Var(21) = (1 - h11)o-2 if
where hii is the ith diagonal element of the H matrix. As usual, o2 is estimated by $2, the residual variance of the
regression. The variance of 2; is estimated by (1 - h11)s2, where hi; is the ith diagonal element of the H matrix and
s2 is the square of the standard error of the regression. The leverage of the ith observation is defined as hii. We'll
discuss the use of leverage later in this lesson.
To best use the residuals, we should standardize them. Their mean is already 0, so standardizing consists of
dividing by their standard deviations. There are three possible formulas to standardize them:
ei
1. —.
s
This is simple, but not precise, since s is the estimated standard deviation of the true error, not of the
residual.
P
2. ri = This is the standardized residual, the residual divided by its estimated standard deviation.
s
3. ei , where s(i) is the residual standard error of a regression that excludes observation i. If yi is
s(;) 11177Z
unusually high or low, it may artificially increase s and thus reduce the standardized residual. By excluding
observation i, we avoid this problem. This residual follows Student's t distribution with n - (k +1) degrees of
freedom, so it is called the studentized residual. Sr
'The syllabus textbook Regression Modeling with Actuarial and Financial Applications mentions that H puts the hat on y, but does not call it the
hat matrix, so you are not responsible for knowing this term. However, I will use it for convenience.

76 5. LINEAR REGRESSION: VALIDATION
Derivation of Variance of Residual
In the text, we derived 2 = (I - H)y. Let M = I - H. Then.
= My
Note that M' = M since H is symmetric, and

M2 = 12 - 2IH + 112 = I - 2H + H2
and
H2 = X(X1X)-1X1X(X/X)-1X' = X(VX)-1X` = H
so M2 = M. It follows that
Var(E) = M2 Var(e) = MG2
This is the covariance matrix for E. Since M = H, the diagonal elements of M are 1 minus the diagonal elements
of H, while the other elements are negative the corresponding elements of H. In particular, the variance of ti is
(1 -
EXAMPLE 5A '141 A regression with three parameters including the intercept is based on 6 observations. The hat
matrix is
0.458 0.376 0.232 0.132 0.009 0.189
0.376 0.411 0.255 0.037 0.164 0.085
0.232 0.255 0.200 0.112 0.036 0.165
H=
0.132 0.037 0.112 0.288 0.417 0.014
-0.009 0.164 0.036 0.417 0.719 0.001
\ -0.189 0.085 0.165 0.014 0.001 0.924
The standard error of the regression is 1.389.

The first 2 residuals are -0.893 and 0.883.
1. Calculate the estimated variance of E'l + 2.
2. Calculate the first two standardized residuals. •
SOLUTION: 1. The covariance matrix of 2 is s2 times I - H. The variances of Pi and E2 are (1 - 0.458)s2 = 0.542s2
and (1 - 0.411)s2 = 0.58952. The covariance is -0.376s2.
\Z-r(Pi + P2) = (1.3892)(0.542 + 0.589 + 2(-0.376)) = 0.731
2. We'll use ri to indicate the standardized residual of the ith observation.

-0.893
-0.873
1.389Nkr5U.
0.883
=
0.828
1.389V0 .589
Quiz 5-1 For a linear model,

(i) The standard error of the regression is 67.
(ii) The first residual is 3.2.
(iii) The leverage of the first residual is 0.04.
Calculate the first standardized residual.

Copyright 02022 ASM
5.2. OUTLIERS AND INFLUENTIAL POINTS 77
n- Let's now discuss how to use residuals to check whether regression assumptions are satisfied.
Linear regression makes the following assumptions:
1. The response is a linear function of the explanatory variables.
2. The response is distributed normally.
3. The response has constant variance. A variable is homoscedastic if its variance is constant and heteroscedastic
otherwise.
4. The observations are independent.

To check normality, construct a normal probability plot of the standardized residuals against a standard normal "sit
distribution. Closeness of the points in the plot to the 45° line indicates normality. Recall that a normal probability
plot is a special case of a ig q plot. See Section 1.3 for a discussion of q q plots.
To check that the variance is constant, plot the residuals or standardized residuals against the fitted values
A random pattern in the plot against fitted values indicates homoscedasticity. For example, the following graph
indicates heteroscedasticity:
ci
15
10
•
0
—5
—10
•
—15 Pi
0 5 10 15 20 25 30 35 40
The absolute values of the residuals grow as the fitted value grows, indicating that variance is not constant. To
correct this problem, if variances are known, use weighted least squares. The weights should be the reciprocals of
the standard deviations.
To check linearity of the model, plot the response or the residuals against each explanatory variable. A pattern
in the plot indicates that the relationship is not linear. For example, the graph in Figure 5.1 indicates a quadratic
relationship to x2 that has not been incorporated into the model:
To check that observations are independent, plot the ei in the order for which correlation is expected, such as
order of occurrence. Observe whether values of consecutive residuals are close to each other.
5.2 Outliers and influential points

Observations can be unusual in two ways:
1. They may be unusual vertically.
2. They may be unusual horizontally.

E
30
•
25
20
15
10
—5 •
•
—10
•
—15
X2i
—200 2 4 6 8 10 12 14
Figure 5.1: Quadratic scatter plot
An observation is unusual vertically if the value of the response is unusually high or low. Then the residual has an
unusually high absolute value. If the errors are normally distributed, very few standardized residuals should have
0
1 absolute value greater than 3. Observations with unusually high (in absolute value) residuals are called outliers.
An outlier can be handled in one of the following ways:
1. Include it in the model but comment on it.
2. Delete it from the data.
3. Create an indicator variable that is 1 only for outliers.

An observation is unusual horizontally if the point in k-space of the explanatory variables is far out of the range
of the other points. Such observations have a high weight in the formula for calculating the 9. Those observations
•10 are called influential points.
An indicator of an influential point is high leverage. We mentioned above that si Hy, and that the diagonal
• element hii is called the leverage of the ith observation. Leverages must each be between 1/n and 1, and sum up to
k +1. Thus on the average they are (k +1)/n. If leverage is more than 2 or 3 times (k +1)/n, the point is contributing
disproportionately to the result and the data should be reviewed.
For a simple regression, leverage is
=
(5.1)
We see that hii is high if and only if lxi — 11, the absolute distance of the explanatory variable from its mean, is
disproportionately high. This contrasts with outliers, which are observations for which the response variable is
disproportionately far from its fitted mean.
A high leverage point can be handled in one of the following ways:
1. Include it in the model and comment on it.
2. Delete it from the data.
3. Replace the variable causing the high leverage with another variable.

Copyright 02022 ASM
5.3. COLLINEARITY OF EXPLANATORY VARIABLES; VIF 79
4. Use a nonlinear transformation of an explanatory variable.

Cook's distance is a measure that combines outliers and leverage. Cook's distance is defined as follows. Let gi(i) •41'
be the fitted value of yi if the ith observation is removed from the data set. Let Di be Cook's distance for the ith
observation. Then
Di =
7= 1(91 — "i(i))2
= Ti (5.2)
(k +1)s2 (k + 1)(1 — hii)
where ri is the ith standardized residual. The first expression is a definition. We measure the impact of observation i
by squaring the difference between the prediction of the model with and without that observation. We sum up these
impacts and divide by a normalizing constant. The second expression is a product of the standardized residual, a
measure of how much of an outlier the observation is, and an expression depending only on leverage.
What value of Di indicates an unusual observation? The first factor, the standardized residual squared, should
average about 1 (the variance of a standard normal distribution) and average leverage is (k + 1)/n so the second
factor should be about 1/n if we ignore 1 — hii in the denominator. So the product should be about 1/n. Another
way to evaluate Di is to compare it to an F distribution with k +1,n — (k + 1) degrees of freedom. If Di is unusually
high, the observation should be checked.
EXAMPLE 5B c: (Continuation of Example 5A) In Example 5A, calculate Cook's distance for the first two residuals..
SOLUTION: The first factor in equation (5.2) is the standardized residual squared, and we calculated the standardized
residual in Example 5A. In that example there are k +1 = 3 parameters.
Di ( 0.873)2(k3(1 0.458 \ 0.215

— 0.458))
0.411
D2 = (0.828)2 (3(1 — 0.411)) 0.159
Quiz 5-2 •41 For a linear model with 4 variables and an intercept, there are 100 observations. The first
standardized residual is 1.24 and the leverage of the first observation is 0.48.
Calculate Cook's distance for the first observation.
VIFi = (1 — (5.3)

Copyright Cs2022 ASIA
Since R2 > 0, VIF • must be at least 1. The larger it is, the more collinear variable xi is. Values above 10 indicate
severe collinearity.
Quiz 5-3 %II Consider the linear regression model based on 21 observations:
Yi = Po + Plr1l+132xi2 +p3xi3 + El
It is suspected that x3 is collinear with the other variables. A regression of x3 on the other two variables is
performed. The resulting F statistic is 6.48.
Calculate the VIF for x3.
There is a relationship between the standard error of 1)1 and the VIF of xj:
VVIFi
.4* Ski = S (5.4)
Sx)
In this equation, sbj and s are based on the full regression of y on the k variables, not omitting xj. We see that all
other things being equal, a higher VIF corresponds to a higher standard error of the coefficient estimate. That means
it is more difficult to detect the significance of variables in the model.
High leverage may induce or mask collinearity.
••41 An explanatory variable may improve the model even if it is mildly collinear with others. A suppressor variable is
an explanatory variable that increases the significance of other variables when included in the model.
Two matrices X1 and X2 with equal numbers of rows are orthogonal if ViX2 = 0. If xi is the 1 x n vector of an
explanatory variable and X2 has the other explanatory variables, then xi is an orthogonal variable. In a sense, it
is the opposite of a collinear variable. Its VIF is 1. Adding it to the model does not affect the other (3 estimates.
We will later discuss principal components regression, a method of creating orthogonal variables. However, while
orthogonal variables remove the collinearity problem, they are difficult to interpret.

5.3. COLLINEARITY OF EXPLANATORY VARIABLES; VIP 81
Table 5.1: Summary of Formulas for Validating Linear Model Assumptions
The hat matrix
H = X(VX)-1X'
Estimated variance of residuals
'c (e1) = (1 - hii)s2

Standardized residuals
ri =
s-VT. 1TH
Studentized residuals
r. =
S(i)
Outliers Standardized residual is high.
Leverage hii.
Properties of leverage
Leverage for simple regression

(5.1)
Cook's distance
E7=1 (Pi — 2 hi;
D, = ' =r. (5.2)
(k +1)52 ' (k +1)(1— hii)
Variance inflation factor Let R2 be R2 for regression of xi on other explanatory variables.

(D
viF1=0_eu,y1 (5.3)
(5.4)

Exercises
Residuals
5.1. You are given the following hat matrix:

0.5932 0.4407 0.1356 —0.1695
(
0.4407 0.3559 0.1864 0.0169
0.1356 0.1864 0.2881 0.3898
—0.1695 0.0169 0.3898 0.7627
The residual standard deviation is 0.7235.
Calculate the estimated variance of the second residual ez,
5.2. °call For a linear regression with 11 observations and 2 variables plus an intercept, you are given:
•
— 902 = 32
• y3 — 93 = 1.2
• The leverage of the third observations is 0.6.
Calculate the standardized third residual.
5.3. •-.11 For a linear regression with 25 observations and 3 variables plus an intercept,
(i) The residual standard error is 22.4.
(ii) If the 21st observation is removed, the residual standard error is 17.2.
(iii) The standardized residual for the 215t observation is 0.83.
Calculate the studentized residual for the 21st observation.

Copyright 02022 ASM
5.4. [MAS-L-F18:351 You are fitting a linear regression model of the form:
y = Xf3 + e; e1 - N(0, 0-2)

1 0 1 9\ 19\
1 1 1 15 32 4 3 51\
361
(6
1 1 1 8 19 4 4 2
X=
y=117;; "
=
1 1 0 7 3 2 3 321
1
1
1
0
0
0
6
6
1 15/
13 51 36 32 491
1.75 -0.20 0.54 -0.20 2.335
oc,xyi .= ( 0.20 )
0.84 0.25 -0.06 0.297
(rX)-1X'Y =
,
0.54 0.25 1.38 -0.16 ' -0.196

-0.20 -0.06 = 0.16 0.04 1.968
0.684 0.070 0.247 -0.171 0.146 0.316

0.070 0.975 -0.044 0.108 -0.038 0.070
0.247 -0.044 0.797 0.063 0.184 -0.247
H = X(X'X)-1X' = -0.171 0.108 0.063 0.418 0.411 0.171
-0.146 0.038 0.184 0.411 0.443 0.146
0.316 -0.070 0.247 0.171 0.146 0.684
Calculate the residual for the fifth observation.
(A) Less than -1

(B) At least -1, but less than 0
(E) At least 2
5.5. A regression model has the form yi = flo + pixi + e1. It is based on 4 observations of xi: {3,7,11,15). You
are given
_ ( 1.2625 -0.1125)
-
-0.1125 0.0125
Calculate the hat matrix, the matrix H for which Si = Hy.
Outliers and influential points
Use the following infornultion for questions 5.6 and 5.7:

A regression model with 5 variables and an intercept is fitted to 60 observations. The first 5 diagonal elements
of the hat matrix are {0.1612, 0.0808, 0.2159, 0.0259, 0.1307}.
•
5.6. For which of the first 5 observations is leverage greater than 2 times average?

Copyright 02022 ASM
5.6-7. (Repeated for convenience) Use the following information for questions 5.6 and 5.7:
A regression model with 5 variables and an intercept is fitted to 60 observations. The first 5 diagonal elements
of the hat matrix are 10.1612, 0.0808, 0.2159, 0.0259, 0.13071.
5.7. You are given:

(i) The first five residuals are 1-1.515, 1.524,1.398, —0.633, 0.8251.
An observation is considered an outlier if the absolute value of its standardized residual is greater than 2.
Which of the first 5 observations are outliers?
5.8. A regression model with 4 variables and an intercept is fitted to 32 observations. For the sixth observation,
you are given:
(i) The standardized residual is 0.824.
(ii) The leverage is 0.471.
Calculate Cook's distance for the sixth observation, D6.
5.9. *-41° [MAS-I-F19:311 In order to predict individual candidates' test scores a regression was performed using
one independent variable, Hours of Study, plus an intercept. Below is a partial table of data and model results:
Hours of Standardized
Candidate Test Score Study Leverage Residuals
1 2,041 538 0.6205 —1.3477
2 2,502 548 0.2018 —0.4171
3 2,920 528 0.6486 —1.1121
4 2,284 608 0.2807 1.1472
Calculate the number of observations above that are influential using Cook's Distance with a unity threshold.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
5.10. 4' You are given the following information about 5 observations from a linear regression:
i Residual Leverage
1 0.153 0.305
2 0.274 0.223
3 —0.211 0.190
4 —0.414 0.101
5 0.352 0.176
Using Cook's distance as a measure, which of these five points is most influential?
5.11. 14. A regression model with 3 variables and an intercept is fitted to 48 observations. You are given
(i) El = 7.624
(ii) The standard error of the regression is 4.823.
(iii) Cook's distance for the first observation is 0.804.
Determine the leverage of the first observation.

5.12. s: [S-F17:361You are analyzing a database and have fit a multiple regression with 12 continuous explanatory
variables and one intercept. You are given the following:
(i) Sum of square errors = 618
(ii) The 10th diagonal entry of the hat matrix, hum = 0.35
(iii) The 10t11residual value = io = 9.5
(iv) Cook's Distance = D10 = 5.0
Calculate the number of data points this model was fit to.
(A) Less than 200
(E) At least 800
5.13. A regression model of the form yi = Po + xi + ei is fitted to 80 observations. You are given that 1 = 51
and Er 1(xi - 1)2 = 60,919.
An observation is considered a high leverage observation if its leverage is greater than 2 times average. In other
words, an observation xi is high leverage if xi <a or xi > b.
Determine a and b.
5.14. s: [MAS-I-S19:31] You are fitting a linear regression model of the form:
y= + E; ei N (0, a2)
/1 0 1 9 21
1 1 1 15 32 3 2 3 32 1.38 0.25 0.54 -0.16
1 1 1 8 19 2 4 4 36 -0.20 -0.06
..Y = x,,, = 00..5245 _0.84
,,,
x=
0 1 1 7
'
17
' A'
3 4 6 51
;
v`'A)-1 =. 0.20 1.75 -0.20
0 1 1 6 15 32 36 51 491 -0.16 -0.06 -0.20 0.04
k0 0 1 6j \15/
0.684 0.070 0.247 -0.171 0.146 0.316
0.070 0.975 -0.044 0.108 -0.038 0.070
0.247 -0.044 0.797 0.063 0.184 -0.247
H = X(VX)-1X1 = -0.171 0.108 0.063 0.418 0.411 0.171
-0.146 -0.038 0.184 0.411 0.443 0.146
\ 0.316 -0.070 0.247 0.171 0.146 0.684
/20.93
32.03
0.293720 0
( 1943 );
19.04
(x/X)-IX'Y X(X'X)-1X'y = 16.89 '
a- = 0.012657
1.854 15.04
15.07
Calculate how many observations are influential, using a unity threshold for Cook's distance.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4

5.15. s: [S-F17:371 You are given the following residual rig plot for a fitted linear regression model:
Normal Q-Q
-2 -1 0 2
Theoretical Quantiles
I. The distribution of residuals is skewed to the left.
II. The residuals are serially correlated.

III. Observations 33 and 101 are influential points
Determine which of the above statements can be concluded to be true from the above chart.
(A) I only (B) II only (C) III only (D) I, II, and III
VIF
5.16. "41 Some authors suggest that a VIFj of 5 or higher indicates high collinearity of xi to the other variables in
a regression model.
Suppose xj is regressed against the other explanatory variables in a model. Let be the coefficient of
determination for this regression. Which values of R2(i) indicate high collinearity according to those authors?
5.17.
•1? A linear regression model with an intercept is used to model y as a function of xi and x2. The correlation
coefficient of the two explanatory variables is 0.4.
Determine VIF2.
Exam SRM Study Manual Exercises continue on The next page .

5.18. "41 Auto collision losses are modeled using two categorical variables: USE (business or pleasure) and MILES
(less than 10,000 miles driven per year or more than 10,000 miles driven per year). You have 5 observations of the
two variables:
USE 1 0 1 1 0
MILES 1 0 1 0 0
You suspect that the two variables are almost collinear. To test this hypothesis, you calculate the VIF.
Determine the VIF of MILES.
5.19. "4' [S-F15:401 You are given the following information related to the linear model:
Y +P2x2 + p3x3 e
•
,c1 yo + Y1X2 + y2x3 + E
Degrees of Sum of
Source Freedom Squares
Regression 2 24.74
Error 503 226.34
• x2 = 60 + (52x3 + E
Degrees of Sum of
Regression 2 3.73
Error 503 3.04
E
• X3 = ÷ 171X1 172X2
Degrees of Sum of
Regression 2 214,2.58
Error 503 85,883
Calculate the variable inflation factor for the variable which exhibits the greatest collinearity in the original
model.
(A) Less than 1.0

(8) At least 1.0, but less than 2.0
(E) At least 4.0

Copyright 02022 ASM
5.20. `11. [MAS-I-S18:32] Two different data sets were used to construct the four regression models below. The
following output was produced from the models:
Residual
Dependent Independent Total Sum of
Data Set Model Sum of
variable variables Squares
Squares
1 YA X/11, X./12 35,9030 2,823
A
2 XA1 XA2 92,990 7,070
1 YB XBi, X22 27,5700 13,240
B
2 Xsi XB2 87,020 34,650
The threshold of the Variance Inflation Factor for variable j (VIFi) for determining excessive collinearity is
VIFi > 5.
Determine which one of the following statements best describes the data.
(A) Collinearity is present in both data sets A and B.
(B) Collinearity is present in neither data set A nor B.
(C) Collinearity is present in data set A only.
(D) Collinearity is present in data set B only.
(E) The degree of collinearity cannot be determined from the information given.
5.21. *41 [S-S17:32] You are given:

• A linear model includes three explanatory variables: X1, X2, and X3.
• R2 is the coefficient of determination obtained from regressing the Ph explanatory variable against all other
(D
explanatory variables.
•
= 0.05, R(22) = 0.83, and .1q3) = 0.47.
• The threshold of the Variance Inflation Factor for variable j (VIFJ) for determining excessive collinearity is
VIFJ > 5.
Determine which of the 3 variables under consideration will exceed the threshold established above.
(A) X1 only (B) X2 only (C) X3 only (D) X11X2,andX3

(E) The correct answer is not given by (A) , (B) , (C) , or (D)
For a regression of y on three variables xi, x2 and x3 and an intercept based on 50 observations, you are given
(i) Z(xi3 — R3)2 = 115.4
(ii) The error sum of squares when regressing y on the three variables is 142.4.
(iii) The error sum of squares when regressing x3 on the other two variables and an intercept is 89.7.
5.22. '41 Calculate the VIP for x3.

5.23. s: Calculate the standard error of b3.
5.24. Nei For a linear regression with 3 explanatory variables based on 24 observations, you are given:
(i) —11)2 = 64.4284.
(ii) zp1oi-y02. 196.0310.
(iii) The standard error of bi is 0.8955.
Calculate VIFi.

Copyright 02022 ASM
Miscellaneous
5.25. 6: [S-S16:361 You are given the following two graphs comparing the fitted values to the residuals of two
different linear models:
Graph 1 ,Graph 2
•
•
•
•
• a
• 0
41D
•
•
—
• a •• •
• • ..
•
S
•
in ••
•
t •
•
4
I .
• • •• •
•
•• 1.
•
• t• • • e
•
.I, • llio • •
7 i
• ...• •S •
• *
• o.
•-• ,
••• •• it
• - •
••
• •,,, . .0.
,,,, •
ea ••
_
'4
I •
00-r
00 0.2 0.4 08 08 1.0 18 20 22 24 26 28 30
Pitied fitted
Determine which of the following statements are true.

I. Graph 1 indicates the data is homoscedastic
II. Graph 1 indicates the data is heteroskedastic
III. Graph 2 indicates the data is non-normal
(A) I only (B) II only (C) III only (D) I and III (E) II and III
Solutions
5.1. Va-r(P2) = 62(1 - hn) = 0.72352(1 - 0.3559) = 0.3372
5.2. The residual standard error is
902 32
s=
\IE(Yi n + 1) y (2 + 1)
=2
The standardized third residual is
1.2
T3 = 0.948683
s A,/ —1 - - h3; 2 VT-75
(---\ 5.3. For the studentized residual, we divide the residual by 17.2 instead of 22.4 and get 0.83(22.4/17.2) = 1.0809

Copyright 02022 ASM
5.4. The hat matrix H puts the hat on y; Hy = Sr. So
St5 = -0.146(19) - 0.038(32) + 0.184(19) + 0.411(17) + 0.443(13) + 0.146(15) = 14.442
and the residual is 13 - 14.442 = -1.442 (A)

5.5.
1 3 0.7 0.4 0.1
0.2)
1 7 1 1 1 0.4 0.3 0.2 0.1
X(X'X)-1-X'
( 1.2625 -0.1125) (1
_
1 11 -0.1125 0.0125 3 7 11 15 0.1 0.2 0.3 0.4

1 15 -0.2 0.1 0.4 0.7
5.6. Average is (k + 1)/n = 6/60 = 0.1. Leverage is higher than 0.2 for the third observation only
5.7. Standardized residuals are
-1.515 1.524 1.398

- -2.068 = 1.987 - 1.973
0.8-‘11 612 0.8il1 - 0.0808 0.8A/1 - 0.2159
-0.633 0.825
- -0.802 - 1.106
0.8-\/1 - 0.0259 0.8V1 01307
Only the first observation is an outlier.

0.471
D6 = (0.8242)
(5(1 - 0.471)) - 0.1209
5.9. We use leverage and standardized residuals to calculate Cook's distance
2
r
1 P(1 - hii)
where p, the number of parameters, is 2 here: the intercept and the slope. Candidates 1 and 3 should be calculated
i rst, since they have high leverage and high standardized residuals. Then calculate Candidate 4. If Cook's distance
f
for candidate 4 is less than 1, it is certainly less than 1 for Candidate 2, whose leverage and absolute value of
standardized residual are both lower.
For Candidate 1,
1 0.6205
DI = 1.34772 = 1.485 > 1
2(1 - 0.6205)
For Candidate 3,
0.6486
D3 = 1.112122(1( - 0.6486) ) = 1.141 > 1
For Candidate 4,
1 0.2807
D4 = 1.14722
2(1 - 0.2807) ) = 0.257 < 1
and we conclude that since Candidate 4 is not influential, neither is Candidate 2. (But if you are curious, D2 = 0.022.)
(C)

Copyright ©2022ASM
5.10. Cook's distance is
D, = ?r
hii I E`‘ 11
(k + 1)(1 — hii) s2(1 h) (k + 1)(1 — hi1))
Since we are comparing the observations, we can ignore multiplicative factors such as s and k + 1, and simply
calculate
(1 — hii)2
For the five observations, this works out to 0.014781, 0.027731, 0.012893, 0.024636, 0.032118. The fifth observation
is the most influential.
5.11. Let Di be Cook's distance for the first observation.
\2 iln
= Air (k +1)(1 — hu)
7.6242h11
0.804 =
4.8232(4)(1 — h11)2
74.8084(1 — 1402 = 58.1254h11
74.8084141 — 207.7422h11 + 74.8084 = 0
hil = 0.4252 ,2.3518
2.3518 is rejected since leverage must be between 1/n and 1.

2 Error SS 618
=
5
n — (k + 1) n-13
2
E210 9.52(n — 13)
rio
=
— 0.22467(n — 13)
,S2(1 h10,10) 618(1 —035)
hi0:10
D10—r10(k + 1)(1 — hio,10)
0.35
5.0 = 0.22467(n — 13)
13(0.65)
5
=537
(0.22467)(0.04142012)
n = 550 (C)
5.13. The average leverage is (k + 1)/n = 2/n. From formula (5.1), leverage is greater than 2 times average if
1 (xi — I)2 2
+
80
n Ei =i(xi _ x)2 > 2 (— )
n
and n = 80
1 _ 3
(xi — 51)2 > 804 80 — aid
60,919
(xi —51)2>2284.46
ixi — 511 > 47.80
That means that a = 51 —47.80 = 3.20 and = 51+ 47.80 = 98.80

5.14. There's a lot to calculate here. The formula for Cook's distance, (5.2), has two factors. The first factor of
Cook's distance is the square of the standardized residual, ri. The fitted values are given by X(X'X)-1X/y, the hat
matrix applied to y, and the observed values are y. The differences between the two, observed minus fitted, are 0.07,
-0.03, -0.04, 0.11, -0.04, and -0.07. To standardize, we divide the squares of these by o-2(1 1111), where hii are the
diagonal elements of the hat matrix II, so
0.072
T2
1
= = 1.225119
0.012657(1 - 0.684)
and similarly the other qs are 2.844276, 0.622721, 1.642599, and 1.225119 respectively. The other factor of Cook's
distance is hid (p(1 - hii)) where p = 4 is the number of variables including the intercept. This factor will have to
be large to make the product greater than 1. For the second observation, the second factor is
0.975
= 9.75
4(1 - 0.975)
and multiplying this by 2.844276 results in a product greater than 1, making it an influential observation. For the
second highest hil, h33, we get 0.797/ (4(1 - 0.797)) = 0.981527, and multiplying this by 0.622721 results in a number
less than 1. For the first and sixth observations, hi; = 0.684, and the quotient is 0.684/(4(1 - 0.684)) = 0.541139. The
quotients for the fourth and fifth observations are even lower. Multiplying these quotients by rs that are no higher
than 1.225119 results in numbers less than 1. So only the second observation is influential.
The following table summarizes the calculations:
h-•
ei hii ri = 02(i-h) 4(1-hu)
Di
1 0.07 0.684 1.225119 0.541139 0.66296
2 -0.03 0.975 2.844276 9.75 27.73169
3 -0.04 0.797 0.622721 0.981527 0.61122
4 0.11 0.418 1.642599 0.179553 0.29493
5 -0.04 0.443 0.226952 0.198833 0.04513
6 -0.07 0.684 1.225119 0.541139 0.66296
(3)
5.15.
I. The standardized residual distribution has several very low observations, observations below -1. The q q plot
indicates that the number of low observations is greater than one would expect for a normal distribution, so
the distribution is skewed to the left. t/
II. Nothing can be deduced about serial correlation, since we aren't given the order of the residuals, nor even that
there is an order. X
III. Influential points have high leverage. We are not given the leverage, so we can't determine influential points. X
(A)
1
5.16. >5
1 - R2
(i)
1 > 5 - 5R2
(i)
R2 > 0.8
(i )
5.17. When regressing x2 on xi, since it is a simple regression, R2 is the square of the correlation coefficient. So
1
VIF2 = -
1.1905
1 - 0.42 =

Copyright 02022 ASM
5.18. The regression of MILES (y) on USE (x) gives
fc =O.6 y =0.4
E (xi - 702 = 3(0.42) + 2(0.62) = 1.2
E(xi - Fc)(yi = (0.4)(0.6) + (-0.6)(-0.4) + (0.4)(0.6) + (0.4)(-0.4) + (-0.6)(-0.4) = 0.8
E(yi - y)2 = 2(0.62) + 3(0.42) = 1.2
R2 is the square of the correlation coefficient of x and y, or
R2 . (0.8)2
k1.2
0.44444
The VIF is 1/(1 - 0.44444) -

5.19. Greatest collinearity is shown by the greatest R2 of the variable when regressed on the other variables of the
model. It is obvious that R2 is greatest in the x3 model, since the ratio of regression sum of squares over the error
sum of squares is obviously higher in this model than in the other two. So we'll calculate VIF for that model.
214,258
R2(3) -
- 0.713858
214,258 + 85,883
1
VIF, =
= -3.495 (D)
1 - 0.713858
5.20. We will compute the VIF, which is (1 - R2)-1, when regressing one predictor on another. And 1 - R2 is the
quotient of the error sum of squares over the total sum of squares. For Data Set A, 1- R2 for model 2 is 7,070/92,990 =
0.07603 making the VIF 0.07603-1 = 13.153. For Data Set B, 1- R2 for model 2 is 1 34,650/87,020 = 0.39818 making
the VIF 0.39818-1 = 2.511. The first VIF is greater than 5 but the second isn't, making the answer (C).
5.21. Use equation (5.3) to compute the VIF.
VIFi = (1 - 0.05)-1 = 1.0526

VIF2 = (1 - 0.83)-1 = 5.8824
VIF3 = (1 - 0.47)-1 = 1.8868
(B)
89.7
= 0.222704
5.22.
R2)(3 = 1 115.4
1
VIF3 = 1.2865
1 - 0.222704
VVIF3
Sb3 = S
Sx3
s = V142.4/(50 -4) = 1.7594

= V115.4 = 10.7424
V1.2865
sb3 =1.7594(107424)
1 0.7424
0.1858

sx, 1,/7-- 1 = V64.4284 = 8.0267

s =
V196.0310 = 3.1307
(sb'SX1"1
= (0.893551(380.072 67) )2 5.2713
5.25. Interesting how scedastic/skedastic changes spelling between the statements.

In Graph 1, since the distribution of the residuals depends on the fitted value, the data appears to be het-
eroskedastic.
In Graph 2, the residuals appear to be skewed towards higher values, which would make the data non-normal.
(E)
Quiz Solutions
5-1. The standardized residual is e1/(s — hii). We are given el = 3.2, s = 67, and h11 = 0.04, so the first
standardized residual is 3.2/(67V0T ) = 0.04875
5-2. Number of observations is extraneous.

Using formula (5.2),
0.48
= 1.242 _
0.28386
k5(1 — 0.48)) —
5-3. We derive R2 from F.
Regression SS/2 R2
9.5 Regression SS/Total SS
F218 =
6.48
Error SS/(21 —2 — 1)
R2
( Error SS/Total SS )
9
(1 — R2)
9 1 — R2
0.72 — 0.72R2 = R2
, 0.72
R.-
—
(3) 1.72
The VIP is
VIF3 — 1.72
1 — 0.72/1.72


Copyright C202.2 ASM
Lesson 6
Resampling Methods
Reading: Regression Modeling with Actuarial and Financial Applications 5.6.2-5.6.3; An Introduction to Statistical Learning
5.1
The previous lessons have discussed measuring the quality of a linear regression model using classical statistical
concepts. These methods make assumptions about the model and then test the assumptions. They are designed for
linear regression models.
More modern methods use the heavy computational power that has become available within the last few decades
to directly calculate how good the model predictions are, as measured by mean squared error. These methods may
be used for any model, not just for linear regression.
A repeated theme in An Introduction to Statistical Learning is the fact that mean squared error equals bias squared
plus variance. Many times an estimation method may decrease bias only at the cost of increasing variance. For
example, some methods have less flexibility. They may fit the model to a straight line. Other methods have more
l exibility. They may fit a curve that goes through every point of the data. Methods with more flexibility will have
f
less bias but more variance. Figure 6.1 shows how the MSE decreases to some extent as the flexibility is increased,
but reaches a minimum and then starts increasing due to higher variance. Our job is to pick the level of flexibility
that minimizes the MSE. (Figure 6.1 is similar to Figure 2.12 in An Introduction to Statistical Learning.)
The quality of a model is determined by how good its predictions are. We can create a model that fits the training
data perfectly by using a large number of variables. But such a model will probably perform poorly when used to
make predictions on other data. To test a model, we must measure differences between the values it predicts and the
actual values. We have actual data. But from where do we get test data so that we can see what the model predicts?
The way we get test data is we split our data into training data and test data. Training data is used to fit the
model, and then the model is applied to the test data. The rest of this lesson discusses different ways to perform
this split.
6.1 Validation set approach

In the validation set approach, also known as the out-of-sample validation approach, a subset of the available data 641
is removed from the data set. This subset is called the validation set or the hold-out set. The remainder of the data is .1
the training set. The model is fitted using only the training data. The validation set is used as the test data. Then the .1
bias squared
variance
MSE
••••
.................... •
l exibility
f
Figure 6.1: Bias/variance tradeoff

96 6. RESAMPLING METHODS
MSE is calculated using the validation set. If we start with n points, let's say that n1 points are training data and
n2 = n — n1 are test data. Then we may calculate the MSE:
no-n2
1
MSE = —
n2 E (yi —g•)2
i=n3 +1
6: or we may calculate the SSPE, the sum of squared predicted errors:

n3 -1-n2
SSPE = (yi — 9i)2 (6.1)

i=n1+1.
which is just the MSE times n2. Select the model with the lowest MSE or SSPE.
As a rule of thumb, use 25-35% of the sample for the validation set with 100 or fewer observations; with 500 or
more observations, use 50%.
el? Problems with this approach are:
1. The validation set is selected at random. Therefore the resulting MSE is highly variable.
2. The training set is significantly reduced in size. Fitting models to smaller data sets leads to higher variance
and poorer statistical results. Therefore the MSE of the test data is likely to be an overestimate of the true MSE
of the fit.
6.2 Cross-validation
Cross-validation methods select test data sets in a systematic non-random fashion, and then average the MSE results
from all runs. These methods usually require multiple runs, but fast computers have made these methods feasible.
We'll discuss two methods.
The first cross-validation method we'll discuss is Leave One Out Cross-Validation (LOOCV). In this method, in
each fit, one observation is held out as the test data set. The other observations are used as the training data set. For
each observation, one fit is performed removing that observation and using that observation as the test data set. Let
n be the number of observations. After n fits are made, the MSEs of the fits as measured on the test observation are
averaged. Let CV() be the LOOCV statistic. Then
1
CV(n) = —
n 1=1
L MSEi (6.2)
where MSE; is the mean square error of the test observation from the fit that removes that test observation) This
approach avoids the problems with the validation set approach that we mentioned above. The result is not random.
And since only one observation is omitted, the MSE is not significantly overestimated.
When the model is fitted by least squares, as in a standard linear regression model, it is unnecessary to perform
multiple fits. There is a simple formula for the LOOCV statistic:
(6.3)
1=1
where Ej = yj — 9; is the residual from the least squares fit on the entire n-observation data set. It is interesting
comparing this formula to the formula for the standard error of the regression, namely E 1(n — 1). The LOOCV
is calculated by not just summing up squares of residuals, but by weighting them with the complements of leverages:
the higher the leverage, the heavier the weight of the residual and the higher the estimate of the MSE is. The division
here is similar to the division for standardizing the residual, where we divide by s 1
If CV(„) is not divided by n, the resulting statistic is called PRESS, for predicted residual sum of squares. So the
'Notice that MSE, is just the square difference between the generated II; and the fitted yi; MSEi = (yi — 9j)2. There is no sum and no division
(division is by 1), since there is only one observation.

6.2. CROSS-VALIDATION 97
PRESS statistic is
11
PRESS = (yi — (6.4)

1=1
2
ei
L-1
1 _h11)
—
(6.5)
where 90) is the fitted value when observation i is omitted from the training data and used as test data.
The second cross-validation method we'll discuss is k-fold cross-validation.2 In this method, k fits are done. The 'Ir
data set is randomly partitioned into k subsets, each of approximately equal size. Then for each of the k subsets, one
fit is performed. In this fit, that subset is removed and used as test data. After the k fits are performed, the test data
MSEs are averaged:
CV(k) = E MSEi (6.6)
Usually k = 5 or k = 10. There is some variability in the result since the split into subsets is random, but the
variability is less than in the validation set approach. LOOCV is the special case of k =
Even if cross-validation does not calculate the MSE accurately, it still can identify the right amount of flexibility
to use, the level of flexibility that minimizes MSE.
Advantages of LOOCV over k-fold cross-validation

1. For a least squares regression, LOOCV is very efficient, since formula (6.3) can be used.
2. Using a small training data set leads to overestimating the test data error rate. LOOCV uses the largest possible
training set, the entire set minus one observation, so it minimizes the overestimation error. It minimizes the
bias of the error.
Advantages of k-fold cross-validation over LOOCV

1. k-fold cross validation requires less computer time (assuming we are not performing least squares regression)
2. Since almost all the training data is the same in each run of LOOCV, correlation is high, leading to higher
variance than k-fold cross-validation.
k-fold cross-validation may lead to lower MSE than LOOCV because of its lower variance. In general, the higher the
k, the higher the variance and the lower the bias. LOOCV is n-fold cross validation, and n is the highest possible k.
2This k has nothing to do with the number of variables. Fortunately the number of variables does not enter into our discussion of cross-
validation.

Copyright 02022 ASM
Table 6.1: Summary of formulas from this lesson
Sum of squared prediction errors

n1+112
SSPE = (yi 902 (6.1)

i=ni+i
LOOCV statistic
CV(H) = (6.2)
PRESS = (6.4)
LOOCV statistic for least-squares regression

n 2
Ei
CV („)= 1E 1— hi,)
(6.3)
71 n P=1
ei
PRESS = —
1 — hii )2 (6.5)
k-fold cross-validation statistic

1
CV (k) = MSEi (6.6)

Copyright 02022 ASM
n Exercises
6.1. "4/ A regression of the form yi = Po + flix+ e is performed on the following data:
xi 8 12 16 18 24
yi 10 20 30 50 55
To validate the model, an out-of-sample validation procedure is used. The validation set consists of the last two
points, (18,50) and (24,55).
Calculate the SSPE statistic.
6.2. A linear regression model of the form
Yi = O +flixii +p2x12 +p3xi3 +ei

is fitted to 84 observations. All 84 observations have the same leverage.
The residual sum of squares is 1090.
Calculate the LOOCV statistic.
6.3. 'kJ' A linear regression model is of the form yi = /30 + ix + el. There are 5 observations. You are given the
following residuals and leverages:
Residual Leverage
3 0.6
—1 0.3
—3 0.2
—3 0.3
4 0.6
Calculate the PRESS statistic.
6.4. '41 Using the 5-fold cross validation method, you obtain the following mean square errors: 842, 759, 805,
738, 824.
Calculate the CV(5) statistic.
6.5. Rank the bias and variance of the following cross-validation methods from lowest to highest:
1. LOOCV
2. 5-fold cross-validation.
3. 10-fold cross-validation.
6.6. 641° [MAS-I Sample:31 You are given the following statements about different resampling methods:
I. Leave-one-out cross-validation (LOOCV) is a special case of k-fold cross-validation
k-fold cross-validation has higher variance than LOOCV when k < ii
III. LOOCV tends to overestimate the test error rate in comparison to validation set approach
Determine which of the above statements are correct.
(A) I only (B) H only (C) HI only (D) I, II, and III
(E) The correct answer isn't given by (A), (B), (C), or (D)

Copyright C2022 ASM
6.7. I': [MAS-I-S18:301 You are considering using k-fold cross-validation (CV) in order to estimate the test error
of a regression model, and have two options for choice of k:
• 5-fold CV
• Leave-one-out CV (LOOCV)
Determine which of the following statements makes the best argument for choosing LOOCV over 5-fold CV.
(A) 1-fold CV is usually sufficient for estimating the test error in regression problems.
(B) LOOCV and 5-fold CV usually produce similar estimates of test error, so the simpler model is preferable.
(C) Running each cross-validation model is computationally expensive.
(D) Models fit on smaller subsets of the training data result in greater overestimates of the test error.
(E) Using nearly-identical training data sets results in highly-correlated test error estimates.
6.8. [MAS-I-F18:33] Two ordinary least square models were built to predict expected annual losses on
Homeowners policies. Information for the two models is provided below:
Model 1 Model 2
Parameter p-value Parameter p-value

Intercept 212 Intercept 315
Replacement Cost (000s) 0.03 <0.001 Replacement Cost (000s) 0.02 <0.001
Roof Size 0.15 <0.001 Roof Size 0.17 <0.001
Precipitation Index 120 0.02 Precipitation Index 210 0.03
Replacement Cost (000s) Replacement Cost (000s)

x Roof Size 0.0010 0.05 x Roof Size 0.0015 <0.001
Model Statistics Model Statistics
R2 0.91 R2 0.94
Adj R2 0.87 Adj R2 0.89
MSE 31,765 MSE 30,689
AIC 25,031 AIC 25,636
Cross Validation Set MSE Cross Validation Set MSE
1 33,415 1 26,666
2 38,741 2 38,554
3 32,112 3 39,662
4 37,210 4 36,756
5 29,501 5 30,303
You use 5-fold cross validation to select the best of the two models.
Calculate the predicted expected annual loss for a homeowners policy with a 500,000 replacement cost, a 2,000
roof size, a 0.89 precipitation index, and three bathrooms, using the selected model.
(A) Less than 1,000
(B) At least 1,000, but less than 1,500
(C) At least 1,500, but less than 2,000
(D) At least 2,000, but less than 2,500
(E) At least 2,500
Exam SRM Study Manual Exercises continue on the next page „ .

6.9.
NI' [MAS-I-S19:34] A statistician has a dataset with n = 50 observations and p = 22 independent predictors.
He is using 10-fold cross-validation to select from a variety of available models.
Calculate the number of times that the first observation will be included in the training dataset as part of this
procedure.
(A) 0

(E) At least 20
6.10. %II [MAS-I-F19:29] An actuary has a dataset with four observations and wants to use Leave-One-Out Cross
Validation (LOOCV) to determine which one of the two competing models fits the data better. The model preference
will be based on minimizing the mean squared error.
The values of the dependent variable are:
Y= Y2/ 1/3. y4) = (1.55, 1.55,1.60, 1.95)
Corresponding fitted values under each model and training data subset are:
Training Model I Model 2
Obs. Used pi 92 93 94 91 92 93 94
1,2,3 1.50 1.60 1.20 1.80 1.60 1.70 1.60 Z
1,2,4 2.00 1.50 1.10 1.90 1.80 1.40 1.30 1.70
1,3,4 1.75 1.55 1.70 2.10 1.40 1.30 1.50 1.95
2,3,4 1.70 1.65 1.60 2.00 1.60 1.70 1.20 2.00
Calculate the maximum value of Z for which the actuary will prefer Model 2.
(A) Less than 1.5
(E) At least 2.4
Solutions
6.1. Fitting the other three points, which are on a straight line with slope (20 - 10)/(12 - 8) = 2.5, gives yi =
-10 + 2.5xi. Applying this fit to the validation set, we get 94 = 2.5(18) - 10 = 35 and ij5 = 2.5(24) - 10 = 50. The
SSPE is
SSPE = (50 - 35)2 + (55 - 50)2 = 250
6.2. Leverage sums up to k + 1, the number of parameters, so the average leverage is (k + 1)/n. Here, every
observation has leverage 4/84 = 1/21. Then the LOOCV is
64
CV(54) = 84 E
i=1 )2 84(1 - 1/21)2
(1090) = 14.30625
2 2 2 2 2
3 -1 -3 -3 4
) (
l
f
6.3. PRESS =
(1 - 0.6)
-
+
- 0.3) + ‘1 + (1 0.3)
-
+
- 0.6)
190.72

Copyright g2022 ASM
102 6. RESAMPL1NG METHODS
6.4. The CV(5) statistic is the average of the five MSEs, or 793.6
6.5. The fewer elements left out of the training set, the less bias. Thus LOOCV has the lowest bias; 10-fold is
second; and 5-fold, which leaves out 1/5 of the elements, has the highest bias.
Variance works the other way around.
6.6.
I. LOOCV is n-fold cross-validation./

The smaller the training data set, the lower the variance. LOOCV's training data set has n — 1 observations,
while k-fold cross-validation's training data set has ((k — 1)/k)n observations, which is a smaller number of
observations.X
The larger the training data set, the lower the bias. Thus LOOCV, which has the larger training data set,
overestimates the test error rate less than the validation set approach.X
(A)
6.7. A doesn't talk about a difference between LOOCV and 5-fold CV (and there is no such thing as 1-fold CV),
and for B, 5-fold CV is simpler than LOOCV which is n-fold CV. C is another argument for 5-fold CV. (D) is an
argument for LOOCV, since LOOCV has the most training data, all the observations except for 1. E is an argument
against LOOCV, since all the training sets are identical except for one element.
6.8. The average MSE is 34,195.8 for Model 1 and 34,388.2 for Model 2, so Model 1 is preferred. The linear
expression is
212 + 0.03(500) + 0.15(2000) + 0.89(120) + 0.0010(500)(2000) = 1633.8 (C)
6.9. Each cross-validation will have 45 training observations and 5 test observations, so each observation will be
in the training set 9 times and in the test set 1 time. (C)
6.10. We compute sums of square differences between fitted values in the one-observation test data sets and actual
values of the dependent variable. We need not divide by 4 to average, since the division is the same for both models , ,,)
and we just want to equate the results.
For training observations 11,2,3} in Model 1, the test observation 94 is 1.80 versus the actual 1.95, for a difference
of —0.15. For training observations {1,2,4} we compare 1.10 to actual 1.60, a difference of —0.50. And the same for
the third and fourth rows:
(1.80— 1.95)2 + (1.10 — 1.60)2 + (1.55— 1.55)2 + (1.70 — 1.55)2 = 0.295

For Model 2,
(Z — 1.95)2 + (1.30— 1.60)2 + (1.30— 1.55)2 + (1.60— 1.55)2 = 0.155 + (1.95— Z)2
Equating the two sums,
(1.95 — Z)2 = 0.295 — 0.155 = 0.14
11.95 — Z1 = 0.3742
Thus Model 2 is preferred if Z <1.95 + 0.3742 = 2.3242 (D)


Copyright C2022 ASM
Lesson 7
Linear Regression: Subset Selection
Reading: Regression Modeling with Actuarial and Financial Applications 5.1-5.2; An Introduction to Statistical Learning
6.1
Modern models may have large numbers of predictors, sometimes more predictors than observations. Using all
predictors will result in lower standard error, even 0 standard error, on the training data but will lead to poor
predictions. It is necessary to select predictors that truly impact the response; mechanically fitting coefficients does
not guarantee a true relationship.
This lesson discusses methods for selecting the important predictors. The next lesson discusses other techniques
for reducing the number of variables.
7.1 Subset selection
If there are k possible predictors, one can fit 2k models containing every subset of the predictors. One can then select
the best model. The method for determining which model is best depends upon whether or not the models being
compared have the same number of predictors:
• When comparing two models with the same number of predictors, the one with the lower RSS is better.'
• When comparing two models with different numbers of predictors, RSS cannot be used directly, since adding
a predictor to a model, no matter how irrelevant, cannot increase the RSS and will almost surely decrease
it.2 Instead, cross-validation or one of these four statistics: Mallow's Cp, AIC, BIC, or adjusted R2, is used to 141
compare models.
This method is sometimes called "Best Subset Selection" to distinguish it from the heuristic methods we will s:
discuss next.
When k is high, it is impractical to fit 2k models. For example, if k = 20, more than 1,000,000 models are possible.
A heuristic approach is then needed. We will discuss forward stepwise selection and backward stepwise selection.
Forward stepwise selection consists of starting with the empty model, the one that sets y equal to the sample Nr
mean. Then k models are formed by adding one predictor from the available k predictors. The best model is selected
using RSS. This process is repeated: k — 1 models are formed by adding one predictor from the remaining k — 1
predictors, and the model with the lowest RSS is selected. This is continued until all k predictors are added to the
model. The k +1 resulting best models, one for each number of predictors, are then compared using cross-validation
or one of the four statistics. Thus a total of 1 + Erc(k — 1) = 1 + k(k + 1)/2 models are fitted.
This algorithm is greedy in the sense that it picks the immediately best model, not considering that picking an s:
inferior model at the current step may lead to a better model ultimately. There is no guarantee that the best model
is selected. In a situation like this:
Variables RSS Variables RSS Variables RSS

None 875 X1 472 X1, X2 420
X2 500 Xi, X3 432
X3 491 X2, X3 412
'Remember, RSS is the sum of squared residuals. It is exactly the same as what we called Error SS. We ll use the symbol RSS in most of this
lesson.
2It is possible for a model with d +1 predictors to have a larger RSS than a model with d predictors if the latter d predictors are not a subset
of the d +1 predictors. Still, the fact that RSS must decrease when the models are nested indicates that comparing RSS for models with different
numbers of parameters is not a good way to measure model quality.

Copyright 02022 ASM
104 7. LINEAR REGRESSION: SUBSET SELECTION
Xi would be the preferred model with 1 variable and then only the {X1, X2} and {X1, X3} models would be
considered. The {X2, X3} model would be bypassed even though it is better than the {X1, X2} and {X1, X31 models.
Forward stepwise selection can be used even when n <k, but only models with n —1 or fewer parameters would
be considered.
Backward stepwise selection consists of starting with the full model, the one having all predictors. Then for
each predictor a model is fitted by removing that predictor and fitting using the other k 1 predictors. From these
k models the one with lowest RSS is selected. This process is repeated: k 1 models are fitted by removing one
predictor at a time. This is continued until the model has no predictors. The k +3. resulting best models, one for each
number of predictors, are then compared using cross-validation or one of the four statistics. A total of 1 + k(k + 1)/2
models are fitted.
As with forward stepwise selection, there is no guarantee that the best model is selected. And backward stepwise
selection cannot be used if k > n, since a fit with more than n predictors generates meaningless statistics.
441 There are hybrid versions of stepwise selection, which are called mixed selection methods, in which variables
are added sequentially but variables may also be removed if they no longer improve the model fit. The steps for
mixed selection are:
1. Start with just the intercept.
2. For each unused variable, create a model by adding it. Select the model with the best RSS.
3. Look at each variable in the model. If there are variables with t ratios below a predetermined threshold (that
you set before starting selection), remove the variable with the lowest t ratio. That is the algorithm presented
in Regression Modeling with Actuarial and Financial Applications. In An Introduction to Statistical Learning, they
instead remove the variable with the highest p value if it is above the predetermined threshold. (The two
methods are similar but not equivalent since the t ratio is a function of the degrees of freedom.)
4. Repeat steps 2 and 3 until no more variables satisfying the t ratio or p value threshold can be added.
In all subset selection methods, one parameter at a time is added or removed. For a categorical variable with
more than two categories, only one category is added at a time, so it is possible that the final model will have only
some of the categories as variables. An individual not having the characteristics of the accepted categories would
effectively be placed in the base category.
For example, suppose the categorical variable "type of vehicle", with categories coupe, sedan, SUV, and van was
under consideration, and "sedan" was the base category. Forward subset selection is used. At the first iteration,
SUV is added to the model. Then this model effectively puts coupe and van in the base category.
Regression Modeling with Actuarial and Financial Applications also mentions a "best regressions" routine, which
finds the best model having a specific number of variables.
Regression Modeling with Actuarial and Financial Applications lists 7 problems with stepwise regression:
1146 1. Data snooping. This means fitting a large number of models to one set of data. If one fits large numbers of
models, one of them is likely to look good even though it is false. Statistics typically uses 95% confidence
intervals, so that 1 out of 20 times a model is accepted even though it is false. If we go through 100 models,
we're likely to find 5 that look good even though they are false.
2. It ignores the possibility that none of the models are right, either because the correct model is nonlinear or
because of outliers and high leverage points in the data.
3. Only some of the 2k models are considered; one of the models not considered may be better.
4. Rather than using I, another statistic should be used for determining which variables are added or removed.
5. The true significance of the model is greater than the significance level of the t statistic used as the addi-
i on/removal criterion, since separate additions are made. If one variable addition is good 95% of the time
t
and a second variable addition is good 95% of the time, the probability that both variables belong in the model
is less than 95%.

Copyright 02022 ASM
7.2. CHOOSING THE BEST MODEL 105
6. Since variables are added one by one, the joint effect of adding two variables is not considered. However,
backward stepwise regression does consider joint effects since only one variable is removed.
7. Automatic procedures don't use additional information that an investigator may have. For example, data from
year 2015 may be unusual because something special happened in that year.
7.2 Choosing the best model

In all subset selection methods discussed above, we choose one model for each number of parameters. But then we
must select among the chosen models.
Cross-validation is one method for selecting the best model. It is computer-intensive, but is the most accurate
method. The alternative to cross-validation is using statistics that estimate the MSE on test data by penalizing the
RSS based on the number of parameters in the model. The penalty estimates the degree to which the RSS on test
data would be greater than the RSS computed from the training data. However, if any of the penalized RSS statistics
is used, RSS is computed using all the data; there is no need to set aside test data if cross-validation is not being
performed.
Mallow's Cp Suppose the full model has k predictors, and we are considering a subset of that model with p < k
predictors. Let S2 be the residual variance of the k-predictor regression:
2 _
r1=1 r
s —
n—k 1
Let (RSS)p be the residual sum of squares of the p-predictor regression. Then Mallow's Cp is defined by
RSSp
fl Cp /1 + 2p
That is the definition given in Regression Modeling with Actuarial and Financial Applications. The definition of Cp
given in An Introduction to Statistical Learning is
Cp = 1!(RSSp + 2ps2) (7.2)
These two definitions are quite different, but they lead to the same results. Remove —n from equation (7.1); it is a
constant that does not depend on the model. Then bring 2p into the fraction:
RSSp + 2ps2
s2
Now change the denominator from s2 to n; both s 2 and n are independent of model, so this doesn't affect the
comparison. Now you have An Introduction to Statistical Learning's version of the formula.
If $2 is an unbiased estimate of the residual variance, then the C defined in An Introduction to Statistical Learning
is an unbiased estimate of the test MSE.
AIC and BIC Linear regression maximizes the likelihood of the data. If additional variables are added to a model,
the maximum likelihood cannot decrease, since setting the additional ps equal to 0 will result in the same likelihood
as the smaller model. The maximum likelihood will probably increase, regardless of whether the added variables
are significant or not. So merely comparing likelihoods of different models is not adequate.
Later on in the course we'll discuss likelihood ratio tests. Those tests set a threshold for adding variables to a
model. However, they are only available for comparing nested models.
The Akaike Information Criterion (AIC) and the Bayes Information Criterion (BIC) are penalized loglikelihood
3and in a footnote in An Introduction to Statistical Learning

measures. They may be used to compare any models whether or not they are nested. Both statistics start out with
twice the negative loglikelihood. The AIC adds 2d for a d-parameter model. The BIC adds d ln n for a d-parameter
model with n observations. Thus the penalty per parameter of BIC is almost always higher than AIC. A model
selected using BIC will tend to have fewer parameters than a model selected by AIC.
In Regression Modeling with Actuarial and Financial Applications, a formula is developed for AIC of a linear regres-
sion. For linear regression with k explanatory variables, the likelihood L of the set of n observations with errors
following a normal distribution with mean 0 and variance a2 is
L=
n exp Hy; Pixii)2/2c2)
Let 1 = In L be the loglikelihood. Twice the negative loglikelihood, —2 In L = —2/, is
RSS
—21= + 2n ln •Nir
0.2 + 2n In
If the usual estimate of a2, namely s 2, is used, then we get
RSS
52 =n — p — 1
—2/ =n—p-1+2n ln s + 2n ln V = n ln s2 + n ln(2n) +n—p-1
For AIC, we add twice the number of parameters that are estimated. There are k + 1 13s, and a2 is considered a
parameter as well, so we add 2(p + 2) to —2/ and get
AIC = n ln s2 + n ln(2n) + n + p + 3 (7.3)
Since AIC is only used to compare models, we can ignore the constants n In(2n) and n + 3. We see AIC balances
improvements in s2 against number of parameters k.
Formula (7.3) uses S2 to estimate a2.
An Introduction to Statistical Learning, on the other hand, does not assume that we use S2 as an estimate of cr.
Instead, we use s 2 from the full model, as we did for Mallow's C. Then s is independent of model, and we can
therefore ignore n ln 62 as well as n ln(2n) and n + 3. For some reason, An Introduction to Statistical Learning divides
the resulting formulas for AIC and BIC by n. (Dividing by a constant does not affect comparisons of models.) The
resulting formulas for AIC and BIC in An Introduction to Statistical Learning are
1
AIC =---(RSSp + 2ps2)
ns
(7.4)
1 ,
BIC = ---(RSS + (ln n)ps2)
n52
(7.5)
where p is the number of predictors in the subset model. Thus AIC is a multiple of Mallow's Cp as defined in
An Introduction to Statistical Learning and therefore leads to identical results. BIC puts a higher penalty on adding
parameters.
Presumably any Exam SRM question asking you to calculate AIC or BIC will tell you which formula to use.
Adjusted R2 Adjusted R2 is defined by

RSSp p —1)
—
Adjusted R2 = 1 (7.6)
Sometimes the symbol .IZ, is used for adjusted R2. Increasing k decreases the subdenominator n — p — 1, which
increases the numerator of the fraction and decreases adjusted R2. Thus RSS must decrease sufficiently to justify the
increase in p. •,.....)
Exam SAM Study Manual
Copyright C24:122 ASM
7.2. CHOOSING THE BEST MODEL 107
r- Let's express adjusted R2 in terms of R2.

n - 1
Adjusted R2 = 1
(R5Sp) n - 1 )
TSS p-1
=1 (1 R2)
(n - p - 1) (7.7)
For Mallow's Cp, AIC, and BIC, the lower the statistic the better the model. Each of them has theoretical
justification based on asymptotic arguments. Adjusted R2, on the other hand, is an ad hoc measure without strong
theoretical justification, and higher values indicate better models.
EXAMPLE 7A al For a linear model with 30 observations and 5 parameters (including the intercept), the residual
sum of squares is 100. The unbiased sample variance of the response variable is 40. The residual variance of the
regression is 10.
Calculate Mallow's Cp, AIC, BIC, and adjusted R2, using the formulas of An Introduction to Statistical Learning.
SOLUTION:
Cp = -n1- (RSSp + 2ps2) = (100 + 2(4)(10)) = E 0
cp 6
AIC = 7.. (RSSp + 2ps2) = 72- = -
f o= 0.6
1 1
= (30)(10) (100 + (11130)(4)(10)) =

BIC = — (In 0.78683
ns2 (RSSp
RSSp/(n - p -1) 100/(30 - 5)
Adjusted R2 -= 1 TSS/(n - 1)
=1
40
0.9

Forward subset selection In the following, p is the number of predictors.

1. Start with model having intercept only. Mallow's Cp
According to An Introduction to Statistical Learning:
2. If we have a p predictor model, create p + 1 predic-
tor models by fitting a model with the current p Cp = (RSS,, 2ps2) (7.2)
predictors plus one of the k p unused predictors.
Do this for each of the unused predictors. According to Regression Modeling with Actuarial and Fi-
3. Select the best p 1 predictor model based on RSS nancial Applications:
or (equivalently) R2. RSS/1
C=--- n + 2p (7.1)
4. If p +1 <k, repeat steps 2-3 with the p + 1 param-
eter model.
AIC in general (Regression Modeling): —21 + 2d where d
5. Select the best model from the various models is the number of parameters
with p predictors, p = 0, 1, 2, ... , k based on cross- BIC in general (Regression Modeling): —21+ dln n where
validation or a statistic (Mallow's Cp, AIC, BIC, d is the number of parameters
adjusted R2). AIC for linear regression models
According to An Introduction to Statistical Learning:
Backward subset selection 1
AIC = —7 (RSSp + 2ps2)
ns
(7.4)
1. Start with full k parameter model.
2. If we have a p predictor model, create p —1 predic- According to Regression Modeling with Actuarial and Fi-
tor models by fitting a model removing one of the nancial Applications:
parameters from the current p predictors. Do this AIC = n In s2 + 1n(27r) + n p+3 (7.3)
for each of the p predictors.
3. Select the best p —1 predictor model based on RSS BIC for linear regression models (according to ISL)
or (equivalently) R2. 1
BIC =
ns
(RSS, + (In n)ps2) (7.5)
4. If p —1 > 1, repeat steps 2-3 with the p —1 predictor
model.
Adjusted R2
5. Select the best model from the various models
with p predictors, p = 0, 1, 2, . , k based on cross- RSSp /(n — p —1)
Adjusted R2 = 1 (7.6)
validation or a statistic (Mallow's Cp, AIC, BIC, TSS/(n — 1)
adjusted R2).
= 1 — (1 — R2)n( —p—i)
n—1I (7.7)

n Exercises
7.1. 4' A least squares model is fitted to 62 observations. 22 different predictors are considered. The best
predictors are selected using backward stepwise selection.
Determine the number of models that are fitted.
7.2. ktr [SRM Sample Question #541 For a regression model of executive compensation, you are given:
(i) The following statistics:
Executive Compensation
Coefficients Estimate Std. Error t-statisic p-value
(INTERCEPT) —28,595.5 220.5 —129.7 <0.001
AGEMINUS35 7,366.3 12.5 588.1 <0.001
TOPSCHOOL 50.0 119.7 0.4 0.676
LARGECITY 147.9 119.7 1.2 0.217
MBA 2,490.9 119.7 20.8 <0.001
YEARSEXP 15,286.6 7.2 2132.8 <0.001
(ii) The acceptable significance level is a = 0.10.

Determine which variable or variables should be removed first prior to rerunning the model.
(A) (INTERCEPT)
(B) AGEMINUS35, MBA, and YEARSEXP
(C) TOPSCHOOL
(D) TOPSCHOOL and LARGECITY
(E) YEARSEXP
7.3. A normal linear model is fitted to 30 observations. The model has the following predictors:
AGE Categorical variable with categories "Under 30", "30 to 39", "40 to 49", "50 to 59", "60 and over".
SEX Male or female
BLOOD PRESSURE Real-valued variable
NUMBER OF CHILDREN Integer-valued variable

The best predictors are selected using forward stepwise selection.
7.4. Nr A normal linear model is used to estimate the sales of oranges. The explanatory variables are various
characteristics of the oranges:
SIZE Small, medium, large
TYPE Navel, temple, juice
STATE Florida, Arizona, California
SEASON Fall, winter, spring, summer
The best explanatory variables are selected using backward stepwise selection.

Copyright 02022 ASM
7.5. •-: For a normal linear model based on 26 observations, 100 predictors are under consideration. The best
predictors are selected using forward stepwise selection.
7.6. "41 [MAS-I-F19:401 You have p = 10 independent variables and would like to select a linear model to fit the
data using the following two procedures:
• Best Subset Selection (BSS)
• Forward Stepwise Selection (FSS)
Let 1\11 be the maximum number of models fit by model selection procedure i.
NFSS
Calculate —
N1355
(A) Less than 0.005

(E) At least 0.0100
Use the following information for questions 7.7 through 7.9:

Least squares models are constructed from subsets of four variables {Xi, X2, X3, X4}. The residual sum of
squares (RSS) from each model is:
Variables RSS Variables RSS Variables RSS Variables RSS
Xi 152 {Xi, X2} 118 {X2, X4} 122 {Xl, X3/ X4} 101
X2 145 {Xi, X3} 144 {X3, X4} 135 {X21 X3, X4} 107
X3 160 {Xi, X4} 129 {Xi, X2, X3} 110 {Xi, X2, X3, X4} 85
X4 138 {X2, X3} 131 {Xi, X2, X4} 105 None 258
7.7. •': Which three variable model is selected by best subset selection?
7.8.
'4" Which three variable model is selected by forward stepwise selection?
7.9. *-: Which two variable model is selected by backward stepwise selection?

Copyright 02022 ASM
7.10. N. [MAS-I-F19:39] An actuary has a dataset with one independent variable, Y, and five independent
variables, X1, X2, X3, X4, X51. She is trying to determine which subset of the predictors best fits the data, and is
using a Forward Stepwise Selection procedure with no stopping rule. Below is a subset of the potential models:
Dependent Independent
Model variable RSS variable p-value
X1 0.0430
1 Y 9,823
X2 0.0096
Xi 0.0464
2 Y 7,070 X2 0.0183
X3 0.0456
Xi 0.0412
3 Y 6,678 X2 0.0138
X4 0.0254
Xi 0.0444
4 Y 4,800 X2 0.0548
X5 0.0254
X1 0.0333
X2 0.0214
5 Y 3,475 X3 0.0098
X4 0.0274
X5 0.0076
The procedure just selected Model 1 as the new candidate model.

Determine which of the following independent variable(s) will be added to the model in the next iteration of this
procedure.
(A) No variables will be added
(B) X3 Only
(C) X4 only
(D) X5 only
(E) X31 X4, ancl X5
Exam SRM Study Manual Exercises continue on the next page ...
7.11. "41. [MAS-I-F18:31] An actuary fits two GLMs, M1 and M2, to the same data in order to predict the probability
of a customer purchasing an automobile insurance product. You are given the following information about each \---)
model:
Degrees of Log
Model Explanatory Variables Included in Model Freedom Used Likelihood
• Offered Price
• Number of Vehicles
Mi 10 —11,565
• Age of Primary Insured
• Prior Insurance Carrier
• Offered Price
• Number of Vehicles
• Age of Primary Insured

M2 8 —11,562
• Gender of Primary Lnsured
• Credit Score of Primary In-
sured
The actuary wants to evaluate which of the two models is superior.

Determine which of the following is the best course of action for the actuary to take.
(A) Perform a likelihood ratio test
(B) Compute the F statistic and perform an F-test
(C) Compute and compare the deviances of the two models
(D) Compute and compare the AIC statistics of the two models
(E) Compute the Chi-squared statistic and perform a Chi-squared test
A normal linear model is fitted to 25 observations. The model has 4 explanatory variables and an intercept.
The RSS is 132 and the estimated residual variance of the regression is 8.
7.12.
•41 Calculate Mallow's Cr, using the formula in James et al.
7.13. NI. Calculate Mallow's Cp using the formula in Frees.
7.14.
11-: A linear regression is fitted to 20 observations. The model has 5 explanatory variables and an intercept.
The RSS is 84.
Calculate Mallow's Cp for the full model using the formula in James et al.

7.15. •41. Various normal linear models are fitted to 60 observations. The models with the lowest residual sum of
squares (RSS) for each fixed number of explanatory variables are:
Number of
Explanatory Lowest
Variables RSS
0 326
1 314
2 303
3 293
4 284
Determine the number of explanatory variables in the model selected by Mallow's C.
7.16. Various normal linear models are fitted to 29 observations. The models with the lowest residual sum of
squares (RSS) for each fixed number of explanatory variables are:
Number of
Explanatory Lowest
Variables RSS
0 162
1 145
2 140
3 136
4 132
Determine the number of explanatory variables in the model selected by Mallow's C.
7.17. "1r For a linear regression model with 100 observations, you are given:
• The model has 10 variables and an intercept.
• The residual sum of squares is 64.8.
• The estimated variance of the residual term is 5.5.
Calculate AIC and BIC using the formulas in James et al.
7.18. •••• A linear regression model has n observations and 8 predictors. You are testing a model having a subset
of the 8 predictors. The subset has 4 predictors.
You are given:
• Both models have intercepts.
• The training RSS of the original model is 82.8.
• The training RSS of the subset model is 116.2.
• The variance of the residuals is estimated using the full model.
• AIC and BIC of the subset model are calculated using the formulas in James et al.
• The AIC of the subset model is 1.271084.
Calculate the BIC of the subset model using the formula in James et al.

Copyright 02022 ASM
7.19. *NI° [MAS-I-F18:39] Two actuaries were given a dataset and asked to build a model to predict claim frequency
using any of 5 independent predictors 11,2,3,4,5} as well as an intercept III.
• Actuary A chooses their model using Best Subset Selection
• Actuary B chooses their model using Forward Stepwise Regression
• When evaluating the models they both used R-squared to compare models with the same number of parameters,
and AIC to compare models with different numbers of parameters.
Below are statistics for all candidate models:
# of Non # of Non
Intercept Log- Intercept Log-

Model Parameters Parameters R2 likelihood Model Parameters Parameters R2 likelihood
1 0 I 0 0.05 17 3 1,1,2,3 0.73 3.35
2 1 1,1 0.56 1.3 18 3 1,1,2,4 0.71 3.25
3 1 1,2 0.57 1.4 19 3 1,1,2,5 0.72 3.3
4 1 1,3 0.55 1.2 20 3 1,1,3,4 0.75 3.5
5 1 1,4 0.52 1.15 21 3 1,1,3,5 0.76 3.6
6 1 1,5 0.51 1.1 22 3 1,1,4,5 0.79 3.9
7 2 1,1,2 0.61 2.5 23 3 1,2,3,4 0.78 3.7
8 2 1,1,3 0.64 2.75 24 3 1,2,3,5 0.74 3.4
9 2 1,1,4 0.63 2.6 25 3 1,2,4,5 0.75 3.45
10 2 1,1,5 0.69 3 26 3 1,3,4,5 0.73 3.35
11 2 1,2,3 0.61 2.5 27 4 1,1,2,3,4 0.88 4.2
12 2 1,2,4 0.62 2.55 28 4 1,1,2,3,5 0.8 3.95
13 2 1,2,5 0.68 2.9 29 4 1,1,2,4,5 0.87 4.1
14 2 1,3,4 0.66 2.8 30 4 1,1,3,4,5 0.83 4
15 2 1,3,5 0.64 2.75 31 4 1,2,3,4,5 0.85 4.05
16 2 1,4,5 0.6 2.45 32 5 1,1,2,3,4,5 0.9 4.25
• AIC i is the AIC of the model chosen by Actuary j.

Calculate the absolute value of the difference between AICA and AICB.
(A) Less than 0.15
(E) At least 0.60
7.20. rYou have fit 5 models using linear regression. The loglikelihoods of the models are
Variables in Model Loglikelihood

xi, x2 -15.785
X1, X2, X4 -14.472
Xi, X3, X4 -13.015
X1, x3, x4, x5 -12.500
X2, X3, X4, X5 -12.021
Using the Akaike Information Criterion, which model is preferred?

7.21. 1": [MAS-I-S19:391 Two actuaries were given a dataset and asked to build a model to predict claim frequency
using any of 5 independent predictors 11,2,3,4,5) as well as an intercept {I).
• Actuary A chooses their model using Best Subset Selection
• Actuary B chooses their model using Forward Stepwise Regression
• Actuary C chooses their model using Backwards Stepwise Regression
• When evaluating the models they all used R-squared to compare models with the same number of parameters,
and AIC to compare models with different numbers of parameters.
Below are statistics for all possible models:
# of Non # of Non
Intercept Intercept
Model Parameters Parameters R2 AIC Model Parameters Parameters R2 AIC
1 0 I 0 1.9 17 3 1,1,2,3 0.73 1.3
2 1 1,1 0.56 1.4 18 3 1,1,2,4 0.71 1.5
3 1 1,2 0.57 1.2 19 3 1,1,2,5 0.72 1.4
4 1 1,3 0.55 1.6 20 3 1,1,3,4 0.75 1.0
5 1 1,4 0.52 1.7 21 3 1,1,3,5 0.76 0.8
6 1 1,5 0.51 1.8 22 3 1,1,4,5 0.79 0.2
7 2 1,1,2 0.61 1.0 23 3 1,2,3,4 0.78 0.6
8 2 1,1,3 0.64 0.5 24 3 1,2,3,5 0.74 1.2
9 2 1,1,4 0.63 0.8 25 3 1,2,4,5 0.75 1.1
10 2 1,1,5 0.69 0.0 26 3 1,3,4,5 0.73 1.3
11 2 1,2,3 0.61 1.0 27 4 1,1,2,3,4 0.88 1.6
12 2 1,2,4 0.62 0.9 28 4 1,1,2,3,5 0.80 2.1
13 2 1,2,5 0.68 0.2 29 4 1,1,2,4,5 0.87 1.8
14 2 1,3,4 0.66 0.4 30 4 1,1,3,4,5 0.83 2.0
15 2 1,3,5 0.64 0.5 31 4 1,2,3,4,5 0.85 1.9
16 2 1,4,5 0.60 1.1 32 5 1,1,2,3,4,5 0.90 3.5
• AlCi is the AIC of the model chosen by Actuary j.

Determine the correct ordering of the AIC values of the three selected models.
(A) AICA < AICB < AICc
(B) AICA = AICB < AICc
(C) AICA < AICc < AICB
(D) AICA = AICc < AICB
(E) The answer is not given by (A), (B), (C), or (D)
7.22. NI" You have fit 5 models based on 70 observations using linear regression. The loglikelihoods of the models
are
Variables in Model Loglikelihood

XI, X2 -120.1
Xi, X2, X4 -119.5
X1, X3, X4 -118.2
Xi, X3, X4, x5 -116.6
Xi, X2, X3, X4, X5 -113.8
Which model is preferred based on BIC?

Copyright 02022 ASM
7.23. For a regression model based on 50 observations, you are given:

(i) The model has 3 variables and an intercept.
Calculate the AIC for the model using the formula in Frees.
7.24. *-411 A regression is performed based on 28 observations. The form of the regression is
Yi = pix1i+p2xi2+ 83xi3+p4xi4+ ri
The AIC, using the formula in Frees, is 111.03.
Determine the BIC.
7.25. So [S-F15:39] You are given the following output from five candidate models:
Model # Included Parameters R2 AIC BIC

1 Gender 0.0162 7532 7899
2 Age 0.0230 7524 7865
3 Age + Gender 0.0390 7508 7885
4 Age+Gender+Income 0.0755 7469 7645
5 Age+Gender+Income+Uninsured 0.0762 7471 7659
Determine which statement is correct based on the selection criteria above.
(A) Model 1 is the best according to R2.

(B) Model 1 is the best according to AIC.
(C) Model 4 is the best according to R2.
(D) Model 4 is the best according to AIC.
(E) Model 5 is the best according to BIC.
7.26. You are given the following ANOVA table from a regression:

Regression 348 3
Error 21 12
Determine adjusted R2.

7.27. [120-81-98:9] You are given:
(i) The regression sum of squares is 1115.11, with 2 degrees of freedom.
(ii) The error sum of squares is 138.89, with 5 degrees of freedom.
Calculate the adjusted R2.
(A) 0.84 (B) 0.89 (C) 0.93 (D) 0.97 (E) 1.00
7.28.
•••• [120-83-98:4] You fit the regression model Y=i +18X; + ei to 11 observations.
You are given that R2 = 0.85.
(A) 0.77 (B) 0.79 (C) 0.80 (D) 0.81 (5) 0.83

7.29. 46 [MAS-I-F18:301 An actuary uses statistical software to run a regression of the median price of a house on
12 predictor variables plus an intercept. He obtains the following (partial) model output:
Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared: 0.7406
F-statistic: 117.3 on 12 and 493 DF
p-value: <2.2e-16
Calculate the adjusted R2 for this model.

(A) Less than 0.70
(E) At least 0.76
7.30. 11 [4-S00:31] You fit the following model to 48 observations:

Yi = Po + Pixii t32Xi2 t33Xi3 Ei
You are given the following ANOVA table:

Regression 103,658 3
Error 69,204 44
Calculate the adjusted R2.

(A) 0.57 (B) 0.58 (C) 0.59 (D) 0.60 (E) 0.61
7.31. •-: For a linear regression model of the form yi = 13o + Pixit + f32Xi2 -I- ei you are given:
(i) F = 96
(ii) Adjusted R2 = 0.95
(iii) E(gi — 02 = 1000
3 8
(iv)
(XCX)-1 = ( 8 12)
12
2
0
0
1
Determine the width of the shortest 95% symmetric confidence interval for pi.
7.32. s: For a linear regression model yi = f30 + Pixii + 132 Xi2p2X13 /34Xi4 Ei with 9 observations you are given
that s2 = 20. The values of gi are 1, 2, 3, 4, 5, 6, 7, 8, 9.
l
f
Copyright 02022 ASM
s
7.33. You are given the following regression models:
(A) yi = + PiXil PkXik E.;
(B) y; = pi° + 161 xii + • - • + + Ei
where y; = 2yi.
Which of the following statements are true?
1. The standard error of the regression will be the same in both models.
2. The adjusted R2 will be the same in both models.
3. Both models will have the same F statistic.
Solutions
7.1. We fit the full model, then 22 models with 1 predictor removed, 21 models with 2 predictors removed, etc.,
until we fit the model with just an intercept. Total number of models fitted is
22
1+ =1+
22(23)
254
2
1=1
7.2. TOPSCHOOL and LARGECITY both have p-values greater than the significance level of 0.10. However, only
TOPSCHOOL, which has the greater p-value, is removed. Sometimes a variable that appears to be not significant
may become significant after another variable is removed. (C)
7.3. We need 4 variables for AGE, 1 for SEX, 1 for BLOOD PRESSURE, and 1 for CHILDREN, a total of 7 variables.
We start with 1 model with just the intercept, then consider 7 models for the first variable to add, 6 for the second,
and so on. Total number of models considered is
(7)(8)= 29
2
k=0
7.4. There are 2 variables for each of SIZE, TYPE, and STATE, and 3 variables for SEASON, for a total of 9 variables.
We start with the model having all variables, then 9 models with 1 variable removed, 8 models with 2 variables
removed, and so on, for a total of
8
(9)(10) 1 1
1 +E(9 —k)= 1 +
k=0
2
=1
7.5. The model has an intercept, and can contain at most 25 predictors; otherwise the variables will not be linearly
independent. We start with the model with an intercept only, then 100 models with 1 predictor, 99 with 2 predictors,
and so on. The total number of models considered is
24
1 + E (100 — k)
k=0
As usual, the sum of an arithmetic sequence is the average of first and last terms (100 and 76 here) times the number
of terms (25 here), so
24
(76 + 100)(25)
1 + E(100 — k) = 1 +
k=0
2
=
2201

Copyright 02022 ASM
7.6. BSS considers every possible model. Each variable may be included or excluded, so there are 210 = 1024
models to consider. FSS starts with the empty model, then considers 10 choices to add, followed by 9 choices, and so
on. The last step considers adding the 1 variable that hasn't entered the model yet. Thus 1 + 10 + 9 + 8 + • • • + 1 = 56
models are considered. The ratio is 56/1024 = 0.055. (D)
7.7. The three variable model with the lowest RSS is {Xi, X3, X4}
7.8. The best one variable model, the one with lowest RSS, is X4. Among the two variable models with X4, the
one with {X2, X4} has the lowest RSS. Among the three variable models with {X2, X4}, the one with pri, x2, x4)
has the lowest RSS.
7.9. Among the models with three variables, the one with (X1, X3, X4} has the lowest RSS. Among the models
with two of those three variables, {X1, X4} has the lowest RSS.
7.10. Forward Stepwise Selection adds one variable at a time, so answer choices (A) and (E) cannot be right. It
selects the variable that lowers RSS the most, and that is the one added for Model 4, namely (D)
7.11. The models are not nested; M2 does not have Prior Insurance Carrier of M1 and does have 2 explanatory
variables that M1 does not have. Thus (A) and (C) are not available, and (B) is only for linear models. (E) is not
relevant. That leaves (D).
7.12. Here p = 4.
cp =2.(Rss 2ps2) = 25
(132 + 2(4)(8)) = 7.84
7.13. s2 for the full model is 8, so

132
C1,= 8
— 25 + 2(4) = —0.5
7.14. s 2 for the full model is RSS/(n k — 1) = 84/(20 — 6) = 6.
7.2
7.15. The mean squared error of the full model is 284/(60 — 5) = 5.163636. Then, using the James et al version of
the Cp formula,
326
Cr (0) = 60
— = 5.4333
314 + 2(5.163636)
Cp(1) = — 5.4055
60
303+ 2(2)(5.163636) = 5.3942
Cp (2) = 60
293 + 2(3)(5.163636)
C,,(3) = 5.4000
60
284 +2(4)(5.163636) = 5.4218
C(4) = 60
The 2-variable model is selected.
7.16. The estimated value of the mean square error of the model with 4 explanatory variables is
2 RSS 132
S -
= 5.5
n—k—1 29-5

We calculate Mallow's Cp for each model using the An Introduction to Statistical Learning formula, but you will
get the same final answer with the Regression Modeling with Actuarial and Financial Applications formula. j
162
C (O) = = 5.586
145 + 2(1)(5.5) = 5.379

Cp (1) — 29
140 + 2(2)(5.5)
Cp (2) — — 5.586
29
136 + 2(3)(5.5)
Cp (3) — = 5.828
29
132 +2(4)(5.5)
C(4) = = 6.069
29
The model with1 explanatory variable has the lowest Cp and is therefore the best.
1
7.17. AIC — (64.8+ 2(10)(5.5)) = 0.3178
100(5.5)
1
BIC — (64.8 + (ln100)(10)(5.5)) = 0.5783
100(5.5)
82.8
7.18. s 2=
n-9
(n_—
AIC =
82.8n 9) (116.2 + ((2)(4)(82.8))1

n—9
1.271084 —
116.2(n — 9) + 8(82.8)
82.8n
105.2458n = 116.2n — 383.4
383.4
n= = 35
10.9542
26 (1n35)(4)(82.8)
BIC =
82.8(35) ) (116.2+ 26 ) 1.448838
7.19. With best subset selection, for each number of parameters we select the model with the highest R2. The
models Actuary A selects are Models 1, 3, 10, 22, 27, 32. In each case, the AIC is —2(/ — p) where p is the number of
parameters including the intercept, or 1.9, 1.2, 0, 0.2, 1.6, and 3.5 respectively; 0 is best.
With forward stepwise regression, we add one variable each time that maximizes R2. We start with I, then add
variable 2, then variable 5, then variable 4, then variable 1, then variable 3. In other words, the models are 1, 3, 13,
25, 29,32 with AICs of 1.9, 1.2, 0.2, 1.1, 1.8, and 3.5 respectively; 0.2 is best. The difference of AICs is [rill. (B)
7.20. We will use the formula in Frees, but you will get the same result using the formula in James et al. The
AIC is —2/ + 2p, where p is the number of parameters, which is the number of variables plus the intercept and the
variance, k 2. There is no need to consider the second and fourth models, since the third and fifth models have
higher loglikelihoods with the same numbers of parameters. The resulting AICs for the remaining models are
2(-15.785) + 2(4) = 39.570
— 2(-13.015) + 2(5) = 36.030
— 2(-12.021) + 2(6) = 36.042
The third model has the lowest AIC and is therefore preferred.
7.21. With best subset selection, A ends up choosing the model with the lowest AIC, model 10 (AIC = 0.0).
With forward stepwise regression, first we select model 1, then add in parameter 2 which generates the 1-
parameter model with the highest R2. That is model 3. Then add in variable 5, which among models 11, 12, and 13

(---' generates the highest R2. The AIC of that model, model 13, is 0.2. We see that the models with more parameters do
not have a lower AIC, so this is optimal.
With backward stepwise regression, first we select model 32, then model 27 which removes parameter 5 and
has the highest R2 among 4-parameter models. Then remove parameter 1, model 23, the model with the highest R2
among 3-parameter models without parameter 5. Then among the 2-parameter models without parameters 1 and 5,
models 11, 12, and 14, the best is model 14. We can't improve the AIC of 0.4 by removing additional parameters, so
that one is selected. (A)
7.22. Using the formula in Frees, the BIC is -2/ + p In n, where p = k + 21s the number of parameters, which is the
number of variables plus the intercept and variance parameters. In this case, In 70 = 4.2485. The second model can
be skipped since the third model has the same number of parameters and higher loglikelihood.
- 2(-120.1) + 4(4.2485) = 257.194
- 2(-118.2) + 5(4.2485) = 257,642
- 2(-116.6) + 6(4.2485) = 258.691
- 2(-113.8) + 7(4.2485) = 257.339
The first model has the lowest BIC and is therefore preferred.
AIC = 501n82.42 + 501n 2n +50+3+3 = -589.05
7.24. We subtract 2p and add p Inn, where p is the number of parameters, which is 6 including the constant and
az.
BIC = 111.03- 2(6) + 61n28 = 119.02
(----'
7.25. Higher R2 and lower AIC and BIC are better. So Model 5 is best according to R2 and Model 4 is best according
to AIC and BIC. (D)
7.26. Let p be the number of predictors. The regression has p degrees of freedom, and the error has n - p - 1
degrees of freedom. Here, p = 3 and n - p - 1. = 12. It follows that n - 1 = 15. We will use formula (7.7).
21
1 - R2 ,.. 21
= 348 + 21 - 369
21 15
Adjusted R2 = 1 - (1 - R2)
( "-1 ) - 1 0.9289
_
-
n -
p 1 369 T.2
138.89/5
7.27. Adjusted R2 =1 = 0.8449 (A)
(1115.11 + 138.89)/7
7.28. We will use formula (7.7).
Adjusted R2 = 1 - (1 - R2)(771_1 ; 1 1) = 1 _0.15(2) = 0.8333 (E)
7.29. We'll use formula (7.7) to calculate adjusted R2. Here, n - p - 1 is the number of degrees of freedom, 493,
and n is total of degrees of freedom, predictors, plus 1 for the intercept, or 506.
505
Adjusted R2 = 1 - (1 - 0.7406)
(493-) = 0.7343 (C)

Copyright 02022 ASM
7.30. We will use formula (7.7).

RSS n 1
Adjusted R2 = 1
(To )
tal SS n — p —1
69,204\ 147\
—1
(69,204 + 103,658) k44) 0.5724 (A)
7.31.
n—1
0.95 = Adjusted R2 = 1— (1 R2)n _ 3
0.05(n — 3) = 1 — R2
(1
n—1
The regression has 2 variables and therefore 2 degrees of freedom. So the F statistic has 2,n —3 degrees of freedom.
96= F —
Regression SS/2
RSS/(n —3)
Divide numerator and denominator by Total SS.
96=
R2/2 R2(n — 3)
(1— R2)/(n —3) — (1— R2)(2)
Now replace R2 and 1 — R2 using (*)
0.05(n — 3) n — 1 — 0.05n + 0.15
R2 =1
n—1 n—1
(n — 1 — 0.05n + 0.15)(n —3) 96
2(0.05(n — 3))
0.95n — 0.85 = 9.6
n 11
Once again, use (*) to obtain 1 — R2.

1 — R- —
, 0.05(8) _
—
0.04
10
R2 = 0.96 Regression
Total SS
SS
Regression SS = 1000
1000
Total SS =
0.96
1000
Error SS = 1000(0.04\)
1000
0.96 24
2
s = 5.20833
(24)(8)
Now we make the only usage of the (VX)-1 matrix. The variance of Pi is S2 times the (2,2) coefficient of that matrix.
s2n = (5.20833)(2) = 10.4167
The t critical value at 5% significance for 8 degrees of freedom is 2.306. The width of the confidence interval is
2(2.306) 67 = 14.885

n7.32. R2——
Error SS = 20(9 — 5) = 80. Regression SS = E(9i - g)2, and g = 5 since g = 9, so Regression SS = 60.
6°
— 2 There are p =4 predictors and n = 9 observations.
60+80 — 7
n—1 1
141
Adjusted R2 = 1 — (1 — R2)
(12 p — 1) — k9 1) — 7
Adjusted R2 may be negative.

7.33. Doubling y will double the p's. Therefore, Total SS, Regression SS, and Error SS will all be multiplied by 4.
Since the standard error is a multiple of the square root of Error SS, it will be doubled, making I false. Since adjusted
R2 and F involve quotients of Regression SS, Error SS, and Total SS, the factor of 4 will cancel out, making II and III
true.

Exam SIZM Study Manual

Lesson 8
Linear Regression: Shrinkage and

Dimension Reduction
Reading: An Introduction to Statistical Learning 6.2-6.4
8.1 Shrinkage methods

Shrinkage methods applied to a regression alter the function to be minimized. They add a penalty function to "41
the sum of square differences. The penalty function is greater if some function of the coefficients is greater. Thus
minimizing the modified function results in shrinking coefficients, possibly zeroing the less significant ones. We
discuss two shrinkage methods: ridge regression and the lasso.
8.1.1 Ridge regression

In ridge regression, the following function is minimized:
2
n k
i=1
(yi — fio — Efiixii
j=1
+ A Xi p?
j=1
The shrinkage penalty function is A >2 . This is A times the square of the 6 norm of the vector (pi,. , We
denote the 6 norm by 11p112. Notice that the sum starts at 1, not 0; there is no penalty for the intercept.
A is a tuning parameter. As A goes to infinity, the coefficients go to 0. The tuning parameter is selected using 'I
cross-validation.
An equivalent formulation of ridge regression is
Minimize
subject to the constraint

n (Yi — Po —EPixii
where s is some constant.
s is called the budget parameter.' For every A, there is an $ that makes this formulation equivalent to the one stated
earlier. Higher values of A correspond to lower values of s.
EXAMPLE 8A '141 For a set of 10 observations, two predictors (xi and x2), and one response (y), the residual sum of
squares has been calculated for several different estimates of a linear model and an intercept. Only integer values
from 1 to 3 were considered for the estimates of Po (the intercept), pi, and /32.
The following table shows the residual sum of squares for every combination of the parameter estimates:
'This s has nothing to do with standard error of a regression.

Copyright 02022 ASM
126 8. LINEAR REGRESSION: SHRINKAGE AND DIMENSION REDUCTION
/32 = 1 fi2 = 2 /32 = 3 Az =1 fi2 = 2 132 = 3 /32 = 1 /32 = 2 Az = 3

Ai = 1 772 630 525 627 605 498 611 609 712
Ai = 2 610 559 572 707 665 601 578 521 562
Ai = 3 489 495 512 722 705 651 549 498 503
1. The parameters pi are estimated using ridge regression with a tuning parameter of A = 20.
Determine the resulting estimates of phi = 1, 2, 3.
2. The parameters f3i are estimated using ridge regression with a budget parameter of s = 10.
Determine the resulting estimates of phi = 1, 2, 3. •
SourrioN: 1. We have to add 20(ft T. + AD (but do not include Tql in that sum) to each RSS. The resulting table (but
you don't have to calculate every value; some are clearly out because RSS is larger than for a lower value of 13i;
for example, (Ao, Ai, Az) = (1, 3,3) versus (Ao, Ai, A2) = (1, 3,1)) is
Ao = 1 Ao =2 4' = 3
if2 = 1 fi2 = 2 132 = 3 132 = 1 /32 = 2 /32 = 3 fi2 = 1 fi2 = 2 132 = 3
Ai = 1 812 730 725 667 705 698 651 709 912
pi = 2 710 719 832 807 825 861 678 681 822
Ai = 3 689 755 872 922 965 1011 749 758 863
The minimum value in the table is for (P09 Ail #2) = (3,1,1)
2. With s = 10, we must have pT f4 10, which disallows (Ai ,J62) = (3,2), (3. 3), (2,3). Among the remaining
RSS values, the minimum is 489 at (Po, #1, Az) = (1,3,1) 0
Ridge regression shrinks the coefficients but does not set them equal to O. Thus all variables are left in the regression,
but the less important ones have small coefficients.
In standard regression, the scale of the predictors does not affect the solution. But in ridge regression, it does,
since each p, and therefore the penalty function, depends on scale. It is therefore best to standardize the predictors
by dividing by their standard deviation:
xij
(8.2)
— i)2
(1/n instead of 1/(n —1) is used in the denominator, but as long as it is used consistently for all predictors they will
all be on the same scale.)
You may get a better feel for ridge regression if you try it out by hand on a simple (1-predictor) linear regression.
Suppose you want to minimize
E(Yi — Po /3ixi)2 + /1/3
Differentiate with respect to po and set the derivative equal to 0 and you get the usual normal equation,
-Eyi nflo + 131 Xi =

Copyright ig32022 ASM
8.1. SHRINKAGE METHODS 127
Differentiate with respect to /3i and set the derivative equal to 0 and you get
- xiyi + pc, Exi+ pi E + Api 0
(E xi) fi0 + (E 4 +A) pi =Exiyi

You can solve the two linear equations using Cramer's rule. For j3, you get
0,1_ n xiYi -xi E Yi

n (E 4 + A) - (E xi )2
For standard linear regression, A = 0. Increasing A makes the denominator larger and oil smaller. But it does not
make i = 0.
8.1.2 The lasso
For the lasso, the following function is minimized:

2
E
(yi -130 - j=1
+ A I/3i I
rfa
(8.3)
The shrinkage penalty function is A -1 1/311. This is A times the 4 norm of the vector (/31, , pk). We denote the el "41'
norm by 1116111. As with ridge regression, the sum starts at 1, not 0; there is no penalty for the intercept.
A is a tuning parameter. As A goes to infinity, the coefficients go to 0. It is selected using cross-validation. sr
As in ridge regression, the variables should be standardized.
An equivalent formulation of the lasso is
Minimize
2
ZIP 5
j=1
where s is some constant.
Once again, s is called the budget parameter. For every A, there is an s which makes this formulation equivalent '41
to the one stated earlier. Unlike ridge regression, the lasso forces coefficients to equal 0, dropping those variables
from the model. In other words, the lasso performs feature selection.
EXAMPLE 8B For a set of 10 observations, two predictors (xi and x2), and one response (y), the residual sum of
squares has been calculated for several different estimates of a linear model and an intercept. Only integer values
from -1 to 1 were considered for the estimates of /30 (the intercept), pi, and 132.
The following table shows the residual sum of squares for every combination of the parameter estimates:
P2 = -1 132 = 0 fiz = 1 P2 = -1 r32 = 0 P2=1 P2 = -1 P2 = 0 P2 = 1

f31 = -1 772 630 525 627 605 498 611 609 712
ifi = 0 610 559 572 707 665 601 578 521 562
=1 489 495 512 722 705 651 549 498 503

1. The parameters /3; are estimated using the lasso with a tuning parameter of A = 60.
Determine the resulting estimates of pi, i = 1,2,3.
2. The parameters f3j are estimated using the lasso with a budget parameter of s = 1.
Determine the resulting estimates of pi, i= 1,2,3.
SOLUTION: 1. We have to add 60(Iiii I + 021) (but do not include 001 in that sum) to each RSS. The resulting table
is
Po = —1 110 = 0
P2 = —1 132 = 0 132 = 1 P2 = —1 132 = 0 132 = 1 132 = —1 132 = 0 132 = 1
Psi = —1 892 690 645 747 665 618 731 669 832
Pi = 0 670 559 632 767 665 661 638 521 622

=1 609 555 632 842 765 771 669 548 623
The minimum value in the table is for (#0, ch, #2) = (1, 0, 0)
2. With s = 1, we must have oil ± 021 1, which disallows any solution in which 131 and Psz are both nonzero.
Among the remaining RSS values, the minimum is 495 at (X, #1,13z) (-1,1,0)
You may get a better feel for the lasso if you try it out by hand on a simple (1-predictor) linear regression. Suppose
you want to minimize
Z (yi — flo — Pi xi)2 + A11311
To make the differentiation easier, assume pi > 0. Then differentiate with respect to Po and set the derivative equal
to 0 and you get the usual normal equation,
— E yi 4- npo + pi Z xi =0
npo + (E xi) pi = E yi
Differentiate with respect to /31 and set the derivative equal to 0 and you get
-Exiyi +130y, xi+ pi E 4 + 0.5A = 0

(Z xi) PO + (E 4) flu= Exiyi_ 0.5A
You can solve the two linear equations using Cramer's rule. For fib you get
it = n () xiyi — 0.5A) — E x; E yi
— (Z x 02
For standard linear regression, A =0. Increasing A will eventually make the numerator 0. If /31 is negative for A = 0,
iPil = —131, causing the sign of A to switch in the solution forf3i. So PI in this case will increase to 0 as A is increased.
An Introduction to Statistical Learning gives a different simplified example for ridge regression and the lasso.
Consider a regression with no intercept and with n = k (so there are n variables and no intercept). Let the X matrix
be the identity matrix, with is on the diagonal and Os elsewhere. Then standard regression results in 13‘j = y/. Ridge
regression results in P;i = yi/(1 + A). The lasso results in
yi — A/2 yi > .A/2
Pil =
lyi + A/2 yi <—A/2
0 lyil - A/2

8.2. DIMENSION REDUCTION METHODS 129
0.5 0.5
0.4 - 0.4 -
0.3 - 0.3 -
0.2 - 0.2 -
0.1 - 0.1 -
0 i
0 1 1
—3 —2 —1 o 1 2 3 —3 —2 —1 0 1 2 3
Figure 8.1: Probability density functions for standard normal distribution (left) and standard double-exponential distribution
(right)
We see that ridge regression reduces all coefficients whereas the lasso selects features.
Both ridge regression and the lasso decrease the MSE of the estimate on the test data. When A = 0, there is no el
bias but variance is high. As A increases, the squared bias increases but the variance decreases. Initially the variance
decreases by more than the squared bias increases, until the MSE reaches its minimum at the optimal value of A. As
A increases above its optimal value, the decrease in variance does not offset the increase in squared bias. Remember
that s moves in the opposite direction of A, so higher s leads to less squared bias and more variance. Bias and
variance can also be measured against R2; the unadjusted model has the highest R2, so squared bias and variance
have the same relationship to R2 as s: higher R2 lowers bias and raises variance.
Both ridge regression and the lasso can be interpreted in a Bayesian manner. In Bayesian statistics, a prior must
be stated for /3.
• For ridge regression, the prior for each p is a normal distribution with mean 0 and standard deviation a svii
function of A. The ridge regression solution is the posterior mode for p . It is also the posterior mean.
• For the lasso, the prior for each p is a double-exponential distribution. The density function for a double-
exponential is
1
x1/9
f (x) =_e-l
26
— c <x <00
The standard double-exponential distribution has 6 = A5 in order to make the variance 1.

The lasso solution is the posterior mode for ,8, but not the posterior mean.
Figure 8.1 graphs the probability density functions of the standard normal distribution and the standard double-
exponential distribution.
8.2 Dimension reduction methods
We have been discussing methods for reducing the number of variables in a linear model. As an alternative to
selecting the most important variables, we can create new variables that are linear combinations of the original
variables. These new variables capture the most important information from the original variables, so that fewer
variables are needed. We will discuss two dimension reduction methods: principal components regression (PCR)
and partial least squares (PLS).

Copyright 6;2022 ASM
8.2.1 Principal components regression

•-", An Introduction to Statistical Learning discusses principal components analysis here because principal components
regression reduces dimension. However, a more complete discussion of principal components analysis appears
later in the textbook. We will follow the textbook, and give a brief discussion of principal components analysis here,
which will probably puzzle you. So read this subsection again after you've read the more complete discussion of
principal components analysis in Lesson 17. (In fact, you may skip this subsection and come back to it later.)
Principal components analysis (PCA) is an unsupervised method for identifying the variables for which the
data varies the most. Suppose we have variables X1, X2,. Xk and we would like to select a principal component
Z1 = Ej4iX1. Then Oil are selected so that E cio = 1 and the variance of
is maximized. The direction of the principal component is the one that minimizes the distance of the data from the
line.
The second principal component is selected to maximize the variance and to be uncorrelated to the first principal
component. It is perpendicular to the first principal component direction. This process can be repeated to generate
additional principal components.
With n observations, the Zi are vectors with n components just like the Xi. The components of Zi are
Zji = E(Pii(X —
j=1
and the same holds for Z21., • Zk, replacing subscripts 1 in the formula with the subscript of Z. The Op are called
loadings and zil are called principal component scores.2 The scores are the distances between the points and the
principal component.
EXAMPLE 8C "•41° For two variables X and Y:
(i) 5C =5
(ii) Y = 9
(iii) The principal component is Z = 0.6X + 0.8Y.
Calculate the principal component score of (X, Y) = (5.5, 8.5).
SourrioN: The score is 0.6(5.5 —5) + 0.8(8.5 — 9) = —0.1 0
In principal components regression (PCR), the regression is performed on principal components. Since the
components are weighted averages of all the variables, PCR does not do feature selection. It is analogous to ridge
regression in this way. However, since each principal component explains less variance than the previous one, only
the first few components are used, reducing the number of variables in the model. The more components used, the
lower the bias and the higher the variance, so test MSE has a U shape.
It is advisable to standardize the variables, using equation (8.1.1), so that the maximization of variance considers
all variables equally.
8.2.2 Partial least squares

Principal components analysis is an unsupervised method. It makes no use of the response variable. In contrast,
partial least squares (PLS) is a supervised method. We set up the variable Z1 = (NIX] by setting cilj1 equal to the
coefficient of the regression of Y on Xj. To select the next variable, we regress each variable on Z1 and set Op equal
to the coefficient of the regression of the residuals on X. This process is repeated to obtain the desired number of
variables. The Zjs are then used as the predictors in the regression for Y.
2Do not confuse these scores with the ones related to maximum likelihood defined on page 165.

8.3. THE CURSE OF DIMENSIONALITY 131
Since the response is taken into account, the directions of the predictors aren't fitted as well as they are by
principal components analysis. However, the predictors generated by PLS do a better job in explaining the response.
Since this approach is supervised, it reduces bias relative to PCA but increases variance, so overall PLS does not
perform better than PCA.
8.3 The curse of dimensionality

When the number of variables k is high, especially if it is n or higher, regression will lead to perfect fits and the
coefficients will not even be well-defined. The statistics on the training data will be excellent. R2, and even adjusted
R2, will be 1 or close to 1. The residuals will be 0. But the model will almost surely perform poorly on test data.
The curse of dimensionality refers to the following statement:
Adding variables to a model with many variables will cause the model to deteriorate, unless the variables
are truly related to the response.
Regularization and shrinkage, the techniques we've discussed in the last lesson and in this lesson, are very
important. Selection of the right tuning parameter is critical. Validation through validation sets or cross-validation
is the only feasible technique for measuring the quality of the model.
Table 8.1: Summary of concepts and formulas from this lesson, Part 1
RIDGE REGRESSION
Minimize:
2
(8.1)
i.1 j=1 j.=.1
Equivalently: Minimize
2
if
yi _ Efi;x41=1
where s is some constant, the budget parameter.

Bayesian interpretation: Prior for 13 is normal distribution. Ridge regression is posterior mode and mean of 13.
Formula to standardize predictors

x ii
(8.1.1)
.sJE 7=1 11)2

Copyright 02022 ASM
Table 8.2: Summary of concepts and formulas from this lesson, Part 2
THE LASSO
Minimize:
2
21 k
Equivalently: Minimize
yi —13o —
.1=1 ) +A
Z Iflii
j.1
(8.3)

)2
EIP S
j=1
where s is some constant, the budget parameter.

Bayesian interpretation: Prior for 13 is double exponential. The lasso is posterior mode but not posterior mean of
13,
Principal Components Analysis Partial Least Squares
Unsupervised method Supervised method
Variables are linear combinations of original Variables are linear combinations of original
variables variables
are selected to make variables -uncorre- Op are coefficients of regression of Y or Zi_i
lated, and maximize variance of Zi op(Xi — on Xi or residuals of regression of Z1_1 on Xi
R) under constraint =1
Higher bias, lower variance Lower bias, higher variance

Copyright 02022 ASM
l
f Exercises
8.1. "-ir For ridge regression, which of the following patterns does the test MSE follow as the tuning parameter
is increased?
(A) Flat.
(B) Decreasing.
(C) Increasing.
(D) First decreasing, then increasing.
(E) First increasing, then decreasing.
8.2. Consider the following fitted linear model:
= 11.5 + 2.5xi1 — 3.1xi2 + 0.8xi3 + Ei
Calculate the shrinkage penalty imposed by ridge regression when A = 3.
8.3. Consider the following two linear regression models:

1. yi = 5 +1.1xii + 0.8x12 + Eh. and E = 4.0.
2. y = 3 + 1.0x2i + 0.1x12 + ell and E =9.0.
If ridge regression is used, for which values of A is the second model preferred?
8.4. Consider the following fitted linear model:
yi = 9.2 + 3.4xi1 — 2.4xi2 + 0.8:43 + 2.1xi4 + et
Calculate the shrinkage penalty imposed by the lasso when A = 5.
8.5. ser When performing ridge regression, the predictors should be standardized.
The observations for one of the predictors, xi, are:
4 5 5 6 8 10 11 12
Calculate the standardized value of the first observation of xi.
8.6. s: [MAS-I-S19:281 Determine which one of the following statements about ridge regression is false.
(A) As the tuning parameter A —) 00, the coefficients tend to zero.
(B) The ridge regression coefficients can be calculated by determining the coefficients g, 1§1R, that
2
minimize Ell=1 (In — po — =ipixij) +A

(C) Unlike standard least square coefficients, ridge regression coefficients are not scale equivariant.
(D) The shrinkage penalty is applied to all coefficient estimates except for the intercept.
(E) Ridge regression shrinks the coefficient estimates, which has the benefit of reducing the bias.


You are given the coefficient vector (3, —1,4,1, —6, —9).
8.7. 649 Calculate the e2 norm of the vector.

8.8. avii Calculate the norm of the vector.

For a linear regression model with 8 explanatory variables and no intercept:
• There are 8 observations.
• xij = 0 for i j and xii = 1.

• The response vector is (3, —5,10,14, —12, 7, 3, —4).
8.9. '41 The lasso is used with A = 10.

Determine the fitted values of f3,, i = 1,..., 8.
8.10. "41 Ridge regression is used with A = 2.

Determine the fitted values of pi, = 1, ,8.
8.11. `141 WAS-1419:361 You are given the following three statements regarding shrinkage methods in linear
regression:
I. As tuning parameter A increases towards co, the penalty term has no effect and a ridge regression will result in
the unconstrained estimates.
For a given dataset, the number of variables in a lasso regression model will always be greater than or equal to
the number of variables in a ridge regression model.
The issue of selecting a tuning parameter for a ridge regression can be addressed with cross-validation.
Determine which of the above statements are true.
8.12. rYou are considering the following two linear models for a set of observations:
1.
yi = 4.2— 1.7xii + 2.5x12 + 1.2xj3 + E;, residual sum of squares is 26.
2. yi = 5.4 + 3.1x12 + el residual sum of squares is 50.
For which values of the tuning parameter A would the second model be preferred when using the lasso?

8.13. A ridge regression is performed based on two predictors and an intercept:
Yi = Po + Pixii + g2xi2
For 131 and /32, only integer values between 1 and 4 are considered.
The resulting lowest residual sum of squares for the optimal value of #0 and each combination of ;61 and #2 is:
/32 1 2 3 4
1 56 52 46 52
2 48 41 34 16
3 42 37 31 33
4 36 40 44 48
The budget parameter for the ridge regression is s = 12.

Determine the estimates of pi and p2.
8.14. "411 A ridge regression is performed based on two predictors and an intercept:
Yi = /3o + Pixii +132x12 + Ei
For pi_ and 132, only integer values between 1 and 4 are considered.
The resulting lowest residual sum of squares for the optimal value of )(30 and each combination of r31 and [32 is:
pi
162 1 2 3 4
1 56 52 46 52
2 48 41 34 30
3 42 37 31 33
4 36 40 44 48
The fitted values of the parameters are pi = 3, P2 = 2.

Determine the lowest value of the tuning parameter A that results in these fitted values.

Copyright g)2022 ASM
8.15. Ls [MAS-I-F19:371 For a set of data with 40 observations, 2 predictors (Xi and X2), and one response (Y),
the residual sum of squares has been calculated for several different estimates of a linear model with an intercept. \--1
Only integer values from 1 to 3 were considered for estimates of po (the intercept), pi, and P2.
The grid below shows the residual sum of squares for every combination of the parameter estimates, after
standardization:
o=1 130 = 2
S2 S2 Sz
1 2 3 1 2 3 1 2 3
1 3,924 1,977 1,250 3,949 1,822 1,174 3,784 1,671 1,107
2 1,858 1,141 711 1,907 1,187 717 1,827 1,128 668
3 1,386 822 369 1,363 711 349 1,294 700 344
Let be the estimate of pi using a ridge regression with budget parameter s = 5. Assume the intercept is not
subjet to the budget parameters.
Calculate the value of pg
(A) Less than 6 (B) 6 -
(C) 7 (D) 8 (E) Greater than 8

8.16. 'kir [MAS-I-S18:34] You are estimating the coefficients of a linear regression model by minimizing the sum:
2
Z (yi -130 — Z pixii subject to EI32. < sr —

i=1 j=1 j=1
From this model, you have produced the following plot of various statistics as a function of the budget parameter, s:
0 le+10
Determine which of the following statistics X and Y represent.

(A) X = Squared Bias, Y = Test MSE
(B) X = Test MSE, Y = Variance
(C) X = Training MSE, Y = Squared Bias
(D) X = Training MSE, Y = Variance MSE
(E) X = Variance, Y = Training MSE

Copyright Q2022 ASM
8.17. [MAS-I-F18:37] You are estimating the coefficients of a linear regression model by minimizing the sum:
2
n
( +AE
From this model you have produced the following plot of various statistics as a function of tuning parameter A:
c:!
0 le+10
Determine which of the following statistics X and Y represent.

(A) X = Squared Bias, Y = Training MSE
(B) X = Test MSE, Y = Training MSE
(C) X = Test MSE, Y = Variance
(D) X = Training MSE, Y = Variance
(E) X = Variance, Y = Test MSE

("--- Use the following information for questions 8.18 and 8.19:
You are given the following observations of two variables:
i xi yi
1 1 2
2 2 5
3 3 12
4 4 13
5 5 18
E xi= 15 y; = 50
E 4= 55 xiyi = 190 = 666
8.18. 'I? You are fitting the linear model yi = /30 + Aix; + Ei using ridge regression with A = 4.
Determine Pi.
8.19. •111 You are fitting the linear model y; = P0 + Plx; + Ei using the lasso with A = 4.
Determine /i.
8.20. s-i? A linear regression is performed based on two predictors and an intercept:
lf Yi = Po + Pixn +132xi2 + Ei
For /31 and P2, only integer values between 0 and 3 are considered.
The resulting lowest residual sum of squares for the optimal value of po and each combination of pi and 132 is:
P2 0 1 2 3
82 77 70 62
1 74 70 65 61
2 69 63 62 60
3 67 60 59 58
The lasso with budget parameter s = 3 is performed.

Determine the fitted values of /31 and P2.

8.21. A linear regression is performed based on two predictors and an intercept:
Yi = Po + pixii p2xi2 + £1
For /31 and p2, only integer values between 0 and 3 are considered.
The resulting lowest residual sum of squares for the optimal value of po and each combination of pi and p2 is:
132 1 2 3
0 82 77 70 67
1 74 70 65 62
2 69 64 62 57
3 67 63 59 56
The lasso with tuning parameter A = 4 is performed.

Determine the fitted values of 131 and p2.
8.22. [MAS-I-S18:361 For a set of data with 40 observations, 2 predictors (X1 and X2), and one response (Y),
the residual sum of squares has been calculated for several different estimates of a linear model with no intercept.
Only integer values from 1 to 5 were considered for estimates of /31 and (32.
standardization:
/32
1 2 3 4 5
1 2,855.0 870.3 464.4 357.2 548.6
2 1,059.1 488.4 216.3 242.8 567.9
f3i 3 657.0 220.0 81.6 241.9 700.8

4 368.4 65.1 60.5 354.5 947.1
5 193.2 23.7 152.8 580.6 1,307.0
Let:
= Estimate of 81 using a lasso with budget parameter s = 5
pi Estimate of (32 using a lasso with budget parameter s = 5
Calculate the ratio f3114.
(A) Less than 0.5
(E) At least 2.0
8.23. it: You are given the following statements regarding principal components regression.
I. The principal components are weighted averages of the explanatory variables.
A principal component Zi = E opiiXij where Zi ciqj = 1 and the coefficients 4ii are selected to minimize the
variance of Zi.
The second principal component is selected to be uncorrelated with the first principal component.
Which statements are true?
(A) None (B) I only (C) II only (D) III only (E) I, n, and III
Exam SEM Study Manual Exercises continue on the next page.. .

8.24. 'se You are given the following statements regarding partial least squares:
I. Partial least squares is a supervised alternative to principal components analysis.
II. The direction of a partial least squares variable does not fit the predictors as well as the direction of a principal
components regression.
The bias of partial least squares is higher than that of principal components regression.
Which statements are true?
(A) I only (B) II only (C) I and II (D) I and III (E) I, II, and III
8.25. •-•• You are given the following observations of two explanatory variables and a response:
Xi X2 Y
1 2 10
2 6 2
3 4 4
4 1 11
5 5 17
xi = 15 xi2 = 18 yi = 45
)1= 55 E 82 = 539 xilyi = 158 Exi2y,= 148

You fit a linear regression model yi = /30 + pixii + P2Xi2 + Ei using partial least squares with Z1 = 01x1 + 02X2.
Determine oi and 02 without standardizing the explanatory variables.
8.26. 64 [SRM Sample Question #8] Determine which of the following statements describe the advantages of
using an alternative fitting procedure, such as subset selection and shrinkage, instead of least squares.
I. Doing so will result in a simpler model
II. Doing so will improve prediction accuracy
III. The results are easier to interpret
8.27. NI° [MAS-I-S19:371 You want to perform a regression of Y onto predictors X1, X2, Xp, using a large
number of observations, and are considering the following modelling techniques:
• Lasso Regression
• Partial Least Squares
• Principal Component Analysis
• Ridge Regression
Determine how many of the above modelling procedures perform variable selection.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
Exam SRM Study Manual Exercises continue on tire next page . . .

8.28. [MAS-I-F19:28] Determine which of the following statements about Principal Component Regression
(PCR) is false.
(A) When performing PCR it is recommended that the modeler standardize each predictor prior to generating
the principal component.
(B) PCR is useful for performing feature selection.
(C) PCR assumes tha the directions in which the features show the most variation are the directions that are
associated with the target.
(D) PCR can reduce overfitting.
(E) The first principal component direction of the data is that along which the observations vary the most.
Solutions
8.1. At first the test MSE decreases, since variance decreases more rapidly than bias increases. The test MSE
attains a minimum and then increases, as variance decreases less rapidly than bias increases. (D)
8.2. 3(2.52 + 3.12 + 0.82) = 49.5
8.3. After applying the shrinkage penalty, the adjusted residual sum of squares is 4 + (1.12 + 0.82)A = 4 + 1.85A
for the first model, 9 + (12 + 0.12)A = 9 + 1.01A. We want
9 + 1.01A < 4 + 1.85A
0.84A > 5
A> 5.9524
8.4. 5(3.4 + 2.4 + 0.8 + 2.1) = 43.5

8.5. The values are divided by the square root of the biased sample variance.
= 7.625
E
= 66.375
8
z ()en — )702= 8.234375

8
4
= 1.393942
8.6. Ridge regression increases bias but decreases variance, making (E) false.
8.7. -\/32 + + 42 + 12 ÷ 62 + 92
8.8. 3 +1+4 +1 +6+9 = 24
8.9. Subtract 5 from In > 5 and add 5 to yi <-5; otherwise 0. b = (This only works for the
special situation mentioned in the box before the question.)
8.10. Multiply each value of yi by 1/(1 + A) = 1/3. b = (1, —5/3,10/3, 14/3, —4,7/3,1, —4/3) (This only works
for the special situation mentioned in the box before the question.)

Copyright 02022 ASM
8.11.
I. As the tuning parameter goes to co, the coefficients go to 0. X

Lasso regression selects parameters whereas ridge regression only shrinks parameters. X
IlL Cross-validation helps select the tuning parameter. i/
(C)
8.12. The adjusted RSS for the first model is 26 + (1.7 + 2.5 + 1.2)A = 26 + 5.4A. The adjusted RSS for the second
model is 50+ 3.1,1. We want the latter to be smaller, and solve for A.
50 + 3.1A < 26 + 5.4A
,
24
10.43478
A> —2.3
8.13. pl+fl < 12. That means neither variable may be 4, and if one is 3, the other must be 1. With these constraints,
the lowest RSS is 41, which occurs at (fi1,132) = (2,2)
8.14. The RSS at the fitted values is 34, and the penalty function is (32 + 22)A = 13A. At the three lower values of
RSS in the table, we have:
(131,g2) RSS Penalty function

(4,2) 30 (42 + 22)A = 20A
(3,3) 31 (32 + 32)A = 18A
(4,3) 33 (42 + 32)A = 25A
A must be at least high enough so that the difference in penalty functions is greater than or equal to the difference
in RSSs. In other words, A Difference in RSS/Difference in penalty. The quotients are 4/7, 3 /5, and 1/8, with the
biggest difference, 3/5 „ occurring at (3,3).
8.15. The budget is g + Since this is 5 or less, we are limited to the upper left triangle of each table, the entries
for (flu, 02) = (1,1), (1,2), and (2,1) only. Looking at those 9 entries, the smallest sum of square residuals is 1671 for
go = 3, gi = 1, f32 = 2, and 3 +1+2 =fl.(B)
8.16. Clearly only Test MSE has a trade-off pattern, a pattern that goes down to a minimum and then comes up,
so answer choice (A) is forced. Increasing the budget parameter s is equivalent to decreasing the tuning parameter
A; A = 0 is an unlimited budget. We know that squared bias increases and variance decreases with increasing A, so
squared bias decreases with increasing s.
8.17. Increasing A makes the fit on the training data worse since it increases the penalty for making the parameters
higher in absolute value, so Training MSE will increase with A. It looks like Y is Training MSE. As for X, squared bias
always increases as A increases, but Test MSE initially decreases due to high decreases in variance, but eventually
increases as the squared bias increase exceeds the variance decrease. We conclude that (B) is correct.
8.18. We minimize f(/3o,pi) = E (yi (Po ± 161 xf))2 + 4f3. Differentiating,
1df
2 (9130 = — Eyi 71130 + pi E xi 0
513o + 15131 = 50
1 df
2d13
1513o + 5913i = 190

Copyright e2022 ASM
Subtract three times the first equation from the second one.
14/31 = 4°
pi = 20 — 2677
8.19. We minimize f (Poi Pi) = Z(yi — (Po + fl i Xj))2 +41131 F. Differentiating, assuming /31 > 0, the partial with respect
to /30 is the same as in ridge regression, and
1 df
2 df3i
15/30 + 55/31 = 188
Subtracting three times the equation for the partial with respect to go,
iopi = 38
131 = En
8.20. We require ir3iI 1121 3. With that constraint, (r31, ft2) = (3,0) results in the lowest RSS.
8.21. We add 4(113ii + ifizi)to each RSS. The lowest value of that sum is 76, obtained at = (1,2)
8.22. The budget parameter s is the parameter such that E s. Here you want 11311+ 021 5, and the smallest
RSS with that property is 216.3, with i = 2 and /32 =3, and then fi1/1q2 = 2/3 (B)
8.23.
I. The principal components are linear combinations of the explanatory variables but not weighted averages
necessarily; the sum of the coefficients need not be 1. X
They are selected to maximize, not minimize, the variance. X
This is true./
(D)
8.24.
I. True, since it considers the response variable.

True, it does not fit the predictors as well as the direction of a principal components regression since partial
least squares takes the response into account. s/
False. Since the response is taken into account, bias is lower, but variance is higher. X
(C)
8.25. We regress Y on X1 and on X2 to get the two coefficients.
158 — (15)(45)/5 2.3

(Pi —
55 — 152/5
148 — (18)(45)/5 —0.81395
C/52
82 — 182/5
8.26. All three are advantages. Fewer variables result in a simpler model and an easier explanation. And the
variance of the prediction is lowered. (D)

Copyright 02022 ASM
8.27. Ridge regression reduces the values of the coefficients but does not make them 0, so it does not perform
variable selection. The lasso sets coefficients equal to 0, which eliminates the associated variables. Principal
component regression and partial least squares do not select variables; instead, they reduce the dimension of the
model by creating new variables that are functions of the original variables. (B)
8.28. Statement (B) is false. While PCR reduces dimension, the variables it creates are functions of all of the
predictors. Thus it does not select predictors.

ii. are topic-by-topic instructional videos! Head to


000
Self-reflection is a means to observe and anal ze oneself in order to grow as a person and actua
Normal is nothing more than

a cycle on a washing machine.
aisim
Lesson 9
Linear Regression: Predictions
Reading: Regression Modeling with Actuarial and Financial Applications 2.5.3, 6.1.2
One of the purposes of a model is to predict the response when given a set of predictors. However, note the following
cautions:
1. The model itself is only an approximation to reality.

2. The model predicts the mean, but there is some error in the fitted value as an estimate of the mean prediction.
3. The predicted value is the mean, but the actual value is normally distributed around the mean. This source of
error is irreducible; no matter how accurately the mean is forecasted, there is uncertainty in the actual value
given the mean.
Thus there are two sources of error in the prediction: the parameters of the model are estimates of the true underlying
parameters, and even if the parameters were known, the model itself contains irreducible error.
A confidence interval is an interval for the predicted value of the mean. A confidence interval takes into account
only the first source of error. A prediction interval is an interval for the predicted value. A prediction interval takes •,,,?
into account both sources of error. Thus a prediction interval is larger than a confidence interval.
Let's discuss intervals for simple linear regressions.
Suppose we want to predict the response corresponding to x*. The predicted value is y* = 130+ pie . The variance
of y* is estimated by
(9.1)
Notice the similarity between this formula and the formula for leverage, (5.1).
The variance of the realized value of y is the variance of y* plus the variance of the error term, which is estimated
by s2. Accordingly, a 1 — a confidence interval for the predicted value y" is "Ls.
.9* ± 11-42s \I i ± (x*— 2)2

and a 1 — a prediction interval for the realized value of y is
1 (x" — 2)2
g* ± ti-ans111 + n
There are ii — 2 degrees of freedom.

For multiple regression, the variance of the realized value of y given the value of the vector x* is calculated using
the following matrix formula:
= s2 (1 + (x)'(X/X)-lx") (9.4) ks•
A prediction interval can then be calculated by multiplying the square root of this formula by an appropriate t
coefficient and adding/subtracting the product to/from the prediction.
I doubt you will be expected to carry out a matrix calculation like this on the exam.

Copyright 02022 ASM
150 9. LINEAR REGRESSION: PREDICTIONS
Exercises
•
9.1. For the linear model yi = Po + Pixii + P2xi2 + e,, you are given that bo = 10.29, b1 = -0.29, and b2 = 0.91.
Calculate the forecasted value when x1 = 10 and x2 = 5.
9.2. Nr You have the following data:

x1 5 4 2 3 9
x2 4 2 5 3 11
y 10 9 15 8 12
You estimate the linear model yi = Po + Pixii + 132xi2 ei. You are given that
0.9254 -0.1503 0.0068
(X1X)-1 =
(-0.1503
-0.0068
0.1002
0.0621
-0.0621
0.0585
Determine the forecasted value when x1 = 6 and x2 =4.
9.3. For the linear model yi = Po + Pixi + ri based on 25 observations, you are given:
(i) 2 = 223
(ii) 9 = 1160
(iii) The fitted values of the parameters are 1;0 = 47.26, b1 = 5.00.
(iv)
Source Sum of Squares
Regression 23,958,920
Error 5,245
Calculate the upper bound of a 95% prediction interval for y when x = 100.
9.4. `-ir For the linear model yi = Po + Pixi + Ei based on 10 observations, you are given:
(i) 2 = 7.1
(iii) The variance of the prediction for y when x = 3 is 17.3568.
Calculate the unbiased sample variance of x.
9.5. a: [SRM Sample Question #131 Determine which of the following statements is/are true for a simple linear
relationship, y= Po + Plx + e.
I. If r = 0, the 95% confidence interval is equal to the 95% prediction interval.
The prediction interval is always at least as wide as the confidence interval.
The prediction interval quantifies the possible range for Efy

9.6.
f [SRM Sample Question #561 Determine which of the following statements about prediction is true.
(A) Each of several candidate regression models must produce the same prediction.
(B) When making predictions, it is assumed that the new observation follows the same model as the one used
in the sample.
(C) A point prediction is more reliable than an interval prediction.
(D) A wider prediction interval is more informative than a narrower prediction interval.
(E) A prediction interval should not contain the single point prediction.
9.7. [S-S17:361 You are given the following information for a model fitted using ordinary least squares (OLS):
Response variable Rating

MSE 6.993
Parameter df 14 standard error (i3)

Intercept 1 14.37632 6.61999
Complaints 1 0.75461 0.09753
Summary Statistic Rating Complaints

lf 30 30
Min 40 37
Max 85 90
Sum 1939 1998
Median 65.5 65.0
Sample Std. Dev. 12.1725619 13.3147572
Calculate the upper bound of the 95% prediction interval for Rating, for an observation with a Complaints value
of 50.
(A) Less than 50

(E) At least 80
9.8. "lb [SRM Sample Question #491 Trish runs a regression on a data set of n observations. She then calculates
a 95% confidence interval (t, u) on y for a given set of predictors. She also calculates a 95% prediction interval (v, w)
for the same set of predictors.
Determine which of the following must be true.
I. lirn„,(u — =0
II. lirn„,(w — v) = 0
III. w—v>u—t
(A) None
(B) I and II only
(C) I and III only
(D) II and III only
(E) The correct answer is not given by (A), (B), (C), or (D).

9.9. •-: For the linear model yi = Po + pixii+p2xi2+ El, you are given:
0.4465 -0.0426 0.0064
(i) (X/X)-1 =
( 0.0426
-0.0064
0.0169
0.0112
-0.0112
0.0125
(ii) The square of the residual standard error of the regression is 20.439
(iii) y* is a forecasted value based on x; = 18, x;' = 10.
Calculate the variance of the forecasted value.
9.10. A linear regression model yi = Po + 13.2X i2 P3Xi3 Ei is used to forecast values of the independent
variable. Let 2 be the forecasted value given that x, is the column vector of values of the explanatory variables. You
are given:
(i) The regression is based on 15 observations.
(ii) 34 (X1X)-1x2 = 5.662.
(iii) The standard error of the regression is 1.290.
(iv) 2 = 26.500.
Construct a 95% prediction interval for z.
9.11. asir [MAS-I-S19:36] An ordinary least squares regression model is fit with the following model form:
E[Y] = po + pixi
After fitting the model, the following plot with the original data (points) and three sets of 95% intervals are provided:
AO.
0
,..• • ' •
.•"
"
0.- •
8
... -
0
O•. -Ø.00000
- -- —
-
.... sL .0.e. &...15.."66 . 00-

0
•
0
00
0
>-
*
• *"
•••
=is
••••
- - - - Interval 1
Interval 2
—
Interval 3
Let "CI" be the 95% confidence interval for E[Yi], and let "PI" be the 95% prediction interval for Y.
Determine which of the following best describes the intervals shown above.
(A) Interval 1 = CI Interval 2= PI
(B) Interval 1 = PI Interval 2 = CI

(C) Interval 1 = CI Interval 3 = PI
(D) Interval 1 = PI Interval 3 = CI

(E) None of (A), (B), (C), or (D) are correct
Exam SIRM Study Manual

Copyright 02022 ASM
lf Solutions
9.1.
y" = 10.29- 0.29(10) + 0.91(5) = 11.94
9.2. 13 = (XIX)-1)Cy
10
0.9254 -0.1503 1 1 1 1 1 9 10.732
= ( -0.1503
-0.0068
0.1002
0.0621 -0.0068)
-0.0621
0.0585
5
4
4
2
2
5
3
3
9
11
15
8
12
=
-1.2135
1.1385
Then 10.732 - 1.2135(6) + 1.1385(4) = 8.005

9.3. The forecast is 47.26+5(100) = 547.26. Use formula (9.3) for the variance. The standard error of the regression
is
Error SS
s
-NI n-2
-
V5245 23
= 15.101
The regression sum of squares is Ill E (xi - 1)2, so

23,958,920
E(xi ,7)2 52
- 958,356.8
The forecast standard deviation is
1 (100 - 223)2
15.101V1 + 25
+
958,356.8
= 15.517
The t coefficient at 23 degrees of freedom is 2.069. The upper bound of the prediction interval is 547.26 +
2.069(15.517) = 579.4
9.4. Use formula (9.3) for the variance. The variance of the prediction as we see in that formula is
(1* 1)2
52 (1 + 1 + n
)
S2 (1 + 1 + n
1
)7)2 ) = 17.3568
3.79812 (1 + 1
10
1-
7.1)2 )
E(xi - 1)2
16.81
- 17.3568
1+ + = 1.2032
10 E(xi - 502
Eoc i
0161.08312 .= 162.89
The unbiased sample variance of x is 162.89/9 = 18.10

9.5.
I. The prediction interval takes into account the variance of the forecasted mean, which the confidence interval
also takes into account, and the random error given the mean, which the confidence interval ignores. If the
random error E = 0, then the two intervals are equal. /
II. The prediction interval takes into account the random error that the confidence interval ignores, so it is at least
as large as the confidence interval./
III. The confidence interval is for the expected value of y given x, The prediction interval is for the value of y give x,
y I x.X
(E)

Copyright 02022 ASM
9.6.
(A) Different models almost always produce different predictions.X

(B) A major purpose of models is to predict observations, so of course the predictions follow the same model as
the observations./
(C) It is unlikely that actual will equal a point prediction exactly, but there is a high probability that actual will be
within the prediction interval.X
(D) The narrower the prediction interval, the more informative it is, since it limits the forecast to a smaller range
of numbers.X
(E) Almost always a prediction interval is built to contain the point predictionX
(B)
9.7. The predicted value is 14.37632 + 0.75461(50) = 52.10682. The standard deviation of it is
s111. n
Z(xi — 2)2
where s is the residual standard error, \/6.99, and

E xi 1998
= = 66.6
30
The sample standard deviation of complaints is
(xi —
= 13.3147572
n—1
so E(xi — 1.)2 = 13.31475722(29) = 5141.2. The standard deviation of the prediction is
+ 1 + (50 — 66.6)2 = 2.7570

30 5141.2
There are 28 degrees of freedom. The t coefficient is 2.048. The answer is
52.10682 + 2.0484(2.7570) = 57.754 (B)
9.8.
I. The confidence interval measures the uncertainty of the predicted mean value. This goes to 0 as the sample
size goes to infinity.
We can see this clearly for a simple linear regression by looking at the confidence interval formula. Looking at
formula (9.2), we see that 1/n —> 0. The denominator of the other summand under the radical is E(x1 — 1)2,
which is n times the variance of x. As n —> oo, this goes to infinity, making the fraction go to 0. /
The prediction interval includes the intrinsic variance of the dependent variable. This variance never changes
regardless of n, so II is false. X
HI. The prediction interval includes the variance of y in addition to the variance of the forecast of its mean, so it
must be larger than the confidence interval, which only includes the latter variance.
(C)

Copyright 02022 ASM
(--"' 9.9. We use equation (9.4).

0.4465 —0.0426
x*PCX)-1(x*)? = (1 18
10) (—0.0426
—0.0064
0.0169
—0.0112 —0.0064) ( 1
—0.0112
0.0125
18
10
= 1.4785
S2 (I -I- X*P00-1(X*)) = 20.439(1 + 1.4785) = 50.658

9.10. We use equation (9.4). There are 3 variables and therefore 15 —4 = 11 degrees of freedom. The appropriate t
coefficient is 2.201. The prediction variance is 1.2902(1 + 5.662), and the square root of that is 3.330. The prediction
interval is
26.500 ± 2.201(3.330) = (19.172,33.828)
9.11. The prediction interval must be wider than the confidence interval since it includes variation of Y1 itself (not
just variation of its mean), and only (B) has that property.


Copyright 02022 ASM
U
Lesson 10
Interpreting Regression Results
Reading: Regression Modeling with Actuarial and Financial Applications 6.1-6.3
This lesson is a brief summary of the material in the reading, which is non-mathematical material. Refer to the
textbook for more details or for amusing examples.
Sb, S (5.4)
Sxj
We see that sbi is influenced by three factors:'

1. s, the residual standard deviation. This can be reduced by measuring y more precisely.
2. Vlpi, the variance inflation factor. By using less collinear variables we can increase significance.
3. -‘1 1. By increasing the size of the sample we can increase significance.
These items cannot be controlled in observational studies, and insurance studies are observational, but they can be
controlled in research studies in which the researcher selects the subjects.
Statistical significance does not mean practical significance. A factor may be statistically significant but the
coefficient p may be so small that the financial effect (assuming we're modeling insurance claims experience) is
negligible.
Statistical effect does not imply causal effect. Generally to demonstrate causal effect one must find a causal
mechanism, or at least verify that the time order is correct and eliminate alternative hypotheses.
10.2 Uses of regression models

In a benchmarking study, regression is used to check reasonableness of observations considering all factors. For
example, you may determine whether the price of a house is reasonable based on a regression of house prices on
various explanatory variables.
We've studied in Lesson 9 how regression is used for prediction. In such predictions, a prediction interval
should be provided. Prediction intervals are wider when the input data is further from the values of the explanatory
variables used to fit the model.
10.3 Variable selection
Overfitting a model by adding an extraneous variable may increase the residual standard error due to loss of one
degree of freedom. Disadvantages of overfitting models are:
'I'm not sure why the textbook doesn't count sxj as a fourth factor. By using explanatory variables that are more spread out we can increase
significance.
Exam SFtM Study Manual 137

Copyright 02022 ASM
158 W. INTERPRETING REGRESSION RESULTS
• Simpler models are easier to interpret.

• Simpler models may do better at fitting out-of-sample data.
• Extraneous variables may introduce collinearity.
However, the coefficient estimates remain unbiased. Underfitting the model by omitting a relevant variable may
introduce bias.
el? Models with fewer variables are called parsimonious models.
In the opinion of Frees, when in doubt, leave the variable in . However, we've studied various methods in
Lessons 7 and 8 for removing variables.
10.4 Data collection
Sampling frame error means obtaining data from the wrong group. An example is using mortality data from life
insurance policyholders to estimate annuity mortality. People who buy annuities expect to live longer, whereas life
insurance purchasers may adversely select against the company.
The sampling region may be limited, and thus forecasts outside the region may not be appropriate. A quadratic
curve may look like a straight line in a small region. The earth appears flat for someone traveling in a small area.
Dependent variables may be censored, meaning that observations outside a certain region are not known exactly.
For example, for insurance with a policy limit, the exact amount of claims higher than the policy limit may be
unknown. An observation of a claim at the policy limit, where all that is known is that the underlying loss was at
least as high as the policy limit, is a censored observation. Severe censoring results in biased estimates.
Dependent variables may be truncated, meaning that observations outside a certain region are not observed.
For example, if insurance has a deductible, losses below the deductible may not be reported. Every observation
is conditional on the loss being above the deductible. Generally truncation is a more serious source of bias than
censoring, since nothing is known about amounts that are truncated.
Omitting variables leads to bias. Sometimes variables must be omitted due to legal reasons; the law may prohibit
using certain factors for rating.
Omitting variables may lead to inclusion of endogenous variables in the model. An exogenous variable is a variable
‘49 whose values are specified outside the model. An endogenous variable is a variable that is a function of other variables
already in the model. Usually an endogenous variable is another variable in the model that has been lagged. Time
series usually include endogenous variables. For example, suppose we are given the time series {lit}, where t
represents a month. Then the model
Yt = Po + + et
has one endogenous variable which is the response variable lagged one month. A more complicated example
is
yt = +Pixt ±(32yt-1 +p3x1_1+ et

which has exogenous variable xt and two endogenous variables; which is the response lagged one period, and
3c1_1 which is the predictor xt lagged one period.
Missing data can be a problem. For some observations, data from some of the explanatory variables may be
missing. The data may be missing at random. Possible strategies for handling missing data are:
1. Delete the observations with missing data. If data is missing at random this should not affect the regression
much.
2. If data is mostly missing from one variable, delete the variable. This may lead to less data loss than deleting
observations with missing values.
3. Impute missing data. Fill it in based on some algorithm. However, the filled in data will have less variability
than true data.

Part II
r
Generalized Linear Model

Lesson 11
Generalized Linear Model: Basics
Reading: Regression Modeling with Actuarial and Financial Applications 13.1-13.3.2, 13.6
Linear regression models assume that the response variable is normally distributed with constant variance. They
are inadequate for many situations.
For example, we often deal with "Yes" or "No" response variables. The response variable may be "Will the
insured be hospitalized within a year?" "Will the person survive one year?" To turn the response into a number,
we could let "No" be 0 and "Yes" be 1. But then we're stuck. No matter how we transform the response, it will
have only two or less possible values. No way a normal distribution with mean p and variance (32 can have just two
values.
So we give up on modeling the variable directly, and instead we model its mean. Let the response be the
probability of "Yes", which we'll denote by it. We can then model it as a normal random variable whose mean is a
linear expression of predictors. Such a model is called a linear probabilihj model. But there are many shortcomings to a:
such a model:
1. The response does not have a normal distribution. So residual analysis is meaningless.
2. The linear function may assume values less than 0 or greater than 1, which are impossible for it.
3. The variance is not constant; in fact, it is n(1 — n), a function of the mean.
In a linear model, we have a linear expression qi LJf3 jx1, where x01 = 1. This expression is the systematic tf
component of the model. The mean of the response yi is the systematic component, and yi has a normal distribution
with fixed variance a2.
In a generalized linear model, the systematic component is the same linear expression. However, rather than
setting the mean of In equal to it, a function of the mean, g(E[yi j), is set equal to qi. And yi may have any distribution
in the linear exponential family, which we will soon define. The function g(x) is called the link function, or just the •,.?
link for short. A linear regression model is a special case of a generalized linear model with a normally distributed
response and an identity link.
To repeat, for a generalized linear model:
g(E[Y]) = Epixi
11.1 Linear exponential family

The linear exponential family has the following probability density function: "ki?
f (y; 0,0 = exp (Y° b(0) + S(y, 0)) (11.1)
Here, 0 is the parameter of interest and 0 is a scale parameter. S(y, 0) is a function of y and cp only, not of 0. y and
0 appear together only in the numerator of the first fraction. The fact that y is alone there, not raised to a power or
otherwise transformed, is what makes this family linear.
Distributions in the linear exponential family may be discrete as well as continuous. For a discrete distribution,
f(y; 0,4)) is the probability function.
Some distributions that are members of the linear exponential family are:

162 11. GENERALIZED LINEAR MODEL: BASICS
• binomial
• normal
• Poisson
• exponential
• gamma
• inverse Gaussian
• negative binomial
Regression Modeling with Actuarial and Financial Applications, in Table 13.8, shows how to parametrize these
distributions as linear exponential. Let's do two examples:
Poisson distribution The usual parametrization of a Poisson is

AY
f (y) = —
Y!
First step is to bring the entire function into an exponential, logging it as needed:
f (y) = exp (—A ÷ y ln A — ln y!)
Since we want y multiplied by the parameter 0, we set 0 = In A, so A = e°, and we get
f (y) = exp (ye ee In y!)

Comparing to equation (11.1), we see that 0 = 1, b(0) = et) , and S(y, = In y!
Normal distribution The usual parametrization of a normal distribution is
f (y) =
crA,ITrr
There are two parameters, and we must choose one to transform into the parameter of interest. Our choice will be
p. Let's bring the right side into an exponential.
2 n
(Y — 02 112
f (y) = exp 2a2
ln
0.51n 2n) = exp( Y 2a2 Inc — 0.5 In 2n)
We can cancel the 2s and get yp in the numerator, so let U = p and 0 = 02. Then
y — 0.502 — 0.5y2
f (y) = exP 0.51n
— 0.51n 2n)
b(6) = 0.562 and we can stick the rest into S(y,
u2
S(y, =— — 0.5 In — 0.51n 271
Here are formulas for the mean and variance of members of the linear exponential family:
E[Y] = b' (0) (11.2)
Var(Y) = 4)b" (0) (11.3)

Copyright *2022 ASM
11.2. LINK FUNCTION 163
For example, for a Poisson, E[Y]= (6) rdo

_
eo = A, and the variance is the second derivative of et) times
cp = 1, or ee = A. For a normal distribution, b'(0) = 0 = p and b"(0) = 1 while cp = 0-2 so Var(y) = a2.
The following table shows the variance as a function of the mean, ignoring multiplicative constants (anything
not involving p is a constant):
Table 11.1: Variance function v(p)
Distribution Variance as function of mean p

Binomial p(1 — p)
Poisson
Normal 1
Gamma p2
Inverse Gaussian 113
For example, the variance of a normal is cp = a2, which is a constant, so we list 1.

In an insurance context, aggregate claims form a compound distribution: the number of claims N is random and
the sizes of the claims Xi are random; the aggregate claims random variable S is
S=
E xi
The Tweedie distribution is a compound distribution with the number of claims having a Poisson distribution and
the claim sizes having gamma distributions; claim sizes are mutually independent and claim sizes are independent 0:
of claim counts. The Tweedie distribution is a member of the linear exponential family with Var(Y) = cp yP , and
1 <p <2. The Tweedie distribution is a mixed distribution: it has a point mass at 0 (the probability of 0 is Pr(N = 0))
and is otherwise continuous. For a Tweedie distribution, 1 <p <2. A Tweedie distribution can also be viewed as a
mixture distribution. Given that n claims occur, the Tweedie variable Y is a sum of n gamma distributions, which is
a gamma distribution. So Y is a weighted mixture of gamma distributions, with weights equal to the probabilities
of n claims, n = 0,1, 2,
11.2 Link function
The inverse of the link function is called the mean function.

The link function of a generalized linear model must be monotonic and differentiable. For each distribution,
there is a canonical link function. The canonical link function sets the systematic component equal to the parameter 441
of interest; ij = 8. This means that
g(p) = Ti by the definition of link

g(p) = 0 because for a canonical link, /7 = 6
g(b' (0)) = since for the linear exponential family, p = (0) by formula (11.2)
g-1(0) =
g(0) = W(0)) 1
g(0) is the inverse of the derivative of b(0). For example, for the Poisson distribution, we saw that b(0) = e8, so
b'(A) = e8, and the canonical link function is g(y) = In y. For the normal distribution, we saw that b(0) = 0.562, so
b' (0) = 0, the identity function, and the canonical link function is the identity function. The following table lists the
b'(0) and its inverse, the canonical link function, for several distributions:

Table 11.2: Canonical links and mean function
Distribution b'(0) Canonical link g(p)

Binomial ea/(1 + ea) In(1-1 /(1— p))
Poisson ee ln ,u
Normal (3 II
Gamma —110 —1/y
Inverse Gaussian (-20)-1/2 —1/(2y2)
Lir The property that makes the canonical link function special is that the GLM estimate is unbiased when it is used.
For binomial, Poisson, and normal distributions, the canonical link function is commonly used. But for gamma and
inverse Gaussian distributions, the canonical link function makes the model hard to explain, so the canonical link
function is rarely used.
Multiplicative constants in the canonical link have no effect, since the coefficients of the systematic component
may all be multiplied by the same constant, so canonical links are often stated without the multiplicative constants.
Thus the canonical link for gamma is stated as 1/p (without the negative sign) and the canonical link for an inverse
gamma is stated as 1/p2.
EXAMPLE 11A •-•• You are given the following generalized linear model for annual aggregate losses on an auto
policy.
Response distribution: Gamma

Link: Logarithm
Parameter P
Intercept 2.21
Miles driven 0.0004
Driver's age category

1 0
2 —0.4
3 —0.6
Vehicle body
Sedan a
Coupe 0.7
SUV 0.9
You are also given that the scale parameter cp = 1/2.

Calculate expected annual aggregate losses and variance of annual aggregate losses for a driver in age category 2
who drives a coupe 10,000 miles. N
SOLUTION: The systematic component is
ln(E[Y]) = 2.21 + 0.0004(10,000) — 0.4 + 0.7 = 6.51
Therefore, E[Y] . e6.51 = 671.826 The variance is cp E[Y]2 for a gamma, so the variance here is
E[1]2/2 = 671.8262/2 = 225,675 0

Copyright C2022 ASM
11.3. ESTIMATION 165
is."\ Quiz 11-1 •-.11 For a generalized linear model of claim sizes,
(i) The gamma distribution is selected.
(ii) The link g(p) = VII is selected.
(iii) The model output is:
Variable
Intercept 18.4
Gender—male 3.1
Area
1.0
3.5
(iv) The variance of claim sizes is 3 times the square of the mean.
Calculate the variance of claim sizes for a male in Area B.
1 1.3 Estimation
The best parameters b for a generalized linear model are estimated using maximum likelihood. For a standard linear
regression model, maximum likelihood leads to the same result as least squares.
To perform maximum likelihood estimation, we log the density function f (y) and take partial derivatives with
respect to each parameter /3i. These partial derivatives are called scores.' The scores are set equal to 0. This gives us
k +1 equations in k +1 unknowns. Although they usually cannot be solved in closed form, they can be approximately
solved using iterated weighted least squares. The partial second derivatives form a matrix. The negative expected value
of this matrix is called the information matrix. Maximum likelihood estimators are consistent and asymptotically
normal, with covariance matrix equal to the inverse of the information matrix. Thus the inverse of the information
matrix is used to test goodness of fit.
Let's illustrate the solution process for simple regression. In simple regression, we're given pairs (xi, yi) and yi is
normally distributed with mean po + f3 ix; and variance a2. We want to select pp and )61 to maximize the likelihood.
To solve this, let's determine the likelihood. The density function for each yi is
( (Yi
f (yi) =
AlTrt
exp
(Po + Pixi))2)
20-2
Since a is a constant, we can ignore 1/a1/ . The likelihood function is then •41'
(Yi — Pixi))2
L(po, pi) = exp (— i•=1
2o2
Logging this expression, the loglikelihood function is •41
(yi — (Po + flixi))2

20.2
To maximize this, we can multiply by —2a2 and minimize
Q(P0, pi) = Z(yi — (Po + Pixi))2

1=1
'These scores have nothing to do with the scores related to principal component analysis defined on page 130.

Copyright g2022 ASM
166 /I. GENERALIZED LINEAR MODEL: BASICS
In other words, we minimize the sum of the squared difference between In and the fitted value ho + pixi. We see
that the least squares solution is the maximum likelihood solution.
Let's maximize /(f30, hi) by differentiating with respect to ho and /31 and solving.
al E;'=i(Yi (ho Pi xi))
apo 0-2
Eli1.1 xi (yi — 030 + pixo)

apl 62
NIP The two expressions on the right are the scores. We will set them equal to 0. To do this, we just have to set the
numerators equal to O. Rearranging, we have
11
/I
n[30 + (E xi) 13i = E 12
Y;
( E Xi) 4)hi=Exiyi
1=1
PO +
Using Cramer's rule, we get

l
f
E yi
E xi E xiy•
hi =
n >xi
En Ex
n xiyi — xi E yi
Cov(x, y)
Var x
where the last line is derived by dividing numerator and denominator of the previous line by n2, and variance and
covariance are the empirical variance and covariance.
The estimate for f30 can be backed out from the first normal equation; since (yi — (Po + Pixi)) = 0, it follows that
Po g
Now let's look at the information matrix.
d21
(3132 02
d21 E xi
19/3041 02
d2i E4
4 21 — a2
There are no ys in these expressions, so the negative expected value of each expression is just negative the value. So
the information matrix is
3=
( E xi
a2

Copyright 02022 ASM
71.4. OVERDISPERSION 167
The information matrix is the covariance matrix for the scores. And its inverse is the asymptotic covariance matrix
of the maximum likelihood estimates. The inverse of the information matrix is
0.2 ( —Exi)
n — (E xi)2- Exi ) n
and we recognize the components of the matrix as the variances and covariances of g'o and pi, as in equations (3.9),
(3.10), (3.11).
11.4 Overdispersion
The distribution used for the response determines the variance. Sometimes we need more flexibility, because the
variance of the data is greater than indicated by the model. For example, if a Poisson distribution is used, the
variance should equal the mean, but the data may indicate that the variance is greater than the mean. This is called
overdispersion. GLM estimation only requires a mean function, not a fully specified distribution, so we can arbitrarily
set the variance to be a multiple of what it would be for the specified distribution:
Var(yi) = a26/9b"(ei)
where (pi is the scale parameter for N. The extra parameter a2 allows for overdispersion. It is unnecessary for
distributions such as the normal distribution, where .1); already equals the variance.
The overdispersion parameter is estimated using the Pearson chi-square statistic divided by the number of "I
degrees of freedom:
1 j1- ,(y — E[yi])2
(11.4)
a2 =N — p Var(yi)
where N is the number of independent cells, p is the number of parameters being estimated (usually p = k + 1 in
our models), and Var(yi) is the theoretical variance of the distribution.
Table 11.3: Summary of formulas in this lesson
Linear exponential family probability density function
f (y; 0, 0) = exp
ye —:(0) S(y , op)) (11.1)
Moments for linear exponential family
E[Y] = b'(0) (11.2)

Var(Y) = 4b"(0) (11.3)
Variance functions See Table 11.1
Canonical links See Table 11.2
Estimate of overdispersion parameter

1
.2
0' =
N—p t (y1 —Var(yi)

E[y iD2 (11.4)
Exam SFtM Study Manual

Copyright 02022 ASM
Exercises
11.1. eir [MAS-I-F19:341 You are given the following statements comparing k-fold cross validation (with k < n)
and Leave-One-Out Cross Validation (LOOCV), used on a GLM with log link and gamma error.
1. k-fold validation has a computational advantage over LOOCV
II. k-fold validation has an advantage over LOOCV in bias reduction
III. k-fold validation has an advantage over LOOCV in variance reduction
(A) None are true (B) I and II only (C) I and III only (D) II and III only
11.2. IN? A geometric distribution is defined by

f (y) = p)p.' 17 := 0, 1, 2, . . .
It is a member of the linear exponential family.

State a function of 0 that b(0) is a multiple of.
11.3. "41 Consider a random variable Y with the following probability density function:
y3e—YiY
f (Y) = y>0
24y4
The random variable is in the linear exponential family.

State a function of 0 that b(19) is a multiple of.
11.4. `41 [S-F16:311 Within the context of Generalized Linear Models, suppose that y has an exponential distribution
with probability density function expressed as:
Determine the variance of y in terms of p.

(A) 1/p (B) - (C) p (D) ,u2
11.5. Nr For a distribution in the linear exponential family, you are given
f (y; xi , = exp
± 750 + s (y 0))
Determine Var(Y), the variance of the distribution, when u = 10.
Exam SRM Study Manual Exercises continue on tire next page . . .

11.6. ": [Based on S-F16:331 You are given the following two probability density functions:
(i) f(y; 6) = 0y-e-1 for y>
(ii) f (y; 6) = Oe-Y8 for y > 0 and 6 > 0
Which of these distributions are in the linear exponential family?
11.7. mIr [MAS-I-F18:27) You are given the following three functions of a random variable, y, where —00 < y <o0.
I. g(y)= 2 + 3y + 3(y — 5)2
IL g(y)=4-4y
III. g(y) = lyl
Determine which of the above could be used as link functions in a GLM.
11.8. ¶P [MAS-I-F19:271 An actuary is asked to model a non-negative response variable and requires that the
model produce an unbiased estimate.
Determine which error structure and link function combination would be the best choice for the modelling
request.
(A) Poisson and Identity
(B) Compound Poisson-Gamma and Log
(C) Normal and Identity
fl (D) Gamma and Log
(E) Poisson and Log

11.9. ikir [MAS-I-F19:261 A number of candidate models were fit using the following variables:
• An intercept term
• Variable A—a Yes/No indicator
• Variable B—a Yes/No indicator
• An interaction of Variables A and B
There are four observations, which were arranged into the following design matrix:
0 0
(1 01
1 1 0 0
X=
1 0 1 0
1 1 1 1
This data was fit using three different link functions:

I. Identity
IL Inverse
III. Log
The predicted values, given below, were the same under all three models:
_ (C0.}.8500)
Y— 0.40
0.70
Determine for which of the above link functions the estimated interaction coefficient is non-zero.
(A) Identity, Inverse, and Log

(B) Identity and Inverse only
(C) Identity and Log only
(D) Inverse and Log only
11.10. al For a generalized linear model for claim sizes, you are given
Response variable: Claim sizes
Response distribution: Normal
Intercept 22
Age of driver
18-24 15
25-64 0
65 and up 13
Income group
Under 50000 12
50000-100000 0
Over 100000 —3
The link function is g(p) = p1/2.

The estimate of o- is 2.
Calculate expected claim sizes for a 20-year old driver earning 40,000.
11.11. al For a generalized linear model for claim counts,

(i) The Poisson distribution is selected.
(ii) The log link is selected.
(iii) The model output is:
Variable Estimated /3
Intercept —2.45
Gender—female —0.85
Income (000) —0.04
Age 0.10
Calculate expected number of claims for a 30-year old male with income of 100,000.

Copyright 02O22 ASM
11.12. q (S-F15:341 You are given the following information for a model of vehicle claim counts by policy:
(i) The response distribution is Poisson and the model has a log link function.
(ii) The model uses two caegorical explanatory variables: Number of Youthful Drivers and Number of Adult
Drivers.
(iii) The parameters of the model are given:
Degrees of
Parameter Freedom 13
Intercept 1 —2.663
Number of Youthful Drivers
0
1 1 0.132
Number of Adult Drivers
1
2 1 —0.031
Calculate the predicted claim count for a policy with one adult driver and one youthful driver.
(A) Less than 0.072
(E) At least 0.078
11.13. •41` [S-F16:32} You are given the following GLM output:
Response variable Pure Premium
Response distribution Gamma
Link log
Parameter df 13
Intercept 1 4.78
Risk Group 2
Group 1 0 0.00
Group 2 1 —0.20
Group 3 1 —0.35
Vehicle Symbol 1
Symbol 1 0 0.00
Symbol 2 1 0.42
Calculate the predicted pure premium for an insured in Risk Group 2 with Vehicle Symbol 2.
(A) Less than 135
(13) At least 135, but less than 140
(E) At least 150

Copyright e2022 ASM
n 11.14. [S-F17:31] You are given the following GLM output:

Link Log
Parameter df
Intercept 1 5.26
Risk Group 2
Group 1 0 0.00
Group 2 1 0.18
Group 3 1 0.37
Territory Code 2
Region 1 0 0.00
Region 2 1 0.12
Region 3 1 0.25
Calculate the predicted pure premium for an insured in Risk Group 3 from Region 2.
(A) Less than 250
(E) At least 325
l
f
11.15. L You have estimated the following generalized linear model:
gQii) = 2 + 3xii + 4xi2
You used a gamma distribution as the response distribution, and the link function you used is g(p) = 1/ p.
Determine the mean of the response when x1 = 5 and x2 = 6.

Copyright 02022 ASM
11.16. N. [MAS-I-S18:24] You are given the following output from a model constructed to predict the probability
that a Homeowner's policy will retain into the next policy term:
Response variable retention

Response distribution binomial
Link square root
Pseudo R2 0.6521
Parameter df
Intercept 1 0.6102
Tenure
<5 years 0 0.0000
5 years 1 0.1320
Prior Rate Change

<0% 1 0.0160
[0%, 10%] 0 0.0000
> 10% 1 —0.0920
Amount of Insurance (000's) 1 0.0015
Let it be the probability that a policy with 4 years of tenure that experienced a 12% prior rate increase and has
225,000 in amount of insurance will retain into the next policy term.
Calculate the value of 71.
(A) Less than 0.60

(E) At least 0.90

11.17. Nr [S-S16:31] You are given the following information for a fitted GLM:
Response variable Claim size
Link Log
Dispersion parameter 1
Parameter df
Intercept 1 2.100
Zone 4
1 1 7.678
2 1 4.227
3 1 1.336
4 0 0.000
5 1 1.734
Vehicle Class 6
Convertible 1 1.200
Coupe 1 1.300
Sedan 0 0.000
Truck 1 1.406
Minivan 1 1.875
Stationwagon 1 2.000
Utility 1 2.500
Driver Age 2
Youth 1 2.000
Middle age 0 0.000

Old 1 1.800
Calculate the predicted claim size for an observation from Zone 3, with Vehicle Class Truck and Driver Age Old.
(A) Less than 650
(E) At least 800

Copyright 02022 ASM
11.18. [SRM Sample Question #4.5] The actuarial student committee of a large firm has collected data on exam
scores. A generalized linear model where the target is the exam score on a 0-10 scale is constructed using a log link, N--)
resulting in the following estimated coefficients
Predictor Variables Coefficient
Intercept —0.1
Study Time (in units of 100 hours) 0.5
Gender (1 for Female, 0 for Male) 0.5

Master's degree (1 for Yes, 0 for No) —0.1
Interaction of Gender and Master's degree 0.2
The company is about to offer a job to Patricia, who is a female with a Master's degree. It would like to offer her
half of the study time that will result in an expected exam score of 6.0.
Calculate the amount of study time that the company should offer Patricia.
(A) 123 hours (B) 126 hours (C) 129 hours (D) 132 hours (E) 135 hours
11.19. *I? You have estimated the following generalized linear model:
= 0.8 + 0.6xii + 1.4x12
The distribution of Y satisfies

Var(Y) = 0.5 E{Y13
The link function is g(p) = Aru.
Determine the variance of the response when x1 = 2 and x2 = 1.
11.20. 'v. [S-S17:29] You are given the following GLM output:
Link log
Scale parameter (0) 1
Parameter df
Intercept 1 3.25
Risk Group 2
Group 1 0 0.00
Group 2 1 0.30
Group 3 1 0.40
Vehicle Symbol 1
Symbol 1 0 0.00
Symbol 2 1 0.45
Calculate the variance of the pure premium for an insured in Risk Group 3 with Vehicle Symbol Group 2.
(A) Less than 3,000
(13) At least 3,000, but less than 4,000
(E) At least 6,000

Copyright C2022 ASM
11.21. "-st. [S-F16:381 You are given the following probability density function for a single random variable, X:
e
11/2 e(x — 1)2 \
f (x I 0) =
(27/X3) exp ( 2x )
Consider the following statements:
I. f (x) is a member of the linear exponential family of distributions.
II. The score function Me) is
1
me). 20 2x
III. The information matrix 1(9) is:

1(0) = 202
(E) The correct answer is not given by (A) ,(B) , (C) , or (D) .

Copyright 02022 ASM
11.22. "1:1 [S-S16:321 You are given the following information for a fitted GLM:
Response variable Claim size
Link Log
Scale parameter (0) 1
Parameter df S
Intercept 1 2.100
Zone 4
1 1 7.678
2 1 4.227
3 1 1.336
4 0 0.000
5 1 1.734
Vehicle Class 6
Convertible 1 1.200
Coupe 1 1.300
Sedan 0 0.000
Truck 1 1.406
Minivan 1 1.875
Stationwagon 1 2.000
Utility 1 2.500
Driver Age 2
Youth 1 2.000
Middle age 0 0.000
Old 1 1.800
Calculate the variance of a claim size for an observation from Zone 4, with Vehicle Class Sedan and Driver Age
Middle age.
(A) Less than 55
(E) At least 70
11.23. s: A generalized linear model with an inverse Gaussian response uses the link g(p) = 1/112. The model
has one explanatory variable x and an intercept. The estimated parameters are b = (0.0135,0.0582)'.
Let p1 be the mean of the response when x = 10.
Let 1./ 2 be the mean of the response when x = 11.
Determine p2 — pi •

Copyright 02022 ASM
11.24. •-.6. [S-F15:31] Given the following information:

(i) Y is a random variable in the linear exponential family
(y) = exp
(yt9 — b(6) + S(y,,
0))
(ii) b(19) =
(iii) 0 = —0.3
(iv) = 1.6
Calculate E[Y].
(A) Less than —1
(E) At least 2
11.25. Nr [Version of S-F15321 A GLM is used to model claim size. You are given the following information about
the model:
(i) Claim size follows a Gamma distribution.

(ii) Log is the selected link function.
(iii) Variance parameter 4) is estimated to be 2.
(iv) Model Output:
Variable b
(Intercept) 2.32
Location—Urban 0.00
Location—Rural —0.64
Gender—Female 0.00
Gender—Male 0.76
Calculate the variance of the predicted claim size for a rural male.
(A) Less than 25
(E) At least 250
Exam SEM Study Manual Exercises continue on the next page . . .

Copyright 02022 ASM
11.26. Nr [MAS-I-S18:25] Three separate GLMs are fit using the following model form:
+ pixi+p2x2
The following error distributions were used for the three GLMs. Each model also used their canonical link functions:
Model 1: gamma
Model II: Poisson
Model III: binomial
When fit to the data, all three models resulted in the same parameter estimates:
Po 2.0
Pi 1.0
P2 —1.0
Determine the correct ordering of the models' predicted values at observed point (Xi, X2) = (2,1).
(A) I < II < III (B) I < III < II (C) II < I < III (D) II < III <I
11.27. `-ir [MAS-I-S18:39] A GLM was used to estimate the expected losses per customer across gender and
territory. The following information is provided:
(i) The link function selected is log.
(ii) Q is the base level for Territory.
(iii) Male is the base level for Gender.
(iv) Interaction terms are included in the model.
The GLM produced the following predicted values for expected loss per customer:
Territory
Q R
Male 148 545
d Female 446 4,024
Calculate the estimated beta for the interaction of Territory R and Female.
(A) Less than 0.85
(E) At least 1.15

EXERCISES FOR LESSON n 181
11.28. %. [S-F17:32] Given a family of distributions where the variance is related to the mean through a power
function:
Var(Y) = a E[Yr
One can characterize members of the exponential family of distributions using this formula.
You are given the following statements on the value of p for a given distribution:
I. Normal (Gaussian) distribution, p = 0
II. Compound Poisson-gamma distribution, I<p<2
III. Inverse Gaussian distribution, p = —1
Determine which of the above statements are correct.
(A) I only (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) or (D) .
11.29. %I' IMAS-I-S18:401 Art actuary fits a Poisson distribution to a sample of data,
-o
f (x) = x!
To assure convergence of the maximum likelihood fitting procedure, the actuary plots three quantities of interest
across different values of B.
Plot I: Score Function, U Plot II: Deviance Plot III: Information, J
0 0
Determine which of the three plots the actuary can use to visually approximate the maximum likelihood estimate
for O.
(A) None can be used (B) I and II only (C) I and III only (D) II and III only

Copyright C2022 ASM
182 H. GENERALIZED LINEAR MODEL: BASICS
Solutions
11.1. This question is based on Lesson 6, but was placed here because of its reference to GLMs.
LOOCV is n-fold validation, k-fold validation requires less computation since it is only creating k validation
sets, whereas LOOCV creates n validation sets. The only exception to this is for a standard linear regression,
where LOOCV has a simple formula, but the question specifies that this is a GLM, not a standard linear regression.
Increasing k increases variance but decreases bias; thus k-fold validation has less variance than LOOCV. (C)
11.2. As usual, write f (y) as the exponential of something.
f (y) = exp (In(1 — p) + y ln p)

The parameter of interest is 8 = In p, and then p = e6, so
f(y) = exp(ln(1 — e°) + ye)
We see that op = 1 and b(0) = — ln(1 — ) , or some multiple thereof.

11.3. This is a gamma distribution, but you don't have to know that to solve the question.
Express the density as the exponential of something.
(y) = exp (—y/y + 31n y — 41n y — ln 24)
The parameter of interest is 0 = —1/y, and then 41n y = —4 ln(-8), so b(0) = —41n(-0) , or a multiple thereof.
11.4. Whether in the context of Generalized Linear Models or not, the variance of an exponential random variable
is the square of its mean. Here the mean is p, as you can verify in the distribution tables, so the variance is p2 . (D)
But if you wanted to do it using the GLM formula for variance, you can set 0 = —1/p and note that cp = 1. Since
exponential is a special case of gamma, and for a gamma Var(Y) = 0E1112, the result follows.
11.5. To get it in the form of (11.1), let B = —25/u2, so u = 5/1r-7. Then
f (y; ,
exp (ye + 10/7) + S(y ,, 4)))
We see that oP = 1 and b(e) = , so
Var(Y) = 4)b”(6) = —10(1)( )(&)_3/2= 2.5(-8)-3/2

When u = 10, 0= —0.25, so Var(Y) = 2.5(0.25-3/2) = F2Tl i
11.6. The first one cannot be linear exponential since the domain depends on 8. If you try to express it in linear
exponential form, you get f (y; 0) = exp (In 0— (8+ 1) In y), and the term that combines y and 6 has In y rather than
y, so it is not linear.
The second one is an exponential distribution, which we know is in the linear exponential family. You can also
express it in linear exponential form: f (y; 0) = exp(Jn 0 — ye), and the term combining y and 19 has just y.
11.7. The link function must be monotonic and differentiable. I and III are not monotonic. (B)
11.8. The inverse of the log function takes real numbers to non-negative numbers, so it is appropriate here. And
to make the estimate unbiased, we should use the model having log as its canonical link, namely Poisson. (E)
11.9. For the identity model, based on the design matrix,

Po + 131 +132 + = 0.7
and we conclude that 131 = 0.3, P2 = -0.1, and /33 = 0.

For the inverse model, the values generated by the linear expression, the systematic component, are the inverses
of 2, 1.25, 2.5, and 1/0.7. Using these values in the right sides of the four equations above does not result in
= 0. Similarly, with the log model, Y is the exponential of the values generated by the systematic component, and
using In 0.5, In 0.8, In 0.4, and In 0.7 as the right sides of the four equations above does not result in f33 = 0. (D)
11.10. The systematic component is 22 + 15 + 12 = 49. Inverting the link, p = 492 = 2401
11.11. Gender is a categorical variable with male as the base level, so nothing is added to the linear expression for
gender.
g(y) = -2.45 - 0.04(100) + 0.10(30) = -3.45
Since g(p) = in p, we have p = e-3'45 = 0.031746 . We can then calculate the probability of n claims using the
Poisson distribution. For example, the probability of 1 claim is 0.031746e-0.031746= 0.030754.
11.12. For the log link,
In p = = -2.663 + 0.132 = -2.531
where the base class for adult drivers is 1 so the value of that variable is 0. Then p = e-2.531 = 0.07958 (E)
11.13. The systematic component is g(p) = 4.78 - 0.20 + 0.42 = 5. The link is g(p) = in p, so p = es = 148.41 (D)
11.14. The systematic component is 5.26 + 0.37 + 0.12 = 5.75. This is the logarithm of pure premium, so pure
premium is e5'75 = 314.19 . (D)
11.15. g(p) = 2 + 3(5) + 4(6) = 41, so p = 1/41
11.16. For a binomial response, the predicted value fi is the probability as well as the mean of the distribution.
Here, the systematic component g(n) = -Fr is
g(n) = VTE = 0.6102 + 0 - 0.0920 + 225(0.0015) = 0.8557
so fr = 0.85572 = 0.7322. (C)
11.17. The systematic component for 3-Truck-Old is 2.1 + 1.336 + 1.406 + 1.8 = 6.642. With the log link, the result
is e6.642 = 766.63 . (D) The dispersion parameter only affects the variance and is not relevant for this question.
11.18. The linear component, with x being study time in hundreds of hours, is -0.1+0.5x +0.5-0.1+0.2 = 0.5+0.5x.
We want e°31-°-5x = 6.
0.5 + 0.5x = ln 6 = 1.791759

1.291759
x =
0.5
Since we offer half the study time, we multiply x by 0.5, and also by 100 to get study time in hours. The result is
129.18 . (C)
11.19. g(p) = 0.8 + 0.6(2) + 1.4(1) = 3.4
Therefore, p = 3.42 = 11.56. The variance is 0.5p3 = (0.5)(11.50) = 772.4022
11.20. The variance for a gamma is 42. The mean is

exp(3.25 + 0.4 + 0.45) = 60.3403
The variance is 60.34032 = 3640.95 . (B)

Copyright Q2022 ASM
11.21.
I. Logging the density function:
0(x— 1)2
ln f (x Ff9)= 0.51n (9— 0.51n(2nx3) 2x
and we see that in the term that combines x and 0, the last term, there is not just x, so f (x) is not a member of
the linear exponential family.X
The score function is the derivative of the logarithm with respect to 0, or
1 (x
u(e).
20
1)2 t/
2x
The information matrix 1(0) is a 1 x 1 matrix (since there's only one variable). It is negative the expected value of
the second derivative. The second derivative is —1/(202). Since X drops out, the second derivative is constant
with expected value —1/(202). Negating we get 1(0) = 1/(2(92), not 202.X
(B)
11.22. The systematic component for 4-Sedan-Middle age is 2.100 +0 + 0 + 0 = 2.100, so the mean is p = e2.100 =
8.16617. For a gamma, the variance is cpp2, so the variance in our case is 8.166172 = 66.6863 . (0)
11.23. The model is
1
= 0.0135 + 0.0582x
11
Or
1
0.0135 + 0.0582x
For x = 10 it is 1.2959. For x = 11 it is 1.2368. The difference is —0.0590

11.24. E[Y] = b'(0). The derivative of b(6) is
1
b' (0) = —
2
(2)(-20)-112 = (-26)-v2
Plug in 8 = —0.3 and we get

E[Y] = 0.6-1/2 = 1.2910 (D)
11.25. The systematic component is

in p = 2.32 — 0.64 + 0.76 = 2.44
The mean is estimated as e2-44 = 11.473 = p. The variance, 49 p2 is
cpp2 = 11.4732/0.5 = 263.3 (E)
11.26. g(Y) = 2.0 + 1.0(2) — 1.0(1) = 3. Use the inverses of the canonical link functions, the mean functions. For
gamma, g(x) = —1/x so the g-1(3) = —1/3. For Poisson, g(x) = In x so g-1(3) = e3 = 20.086. For binomial,
g(x) = In (x /(1 — x)) so g-1(3) = = 0.9526. (B)

Copyright 02022 ASM
11.27. The logged data is

Territory
a)
-a
z
Male 4.9972 6.3008
Female 6.1003 8.3000
Let pi be the coefficient of territory, 132 the coefficient of gender, and p3 the coefficient of interaction. Then
Po= 4.9972
Po + = 6.3008
Po + 132 = 6.1003
Po + 131 +P2+133 = 8.3000
Then the estimate for Pi is 6.3008 — 4.9972 and the estimate for 132 is 6.1003 — 4.9972. Adding these to Po we get
Po + 01+132 = 6.3008 + 6.1003 — 4.9972 = 7.4039. Therefore the interaction beta,133, must be 8.3000— 7.4039 = 0.8961
(B)
11.28. As discussed in this manual p = 3 for inverse Gaussian. But I and II are true. (B)
11.29. The score function is 0 at the maximum likelihood estimate for 19 and the deviance is minimized. However,
information is just the reciprocal of the variance, and does not indicate the maximum likelihood estimate. (B)
Quiz Solutions
g(p) = = 18.4 + 3.1 + 1.0 = 22.5
p = 22.52 = 506.25
The variance is 3(506.252) = 768,867


Copyright 02022 ASM
Lesson 12
Generalized Linear Model: Categorical

Response
Reading: Regression Modeling with Actuarial and Financial Applications 11.1-11.2, 11.4-11.6
One of the most important uses of a generalized linear model is to model categorical responses.
Categorical responses are of three types:
1. Binomial: there are two categories. These are "Yes/No" variables: Is the drug effective? Will the policyholder
submit a claim? Will a student pass SRM?
2. Nominal: there are multiple categories with no particular order. For example: What type of vehicle will a
person buy? Which investment will perform best?
3. Ordinal: there are multiple categories that follow a logical order. For example: How severe was an accident?
(Property damage only, Bodily injury but no fatality, Fatality)
Sometimes the categories are called levels.
12.1 Binomial response
In the following discussion, let -r7 be the systematic component Vi=0 piXi.
If the response variable Y has only two possible values, then the two values are coded as 0 and 1, and the expected
value of Y is the probability of the value coded as 1. Let 77 be that probability. it must be between 0 and 1. However,
the systematic component ri can be any real number. Therefore, the link g(ni) =ij must be a function going from
the interval [0,11 to (-co, co). The inverse of the link, g-1(1ii) = rig, takes any real number to the interval [0,11.
The most popular link, which is also the canonical link, is the logit link:
(12.1)
The function g(n) is called the logit function, and we will write logit(n). When this link is used, we say that we've •
performed logistic regression.
Two other links that are commonly used are
• The probit link, which is the inverse of the standard normal distribution function: •41
g(n) = cl3-1(n) (12.2)

•
• The complementary log-log link,
g(n) = ln(-1n(1 - n)) (12.3)
The textbook only mentions the complementary log-log link in passing, so it may not be tested.
These three links are similar around it = 0.5 but differ at the values of it near 0 and 1. The logit and probit links are
symmetric with a sign change around it = 0.5 but the complementary log-log link is not.
For each link, it will be necessary to invert the link function to obtain an estimate for it, the mean. For the logit •11.
link, ri = ln
noi)= (12.4)

Copyright 02022 ASM
188 12. GENERALIZED LINEAR MODEL: CATEGORICAL RESPONSE
0.9 —
0.8 —
0.7 —
0.6 —
0.5 —
0.4 —
0.3 —
0.2
0.1 —
-5 -4 -2 -1 0 1
Figure 12.1: Logit mean function
The ratio o = n/(1 n) is the odds ratio. The odds ratio is defined as follows: the odds of an event are o if and
only if the probability of the event 77 is o/(1 o). For example, when we say the odds are 2 to 1, or in other words
o = 2/1 = 2, we are saying the probability is 2/3. In a gambling context, if odds of an event are 2 to 1, then in a fair
gamble the gambler wins 1 when the event occurs and loses 2 when it doesn't. In the logistic model, the systematic
component is the logarithm of the odds. The logit is the log of the odds ratio. Given odds of o, the probability is
o/(1 o).
For the probit link, the inverse of the link is the cumulative distribution function of the standard normal
distribution:
01) = VT1) (12.5)
For the complementary log-log link, the inverse is the extreme value distribution:
71 = 1 — (12.6)
Figures 12.1, 12.2, and 12.3 graph these inverses.

One way of interpreting these links is the threshold interpretation. We assume there is an unobservable
underlying response variable y* assuming all real values. The model is
Y* =
where E has a certain distribution. When y" is less than or equal to a threshold, which we set equal to 0, we observe
0; when y* is greater than 0, we observe 1. Thus ri(rii) is the probability that y; is positive. The higher qi is, the
more likely that y; is positive.
For example, if we assume Ei has a logistic distribution with distribution function:
1
F(x) = 1 + e_x
then the density function is
e-x
(1 + e-x)2
Then
1
Pr(yi = 1 I xi) = Pr(y; > 0 I xi) = Pr(ei > -rp) =1

12.1. BINOMIAL RESPONSE 189
0.9 —
0.8 —
0.7 —
0.6 —
0.5 —
0.4 —
0.3 —
0.2 —
0.1 —
o 11
—5 —4 —3 —2 —1 0 1 2 3 4 5
Figure 12.2: Probit mean function
0.9 —
0.8 —
0.7 —
0.6 —
0.5 —
0.4 —
0-3 —
0.2 —
0.1 —
0 I I 1 I 11
—5 _
—3 —2 —1 0 4 5
Figure 12.3: Complementary log-log mean function

which is the mean function for the logit link that we developed above.
For the density function of the logistic distribution, if you replace x with —x, you get
ex
=
(1 + ex)2
and dividing the numerator and denominator by e 2x, this equals -(ee-ixt )r, the same as f (x), so the density is symmetric
around 0, as we mentioned above.
If we assume E has a standard normal distribution, then the threshold calculation gives:
Pr(y* = 1Ix) = Pr(17 + E > 0) = Pr(E > = 1 — (1)(-0=(KO
which is the inverse of the probit link.

Let's now interpret the gi of a logistic regression. In a logistic regression, the systematic component ri is the
logged odds of the event. Compare the response ni when Xi = 1 versus the response n2 when xi = 0, with all
other x values held constant. Let III and 172 be the corresponding systematic components. Then rh. — 172 = f3i. Let
01 = n1/(1 ni) and 02 = n2/(1 — n2) be the odds in each case. Then
pi = in 01 — In 02 = in 6'1
02
e' = °1
02
So di is the ratio of the odds. For example, if = 1, then the odds of an event when xi = 1 are e times the odds
when xi = 0.
EXAMPLE 12A The following logistic model for the probability of passing Exam SRM is fitted:
Response variable: Probability of passing Exam SRM
Response distribution: Binomial
Link: Logit
Parameter
Intercept —1.3
Familiar with R
Yes 1.0
No 0.0
Hours of studying ASM manual 0.01
Score on Exam STAM
5 or less —2.5
6 0.0
7 1.2
8 or more 1.6
Calculate the odds of passing Exam SRM for someone who is not familiar with R, has studied the ASM manual
for 150 hours, and who passed Exam STAM with a 7. •
SoLuTioN: The systematic component is
= —1.3 + 0.01(150) + 1.2 = 1.4
The odds are elA = 4.0552 . The question didn't ask for the probability of passing, but in case you are interested, it
is 4.0552/5.0552 = 0.80218.

Copyright 02022 ASM
22.2. NOMINAL RESPONSE 191
EXAMPLE 12B Ler In the previous example, calculate the probability of passing if the probit link is used. •
SOLUTION: The systematic component was calculated in the previous example and was found to equal 1.4. With the
probit link,
it (1)(1.4) = 0.9192
If a model were fitted for this link, the coefficients would vary by link and the result for probability might not be so
different. 0
12.2 Nominal response
I expect few if any exam questions on nominal response and ordinal response models. An Introduction to Statistical
Learning says that these models are rarely used. The CAS has not asked any questions on these models since they
added GLM to their syllabus in 2015. And none of the SRM sample questions ask about these models. If you are
in a hurry, you may skip the rest of this lesson.
We've so far talked about response variables having one of two possible values. Now let's talk about response
variables having one of c > 2 values. In the generalized logit model, we select one category as the base category or al
reference category. Let's say it's category c. For each of the other categories, the model has one equation for the If
relative odds. The odds of category j relative to category i is the probability of category j divided by the probability k:
of category i. The nominal logistic model has an equation for the logarithm of the odds of each category j relative es:
to the reference category:
fl ln .jTr '
ne i=0
Epux, J.-, 1,2,3, ... , c — 1
Let rij = 011x1. Then ni = ncert j , and the probabilities must sum up to 1, so
1
nc
eqj
j — 1, 2,3,...,c — 1
—
1± ei
EXAMPLE 12C k? You use a logistic model to predict the color of the car covered by an insurance policy, as a function
of certain characteristics of its owner. The color may be white, silver, or red. White is the reference category. The
iftted model has the following coefficients:
Parameter Silver /3 Redi3
Sex
Male 0 0
Female —1.2 0.8
Income
Under 50,000 —0.5 —0.3

50,000-100,000 0 0
Over 100,000 1.5 0.7
Calculate the probability that a policy on a female with 40,000 income covers a white car. •

SourrioN: Let white be category 3. We have to calculate the relative odds of silver (category 1) and red (category
2) cars first.
ln = -1.2 - 0.5 = -1.7
n3.
ln = 0.8 - 0.3 = 0.5

13
7
1
0.353182
7r3 = 1 ± e-1.7 e0.5
Quiz 12-1 slir A student entering college may ultimately become a physician (category 1), an actuary (category
2), or a professor (category 3). To calculate the probabilities of these three careers, a generalized logit model is
constructed with intercepts only. Category 3 is the base category and pa = 1.3 for category 1,0.8 for category 2.
Calculate the relative odds of the student becoming an actuary versus becoming a physician.
We are now going to provide an interpretation for Al in a generalized logit model. Suppose there is a binary
•-.11 explanatory variable x. Then for that variable, the odds ratio of category j to the base category is the ratio of the
probability of category j when x = 1 (the explanatory variable is present) over the probability of category j when
x = 0 (the explanatory variable is absent), divided by the same ratio for the base category:
odds ratio] = TriP inia

ncitInca
where the subscript p means present and the subscript a means absent. For the nominal logistic model, assume that
x1 is the binary explanatory variable of interest and that all other variables are held constant. Then, since xi = 1
when the explanatory variable is present and xi = 0 when it is absent,
rtiP
ln = Poi + pi) Jefixi
i=2
The logarithm of the odds ratio is the difference of the logarithms, or

11,
In odds ratioj = in TE1P
Trip
- In 771a
The odds ratio is e1311. hi other words, pi; is the logarithm of the odds ratio. The odds ratio does not vary with the
values of the other explanatory variables.
The generalized logit model has a nested structure. If we condition on yi #1, for example, the conditional value
of yi follows a generalized logit. If we condition on yi equalling one of two values, the conditional model is the
logistic model we studied for binary variables:
Pr(y; =a IY = a or yi = b) =
Oa I (1+ eqi)
(e,747 eb) (1 j_ vc-1 z,71;)
eq.
+
eria-rm
1 + erin-qt,
Compare this to equation (12.4). If we treat a as 1 and b as 0, this probability is a logistic model with ps equal to the
excess of the ps for value a minus the ps for value b.

Copyright 02022 ASM
12.3. ORDINAL RESPONSE 193
Quiz 12-2 A student entering college may ultimately become a physician (category 1), an actuary (category
2), or a professor (category 3). To calculate the probabilities of these three careers, a generalized logit model is
constructed with intercepts only. Category 3 is the base category and f30 = 1.3 for category 1,0.8 for category 2.
Calculate the probability that the student becomes a professor, given that she didn't become a physician.
The multinomial logit model generalizes the generalized logit model. In the generalized logit model, f3 varies '11
by alternative for y but the explanatory variables don't vary. In the multinomial logit model, f3 does not vary by
alternative but the explanatory variables do.
In the logit models, the relative odds of alternative 2 versus alternative 1, Pr(yi = 2)/Pr(yi = 1), equals
Pr(yi = 2) erii2
Pr(yi = 1) —
This ratio does not vary by qij for j 1,2. This property is known as independence of irrelevant alternatives. In some 11-:
cases this property is not desirable.
12.3 Ordinal response

Ordinal response variables have several categories in a logical order. For example, the underwriting category of an •41
insured: preferred, standard, substandard, or the size of a car: compact, mid-sized, full-sized are ordinal variables.
Ordinal variables can be thought of as categories of a latent continuous random variable with cut points. For
example, the latent variable underlying underwriting category measures exactly how likely the insured is to submit
a claim. If the probability of submitting a claim is below a certain cut point, say 0.1, the category is preferred; if it is
above that cut point but below a higher one, say 0.4, the category is standard, and if it is above 0.4, the category is
substandard.
Cumulative logit and proportional odds models
One way to model an ordinal response variable is the cumulative logit model. This model estimates the cumulative s:
odds of each category. The cumulative odds of category in are
Pr(Y < m) = E17,1 ni TEi + • • • + TZIn
1 — Pr(Y in) 1— ni + • • • + nc
Thus
" rtm
In PimXi
7Im+1+ - - - + 7Ic
i=0
No equation is needed or possible for m c, since the cumulative probability of category c is 1 and thus the
cumulative odds are infinite. Therefore, such a model would have (k 1)(c —1) parameters. In the proportional odds
model, all parameters except for the intercept are assumed to be the same for all categories. In other words
771 + • • • + 17m
In —
PiXi
nm+1 + • • • + ne
1=1
does not vary by category in for i > 0; the categories differ only through km. The reason this model is called
the proportional odds model is that the cumulative odds of Y < m are
(Th E7='1 ni = eflo,u+V„ p

1 — V-11 ni
194 /2. GENERALIZED LINEAR MODEL: CATEGORICAL RESPONSE
If we fix the category but consider two sets of values of the variables Xi, say Xil. and X12, the relative odds of Y
given each of the two sets of values are
dow+V., ii
efiwn+VLI pixi2
This quotient is independent of the category in; the letter in cancels out in the quotient. The level of the odds
depends on the category, but the ratio of the cumulative odds for different inputs does not vary by category.
EXAMPLE 12D Type of accident (1=property damage, 2=bodily injury, 3=fatality) is modeled using a cumulative
proportional odds model. The explanatory variable is type of vehicle. You are given the following based on the
model:
(i) The probability of property damage with a car is 0.30.

(ii) The probability of property damage or bodily injury with a car is 0.40.
(iii) The probability of property damage with a truck is 0.45.
Determine the probability of property damage or bodily injury with a truck. •
SOLUTION: For a car, the odds of property damage is 0.3/0.7 = 3/7 and the odds of property damage or bodily
injury is 0.4/0.6 = 4/6. For a truck, the odds of property damage is 0.45/0.55 = 9/11. The relative odds of property
4/6 14
damage or bodily injury to property damage for a truck are the same as for a car, 7 = 7 . So the odds of property
damage or bodily injury is
9 (14) 14 _
Ti 9 11
Let's translate this into a probability it.

it 14
1 — it 11
77
14/11 0.56
1 -1- 14/11
Odds are not additive. Suppose events 1 and 2 are mutually exclusive. If the odds of event 1 are oi and the odds
of event 2 are 02, it does not follow that the odds of event 1 or 2 are 01 + 02; in fact, unless one of the odds is 0, you
can be sure the odds are not 01 + 02. So in a cumulative proportional odds model, if you have the cumulative odds of
event 1 and the cumulative odds of event 2 (which means the odds of events 1 or 2), you cannot calculate the odds
of event 2 by subtracting the former from the latter. You must convert them to probabilities. We saw how to convert
odds to probabilities in the previous example. In general, to convert odds to probabilities, since
71
o =
1—n
it follows that probabilities in terms of odds are:
Quiz 12-3 The odds of event 1 are 0.5 and the odds of events 1 or 2 are 2. The two events are mutually
exclusive.
Calculate the odds of event 2.
It is possible to use probit and complementary log-log links instead of logit for the cumulative ordinal model.

Table 12.1: Summary of Formulas in this Lesson
Links for binomial response variables
Link Inverse of link
rc ell
Logit i=11-1 rz -
7 1 + el
Probit 77 _ 0-1(n) TI = cD(11)
Complementary log-log r! = ln(-1n(1 - it)) it = 1 - exp(- exp(ri))
Logistic model for nominal response variable
In =
EpijXi j=1,2,...,c -1
i=o
1
nc =
1 + E=1 eqi
eqi
n = j 1,2,3,...,c -1
1
Cumulative logit model for ordinal response variable
In
Tri÷i + • • • + ne 1=0
If 13 does not vary by category except for intercept, then the model is the proportional odds model:
rti + • • • +
In Poi +
hTc
Ti'
=1.
eP°0-4113'xi
1 - EL ni
Exercises
Binary response
12.1. %. [SRM Sample Question #42] Determine which of the following statements is NOT true about the linear
probability, logistic, and probit regression models for binary dependent variables.
(A) The three major drawbacks of the linear probability model are poor fitted values, heteroscedasticity, and
meaningless residual analysis.
(B) The logistic and probit regression models aim to circumvent the drawbacks of linear probability models.
(C) The logit function is given by 71(z) = ez /(1 + ez).
(D) The probit function is given by m(z) = (1)(z), where (1) is the standard normal distribution function.
(E) The logit and probit functions are substantially different.
l
f
Exam SRM Study Manual Exercises continue on the next page.

Copyright 02022 ASM
12.2. 41-1° [SRM Sample Question #14] From an investigation of the residuals of fitting a linear regression by
ordinary least squares it is clear that the spread of the residuals increases as the predicted values increase. Observed
values of the dependent variable range from 0 to 100.
Determine which of the following statements is/are true with regard to transforming the dependent variable to
make the variance of the residuals more constant.
I. Taking the logarithm of one plus the value of the dependent variable may make the variance of the residuals
more constant.
II. A square root transformation may make the variance of the residuals more constant.
IlL A logit transformation may make the variance of the residuals more constant.
(A) None of I, II, or III is true
(B) I and II only
(C) I and III only
(D) II and III only
12.3. tr [SRM Sample Question #7] Determine which of the following pairs of distribution and link function is
the most appropriate to model if a person is hospitalized or not.
(A) Normal distribution, identity link function
(B) Normal distribution, logit link function
(C) Binomial distribution, linear link function
(D) Binomial distribution, logit link function
(E) It cannot be determined from the information given.
12.4. [SRM Sample Question #20] An analyst is modeling the probability of a certain phenomenon occurring.
The analyst has observed that the simple linear model currently in use results in predicted values less than zero and
greater than one.
Determine which of the following is the most appropriate way to address this issue.
(A) Limit the data to observations that are expected to result in predicted values between 0 and 1.
(B) Consider predicted values below 0 as 0 and values above 1 as 1.
(C) Use a logit function to transform the linear model into only predicting values between 0 and 1.
(ID) Use the canonical link function for the Poisson distribution to transform the linear model into only predicting
values between 0 and 1.
(E) None of the above.

Copyright 02022 ASM
12.5. 4: [S-S16:30] You are given the following information for a fitted GLM:
Response variable Occurrence of Accidents
Response distribution Binomial
Link Logit
Parameter df if se
Intercept 1 —2.358 0.048
Area 2
Suburban 0 0.000
Urban 1 0.905 0.062
Rural 1 —1.129 0.151
Calculate the modeled probability of an Urban driver having an accident.

(A) Less than 0.01
(E) At least 0.20
12.6. al' For a binary response, there is an underlying linear model. The variable in the underlying linear model
has the following density function:
e-lx1
f (x) = 2
-00 <X <00
Determine the link function g(it) based on the threshold interpretation of the model.


You are given the following output from a GLM to estimate the probability of a claim:
• Distribution selected is Binomial
Parameter
Intercept —1.485
Vehicle Body
Coupe —0.881
Roadster —1.047
Sedan —1.175
Station wagon —1.083
Truck —1.118
Utility —1.330
Driver's Gender
Male —0.025
Area
0.094
0.037
—0.101
12.7. *41 (S-F15:33] Link selected is Logit

Calculate the estimated probability of a claim for:
• Driver Gender: Female
• Vehicle Body: Sedan

• Area: D
(A) Less than 0.045

(E) At least 0.060
12.8. Nr Link selected is Probit.

Calculate the estimated probability of a claim for
• Driver Gender: Male
• Vehicle Body: Roadster

• Area: A

n12.9. •Ir [MAS-I-F18:261 You are given the following output from a model constructed to predict the probability
that a Homeowner's policy will retain into the next policy term:
Response variable retention

Response distribution binomial
Link probit
Pseudo R2 0.6521
Parameter df
Intercept 1 0.4270
Tenure
<5 years 0 0.0000
5 years 1 0.1320
Prior Rate Change

<0% 1 0.0160
[0%, 10%] 0 0.0000
> 10% 1 —0.0920
Amount of Insurance (000's) 1 0.0015
Let ft be the modeled probability that a policy with 4 years of tenure that experienced a +12% prior rate change
and has 225,000 in amount of insurance will be retained into the next policy term.
Calculate the value of ft.
(A) Less than 0.60

(E) At least 0.90

Copyright 02422 ASM
12.10. •-: [S-S16:33] You are given the following information for a GLM of customer retention:
Response variable Retention
Link Logit
Parameter df 13
Intercept 1 1.530
Number of Drivers 1
1 0 0.000
>1 1 0.735
Last Rate Change 2

<0% 0 0.000
0%-10% 1 —0.031
> 10% 1 —0.372
Calculate the probability of retention for a policy with 3 drivers and a prior rate change of 5%.
(A) Less than 0.850
(E) At least 0.910
12.11.
•kir [MAS-I-F18:28] In a study 100 subjects were asked to choose one of three election candidates (A, B, or N..._.,
C). The subjects were organized into four age categories: (18-30, 31-45, 45-61, 61+).
A logistic regression was fitted to the subject responses to predict their preferred candidate, with age group (18-
30) and Candidate A as the reference categories.
For age group (18-30), the log-odds for preference of Candidate B and Candidate C were —0.535 and —1.489
respectively.
Calculate the modeled probability of someone from age group (18-30) preferring Candidate B.
(A) Less than 20%
(B) At least 20%, but less than 40%
(C) At least 40%, but less than 60%
(D) At least 60%, but less than 80%
(E) At least 80%

Copyright 02022 ASM
12.12.
`41' [MAS-I-819:27] A statistician uses a logistic model to predict the probability of success, it, of a binomial
random variable.

• There is one predictor variable, X, and an intercept in the model.
• The estimates of it at x 4 and 6 are 0.88877 and 0.96562, respectively.
Calculate the estimated intercept coefficient, bo, in the logistic model that produced the above probability
estimates.
(A) Less than —1

(E) At least 2
12.13. '146 [MAS-1-F19:25] A bank uses a logistic model to estimate the probability of clients defaulting on a loan,
and it comes up with the following parameter estimates:
Variable Pi
0 Intercept —1.6790
1 Income(in 000's) —0.0294
2 Student [Yes] —0.3870
3 Number of credit cards 0.7710
The following four clients applied for loans from the bank:
Client Income Student # of credit cards
1 25,000 Y 1
2 10,000 Y 3
3 20,000 N 0
4 75,000 N 3
The bank will reject any loan if the probability of default is greater than 10%.
Calculate the number of clients whose loan requests are rejected.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
12.14. s: A binary response is modeled with a generalized linear model and a logit link. The fitted model is
g (n) 3.56 + 0.42xi
Determine the odds of the event occurring if x1 = 5.

Copyright 02022 ASM
12.15. 'Lir [MAS-I-S18:27] You are given the following information about an insurance policy:
(i) The probability of a policy renewal, p(X), follows a logistic model with an intercept and one explanatory
variable.
(ii) Po = 5
(iii) 131 = —0.65
Calculate the odds of renewal at x = 5.
(A) Less than 2

(E) At least 8
12.16. [S-S16:291 You are given the following information for a fitted GLM:
Response variable Occurrence of Accidents

Link Logit
Parameter df
Intercept 1
Driver's Age 2
1 1 0.288
2 1 0.064
3 0 0
Area 2
A 1 —0.036
1 0.053
0 0
Vehicle Body 2
Bus 1 1.136
Other 1 —0.371
Sedan 0 0
The probability of a driver in age group 2, from area C and with vehicle body type Other, having an accident is
0.22.
Calculate the odds ratio of the driver in age group 3, from area C and with vehicle body type Sedan having an
accident.
(A) Less than 0.200

(E) At least 0.350

Copyright 02022 ASM
12.17. NI' You are given the following information about an insurance policy:
(i) The probability of a policy renewal is modeled using a generalized linear model with a probit link.
(ii) The model has an intercept and one explanatory variable.
(iii) Po = —3
(iv) 131 = 0.8
Calculate the odds of renewal at x = 2.
12.18. •41 [MAS-I-S18:28] You are given the following information:

(1) A statistician uses two models to predict the probability of success, TE, of a binomial random variable.
(ii) One of the models uses a logistic link function and the other uses a probit link function.
(iii) There is one predictor variable, X, and an intercept in each model.
(iv) Both models happen to produce the same coefficient estimates,ijo = 0.02 and P./ = 0.3.
(v) You are interested in the predicted probabilities of success at x = 4.
Calculate the absolute difference in the predicted values from the two models at x = 4.
(A) Less than 0.1
(E) At least 0.4
12.19. ev? You are modeling the probability that a policy will be renewed. You use a generalized linear model and
a logit link. The output of the mode! is:
Intercept 0.22
Number of years policy is in force 0.15
Number of claims submitted —0.25
Age of policyholder
Under 30 —0.30
30-49 0
50 and up 0.18
Calculate the predicted probability that a 55-year old policyholder for whom the policy was in force for 5 years
and who has not submitted any claims will renew the policy.
12.20. •41 A binary response is modeled with a generalized linear model and a logit link. The fitted model is
g (T) = 0.24 + 0.57xi
Calculate the amount by which the probability of the event increases if X1 is increased from 2 to 3.

Copyright 02022 ASM
12.21. "41i A binary response is modeled with a generalized linear model and a probit link. The fitted model is
g(n) = 0.2 + 0.4x1 + 0.6x2
Determine the probability of an event when xi = 1 and x2 = 2.
12.22. A binary response is modeled with a generalized linear model and a complementary log-log link. The
fitted model is
g(n) = 0.02 + 0.04x1 + 0.06x2
Determine the odds of an event when xi = 1 and x2 = 2.
12.23. A binary response is modeled with a generalized linear model and a complementary log-log link:
g(n) =Pa+ jxj
The fitted value of Po is —1.5. When xi = 1, the fitted value of n is 0.375.

Determine pi..
Nominal response
12.24. '41 A nominal logit model is used to predict the political party of a person. The base category is "Demo-
cratic". The fitted model is
Party Republican Liberal Conservative

Intercept —0.33 —0.44 —0.25
Income Level
Under 50,000 0 0 0
50,000-99,999 0.51 0.35 0.45
100,000 or more 0.37 0.52 0.22
Highest School Completed

High school or less 0 0 0
College 0.65 0.60 0.55
Graduate 0.61 0.63 0.48
Calculate the odds, relative to the base category, that a person who completed college but not graduate school
and has an income of 100,000 or more is Republican.
(--, Use the following information for questions 12.25 and 12.26:
A nominal logistic model is used to predict the type of car one buys. The base category of car is "sedan".
The other categories are "van" and "SUV". The explanatory variables are "gender" and "age group". The fitted
coefficients are:
Type of Car Van SUV

Intercept 0.10 —0.02
Gender
Male 0 0
Female —0.18 —0.06
Age group
Under 25 —0.11 0.18
25-54 0 0
55 and up 0.06 0.04
12.25. 441 Calculate the odds ratio of females for vans.
12.26. ..11 Calculate the probability that a male age 20 buys an SUV.
12.27.
•-it For a nominal response variable with 3 categories, a logistic model with 1 explanatory variable and an
intercept is fitted:
g(n1) Poi +
l
f The estimated parameters for the two non-base categories are
Category (j) b0 b11

1 0.613 0.027
2 0.307 0.135
Calculate the probability of the base category (category 3) if xi = 2.
Exam SRM Study Manual Exercises continue on the next page. . .

Copyright 02022 ASM
12.28. Nr Vacations are categorized as follows:

1. Active
2. Beach
3. Visiting family
Vacation category is modeled as a function of two explanatory variables: number of family members and length
of vacation. A nominal logit model is constructed. Category 3 is the reference category. The following are the fitted
coefficients:
Category Active Beach
Intercept —0.15 —0.20
Number of family members

1 or 2 0 0
3 or 4 —0.05 —0.14
5 or more —0.22 —0.18
Vacation length
1 week or less —0.23 0.04
8 days-2 weeks 0.13 0.08
More than 2 weeks 0.32 —0.05
Calculate the predicted probability that a 2-week vacation for a family of 4 is active.
12.29. A nominal generalized logit model is used to model income level as a function of education. Income
level of 50,000-99,999 is the reference category. The fitted model is:
Under 50,000 100,000 and over
Intercept —0.10 0.04
Highest school completed

High school or less 0.83 —0.62
College 0 0
Graduate —0.37 0.41
Calculate the probability that a person who completed graduate school earns 100,000 or more.

n Ordinal response
12.30. al [MAS-II-F19:25] A book of 122 commerical policies are observed for one year. The observed claim
counts distribution is shown below:
Claim count Policies
0 45
1 32
2 27
3 10
4 7
5 1
Calculate the log-cumulative-odds of 3 claims.

(A) Less than —2.0
(B) At least —2.0, but less than 0.0
(E) At least 4.0
12.31. s'oll An ordinal variable classifies drivers as follows:
j Category
1 Low risk
2 Medium risk
3 High risk
A cumulative logit model results in the following coefficients for the explanatory variables:
Category Low Risk Medium Risk
Intercept 1.30 2.05
Gender
Male 0 0
Female 0.75 0.80
Age
Under 25 —1.80 —0.65
25-44 0 0
45-64 1.00 1.47
65 and over 0.23 —0.12
Calculate the probability that a male driver age 65 or over is a medium risk.

Copyright CO2022 ASM

A cumulative proportional odds model models an ordinal variable with 3 possible values:
j Degree of crash
1 Non-casualty
2 Injury
3 Fatal
The fitted model is
In
Ei<j ni
— bcv + 0.056x1 + 0.082x2
— 24<i 77i
The fitted values of the intercept are boi = 0.4, b02 = 0.6.
Xi is a binary variable.
12.32. '41 Calculate the relative odds of an injury or non-casualty for someone with x1 = 1 relative to someone
with xi = 0.
12.33. 4: Calculate the probability of an injury for someone with x1 ='0, x2 = 5.
12.34. NI° In a cumulative proportional odds model for an ordinal variable, the fitted model is
In = boi bixi
1 — Vi=1 iti
and brrii =0.05] for) 1,2,3,4.
For xi = 1, the odds that the response is 1 relative to xi = 0 are 1.5.
Calculate the odds that the response is 1 or 2 for xl= 1 relative to xi = 0.
A cumulative proportional odds model is used for a 4-category ordinal response variable. There is one
explanatory variable. Therefore, the form of the model is
g(n1) = Poj Pixi

When x1 = 1, the probabilities of categories 1 and 2 are 0.627148 and 0.126841 respectively.
12.35. Determine )602 — P01
12.36. 'kr When x1 = 2, the probability of category 1 is 0.719100.

Determine Poi.

12.37. evf A survey is done among drivers to determine the importance of rear window wipers on cars. The
response variable has the following values:
j Importance
1 Not important
2 Important
3 Very important
A cumulative proportional odds model is used to predict the response. The model has an intercept and one
explanatory variable: number of inches of rain per year in the driver's region. The fitted coefficients are
Intercept (Not important) 0.14

Intercept (Important) —0.07
Inches of rain —0.01
A driver lives in a region with 40 inches of rain per year.

Calculate the probability that this driver responds "Very important".
12.38. •1 An ordinal variable classifies drivers as follows:
I Category
1 Low risk
2 Medium risk
3 High risk
A cumulative proportional odds model estimates category using several explanatory variables. One of the
explanatory variables is city.
For two insureds, values of all explanatory variables except for city are identical. The values of city, along with
the probabilities of the three categories, are as follows:
Category 1 2 3
New York 0.724 0.220 0.056
Chicago 0.698
Determine the probability of medium risk for the Chicago insured.
12.39. A cumulative probit model of the form
is used as a model for severity of accident. The categories are

1. Property damage only
2. Non-fatal bodily injury
3. Fatality
The explanatory variables are size of car and speed of car at time of accident. The fitted parameters are
Category
Intercept (category 1) 0.45
Intercept (category 2) 0.62
Car size
Compact 0
Midsize 0.35
Full size 0.71
Speed of car
Under 30 mph
30-59 mph -0.12
60 mph and over -0.64
Calculate the probability of a fatality for a driver in a full size car driving at 40 mph who has an accident.
12.40. "IP A cumulative probit model for a categorical variable with 4 categories results in the following fit:
0-1(70 = boi + 0.1xi + 0.5x2
with bol = 0.060, b02 = 0.150, and b03 = 0.370.
Calculate the probability of category 3 for someone with x1 =2, x2 = 1.5.
12.41. rThe grade a student gets on an actuarial exam (0-10) is predicted using a cumulative complementary
log-log model. Explanatory variables are hours of study (xi, a continuous variable) and use of the ASM manual (x2,
a binary variable). The base category for x2 is that the ASM manual was not used. The form of the model is
g(n1) = 130j pixi + 162X2

The fitted values are poi = -2 + 0.5), pi = -0.01, 1e2 = -0.40, where j is the grade. As usual, a grade of at least 6 is
needed to pass.
Calculate the predicted probability that someone who studies for 200 hours and uses the ASM manual passes
the exam.
Solutions
12.1. The list of statement (A) can be found at the beginning of the previous lesson. (B), (C), and (D) are found in
this lesson, although (C) and (D) are not accurate; the logit and probit functions are actually the inverses of what is
stated. Comparing Figures 12.1 and 12.2 shows that (E) is false.

12.2.
I. A logarithmic transformation may reduce variance.X

II. Square root is a power transformation that may reduce variance.V
The logit transformation can only be applied to variables that take values between 0 and 1.X
(B)
12.3. The response variable is binomial; either a person is hospitalized or not. And for a binomial response, a logit
link function is more appropriate than a linear link function since the response, the mean of the binomial variable,
should be between 0 and 1. A logit link always produces a result that is between 0 and 1, while a linear link can
produce any real number. (D)
12.4. The logit function transforms real numbers to the range [0,1], so it is the appropriate transformation for this
situation. (C)
12.5. The systematic component for an Urban driver is -2.358 + 0.905 = -1.453, so the odds are e-1-453 = 0.233868
and the probability is 0.233868/1.233868 = 0.1895 . (D)
12.6. The underlying cumulative distribution function is
ix 0.5e" du = 0.5ex x< 0
CS+ lox 0.5e-il du = 1 - 0.5e-x x > 0

We want TE to be the probability that y. > 0, so
IT = Pr(E > -q) =

11 - 0.5e-q n > o
0.50 o
At i7 = 0, TE = 0.5, so that is the borderline between the two cases of the inverted function. Inverting this function,
for TE > 0.5:
n = 1 -
2(1 - it) = e-T1

17 = - ln 2(1 - n)
and for TE <0.5
TE = 0.50
q = in 2n
We conclude
n < 0.5
goi) ={1n(2n) ln(2(1 n)] it > 0.5
12.7. For the logit link,

ln =
1 -
Here female is the base class for gender so its value is 0, whereas the other two variables have nonzero values.
In = -1.485 - 1.175 - 0.101 = -2.761

1 -
e-2.761
= 0.05947 (D)
= 1 ±e2761

Copyright 02022 ASM
212 /2. GENERALIZED LINEAR MODEL: CATEGORICAL RESPONSE

= -1.485 - 1.047 - 0.025 = -2.557
and cl)(-2.557) = 0.0052

12.9. The systematic component is ri = 0.4270 - 0.0920 + 0.0015(225) = 0.6725. The probit link is 43(0.6725), and
using the rounding convention, 43(0.67) = 0.7486. (C)
12.10. The systematic component for 3 drivers and 5% is 1.53 + 0.735 - 0.031 = 2.234. Therefore, the odds are
e2234 = 9.3371 and the probability is 9.3371/10.3371 = 0.90326 (D)
12.11. The odds for preference of Candidate B are o = e-0-535 = 0.5857, and the probability of preference for
candidate B are 0(1 + o) = 0.5857/L5857 = 0.3694. (B)
12.12. Apply the logit function to the estimated probabilities:
0.88877
(1 )
0.88877
0.96562
= 2.078238 = /4) + 4b1
ln
(1 - 0.96562) = 3.335295 = bo + 6b1
Subtracting the first from the second, b1 = (3.335295 -.2.078238)/2 = 0.628529, and therefore 1)0 = 2.078238 -
4(0.628529) = -0.43588. (B)
12.13. logit(0.1) = In = -2.197. We will calculate the systematic component for each client and reject if it is
greater than -2.197.
#1 : -1.679 - 25(0.0294) - 0.3870 + 0.7710 = -2.03 > -2.197
#2 : -1.679 - 10(0.0294) - 0.3870 + 0.7710(3) = -0.047> -2.197
#3 : -1.679 - 20(0.0294) = -2.267 < -2.197
#4 : -1.679 - 75(0.0294) + 0.7710(3) = -1.571 > -2.197
Three clients are rejected. (D)
12.14. In this model, the odds are the exponential of the systematic component, which here is e356+°'42(5) = 287.15
12.15. In a logistic model, odds are the exponential of the systematic component, so here the odds are
1.75
e54)•65(5) = e = 5.7546 (C)
12.16. The odds for the 2-C-other driver are 0.22/(1 - 0.22) = 0.282051, with logarithm -1.2657. This is equal to
x + 0.064 + 0- 0.371, so x = -0.9587. Then the odds for the 3-C-sedan driver are e-119587 = 0.3834. (E)
12.17. Probit links give you probabilities, not odds, but it's easy to convert a probability into an odds ratio. We
have
17 = -3 + 0.8(2) = -1.4
it = 4'(-1.4) = 0.0808
The odds are 0.0808/(1 - 0.0808) = 0.0879
12.18. The systematic component is 0.02 + 0.3(4) = 1.22. For the logistic model
e 1.22
77 = = 0.7721
1 + e1.22
For the probit model

77 = 0(1-22) = 0.8888
The absolute difference is 0.8888 - 0.7721 = 0.1167 (B)

Copyright 02022 ASM

17 = 0.22 + 0.15(5) + 0.18 = 1.15
Using the logit link,
e1.15
71 = 0.7595
1 -I- el-15
12.20. For xi = 2, the probability is

e 0.24+0.57(2)
= 0.79899
1 4_ e0.24+0.57(2)
For x1 = 3, the probability is
e0 24+0 57(3)
• •
= 0.87545
1 4. 0.24+0.57(3)
The increase is 0.07646
12.21.
= 11)(0.2 + 0.4(1) + 0.6(2)) = 0(1.8) = 0.9641
12.22. Just because it isn't a proportional odds model doesn't mean you can't calculate the odds.
The probability is
e119 17
77 = 1 - exp (_e0.02+0.04(1)+0.06(2)) = 1 =0.6980
The odds are 0.6980/(1 - 0.6980) = 2.3109

12.23.
TE = 1 - exp (-e00411'.11)
0.375 = 1 - exp (-e-13+131
In 0.625 =
pi = In(- ln 0.625) + 1.5 = 0.7450

12.24. The systematic component for a person with a college education and an income of 100,000 or more is
-0.33 + 0.37 + 0.65 = 0.69. That is the logarithm of the odds, so the odds are 069 = 1.9937
12.25. In the nominal logistic model, the odds ratio for a binary variable, a variable that is either present or absent,
is di, where Pi is the coefficient for the binary variable. See page 192 for a discussion of this. In this question,
pi = -0.18, so the odds ratio is e-0'18 = 0.83527
12.26. The systematic component of male age 20 is 0.10 - 0.11 = -0.01 for van and -0.02 + 0.18 = 0.16 for SUV.
Thus the probability of buying an SUV is
e0.16
0.370946
TCSLIV = 1 + e-0.01 0.16
12.27. We have, with subscript 3 referring to the base level,

111
= e0 613+0.027(2) = 1.94838
•
713
1/2
e0.307+0.135(2) = 1.78069
113
So it3 is
1
It3 0.21146
1 + 1.94838 + 1.78069

Copyright 02022 ASM
12.28. The systematic components for each category are

Active: 0.15 -0.05 + 0.13 = -0.07
Beach: -0.20 - 0.14 + 0.08 = -0.26
The probability we seek is

e-0.07
712 = 0.3449
1 ± e-0•137 e-0.26
12.29. The systematic components are

Under 50,000: -0.10 - 0.37 = -0.47
100,000 and over: 0.04 + 0.41 = 0.45
The probability we seek is

C045
0.491124
1 ± e-047 e045 =
One could argue that the response is ordinal.
12.30. The probability of 3 claims or less is (45+32+27+10)/122 = 0.934426. The cumulative odds are 0.934426/(1 -
0.934424 The log-cumulative-odds are
0.934426
ln 2.6568 (D)
1 - 0.934426
12.31. The systematic components are

Low risk: 1.30 + 0.23 = 1.53
Medium risk: 2.05 - 0.12 = 1.93
The cumulative odds of the two categories are

01 = e1'53 = 4.61818
o2 el-93 = 6.88951
The cumulative probabilities are 4.61818/5.61818 = 0.82201 and 6.88951/7.88951 = 0.87325. The probability of
medium risk is 0.87325 - 0.82201 = 0.05124
12.32. As we discussed, the relative odds are the exponential of the corresponding (3 coefficient.
e0.056 1.0576
12.33. The model is cumulative, so we'll have to calculate the odds of non-casualty and the odds of non-casualty
or injury, translate odds into probabilities, and then take the difference. The odds of non-casualty are
eo.4+0.o82(5) =2.2479
The probability is 2.2479/3.2479 = 0.692110.

The odds of non-casualty or injury are
e0.6+0.082(5) =2.7456
The probability is 2.7456/3.7456 = 0.733020.

The difference is 0.04091
12.34. Since it is a cumulative proportional odds model, the relative odds of response 1 to response 0 for cumulative
category 2 (which is categories 1 and 2 combined) equals the relative odds of responsel to response 0 for cumulative
category 1 (which is category 1 alone), and we are given that the relative odds of response 1 to response 0 for
category 1 is 1.5

Copyright 02022 ASM
12.35. The cumulative probability of category 2 is 0.627148 + 0.126841 = 0.753989. The cumulative odds of
categories 1 and 2 are 0.627148/(1 - 0.627148) = 1.68203 and 0.753989/(1 - 0.753989) = 3.06486 respectively. The
logged odds of categories 1 and 2 are 0.52 and 1.12 respectively. From the form of the model,
Poi + pi = 0.52
po2 + p1 = 1.12
Thus pO2 I301 = 1.12 -0.52 = 0.6
12.36. The logged odds of category 1 are In (0.719100/(1 - 0.719100)) = 0.94. From the form of the model,
Poi +43i =0.94
From the previous exercise, we know that poi + pi = 0.52. Thus pi = 0.42 and j301 =171)71 1 .
12.37. We need the cumulative probability of categories 1 and 2, and then the probability of category 3 will be the
complement of that cumulative probability.
The linear component is -0.07- 0.01(40) = -0.47. The cumulative odds of category 2 are e-(147 = 0.625002. The
cumulative probability of category 2 is 0.625002/1.625002 = 0.384616. The probability of category 3 is 1 - 0.384616 =
0.615384
12.38. We use the fact that the relative cumulative odds are the same for categories 1 and 2. For category 1, the
odds for New York are 0.724/(1 - 0.724) = 2.62319. For Chicago, they are 0.698/(1 - 0.698) = 2.31126. The relative
odds of Chicago to New York are 2.31125/2.62319. Therefore the same proportion applies to the sum of the first two
categories. For New York, the odds of the first two categories are 0.944/(1 - 0.944) = 16.85714, so for Chicago the
odds of the first two categories are 16.85714(2.31125/2.62319) = 14.85262. The probability of the first two categories
for Chicago is 14.85262/15.85262 = 0.93692. The probability of medium risk for Chicago is 0.93692 - 0.698 = 0.239
12.39. For category 2, the systematic component is 0.62 + 0.71 - 0.12 = 1.21. Thus the cumulative probability of
category 2 is (1)(1.21) = 0.8869. The probability of a fatality, category 3, is 1 - 0.8869 = 0.1131
12.40. The cumulative probability of 2 or lower is
0(0.150 + 0.1(2) + 0.5(1.5)) = 0(1.10) = 0.8643
and the cumulative probability of 3 or lower is
0(0.370 + 0.1(2) + 0.5(1.5)) = (I)(1.32) = 0.9066
The probability of exact category 3 is the difference of the two cumulative probabilities, or 0.0423
12.41. We'll calculate the probability of 5 or less; the probability of passing is the complement. The linear component
is -2 -4- 0.5(5) - 0.01(200) - 0.4 = -1.9. Using the inverse of the complementary log-log link,
n = 1 - exp(-e-1.9) = 0.138921
The probability of passing is 1 - 0.138921 = 0.861079
Quiz Solutions
12-1. The logistic model gives the logarithm of the relative odds as 0.8 - 1.3 = -0.5, so the relative odds are
e-0.5 0.6065
12-2. The model for the conditional probability of becoming a professor is logistic. pp for this model is the excess
of 130 for professor over the po for actuary, or 0 - 0.8 = -0.8, since for the base category professor po = 0. Using
equation (12.4), the requested probability is
e-0.8
0.3100
+

12-3. The probability of event 1 is 0.5/1.5 = 1/3. The probabilit events 1 or 2 is 2/3. Therefore, the probability
of event 2 is 1/3 and the odds of event 2 are (1/3)/(1 — 1/3) = 0.51.


Copyright 02022 ASM
000
1• STUDY for future Study Breaks?
BREAK #asmstudybreak
Inspiration for when you need a little push and some fuel to help you keep • am
\
•••••=,
"If you want to succeed you

should strike out on new
paths, rather than travel the
worn paths of accepted
success."
John D. Rockefeller
atstm
,)
Lesson 13
Generalized Linear Model: Count

Response
Reading: Regression Modeling with Actuarial and Financial Applications 12
13.1 Poisson response

A count response variable, such as number of claims per year for a single policyholder, is sometimes assumed to
have a Poisson distribution. For a model with only an intercept, the maximum likelihood estimate of the mean is al
the sample mean, which we'll now derive. The likelihood of each observation is CAAni //M. The loglikelihood of all
the observations is
1(A) = -nA + (In A) E ni - In ni!

dl
ni =0
dA A
n
A
When Poisson regression is performed, usually the canonical link function, g(y) = In y, is used. Then if xi •41
assumes two values, x11 and .7c21 and all other predictors stay the same, then
E[yi]= efijx1J+•••
E[y2I = efijx21+...
E[y2] = epi(x2)-xii)
ELM
We see that ePj can be interpreted as the proportional change in E[iii] per unit change in xii.
We will sometimes use the notation y i for9,the fitted value in the GLM model.
Sometimes observations may need to be weighted for exposure. For example, each observation of y may be
total claims submitted by a group with n members, and you may want to assume that the number of claims per
member is Poisson with mean A. Or you may have number of claims per policyholder, but not every policyholder
had a policy for a full year; some may have had it for only a couple of months, just having exposure less than 1. To
incorporate exposure into the model, let Ei be the exposure of observation L The model form is then:
ln = ln .E; Epixii
i=0
The additional term In Ei, which is known in advance (unlikefi,which has to be estimated), is called the offset. •41
13.2 Overdispersion and negative binomial models

The drawback of a Poisson model is that the variance must equal the mean. Often the variance is greater than the
mean; this is called overdispersion. We mentioned this in Section 11.4. One way of getting around this is to arbitrarily *1

220 13. GENERALIZED LINEAR MODEL: COUNT RESPONSE
specify that the variance is a constant multiple of the mean:1

Var(yi) = Opi
Even though the distribution of yi is not a real probability distribution, the estimates of pi are still consistent.
Typically is estimated by
(yi pi)2
cl" = n — (k +1) fri.J jui
(13.1)
The rationale of this formula is that the sum in this formula is assumed to have a chi-square distribution with n—(k+1)
degrees of freedom, and the mean of that chi-square distribution is n — (k + 1). Compare this to equation (11.4);
here, the Poisson variance equals the mean, each observation is considered as an independent cell, and k +1 ps are
estimated.
The overdispersion parameter affects the standard error calculations of the Pis, which get divided by it.
The drawback of using cp is that since no probability distribution corresponds to the model, one cannot estimate
probabilities of having specific numbers of claims; only moments can be estimated.
An alternative to this method is to use a probability distribution whose variance is greater than its mean. The
'I negative binomial distribution, a member of the linear exponential family, fits the bill. It has the following probability
mass function:
j+r—1
Pr(y = j) = ( r—1 )pr(1 — p)i j = 0, 1,2, ...
with E[Y] r(1 — p)/p and Var(Y) = r(1 — p)/p2. We setp = r(1 — p)/p and Var(Y) = goy where 1/p. The Poisson
distribution is the limit of the negative binomial as p 1 and rp —* A. Typically the log link is used even though it
is not the canonical link for this distribution.
13.3 Other count models
Four other count distributions are popular. The first two provide flexibility for the probability of 0.
13.3.1 Zero-inflated models

'11 in these models, the distribution is a two-component mixture distribution of the constant 0 and the count distribution2
h(y). The weight on the constant 0 is n and the weight on the count distribution (the count distribution may have a
nonzero probability of 0) is 1 — 7/, so the total probability of 0 is greater than 77 unless h(0) = 0:
„ + (1 — n)h(0) j = 0
Pr(y= P = (13.2)
701(j) 1>0
For example, suppose the count distribution is Poisson with mean 0.2 and the 0 component has weight 0.1. Then
Pr(y = 0) = 0.1 + 0.9e-0-2
Pr(y = 1) = 0.9(0.2)e-12
0 22
Pr(y = 2) = 0.9
H 2
e-°.2
ni is a function of the predictors xi, and is estimated using a binary model such as logit.
II am following the textbook's notation of calling it (if) here even though it was called 472 in Section 11.4, but I don't understand why a2 is
necessary. For 2-parameter distributions, there is no need for 1
4
2 since the second parameter cp can already be adjusted to account for variance,
while for 1-parameter distributions there is no second parameter cp, so op can be used as the overdispersion parameter.
2The textbook uses g(y) for this, but this is confusing since we use g(y) for the link function.

Copyright 02022 ASM
13.3. OTHER COUNT MODELS 221
For a Poisson distribution with mean pi and probability of the zero component rz, the double expectation formula
gives:
E[y] = (1 — ni)pi (13.3)
and the variance of yi is computed by the conditional variance formula (with I indicating the component of the
mixture):
Var(yi) = E[Var(yi I I)] + Var(E[yi I I]) = (1— Tri)pi + pN(1 — ni) (13.4)
We see that the variance has an extra term that the mean doesn't, so the variance is greater than the mean.
13.3.2 Hurdle models
The rationale for hurdle models is that the response is a result of a two-step process. The first step is the hurdle; the Ikt
decision to make the count greater than 0 is the first decision. The second step is actually determining the non-zero
count. For example, a patient decides whether to go to the hospital, and in the second step the hospital decides how
many days the patient is there.
In these models, there is a specified probability of 0 and a specified count distribution h (y). Given that the count
is not 0, its distribution is the count distribution truncated at 0. In other words,3
(13.5)
where k =
.1-h(o) • Those who took STAM will recognize this distribution as a zero-modified (a, b, 1) distribution.
For example, suppose the count distribution is Poisson with mean 0.2, and n = 0.1. Then k = 0.9/(1 — e—(12) and
Pr(y = 0) = 0.1
Pr(y = 1) =
pr(y 2) =
0 22)
Probabilities of nonzero integers are multiplied by k, so first and second moments are multiplied by k. If h(y) is a
Poisson distribution with mean pi, it follows that
E[y] = kpi (13.6)

Var(yi) = E[14] E[y1]2
= (kpi + kp) — k214
= kp1 + k(1— k)14
= E[y1] + k(1 — k)1.4 (13.7)
k may be greater or less than 1. When k < 1, then 1 — k> 0 and the second term in positive, so variance is greater
than mean. When k> 1, then 1 — k <0 and variance is less than mean. So this model can handle underdispersion
as well as overdispersion.
13.3.3 Heterogeneity models

In a heterogeneity model, some model parameters may vary randomly. Assume that we use a log link, and there is Nr
one parameter ai that varies randomly. Then
In yi I = ai
3Even though we usually use k for number of variables, the textbook uses k here for the scaling constant. I don't think you will confuse this
non-integral k with the other k, so I followed the textbook.

Copyright 02022 ASM
or
yi I ai = exp(ai + Pixi)
The distribution of the random variable a• is allowed to vary by observation i. Adding a constant to ai leads to an
equivalent model by adjusting the intercept, so we specify E[e] = 1 to make the intercept unique. Let pi =
Assuming yiIai has a Poisson distribution, so that its variance equals its mean, then the moments of yi are
E[y1] = E[E[yi I ad]= pi Ejeall = (13.8)
Var(yi) = E[Var(E[yi I as])] + Var(E[yi I ail)
= E[e'l+q3]+ Var(eai*O)
=
+ 1.4 Var(e) (13.9)
The variance is greater than the mean. We model overdispersion by selecting a distribution for ai that results in the
desired ratio of variance to mean.
The textbook discusses two distributions that may be used for ai. The first distribution for a is a gamma
distribution. This means that eal follows a loggamma distribution. Multiplying a gamma random variable by a
constant results in a gamma distribution. So e'''''113 , the mean of the conditional Poisson distribution for yi, has
a gamma distribution as well. And as you learned in Exam STAM4, a gamma mixture of Poisson distributions is
negative binomial. So by using a gamma distribution for eai, yi will have a negative binomial distribution for all
observations having the same distribution for ai.
The second distribution for a is normal. Then e", follows a lognormal distribution. Normal distributions are
used very frequently when we're not sure what distribution is best. Claim count probability does not have a closed
form expression if a lognormal distribution is used, but can be approximated by computer algorithms.
13.3.4 Latent models
In a latent model, we assume that there is some unobservable discrete random variable that affects yi. For example,
this random variable may have two values: "low risk" and "high risk". Thus yi is modeled as a discrete mixture,
which for conditionally Poisson yi results in variance higher than mean. While this model is intuitively appealing,
it has computational issues; mixtures may have more than one maximum, complicating maximum likelihood
estimation, and convergence is slow.
Exercises
13.1. al? You are given the following information for a fitted GLM:
Response variable Number of cars
Response distribution Poisson
Link log
Parameter df #
Intercept 1 0.186
In(Income) 1 0.009
Family size 2
1 or 2 0 0.000
3 or 4 1 0.137
5 or more 1 0.355
Calculate the variance of the number of cars for a family of size 4 with 150,000 income.
4If you didn't learn this, either prove it yourself or take it on faith.

Copyright 02022 ASM
Overdispersion estimate
1 (yi
43 =
(13.1)
i=1
ri
Zero-inflated models
n + (1 — n)h(0) j=0
Pr(Y
;\
{(1 — 77 )11(j)
E[Yi] =(1— ni)pi
j>0
(13.2)
(13.3)
Var(p) = (1 — ni)pi + /47(1 — ni) (13.4)
Mean and variance formulas assume that h(j) is the probability mass function of a Poisson with mean pi.
Hurdle models
j 0
E[y] =
{Ich(j) j > 0 (13.5)
(13.6)
Var(yi) = k + k(1 — (13.7)
1 —
where k = Mean and variance formulas assume a base Poisson distribution before the hurdle.
1 —
h(0)
Heterogeneity models
E[N] = (13.8)
Var(yi) = +p Var(ei) (13.9)
These mean and variance formulas assume If;Iai follows a Poisson distribution.
13.2. sil° You are given the following information for a fitted GLM:
Response variable Claim count
Link log
Parameter df
Intercept 1 -1.512
Rating class 1
Standard 0 0.000
Preferred 1 —0.301
Miles driven (thousands) 1 0.020
Calculate the probability that a preferred driver who drives 10,000 miles submits at least 1 claim.


A Poisson regression model with log link is used to estimate the number of diabetes deaths. The parameter
estimates for the model are:
Response variable Number of Diabetes Deaths

Link Log
Parameter df p-value
Intercept 1 —15.000 <0.0001
Gender: Female 1 —1.200 <0.0001

Gender: Male 0 0.000
Age 1 0.150 <0.0001

Age2 1 0.004 <0.0001
Age x Gender: Female 1 0.012 <0.0001

Age x Gender: Male 0 0.000
13.3. [S-S16:411 Calculate the expected number of deaths for a population of 100,000 females age 25.
(A) Less than 3
(E) At least 9
13.4. "-ir For a 60 year old, calculate the ratio of expected number of diabetes deaths for males over expected
number of diabetes deaths for females.

13.5. %. [S-S17:371 Let Y1, , Y be independent Poisson random variables, each with respective mean pi for
= 1, 2, ... ,n, where:
a, for i = 1,2, ... ,rn
In pi =
{ p, for i = m + 1, m + 2, ... , n
The claims experience for a portfolio of insurance policies with in = 50 and n = 100 is:
50
yi = 563
100
E yi
1=51
1,261
Denote by a and p the maximum likelihood estimates of a and p respectively.

a
Calculate the ratio 7.
(A) Less than 0.40

(13) At least 0.40, but less than 0.60
(E) At least 1.00
13.6. •46 [S-S17:30] You are given the following information for a fitted GLM:
Response variable
Link log
AIC 221.254
Parameter s.e.(s)
Intercept 5.421 0.228
Gender
Male 0.000 0.000
Female —0.557 0.217
Age 0.107 0.002
Calculate the predicted value of Y for a Female with an Age value of 22.
(A) Less than 1,000
(E) At least 2,500

226 13, GENERALIZED LINEAR MODEL: COUNT RESPONSE
13.7. a: You are given the following data for 5 groups, each of which is covered by workers compensation policy.
Group Number of Claim Hazard
Name Employees Count Class
A 14 5
29 3 1
16 7 3
24 6 2
42 8 1
You are building a generalized linear model in which hazard class will be an explanatory variable and the
response will be claims per employee. You will use a log link.
Determine the offset for the first group.
13.8. 1 A Poisson regression with a log link has the form:

In = Po + pixi
There are 4 observations. Actual and fitted values are
yi 1 2 4 7
Pi 1.0699 2.3008 3.3739 7.2553
Calculate the estimated overdispersion parameter 4).
13.9. & You are given the following output from a generalized linear model:
Response variable Claim frequency

Link log
Parameter df se
Intercept 1 2.50300 0.07500

Miles driven 1 0.00005 0.00001
Gender: male 0 0.00000
Gender: female 1 —0.07500 0.04200
Based on this model, if a driver drives an additional 1000 miles, what is the percentage increase of claim
frequency?
13.10.
For a zero-inflated Poisson regression model with an intercept only:
(i) The log link is used
(ii) The intercept isJo= —0.3
(iii) The probability of 0 is 0.8.
Calculate the probability of 1.

EXERCISES FOR LESSON 1.3 227
n13.11. For a zero-inflated Poisson regression model:

Response variable
Link log
Parameter df
Gender
Male 0 0.000
Female 1 —0.200
A zero-inflated Poisson distribution is a mixture distribution. The component of the mixture that is the constant 0
has a weight of 0.4.
Calculate the variance of Y for a female.
13.12. 'I For a hurdle model, the base distribution of the response is Poisson and a log link is selected. The
model has two variables, xi and x2, and an intercept. The estimated parameters are bo = 1.5, bi = 0.4, b2 = 0.2. The
probability of 0 is 0.25.
Calculate the fitted probability of 2 when xi = 0.5 and x2 = 0.7.
13.13. You are given the following information for a fitted GLM:
Response variable Claim count

Link log
Parameter df
Gender
Female 0 0.000
Male 1 0.408
Rating class 1
Standard 0 0.000
Preferred 1 —0.677
The model is a hurdle model. For a preferred male driver, the probability of 0 claims is 0.8.
Calculate the variance of the claim count from a preferred male driver.
13.14. svr A hurdle model is used. The base distribution is Poisson and the log link is selected.
For observation 1, the mean is 0.250. You wish to set the variance equal to 1.1 times the mean.
Determine the probability of 0.
13.15. In a heterogeneity model for y, the distribution of yi I ai is Poisson and a log link is used. The fitted
value of an observation yi is pi = 1.35. The random component a; follows a gamma distribution with mean 0.4 and
variance 0.3.
Calculate the variance of yi.

Solutions
13.1. For a Poisson distribution, variance equals mean.

The systematic component is
0.186 + 0.009 In 150,000 + 0.137 = 0.430266
The mean and variance are e0430266 1.5377

-1.512 - 0.301 + 0.2 = -1.613
The mean of Y is e-"13 = 0.199289. The probability of 0 claims is C0199289 = 0.819313. The probability of at least 1
claim is 1 - 0.819313 = 0.1807
13.3. The systematic component for F25 is

-15.000 - 1.200+ 0.15(25) + 0.004(252) + 0.012(25) = -9.65
With the log link, this gives e-9'65 = 0.000064426. For 100,000 females, multiply this by 100,000 to obtain 6.4426
(C)
13.4. The fi for females is -1.200, and the for age times gender is 0.012, which is multiplied by 60 for 60-year olds.
Thus the systematic component for males is the systematic component for females minus -1.200 + 0.012(60) = -0.48,
and the ratio we want is e048 1.6161
563
13.5. The maximum likelihood estimate of a Poisson parameter is the sample mean. Thus pi = 50
= 11.26 for
1 < i < 50 and pi = 1261= 25.22 for 51 < i < 100. Then
ln 11.26
0.7502 (C)
p 1n25.22 =
13.6. The inverse of the link is exponentiation, so we exponentiate the systematic component.
exp (5.421 - 0.557 + 0.107(22)) = 1363.76 (13)
13.7. As discussed in the lesson, the offset is the logarithm of exposure, in 14 = 2.639
13.8. There are n = 4 observations and k = 1 parameter plus an intercept. The estimated overdispersion is
1 1(1 - 1.0699)2 (2 -2.3008
2.3008)2 + (4 -3.3739
3.3739)2 + (7 - 7.2553
7.2553)2
(t)
-4- (1 + 1) k 1.0699
+
) _
0.0845
which indicates underdispersion.

13.9. Claim frequency is multiplied by el000b1 = e0.05= 1.0513. That's a 5.13% increase.
13.10. We back out rri, the additional probability of 0.

= e-03 = 0.740818
Pr(yi = = ni + (1 -
0.8 = n(1_ e-0.740818) e -0.740818
0.8 e0740818
=
= 0.617793
_ e-0.740818
The probability of 1 is
Pr(yi = 1) = (1 -
= (1 - 0.617793)(0.740818)C11740818 = 0.134982

n 13.11. Using formula (13.4),

= exp(-0.3 - 0.2) = 0.606531
Var(Y) = (1 - 0.4)(0.606531) + 0.4(0.6065312)(1 - 0.4) = 0.452209

1.5 + 0.4(0.5) + 0.2(0.7) = 1.84
The Poisson mean is e1.84 = 6.296538. We multiply the Poisson probabilities by
1 - 0.25
= 0.751385
1 - e-6.296538
The probability of 2 is
1 -
e-Pt (Pt) - 0 961,1(6 2965382\) =

,
5
Pr(yi = 2) 751385e-6•2 0.027446
2 2
13.13. The systematic component is -1.135 + 0.408 - 0.677 = -1.404. Then pi = = 0.245613. The k ratio
(actual probability of greater than 0 divided by fitted probability of greater than 0 in Poisson) is
k =1 e-0.245613= 0.918380
Using formula (13.7), the variance is
Var(y1) = (0.918380)(0.245613) + (0.918380)(0.2456132)(1 - 0.918380) = 0.23009
13.14. Dividing formula (13.7) by formula (13.6) and setting the quotient equal to 1.1, we get
1.1=
kipi + kq4(1 - kt)
=1+pi- kipi
0.1 = pi-0.25
pi = 0.35 k1=
For p = 0.35, the Poisson probability of 0 is C(135 .= 0.704688. So

=
1- rii
1 - h(0)
5 1 - rti
7 0.295312
5
= 1 - -(0.295312) = 0.789063
7
13.15. Using formula (13.9),

Var(yi) = 1.35 + 1.352(0.3) = 1.89675


Copyright C2022 ASM
Lesson 14
Generalized Linear Model: Measures of Fit
Reading: Regression Modeling with Actuarial and Financial Applications 11.3.2, 11.4, 12.1.4, 12.2, 13.3.3, 13.4, 13.5
Once we have estimated the coefficients of a generalized linear model, we have to determine whether it is the best
model.
Many of the tests we discuss here are for nested models. That means that we compare one model that has several Nr
explanatory variables to another that has a subset of those variables. The second model may even have only one
variable, the intercept, in which case we're testing the significance overall of our first model. Typically we compare a
model to another model with one variable removed in order to determine the significance of that variable. However,
with categorical variables, we may want to remove all associated dummy variables at once.
14.1 Pearson chi-square

Goodness of fit may be tested using the Pearson chi-square statistic.
For a count model, we can compare, for each nonnegative integer i, the observed number of observations of i to
the fitted number. You are probably quite familiar with this chi-square test, whether from your statistics studies or
Exam STAM, but just to review: you calculate the expected (or fitted) number of observations for each claim count,
npi, and compare it to the actual number of observations for each claim count i,ni, and then the formula is
v-, (n n 1)02
Q L (14.1) •-:
Q has degrees of freedom equal to the number of cells (number of distinct claim counts), minus 1 because actual
and expected number of observations are constrained to be equal, and minus the number of parameters fitted from
the data.
An alternative formula is
n
Q= —n (14.2) 41
npi
Formula (14.2) only works when the following two conditions apply:
1. Sum of fitted values equals sum of actual values.
2. The denominators equal the fitted values.
EXAMPLE 14A You are given the following claim counts:

Number of Claims Number of Policies
0 2018
1 428
2 45
3 9
4 or more 0
Total 2500
Assume that claim counts follow a Poisson distribution, and the Poisson parameter is estimated using maximum
likelihood.
Calculate the chi-square statistic. •
Exam SR/v1 Study Manual 231

232 14. GENERALIZED LINEAR MODEL: MEASURES OF FIT
SOLUTION: The maximum likelihood estimate is the sample mean, 0.218. The expected number of policies with k
claims is 2500pk = 2500e-11218(0.218k RD. This works out to
Number Number Expected Number
of Claims of Policies of Policies
0 2018 2010.31
1 428 438.25
2 45 47.77
3 9 3.47
4 or more o 0.20
Total 2500
Notice that the expected number of claims in the "4 or more" cell is 2500 minus the sum of the expected numbers
of claims in the other cells. Then the chi-square statistic is
(2018 — 2010.31)2 1-
(428 — 438.25)2 +
(45 — 47.77)2 (9 — 3.47)2 (0 — 0.2)2 = 9.443
2010.31 438.25 47.77 3.47 0.2
with 3 degrees of freedom.1 0
Everything we've said so far applies regardless of what distribution is fitted. If you fitted a negative binomial
and used a log link, formula (14.1) would still apply.
But sometimes the fit is tested observation by observation. In that case, you would use the following formula to
calculate the Pearson chi-square statistic:
- PO'
(14.3)
1=1 Gpv(pi)
where cpv(f.ii) is the estimated variance of yi using the variance formulas in Table 11.1 on page 163. The number \—)
of degrees of freedom, if we assume all observations are independent (and we usually do), is n — p, where p is the
number of fitted parameters, usually k + 1.2
14.2 Likelihood ratio tests
Back in Lesson 4 we studied the F test for testing the whether a linear regression model or a set of its parameters is
significant. The corresponding test for a generalized linear model is the likelihood ratio test.
Let 7be the loglikelihomi of the unconstrained model and f the loglikelihood of the constrained model. Then the
likelihood ratio statistic
LRT = 2(1 — f) (14.4)
It has an approximate chi-square distribution with q degrees of freedom, where q is the number of constraints.
The likelihood ratio test compares nested models, cases where one model is included in the other model.
14.3 Deviance
Deviance is a measure of the quality of a model.

We define the saturated model as the model with one parameter for each observation. This is the best fitting model
al possible. The scaled deviance is defined as the likelihood ratio statistic comparing the saturated model to the model
'Those who have studied the chi-square test know that we'd merge the cell with 4 or more with the cell with 3, and perhaps merge both
of these into the cell with 2 claims, due to the low number of expected claims in these cells. But Regression Modeling with Actuarial and Financial
Applications, on page 344, does not merge cells despite having 0.40 and 0.01 expected claims in two cells.
2This is what the textbook seems to say, but the chi-square statistic assumes a normal distribution for each group, so responses should be
grouped to calculate the statistic; it shouldn't be done observation by observation.

Copyright 02022 ASM
14.3. DEVIANCE 233
under consideration.
D(ê) = 2 (Osaturated) — 0)) (14.5)
The deviance is the scaling factor rp times the scaled deviance: D(e) = cpD*(6). For Bernoulli and Poisson models,
where = 1, the deviance equals the scaled deviance.3 The lower deviance is, the better the model.
For nested models, the likelihood ratio statistic may be calculated as the difference of the scaled deviances. When
calculating the difference of two scaled deviances, the loglikelihood of the saturated models cancels.
EXAMPLE 14B •41 A Poisson regression with 5 variables has a deviance of 20.56. When 2 of the variables are removed
from the model, the deviance is 26.78.
Determine the p level of the hypothesis that the 2 variables that were removed from the model are significant..
SOLUTION: The loglikelihood ratio statistic is the difference of the (scaled) deviances, 26.78 — 20.56 = 6.22. The log-
likelihood ratio statistic is chi-square with 2 degrees of freedom, or exponential with mean 2, and for an exponential,
the distribution function is F(6.22) = 1 — C62212 = 0.955399, so the p level is 1 — 0.955399 = 0.044601. Even though
you learn in probability that a chi-square with v degrees of freedom is a gamma distribution with a = v/2 and f3 = 2,
or an exponential with mean 2 when v = 2,1 doubt an exam would expect you to know this. You'd probably just be
expected to look up 6.22 in the tables and note that it is between the 95th percentile (5.991) and the975thpercentile
(7.378).
We will now derive deviance formulas for the normal, Bernoulli, and Poisson models. We assume all observations
are unique, so that the fitted values in the saturated model equal the observations.
Normal Ignoring the constant 1/(o-VTn) in the normal density, which will cancel when we take the difference of
loglikelihoods, the loglikelihood is
—Oh —
2a2
For the saturated model, pi yi, so yi — = 0, making the loglikelihood 0. For the model under consideration,
pi =9.So the scaled deviance, twice the difference in loglikelihoods, is
- .902
D(0) = C2
(14.6)
The scale factor is 49 = 62. The deviance is D(B) = 013*(6) =

11
D(0) = (14.7) .1
1.1
Quiz 14-1 For a normal linear regression model, the deviance is 125. The model has 22 observations and
2 parameters, Po and Pi.
Calculate the residual standard error of the regression.
Bernoulli Assume that the parameters of the Bernoulli are 7Ii for each observation of the explanatory variables.
The mean of the distribution is ni. In the saturated model, yi = ni. The loglikelihood, ignoring binomial coefficients,
which don't vary by model and that will therefore cancel when taking differences, is
/(b) =
3This is the way it is defined in the textbook on the syllabus. But many other authors reverse the definitions, calling "deviance" what we call
"scaled deviance" and vice versa.

Copyright 02022 ASM
In the saturated model, ni = yi while in the model under consideration fri pi. Notice that
ln yi — lngi = ln —Yi
9i
and similarly
ln (1 — In) — ln (1 — 9i) lri
Then the deviance is
D = 2E((yilnni + (1 — yi) ln(1 — mi)) (yi In it1 + (1 — yi) In(1
=2 ((Van yi + (1 — yi)In(1 — yi))) (yi In gi + (1 — yi)ln(1

71
1 —yi
=2E(yilnL' +(1_Yi) in 1—
(14.8)
In this formula, the convention is that y In y = 0 whenever y = 0. That means that if all the yis are 0 or 1 (in other
words, the data are not grouped), the only terms in the sum are the ones that have yi = 1 or 1— yi = 1, and D reduces
to
D = —2 (E In + ln(1 — 9i))
yi=1 yi=0
(14.9)
EXAMPLE 14C I': Drivers are classified as low-risk (class 0) and high-risk (class 1). You use a generalized linear
model to predict the class. For 6 drivers, the results are
Actual class 0 0 1 1 1
Fitted class 0.25 0.35 0.12 0.47 0.84 0.52
Calculate the deviance statistic. •
SOLUTION: The class is a Bernoulli random variable. If you look at how the formula is derived, you'll see that we
can drop all terms involving logarithms of 0. In other words, we sum up yi ln t only for yi = 1 and (1 — yi) In
only for yi = 0.
1 1 1 1 1
D = 2 (ln 1 — 0.25
+ +
in 1 — 0.35 In 1 — 0.12 0.47 0.84 0.52
4.8592
Poisson If the means are A.i, the loglikelihood, ignoring ln yi! terms, which will cancel when taking differences, is
1(b) = yi ln Ai)
In the saturated model Ai = yi while in the model under consideration Ai = Pi, so

rt
D = 2 (E (yi ln (yi 9i)))

Yi
For terms with yi = 0, the summand yi In(yi/pi) is omitted.

Copyright 02022 ASM
14.4. PENALIZED LOGLIKELIHOOD TESTS 235
If the model has an intercept and the log link is used, E yi = Epi so the second term drops out, and we're left
with
D =2Eyiln Yi
1=1
(14.11)
To put it differently,for a Poisson regression with the log link and an intercept, the sum of the residuals 9.; - yi is 0.
EXAMPLE 140 "411 A Poisson regression with a log link is run. The results are
Actual 0 1 1 2 2
Fitted 0.35 0.85 1.22 1.74 1.84
Calculate the deviance. •
SourrxoN: Since a log link is used, we can use the simpler formula (14.11) and skip the first summand which
includes a logarithm of 0.
1 1 2 2
D = 2 (In — 0.85
—
1.22
2In —
1.74
21n
1.84
= 0.8179
14.4 Penalized loglikelihood tests

The tests discussed in this section may be used even for non-nested models. However, the models that are being
compared must be based on the same observations.
We can compare the loglikelihoods of the models and select the one with the highest. But it would not be right
to compare unadjusted loglikelihoods, because the loglikelihood cannot decrease, and usually increases, with the
addition of variables, no matter how irrelevant they are. We must charge a penalty for additional variables. Two
penalized loglikelihood formulas that we've discussed previously are4: .4*
1. Akaike Information Criterion. The penalty is 2 for each parameter. The formula is
AIC = —2/ + 2p (14.12)
where / is the loglikelihood and p is the number of parameters estimated, usually k + 1. The lower the AIC is,
the better the model is.
2. Bayesian Information Criterion. The penalty varies with the number of observations, and is In n for each 41/4:
parameter. The formula is
BIC = —2/ + p inn (14.13)
where n is the number of observations and p is the number of parameters estimated.

When comparing models, it is not necessary to know the number of parameters in each model. You just need to
know the difference in the number of parameters between the models. The improvement in the loglikelihood for a
model with q additional parameters must be at least 24) if AIC is used, q In n if BIC is used.
Quiz 14-2 sr For a generalized linear model, the negative loglikelihood is 158.06. You are considering adding
a 4-category categorical variable to the model.
Determine the highest value of the negative loglikelihood for which the additional variable is accepted if
the AIC is used to select models.
4Unlike in Section 7.2, there is no other formula in An Introduction to Statistical Learning

Copyright T:12022 ASM
14.5 Max-scaled R2 and pseudo-R2

Here are two measures that serve a role in generalized linear models similar to the role of R2 in a linear regression
model. They both use the maximized loglikelihood of the minimal model, the model with just an intercept We will
call this 10. Let 1(b) be the maximized loglikelihood of the model under consideration.
Max-scaled R2 starts out by defining R2 as follows:
R2 1 exp(/o/n) 12 (14.14)
exp(/(b)/n))
The problem is that while this R2 is never less than 0, it can never be as high as 1. Max-Scaled R2 divides R2 by its
maximum value so that the statistic is 1 for a perfect model:
R2
max-scaled R2 = 2 (14.15)
1 — (exp(/0/n))
Quiz 14-3 "Nr For a GLM based on 85 observations:

(i) The maximized loglikelihood of the model is —74.3.
(ii) The maximized loglikelihood of a model with just an intercept is —95.5.
Calculate the max-scaled R2 statistic.
For the pseudo-R2, let /max be the loglikelihood of the saturated model. The pseudo-R2 statistic is defined by
—
pseudo-R2 = /max — lo
For a linear regression model, this reduces to R2, as we will now show. For a normal distribution, the minimal
model sets fp = y. Then
and
E(yi - 9i)2 n 2no-2
1(b) =
az 2
ln 2no-2 = —Error SS/412 — 2
The saturated model sets gi = yi, so

'max = ln 2Trcr2
In equation (14.16), the 3- In 2no-2 terms cancel in the numerator and denominator, and we are left with
Total SS/c2 — Error SS/a-2 Regression SS
pseudo-R 2 = Total SS
Total SS/a2
Some authors (such as the author of the textbook on the CAS MAS-I syllabus) define pseudo-R2 differently.
14.6 Residuals
'If One way to improve a model is to look at residuals of the current model. Residuals are useful for
1. Identifying additional covariates or non-linear patterns

Copyright Cs2022 ASM
14.6. RESIDUALS 237
2. Identifying outliers
3. Displaying effects of individual observations on the model
4. Displaying heteroscedasticity and time trends
Raw residuals are meaningless in nonlinear models. To create meaningful residuals, three methods are briefly
discussed in the textbook:
1. Define a function Ej = R(yi; xi, 0), where U is a vector of model parameters f3i and scale parameters. The
function is defined so that Ej are independent and identically distributed. This is the Cox-Snell method.
2. Define a function Ei = R(yi; x,, 49) that is based on transforming yi.
3. Use deviance residuals.
Here are some examples of residuals that may be used:

Pearson residuals are defined by
Yi — Pi
qi = (14.17)
VVar(yi)
Thus they are an example of a Cox-Snell residual.
For example, for a Poisson regression
Yi 9i
cif = (14.18)
For a 13ernou1li regression

Yi 9i
qi (14.19)
Vgi(1 — gi)
The sum of the squares of these residuals is the Pearson chi-square statistic.
Deviance residuals are the square root of the term in the deviance formula corresponding to the observation,
multiplied by the sign of yi —
di = sign(yi — 9i)2 (in f (In; ei,sat) In f (y1;61)) (14.20) •111
where f(y) is the distribution of yi and "sat" stands for saturated. For example, for a Poisson regression, the 6:
deviance residual is
For a Bernoulli regression

di = sign(yi —
-\12 (yi In (41,) — (Yi — 90)
Yi
(14.21)
•41
di = sign(yi — 9;),\I2 (yi In i + (1 — y1) in 1 ) (14.22)
Anscombe residuals An Anscombe residual is an example of the second type of residual mentioned above, the *.:
type based on transforming yi. It is of the form
h(y1) — E [h (NA
r
n
Vai
A, (yi))
where 11 is a transformation that makes h(y) approximately normally distributed. (It is a function of both yi
and p.i.) The textbook does not state how to derive h, but just provides a short table with the transformations
for binomial, Poisson, and gamma distributions. Deviance residuals and Anscombe residuals are close in
many cases.
The textbook also describes another unnamed residual of the second type, where h is selected to stabilize
variance.

Table 14.1: Summary of GLM Measures of Fit
Pearson chi-square AIC

For data grouped by intervals, with total number of AIC = —2/ -i- 2p (14.12)
iftted equal to total actual observations, or for Poisson
models: BIC
(n — i)2 BIC = —21 p Inn (14.13)

Q= (14.1)
max-scaled R2
(ni)2 n (14.2)
npi
R2 = 1 texp(/0/n)12 (14.14)
For independent observations: exp(//n) )
R2
max-scaled R2 = (14.15)
(14.3)
•
ov(pi) 1 — (exp(/o/n))2
Likelihood ratio test
pseudo-R2
—
LRT = 2(/-7) (14.4) pseudo-R2= (14.16)

imax —
Deviance
Pearson residual
D =2ç5 (1(bsaturated)-1(b)) — X2(/ — (k +1)) (14.5)
yi —91
Normal qi - (14.17)
11
VVar(yi)
D = Dyi — 902 qi = Yi — 9i (Poisson) (14.18)
Bernoulli
1Fli
Yi— Pi
qi = (Bernoulli) (14.19)
D =2E (yi yi)ln (14.8) V:91(1 — 9)
Yi pi
Deviance residual
Poisson
-2( E h, 9, + z _9))
yi=1 yi=o
(14.9)
di = sign(Yi — 942 On f(yi; of,sai — f(y1; 61))
(14.20)
Poisson:
Yi
(14.10)
Likelihood ratio test di = sign(yi - iji).\12 (yi in — (yi y'i)) (14.21)
Bernoulli:
2(/ — i) = /5 — 15
where hat indicates the unconstrained model, tilde
indicates the constrained model, and there are q
constraints.
di = sign(yi — 91).\12 (yi lnYi+ (1 — y-)ln
Yi — gi
(1422)

Copyright (02022 ASM
n Exercises
Pearson chi-square
14.1. a: ISRM Sample Question #281 Dental claims experience was collected on 6480 policies. There were a total
of 9720 claims on these policies. The following table shows the number of dental policies having varying numbers
of claims.
Number of Claims Number of Policies

0 1282
1 2218
2 1856
3 801
4 235
5 81
6 or more 7
Total 6480
Calculate the chi-squared statistic to test if a Poisson model with no predictors provides an adequate fit to the
data.
(A) 80 (B) 83 (C) 86 (D) 89 (E) 92
14.2. uur The following are the results of a Poisson regression:
0 0.2
0 0.4
1 0.7
1 0.8
1 1.2
2 1.5
2 2.1
2 2.1
Calculate the Pearson chi-square statisic.
14.3. 41 A generalized linear model has a gamma response variable with scale parameter 119 = 1/3. There are 4
observations with actual and fitted values as follows:
Actual Fitted
2.1 2.5
1.8 1.4
3.9 3.8
3.4 3.1
Calculate the Pearson chi-square statistic.

Likelihood ratio test and deviance
14.4. '1 [S-F16:41] A modeler is considering revising a linear model for claim counts with age as an explanatory
variable. It is currently being included in the model as a continuous variable with no interactions.
Determine which of the following statements is false.
(A) Including age as a categorical variable with more than two levels would increase the model degrees of
freedom.
(B) Including a polynomial term age2 may decrease model deviance.

(C) A plot of age against the residuals may be used to assess if a transformation of age is necessary.
(D) One way of assessing -multicollinearity is by performing a regression of age against all other explanatory
variables.
(E) Including several interaction terms involving age may make the model more parsimonious.
14.5. al For a generalized linear model, the loglikelihood of the saturated model is —42.51 and the deviance of
the model under consideration is 21.20.
Calculate the loglikelihood of the model under consideration.
14.6. A generalized linear model with Bernoulli response is fitted to 15 observations. All the observations
except for the first, ninth, and twelfth are 0. The model sets the Bernoulli parameter 7I = 0.2 for all 15 observations.
Calculate the deviance.
14.7. "41 For a normal linear model, the deviance is 22.52. The model has 25 observations and 5 parameters (4
explanatory variables and an intercept).
Calculate the residual standard error of the regression.
14.8. .4' A generalized linear model with Poisson response is fitted to 5 observations. There is 1 explanatory
variable and an intercept. The results of the model are
Yi Pi
10 7
12 12
15 20
18 19
20 25
The deviance statistic is used to assess the fit.
Which of the following statements is correct?

(A) Reject the fit at 0.5% significance.
(B) Reject the fit at 1% significance but not at 0.5% significance.
(C) Reject the fit at 2.5% significance but not at 1% significance.
(D) Reject the fit at 5% significance but not at 2.5% significance.
(E) Do not reject the fit at 5% significance.

Copyright Q2022 ASM
fl 14.9. *It [S-F16:35] You are given ,y„, independent and Poisson distributed random variables with respec-
tive means pi for i 1, 2, ... , n.
A Poisson GLM was fitted to the data with a log-link function expresses as
E[y1] eflo+flixi
where xi refers to the predictor variables.

Analysis of a set of data provided the following output:
xi yi gi yi in(yilgi)
0 7 6.0 1.0791
0 9 6.0 3.6492
0 2 6.0 —2.1972
1 3 6.6 —2.3654
1 10 6.6 4.1552
1 8 6.6 1.5390
1 5 6.6 —1.3882
1 7 6.6 0.4119
Calculate the observed deviance for testing the adequacy of the model.
(A) Less than 4.0
(E) At least 10.0
l
f
14.10. [MAS-I-F19:331 You have a sample of five independent observations, xi, • • • , x5, each with exponential
distribution:
f (xi I t9i) = exp

)
You are fitting this data to a model with ei e for all i, using maximum likelihood estimation:
f (xi I 0) = 7(1.4 exp (-÷11;

The five observed values are: 100, 100, 500, 800, 1000
The deviance of the model, D, is equal to twice the difference between the log-likelihood of the saturated model
and the log-likelihood of the fitted model.
Calculate D.
(A) Less than 2

(E) At least 8
Exam 5RM Study Manual Exercises continue on the next page . .

Copyright 61022 ASM
14.11. [S-S16:34j Determine which of the following statements are true.

I. The deviance is useful for testing the significance of explanatory variables in nested models.
II. The deviance for normal distributions is proportional to the residual sum of squares.
III. The deviance is defined as a measure of distance between the saturated arid fitted model.
(A) I only (8) II only (C) III only (D) All but III (E) All
14.12. '41 [MAS-I-S18:291 You are given the following statements relating to deviance of GLMs:
I. Deviance can be used to assess the quality of fit for nested models.
II. A small deviance indicates a poor fit for a model.
III. A saturated model has a deviance of zero.
(A) None are true (B) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) ,or (D)
14.13. ": The loglikelihood of a fitted generalized linear model with 6 parameters 130,161, • • • s P5 is —130.52. The
null hypothesis is Ho: 134 = P5 = 0. Under the null hypothesis, the loglikelihood is —134.88.
Under the likelihood ratio test, which of the following statements is correct?
(A) Reject Ho at 0.5% significance.
(B) Reject Ho at 1% significance but not at 0.5% significance.
(C) Reject Ho at 2.5% significance but not at 1% significance.
(D) Reject Ho at 5% significance but not at 2.5% significance.
(E) Do not reject Ho at 5% significance.
14.14. NI. [S-S17:421 A study was commissioned on the effect of type of fertilizer and type of seed on crop yield.
There are four types of fertilizers and five types of seed included in the study.
Two separate Poisson GLMs with log link functions were fit to the dataset:
1) Using type of fertilizer and type of seed without an interaction term, the log-likelihood of this GLM is —283.
2) Using type of fertilizer and type of seed, and all interaction terms between those two main effect variables, the
log-likelihood of this GLM is —272.
Let:
• Ho: The effect of type of fertilizer is independent of type of seed on crop yield.
• Hi: The effect of type of fertilizer is not independent of type of seed on crop yield.
Calculate the smallest significance level at which you reject Ho.
(A) Less than 0.5%
(B) At least 0.5%, but less than 1.0%
(C) At least 1.0%, but less than 2.5%
(D) At least 2.5%, but less than 5.0%
(E) At least 5.0%

Copyright 02022 ASM
14.15. The scaled deviance for a fitted generalized linear model with 5 parameters po, 434 is 52.08. The
null hypothesis is Ho: P4 -= 0. If P4 is set equal to 0, the scaled deviance is 55.98.
Using the likelihood ratio test, which of the following statements is correct?
(A)Reject Ho at 0.5% significance.
(B)Reject Ho at 1% significance but not at 0.5% significance.
(D)Reject Ho at 5% significance but not at 2.5% significance.
14.16. [ SRM Sample Question #191The regression model Y = pol-pixi+,62x2+,53x1 X2+ E is being investigated.
The following maximized log-likelihoods are obtained:
• Using only the intercept term: —1126.91
• Using only the intercept term, Xi, and X2: —1122.41
• Using all four terms: —1121.91
The null hypothesis Pi = 132 = f33 = 0 is being tested at the 5% significance level using the likelihood ratio test.
Determine which of the following is true.
(A) The test statistic is equal to 1 and the hypothesis cannot be rejected.
(B) The test statistic is equal to 9 and the hypothesis cannot be rejected
(C) The test statistic is equal to 10 and the hypothesis cannot be rejected.
(D) The test statistic is equal to 9 and the hypothesis should be rejected.
(E) The test statistic is equal to 10 and the hypothesis should be rejected.
AIC and BIC
14.17. You are given the following results for two generalized linear models fit to the same data:
Model AIC
g(Y) = Po 89.2
g(P) = PO ± P1X1 P2X2 + P3x3 88.4
Calculate the likelihood ratio statistic to test Pi = = 133 = 0.

Copyright 6'2022 ASM
14.18. f [S-F15:361 You are given the following information for two potential logistic models used to predict the
occurrence of a claim:
Model 1: (AIC=262.68) Model 2: (AIC=263.39)

'.
Parameter Parameter
(Intercept) —3.264 (Intercept) —2.894
Vehicle Value ($000s) 0.212 Gender—Female 0.000
Gender—Female 0.000 Gender—Male 0.727
Gender—Male 0.727
AIC is used to select the most appropriate model.

Calculate the probability of a claim for a male policyholder with a vehicle valued $12,000 by using the selected
model.
(A) Less than 0.15

(E) At least 0.60
14.19. [S-F15:371 You are given the following table for model selection:
Scaled Number of
Model Deviance Parameters (k + 1) AIC
Intercept + Age A 5 435
Intercept + Vehicle Body 392 11 414
Intercept + Age + Vehicle Value 392 X 446

Intercept + Age + Vehicle Body + Vehicle Value B Y 501
Calculate Y
14.20. ": [S-F15:38) You are testing the addition of a new categorical variable into an existing GLM. You are given
the following information:
(i) The change in scaled deviance after adding the new variable is —53.
(ii) The change in AIC after adding the new variable is —47.
(iii) The change in BIC after adding the new variable is —32.
(iv) Prior to adding the new variable, the model had 15 parameters.
Calculate the number of observations in the model.
(A) Less than 1,000

(E) At least 1,300

14.21. kir [S-S16:35] You are given the following information about three candidates for a Poisson frequency GLM
on a group of condominium policies:
Model Variables in the Model df loglikelihood AIC BIC
1 Risk Class 5 —47,704 95,418 95,473.61182
2 Risk Class + Region —47,495
3 Risk Class + Region + Claim Indicator 10 —47,365 94,750
Insureds are from one of five risk classes: A, B, C, D, E.
Condominium policies are located in several regions.
Claim Indicator is either Yes or No.
All models are built on the same data.
Calculate the absolute difference between the AIC and the BIC for Model 2.
(A) Less than 85

(E) At least 115
14.22. [S-S16:37] Determine which of the following GLM selection considerations is true.
(A) The model with the largest AIC is always the best model in model selection process.
(B) The model with the largest BIC is always the best model in model selection process.
(C) The model with the largest deviance is always the best model in model selection process.
(D) Other things equal, when the number of observations>1000, AIC penalizes more for the number of param-
eters used in the model than BIC.
(E) Other things equal, when the number of observations>1000, BIC penalizes more for the number of param-
eters used in the model than AIC.
14.23. •41 IS-S16:38) You are testing the addition of a new categorical variable into an existing GLM, and are given
the following information:
(i) A is the change in AIC and B is the change in BIC after adding the new variable.
(ii) B > A + 25
(iii) There are 1500 observations in the model.
Calculate the minimum possible number of levels in the new categorical variable.
(A) Less than 3 (5) 3 (C) 4 (D) 5 (E) More than 5
Copyright 02022 ASM
14.24. •.+? [MAS-I-F19:321 You have three competing GLMs that each predict the number of claims under an
insurance policy, and are evaluating the models using AIC and BIC. All models are trained on the same dataset of \--)
300 observations. These models are summarized below:
Number of
Model Likelihood Parameters
0.0456 4
2 0.0567 5
3 0.0575 6
The following are three statements about the fit of these models:
I. Model #1 is best based on BIC
IL Model #2 is best based on AIC
III. Model #3 is best based on BIC
You are given a set of 65 observations. Two models are proposed for the underlying distribution. The models
only differ in that the first model includes a categorical variable with 4 categories and the second one doesn't.
Let 11 be the loglikelihood of the first model and let 12 be the loglikelihood of the second model.
14.25. P Calculate the most by which h may exceed /2, using the Akaike Information Criterion, if the second
model is preferred.
14.26. '''‘r Calculate the most by which /1 may exceed /2, using the Bayesian Information Criterion, if the second
model is preferred.
14.27. rCalculate the most by which /1 may exceed 12, using the likelihood ratio test at 5% significance, if the
second model is preferred.
14.28. •41 You are given two models based on the same n observations. The loglikelihood of the first model is
—110.52 and the loglikelihood of the second model is —105.34. The second model has 2 more parameters than the
i rst model.
f
The Bayesian Information Criterion is used to select the best model.

Determine the highest n for which the second model is selected.
14.29. N? A generalized linear model is used for claim frequency. One of the explanatory variables is age. Age is
a categorical variable. There are 5 age groups,
Model I includes age as an explanatory variable and Model II does not. The models are otherwise identical.
You are given:
Model Scaled Deviance AIC
Model I 20.7 139.8

Model II 23.9
Determine the AIC for Model II.
Exam SAM Study Manual Exercises continue on the next page . . .

Max-scaled R2 and pseudo-R2

For a Poisson regression with a log link, you have the following actual and fitted values:
Yi 9i
1 1.2
3 2.6
4 4.2
6 6.8
8 7.2
In the minimal model, the fitted value of yi is 4.4.
14.30. Calculate the pseudo-R2 statistic for this fit.

14.31. '11 Calculate the max-scaled R2 statistic for this fit.
A generalized linear model has an exponential response variable. There are 5 observations with actual and
fitted values as follows:
Actual Fitted
1 0.80
2 1.23
3 3.57
4 5.00
5 4.40
The minimal model sets the fitted value equal to the mean.
14.32. Calculate the pseudo-R2 statistic.

14.33. Calculate the max-scaled R2 statistic.
Residuals
14.34. 14 [SRM Sample Question #52] Determine which of the following statements is/are true about Pearson
residuals.
I. They can be used to calculate a goodness-of-fit statistic.

II. They can be used to detect if additional variables of interest can be used to improve the model specification.
They can be used to identify unusual observations.
(A) I only
(B) II only
(C) III only
(D) I, II, and III

Copyright 02022 ASM

For a generalized linear model, the response distribution is binomial. For a cell with observed value 0, the
iftted probability is 0.180.
14.35. Calculate the Pearson residual for that cell.
14.36. *-411 Calculate the deviance residual for the cell.
14.37. 'kJ° For a generalized linear model with 5 observations, the deviance residuals are 0.125, 0.342, -0.207,
0.408, -0.603.
Calculate the scaled deviance.
14.38. •46 For a generalized linear model, the response distribution is Poisson. You are given that y5 = 10 and
gs = 8.
Calculate the deviance residual d.
Solutions
14.1. We calculate the expected number of policies with each number of claims, using Poisson probabilities
p„ = e-A /1"/n!, where A is the mean. With no predictors, the sample mean is used as the mean, and 9720/6480 = 1.5.
Let EL be the expected number of policies with k claims.5 You may calculate Ek directly, but we will use (a, b, 0)
methods that you learn in Exam STAM.
E0 = 6480C1.5 1446
1.5E0 = 2169
1.5
E2 = —El = 1627
2
1.5
Es T E2 = 813
1.5
E4 = - E3 305
1.5
Es = —E4 = 91
5
5
E6 = 6480 - EA, = 29
k=0
The chi-square statistic, using equation (14.2), is

n2
Q z(Ok Ek
Ek )2 = Ek
n
12822 22182 18562 8012 2352 812 72

6480
= 1446 + 2169 ± 1627 + 813 + 305 + 91 ±
= 85.97 (C)
sDo not confuse this with exposure, which is not used in this question.

14.2. For Poisson, where variance equals mean and E yi = we may use the alternative formula (14.2). There
are a total of 9 observations; n = 9.
02 02 12 22
Q= + + + • • • + - - 9 = 0.9881
0.2 0.4 0.7 2.1
14.3. We sum up (yi - rii)21014.

(2.1 - 2.5)2 (1.8_1.4)2 (3.9_3.8)2 +(3.4 - 3.1)2
Q = + +
2.52/3 1.42/3 3.82/3 3.12/3
0.35187
14.4.
(A) There is one variable for each category in excess of 1, so including more than 2 categories will increase the
number of variables in the model and the number of degrees of freedom.
(B) Increasing the number of variables in the model may decrease deviance and cannot increase it.
(C) The plot will show whether the residuals are randomly distributed.
(D) The VIF is calculated by regressing a variable against the others. VIF is R2 for that regression. High VIF
indicates collinearity.
(E) A parsimonious model has fewer variables, not more variables.
(E)
14.5. D = 2(1(bmax) - 1(b)), so

21.20 = 2(-42.51 - /(b))
21.20
1(b) =
-2- 42.51 = -53.11
14.6. The observations that are not 0 are 1. In the saturated model, 77 = 1 for all observations of 1 and 77 = 0 for all
observations of 0, making the likelihood 1 and the loglikelihood 0. The likelihood of the model is (0.812)(0.23), so
the loglikelihood is 31n 0.2 + 121n 0.8 = -7.50604. The deviance is 2(7.50604) = 15.01207
Alternatively, use formula (14.8). For the 12 observations of 0, yi In yi is treated as 0 and
1 - yi 1
ln 0.8
(1- Yi) In = In
while for the 3 observations of 1, (1 - yi) ln(1 - yi) is treated as 0 and
Yi 1
-ln 0.2
Yi
= hi -0.2
Summing them up and doubling, 2(-121n 0.8 - 31n 0.2) = 15.01207
14.7. Deviance for a normal regression is E(yi -
1.06113

Copyright 02022 ASM
14.8. Let D be the deviance.
D = 2(101n(10/7) + 121n(1) + 151n(15/20) + 181n(18/19) + 201n(20/25)

— (10 — 7) — (12— 12) — (15 —20) — (18 — 19) — (20 —25)) = 3.631
There are n — (k + 1) = 5 —2 = 3 degrees of freedom. The 95th percentile of chi-square with 3 degrees of freedom is
7.815, so the fit is accepted at 5% significance. (E)
14.9. Since there is an intercept and the log link is used, E(yi — 0. But if you didn't know this, you could just
add up the yi and gi columns and verify that the sums are 51 in both columns. So just sum up the last column of the
table and double it. The result is 9.7672 . (D) If you want to use the incorrect formula in the textbook, then don't
double the sum of the last column.
14.10. The maximum likelihood estimate is the sample mean, or 6 = (100+100+500 +800 +1000)/5 = 500. (You may
have learned this in your statistics course, which is preliminary to this course, or while taking STAM. But perhaps this
question is not suitable for SRM.) The loglikelihood of each observation is — ln ei — xi/Oi. Thus for the fitted model,
with 0 = 500, the loglikelihood is —51n 500-2500/500 = —36.073. In the saturated model, each observation is fitted to
its own model, and the fitted value is 0; = xi. Then the loglikelihood is —21n 100—In 500—in 800—In 1000-5 = —34.017.
The difference between double the negative log likelihoods is 2(-34.017 + 36.073) = 4.111 . (C)
14.11. Statement I is true since differences in loglikelihoods can be deduced from differences in deviances, and
then the likelihood ratio test may be used.
The deviance for a normal distribution is E(yi — 9i)2, making II true.
Statement III is true. Refer to the definition of deviance. (E)
14.12. I is true. II is false, since the smaller deviance, the better the fit. III is true since the deviance is a multiple of
the difference from the loglikelihood from the saturated model's loglikelihood. (C)
14.13. Twice the difference in loglikelihoods is 2(134.88 — 130.52) = 8.72. There are 2 constraints, so this statistic
is chi-square with 2 degrees of freedom. For chi-square, 8.72 is between the critical value 7.378 at 2.5% significance
and 9.210 at 1% significance, making (C) the correct answer.
14.14. One category of fertilizers and one category of seed type is baseline, leaving 3 fertilizers and 4 seed types to
interact, a total of 12 interactions. Twice the loglikelihood increases by 2(-272 + 283) = 22 with interactions, and is
chi-square with 12 degrees of freedom. The critical values are 21.026 at 5% and 23.337 at 2.5%, making the answer
(D).
14.15. The difference in deviances is 55.98 — 52.08 = 3.9, and this is the likelihood ratio test statistic. One parameter
is being constrained, so there is 1 degree of freedom. For chi-square with 1 degree of freedom, the critical values are
3.841 at 5% significance and 5.024 at 2.5% significance. So (D) is correct.
14.16. The test statistic is twice the difference in log-likelihoods between using all 4 terms and using only the
intercept:
2(-1121.91 + 1126.91) = 10
There are 3 additional parameters in the bigger model, hence 3 degrees of freedom for the chi-square distribution.
At 5% significance, the critical value is 7.815. Thus the null is rejected. (E)
14.17. The AIC is —21+ 2p, where p is the number of parameters, k + 1. The first model hasp = 1; the second model
has p = 4. For the first model:
—2/ +2 = 89.2
—2/ = 87.2
2/ = —87.2
For the second model
—2/ + 8 = 88.4
—21 = 80.4

Copyright 02022 ASM
EXERCISE SOLUTIONS FOR LESSON /4 251
r-, 21 = —80.4
The likelihood ratio statistic is twice the difference in loglikelihoods, or (-80.4) — (-87.2) =
14.18. The lower the AK the better, so Model 1 is selected. From the given coefficients,
g(m) = —3.264 + 12(0.212) + 0.727 = 0.007

For a logistic model,
e0.007
7r= 0.50175 (D)
1 4.e0007
14.19. The AIC is twice the negative loglikelihood plus twice the number of parameters. The scaled deviance is
twice the excess of the loglikelihood of the saturated model over the loglikelihood of the model. Since the saturated
model is the same for all four models, the difference in AICs is the difference in scaled deviances phis twice the
difference in the number of parameters.
From the first model we see that Age has 4 parameters (although this is extraneous), and from the second model
we see that Vehicle Body has 10 parameters. Since the AIC of the third model is 32 more than the AIC of the second
model and the scaled deviance is the same, the third model must have 16 more parameters than the second model,
so X = 27 and Age + Vehicle Value has 26 parameters. The fourth model, Intercept + Age + Vehicle Value + Vehicle
Body must have 1 parameter for Intercept, 26 for Age + Vehicle Value, and 10 for Vehicle Body, for a total of 37
parameters. (C)
14.20. Statement (iv) is extraneous.
Twice the negative loglikelihood increases by the same amount the scaled deviance increases, so twice the
negative loglikelihood increased by —53. The AIC is —2/ + 2q where q is the number of parameters. Let p be the
number of parameters for the new variable. We see that the AIC increased by 6 more than —21, so 2p = 6 and p = 3.
The BIC, which is —21 + q Inn increased 21 more than —2/, so p In n = 21 and Inn = 7. The number of observations
is n = e7 = 1097. (B)
14.21. Since claim indicator has one degree of freedom, eliminating it lowers the degrees of freedom from 10 to 9.
From Model 1, we see that
BIC = 2(47,704) +5 Inn = 95,473.61182
Inn = 95'473.61182
5
— 95,408= 13.12236
It is unnecessary to calculate n, but if you're curious, it is 500,000.
The difference between BIC and AIC in Model 2 is
91n n — 9(2) = 9(11.12236) = 100.101 (C)
14.22. (E) is true, since the penalty function per parameter is hi n with BIC and 2 with MC, and Inn > 2 for
n > 1000. That makes (D) false. In the other 3 statements, the opposite of each statement is true: The model with
the smallest AIC/BIC/deviance is best.
14.23. The difference in penalties per parameter between BIC and AK is In n — 2 = In 1500 — 2 = 5.3132. If the
difference is greater than 25, then the number of parameters must be at least 25/5.3132, which rounds up to 5. Since
the number of parameters is 1 less than the number of levels, there must be at least 6 levels. (E)
14.24. Twice the negative loglikelihoods are
Model 1 —21n 0.0456 = 6.176
Model 2 —21n 0.0567 = 5.740
Model 3 —21n 0.0575 = 5.712
After adding the AIC penalty of 2 per parameter, Model 1 is best. The BIC penalty is in 300 = 5.704 per
n parameter, so once again Model #1 is preferred. (A)

14.25. A 4-category variable generates 3 parameters. Model II has 3 parameters less than Model I. The AIC is
obtained by doubling the loglikehood, negating, and adding 2 for each parameter. If Model II's loglikelihood is less
than Model I's by the number of parameters, El, then the two AICs will be the same.
14.26. The BIC is obtained by doubling the loglikelihood, negating, and adding In n for each parameter. The BICs
will be equal if the loglikelihood for Model II is less than Model Is by 3(ln 65)/2 = 6.262
14.27. Twice the difference in loglikelihoods is chi-square with 3 degrees of freedom. The critical value for x2(3) at
95% is 7.815, so we are indifferent between the two models if 2(l1 - /2) = 7.815, or I - 12 3.908
14.28. We want 2(110.52) - 21n n > 2(105.34), so
-21n n > -10.36

Inn <5.18
n <177.68
The highest n is 177

14.29. Twice the negative loglikelihood of Model II is 23.9 - 20.7 = 3.2 more than that of Model I, so its AIC is 3.2
higher for that reason. A 5-category variable adds 4 parameters to a model, so the AIC of Model II is 2(4) = 8 lower
for that reason. Therefore, the AIC of Model II is 139.8 + 3.2 - 8 = 135.0
14.30. We must calculate /(b), and 'max. The likelihood and loglikelihood for Poisson observations with means
Ai is
L= e ' ''n
/=-ZAi+EnilnA1-In n ni !
For all models, n, = yi. For the model under consideration, Ai = 9i. For the minimal model, Ai = 4.4. For the
saturated model, A1 yi. We can ignore In n!, which cancels. In fact, we can even ignore E Ai, since Eyi = z 91,
but we won't ignore it.
/0 = -5(4.4) + (1 + 3 + 4 + 6 + 8)ln 4.4 = 10.5953

/(b) = -(1.2 + 2.6 + 4.2 + 6.8 + 7.2) +ln 1.2 + 31n2.6 + 41n 4.2 + 61n6.8 + 81n7.2 = 14.0834
!max =-(1 + 3 +4+ 6 ± 8)+1n1 +31n3 + 41n4 + 61n6+ Bina =14.2271
2 14.0834 - 10.5953
pseudo-R = 14.2271 - 10.5953 = 0.9604,
14.31. We'll use the calculations from the previous exercise. But we'll have to respect signs and subtract inn ni!
from each one since it doesn't cancel in the R2 formula.
Infl ni! = In(1! • 3! • 4! • 6! • 81) = 22.1537

/0 = 10.5953 - 22.1537 = -11.5584
/(b) = 14.0834 - 22.1537 = -8.0703
R2 =
(exp(-11.5584/51 2 =0.752224
exp(-8.0703/5)
0.752224
max-scaled R2 - 0.7597
1 - exp(-11.5584/5)2

Copyright 02022 ASM
14.32. The density function for an exponential is f(x; p) = II' p. Setting pi = pi, the loglikelihood is
)
-E(ing,+
For the minimal model, y = = 3. For the saturated model, y = y. With the given fitted values,
/(b) = - ln(0.8 • 1.23 .3.57.5 • 4.4) - 0.0002

0.8 1.23 3.57 - 4.4 -
With pi = = 3,
15
/0 -7- -51n3- - = -10.4931
3
With 9; = yi,
/max = -1n(1 2 • 3 • 4 • 5) - 5 = -9.7875
The pseudo-R2 statistic is
10A931 - 10.0002 _ 0.6985
10.4931 - 9.7875 -
14.33. We'll use the calculations from the previous exercise.
R2 = 1
texp(-10.4931/5)) 2
exp(-10.0002/5)
0.17893
0.17893
0.17893
max-scaled R2 = 0.18166
1 - (exp(-10.4931/5))2 0.98496
14.34. The sum of the residuals is the Pearson chi-square goodness-of-fit statistic, making I true. II and III are listed
in the list of four items that residuals are useful for at the beginning of Section 14.6. (D)
14.35. residual =
yi - fEi
-0.46852
V(0.180)(0.820)
Yi 1-yi
14.36. residual = sign(yi -
) 2 (yi In + (1 - yi) In ist )
42 (In 0.8120 = -0.63000
14.37. D = 0.1252 + 0.3422 + (-0.207)2 + 0.4082 + (-0.603)2 = 0.705511

14.38. By formula (14.21),
d5=11 Yi
10
= -\12 (101n - - (10 - 8)) = 0.680346

8

Quiz Solutions
14-1. Comparing formulas (3.2) and (14.7), the residual standard error of the regression is ./D(e)/(n — (k + 1)).
s = V125/(22 — 2) = 2.5
14-2. A 4-category variable adds 3 parameters to the model. The AIC penalty adds twice the number of parameters,
or 6, to twice the negative loglikelihood. To compensate for this, the negative loglikelihood must decrease by at least
3, to 155.06
e955185'
14-3. R2 = 1
(e-743/85) 2 = 0.392755
0.392755 0.392755
max-scaled R2 = = = 0.439181
1 — (e-953/85)2 0.894290

exams with helpful solutions and tips. Inciuded with GOAL

Copyright 02022 ASM
Part III
Other Statistical Learning

Methods
Lesson 15
K-Nearest Neighbors
Reading: An Introduction to Statistical Learning 2.2.3, 3.5
As mentioned in the preface, none of the SOA sample questions are on the topic of this lesson. Possibly they never
test on it. So feel free to skip this lesson if you're short on time, unless your friends tell you that they got questions
on it.
We are now going to discuss alternatives to regression. We begin by discussing categorical responses, the
classification setting.
15.1 The Bayes classifier

Suppose our response variable is a categorical variable. We are trying to predict the class. Since classes are usually
descriptive (for example, "Property damage only", "Bodily injury", "Fatality") rather than numerical, we cannot use
a measure like MSE. Instead, our measure of quality is the proportion of cases for which we correctly predicted the
class. Let 1(yi # 9) be the indicator variable that is 1 if yi 0 otherwise. Then we want to minimize
This is the error rate. •Ir

The best classifier is the Bayes classifier. The Bayes classifier assigns each case to its most likely class, given the .41
explanatory variables. Thus if there are c classes, and we know xo and Pr(yi = j I xo) for j = 1, 2, ... c, we assign the
value j that maximizes Pr(yi = j I x0). For example, if there are 2 classes, the Bayes classifier assigns the class with
Pr(Y = j I xo) > 0.5.1
If the response is a function of two variables, one can draw a graph with the two predictors as the two axes.
Using the Bayes classifier, we can identify the regions mapping to each class. The boundaries between these regions
are called the Bayes decision boundaries. al?
The error rate using the Bayes classifier is
Error rate = 1 E [max Pr(Y =11X)] (15.1)
EXAMPLE 15A Z A variable Y may have the value 1 or 2. Given the value of two explanatory variables, Xi. and X2,
the probability that Y is 1 is
Ixi
Pr(Y = 1 I Xi = X1 . X2 = X2) =
IX1 I + IX21
•
iThe textbook is silent on what to do if there is more than one maximum probability; for example, if Pr(Y = j I xo) = 0.5 and there are two
classes. You'd have to pick one of the maxima randomly then.
Exam SEM Study Manual 257

Copyright Q2022 ASM
258 15. K-NEAREST NEIGHBORS
That is the boundary, with points having Ixil > Ix21 mapping to 1. Here is a graph:
xi
15.2 KNN classifier
The Bayes classifier is a theoretical method_ We do not know the conditional probabilities Pr(Y = j I xo). The
•••• K-nearest neighbors (KNN) classifier is one way to estimate those probabilities. To carry out KNN, select an integer
K. Then look at the values of Y at the K observations nearest to the point of interest. Set Pr(Y = j I xo) equal to
the proportion of those points with Y = j. Use the Bayes classifier to assign a value to Y; in other words, the value
assigned to Y is the most common value at the K nearest points.
EXAMPLE 15B Y is a classification variable with two possible values: 1 and 2. X is an explanatory variable. You
are given the following observations (X, Y):
(1,1) (2,2) (5,1) (7,1) (10,2) (12,2)
Using K-nearest neighbors with K = 3, determine the Xs that go to each value, 1 and 2. Then calculate the error
rate. •
SOLUTION: The majority of the points, at least 2 out of 3, will go to 1 as long as X <8.5. When X > 8.5, then X = 12
is closer than X = 5 and the three nearest points are 7, 10, and 12, so points with X > 8.5 go to 2.
The assignment is incorrect for X = 2 but is correct for all other points, so the error rate is 1/6 0
In the previous example, if K = 1, then each point would go to the Y at the nearest point; the assignment rule would
be
X < 1.5 1
1.5 < X < 3.5 2
3.5 < X < 8.5 1
X > 8.5 2
The error rate is 0.
If K = 5, then points closer to 1 would go to 1 and points closer to 12 would go to 2. The rule would be X <6.5
goes to 1 and X > 6.5 goes to 2, with error rate 1/3.
We see that the higher K is, the less flexible the method is, and the higher the training error is. Higher K
means higher bias. However, higher K also means lower variance. For test data, the error rate is minimized for an
intermediate value of K. Using a value of K too low is overfitting the model.
Sometimes we analyze the error rate as a function of 1/K, in which case low values of 1/K mean high bias and
low variance while high values of 1/K mean low bias and high variance.

15.3. KNN REGRESSION 259
r"--- 15.3 KNN regression

Now let's discuss the regression setting, continuous response variables.
KNN is an example of a non-parametric method, a method that does not specify a functional relationship between
the predictors and the response. In KNN regression, K is selected. The value of the response at any point is the
average of the values at the K nearest observations.2
EXAMPLE 15C A dependent variable Y is modeled as a function of an independent variable X.
You are given the following observations:
x 2 3 4 5 7
y 34 38 53 50 70
Perform KNN regression with K = 1, K = 3, and K = 5.
SoLuTioN: With K = 1, the value of Y at each point is the value at the closest observation. Thus
X <2.5 Y =34
2.5 < X <3.5 Y =38
3.5 <X <4.5 =53
4.5 < X <6 Y = 50
X>6 Y = 70
With K = 3, the three closest Xs are 2,3, and 4 as long as X <3.5. The closest Xs are 3,4, and 5 when 3.5 < X <5.
The closest Xs are 4,5, and 7 when X > 5. Thus
X <3.5 Y =41
3.5 <X <5 Y =47
X > 5 Y
l
f With K = 5, all five points are closest, and Y is always set equal to its mean, 49.
As we see from the example, the method produces a step function. It becomes less flexible as K increases. What we
said for KNN classification applies here as well; higher K increases the MSE for training data. For test data, higher gl
K increases bias but decreases variance; the value of K minimizing test MSE is in the middle.
How does KNN regression compare to linear regression? If the underlying variable follows the assumptions
of linear regression, then linear regression will give a better fit. If not, KNN may give a better fit, particularly if
the number of predictors is small. But if the number of predictors k is large, there may be no nearby neighbors in
k-dimensional space, and KNN may be based on faraway values of observed Ys. Parametric methods are superior
when there are only a small number of observations per predictor.
In addition, parametric methods such as linear regression are easier to interpret than KNN.
2The textbook is silent on what to do if there is a tie for nearest point. Perhaps you'd average the responses at all the tied points in that case.

Exercises
15.1. ": You are predicting whether a policyholder will renew his policy based on how many years the policyholder
has had the policy with the company. Based on your experience, the probability of renewal is:
Number of Years Number of Probability of

Policy with Company Policies Renewal
1 450 40%
2 240 60%
3 200 70%
4 or more 110 75%
Calculate the Bayes error rate of the renewal prediction based on number of years the policyholder had the policy
with the company.
15.2. Joe is a chess player. The probability that a player wins or draws a chess game against Joe, given the class
of that player, is as follows:
Class of Proportion of Players in Probability of Win Probability of Draw

Player this Class Against Joe Against Joe
Master 0.01 0.99 0.01
Expert 0.05 0.90 0.05

Class A 0.10 0.79 0.12
Class B 0.20 0.50 0.30
Class C 0.30 0.32 0.37
Class D 0.20 0.15 0.36
Class E 0.14 0.03 0.13
If a player does not win or draw against Joe, then the player loses.
Calculate the Bayes error rate of the predicted outcome of the game based on the player's class.
15.3.
%. The probability of passing Exam SRM is modeled as a function of the number of hours of study, h. The
probability of passing is h/(150 h). The Bayes classifier is used to predict whether a person passes.
The number of hours that people study is uniformly distributed between 100 and 900.
Calculate the Bayes error rate.
15.4. •-ir Y is a classification variable with two classes, I and II. A model for Y has two explanatory variables, X1
and X2. The probability that Y is class I is 4/(4 + 44).
Determine the Bayes decision boundary, and draw a graph with X1 and X2 as the axes, showing the regions in
which class I and class IT are predicted.
15.5. ',11` There are 3 candidates running for mayor, Susan, Jack, and Mae. The probability that a voter will vote
for a candidate is modeled as a function of the voter's income (in thousands). Income is always positive. Let x be
the voter's income. The probability that a candidate will vote for Susan is x/(100 + x). The probability that a voter
will vote for Jack is (100x + 3600)/(100 + x)2. Otherwise the voter will vote for Mae.
Assume x is positive.
Determine the Bayes decision boundaries between the candidates a voter votes for.

15.6. %. Which of these statements are true regarding K-nearest neighbors?

I. As K increases, the method becomes more flexible.
II. As K increases, the training error usually increases.

III. As K increases, the test error usually decreases.
15.7. •41 There are two real-valued predictor variables, X1 and X2, and a classification response variable Y with
two classes, A and B. You are given the following data:
)(2
1 2 A
1 6 A
2 2 A
3 4 B
4 1 A
4 5 B
5 3 B
KNN with K =3 is used.
Determine the value of Y assigned to (3,3).

A classification variable Y with classes 0 and 1 is modeled using predictor variable X. You are given the
following observations:
X 1 2 4 7 11 12 15
Y 0 0 1 0 1 0 1
15.8. N? Using KNN with K = 3, determine the regions for which 0 is predicted for Y.
15.9. Using KNN with K = 7, determine the classification error rate.

Copyright 02022 ASM
15.10. sir [MAS-II-F18:361 A training data set contains eight observations for two predictor variables, X1 and X2,
and a response variable, Y. The response Y has three possible classes: P. N, and U.
Distance from
i Xi X2 Y
(xii,xi2) to (3,2)
1 4.1 3.0 P 1.5
2 —2.6 —3.0 N 7.5
3 —1.1 1.3 U 4.2
4 0.0 1.2 U 3.1
5 —3.0 —5.0 N 9.2
6 2.0 2.0 U 1.0
7 —3.1 —2.0 N 7.3
8 3.2 3.1 P 1.1
Three models are constructed using K-Nearest Neighbors and the data set above to predict Y for the two-
dimensional space of predictors Xi and X2.
• Model 1: K-Nearest Neighbors with K = 1.
• Model I: K-Nearest Neighbors with K = 3.
• Model I: K-Nearest Neighbors with K = 7.
Each model is used to classify the point (3,2).
Determine the predicted response Y at this point using each of the three models.
(A) Model I: Y =I', Model II: Y =I; Model III: Y =P
(B) Model I: Y =P, Model II: Y =P, Model III: Y =U
(C) Model I: Y =U, Model II: Y =P, Model III: Y =P
(D) Modell: Y =U, Model H: Y =P, Model III: Y =U

Copyright 02022 ASM
15.11. [MAS-II Sample:11] You are given the following data to train a K-Nearest Neighbors classifier with
K = 5:
Distance to
X1 X2 Y Xi = 0, X2 = 5
4 4 Yes 4.1
1 6 No 1.4
7 5 No 7.0
5 5 Yes 5.0
2 7 Yes 2.8
7 2 Yes 7.6
8 4 Yes 8.1
8 6 Yes 8.1
2 3 Yes 2.8
2 5 No 2.0
2 2 Yes 3.6
6 6 No 6.1
1 8 No 3.2
0 5 Yes 0.0
Calculate Pr[Y = "Yes" IX1 = 0, X2 = 5] with the K-Nearest Neighbors classifier.

(A) Less than 0.3
(E) At least 0.9

Copyright 02022 ASM
264 /5. K-NEAREST NEIGHBORS
15.12. NIP [MAS-H-S19:381 You are provided with training and test data samples consisting of a single variable X,
and an observation Y consisting of two possible classes, T and F.
Training Data Test Data
xi y I xi yi
1 —1.60 T 1 —1.3 F
2 —1.50 F 2 0.9 F
3 —0.60 T 3 1.2 T
4 —0.30 T
5 0.40 F
6 0.40 F
7 0.70 T
8 1.20 T
9 1.30 T
10 2.10 F
The Bayes classifier Pr(Y = F I X = x) = e

Calculate the amount by which K-nearest neighbors test error rate with K = 3 exceeds the Bayes error rate on the
test data.
(A) Less than 0.35

(B) At least 0.35 but less than 0.40
(C) At least 0.40 but less than 0.45
(D) At least 0.45 but less than 0.50
(E) Greater than or equal to 0.50

15.13. N. [MAS-II-F18:37] You are given a data set consisting of 150 data points. There are three possible
classifications for each data point, as shown in the leftmost graph, with 50 data points falling into each classification.
Based on this data, you train a K-Nearest Neighbors model using 95 of the data points. You evaluate k ranging
from 1 to 95. For each k, you calculate the test error rate on the remaining 55 data points with results shown in the
rightmost graph.
+4
•o 00
o 0 4-
000 o
o 0 0 0 0 0 ++
o +
O00 a A di+ ++A+
O 0 0 + 44
re)
00 000 • AA 4+4 +A4+ ++ -H-
o AA AA A+A A
+11+ A+-1-3-6 a + +
a 4, a ++
AAA +
+A AA+ 4 +
a A a
4 A
1 1
4.5 50 5.5 60 65 7.0 7.5 80
Feature I
Determine the cause of the rapid increase in error rate between k equals 70 and k equals 95.
(A) As k increases the K-Nearest Neighbors algorithm performs better.
(B) As k increases the K-Nearest Neighbors algorithm performs worse.
(C) As k approaches 95, all data points are predicted to have the same classification.
(D) As k approaches 95, all data points are incorrectly classified.
(E) There is no clear relationship between the value of k and the error rate.
15.14. Nr A continuous variable Y is modeled as a function of X using KNN with K = 3. You are given following
data:
X 5 8 15 22 30
Y 4 1 10 16 30
Calculate the fitted value of Y at X = 12.
15.15. •11. A continuous variable Y is modeled as a function of X using KNN with K = 2. You are given the
following data:
X 4 7 12 14 15 21 22
Y 3 8 15 22 , 30 40 53
Cross-validation is used to test the fit. The training set consists of the observations with X = 4,12,14,15,22.
(Th Calculate the mean square error on the training data and on the test data.

Copyright 02022 ASM
Solutions
15.1. The Bayes rule predicts nonrenewal for policies with the company for one year and renewal otherwise. There
are 1000 policies total; 45% with the company for one year, 24% for two years, 20% for three years, and 11% for four
or more years. The error rate is 1 minus the probability of the right prediction, the probability of nonrenewal for
policies with the company for one year or renewal for policies with the company for two years or more. The Bayes
error rate is
0.45(0.4) + 0.24(0.4) + 0.20(0.3) + 0.11(0.25) = 0.3635
15.2. The Bayes prediction is the most likely outcome given the class: win for Class B and up, draw for Class C,
lose for Class D and Class E. The average error rate is the sum of the probabilities of error (not selecting the correct
outcome) times the proportion in the class:
0.01(0.01) + 0.05(0.10) + 0.10(0.21) + 0.20(0.50) + 0.30(0.63) + 0.20(0.51) + 0.14(0.16) = 0.4395
15.3. The Bayes decision boundary is h = 150, since the probability of passing is 50% when h = 150. Above h = 150
we predict the person passes; below h = 150 we predict the person fails. Notice that yevh- = 1 151g4°.1„ and that the
density function of the uniform distribution between 100 and 900 is 1/800. The Bayes error rate is
150
1 150
Error Rate =
800
1
(f (1
too ) 1.50900150 dh )
150 + h
dh +
150
150 + h
1
800 (50 — (1501n(150 + /2)1 ) + (1501n(150 + h)19°°))
100 150
= —
800 (50 — 150(ln 300— In 250) + 150(In 1050— In 300))
1
=
(50 — 150 In 1.2 + 1501n 3.5) = 0.2632
800
15.4. The decision boundary is the set of points for which the Pr(Y = I) = 0.5.
x21
=0.5
x2 + 44
I
X2 = 0.5X2 + 2411
0.54 = 24
x2 = ±0.5xi
The Bayes decision boundary is the pair of lines x2 = ±0.5x1. Here is the graph showing the Bayes decision boundary
and the predictions:
x2
X1

15.5. Notice that the probability that a voter will vote for Mae is
x 100x + 3600 100x + X2 + 100X + 3600 6400
1 = 1
100 + x (100 + x)2 x2 + 200x + 10,000 (100 + x)2
We have to determine when each of the probabilities is the largest.. The probability of Susan is greater than the
probability of Jack when
100x + 3600
100 + x (100 + x)2
100x + x2 > 100x + 3600
x2 > 3600
x > 60
The probability of Susan is greater than the probability of Mae when

6400
100 + x (100 + x)2
100x + x2> 6400
x2 + 100x — 6400 > 0
x > —100 +2 35,600 = 44.34

The probability of Jack is greater than the probability of Mae when
100x + 3600 6400
>
(100 + x)2 (100 + x)2
100x + 3600> 6400
x > 28
We conclude that one decision boundary is E1; below 28 Mae is most likely. The other decision boundary is
!
Ifil; above 60, Susan is most likely. Between 28 and 60, Jack is most likely.
15.6.
I. The method is most flexible for small KX

II. The training error is 0 for K = 1 and usually increases as K grows./
III. The test error initially decreases as K increases until it reaches its minimum value, then it increases.X
15.7. The squares of distances of (3,3) to the points are
(Xi, X2) Square Distance to (3,3)

(1,2) 5
(1,6) 13
(2,2) 2
(3,4) 1
(4,1) 5
(4,5) 5
(5,3) 4
h
T e three closest points are (2,2) (A), (3,4) (B), and (5,3) (B), so E is assigned.

Copyright 02022 ASM
15.8. For X <4, the three closest points are 1, 2, and 4, and the majority of those have Y = 0, so 0 is assigned. For
4 < X <6.5, the closest points are 2, 4, and 7, and once again 0 is assigned. For 6.5 < X <8, the closest points are
4, 7, and 11, and the majority of those have Y = 1, so 1 is assigned. For 8 <X <11, the closest points are 7, 11, and
12, and the majority of those have Y = 0, so 0 is assigned. For X > 11, the closest points are 11, 12, and 15, so 1 is
assigned.
15.9. All 7 points are used, and the majority of Y values is 0, so 0 is always assigned. This is correct at 4 of the
points and incorrect at the other 3, so the error rate is 3/7
15.10. The nearest point to (3,2) is i = 6, and Y is U there, so in Model I Y = U. The 3 nearest points are i = 1,6,8,
with two Ps and one U, so in Model II Y = P. The 7 nearest points are every point except i = 5, with two Ps, three
Us, and one N, so in Model III Y = U. (D)
15.11. The 5 closest points, based on the distances in the last column, are at distances 0.0, 1.4, 2.0, 2.8, 2.8, and the
Y column has Yes, No, No, Yes, Yes for those 5 lines respectively, so the probability is 3/5= 0.6 . (C)
15.12. The Bayes error rate is the average of 1 - max(Pr(Y = j I X)) over all values of X. Here, the Bayes decision
boundary is
e'2= 0.5
lx1=V =0.832555
All 3 values of X in the test data set are greater than 0.832555 in absolute value, so Pr(Y = 1 X) is maximized for
j = T. We therefore need to average the complements of those probabilities, or e_'2,over the three xis:
e-(-1.3)2 c0.92 e-1.22
Bayes error rate = 0.184520 + 0.444858 + 0.236928 = 0.288768
3 3
Notice that the Bayes error rate does not depend on the yi observations in the test data.
The three nearest neighbors of -1.3 in the training data are -1.60, -1.50, and -0.60, with two Ts and one F, or
a prediction of T. The three nearest neighbors of 0.9 are 0.70, 1.20, and 1.30, with three Ts, or a prediction of T. The
three nearest neighbors of 1.2 are 0.70, 1.20, and 1.30, with three Ts, or a prediction of T. Only the prediction for 1.2
is correct, so the error rate is 2/3. Then 2/3 - 0.288768 = 0.377898. (B)
15.13. Since the training data set is 95 points, all points are classified based on distance to some of those 95 points.
If k = 95, then the 95 nearest points are the entire training data set and all points are classified as belonging to the
class of the majority of those 95 points. (C)
15.14. The three nearest X values are 5, 8, and 15, and Y = 4,1,10 at those points, so the fitted value is the average
of those three numbers, or (4+ 1 + 10)/3 = E.
15.15. We calculate the fitted values at all seven Xs using only the five Xs in the training set.
Nearest two points

Xiin training set Fitted value Y1 ei
4 4,12 3+15 _ a 3 6
2 — '
7 4,12 3+15 _ a 8 1
2 — '
12 12,14 15.22 = 18.5 15 3.5
14 14,15 2213° = 26 22 4
15 14,15 22-r° = 26 30 -4
21 15,22 3°42.53 = 41.5 40 1.5
22 15,22 313+53
2
- 41.5 53 -11.5
The MSE on the training data is

62+ 3.52 + 42 + (-4)2 + (-11.5)2 42.5
5

We divide by 5 since no parameters are estimated.

The MSE on the test data is
12+1.52 = _1.625
2


Copyright E2022 ASM
000
Not all important questions are mathematical; creative thinkins is an important skill to practice!
In order to protect
themselves from poachers,
African Elephants have been
evolving without tusks, which
unfortunately also hurts their
species.
aisim
Lesson 16
Decision Trees
Reading: An Introduction to Statistical Learning 8. If using second edition, skip subsections 8.2.4, 8.25., and 8.3.5
16.1 Building decision trees

Decision trees are a non-parametric alternative to regression. They are reminiscent of KNN, in that they split the
predictors into regions and then assign the average value of the region in the regression setting and the most common
value in the classification setting. The regions are always multidimensional rectangles.
As a simple example, suppose we want to estimate the values of houses. The explanatory variables we'll use are
the style of the house (ranch, Cape Cod, or colonial), square footage, and number of bedrooms. We can build a tree
by a series of questions that splits the observations into regions, and then assign the average house value for each
region. The first question can be "Is the house a ranch?" This splits the tree into two branches, "Yes" and "No". At
the "Yes" node, we can ask "Is the square footage greater than 1500?" and split the tree into two branches based on
the response to that. If the house is not a ranch, the next question can be "Is the house a Cape Cod?" If the answer
to that question is "No", we can ask "Does the house have more than 2 bedrooms?" With this series of questions,
the following tree results, where left branches are "Yes" and right branches are "No".
Rapch
Footage > 1500 Cape Cod
Bedro s > 2
The tree is drawn upside down. The nodes on the bottom are called leaves or terminal nodes. The nodes in the tree
that are split into two branches are called intermediate nodes. Notice the following regarding the tree:
• The predictors can be categorical, count, or continuous variables. For continuous variables, a cut point must
be selected. Here we arbitrarily selected 1500 for the cut point of square footage. We will discuss how to select
the cut point.
• Every split is binary. To split the 3-category categorical variable House Style, we first split it into two categories,
then split one of those two into two categories.
• Questions may be different at each node. We can ask for square footage of ranches and number of bedrooms
of colonials, and not ask any question for Cape Cods. If we had asked about square footage for Cape Cods,
the cut point may be different than the cut point for ranches.

Copyright 02022 ASM
274 16. DECISION TREES
Drawing the regions of the above tree requires three dimensions, since there are three variables. If we only consider
House Style and footage, arid prune the bedroom node off the tree, a graph of the regions would look like this:
1500
8
Ranch Colonial Cape Cod

Style
As we said above, the fitted values at the terminal nodes are the averages of all values in the region for continuous
responses, and the most common value for categorical responses.
Let's now limit the discussion to continuous responses.
The optimal tree for a continuous response is the one that minimizes the mean squared error
(Yi — )2
j=1.
where Ri is the ith region and there are J regions. However, it would not be possible to calculate MSE for every
Lir possible split into regions due to the huge number of possible splits. Instead, trees are grown by recursive binary
e-r splitting. The algorithm selects a binary split that minimizes the MSE. This algorithm is greedy in that it only
optimizes the current split, and does not take into account that a split that is not optimal at the current iteration may
lead to a better overall tree at a later step.' At each step, we select a region Ri to split, a predictor Xk, and a cut point
s so that the split into Rh and Ri, minimizes
(Yi — 9R11 )2 + (yi- gRid2

x, E Rh
The algorithm continues until the number of observations in a region is below a prespecified number; for example,
until there are fewer than 5 observations.
The resulting tree is probably too big. More splits means more flexibility, lower bias, higher variance, and as
usual, there is an optimal number of splits that minimizes test MSE. We therefore prune the tree, using a method
similar to the one discussed in Section 8.1 for the lasso. This pruning method is called cost complexity pruning or
446 weakest link pruning. We specify a tuning parameter a. The cost of a tree is a per terminal node. For each value of a,
we prune the tree to minimize
IT'
(yi- fiR.)2+ aiTi (16.1)

m=1 i:xt€Rni
•••• where ITI is the number of terminal nodes. We then select the value of a that is optimal based on cross-validation.
The higher the a, the smaller the tree. The subtrees produced this way are a nested sequence, so it is easy to produce
the series of subtrees as a function of a.
Cross-validation is a little unusual here. We first select a validation set as the training set; in the textbook's
examples, the validation set consists of half the observations. We build a tree with the training data, ignoring the
test data. We then use N-fold cross-validation on pruning with a; we determine the optimal number of terminal
nodes for each a using cross-validation. The cross-validation does not use the test data. The cross-validation reports
'We previously discussed algorithms that are greedy when we discussed forward and backward stepwise subset selection in Section 7.1.

Copyright g2022 ASM
16.1. BUILDING DECISION TREES 275
0.9 —
0.8 —
0.7 — •M1
cross-entropy
0.6 —
0.5
0.4 —
0.3 —
Gini index
0.2—
0.1 —
0
0 0.1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1
Figure 16.1: Gini index and cross-entropy as a function of pink when there are two classifications.
the MSE for each number of terminal nodes. After we determine the best a (or the optimal number of terminal
nodes), we calculate the MSE of the pruned tree on the test data.
We'll now discuss optimizing classification trees.
For classification trees, instead of mean squared error, one might consider using the classification error rate as
the number to minimize:
EE Pr(yjgizi)
j=1 iERj
For a single region Rj and classes 1, 2, ... , K, y'Rj is the most common class. Let pjk be the proportion of observations
in R./ for which yj = k. Then the classification error rate for region Rj is
Ej = 1 — max /3 jk (16.2)
But this measure is not sufficiently sensitive for tree growing. Instead, either the Gini index or the cross-entropy is
used. For a single region Rj, the Gini index is the variance of the observations: •41
(16.3)
The cross-entropy for region R1 is •41°
Dj = —Epikinpik
k=1
(16.4)
Since the logarithm of Pik, a number between 0 and 1, is negative, D3 is positive. Cross-entropy and Gini index are
close numerically. Both of them measure node purity; they are minimized when pik is close to 0 or 1, so they are
minimized when all the classes are right or when they are all wrong. Neither measure takes into account the fitted
classifica tion.2
Figure 16.1 shows how the Gini index and cross-entropy vary with pink when there are two classifications.3
These are the definitions of Gird index and cross-entropy for a single region. When splitting a tree at a node, you
need to compute these measures for both split regions overall. The measure for a set of regions equals the weighted
average of the measures for each of the regions, with weights being the proportions of observations in each region.
21n cross-entropy, Pik in pik is treated as 0 when pik = 0.
3This is part of An introduction to Statistical Learning exercise 8.3.
Exam SRM Study Manuat

Copyright 02022 ASM
EXAMPLE 16A A categorical response variable Y has values A, B, and C. There is one explanatory variable X.
X 0 0 1 2 3 5 6 9 10 12
Y A B A A B B C B C C
1. Calculate the Gini index and the cross-entropy.

2. The region is split into X <6 and X 6.
Calculate the overall Gini index and cross-entropy after the split.
SoLurioN: 1. The proportions of observations in each class are 0.3 in A, 0.4 in B, and 0.3 in C. Then
G = (0.3)(0.7) + (0.4)(0.6) + (0.3)(0.7) = 0.66
D = -0.31n 0.3 - 0.41n 0.4 - 0.31n 0.3 = 1.088900
2. The region X <6 has 6 observations and gets 0.6 weight the region X 6 has 4 observations and gets 0.4
weight. For X <6, the proportions of observations in each class are 0.5 in A, 0.5 in B. For X 6, the proportions
of observations in each class are 0.25 in 13, 0.75 in C.
G = 0.6((0.5)(0.5) + (0.5)(0.5)) + 0.4((0.25)(0.75) + (0.75)(0.25)) = 0.45

D = -0.6(0.5 In 0.5 + 0.51n0.5) - 0.4 (0.25 In 0.25 + 0.75 In 0.75) = 0.640822
Notice that for the Gini index, when there are only two classes, both products in the sum Zi pii(1 po are the
same. This is true for both split regions. 0
The Gini index or cross-entropy are used for splitting the tree. But for pruning the tree, while any of the three
measures may be used, it is best to use classification error rate as the criterion if predictive accuracy is desired. The
cross-validation method, which involves a tree built with training data and cross-validation for the pruning of this
tree, is the same as the method for regression trees.
The use of the Gini index or cross-entropy for splitting may result in split nodes with the same predicted class.
Even though the predicted class is the same, node purity may be increased. For example, suppose there are two
classes A and B, with 16 As and 4 Bs. Split it into one region with 10 As and another region with 6 As and 4 Bs. Both
splits will predict A. The Gini index before the split is
(0.8)(0.2) + (0.2)(0.8) = 0.32
After the split, each region has half the observations and gets a weight of 1/2. The Gini index is
0.5 ((1)(0) + (0)(1)) + 0.5((0.6)(0.4) + (0.4)(0.6)) = 0.24

which is less than before the split.
•Nt. R provides a residual mean deviance measure for trees. For regression trees, it is the RSS divided by n - IT I, where
as usual n is the number of observations and I TI is the number of terminal nodes. For classification trees, it is
Residual mean deviance - 2 /n1 k 11m k in Ps" (16.5)

n - IT I
where n,„k is the number of observations in class k in terminal node in and /3„, is n,„k/n (as it was when we
discussed the cross-entropy.)
el? How do decision trees compare with linear regression? If the assumptions of linear regression are satisfied,
linear regression will do a better job. In fact, linear regression usually does a better job than decision trees. But if
the underlying variable is complex and nonlinear, decision trees may be better. If the decision boundary is a set of
horizontal and vertical lines, decision trees will capture that relationship better than a linear model.
Decision trees have the following advantages over linear models:

16.2. BAGGING, RANDOM FORESTS, BOOSTING 277
1. Easier to explain.
2. Closer to the way human decisions are made.
3. Tree can be graphed, making it easier to interpret.
4. Easier to handle categorical predictors; linear regression requires dummy variables.
However, decision trees suffer from two shortcomings: they do not predict as well, and they are not robust. Small
changes to input data can have big effects on trees. The methods of the next section address these shortcomings.
16.2 Bagging, random forests, boosting

The method of decision trees that we have discussed suffers from high variance. If data is split into two sets and
each set is used as a training set, the results may be quite different. The methods we discuss in this section, bagging, '41
random forests, and boosting, modify the decision trees method to lower variance.
However, the methods discussed in this section require heavy computation. I'd expect exam questions on these
methods to be purely knowledge, rather than computation, questions.
16.2.1 Bagging
Bagging is a form of bootstrapping. In fact, bagging stands for bootstrap aggregation. Strangely, bootstrapping is not •Nr
on the syllabus, but you need to understand the bootstrapping concept to understand bagging, so here's a short
explanation.
The basic idea of bootstrapping is that we often want to know something about the underlying distribution,
but we don't know the underlying distribution. To get around that problem, use the empirical distribution, the
observations that we have, as the underlying distribution. To learn something about the underlying distribution,
simulate off the sample. (This sounds crazy at first, since the sample is already random, but get used to it!)
Simulating off the sample of size n means drawing n items from the sample with replacement. (If it weren't
with replacement, we know what those n items would be.) For example, if you're observing claim sizes from some
insurance, and you have 5 claim sizes: (1000, 2000, 4000, 6000, 9000), you may randomly draw bootstrap samples
like:
(4000, 1000,4000, 6000, 1000)

(9000, 2000, 1000, 4000, 2000)
(1000, 9000, 9000, 2000, 1000)
Of course in real situations, the sample has hundreds or thousands of items, and you draw hundreds or thousands
of samples.
And that's what bagging is about! Take your training set, with n observations, and select B bootstrap samples
from it. Construct trees, using the algorithms we discussed in the previous section. If you are interested in the
response for x = {x1, x2, , xk}, a vector of values for the k predictors, calculate the corresponding y for each of the
B trees and average them.
1 B
Aas(x) = f (x)
1)=1
where ](x) is the value of the response for tree b. By averaging the trees, variance is reduced. Remember what you
learned in probability: the variance of the sample mean for an independent sample is the variance of the distribution
divided by the size of the sample. Bagging divides the variance of a single tree by B, assuming that the trees are
independent. They probably are not independent, and we'll discuss how to fix that problem in the next subsection.
If the response variable is categorical, we can set the bagged value equal to the most commonly predicted value
in the B trees.
There is no danger of overfitting by making B too large.

Income
Age
CreditHist
Experience
0 20 40 60 80 100
Figure 16.2: Variable importance plot
One can show that for n sufficiently large, about 1/3 of the items in the sample are not used in a bootstrap sample.
This allows "out-of-bag (008) validation". For each tree, the test MSE or error rate may be computed using the
out-of-bag part of the sample, the items that were not used to build the tree. This is very convenient, eliminating
the need for cross-validation. It can be shown that for B sufficiently large, the 00B error is virtually equivalent to
the leave-one-out cross-validation error.
Unfortunately, bagging makes the model difficult to interpret. However, to measure the importance of predictors,
one can calculate the amount that the RSS (for regression trees) or Gini index (for classification trees) is decreased as
64/ a result of bringing in that predictor, and average that amount over all trees. A variable importance plot summarizes
this information. A sample of such a plot for a hypothetical tree based on 4 variables is shown in Figure 16.2. In
this graph, the predictor with the largest average decrease is shown as a bar going to 100, and the bars for the other
predictors are proportionate to that bar. For example, if the best predictor's average decrease in RSS was 300 and
another predictor's average RSS decrease was 99, that predictor's average would be shown with a bar going to 33.
As stated above, the average decrease in Gini index would be used for classification trees.4
16.2.2 Random forests
Bagged trees may be correlated. An important predictor may appear in all trees regardless of which sample values
s: were used to build the tree. To correct this problem, in a random forest, a positive integer in is specified. At
each split, In predictors are selected randomly, and those are the only predictors that are considered for splitting.
•
The trees are decorrelated. Typically, in -6., where k is the number of predictors. If in = k, the random forests
method reduces to bagging. There is no danger of overfitting by making B too large.
16.2.3 Boosting
We will only discuss boosting in a regression setting, that is, for continuous response variables.
Boosting starts off with a small tree, and adds in a portion of the response. It then recursively builds small trees
based on the residuals from the existing model, and adds a portion of those results to the response. So it learns
slowly.
air Boosting has three parameters:
1. B, a positive integer, the number of cycles. Unlike bagging and random forests, making B too large will overfit
the model; it is selected using cross-validation.
2. A, a positive real number no greater than 1, the shrinkage parameter. It is the proportion of the response that
is added in at each cycle.
3. d, a positive integer, the number of splits for each tree. It is the number of terminal nodes minus 1. If d = 1,
there will be only one split, and the resulting model will be an additive model with no interaction between
predictors. So d is the depth of interaction.
4In An Introduction to Statistical Learning Figure 8.9 and the accompanying discussion, it says that the mean decrease in Gini index is plotted.
On the other hand, on page 330, it says that the average of the decrease in deviance is plotted, and on page 325, the deviance for a classification
tree is defined as the cross-entropy.

26.2. BAGGING, RANDOM FORESTS, BOOSTING 279
Cost complexity pruning

ITt
(yi - 9R)2 arri (16.1)

m=l krieR„,
Classification error rate
Ej = 1 — max Pik (16.2)
Gini index for one region
k=1
Cross-entropy for one region
Dj = — Pjk In Pjk (16.4)

k=1
To compute overall Gini index or cross-entropy, use a weighted average of their values in each region. The
weights are the proportions of observations in each region.
Residual mean deviance for classification trees
Residual mean deviance =

2 (Zur .Ek mk in Pnik)
11 — IT]
(16.5)
Bagging Build B trees using B bootstrap samples from the data, then use their average.
Out-of-bag validation Use the observations outside the bootstrap sample to calculate error measures.
Random forests Use bagging, but at each split only allow in predictors, with in;LI IN typically.
Boosting Gradually obtain prediction by building trees on residuals with d splits, shrinking results by factor A,
adding results to current prediction, and subtracting results from residuals.

Copyright 02022 ASM
Table 16.2: Summary of bias-variance tradeoff for various methods
Method Bias Variance Remarks

Linear regression
Increase k (number of parameters) I T
KNN
Increase K T I
k-fold CV
Increase k I T LOOCV has k = n. Test error is
overestimated
Ridge regression St lasso
Increase A (tuning parameter) T I Ridge regression does not select
Increase $ (budget parameter) I T features. Lasso selects features.
Principal component regression
(PCR)
Increase number of components I T Unsupervised
Partial least squares (PLS)
Increase number of components I T Supervised
PLS versus PCR PLS has lower bias but higher
variance
Decision trees
Higher IT! I T
Cost complexity pruning
Higher a T I
Bagging Cannot overfit regardless of B
Random forests Cannot overfit regardless of B.
Lower in reduces correlation of
trees.
Boosting Can overfit if B is too high relative

to tuning parameter A.

r"-- Exercises
16.1. 'sir [SRM Sample Question #291 Determine which of the following considerations may make decision trees
preferable to other statistical methods.
I. Decision trees are easily interpretable.
II. Decision trees can be displayed graphically.
III. Decision trees are easier to explain than linear regression models.
(A) None (13) I and II only (C) I and III only (D) II and III only
16.2. '141 [SRM Sample Question #33] The regression tree shown below was produced from a dataset of auto claim
payments. Age Category (1, 2, 3, 4, 5, 6) and Vehicle Age (1, 2, 3, 4) are both predictor variables, and log of claim
amount (LCA) is the dependent variable.
agecal > 1.5 <1.5
> 2.5
agecat > 4.5

1 <4.5
veh_a >3.5
8.028 7.995 8.146

=
106., ,srt = 801, (n. = 246)
Consider three autos I, II, III:
I. An Auto in Age Category 1 and Vehicle Age 4

II. An Auto in Age Category 5 and Vehicle Age 5
III. An Auto in Age Category 5 and Vehicle Age 3
Rank the estimated LCA of Autos I, H, and III.
(A) LCA(I) < LCA(II) < LCA(III)
(B) LCA(I) < LCA(III) < LCA(II)
(C) LCA(II) < LCA(I) < LCA(III)
(D) LCA(II) < LCA(III) < LCA(I)
(E) LCA(III) < LCA(II) < LCA(I)
l
f

16.3. '',46 [SRM Sample Question #511 You are given the following regression tree predicting the weight of ducks
in kilograms (kg): \--2
Age > 7.5 Years

[n [yes]
Gender = 1 (Male) Gender = 1 (Male)

[no] [yes] [no] [yes]
Wing Span > 5.3 Crn

[no] [yes]
0.80 kg 0.90 kg 0.95 kg
1.10 kg 1.25 kg
You predict the weight of the following three ducks:

X: Wing Span = 5.5 cm, Male, Age = 7 years
Y: Wing Span = 5.8 cm, Female, Age =5 years
Z: Wing Span = 5.7 cm, Male, Age = 8 years
Determine the order of the predicted weights of the three ducks.
(A) X <Y < Z (B)X<Z<Y (C)Y<X<Z (D) Y<Z<X (E)Z<X<Y
16.4. f A regression tree is being constructed. There is one explanatory variable X and the response variable
is Y. Four observations of (X, Y) are:
(0,8) (1,5) (3,8) (6,6)
After one iteration of recursive binary splitting, the observations are split into two groups.
Determine the members of the two groups.
Exam SRM Study Manual Exercises continue on the next page.. .

Copyright 02022 ASM
16.5. alir [MAS-II Sample:12] A data set contains six observations for two predictor variables, X1 and X2, and a
response variable, Y.
Xi X2 Y
1 0 1.2
2 1 2.1
3 2 1.5
4 1 3.0
2 2 2.0
1 1 1.6
A regression tree is constructed using recursive binary splitting. A split is denoted
Ri(j,$) = {X I Xi < s} and R2(j,$)= {X I Xi s}

The following five splits are analyzed.
I. R1(1,1) = {X IXi < 1} and R2(1,1) = {X J X1 > 1}
II. Ri(1,4) = {X lxi <4} and R2(1,4) = {X 1 X1 4}
III. R1(2,0) = {X 1 X2 < 0} and R2(2,0) = {X [X2
IV. Ri(2,1) = {X 1 X2 < 1} and R2(2,1) = {X 1 X2 1.}
V. R1 (2, = {X I X2 < 2} and R2(2,2) = {X I X2 2}
Determine which split is chosen first.

(A) I (B) II (C) III (D) IV (E) V

Copyright 02022 ASM
16.6. a: [SRM Sample Question #48] The following tree was constructed using recursive binary splitting with
the left branch indicating that the inequality is true. j
21 <t1
X2 <t2
A
Si <t3
Determine which of the following plots represents the tree.
A A
t3
Xi C Xi
(A) ti B (B) t3
t2 t2
X2 X2
A A
t3
X2 D
(D) t1 B
t2
xl t2xi
t2
X2

Copyright 6)2022 ASM
(---- 16.7. 1141 [SRM Sample Question #571 You are given
(i) The following observed values of the response variable, R, and predictor variables X, Y, Z:
R 4.75 4.67 4.67 4.56 4.53 3.91 3.90 3.90 3.89
X M F M F M F F M M
Y A A D ID B C B D B
Z 2 4 1 3 2 2 _
5 5 1
(ii) The following plot of the corresponding regression tree:

Z<3
Y = A,B
X=F
Ti T2
T3 T4
Calculate the Mean Response (MR) for each of the end nodes.
(A) MR(T1) = 4.39, MR(T2) = 4.38, MR(T3) = 4.29, MR(T4) = 3.90
(B) MR(T1) = 4.26, MR(T2) = 4.38, MR(T3) = 4.62, MR(T4) = 3.90
n (C) MR(T1) = 4.26, MR(T2) = 4.39, MR(T3) = 3.90, MR(T4) = 4.29
(D) MR(T1) = 4.64, MR(T2) = 4.29, MR(T3) = 4.38, MR(T4) = 3.90
(E) MR(T1) = 4.64, MR(T2) = 4.38, MR(T3) = 4.39, MR(T4) = 3.90
Exam snit Study Manual Exercises continue on the next page . . .

Copyright t2O22 ASM
16.8. '45 [MAS-II-F18:38] You are given the following unpruned decision tree:
82
20 58
The values at each terminal node are the residual sums of squares (RSS) at that node. The table below gives the
RSS at nodes S, T, and X if the table was pruned at those nodes:
Node RSS
251
209
X 86
The RSS for the null model is 486. You use the cost complexity priming algorithm with the tuning parameter, a,
equal to 9 in order to evaluate the following pruning strategies:
No nodes pruned
Prune node S only
Prune node T only
Prune node X only
Prune both nodes S and X
Determine which pruning strategy is selected.

(A) I (B) II (C) III (D) IV (E) V
16.9.s: A classification tree is constructed to predict whether students will pass Exam SRM on their first try.
Two binary explanatory variables are used: X1 (Did the student take a statistics course in college?) and X2 (Did the
student pass STAM?) Available data is as follows:
= 0, X2 = 0 5 passed on their first try, 20 didn't
= 1,X2 = 0 10 passed on their first try, 10 didn't
Xi = 0,X2 = 1 7 passed on their first try, 9 didn't
Xi = 1,X2 = 1 6 passed on their first try, 4 didn't
Determine the first split made in the tree using the Gini index.
16.10. •41 For a regression tree, two nodes of the tree have the following response values:
R4: 2, 3, 3, 4, 5
R5: 2, 4, 6
The tree is pruned using cost complexity pruning. The split of R4 and R5 is the optimal one to prune.
Determine the smallest value of a for which pruning will occur.
16.11. •-: [SRM Sample Question #25] Determine which of the following statements concerning decision tree
pruning is/are true.
I. The recursive binary splitting method can lead to overfitting the data.
II. A tree with more splits tends to have lower variance.
III. When using the cost complexity pruning method, a = 0 results in a very large tree.
16.12. "4" [SRM Sample Question #411 For a random forest, let p be the total number of features and in be the
number of features selected at each split.
Determine which of the following statements is/are true.
I. When in = p, random forest and bagging are the same procedure.
IL is the probability a split will not consider the strongest predictor.
P
III. The typical choice of in is
(A) None of I, II, or III is true

(B) I and II only
(C) I and III only
(D) II and III only

A classification tree is used to predict classification based on region. You are given the following numbers of
observations in each class and region:
Region Class A Class B Class C

I 72 22 6
II 55 25 10
III 20 32 28
16.13. 1.46 Calculate the classification error rate for Region I.

16.14. *4? Calculate the Gini index for Region II.
16.15. ari. Calculate the cross-entropy of Region I.
Exam SRM Study Manual Exercises continue On the next page . . .

Copyright 02022 ASM
16.16. [MAS-II-F18:401 You are given the following classification decision tree and data set:
Xi <21 i Y
X1 X2
1 12 Y T
2 23 N'F
= Y 3 4 Y F
4 32 Y F
5 22 N T
6 30 Y T
7 18 N T
Determine the relationship between the classification error rate, the Gini index, and the cross-entropy, summed
across all nodes.
(A) cross-entropy > Gini index > classification error rate

(B) cross-entropy > Gini index = classification error rate
(C) classification error rate > Gini index > cross-entropy
(D) Gini index > cross-entropy > classification error rate
(E) The answer is not given by (A), (B), (C), or (D).
16.17. "I [SRM Sample Question #9] A classification tree is being constructed to predict if an insurance policy
will lapse. A random sample of 100 policies contains 30 that lapsed. You are considering two splits:
Split 1: One node has 20 observations with 12 lapses and one node has 80 observations with 18 lapses.
Split 2: One node has 10 observations with 8 lapses and one node has 90 observations with 22 lapses.
The total Gini index after a split is the weighted average of the Gini index at each node, with the weights
proportional to the number of observations in each node.
The total entropy after a split is the weighted average of the entropy at each node, with the weights proportional
to the number of observations in each node.

I. Split 1 is preferred based on the total Gini index.
II. Split 1 is preferred based on the total entropy.
III. Split 1 is preferred based on having fewer classification errors.

Copyright 02022 ASM
16.18. [SRM Sample Question #501 Determine which of the following statements regarding statistical learning
methods is/are true.
I. Methods that are highly interpretable are more likely to be highly flexible.
II. When inference is the goal, there are clear advantages to using a lasso method versus a bagging method.
III. Using a more flexible method will produce a more accurate prediction against unseen data.
(A) I only
(B) II only
(C) III only
(D) I, II, and III
16.19. [SRM Sample Question #10) Determine which of the following statements about random forests is/are
true.
I. If the number of predictors used at each split is equal to the total number of available predictors, the result is
the same as using bagging.
II. When building a specific tree, the same subset of predictor variables is used at each split.
III. Random forests are an improvement over bagging because the trees are decorrelated.

Copyright P2022 ASM
16.20. ": [SRM Sample Question #39] You are given a dataset with two variables, which is graphed below. You
want to predict y using x.
Determine which statement regarding using a generalized linear model (GLM) or a random forest is true.
o 0 o
U, o(93
o 0
oo o0 0 8)88 0 0 00 8 o
o2_ 0 o
CV 0 g°% ° o GEo-v
oo
0, co
ci2P'
>. O o 0o
o o co
-4 -2 0 2 4
(A) A random forest is appropriate because the dataset contains only quantitative variables.
(B) A random forest is appropriate because the data does not follow a straight line.
(C) A GLM is not appropriate because the variance of y given x is not constant.
(D) A random forest is appropriate because there is a clear relationship between y and x.
(E) A GLM is appropriate because it can accommodate polynomial relationships.

n'16.21.'141' EMAS-II-F18:391 An actuary creates three tree-based models using bagging, boosting, and random
forests. The error on a test data set, as a function of the number of trees in each model, is plotted on the graph below.
0.55
0.5
0.4 • •••••••••11"0••••••••••••41...P•Po••••.# m•••••~•=0.• •••04•••••••••••••••••••••••••" ".••
%
gimpapjaira, eta 0. • w VP fp& lib Oita sir elm es%
0.35
0.3
Determine the type of model most likely to have created each of the lines on the graph.
(A) I: Boosting, II: Bagging, III: Random forest
(B) I: Bagging, II: Boosting, III: Random forest
(C) I: Bagging, II: Random forest, III: Boosting
(D) I: Random forest, II: Bagging, III: Boosting

Copyright 02022 ASM
292 M. DECISION TREES
16.22. '',416 [MAS-II-S19:39] A boosted tree model is defined by:

• A = 0.2
• The following four trees:
X2 < 1 X4 <2.67
6.4 4.34
X3 = Y X3 = Y
1
4.83 4.75
Xi = Y X4 <3
10 3.14 6.67
X7 <8
31.5 8.81
X7 <6.5 I X&
0.34 5.89
=Y X3 = Y
2.81 5.9
X3 = Y X4 <2
8.1 6.3 7.41 14.1
You are given the following record:
X1 X2 X3 X4. X5 X6 X7
N 6 Y 4 0.5 0.25 6
Calculate the prediction of the boosted tree model for this record.
(A) Less than 2
(E) At least 11

Copyright Q2022 ASM
EXERCISES FOR LESSON /6 293
(---- 16.23. a•: [SRM Sample Question #261Each picture below represents a two dimensional space where observations
are classified into two categories. The categories are represented by light and dark shading. A classification tree is
to be constructed for each space.
Determine which space can be modeled with no error by a classification tree.
I. II.
(A) I only (B) II only (C) III only (D) 1, II and III
16.24. s: [SRM Sample Question #12] Determine which of the following statements is true.
(A) Linear regression is a flexible approach
(B) Lasso is more flexible than a linear regression approach
(C) Bagging is a low flexibility approach
(D) There are methods that have high flexibility and are also easy to interpret
(E) None of (A), (B), (C), or (D) are true

16.25. f [MAS-II-F19:38] You are given the following three statements about tree-based methods for regression
and classification:
I. The main difference between bagging and random forests is the number of predictors considered at each step
in building individual trees.
II. Single decision tree models generally have higher variance than random forest models.
III. Random forests provide an improvement over bagging because trees in a random forest are less correlated than
those in bagged trees.
Determine which of the statements I, II, and III are true.
Solutions
16.1. All three statements are true. Decision trees do not require appreciation of the effects of coefficients, link
functions, etc., making them easily intepretable and easier to understand than linear regression. And they are
displayed graphically. (E)
16.2. For I, you immediately go right and end up at 8.146.
For II, you go left, then right since vehicle age is at least 2.5, then left because age category is at least 4.5, then
right since vehicle age is at least 3.5, and end up at 8.028.
For III, you do the same as II, except at the last juncture you go left since vehicle age is less than 3.5, and end up
at 7.771. (E)
16.3. For X, we go left at the Age node and then right at the Gender node, getting 0.90 kg.
For Y, we go left at the Age node and left at the Gender node, getting 0.8 kg.
For Z, we go right at the Age node, right at the Gender node, and right at the Wing Span node, getting 1.25 kg.
(C)
16.4. We minimize the RSS from using the mean of each group.
We can split at x = 0.5, at x = 2, or at x = 4.5. (Other splits can be made, such as x = 2.5, but they would be
equivalent to one of these three in that they split the 4 points up into the same groups.)
If the x = 0.5 split is made, (0,8) is split from the other observations. The mean response for the other
observations is 19/3 and the RSS is 3 times the variance of the three responses:
2
(5 213 ) + (8 \2± (6
) k 3) 19)2= 42
3
If the x = 2 split is made, (0,8) and (1,5) are put in the first group. Then the prediction is .84 5-= 6.5 for the first
group and = 7 for the second group, and the RSS is
2
2
± 2(12) = 6.5
If the x = 4.5 split is made, (6,6) is split from the other observations. The mean of the other observations is 7
and the RSS is
2(8 — 7)2 + (5 — 7)2 =6
So the first option is taken: (0,8) is in the first group and the other three observations are in the second group.

16.5. Splits I and III don't split at all; all observations go into R2. Split II puts (4,1) into R2 and everything else
into Ri. There is no error for (4,1), whereas the error of the other 5 is the square difference from the mean, or the
population (division by 5) variance, 0.1096, times 5/6. Split IV puts (1,0) into Ri and everything else into R2. Once
again, we can compute the MSE as the variance in R2, or 0.2824, times 5/6. Split V puts two observations, (3,2) and
(2,2), into R2 and the others into Ri. The variance of the observations in Ri is 0.451875 so the sum of squares is
4(0.451875) = 1.975. Dividing by 6 gets a mean square error higher than for Split II, and this is even before adding
in the error for R2. (B)
16.6. If xi > ti then A is chosen. If ti <Xl <t3 then D is chosen. That is enough to select (B). You can then verify
that it does the correct selection for x2 as well.
16.7. There are nine observations, and the nodes they correspond to are:
R 4.75 4.67 4.67 4.56 4.53 3.91 3.90 3.90 3.89
X M F M F M F F M M
Y A A D D B C B D B
Z 2 4 1 3 2 2 5 5 1
Ti T3 T2 T2 Ti T2 13 T4 Ti
The average of the three Ti nodes is

4.75 + 4.53 + 3.89
MR(T1) — = 4.39
3
and this already forces (A). But to check the others:

4.67 + 4.56 + 4.53
MR(T2) = = 4.38
3
4.67 + 3. 90
MR(T3) = = 4.285
2
MR(14) = 3.90
Using X2:
When X2 .= 0, 15 pass and 30 don't.
When X2 = 1, 13 pass and 13 don't.
The Gini index is

Copyright C2O22 ASM
16.10. If the rectangles are kept separate, the mean of each rectangle (3.4 for R4, 4 for R5) will be the fitted value,
and the square differences between the values and their means add up to:
(2 — 3.4)2 + (3 — 3.4)2 + • • • + (5 — 3.4)2 + (2 — 4)2 + (4 — 4)2 + (6 — 4)2 = 13.2

If the rectangles are put together, the mean value is 3.625 and the sum of square differences is now
(2— 3.625)2 + (3 — 3.625)2 + • + (6 — 3.625)2 = 13.875
Thus if a > 13.875 — 13.2 = 0.675 , the tree will be pruned.

16.11.
I. Too many splits may overfit the data.

II. The more splits, the higher the variance. X
HI. The cost function is WI, where T is the number of terminal nodes. Setting a = 0 means there is no cost for
splitting, resulting in a very large tree. s/
(C)
16.12.
I. This is true. To make a random forest different from bagging, set in

in predictors are selected at random out of p, so the strongest predictor's probability of being selected is m /p.
The probability it is not selected is 1 — ?nip./
in = -0 is typical. The SOA's solution mentions that in = p13 is also typical, but the textbook doesn't mention
that. X
(B)
16.13. The most common class for Region I is Class A, so the classification error rate is 1 — 72/(72 + 22 + 6) = 0.28
16.14. 551i _ 55) + 2511 _ 251 + 10!1 _ 10) 0.537037

•§ k 90) 90 k 90) 90 90 ) =
16.15. —(0.72 In 0.72 + 0.221n 0.22 + 0.061n0.06) = 0.738436
16.16. In this question, they expected you to sum the Gini index and cross-entropy over the nodes, rather than
compute a weighted average. It's a bit confusing that Y is used both as a variable name and as a value of X2.
Observations i = 1, 3, and 7 are in the Xi <21 node, and the most common value of Y for those three values of
i is T, which occurs twice, so the classification error is 1 — 2/3 = 1/3. Observations i = 4 and 6 are at the Xi 21,
X2 = Y node, with Y equal to T once and to F once, making the classification error 1/2. Observations i = 2 and i = 6
are at the X1 21, X2 = N node with Y equal to T once and to F once, making the classification error 1/2. Total
classification error is 1/3 + 1/2 + 1/2 = 4/3.
Gini index at first node is (1 /3)(2 /3)÷(2 /3)(1/3) = 4/9. At the second and third nodes it is (1/2)(1/2)+(1/2)(1/2) =
1/2. Sum is 4/9 + 1/2 + 1/2 = 13/9.
Cross entropy at first node is —(1/3) ln(1 /3)— (2/3) In(2/3), and at the second and third nodes it is —(1/2) ln(1 /2)—
(1/2) ln(1/2). The sum over the three nodes is 2.022809. (A)
16.17. For the Gini index, we start with Split 1. For the first node, for 20 observations, the majority are lapses, so
this node will be predicted to lapse. Then 12 of the 20 observations observations, or 0.6, are properly classified and
8 observations, or 0.4, are not. Thus the Gini index is (0.6)(0.4) + (0.4)(0.6) = 0.48. For the second node, the majority
are non-lapses, so non-lapse will pe predicted. 62 observations are non-lapse and are properly classified and 18 are
lapses and are not properly classified, so the Gini index is (0.775)(0.225) + (0.225)(0.775) = 0.34875. Weighting these
by proportions of observations, 0.2 in the first node and 0.8 in the second node, we get
0.2(0.48) + 0.8(0.34875) = 0.375

Copyright 02022 ASM
For Split 2, the node with 8 lapses will be classified as lapse and 8 of the 10 observations will be properly classified,
while the node with 22 lapses will be classified as non-lapse and 68 of the 90 observations will be properly classified.
We get
0.1 ((0.8)(0.2) + (0.2)(0.8)) + 0.9 I22) I22)90) fL8))

90) 90) k90
0.3644
Split 2 is preferred.
Split l's cross-entropy is
—0.2(0.6 In 0.6 + 0.4 in 0.4) — 0.8(0.775 In 0.775 + 0.225 In 0.225) = 0.5611
Split 2's cross entropy is
68 68 22 22
— (0.1(0.81n 0.8 + 0.2 hi 0.2) + 0.9 (-

90
In — + — In
90 90 -)) = 0.5506
90
Once again Split 2 is preferred.

The first node is the lapse node and the second node is the non-lapse node. For Split 1, 8 non-lapses are
classified as lapses and 18 lapses are classified as non-lapses, a total of 26 misclassifications. For Split 2,2 non-lapses
are classified as lapses and 22 lapses are classified as non-lapses, a total of 24 misclassifications. Split 2 has fewer
classification errors. (E)
16.18.
L Methods that are highly interpretable tend to be simple and have few parameters. For example, linear regression
is easier to interpret than splines. Thus methods that are highly interpretable are less flexible.X
IT. Lasso is a simple method; it simply selects parameters. Bagging is hard to use since it uses an average of lots
of models./
III. Flexible methods may fit the training data better but may not fit test data so well.X
(B)
16.19.
I. If all available predictors are used, the forest is not random. /

II. A random set of predictor variables is used at each split. X
III. By using a random set of predictors, trees are decorrelated. /
(C)
16.20. The plot seems to show a quadratic relationship between y and x, and we know that GLM (or even linear
regression) can handle such relationships by using x2 as a predictor, so on an exam I would answer (E) and move
on. But let's look at the other four statements.
(A) is wrong because tree-based methods are better at qualitative variables and weaker with quantitative
variables.
(B) is wrong because tree-based methods do not have a special technique to handle linear versus non-linear
relationships.
(C) is wrong because the variance does look constant in the plot.
(D) is wrong because tree-based methods don't produce any formula relationship between variables. GLM
does that.
16.21. Bagging selects bootstrap samples from the nodes. As the number of those samples goes to infinity, the
method converges to its lowest error rate and cannot improve after that. Random forests are like bagging except
that they remove correlation between samples by forcing the selection to only consider some of the nodes; thus the
results will once again converge after a while, but the removal of correlation should improve the error rate. Boosting
builds a lot of small trees and sets the response equal to a function of the results; thus it can result in improvements
even as the number of trees grows large, although if the number of trees is too large there may be overfilling. Thus
the answer is (C), even though line III doesn't increase in the graph; perhaps it increases when the number of trees
is higher than 750.

16.22. For the first tree, X2 > 1, then X3 = Y, then X1 Y, so we end up at the 3.14 node.
For the second tree, X4 2.67, then X3 = Y, then X4 3, then X7 <8, so we end up at the 3.5 node.
For the third tree, X7 <6.5, then X1 # Y, so we end up at the 2.81 node.
For the fourth tree, X6 <1, so we end up at the 0.34 node.
The prediction is 0.2(3.14 + 3.5 + 2.81 + 0.34) = 1.958 . (A)
16.23. Only a function that is piecewise constant can be modeled with no error by a classification tree. For such a
function, the regions are rectangles parallel to the axes. Thus only I can be modeled with no error by a classification
tree. (A)
16.24. Linear regression forces a straight line, so it is not flexible. The lasso limits the variables further, so it is less
l exible than linear regression. Bagging uses an average of many models so it is more flexible than a single tree.
f
Methods with high flexibility require complicated relationships to the explanatory variables, making them difficult
to interpret. (E)
16.25. All three statements are true, as discussed in this lesson. (D)


Copyright 02022 ASM
Lesson 17
Principal Components Analysis
Reading: An Introduction to Statistical Learning 10.2 (first edition) or 12.2 (second edition)
17.1 Loadings and scores

Principal components analysis is an unsupervised method for visualizing data. It creates principal components that 'I
summarize correlated variables. One can then visualize large numbers of variables by looking at a smaller number
of principal components.
Principal components are linear combinations of variables. Let p be the number of variables, n the number of
observations. We are given variables Xi, , X1, and are going to create principal components Z,,,. We subtract the
mean from each xj so that the mean of each variable is 0. This does not affect the variance of each X. We will
therefore assume that the mean of each variable is 0:
= 0 for j = 1, 2,... ,p
The principal components are linear combinations of the Xi:
zm
j=1
In other words,
Zim = Eq5jmXij
j=1
Since xi/ = 0, if follows that
n P n
E
:=1
Zim .7.:-.
E (ping z xii = 0
j.i i.i.
For the first principal component Z1, Oil are selected to maximize the variance of Z1. In other words,
Maximize
2
TI E Zi — TIE DpfiXii
1=1 1=1 j=1

13
—1
CPA
—
j=1
Thus the most important direction of the Xj, the one on which they vary the most, is selected. Then
j=1

Copyright 02022 ASM
300 17. PRINCIPAL COMPONENTS ANALYSIS
For the other principal components Zr,,, goi,„ are selected to maximize the variance of Z„, with the constraint that
Zm is uncorrelated to any of the previous principal components. That is equivalent to it being orthogonal to all
previous components. Thus all the equations stated above for Zi hold generally for Z„1:
Elb? 171
=1
.1=1
At most, there are min(p, n 1) principal components.

The (pin, are called loadings. They are the weights on the variables X1 in the expression for Zt,„ except that we
don't use the term "weights" because they don't add up to 1 (their squares add up to 1). There are p of them for
each principal component. You can form loading vectors either by fixing nz and varying j, in which case the vector
will have p components, or by fixing j and varying in, in which case the vector will have M components, where M
is the number of principal components that are being used. M is presumably lower than p, since the whole point of
principal component analysis is to summarize p variables with a smaller set of variables. So when visualizing data,
M-component loading vectors are more useful.
NI° The z,„, are called scores. They are the coordinates of the observations in the Z coordinate system. Score m for
observation i is the distance of point k from 0, in the Z„, direction. There are m scores for each observation. An
m-component vector of scores for each observation can be created fixing i and letting m vary.
Calculating loadings is easy linear algebra done by eigen decomposition, but you are not expected to solve it by
hand.
To illustrate this, consider two variables that have already been centered at mean 0, X1 and X2, with values shown
in the following table:
i xii xi2 zil zi2 i Xil X12 Zil Zi2
1 -24 -21 -30.45 -9.48 11 1 0 0.91 -0.41

2 -23 -18 -28.32 -7.14 12 5 13 9.84 9.86
3 -20 -12 -23.15 -2.87 13 6 4 7.11 1.23
4 -19 -8 -20.61 0.38 14 10 5 11.17 0.52
5 -18 -6 -18.89 1.80 15 12 7 13.81 1.54
6 -15 1 -14.12 5.16 16 16 9 18.27 1.75
7 -12 0 -10.97 4.86 17 21 10 23.25 0.64
8 -9 2 -7.42 5.47 18 23 6 23.46 -3.83
9 -5 1 -4.17 2.94 19 25 2 23.67 -8.30
10 -2 4 -0.21 . 4.67 20 28 3 26.82 -8.60
The loadings of the first principal component are 0.9143 and 0.4050. The loadings of the second principal
component are -0.4050 and 0.9143. As you can see, the second loading component is perpendicular to the first
loading component. The table shows the scores of the two components for each observation. For example, for the
i rst observation
f
0.9143(-24) 0.4050(-21) = -30.45

-0.4050(-24) 0.9143(-21) = -9.48
Figure 17.1 shows the points and the first principal component line, the line defined by
021
Y-
7
To understand what the scores represent, look at the figure and project the observations perpendicularly to the
principal component line. They hit the line at a point. The score of the observation is the distance of that point

Copyright (02022 ASM
17.2. BIPLOTS 301
—30 —20 —10 0 10 20 30Ki

Figure 17.1: Principal component lines. First principal component line is solid. Second principal component line is dashed.
to (0,0). In other words, if you computed both principal components, and then drew a graph with axes Zi and Z2, the
coordinate of that point would be the two scores, (zii , z12). For example, Figure 17.2 projects observation 12 on the
two principal components. The scores are 9.83699 for the first component and —9.86072 for the second component.
After looking at this graph, you can understand why observation 13's score on the first principal component is lower
than observation 12's score even though its x coordinate is higher.
14-7"\.JQuiwords,
z 17-1theNIPlength
Consiofdera perpendicular
the 20 point exampl e we just discussed and il ustrated in Figure 17.1.
Calculate the Euclidean distance of the first observation (-24, —21) to the principal component line, in other
line dropped from the first observation to the principal component line.
The second principal component is the vector of loadings Op subject to the constraint Ei 4:12 = 1 that maximizes 111
the variance of zi2 = E cpprij such that Z2 is uncorrelated to Z1. And all other principal components are defined
the same way, to maximize variance and be uncorrelated to the previous principal components. Each principal
component is orthogonal (perpendicular) to the hyperplane of the previous principal components. Thus principal
component m is the set of cpims, j = 1,. . , p, that solves the following:
Maximize Z1 opi i subject to
Ei!j=1 4)2int —
—1
EPi=1 Oint(Pik = 0 for k < m
17.2 Biplots
l
f One way to visualize the principal components is through a biplot. A biplot plots two things, using labels on the left 61
Figure 17.2: Projection of observation 12 on the two principal components.
and bottom for one and labels on the top and right for the other. For principal components, the biplot shows which
variables are correlated with each other and how the observations vary with the two principal components shown.
Our biplot will plot the first two scores of our 20 observations, and the first two loadings of each variable. The
biplot is more interesting with more than 2 variables, so we'll add a third variable to the 20 point example given
in the previous section. Table 17.1 lists the three variables and the three scores, but we will use only the first two
scores.
The loadings for the three variables are:
1 2 3
Oil 0.16741 0.09167 0.98162
op 0.90061 0.39085 —0.19009
cbj3 —0.40109 0.91588 —0.01712
The biplot is Figure 17.3. The three variables are labeled X1, X2, and X3, and the 20 observations are labeled
pi., • • , P20 • The biplot shows the first two components of the loading vectors by variable]; it shows the vector of the
principal component 1 and principal component 2 loadings for each variable. The numbers on the bottom and left
axes are scores. The numbers on the top and right axes are loadings. For example, the line labeled "Xi" goes from
(0,0) to (0.16741,0.90061). We see that variables 1 and 2 each have loadings that are virtually in the same direction;
they are highly correlated. Variable 3, however goes in a quite distinct direction, and consists almost entirely of
component 1. The observations are all around the place. Some examples: Observation 20 (shown as p20) is high
in variables 1 and 2 but only average in variable 3. Observation 12 is average in variables 1 and 2 and very high in
variable 3. Observation 10 is low in variable 3 and a little above average in variables 1 and 2. .)

Copyright Q2022 ASM
17.2. BIPLOTS 303
lf Table 17.1: Variables and their scores on the three principal components
/ xi], Xi2 X13 Zil Zi2 Z i3
1 -24 -21 -20 -25.5751 -26.0207 -9.2649

2 -23 -18 0 -5.5003 -27.7494 -7.2608
3 -20 -12 20 15.1842 -26.5043 -3.3113
4 -19 -8 -30 -33.3626 -14.5357 0.8074
5 -18 -6 0 -3.5633 -18.5561 1.7243
6 -15 -1 30 26.8458 -19.6028 4.5867
7 -12 0 -40 -41.2736 -3.2037 5.4981
8 -9 2 0 -1.3233 -7.3238 5.4416
9 -5 1 40 38.5193 -11.7158 2.2363
10 -2 4 -50 -49.0490 9.2667 5.3219
11 1 0 0 0.1674 0.9006 -0.4011
12 5 13 50 51.1096 0.0796 9.0447
13 6 4 -20 -18.2613 10.7689 1.5995
14 10 5 0 2.1324 10.9604 0.5685
15 12 7 20 22.2829 9.7415 1.2556
16 16 9 -30 -25.9450 23.6302 2.3392
17 21 10 -10 -5.3840 24.7223 0.9072
18 23 6 30 33.8489 17.3565 -4.2435
19 25 2 10 14.1847 21.3961 -8.3667
20 28 3 0 4.9624 26.3897 -8.4828
-0.5 0 0.5
toD
Lt
C•1
Cg
-50 -40 -30 -20 -10 0 10 20 30 40 50
First principal component
Figure 17.3: Biplot

Copyright C2022 ASM
17.3 Approximation
Another interpretation of principal components is that they are the best linear approximation of the observations.
Nr They constitute the hyperplane with minimum Euclidean distance to the observations. If M principal components
are used, then the score vector of an observation times the loading vector for a variable approximates the observation's
value for that variable:
111? x
E zin, (pin,
m=1
(17.1)
assuming that the variables are centered at 0. The approximation is exact if M = min(7/ — 1, p), as the following
shows for M = p:
E t:P j111ZiM = DA. E ChniX1k

m=l m=1 k=1
= X ik (1) pug) krn

k=1 n1=1
But
SO
E ofinzi,„ = Xii
ni=1
EXAMPLE 17A 11 For the 3-variable example above, the scores of the first observation in the first two components,
to four decimal places, are —25.5751 and —26.0207.
Approximate the three components of the first observation. •
SOLUTION:
x1,1 =LI (-25.5751)(0.16741) + (-26.0207)(0.90061) =L-27.7160

X12 :L.-, (-25.5751)(0.09167) + (-26.0207)(0.39085) = 1-12.5145
X13 (-25.5751)(0.98162) + (-26.0207)(-0.19009) = —20.1587
Actual values are X1,i= —24, x/,2 = —21, and x13 = —20.
17.4 Scaling
In linear regression, the scale of variables does not matter; if a variable is multiplied by r, then the corresponding
'sir coefficient p is divided by r. But the scale does matter in principal components analysis. If a variable is multiplied by
a constant greater than 1, its variance increases, and PCA puts a higher loading on the variable in order to maximize
variance. To avoid giving some variables spurious importance, the variables are usually scaled so that their standard
deviations are 1. However, if the variables are expressed in the same units and the relative scale of the variables has
meaning, one may choose not to scale the variables.
The example we just did shows how variance affects the accuracy of the approximation. The variances of the
three variables are 280.5, 76, and 680; we see how x1,3 is estimated very well and x1,2 is estimated poorly due to its
low variance.
We will now scale the variables by dividing by their standard deviations; in other words, we'll standardize the
variables. The resulting loadings are

17.4. SCALING 305
Table 17.2: Standardized variables and their scores on the three principal components
i Xil Xi2 Xi3 Zil Zi2 Zi3
1 -L4330 -2.4089 -0.7670 -2.8248 0.0827 -0.6766
2 -1.3733 -2.0647 0.0000 -2.3534 -0.5948 -0.5065
3 -1.1942 -1.3765 0.7670 -L5644 -1.1962 -0.1752
4 -1.1345 -0.9177 -1.1504 -1.6947 0.7374 0.1918
5 -1.0747 -0.6882 0.0000 -1.2049 -0.3278 0.2638
6 -0.8956 -0.1147 1.1504 -0.3974 -1.3168 0.4970
7 -0.7165 0.0000 -1.5339 -0.8770 1.3320 0.5682
8 -0.5374 0.2294 0.0000 -0.2086 -0.0787 0.5401
9 -0.2985 0.1147 1.5339 0.2642 -1.5280 0.2251
10 -0.1194 0.4588 -1.9174 -0.2524 1.8960 0.4926
11 0.0597 0.0000 0.0000 0.0407 0.0125 -0.0419
12 0.2985 1.4912 1.9174 1.7134 -1.5689 0.7696
13 0.3582 0.4588 -0.7670 0.3647 0.8845 0.1083
14 0.5971 0.5735 0.0000 0.8007 0.2105 -0.0105
15 0.7165 0.8030 0.7670 1.2340 -0.4716 0.0361
16 0.9553 1.0324 -1.1504 1.0682 1.4656 0.1142
17 1.2539 1.1471 -0.3835 1.5448 0.8041 -0.0464
18 L3733 0.6882 1.1504 1.6999 -0.7214 -0.5225
19 1.4927 0.2294 0.3835 1.2718 -0.0233 -0.8999
20 1.6718 0.3441 0.0000 1.3754 0.4020 -0.9274
1 2 3
CPjl 0.68139 0.68662 0.25349

0.20987 0.14850 0.96639
-0.70120 0.71168 0.04291
We see that the loadings of the first principal vector are much higher for the first and second variables.
The standardized variables and their scores are shown in Table 17.2.
The biplot for standardized variables is shown in Figure 17.4. It is quite different from the unstandardized biplot.
X1 and X2 are still shown as highly correlated, but are about the same size now. They have much higher loadings
on the first principal component so they point rightward rather than upward, and as a result X3's loading on the
i rst principal component is lower.
f
We'll redo the Example 17A with standardized variables. The scores of the first observation in the first two
components, to four decimal places, are now -2.8248 and 0.0827.
X1,1 (-2.8248)(0.68139) + (0.0827)(0.20987) = -1.9074

x1,2 (-2.8248)(0.09167) + (0.0827)(0.14850) = -1.9273
(-2.8248)(0.25349) + (0.0827)(-0.96639) = -0.7960
Actual values are x1,1 = -1.4330, x1,2 = -2.4089, and x13 = -0.7670.
The vector of loadings is unique up to sign. In other words, one may obtain an equivalent solution by flipping •-dr
the sign on all of the loadings. The loadings indicate the direction of the principal component, and direction is not
affected by reversing the sign. Flipping the sign of the loading vector results in flipping the sign on all of the scores.
The flippings cancel in the approximation formulas.

Copyright 62022 ASM
306 /7. PRINCIPAL COMPONENTS ANALYSIS
—0.5 0 0.5
Lc?
—2 —1 0 1 2 3
Figure 17.4: Biplot
17.5 Proportion of variance explained

How many principal components should be used? There is no definitive answer to this question. With supervised
methods, one would answer the question using cross-validation. But that method is not available for unsupervised
methods.
`14I One may look at the proportion of variance explained (PVE) by the principal components. The total variance of
the variables (which we are still assuming are centered at 0) is
P
j=1
Var(Xj) =
j=1 I1xI (17.2)
If the variables are standardized (scaled so that their variances are 1), the total variance is p.
The variance of the mth principal component is
2
" n
( .IPJmXiJ
P
Var(Z,n) = — = lv (17.3)
'
n n
1=1 j=1
Divide (17.2) into (17.3) to obtain the PVE.

For example, in our original 2-variable 20-point example, Erix = 5610 and V°1 = 1520, for a total variance
of (5610 + 1520)/20 = 356.50. The variance of Zi is 330.31. Thus 330.31/356.50 = 92.65% of the variance is explained
by the first principal component.
One may calculate the cumulative proportion of variance explained by the first M components, and select M so
that a specified proportion of variance is explained. An alternative is to plot the proportion of variance explained
by each component, with a point for each (rn,PVE,„). Then connect the points with lines, and look for "elbows",

17.5. PROPORTION OF VARIANCE EXPLAINED 307
0.9
0.8
XL' 0.7
V, 0.3
z
0.2
c13
0.1
0
1 2 3 4 5
Number of principal components
Figure 17.5: Scree plot
points at which the plot bends so much that very little is explained by further principal components. Such a plot is
called a scree plot. Figure 17.5 is an example of a scree plot for a hypothetical principal components analysis. In this 'sr
plot, the elbow is at the point representing the variance of the second principal component, so we would use two
principal components. Obviously this method is ad hoc.
Here are the percentages of variance explained in our 3-variable example, both the original and the scaled
versions:
Percentage of Variance Explained

Number of
Principal Unsealed Scaled
Components Variables Variables
1 0.6694 0.6092
2 0.3055 0.3146
3 0.0251 0.0762
In the unsealed version the first two components explain 97.5% of the variance, which is probably good enough.
In the scaled version 92.4% of the variance is explained by the first two components, which may still be good enough.

Copyright 02022 ASM
Principal components Vectors Z,,, with
Znr =
çbjmXj
j=1
and cpj„, are loading of mth principal component on Xj.

Scores zi„„ 1 < i < n,1 < m < M
Loadings op, selected to satisfy
subject to
1.
2. Vii=i 4bjk0j,,, = 0 fork < in. (Vacuous condition if in = 1.)

Biplot Plot of scores of observations together with plot of loadings by variable.
Approximation with principal components
E Zinic
i=1
(17.1)
Proportion of variance explained (PVE) The following two are variance formulas:
p n
(17.2)
)2
1 2 1
Var(Z„,) = —
n
zim = — ripjrn 1.1 (17.3)
n
i=1 i=1 j=1
Then the proportion of variance explained by Z,„ is

Var(Z,„)
PVE =
Var(Xi)
Scree plot Plot of PVE against number of principal components

Copyright 02022 ASM
l
f Exercises
17.1. "4/ [SRM Sample Question #5] Consider the following statements:
I. Principal Components Analysis (PCA) provides low-dimensional linear surfaces that are closest to the obser-
vations.
II. The first principal component is the line in p-dimensional space that is closest to the observations.
PCA finds a low dimension representation of a dataset that contains as much variation as possible.
IV. PCA serves as a tool for data visualization.
Determine which of the statements are correct.
(A) Statements I, II, and III only

(B) Statements I, II, and IV only
(C) Statements I, III, and IV only
(D) Statements II, III, and IV only
(E) Statements I, II, III, and IV are all correct
17.2. el [MAS-II-S19:41] You are reviewing a dataset with 100 observations in four variables: X1, X2, X3, and X4.
You analyze this data using two principal comopnents:
Zi = 011X1 + 021X2 +47531X3 + 041 X4

Z2 = 412 + CP22X2 4532X3 + 042 X4
lf You are given the following statements:
2=
1. Eii2i° (zt=i i2xii
El=i ipoi2 = 0
)=1 cp2
j1 + Ei)=1 j2 —
iii -1
Determine which of the preceding statements are always true.
(A) I only (B) II only (C) III only (D) 1,11, and III
17.3. '11 You are given the following 5 observations of 2 variables:

X1 X2
1 5
2 7
4 4
6 3
9 1
The loadings of the first principal component are —0.8297 and 0.5583.
Calculate the first principal component's score for the first observation.

17.4. For a principal components analysis on two variables, the means of the two variables are 0. You are given
the following observation and its score on the first principal component:
Xi X2 Score
4 10 8
Calculate the first principal component's loading on the first variable, assuming that it is positive.
17.5. [MAS-I-S18:26 fixed] You are given the following daily temperature readings, which are modeled as a
function of two independent variables:
Independent variables
Observation Temperature Xi X2
1 5 4 6
2 10 8 2
• The data set consists only of these two observations.

• The first principal component loading for Xi, On, is 1/./i
• The first principal component loading for X2, 021, is negative.
Calculate the principal component score for Observation 1.
Exam SFtM Study Manual Exercises continue on the next page. .

Copyright 02022 ASM
•
17.6. •-•
[MAS-I-F18:381 You are given a series of plots of a single data set containing two variables:
Plot 1 Plot II
•
—
PC I
•
-- PC 2
_...----- • •
• ••
4
• •
•
0.2 0:4 06 0:8 0:2 0:4 0'6 0.8 1.0
Plot III Plot PI
—
PC 1
—
PC 1
•
--- PC 2 --- PC 2
•
• • • • •
•
S.
• .
•
0.2 0.'4 016 0:8 1:0 0.2 04 0.6 0.8 1.0
Plot V
mi
•
—
PC 1
---
PC 2
-
•
•
ci 4. a • •
el -
el.
0:2 0:4 0:6 0:8 1.0
Determine which of the above plots accurately represent the first and second principal components (PC1 and
PC2, respectively) of this dataset.
(A) I (B) IT (C) III (D) IV (E) V

17.7. sir A department store performs principal components analysis to visualize the purchases of customers in
four categories.
You are given the following biplot, with the scores for two customers.
—0.5 0 0.5
Food
3
Bradley
2 0.5
Clothes
1
Linens
0 0
—1
Abigael
—2 Applicances —0.5
—3
—3 —2 —1 0 1 2 3
Your colleague draws the following conclusions from this biplot:

I. Customers who purchase a lot of clothes tend to purchase a lot of linens.
Abigael purchases more food than the average customer.
Bradley purchases more food than the average customer.
Which of these statements are correct?
(A) None of I, H, or III is true

(B) I and II only
(C) I and III only
(D) II and III only

Copyright 02022 ASM
n 17.8. s: (MAS-1I-S19:40) You perform two separate principal component analyses on the same four variables in
a particular data set: Xl, X2, X3, and X4. The first analysis centers but does not scale the variables, and the second
analysis centers and scales the variables. The biplots of the first two principal components produced from these
analyses are shown below. The location of Observation 24 is labeled on the plots as well.
Biplot from Unsalted Analysis Biplot from Scaled Analysis
CNI
0
0—
PC1 PC1
l
f Given the following statements:
I. X1 is more highly correlated with X2 than with X3.
IL X3 has the highest variance of these four variables.
III. Observation 24 has a relatively large, positive value for X4.
Determine which of the preceding statements are demonstrated in the biplots shown above.
(A) None of I, II, or III are demonstrated in the biplots
(B) I and II only
(C) I and III only
(D) II and III only
17.9. 111 You are performing a principal components analysis of three variables, Xi, X2, and X3. Their loadings
on two principal components Zi and Z2 are
X1 X2 X3
Z1 —0.85956 —0.18960 0.47457
Z2 0.25890 0.63908 0.72426
The scores of the third observation on the two components are, in order, —22.5603 and —45.9338.
Calculate an approximation of the third observation of X/.

Copyright 02022 ASM
17.10. si? You are performing a principal components analysis of three variables, X1, X2, and X3. Their loadings
on two principal components Z1 and Z2 are
.X1 x2 x3
Z1 —0.97932 —0.18906 0.70210
Z2 0.03669 0.18452 0.98214
The scores of the fourth and fifth observations on the two components are
I zii zi2 zia
4 —37.1131 15.4973 14.6442
5 —46.5435 —14.2271 15.3719
The fourth observation's first component x41 = 34.

Determine the fifth observation's first component 7C51
17.11. 6.: [SRM Sample Question #371Analysts W, X, Y, and Z are each performing principal components analysis
on the same data set with three variables. They use different programs with their default settings and discover that
they have different factor loadings for the first principal component. Their loadings are:
Variable 1 Variable 2 Variable 3

W —0.549 —0.594 0.587
X —0.549 0.594 0.587
Y 0.549 —0.594 —0.587
Z 0.140 —0.570 —0.809
Determine which of the following is/are plausible explanations for the different loadings.
I. Loadings are unique up to a sign flip and hence X's and Y's programs could make different arbitrary sign
choices.
Z's program defaults to not scaling the variables while Y's program defaults to scaling them.
Loadings are unique up to a sign flip and hence W's and X's programs could make different arbitrary sign
choices.

Copyright 02022 ASM
EXERCISES FOR LESSON /7 315
17.12. •111 [MAS-II-F18:41] You are provided with the following normalized and scaled data set:
i Xi X2 X3
1 —0.577 1 —1
2 —0.577 1 1
3 —0.577 —1 1
4 1.732 —1 —1
The first principal component loading vector of the data set is (0.707, —0.500, —0.500).
Calculate the proportion of variance explained by the first principal component.
(A) Less than 53%
(B) At least 53%, but less than 58%
(C) At least 58%, but less than 63%
(D) At least 63%, but less than 68%
(E) At least 68%
17.13. [SRM Sample Question #61 Consider the following statements:

I. The proportion of variance explained by an additional principal component never decreases as more principal
components are added.
II. The cumulative proportion of variance explained never decreases as more principal components are added.
III. Using all possible principal components provides the best understanding of the data.
IV. A scree plot provides a method for determining the number of principal components to use.
Determine which of the statements are correct.
(A) Statements I and II only

(B) Statements I and III only
(C) Statements I and IV only
(D) Statements II and III only
(E) Statements II and IV only
1
17.14. [MAS-II-F19:41] You are given:
(i) A data set contains 500 observations for five predictor variables {Xi, X2, X3/ X4/ X5}.
(ii) Each predictor has been standardized to have mean 0 and standard deviation 1.
(iii) Each of the 500 observations takes the form (x11, xj2/ 743, X14/ X15) for i ranging from 1 to 500.
(iv) For each observation, a new predictor Z is calculated by projecting onto the first principal component.
(v) The projection for the ith observation is denoted by zi.
The total variance present in the data set is equal to 4 and Er? = 750.
Calculate the proportion of variance explained for the first principal component.
(A) Less than 0.20
(E) At least 0.50

17.15. i'l [MAS-H-F19:391 Dataset Z contains 4 variables and 100 records and has the following correlation matrix. j ,.....
1.00 0.93 0.08 —1.00
(
0.93 1.00 0.10 —0.93
0.08 0.10 1.00 —0.08
—1.00 —0.93 —0.08 1.00
Determine which of the following plots of cumulative proportional variance is produced by principal components
analysis on dataset Z.
(A) (3)
•
1 2 3 4 1 2 3 4
Principal Component Principal Component

(C) (D)
1 2 3 4 1 2 3 4
Principal Component Principal Component

(E)
D
/
0
1 2 3 4
Principal Component

(----‘ 17.16. '411 You are given the following four observations for two variables:
x1 x2
—2 I
-1 —2
-1 0
4 1
The first principal component has loadings of 0.973402 on X1 and 0.229106 on X2.
Calculate the proportion of variance explained by the first principal component.
17.17. I': [SRM Sample Question #301 Sarah is applying principal component analysis to a large data set with
four variables. Loadings for the first four principal components are estimated.
Determine which of the following statements is/are true with respect to the loadings.
I. The loadings are unique.
II. For a given principal component, the sum of the squares of the loadings across the four variables is one.
III. Together, the four principal components explain 100% of the variance.
(A) None (B) I and II only (C) I and III only (0)11 and III only
Exam SAM Study Manual Exercises continue on the next page...

Copyright 02022 ASM
17.18. '41' [SRM Sample Question #35] Using the following scree plot, determine the minimum number of principal
components that are needed to explain at least 80% of the variance of the original dataset. ,.._....)
1-1
Cq
C3
cl
o
•
•
1 2 3 4 5
Principal Component
(A) One (B) Two (C) Three (D) Four

(E) It cannot be determined from the information given.
Solutions
17.1. All the statements are correct. (E)

17.2.
I. The two expressions (after division by 100) represent the variance of the principal components 1 and 2
respectively, and there is no reason that they should be equal. On the contrary, the first principal component
is selected to have the highest variance. X
II. This formula says that the principal components are orthogonal, and indeed they are. i
III. The sum of the squares of the loading for each principal component is 1. Summing up two is results in 2,
not 1. X
(B)
17.3. The mean of the first variable is 4.4. The mean of the second variable is 4. So the score is
—0.8297(1 — 4.4) + 0.5583(5 —4) = 3.37928

17.4. Let the loadings be 491 and 02. Then og = 1 — 44. From the score, we have
44), + 1002 = 8
So
44)1 + 101/1 — =8
100(1 — qq) = (8 — 401)2

100 — 10044 = 64 — 6401 +164)
11644 —644)i —36 = 0
64+ 1/2T(8-(K)
=
0.89751
232
17.5. The squares of the loadings must add up to 1, so 021 = — (1/V)2 = —1hri, where the negative square
root was used since the loading for X2 is negative. The observations are centered around 0. The mean of X1 is 6
and the mean of X2 is 4, so we subtract the mean 6 from the first observation of X1, 4, and the mean 4 from the first
observation of X2, 6.
Then the score of observation 1 is (1/ )(4 —6) + (-1/V)(6 —4) = —2.828
17.6. Principal components are the lines closest to the data. Plot I seems like the plot that accomplishes this.
Plot II reverses the lines, but the solid line in Plot I looks closer to most of the data. Principal components must be
perpendicular to each other, eliminating Plot III. In Plot IV the first line is not so close to the data. In Plot V the
principal components are not even lines. (A)
17.7. Clothes and linens have similar loadings, so I is correct.
Abigael has a negative score from the first component. But we cannot tell whether that is because she purchases
l
f more food than the average customer or because she purchases less clothes, linen, ana appliances than the average
customer. We cannot conclude
Bradley has a positive score on both components. If Bradley purchases more food than the average customer,
this would probably lead to a negative first component, unless offset by greater than average purchases of the other
items. We cannot conclude III. (E)
17.8.
I. Since X1 and X2 go in virtually the same direction, the loadings for PC1 and PC2 on the variables must be
similar, which implies high correlation. /
IL When a variable is scaled, it is divided by its standard deviation to make the variance 1. Since the first principal
component has maximal variance, it will put lower loading on variables with lower variance. The higher the
variance of the original variable, the greater the reduction in loading.
Comparing the unscaled and scaled biplots, we see that X3's loading on the first principal component was
significantly decreased whereas the loadings of the other variables on the first principal component were
increased. We conclude that X3 has the highest variance. /
III. X4's loadings and the scores of observation 24 are in the same direction, but this does not necessarily imply
observation 24 has a high coefficient for X4. It may have high negative coefficients for X1, X2, and X3 and have
a high score in PC1 as a result. X
(13)
17.9. xi3 —0.85956(-22.5603) + 0.25890(-45.9338) = 7.4997
17.10. (-0.97932)(-37.1131) + (0.03669)(15.4973) + 013(14.6442) = 34
14.6442013 = 34 — 36.9142 = —2.9142
013 = —0.19900
= (-0.97932)(-46.5435) + (0.03669)(-14.2271) + (-0.19900)(15.3719) =

17.11. Loadings are unique up to sign, but all signs must be flipped. Thus I is correct but III is not, since there is
only one sign flip between W and X. II is another possible explanation of differences in loadings. (B)
17.12. Since the data are scaled, the variance of each Xi is 1, and the sum of the variances of the three Xis is 3.
The principal component is computed from the loading vector as zi = E,Pk=i OkXki, and we get zi = (0.707)(-0.577) —
0.5 + 0.5 = —0.40794, and similarly z2 = —1.40794, z3 = —0.40794, and z4 = 2.224524. The sum of the squares
of these 4 numbers is 7.263628, so the variance of the first principal component is 7.263628/4 = 1.815907. Then
1.815907/3 = 60.53% of the variance is explained. (C)
17.13.
I. The principal component explaining the greatest amount of variance is added first, and each additional one
explains less variance. X
II. Each added principal component explains additional variance. t/
III. Data is best understood with fewer principal components, making the model simpler. X
IV. A scree plot shows how much additional variance is explained by each principal component; one can limit the
number of principal components used to the ones that explain a significant amount of variance.
(E)
17.14. The total variance is 4. The variance of the first principal component, by formula (17.3), is
1 2 750
'
n 500
The proportion of variance explained by the first principal component is 1.5/4 = 0.375 . (C)
Incidentallyl, it is hard to see how the total variance present in standardized data could be 4. The variance
of a standardized variable is 1, and there are 5 predictors, so either the variance of the data set is 5, or if the
standardization used division by 499 instead of 500, it is 5(499/500).
17.15. From the matrix, variables 1 and 4 are perfectly correlated with correlation —1.00, so the fourth principal
component cannot explain any variance, while the other components explain some variance. Thus (C) is the only
possible plot.
17.16. The two variables are already centered at 0 (their means are 0). The sum of their squares is
(-2)2 + (-1)2 + (-1)2 + 42 + 12 -I- (-2)2 -I- 02 + 12 = 28
The scores are
0.973402(-2) + 0.229106(1) = —1.717698

0.973402(-1) + 0.229016(-2) = —1.431613
0.973402(-1) = —0_973402
0.973402(4) + 0.229106(1) = 4.122713
The sum of the squares of the scores is 22.944281. Both the total variance and the variance of the first principal
component equal their sum of squares divided by the number of observations, which is 4. So (22.944281/4)1(28/4) =
0.81944 of the variance is explained.
17.17.
I. The loadings are unique up to sign, but the signs may be reversed. X
II. The loadings are defined under the constraint that the sum of their squares must equal one. *7
The four principal components have the same dimension as the original variables so they must explain all of
their variance.
(D)
'This problem was mentioned to me by Mario Mendiola

(---- 17.18. Looking at the scree plot, the first principal component explains more than 60% of the variance, but less
than 80%. The second principal component explains a little more than 20%. So the first two principal components
explain more than 80% of the variance, while the first principal component alone does not explain 80% of the
variance. (B)
Quiz Solutions
17-1. The distance from (-24, —21) to (0,0) is V242 + 212. The score, whose absolute value is the distance from the
projection of (-24, —21) onto the principal component line to (0,0) is —30.45. By Pythagoras' Theorem, the length of
the perpendicular line, one of the legs of the right triangle, is the square root of the difference between the square
of the hypotenuse and the square of the other leg:
V(242 + 212) — 30.452 = 9.48

This is the absolute value of the second score, since that represents the distance to (0,0) along the second principal
component, so you could pick the answer off the score table.


Lesson 18
Cluster Analysis
Reading: An Introduction to Statistical Learning 10.3 (first edition) or 12.4 (second edition)
Cluster analysis is an unsupervised learning method. It groups the observations into a small number of homo-
geneous clusters, groups of observations that are similar to each other. Contrast this with principal components
analysis:
• Principal components analysis looks for a low-dimensional representation that explains most of the variance. •41/
• Cluster analysis tries to group the observations into a small number of groups of similar observations.
Examples of applications of cluster analysis are:
Marketing Customers are grouped based on what they buy. Advertising can be targeted to the groups that are
most interested in the advertised products.
Medical Patients with a certain disease may be grouped based on clinical measurements, to determine the best
therapy.
Actuarial modeling Actuarial models require grouping of policies into cells.
We will study two clustering methods: K-means and hierarchical.
18.1 K-means clustering

in K-means clustering, we decide in advance the number of clusters we will have. The clusters C1,... ,CK are
exhaustive, meaning that every observation belongs to one of the clusters, and mutually exclusive, meaning that no ••••
observation belongs to more than one cluster. The clusters are selected to minimize the total dissimilarities between
points within the clusters. Let W(Ck) be the total dissimilarities between points in Ck. Then our objective is
minimize
E w(co
k=1
(18.1)
As usual, let Xi be vectors of the variables or features, withj= 1, 2, ... , p. There are n observations, so Xi =
zii, x211 ,x,,1}. So a single observation is X12,. xip 1. The most common W(Ck) used involves squared
Euclidean distance:
1
W(Ck) = z z (xii — xi,j)2 (18.2) •-•°
iCk i,i,Eci j.1
where ICk I is the number of observations in Ck. Notice that the sum is over all pairs {i, PI in both orders. (This
doesn't matter for optimization, since doubling W(Ck) does not affect where the minimum occurs. But it makes
proving the minimizing algorithm easier.)
EXAMPLE 18A kir There are two variables. We are given the four-point cluster (1,7), (3,5), (6,2), and (2,10). Calculate
W(Ck) for this cluster. •

Copyright 02022 ASM
324 18. CLUSTER ANALYSIS
SOLUTIO14: As an example, the distance between (1,7) and (3,5) is (1 — 3)2 + (7 — 5)2 = 8. The distances between all
pairs of points is given in the following table:
(1,7) (3,5) (6,2) (2,10)
(1,7) 0 8 50 10
(3,5) 8 0 18 26
(6,2) 50 18 0 80
(2,10) 10 26 80 0
The sum of all the numbers in this table is 384. Then W(Ck) = 384/4 = 96
Putting (18.1) and (18.2) together, our objective is
K
1
minimize EI—
Ck IX ( — Xpi)2 (18.3)
k=1 i,PEck j=1
It would be computationally infeasible to exhaustively go through every partition of the points into K clusters and
determine which one has minimal distance. But there is a simple algorithm to find local minima. This algorithm
is based on centroids of clusters. The centroid of a cluster is the point whose coordinates are the means of the
coordinates of the cluster. For example, for the 4-point cluster in Example 18A, the centroid is
(1 +3+6+2 7 + 5 + 2 + 10)
4 4
= (3, 6)
The algorithm is
1. Split the observations arbitrarily into K clusters.
2. For each cluster, calculate the centroid.
3. Create new clusters by associating each point with the nearest centroid.
4. Repeat steps 2-3 until cluster assignments do not change.
This algorithm finds a local minimum, since the distance cannot increase at each iteration. The following identity
demonstrates this:
1
I Ck E Xl02 = 2 E (X4 — 44)2 (18.4)

j=1 iECk j=1
where kj is the mean coordinate in Ck:

(18.5)
" = T iE x"iEck
We see that the dissimilarity function is twice the squared Euclidean distance of points from the cluster centroid,
and assigning points to the closest cluster centroid can only reduce the total dissimilarity. The textbook leaves the
proof of the identity as an exercise, but if you're curious, see the sidebar.
Because of the use of centroids, the clustering method discussed in this section is called K-means clustering
(rather than K-clusters clustering).
The textbook illustrates this algorithm in its Figure 10.6; you can see there how a random assignment to 3 clusters
converges to something reasonable after 10 iterations. For simplicity, we'll do a 1-dimensional example.
EXAMPLE 18B •-•? You are given the points 0, 1, 5, 7, 12, 14. You will split them into 2 clusters using the K-means
algorithm.
1. Your initial assignment is {0,1,5} to Cluster 1 and {7,12,14) to Cluster 2.
Carry out the algorithm.

18.2. HIERARCHICAL CLUSTERING 325
l
f
Proof of equation (18.4)
Start with the expression in the inner sum, adding and subtracting
— Xit1)2 (Xij — ikj —
Keep in mind that the sum on the left side of equation (18.4) is over all (i, i'), including i = i'. So the sum has IC1,12
terms. Consider the first two summands of (1. Each one of them will be summed ICk times, once for each i (for
the first summand) or for each i' (for the second summand). So (xii — ici)2 will be summed 2ICkl times. The sum
on the left side is divided by ICk I, so we end up with twice the sum of those squares, which is what the right side
of equation (18.4) is. So we just have to show that the sum of the third summand, the cross products, is 0. That
sum is double the following:
E(x„ -
ieCk
z
PcCk
Now, by the definition of the mean gkj shown in equation (18.5), Eieck xii = ICkliki, so the sum of the differences
xfi — which is the difference of the sums, is 0. The proof is complete.
2. Your initial assignment is {0,1,5,7,12) to Cluster 1 and 114) to Cluster 2.

Carry out the algorithm.
SOLUTION: 1. The centroids are 2 and 11. Points less than 6.5 go to the first duster and points above 6.5 go to the
second cluster. No assignments changed, so we are done.
2. The centroids are 5 and 14. Anything below 9.5 goes to the first cluster, so the new clusters are {0, 1, 5, 7} and
{12,141. The new centroids are 31 and 13. Anything below 8i goes to the first cluster. Assignments don't
change, so we are done.
This solution is better than the one for the first part.
As the example shows, one must perform the algorithm multiple times with different starting assignments to have
a good chance of finding the best cluster assignments. Another issue is choosing K.
18.2 Hierarchical clustering

Hierarchical clustering is an alternative to K-means clustering that does not require specifying the number of clusters
in advance. As its name implies, it results in bigger clusters containing smaller clusters containing smaller clusters.
We will describe bottom-up or agglomerative clustering. This type of clustering starts with n clusters, one for each *yr
point. It then iteratively fuses the two most similar clusters together. It keeps repeating this process until all points
are in one cluster. You can then decide how many clusters should be used. To help with this decision, a graph called
a dendrogram is produced. This graph is an upside down tree showing all the fusions, with their heights determined el
by the dissimilarity measure. The number of clusters is determined by deciding at what height to cut the graph.
Let's begin by discussing the fusing algorithm. We specify a dissimilarity measure, a formula that measures
how different two points are. Usually the usual Euclidean distance, namely the square root of the sum of square a:
differences between coordinates — x12)2 is used. Unlike for K-means clustering, we do not square the
Euclidean distance. You may think that it makes no difference; if the distance between two points is minimal,
then the square of the distance will also be minimal. But it does make a difference. Unlike K-means clustering,
l
f hierarchical clustering produces a dendrogram in which the height of each fusion is the dissimilarity measure, and
we don't want to spoil the scale of the graph by squaring the distance. Also, not squaring the distance will make a

Copyright 02022 ASM
difference when we are dealing with distances between clusters having more than one point for one of the distance
measures we will discuss.'
We start with one cluster for each point. At each iteration of the algorithm, we compare every pair of clusters. If
there are k clusters, we make k(k — 1)/2 comparisons. We select the pair of clusters with the smallest dissimilarity
and fuse them. We keep repeating this algorithm until we're left with one cluster.
We know how the dissimilarity between two points is defined. But how is the dissimilarity between two clusters
'49 defined? Dissimilarity between clusters is called linkage. Four types of linkage are popular: complete, single,
•,? average, centroid.
Complete linkage Calculate the dissimilarity between every point of cluster A and every point of cluster B. If there
are a points in A and b points in B, do ab calculations. The dissimilarity is the maximum of these numbers.
Single linkage Similar to complete linkage, except the dissimilarity is the minimum of the ab calculations. This
linkage leads to trailing clusters, clusters in which one point at a time is fused to a single cluster.
Average linkage As in complete linkage, calculate ab dissimilarities. Then use the average.
Centroid linkage Calculate the centroid of each cluster and use the dissimilarity between the centroids. This
method has the disadvantage that it may result in inversions. An inversion occurs when a later fusion occurs
at a height lower than an earlier fusion— the dissimilarity of a later fusion is less than the dissimilarity of an
earlier fusion involving the same points.
The textbook has a 9-point example in 2 dimensions using complete linkage. Before we do a 2-dimensional
example, let's do an even simpler example with 6 points in 1 dimension.
EXAMPLE 18C 41 You are given the points 0, 1,5, 7, 10, 13.5.
Carry out hierarchical clustering using each of the four linkages. •
SoLurioN: The first two links are the same regardless of linkage. The points closest together are 0 and 1, so we link
them into 1 cluster.
Next (5,7) are closest and are linked, so we have (0,1), (5,7) , (10), (13.5).
Complete linkage With complete linkage, the distance from (0,1) to (5,7) is the largest difference, or 7, and the
distance from (5,7) to (10) is 5. So we link (10,13.5), since they are only 3.5 apart.
The distance from {0,11 to (5,7) is 7 and the distance from (5,7) to {10,13.51 is 8.5, so we link the first two and have
(0,1,5,7) and (10,13.5). Then we link these two into one cluster.
As we see, complete linkage prefers fusing smaller clusters to fusing larger clusters. More points in a cluster
means more numbers to maximize over. Here, it first fused all the points into three 2-point clusters before it
considered fusing those clusters.
Single linkage With single linkage, starting with (0,1), {5,7} , (10), (13.5), distance is the smallest difference, so the
distance from (OM to {5,71 is 4 and the distance from (5,7) to 10 is 3, which is the smallest, so we link them. We have
(0,1), (5,7,10), and {13.5}. The distance between the first two is 4 and the distance between the last two is 3.5, so we
link them and get {0,11 and (5,7,10,13.5). Then we link these into one cluster.
Average linkage With average linkage, starting with (OM, (5,7), f101, (13.5), average distance between the first two
clusters is 5.5. Average distance between (5,7) and (10) is 4. Both of these are higher than 3.5, so we link (10,13.5),
and have (0,1), (5,7), and {10,13.51. Average distance between first two is still 5.5 and average distance between the
last two is 5.75, so we link the first two. We end up with the same hierarchy as with complete linkage.
Namely, the average linkage.

35 -
010
30
09 08
07
25- 0
5
06
20- 0
15- 03
10 -
01 02
5 -
0 1 I I 1 I 1 I
0 5 10 15 20 25 30 35
Figure 18.1: 10 observation points to be clustered
Centroid linkage with centroid linkage, starting with {0,11, {5,7} , {10}, {13.5}, distance between the first two is
6 - 0.5 = 5.5. Distance between {5,7} and {10) is 10 - 6 = 4. Distance between (10) and {13.5) is 3.5. So we link
110,13.51, and have {0,1), (5,7), and (10,13.5). Distance between the first two is still 5.5. Distance between {5,7) and
(10,13.5) is 11.75 - 6 = 5.75, so we link the first two. We end up with the same hierarchy as with complete linkage
and average linkage.
As an example of an inversion using centroid linkage, consider the following three points: (0,0), (40,0), (20,38).
The first link is between (0,0) and (40,0), since the distance to (20,38) from either point is 11202 + 382 = 42.94 > 40.
The centroid of {(0,0),(40,0)) is (20,0). The distance from the centroid to (20,38) is 38, which is less than the distance of
the first link, which is 40. While there is nothing per se wrong with an inversion, it produces a strange dendrogram,
and cutting the dendrogram in the inverted area produces unusual clusters.
Let's now look at dendrograms. We will work with the following 10 observation points:
1. (10,10) 6. (15,24)
2. (18,10) 7. (17,27)
3. (18,16) 8. (13,30)
4. (20,20) 9. (8,30)
5. (8,25) 10. (12,34)
They are graphed in Figure 18.1.
The first link is the same regardless of linkage, as it must be. It links point 6 to point 7. The distance between
them is V(17 - 15)2 + (27 - 24)2 = 3.606.
It turns out that the second link is the same regardless of linkage. It links point 8 to point 10. The distance
between them is 1112 + 42 = 4.123.
The third link is also the same regardless of linkage. It links points 3 and 4, with distance V22 42 = 4.472.
After the third link, the links differ by linkage. Let's start with complete linkage. Complete linkage continues
to link single points: first 5 and 9, and then 1 and 2. Complete linkage tends to prefer linking smaller groups since

Copyright 02022 ASM
Sequence Link Distance
4
10
Figure 18.2: Dendrogram with complete linkage
the distance is the maximum of the point-to-point distance; the more points, the higher the maximum tends to be.
Figure 18.2 shows the dendrogram and table of distances for complete linkage.
Next let's look at average linkage. The fourth link is the same as for complete linkage, but then it links 15,9) with \J
18,10). Average linkage is not as biased towards linking single points as complete linkage is. Figure 18.3 shows the
dendrogram and table of distances for average linkage.
Single linkage is the most unusual, so let's discuss centroid linkage next. Fortunately there are no inversions.
But the fourth link is already different from complete and average linkage. The link of (9) with (8,10) takes priority
over the link of (9) with 151 indicated by the other two methods. 151 does get linked with (8,9,10) at the next step,
making it like average linkage. But the later links are in a different order. See Figure 18.4 for the dendrogram and
table of distances.
Single linkage suffers from having some ties for distance. The dendrogram here breaks the ties by following
average linkage through link 5. At the three tied heights, the dendrogram has been distorted a bit to show the
sequence of links. See Figure 18.5 for the dendrogram and the table of distances. Notice how it behaves the
opposite of complete linkage; it prefers to fuse groups with greater numbers of observations, so that there are more
observations to minimize over, and saves the single point 111 to the end before fusing it to the clusters.
One thing to be careful about is that distance is determined vertically, not horizontally. For example, in centroid
linkage, point 5 is no closer to point 9 than it is to points 8 and 10. It is doser to those points than to points 6 and 7
by a whisker, because the link of that group to the group with 5 is higher than the link of 5 with (8,9,10).
Clusters are formed by cutting the tree at a specific height. For example, if you wanted 3 clusters, you'd cut the
trees below the first two links. The clusters you'd get would then be
Complete (1,2), 13,41,(5,6,7,8,9,10)
Average (1,2), (3,4,6,71, (5,8,9,10)
Centroid (1,2), 0,41,(5,6,7,8,9,10)
Single (1), (2,3,4), (5,6,7,8,9,10)
Look back at Figure 18.1. Which linkage leads to the most plausible clusters? I think in this case single linkage
is quite logical! But if we split the data into two clusters, I think single linkage is the least plausible in splitting 111
from the other nine observations. Splitting it as {1,2,3,4} and (5,6,7,8,9,10), as the complete and centroid linkages do,
seems more logical. You may have a different opinion.

Copyright 02022 ASM
10
Figure 18.3: Dendrogram with average linkage
1 {61-17) 3.606
2 {81-110} 4.123
3 {31—(4) 4.472
4 19)—(8,10} 4.924
5 151—{8,9,10} 7.008
6 {6,7}—{5,8,9,10) 7.150
7 111—(21 8.000
8 {1,2}-13,4) 9.434
9 {1,2,3,4}-0,6,7,8,9,10) 14.974
10
Figure 18.4: Dendrogram with centroid linkage

1 (61-171 3.606
2 {8}—{10) 4.123
3 {3)—(41 4.472
4 (51—{9} 5.000
5 (5,9)—(8,10) 5.000
6 {6,7}-15,8,9,10) 5.000
7 {2}—(3.4) 6.000
8 {2,3,4)-15,6,7,8,9,101 6.403
9 (1)—(2,3,4,5,6,7,8,9,10) 8.000
5 9
3 4
8 10
6 7
Figure 18.5: Dendrogram with single linkage
Practical clustering links large numbers of observations and has thick dendrograms. Refer to the textbook for
such dendrograms.
18.3 Issues with clustering

N:1 Hierarchical clustering assumes that there is a hierarchy. If the variables are unrelated, this may not be a valid
assumption. For example, if the variables are sex and race, K-means clustering will be superior to hierarchical
clustering.
With hierarchical clustering, another dissimilarity measure that may be used instead of distance is correlation.
In marketing, one may be more interested in customers with similar shopping patterns than in customers who
spend similar amounts on a product. A customer who buys the same products as another customer in the same
proportions, but spends half as much, may still be targeted with the same advertisements. On the other hand, a
customer who spends the same as another on half the products they buy but doesn't buy the other half may not be
similar enough to the other customer. To use correlation as a dissimilarity measure, there must be at least 3 features;
any 2 numbers are perfectly correlated (have a linear relationship) with any other 2 numbers.
Scale of variables is important. If customers of a department store are clustered based on two variables, (1) number
of apples purchased and (2) number of refrigerators purchased, the former variable will vary much more greatly
than the latter and will influence the dissimilarity measure more heavily. If the variables are (1) dollars spent on
apples and (2) dollars spent on refrigerators, the opposite may occur. One may want to standardize all variables to
have standard deviation 1.
Thus clustering requires the following decisions:
1. Which clustering method is more appropriate?
2. Should variables be scaled?
3. Which dissimilarity measure should be used? (Hierarchical only)

4. How many clusters? (K-means only)

18.3. ISSUES WITH CLUSTERING 331
Table 18.1: Summary of formulas and concepts in this lesson
Objective for K-means clustering minimizing squared Euclidean distance
1
minimize vi ti aI (18.3)
ICkl LiteCk j=1
Identity for squared Euclidean distance
(18.4)
Linkages for hierarchical clustering
Complete Largest distance between points

Single Smallest distance between points
Average Average distance between points
Centroid Distance between centToids
5. Which linkage? (Hierarchical only)

6. Where should the dendrogram be cut? (Hierarchical only)
l
f
Validating clusters—establishing that they are meaningful and not just a result of an algorithm that forces them
to exist—is difficult. There is no consensus on the best technique.
In some cases, clusters may not be appropriate. There may be many similar points, plus some outliers that
shouldn't be placed in any cluster.
Clustering is not robust. A small perturbation of the data may result in different clusters.
One may perform clustering on principal component score vectors instead of on the original data. The results
may be different.
While we've discussed clustering of observations based on features, the opposite is also possible: clustering
features based on observations.

Copyright 02022 ASM
Exercises
18.1. 41 (MAS-I-518:351 You are given the following three statistical learning tools:
I. Cluster Analysis
II. Logistic Regression
III. Ridge Regression
Determine which of the above are examples of supervised learning.
(A) None are examples of supervised learning
(B) land II only
(C) 1 and III only
(D) II and III only
18.2. 41 [SRM Sample Question #32] Determine which of the following statements is/are true with respect to
clustering methods.
I. We can cluster the n observations on the basis of the p features in order to identify subgroups among the
observations.
II. We can cluster p features on the basis of the n observations in order to discover subgroups among the features.
III. Clustering is an unsupervised learning method and is often performed as part of an exploratory data analysis.
(A) None (13) I and II only (C) I and III only (D) II and III only
(E) The correct answer is not given by (A) , (B) , (C) , or (0) .
18.3. sur [MAS-II-F19:40] You are given three statements about the k-means clustering algorithm.
I. The k-means clustering algorithm requires that observations be standardized to have mean zero and standard
deviation one.
IL The k-means clustering algorithm seeks to find subgroups of homogeneous observations.

III. The k-means clustering algorithm looks for a low-dimensional representation of the observations that explains
a significant amount of the variance.
Determine which of the statements I, II, or III are true.
18.4. ": [SRM Sample Question #43] Determine which of the following statements is NOT true about clustering
methods.
(A) Clustering is used to discover structure within a data set.

(B) Clustering is used to find homogeneous subgroups among the observations within a data set.
(C) Clustering is an unsupervised learning method.
(D) Clustering is used to reduce the dimensionality of a dataset while retaining explanation for a good fraction
of the variance.
(E) In K-means clustering, it is necessary to pre-specify the number of clusters.

l
f 18.5. A K-means clustering process with K =2 produced the following clusters:
A: (0,0), (1,2), (2,0), (3,2)
B: (0,3), (1,5), (2,4)
Clustering is based on squared Euclidian distance.
Calculate the value of the objective function, the function that is minimized by the clustering algorithm.
18.6. [MAS-II-S19:42] You have decided to perform K-means clustering with K 2 on the following dataset
and have already randomly assigned clusters as follows:
Observation 11 x2 Initial Cluster
1 5 5 2
2 4 6 2
3 3 0 1
4 5 3 1
5 5 1 2
6 3 6 1
7 2 5 2
• The centroid of the initial cluster 1 is x1 = 3.667, x2 = 3.000

• The centroid of the initial cluster 2 is x1 = 4.0, x2 = 4.250
Calculate the Euclidean distance of Observation 5 from the final centroid of Cluster 2.
(A) Less than 1

l
f (C) At least 2, but less than 3
(E) At least 4
18.7. Nr A professor with 10 students gives a final, and the grades are
99 98 95 93 91 89 87 85 79 77
The professor would like to divide these grades into 3 clusters. Grades in the highest cluster will be A; grades in
the middle cluster will be B; and grades in the lowest cluster will be C.
The professor uses K-means clustering with Euclidean distance squared. Initially, all grades above 90 are put in
the first cluster; grades between 80 and 89 are put in the second cluster; and grades below 80 are put in the third
cluster.
Determine the three clusters ultimately resulting from the clustering algorithm.
18.8. rYou are performing K-means clustering on a set of data. The data has been initialized randomly with 3
clusters as follows:
Cluster Data Point Cluster Data Point Cluster Data Point

A (0,0) B (1,2) C (-1,0)
A (1,4) B (-2,4) C (5,3)
A (2,2) B (0,2) C (-1, —3)
A single iteration of the algorithm is performed using Euclidean distance between points.
Determine the three clusters resulting after the iteration.
Exam 5RM Study Manual Exercises continue on the next page . . .

18.9. q [SRM Sample Question #15] You are performing a K-means clustering algorithm on a set of data. The
data has been initialized randomly with 3 clusters as follows:
Cluster Data Point
A (2,—i)
A (-1,2)
A (-2,1)
A (1,2)
B (4,0)
B (4,—i)
B (0,-2)
B (0,-5)
C (-1,0)
C (3,8)
C (-2,0)
C (0,0)
A single iteration of the algorithm is performed using the Euclidian distance between points and the cluster
containing the fewest number of data points is identified.
Calculate the number of data points in this cluster.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
You are given the following four pairs of observations:
xi = (-1,0), x2 = (1,1), x3 = (2,—I), and x4 = (5,10)
18.10.
•-.11 [SRM Sample Question #11 A hierarchical clustering algorithm is used with complete linkage and
Euclidean distance.
Calculate the intercluster dissimilarity between {xi, x21 and {x4}.

(A) 2.2 (B) 3.2 (C) 9.9 (D) 10.8 (E) 11.7
18.11. q A hierarchical clustering algorithm is used with average linkage and Euclidean distance.
Calculate the intercluster dissimilarity between {xi, x21 and {x4}.
18.12.
`-‘11.[MAS-II-F18:421 You are provided the following data set with a single variable X:
i X
1 9
2 15
3 4
4 2
5 18
A dendrogram is built from this data set using agglomerative hierarchical clustering with complete linkage and
Euclidean distance as the dissimilarity measure.
Calculate the tree height at which observation i = 1 fuses.
Exam SRM Study Manua} Exercises continue on the next page . . .


29 36 41 44 45 49 55 64 69
18.13.
1 A hierarchical clustering algorithm is used with single linkage and Euclidean distance.
The observations are split into three clusters.
Determine the three clusters.
18.14. "All A hierarchical clustering algorithm is used with complete linkage and Euclidean distance.
The observations are split into three clusters.
Determine the three clusters.
18.15. 41 You are given the following four pairs of observations:

= (1,0), x2 = (2,1), x3 = (3, 1), and x4 = (3, —2)
A hierarchical clustering algorithm is used with average linkage and Euclidean distance.
Calculate the intercluster dissimilarity between {x1, x2} and {x3, x4}.
18.16.
[MAS-II-F19:42] An actuary is using hierarchical clustering to group the following observations:
13
40
60
71
The actuary recalculates the clustering using two linkage methods: complete and average.
(i) hump is the height of the final fuse using complete linkage.
(ii) hayg is the height of the final fuse using average linkage.
Calculate I//comp — havg •
(A) Less than 10
(E) At least 25
18.17.
r[SRM Sample Question #2] Determine which of the following statements is/are true when deciding on
the number of clusters.
I. The number of clusters must be pre-specified for both K-means and hierarchical clustering.
II. The K-means clustering algorithm is less sensitive to the presence of outliers than the hierarchical clustering
algorithm.
The K-means clustering algorighrn requires random assignments while the hierarchical clustering algorithm
does not.
(A) I only (13) II only (C) III only (D) I, II and III

18.18. [SRM Sample Question #341 Determine which of the following statements is/are true about clustering
methods:
I. If K is held constant, K-means clustering will always produce the same cluster assignments.
Given a linkage and a dissimilarity measure, hierarchical clustering will always produce the same cluster
assignments for a specific number of clusters.
IlL Given identical data sets, cutting a dendrogram to obtain five clusters produces the same cluster assignments
as K-means clustering with K = 5.
18.19. NI° [SRM Sample Question #16] Determine which of the following statements is applicable to K-means
clustering and is not applicable to hierarchical clustering.
(A) If two different people are given the same data and perform one iteration of the algorithm, their results at
that point will be the same.
(B) At each iteration of the algorithm, the number of clusters will be greater than the number of clusters in the
previous iteration of the algorithm.
(C) The algorithm needs to be run only once, regardless of how many clusters are ultimately decided to use.
(D) The algorithm must be initialized with an assignment of the data points to a cluster.
(E) None of (A), (B), (C), or (D) meet the meet the stated criterion.
18.20. [SRM Sample Question #40] Determine which of the following statements about clustering is/are true.
I. Cutting a dendrogram at a lower height will not decrease the number of clusters.
II. K-means clustering requires plotting the data before determining the number of clusters.
III. For a given number of clusters, hierarchical clustering can sometimes yield less accurate results than K-means
clustering.
18.21. [SRM Sample Question #361 Determine which of the following statements about hierarchical clustering
is/are true.
I. The method may not assign extreme outliers to any cluster.

II. The resulting dendrogram can be used to obtain different numbers of clusters.
III. The method is not robust to small changes in the data.
Solutions
18.1. Cluster analysis has no response variable, so it is unsupervised. The two regression techniques have a
response variable, so they are supervised. (D)
18.2. I and II are true; clustering can be done on observations or on features. And it is an unsupervised method.
So all three statements are true. (E)

Copyright 02022 ASM
n. 18.3. Statement I sounds like it came from principal component analysis, or perhaps ridge regression or the lasso,
although none of them absolutely require standardizing. And statement III sounds like it came from principal
component analysis. Only H is true. (B)
18.4. Clustering does not reduce the dimensionality of a data set. It breaks a data set into clusters. (D)
18.5. We can measure distances between pairs of points or use the centroid formula. We'll use the centroid formula.
The centroid of A is
(0+1 +2 +3, 0+2+0+2)

4 4
— (1.5, 1)
The centroid of B is (1,4). The objective function is the sum of squared differences from the centroid, doubled, or
2((1.52 + 12) +(0.52 + 12) +(0.52 + 12) + (1.52 + 12) + (12 + 12) + (02 1- 12) + (12 + 02)) = 26
18.6. Squared Euclidean distances from the two centroids are

Observation Distance from Centroid 1 Distance from Centroid 2
1 5.7769 1.5625
2 9.1109 3.0625
3 9.4449 19.0625
4 1.7769 2.5625
5 5.7769 11.5625
6 9.4419 4.0625
7 6.7789 4.5625
Thus observations 1, 2, 6, and 7 go to cluster 2 and observations 3,4, and 5 go to cluster 1. The new centroids
are (4.333,1.333) for cluster 1 and (3.5,5.5) for cluster 2. The new squared Euclidean distances are
Observation Distance from Centroid 1 Distance from Centroid 2
1 13.8889 2.5
2 21.8889 0.5
3 3.5556 30.5
4 3.2992 8.5
5 0.5556 22.5
6 23.5556 0.5
7 18.8889 2.5
No cluster assignments change, so this is the final clustering. The squared distance of Observation 5 from
cluster 1 is 22.5; -k,/2 = 4.7434 . (E)
18.7. The means of the clusters are 95.2, 87, and 78. Then 91 is moved to the second cluster since it is closer to 87
than to 95.2. The resulting clusters have means 96.25, 88, and 78, and no improvement is possible, so they are the
final clusters.
18.8. The centroids are (1,2), (-1/3,8/3), and (1,0).

(0,0) is closest to (1,0).
(-1,0) is closest to (1,0).
(1,4) is closest to (-1/3,8/3).
(-2,4) is closest to (-1 /3, 8/3).
(0,2) is closest to (-1/3,8/3).
(-1, —3) is closest to (1,0).
The clusters are 1(0,0), (-1,0), (-1, —3)1,1(1,4), (-2,4), (0,2)), and 1(1,2), (5,3), (2,2)).
l
f

Copyright &2022 ASM
18.9. The centroid of A has x coordinate (2 + (-1) + (-2) + 1)/4 = 0 and y coordinate (-1 + 2 + 1 + 2)/4 = 1. The
centroid of B is (2, —2). The centroid of C is (0,2). Looking at these centroids on a graph:
•
*C
A
*B
it's clear that the only points near (0,2) are the three upper ones on the graph; the other points are closer to the
centroids of A or B. But if you want to be sure, you have to calculate all the distances between points and their closest
centroids. (D)
18.10. For complete linkage, we calculate the maximum distance from the two pairs of observations in the group.
1x4— =J(5_(_1))2+(10_0)2= VT )
I X4 — X21 = 1/(5 1)2 + (10 — 1)2 = A/T7
The larger distance is Ji = 11.6619 . (E)
18.11. The distance is now based on the average of the two distances, 0.5(.0737; + -\17") = 10.7554
18.12. Complete linkage means we consider the maximum difference between clusters. The difference between 2
and 4 is the minimum of the differences of two numbers, so i = 3 and 4 fuse first. Then 15 and 18 are closest, so i = 2
and 5 fuse. Now the two clusters of two elements are 16 apart (18 — 2), so i = 1 fuses next, and it is closer to 12,41
(distance 7) than to 115,181 (distance 9), so it fuses at height E. (0
18.13.
44 and 45 are fused. With single linkage, the distance from 41 is 3, so 41 is fused into (44,45). The distance
of (41,44,45) from 49 is 4, s049 is fused into this group. Now 36 is fused into this group, and (64,69) is fused, since
they have distance 5. So far we have
29 {36,41,44,45,49) 55 {64,69}
Now 55 is fused into 136,41,44,45,491. The three clusters are (29), {36,41,44,45,49,551, and (64,69).
18.14.
44 and 45 are fused. Then 41 is fused with them since it is 4 away. The distance of (41,44,45) to 49 is 8, so
{64,69) are fused next. Then {49,55) are fused. Then (29,36) are fused. So far we have
{29,36} {41,44,45} {49,55} {64,69}
Distances between groups are 16 from first to second, 14 from second to third, and 20 from third to fourth, so second
and third are fused. The three clusters are (29,36), (41,44,45,49,55), and (64,69).

18.15. We must compute the four distances between the two points in each cluster.
1X3—X 21 = 1/(3 — 2)2 + 1)2 =

I X4 — I= (3_1)2 + (-2 — 0)2 =
ix4 — x21 = -\/(3 — 2)2 + (-2 — 1)2 = VTO
The average of these four numbers is 2.3067
18.16. With complete linkage, first fuse 60-71, which has the smallest distance, 11. Now 40 is 31 away from 60-71
since complete linkage uses the maximum distance between points in clusters, and here 71 is the furthest from 40.
40 is closer to 13; it is 27 away. So fuse 40-13. Finally fuse the two pairs (13,40) and {60,711. Their distance is the
distance between points furthest apart, hcomp = 71 — 13 = 58.
With average linkage, first fuse 60-71. The average of that cluster is 65.5, and 40 is closer to 65.5 than to 13, so
fuse 40-(60,71), which has an average of 57. Final fuse of 13 to {40,60,71) has distance havg = 57 — 13 = 44.
The answer is 58 — 44 = 14. (B)
18.17.
I. The number of clusters is specified only for K-means clustering. X

Both methods are sensitive to outliers. X
he
T K-means clustering algorithm begins by randomly splitting the data into K clusters. The hierarchical
clustering algorithm does not start out with any assignment.
(C)
18.18.
I. K-means clustering starts out with a random assignment to K clusters, and the final result may vary depending
on the initial assignment. X
II. True; there is nothing random in hierarchical clustering./
III. Since I is false, this is certainly false; moreover, different linkages may lead to different results from hierarchical
clustering. X
(B)
18.19.
(A) With K-means clustering, one must choose a random initial distribution of observations into clusters, so the
statement is not applicable to K means clustering.
(B) With K-means clustering, the number of clusters is fixed in advance, so the statement is not applicable to K
means clustering.
(C) With K-means clustering, the algorithm produces a local minimum and must be run several times to obtain a
global minimum.
(ID) This statement is true for K-means clustering but not for hierarchical clustering.
(D)
18.20.
I. The dendrogram shows fusions at their height, and lower heights are fused before higher heights (at least if
centroid linkage is not used), so this statement is true.
II. There is no requirement to plot data before deciding on K. X

Copyright =022 ASM
III. This is true, since K-means clustering starts off with a random assignment to clusters which may work out
well, whereas hierarchical clustering is force to fuse based on linkage and may miss better alternatives for a
specific number of clusters. s/
(C)
18.21.
I. Every point is assigned to its own cluster right from the start; then the clusters are fused together. No point
drops out. X
The dendrograrn may be cut at arty height to obtain different numbers of clusters. /
III. This is true. i/
(D)

ActuarialUniversity.corn and log into your account.

Copyright 02022 ASM
Part IV
Time Series
Lesson 19
Time Series: Basics
Reading: Regression Modeling with Actuarial and Financial Applications 7, 8.1
19.1 Introduction
A time series is a series of observations y, y2, , YT over consecutive time periods, such as months or years.
Examples of time series are:
• Daily stock prices
• Volume of stock trades, by day
• Monthly sales
• Population of a country, by year
Time series analysis attempts to find patterns in a time series that can help predict future values. The patterns may •41
relate the terms yi to previous terms in the series or to the time variable t. We'll now briefly discuss models that
relate yt to t.
Longitudinal data is data from a process that varies with time. Cross-sectional data is the opposite: data that is not •1?
organized by time. A regression model in which a dependent variable is a function of explanatory variables other
than time is a causal model, in the sense that it states that the dependent variable y is caused by explanatory variables ikit
xi. However, statistical models only find correlation, not causation. When several variables grow with time, a causal
model may find a spurious relationship between the variables. On the other hand, a regression of a time series
variable against time (this is not a causal model) may find a legitimate deterministic relationship between the time
series and time. Causal models have the additional drawback that to forecast the dependent variable, you must have
a forecast of the independent variables.
A time series can be decomposed into three parts: trend, seasonal factors, and random patterns. Trend is the
long-term pattern of the data, while seasonal factors are the cyclical pattern. Let 7; be trend and St the seasonal
pattern. Then an additive model would be "41
+ Sj + Et
while a multiplicative model would be
yi = T, x S1 ±
To help analyze a time series, one may draw a times series plot. A time series plot is a Scatter Plots of a time
series against time, with the consecutive points connected with lines.
A regression model for a time series may simply have a linear trend in time:
yi = Igo+ it 4- el
One may also use a polynomial. Seasonal patterns may be modeled by dummy variables for the seasons or by
trigonometric functions. Sometimes seasonal adjustment is done; a new time series is created with the seasonal '46
pattern removed. It is also possible to model a regime change, a change in the behavior of the time series starting at
a certain point of time, by using a dummy variable set equal to 0 before that time and 1 after that time.
Regression models are naive in the sense that they ignore information other than the time series being modeled.
Another shortcoming of regression techniques is that they give the highest weight to the earliest and latest observa-
tions, the ones with the t variable furthest away from the mean. Generally, when forecasting, one wants to give the
highest weight to the latest forecasts and the lowest weight to the earliest forecasts.

Copyright 02022 ASM
344 19. TIME SERIES: BASICS
19.2 Mean and variance
• If the mean of a time series does not vary by t, the series is said to be stationary in the mean. For such a time series,
we can estimate the mean as the sample mean of the observed values.
Let p(t) be the mean of y,, the tth term of a time series. The variance of a time series at time t is
0-2(t) = ERY, — P(0)2] (19.1)
"-.16 If the variance does riot vary with t, the series is said to be stationary in the variance. If the series is stationary in the
*1:1 variance, we can estimate the variance using the usual forrnula for the sample variance:
2 Z;1=1(Yt fi)2
s =
(19.2)
n —1
Time series terms tend to be correlated with each other (otherwise the time series wouldn't be interesting). The
sample variance will tend to underestimate the true variance for this reason. However, this bias reduces rapidly as
the size of the series increases.
The correlation of terms in a time series is of great interest. The correlation of a time series with itself is called
1141 autocorrelation or serial correlation. We will look at the correlation of terms in a time series at times t to terms at times
t k, the correlation of yi with yt,k. The distance between terms, k, is called the lag. If a series is stationary in mean
•
and variance and autocorrelation is a function only of the lag, not of the time, we say that the time series is (weakly)
stationary. The higher moments of such a series may still vary with t. If none of the moments vary with time, then
the series is strongly stationary. We will assume, in our discussion of autocorrelation, that the time series is weakly
stationary.
We calculate the sample autocorrelation at lag k with this formula:
(19.3)
Notice that the numerator and denominator do not have the same number of terms. The textbook refers to rk as the
autocorrelation, leaving out the word "sample". When the textbook deals with the underlying autocorrelation of
the time series, it calls it "the correlation between terms k apart" and uses the symbol pk. In this manual, we will
refer to the underlying autocorrelation as the true autocorrelation.
The (true or sample) autocorrelation at lag 0 is 1.
EXAMPLE 19A "Ili You are given the following time series:
5, 9, 3, 6, 7
Calculate the sample autocorrelation at lag 1. •
SOLUTION: We will calculate

Z7=2(xi--1 ,k)(xt —
= ELi(xt -1)2
The sample mean is 6.
=
(-1)(3) + (3)(-3) + (-3)(0) + (0)(1) —
(_1)2 4. 32 + ± 02 + 12
You can use your statistical calculator for this calculation. Use the E x2 statistic for the denominator and the E XY
statistic for the numerator, where X for the numerator calculation is the first 4 terms and Y is the last four terms. 0
Quiz 19-1 ki? In Example 19A, calculate the sample autocorrelation at lag 3.

Copyright es2022 ASM
19.3. WHITE NOISE 345
(---‘ 19.3 White noise

A white noise time series is a stationary time series in which each term is independent and has constant mean p and 111
constant variance (52. We usually assume that the terms are normally distributed. (Many authors also assume that
p = 0, but the author of our textbook does not assume that.) Since the terms are independent, autocorrelations at
all lags greater than 0 are 0.
Suppose that we are given T observations of a white noise process, yi,...,yT. Then the /-period look-ahead Ikir
forecast is 9 . The forecast interval is NI°
1
9± 1,1-exl2Sy T, (19.4)
where tT...1,1_02 is the f-distribution critical value with T — 1 degrees of freedom and confidence level a, and S2y is
the unbiased sample variance of the T observations. The width of the forecast interval is independent of 1. Often
we use the approximate 95% forecast interval 9 ± 2sy.
Quiz 19-2 N? You are given the white noise series{yi,...ys} = {1, —1, 2, 0,3}.
Calculate the upper bound of a 90% forecast interval for y7.
We try to reduce any time series to white noise by finding patterns, leaving the unexplained part as white noise.
The procedure for reducing a time series to white noise is called a filter. The uncertainty that cannot be explained
by patterns is called irreducible.
A white noise series may be identified if a series has a more or less constant mean and variance and doesn't move
around much.
19.4 Random walks

l
f
A random walk is a nonstationary time series which is the initial level yo plus the accumulation of white noise. Let
ct be white noise. Then the random walk yt can be expressed as
yt = ct t > 1
Let pc and ac2 be the mean and variance of the white noise series C. Then E[yll = yo + tpc and Var(y1) = tcf. Even
if pc = 0, a random walk is nonstationary because the variance increases with time. (Many other authors consider
= 0 to be part of the definition of random walk. If pc 0, they call the series a "random walk with drift", and pc
is called the "drift")
The forecast of a random walk is
YT+I = YT + re (19.5)
where ë is the sample mean of cf. The standard error of an /-period lookahead forecast is se Nil, where se is the
estimate of the standard deviation of c. Therefore, the forecast has approximate 95% confidence interval sir
yr + lE± 2scV1 (19.6)
The differences of a random walk, yl are white noise.

To identify a random walk:
1. Check the pattern of the values. For a random walk, the values increase at a constant rate and the variance is
a linear function of time.
2. Differencing the time series should result in white noise. The pattern of values should be constant with
constant variance over time.
3. Standard deviation of differences should be significantly lower than standard deviation of the original series.

•41 You should distinguish between a random walk and a linear trend in time. A linear trend in time is
= yo + kt et
whereas a random walk is
yr = yo+pct Ej
j=1
• Whereas they both have the same mean if k = pc, the variance is different. A linear trend in time has stationary
variance, whereas a random walk's variance increases with time. The linear trend in time's error term is white noise,
making it stationary, whereas the random walk with drift's error term u = Eit=i ei is a random walk, making it
nonstationary. If the drift of the random walk pc = 0, the comparison between it and a linear trend in time is not so
clear since the random walk's mean does not increase (or decrease) in time.
Differencing a random walk is an example of filtering. Another filtering method is to take the logarithm of a
time series, which may stabilize variance. If that doesn't work, one may take differences of logarithms. Difference
of logarithms correspond to approximate proportional changes;
y1-1 Yi YI-1
- in Yi-i = in
yt—i = In (1 +
19.5 Control charts
'41' A control chart for a time series is a chart upon which control limits are superimposed. The limits are typically
• UCL = 7+3sy and LCL = 17— 3sy, where UCL is the upper control limit and LCL is the lower control limit. Examples
of control charts are
Xbar charts calculate the averages of series of k observations. For example, if k = 5, compute the average of
observations 1-5, observations 6-10, observations 11-15, etc. The variance of an average is lower than the
variance of the series, so unusual patterns should stick out.
R charts calculate ranges of series of k observations. The range is the maximum observation minus the minimum
observation. The range is a simple measure of variability, and the chart helps evaluate patterns.
19.6 Evaluating forecasts

•
To evaluate a forecast, we use out-of-sample validation techniques, similar to the ones we learned in Lesson 5. We
may split the data, which goes through time T, at time T1 <T. The data up to time Ti is used as a training data set
and the subsequent data is a test data set. We fit the model using the training data set. The data from time Ti + 1
to time T is forecasted using the model, and the residuals (excess of actual over forecast) et = yt - 91 are computed.
The following statistics may be used to evaluate the differences; for all of them, the smaller the better.
1. Mean error statistic
1
ME =
T Eei
1=3'1+1
2. Mean percentage error
3. Mean square error

1
MSE = e2
T

19.6. EVALUATING FORECASTS 347
Sample variance of time series

2
2 E;1=1 (Yt 9)
—
n—1
(19.2)
Sample autocorrelation at lag k

ZT=k4-1(Yi-k g)(Yr g)
rk =
—
(19.3)
Forecast for white noise
YT+I = yr for any /
Forecast interval for white noise
9 ± tT-1,1-42 4 1+ (19.4)
/-period lookahead forecast for random walk

YT+I = YT + lc (19.5)
Approximate 95% 1-period l000kahead forecast interval for random walk
yr + 2sc1/7 (19.6)
Quality measures for validation of models ME, MPE, MSE, MAE, MAPE. See Section 19.6 for formulas.
4. Mean absolute error

1
MAE =
T E jed
t=Ti +1
5. Mean absolute percentage error

100 et
MAPE =
T —
=71+1
ME and MPE can reveal trend patterns, but they will not reveal problems when the residuals are positive and
negative but have a low average. The other measures will reveal such problems. MPE cannot be used if the series
has Os and may not be logical if the series has negative terms. The same applies to MAPE if differences are 0 or
negative.

Copyright 02022 ASM
Exercises
19.1. You are given the following time series
yr
1 8
2 5
3 12
4 18
5 7
6 10
Calculate the sample autocorrelation at lag 1.
19.2. r[S-F16:42) You are given the following ordered sample of size 6 from a time series:
1 1.5 1.6 1.4 1.5 1.7
Calculate the sample lag 2 autocorrelation.

(B) At least —0.6, but less than —0.3
(C) At least —0.3, but less than 0.0
(E) At least 0.3
19.3. [S-F17:431 You are given the following information from a time series:
xi
1 4.0
2 3.5
3 2.5
4 5.5
5 4.5
6 4.0
4.0

(B) At least —0.30, but less than —0.10
(C) At least —0.10, but less than 0.10
(E) At least 0.30

19.4. air [MAS-I-818:41] You are given the following information from a time series:
yi
1 3.0
2 2.5
3 2.0
4 3.5
5 4.0
6 3.0
3.0
Calculate the lag 3 autocorrelation.

(A) Less than -0.30
(B) At least -0.30, but less than -0.10
(C) At least -0.10, but less than 0.10
(E) At least 0.30
19.5. [120-F90:16] A mutual fund has provided investmentyield rates for five consecutive years as follows:
Year Yield
1 0.07
2 0.06
3 0.07
4 0.10
5 -0.05
Determine r2, the sample autocorrelation coefficient at lag 2.

(A) -0.5 (B) -0.4 (C) -0.3 (D) -0.2 (E) -0.1
19.6. L.: [120-83-98:13] You observed the following time series:

Yr
1 1.2
2 1.1
3 0.9
4 1.3
5 1.5
Determine the lag 2 sample autocorrelation coefficient 1'2.

(A) -0.50 (B) -0.02 (C) 0.15 (0) 0.22 (E) 0.52

Copyright 6)2022 ASM
19.7. •-: [MAS-1-F18:45] You are given the following quarterly rainfall totals over a two-year span:
Quarter Rainfall
2016q1 25
2016q2 19
2016q3 10
2016q4 32
2017q1 26
2017q2 38
2017q3 22
2017q4 20

(A) Less than 0.0
(E) At least 0.9
19.8. s: [MAS-I-F19:44] You are given the following annual sales totals for a department store.
Year Sales
2013 400
2014 375
2015 410
2016 420
2017 410
2018 525

(A) Less than 0.00
(E) At least 0.15
•
19.9. e'li
[MAS-I-S19:421 Consider the following time-series data for price of a stock on January 1 for the last 5
years:
Date Jan. 1, 2013 Jan. 1, 2014 Jan. 1, 2015 Jan. 1, 2016 Jan. 1, 2017
Price 63.18 81.89 103.43 123.90 133.53
Calculate the sample autocorrelation at lag 1 for this data.

(A) Less than 0.2
(E) At least 0.8

19.10. d'Ir [SRM Sample Question #31 You are given:

(i) The random walk model
YO el ± C2 ± • • • ± Ci
where ct, t = 0,1, 2, ... , T denote observations from a white noise process.
(ii) The following nine observed values of Ct
t 11 12 13 14 15 16 17 18 19
ci 2 3 5 3 4 2 4 1 2
(iii) The average value of Cl, C2, • • • ,c10 is 2.

(iv) The 9 step ahead forecast of yi9, .f/.19, is estimated based on the observed value of yio.
Calculate the forecast error, y19 — 1/19.
(A) 1 (3) 2 (C) 3 (D) 8 (E) 18
19.11. •••• [SRM Sample Question #311 Determine which of the following indicates that a nonstationary time
series can be represented as a random walk
I. A control chart of the series detects a linear trend in time and increasing variability.
II. The differenced series follows a white noise model.
III. The standard deviation of the original series is greater than the standard deviation of the differenced series.
19.12. kir [SRM Sample Question #38] You are given two models:
Model L: y, = ho + f31f ± el
where fell is a white noise process, for t = 0,1, 2, ...
Model M: y = yo + pct + ut
if, — Ys-i
\-1
ut
2,i=1 ei
where { et} is a white noise process, for t = 0,1, 2, ...
I. Model L is a linear trend in time model where the error component is not a random walk.
Model M is a random walk model where the error component of the model is also a random walk.
III. The comparison between Model L and Model M is not clear when the parameter pc = 0.
(A) I only (B) II only (C) HI only (D) I, II, and III

19.13. •Ir [SRM Sample Question #4] You are given:

(i) The random walk model
yt = yo + ci + c2 + • • • + ct
where ct, t = 0,1,2, ... ,T denote observations from a white noise process.
(ii) The following ten observed values of yt:
t 1 2 3 4 5 6 7 8 9 10
yi 2 5 10 13 18 20 24 25 27 30
(iii) yo = 0
Calculate the standard error of the 9 step-ahead forecast, 919.
(A) 4/3 (B) 4 (C) 9 (D) 12 (E) 16
19.14. "It [SRM Sample Question #21] A random walk is expressed as

yt = + cs for t = 1,2,
where
E[ci] = pc and Var(ci) = a, t = 1,2,...
Determine which statements is/are true with respect to a random walk model.
I. If 1.1, = 0, then the random walk is nonstationary in the mean.
II. if cr = 0, then the random walk is nonstationary in the variance.
III. If cr > 0, then the random walk is nonstationary in the variance.
19.15. s: You are given that ct is white noise with variance 16.
Determine the width of an approximate 95% confidence interval for a three step ahead forecast.
19.16. "411 You are given that xi is a random walk. The variance of .x-5 is 35.
Determine the variance of x6.
19.17. a: For a random walk { yi}, differences of consecutive terms are white noise with mean 2 and variance 16.
Assume that the terms of the white noise series are normally distributed.
You are given that yi = 3.
Calculate the probability that y2 is positive.
19.18. sfr You are given that yt is a random walk. Assume all terms of the series are normally distributed. o-,2 = 3.
Determine the approximate width of a 95% confidence interval for a three step ahead forecast.
19.19. 441 You are given that yi is a time series satisfying the following equation:
Yi =
+4+ ci
where ci is a normally distributed error term. The variance of the error term is estimated to be 5. You are also given
that VT 25.
Determine the lower bound of a 95% confidence interval for a three step ahead forecast of yt, 9T+3.

Copyright 02022 ASM
19.20. a'? [4-F00:40] You are given two random walk models. These models are identical in every respect, except
that for one of them pc = 0 and for the other one Uc > 0.
Which of the following statements about these random walk models is incorrect?
(A) For the random walk with pc = 0, all forecasted values from time T are equal.
(13) For the random walk with pc = 0, the standard error of the forecast from time T increases as the forecast
horizon increases.
(C) For the random walk with pc # 0, the forecasted values from time T will increase linearly as the forecast
horizon increases.
(D) For the random walk with pc 0, the standard error of a forecasted value from time T is equal to the
standard error of the corresponding forecasted value for the random walk with pc = 0.
(E) For the random walk with pc 0, the standard error of the forecast from time T increases or decreases,
depending on pc, as the forecast horizon increases.
You are given the following time series:
2,4,1,3,5,3,2,1,6,4
It is modeled as a white noise series. To test this model, out-of-sample validation is done, with the first 6 terms
used for the model development subsample.
19.21. %. Calculate the mean percentage error.

19.22. 41 Calculate the mean square error.
l
f
19.23. 'kr You are given the following time series:
2,5,10,11,15,17,20,22,25,27
It is modeled as a random walk. To test this model, out-of-sample validation is done, with the first 6 terms used for
the model development subsample.
Calculate the mean square error.
19.24. 0You are given the following time series:

2,5, 10,11, 17, 22, 25, 28, 35, 40
It is modeled as a random walk. To test this model, out-of-sample validation is done, with the first 6 terms used for
the model development subsample.
Calculate the mean absolute error.

Copyright 02022 ASM
19.25. [SRM Sample Question #55) You are given the following eight observations from a time series that
follows a random walk model:
Time (t) 0 1 2 3 4 5 6 7
Observation (y,) 3 5 7 8 12 15 21 22
You plan to fit this model to the first five observations and then evaluate it against the last three observations
using one-step forecast residuals. The estimated mean of the white noise process is 2.25.
Let F be the mean error (ME) of the three predicted observations.
Let G be the mean square error (MSE) of the three predicted observations.
Calculate the absolute difference between F and G, IF - G.
(A) 3.48 (B) 4.31 (C) 5.54 (D) 6.47 (E) 7.63
Solutions
19.1. We calculate E(y, - g)2 and E(yi - 9).

= 10
= 706
E(Yi - 9)2 = 706- 6(102) = 106

Z(Yi - ;17)(yi+i - (-2)(-5) + (-5)(2) + (2)(8) + (8)(-3) + (-3)(0) = -8
-8
-0.075472
1.1 = T1II6 =
19.2. The sample mean is 1.45. Subtracting the sample mean, the terms are -0.45, 0.05, 0.15, -0.05, 0.05, 0.25. The
sum of their squares is 0.295. The lagged sum (xi - 1)(xi+2 - R) is
(-0.45)(0.15) + (0.05)(-0.05) + (0.15)(0.05) + (-0.05)(0.25) = -0.075
The sample lag 2 autocorrelation is -0.075/0.295 = -0.25424 (C)
19.3. The numerator of the autocorrelation formula at lag 4 is (xi -5) (x5-2)+ (x2-1)(x6-1) = (0)(0.5)+ (-0.5)(0) = 0,
so the autocorrelation is E. (C)
19.4.
(3.5 - 3)(3 - 3) + (4 - 3)(2.5 - 3) + (3 - 3)(2 - 3) -0.5
r3 - = -0.2 (B)
-
02 ± (_.5)2 (_1)2 0.5)2 + + 02 2.5

19.5. The mean is0.07+0.06+0.07+0.1-0,05
5
— 0.05. We need
Y2 —
EL3(Yi - 2)(yr-2 - Y)
V=1 (Yt g)2
After subtracting 0.05 from each term, the series is 0.02, 0.01, 0.02 0.05, -0.10.
The denominator is 0.022 + 0.012+ 0.022 + 0.052 + (-0.10)2 = 0.0134. The numerator is (0.02)(0.02) + (0.05)(0.01) +
(-0.10)(0.02) = -0.0011. So r2 =
= -0.08209 . (E)

Copyright 02022 ASM
19.6. The mean is 1.2+1.1+0.9+1.3+1.5 =1.2. We need5
12 =
V.3(Yt - 0)(Yt-2 - 9)
ELi(Yi - 9)2
After subtracting 1.2 from each term, the series is 0, -0.1, -0.3, 0.1, 0.3.
The denominator is 2(0.12 + 0.32) = 0.2. The numerator is (0.1)(-0.1) + (0.3)(-0.3) = -0.1. So r2 = -0.5
(A)
19.7. You'd use your calculator to work this out. The average of the 8 numbers is 24; the sum of square differences
of the numbers from 241s 506. Then the lag 4 product is
= (26 - 24)(25 - 24) + (38 - 24)(19 -24)

+ (22 - 24)(10 - 24) + (20 - 24)(32 - 24) = -72
The sample lag 4 autocorrelation is -72/506 = -0.14229,. (A)

19.8. The mean is 423.33, and E(xf - 2)2 for sales xi equals 13583.33. That is the denominator of the autocorrelation.
The numerator is
(400 - 423.33)(410 - 423.33) + (375 - 423.33)(420 - 423.33)

+ (410 - 423.33)(410 -423.33) + (420 - 423.33)(525 -423.33) = 311.11
The lag 2 autocorrelation is 311.11/13583.33 = 0.0229 (B)

19.9. The mean of the five numbers is 101.186. Then the numerator of the autocorrelation before division by 5 is
fl 4
Z (yi — 9)(yi+i = (63.18- 101.186)(81.89 - 101.186) + - • • + (123.90- 101.186)(133.53 - 101.186) = 1475.695
while the denominator before division by 5 is the biased sample variance times 5 or the unbiased sample variance
times 4: 3383.887. The quotient is 1475.695/3383.887 = 0.4361. (C)
19.10. The forecast is yio plus 9 times the drift p(c), the mean of the underlying white noise, and the drift
is estimated as the average value of the past observations, or 2, so 919 = Y10 + 18. The actual value of y19 is
Yio + + 3 + 5 + • • • + 2 = yio + 26. The forecast error is 26 - 18 = n. (D)
19.11. See the enumerated list on page 345, taken from the Frees textbook (except that "control chart" is not
mentioned); it includes all three statements. (D)
19.12. See page 346, which summarizes an obscure passage in Frees. Note that statement III means that the
comparison is not clear because L has trend and M doesn't, so you wouldn't think of comparing them; it doesn't
mean that you can't compare and contrast them. (D)
19.13. The 10 observed changes, el = yi _1, are 2, 3, 5, 3, 5, 2, 4, 1,2, 3. Their mean is 3 and their unbiased sample
variance is 16/9. So the standard deviation is 4/3. The standard error of 919, is 1/§. times the standard deviation, or
(B)
19.14. The mean of a random walk is yo pct, which is not constant unless pc = 0, so I is true. The variance of a
random walk is ta., which is not constant unless cy = 0, so II is false and III is true. Of course if a = 0 the walk is
not really random. (C)
19.15. The variance is 16 at all future times, so the confidence interval is ±2V3.-6, and 2(2)A/ 17 =
19.16. The variance of each term is t a2. If52= 35, then 6a2 =
19.17. y2 is a normal random variable with mean yi + pc = 3 + 2 = 5 and variance 16. The probability that it is
greater than 0 is 1 - 1)((0 5)/-{1-6) = 1 - (13(-1.25) = 0.89441
Copyright 02022 ASM
19.18. Variance is 3(3) = 9. Confidence interval width is 2(2) =1111.

19.19. yi is a random walk with pc = 4. Variance is 3(5) = 15. Using the approximation of equation (19.6), the •—..)
lower bound of confidence interval is 25 + 3(4) — 2V1:5- = 29.25
19.20.
A. Since the mean change at each time unit is E[mr+ki = 0, the forecast is xT. for all future times. t/
B. Since a white noise term is added at each time, the standard error for k periods ahead is "sffc.
C. For the random walk with pc at 0, the forecasted value k periods ahead is XT + kp. 1/
D. The standard error is the square root of the variance of the sum of k white noise terms.
E. pc is non-stochastic so it has no effect on the standard error, which always increases. X
(E)
19.21. The average of the first 6 terms is 3, so the forecast for all future times is 3. The mean percentage error is
100(1 2 —3 —1
= 43.75
4 2 1 6 4
19.22.
(_1)2 (_2)2 + 32 ± 12 3.75
4
19.23.(yo — yi)/5 = (17— 2)/5 = 3, so the forecasted increase per period is 3. The forecasted series is 20, 23, 26, 29.
The mean square error is
02 + 12 + 12 + 22
1.5
4
19.24. (y6 — yi)/5 = (22— 2)/5 = 4, so the forecasted increase per period is 4. The forecasted series is26,30, 34,38.
The mean absolute error is (1 + 2 + 1 + 2)/4 = 1.5
19.25. For each forecast period, the forecast is pt =9 + 2.25. Thus 95 = 14.25, 96 = 17.25, and 97 = 23.25. The
forecast errors are 0.75, 3.75, and —1.25 respectively.
The mean error is
0.75 + 3.75 — 1.25
F= = 1.08333
3
The mean square error is

0.752 + 3.752 + 1.252
G = 5.39583
3
The absolute difference is 4.3125 . (B)
Quiz Solutions
19-1.
(-1)(0) + (3)(1)
r3 —
(_1)2 + 32 + (_3)2 4. 02 + 12
—
[0.15
19-2. The sample mean is 1 and the unbiased sample variance is (02 + 22 + 12 + 12 + 22)/4 = 2.5. A forecast interval
is two-sided, so we use the 95thpercentile of a t distribution with 4 degrees of freedom, or 2.1318. The upper bound
of a 90% forecast interval is
1
1 + 2.1318 -k./ 5-\/1 + — = 5
4.6924

000
Never stop testin your knowledge, whether it be relevant to our career, or ust trivia.
QUESTION:
Which car is often called the first
muscle car?
019 DeRuod aqi :Jamsuv
atstm
Lesson 20
Time Series: Autoregressive Models
Reading: Regression Modeling with Actuarial and Financial Applications 8.2-8.4
An autoregressive model of order 1, or AR(1) is a time series where each term may be expressed in terms of the '46
previous term plus white noise:
Examples of autoregressive processes are:
1. White noise, if f3i =0.

2. A random walk, if pi =
However, we generally insist that 0 < 'pi I<1. When 'Pi I<1, the process is stationary.
For a stationary AR(1) process, the true autocorrelation at lag k is
pk (20.2)
To test whether a series is white noise, in other words Pi = 0, one may check whether the sample autocorrelations
rk are significant. The standard error of rk is 1/Vf. So if Irti > 2/VT, the autocorrelation may be regarded as
significant, rejecting the white noise model.
The coefficients po and pi may be estimated using the method of conditional least squares, which is simply
regression of yi on The coefficients are approximately b1 = 1'1 and bo = 9(1 — b1). Let et be the residuals,
et=y' — (ho +1)12/1-1)
The variance of the error term et is estimated by
1
S2 —
T—3 E (et —
1=2
1)2 (20.3)
Here, there are T 1 terms in the sum, but 2 degrees of freedom are used up to estimate po and (31, leaving T —3
degrees of freedom.
EXAMPLE 20A a-.? You are given the time series {55, 35, 52, 40, 46, 42). You fit an AR(1) model to it using the
autocorrelation at lag 1 to approximate pi.
Calculate the estimated variance of the error term.
SOLUTION: First calculate ri. The mean of the six terms is 45. After subtracting 45 from each term, the remainders
of the time series terms are {10, —10,7, —5,1, —3}. The autocorrelation at lag 1 is
(10)(-10) + (-10)(7) + (7)(-5) + (-5)(1) + (1)(-3) 0.75

11 = =
102 + (-10)2 + 72 + (-5)2 + 12 + (-3)2
Then P'.0 = 45 + 0.75(45) = 78.75. The estimated values of the time series are gi = 78.75 — 0.75y,_, and the error is
Et =ys— 9, — (78.75 — 0.75y/-4)

360 20. TIME SERIES: AUTOREGRESSIVE MODELS
We cannot calculate9,so we can't calculate the first error. The other errors are —2.5, —0.5, 0.25, —2.75, and —2.25.
The mean error is —1.55. The variance of the error is
(( 2.5 — (-1.55))2 + (-0.5 — (-1.55))2 + (0.25 — (-1.55))2 + (-2.75 — (-1.55))2 + (-2.25 — (-1.55))2)
:= 2.391667 0
Notice that the variance of yt is greater than the variance of r. By taking the variance of the two sides of
equation (20.1), we get
Var(si)
Var(yt) = (20.4)
One may define autoregressive models of higher order, but they are not on the syllabus.
Forecasting with AR(1) series
To forecast values in an AR(1) model, start at time T and express them in terms of the previous values recursively,
omitting the El term:
9t+1 = PO + poi
'41 For an AR(1) process, the estimated variance of a k-period forecast is
Var(gT+k) = 62 E p321(1-1)
t=1
(20.5)
where s 2 is the estimated variance of the error term. Notice that for a 1-step ahead forecast, the sum has only one
term, 1, so Var(frT+1) = s2.
The approximate 95% forecast interval is
Quiz 20-1 s:
For an AR(1) model, Po = 5,131 = 0.6, and yy, = 32.
Calculate the forecast of YT+2.

Table 20.1: Summary of Formulas for AR(1) Models
Definition of process:
yt =p0 +piyi_i + (20.1)
The process is stationary if and only if VII <1
pk = (20.2)
Approximate fitted values of Po and Pi: 1)1 ri, bo 9(1 -
S2 = Var(el ) = T—3
(et — (20.3)
t =2
Var(ei)
Var(yl) = 2 (20.4)
1 — P1
k
Var(9T+k) = 62 E P21(1-1)
1=1
(20.5)
Exercises
20.1. si? You are given the AR(1) process yi = + Ei. Also, icq = 4.
r." Determine Var(yi ).
20.2. For an AR(1) process yi = 0.6y1_1 + el, you are given that Var(yt) = 5.
Determine
20.3. %. [VEE Applied Statistics-Summer 05:1] You are given the following information about an AR(1) time-series
model:
p2 = 0.8
Determine 'Pi I.
(A) 0.5 (B) 0.6 (C) 0.7 (D) 0.8 (5) 0.9
20.4. sir You are given the following AR(1) process:

— 50 = 0.75(y/-4 —50) + Ej
where El is white noise with
• E[ei] = 0
• Var(Ei) = a2
Y20 = 60. Forecasts are based on y20.
Determine the lowest I for which 920+t <53.

Copyright Q2022 ASM
362 20. TIME SERIES: ALITOREGRESSIVE MODELS
20.5. NIP [MAS-1-519:43] You are given an autoregressive time series of order 1:
= +
and the following time series graphs:
S1 $2
0 20 40 00 80 100 0 20 40 00 SO 100
Time Time
$3 Si4
0 20 40 ao 80 100 0 20 40 ao 80 100
Time Time
Determine the most likely coefficient P1 for each graph, sl—s4.

(A) s1=0.90, s2=0.50, s3=0.99, s4=0.99
(B) s1=0.99, s2=-0.50, s3=0.90, s4=-0.99
(C) s1=-0.90, s2=0.50, s3=-0.99, s4=0.99
(D) s1=-0.99, s2=0.50, s3=-0.90, s4=0.99
(E) s1=-0.90, s2=0.99, s3=-0.99, s4=0.50
20.6. [SRM Sample Question #22] A stationary autoregressive model of order one can be written as
Yt = Po + piyi_i + el, t = 1,2, .. .
Determine which of the following statements about this model is false.

(A) The parameter 00 must not equal 1.
(13) The absolute value of the parameter pi must be less than 1.
(C) If the parameter pi = 0, then the model reduces to a white noise process.
(D) If the parameter 01 = 1, then the model is a random walk.
(E) Only the immediate past value, yt-i, is used as a predictor for yt •
20.7. am? [S-F15:421 An AR(1) model is fitted to time series data through time t = 7. The resulting model is
yt = 14.3379 — 0.79yt_i + Ei
You are given that y7 = 8.50.

Calculate the forecast for the observation at t = 9, 99.
(A) Less than 8.25
(E) At least 8.55
(---- 20.8. r[MAS-I-S18:44] You are given the following fitted AR(1) model:
ye = 5 + 0.85y/...1 + Et
The estimated variance of the error is 13.645.
Calculate the two-step ahead forecast standard error.

(A) Less than 4.0
(E) At least 13.0
20.9. You are given the following fitted AR(1) model based on 20 observations:
ye = 4 — + El
You are given:

(i) 1/20 = 3-5.
(ii) The estimated variance of the error is 12.5.
Calculate an approximate 95% forecast interval for y23.

Copyright 02022 ASM
364 20. TIME SERIES: ALITOREGRESSIVE MODELS
20.10. You are given the following fitted AR(1) model based on 15 observations:
= 10 + 0.4y, + et
You are given:

(i) Y15 = 15
(ii) The sum of the square differences of the residuals from their means is 60.
Calculate the upper bound of a 99% forecast interval for y17 using the t distribution.
20.11. Nr For an AR(1) process:

(i) 7 (9T-q) = 10
(ii) Gr(Pr+2) = 15
Calculate Var(9T+3).
Solutions
20.1. By formula (20.4)
Var(yr) = 1 _40.92 = 21.0526
a2
20.2. By formula (20.4) for Var(y/), 5 = 1-0r6 ' Therefore, 4 = 3.2
20.3. By formula (20.2), p2 = fq, so 131 = 1/U with absolute value approximately 0,9. (E)
20.4. The mean of the series is 50 and the current value y20 = 60, which is 10 higher than the mean. In the forecast \---)
of an AR(1) series, each excess of a term over the mean is 131 times the excess of the previous term over the mean.
Here, pi = 0.75. We want the t such that yi = 53, an excess of 3 over the mean, so we want 10(0.751) <3. Solving
fort,
10(0.751) <3
0.75' <0.3
t In 0.75 < 1n0.3
In 0.3
t > = 4.185
ln0.75
The smallest t with 920+1 <53 is IT.

20.5. sl and s3 are alternating, so one would expect a negative coefficient, with greater magnitude for s3 because
it oscillates with higher amplitude. s2 and s4 appear to have positive coefficients, with s4's higher since it doesn't
damp down as fast. Thus (C) is the most likely.
20.6.
(A) is false, since the model is stationary regardless of the intercept. (B) is true since the model is not stationary
otherwise. The other statements are true by definition of white noise, random walk, and by the equation of the
model.
20.7.
98 = 14.3379 + (-0.79)(8.50) = 7.6229
99 := 14.3379 + (-0.79)(7.6229) = 8.3158 (B)
20.8. Use formula (20.5) with s2 = 13.645 and 131 = 0.85.
-NA7a-r(9T+2) = V13.645(1 + 0.852) = 4.848 (B)

20.9. )121 = 4 — 0.9(3.5) = 0.85 The standard error of the forecast is
:922 = 4 — 0.9(0.85) = 3.235

923 = 4 — 0.9(3.235) = 1.0885
V12.5(1 + 0.92 + 0.94) = 5.5521

The approximate 95% forecast interval is 1.0885 ± 2(5.5521) = (-10.0158,12.1928)
20.10. The mean square error is 60/(T — 3) = 60/12 = 5. The t coefficient to use for 99% confidence is to.005, since
we want a two-sided confidence interval, so only 0.5% should be excluded on the lower and upper sides. For 12
degrees of freedom, toms = 3.0545.
916 10 + 0.4(15) = 16
917 = 10 + 0.4(16) = 16.4
The upper bound of the 99% forecast interval is 16.4 + 3.0545-V5(1 + 0.42) = 23.7562
20.11. From (i), S2 = 10. From (ii), s2(1 + fiT) = 15, so g 0.5. Then pl = 0.25, and 10(1 + 0.5 + 0.25) = 173
Quiz Solutions
20-1.
19.52


Copyright 02022 ASIV1
Lesson 21
Time Series: Forecasting Models

Reading: Regression Modeling with Actuarial and Financial Applications 9
21.1 Moving average smoothing

To smooth a time series, one may average k consecutive terms and create a new time series: °or
Yt—k+1 Yt—k+2 " yi
st =
This formula can be rearranged into

— yi—x
If there are no trends in the data, then we can ignore the second term on the right and forecast 3r = g.T. for all I.
If a time series has trend, we can perform double moving average smoothing. In other words, we can smooth st: 6-r
..(2) =
gi—k+1 g(—k+2 " • gl
SI
(2)
The estimate of trend is bi,r 2(gr — sT )/(k — 1), and the forecast is 9T+1 =
Moving average smoothing is weighted least squares, with weights of 1 on the most recent k periods and 0 on
earlier periods.
21.2 Exponential smoothing

Moving average weights are constant for the previous k periods and then drop off to 0. A smoother pattern of
weights would be more desirable. Exponential smoothing puts weights proportional to 7,Vk on the kth period before
the current one. These weights add up to Ew wk = 1/(1 — w), so the formula for an exponentially smoothed series is 11-:
+ wYt-t + w2Y1-2 + • • •
Si = (21.1)
1/(1 — w)
In practice, values of y are not available before time 1, so we set
+ wyt—i + w2yt-2+ • • • +wiyo
st =
1 —
where yo can be set to 0, the series mean, or yi..1

An alternative way to express exponential smoothing is
gt = (1 — w)Yt + tvgt-i (21.2)
Setting w low leads to very little smoothing, whereas setting it high leads to a lot of smoothing. Using this equation,
the forecast using exponential smoothing is that future values will equal the last known value of s:
9T+k = gT (21.3)
'The weights do not add up 1. I would prefer to pot the balance of the weight on yo, but the textbook does not do that. It doesn't make a big
difference ill is large. It makes no difference if yo = 0.
Exam SAM Study Manual 367

Copyright 02022 ASM
368 21. TIME SERIES: FORECASTING MODELS
Quiz 21-1 I': You are given the following time series:
12, 18, 20, 21, 25
You are to exponentially smooth it with smoothing parameter w = 0.6 and starting value yo = 0.
Calculate 55.
One way to evaluate this model is to check what this model would have predicted in the past versus the data we
have. The one-step prediction error is et = y, — gi_l. Thus the sum of squared one-step prediction errors is
SS(w) = E(yt - gi-1)2

1=1
(21.4)
(where presumably go = yo, although the textbook doesn't say). You should select w to minimize SS(w).
If the time series has trend, one may perform double exponential smoothing; calculate gi as above, then expo-
nentially smooth gi to obtain g1(2). Then, let bo,T be the intercept and bix be the slope. They are
bo,T = 2gr g.?)
1-W .(2)
(21.5)
bi,T (gT sT
and the forecast is g F I = bO,Tbl,T1,
21.3 Seasonal models
Fixed effects
Fixed seasonal effects can be modeled using trigonometric functions or dummy variables for each season. The most
convenient form is a linear regression model with the variables being trigonometric functions. Let SB be the seasonal
base—the number of terms in the time series per year, or per whatever seasonal cycle is of interest. For example,
if a time series has annual seasonal cycles and the terms in the series are monthly observations, then SB = 12. Let
fi = 271f ISB, where j is a positive integer. Then the linear regression model for the seasonal time series is
j=1
If SB is even then nz is at most SB 12; any additional variables would be collinear with the ones already in the model.
In fact, sin fm t is identically 0 for m = SB /2 and would be omitted. The optimal value of 171 is selected by comparing
models, possibly using F tests.
Autoregressive seasonal models

We've studied AR(1) models, in which each term is a function of the previous one. In a seasonal autoregressive
model, each term is a function of a term that is a multiple of SB periods earlier, where SB is the seasonal base, as
defined above. In other words,
Yt = 130 131Yi—S13 P2Y1-2SB ' " PPYI-P•SB
When there are P lagged terms, this model is called an SAR(P) model.

21.4. UNIT ROOT TESTS 369
Seasonal exponential smoothing

Holt generalized double exponential smoothing by allowing different smoothing parameters for the intercept and '4°
the trend. As usual, let 170,t be the intercept and bi,t be the trend, so that we forecast yT+1 = bo,T -Fbi,TI. Let w1 be the
smoothing parameter for the intercept and w2 the smoothing parameter for the trend. Then b0,1 is 1 - wi times the
term (yi) plus w1 times the forecast from the previous period, b0,f-1 +
bo,/ = (1 - wi)yt + wi(bo,t-i +
b1,1 is 1 - w2 times the observed trend of the bos plus w2 times the smoothed trend from the previous period:
bu = (1 - w2)(bo,1 - bo,t-i) + w2b1,1-1
Winter added seasonality to this model. The resulting model is called the Holt-Winter seasonal additive model .2 Let 4/
St be the seasonal factor, with Zis_1..31 St = 0, where SB is the seasonal base. A third weight w3 is used to update the
seasonal factors. The recursive formulas are
= (1 - wi)(Yt - St-sB)+ +
= (1- w2)(bo,1 - (21.6)

gt = (1 - ws)(Yi NJ) W3g1—SB
The forecasting equation is
9T+1 = bO,T bl,T1 ST(/) (21.7)
where ST(/) is the last value of the seasonal factor for season I, ST(I)= ST+1-S13.
EXAMPLE 21A air The Holt-Winters seasonal model is used. Data is available for 90 monthly periods. You have
computed
b0,90 = 122
bi,90 = 5
S.(85) =-3
Calculate the projected value of y121. •
SOLUTION: Projection is 121 - 90 = 31 periods into the future. The cycle is annual and there are 12 months, so
ST(121) = ST(109) = ST(97) = ST(85).
9121 = 122 + 31(5) -3 274
21.4 Unit root tests
Sometimes a time series may show trend, but it is not clear whether the time series is a random walk, has a linear
trend, or is autoregressive. To test this, we set up the regression model
ill - y1-1= Po ( - 1)yr-i + pit + et
yt is a random walk if rp = 1. We use the t statistic to determine the significance of the hypothesis cp = 1 versus Gp <1,
but we cannot use the standard critical values for t under the null random walk hypothesis. Instead, we use critical
values compiled by Dickey and Fuller, which are higher, so it is harder to reject the random walk hypothesis.3
The Dickey-Fuller test assumes that El are not serially correlated. The augmented Dickey-Fuller test allows for
serial correlations of Ci. In the augmented Dickey-Fuller test, we add lagged variables to the regression:
yr - =
(q5 - 1)yr-1 + flit E i(yt_ -

j=1
+
It is not clear how many lagged variables should be added, so typically the test is performed for several values of p.
2There is also a multiplicative model, but we will not discuss it.
3To partially make up for this, it is traditional to use 10% significance instead of 5% significance.
Exam SkIvi Study Manual

Copyright 02022 ASM
370 21. TIME SERIES: FORECASTING MODELS
21.5 ARCH and GARCH models
Nr The autoregressive changing heteroscedasticity model, or the ARCH model, allows the conditional variance of a time
series to vary based on previous errors, yet leaving the unconditional variance constant, and therefore allowing the
1-1? time series to be weakly stationary. The formula for the variance at any time t is
2 2
at -w+Eyj Et-1 •
( 21.8)
j=1
where w > 0 is the long-term volatility parameter, yi 0, and VI < 1.

The generalized ARCH model, or the GARCH model, allows the variance to vary with previous variances as
well as with previous error terms:
4 =W + (21.9)
•411 with bi 0, yi 0, and yi +Ebi <1. A GARCH model is stationary with unconditional variance
Var(Ei) = w (1 - yj ai) (21.10)
Double moving average smoothing Holt-Winter forecasting equation

Estimate of trend
9T+i = bo,T big/ ST(/) (21.7)
Exponential smoothing Dickey-Fuller unit root test Test whether 4) = 1 in the

Formula for series following regression:
= (1 - w)yi 74_1 = (21.2) yt = Po + (0/9 - 1),y1-1 + Pit + El

11(1 - w)
Forecasting equation
9T+k = ST (21.3) ARCH conditional variance
Sum of squared one-step ahead prediction error

(21.8)
SS(w) = E(yi gt_i)2
1=1
(21.4)
Double exponential smoothing, estimate of trend

GARCH conditional variance
box = 2g1, - §1F)
1- "(2)\
(21.5)
bi,T = - ST ) ut
2
(21.9)
i=1 j=1
Holt-Winter model recursive formulas
bo = (1 - wi)(Yt - gt-ss) + b1,,-1)

b1,1 = (1 - zu2)(boi - bod -1) + GARCH unconditional variance
=
- zu3)(Y1 - bo,t)+ W3g i—SB
(21.6) Var(ei) = w
(1 - _ E oi) (21.10)

Copyright 02022 ASM
lf Exercises
21.1. 'Nit You are given the following time series:

t 0 1 2 3 4
Yt 50 50 60 44 56
You forecast this series using an exponentially weighted moving average model with w = 0.6.
Determine the forecast of y6.
21.2. '41 You are applying exponential smoothing to a time series. You are given that Y20 = 100, yn = 101,
= 110, w = 0.6.
Determine g 21 .
21.3. *se? You apply exponential smoothing to a time series. You are given
y5 = 90 y6 = 81 g4 = 70 55 = 82
Determine the lowest possible value for the smoothing constant w.
21.4. "Lir [Old exam] You are given the following information about a series of observations, y, t = 1, 2, 3, ... ,100:
td< io for all t
Y96 = 7.1
Y97 = 8.9
Y98 = 5.5
y99 = 8.2
Yioa = 7.3
You are to use exponential smoothing with w = 0.1 to forecast values beyond the end of the observation period.
You are given that the smoothed value g95 = 7.1.
Calculate the forecasted value of yioi.
(A) 7.1 (B) 7.2 (C) 7.3 (D) 7.4 (B) 7.5
21.3. NI° [120-81-95:13] You use exponential smoothing with w = 0.9 and go = 27 to forecast the following time
series:
Yt
27
1 20
2 30
3 10
4 60
5 15
Determine 55.
(C) 22 (D) 26 (B) 27

c (A) 16 (B) 19

Copyright 02022 ASM
372 2/. TIME SERIES: FORECASTING MODELS
21.6. '41 [120-83-98:121 You are given the following observations from a time series:
t 1 2 3 4 5
yiI 15.6 3.4 13.3 -4.6 -1.8
and go = 15.6.
Use exponential smoothing with smoothing constant w = 0.40 to calculate the smoothed value gs of the fifth
observation.
(A) -1.8 (B) -0.4 (C) 1.2 (D) 5.2 (E) 11.3
21.7. 82 [VEE-Applied Statistics-Summer 05:3] You use exponential smoothing with w = 0.9 to make one-step-
ahead forecasts, that is, 9t+1 = gl. t = 0,1,2, 3, for the following data:
Y! 91+1
81
1 80
2 81
3 89
4 74
Determine the mean forecast error Z41_1(yi - 91).

(A) -0.5 (B) -0.1 (C) 0.0 (D) 0.1 (E) 0.5
21.8. s: You are given the time series

t 1 2 3 4
:lit 1 4 8 11
.(2)
You perform double exponential smoothing with smoothing constant w = 0.8 and starting values so = so = 0.
Calculate the estimated trend.
21.9. [SRM Sample Question #461 A time series was observed at times 0, 1,..., 100. The last four observations
along with estimates based on exponential and double exponential smoothing with w = 0.8 are:
Time (t) 97 98 99 100
Observation (yi) 96.9 98.1 99.0 100.2
Estimates (41)) 93.1 94.1 95.1
Estimates (e) 88.9 89.9
All forecasts should be rounded to one decimal place and the trend should be rounded to three decimal places.
Let F be the predicted value of yin using exponential smoothing with w = 0.8.
Let G be the predicted value of y102 using double exponential smoothing with w = 0.8.
Calculate the absolute difference between F and G, IF - GI.
(A) 0.0 (B) 2.1 (C) 4.2 (D) 6.3 (E) 8.4
Exam SRM Study Manual Exercises continue on tire next page. ..

EXERCISES FOR LESSON 2/ 373
21.10. q You apply the Holt-Winters method to a time series.

You are given the following data:
t 2 3 4
XI 100 120 130
b0,1 =90 b1,1 = 10 w1 = 0.4 zy2 =0.2
There is no seasonality in the data; gi = 0 for all t.

Determine b0,4.
21.11. •-: You apply the Holt-Winters method to a time series.

t 2 3 4
xt 80 100 90
= 80 b1,1 = 10 w1 = 0.4 w2 = 0.6
There is no seasonality in the data; gi = 0 for all I.

Determine b14.
21.12. q You apply the Holt-Winters method to a time series.

YT = 55 bo,r-i = 50 bi,T-1 = 10 w1 = 0.5 w2 = 0.2
There is no seasonality in the data; 5ct = 0 for all I.

Determine the 5 step lookahead forecast, 9T+5.
21.13. "-.11' You apply the Holt-Winters method to a time series. You are given box-1
and you forecast bo,T+i = 536. The smoothing constant zy2 = 0.5.
Determine the smoothing constant w1.
You apply the Holt-Winters method to a time series. You are given that the data are monthly, y2 = 19,1/14 = 15,
yi5 = 20, b0,14 = 17, bi,14 = 2, ,. 3 = 3, zvi = 0.8, w2 = 0.7, w3 = 0.6.
21.14. 6.1? Calculate S15.
21.15. ••• g4 = -2.
Calculate the forecasted value 916.
21.16. q For an ARCH model, you are given

(i) Var(o I ei_l, ei_2) = w + 0.64k + 0.2e_2
(ii) w = 4.
(---
Calculate Var(E/).

Copyright Q2022 ASM
374 2/. TIME SERIES: FORECASTING MODELS
Solutions
21.1. The forecast of yo is the same as forecast of y5, namely $4. We use
S4
EL0 wk(1 - W)Y4-k
1/(1 - w)
„
56 + 0.6(44) + (0.62)(60) + (0.63)(50) + (0.64)(50) 48.512
Y6 -
1/0.4
21.2. gzo = 0.4(100) + 0.6(110) = 106

S21 = 0.4(101) + 0.6(106) = 104
21.3. g5 = 90(1 - w) + 70w = - 20w
g6 = y6(1 - w) g5w
82 = 81(1 - w) + (90- 20w)w = -20w2 + 9w + 81
20w2 - 9w + 1 = 0
9 ± A/81 :1
w= = 0.20,0.25
40
21.4. The forecasted value is 9ioi. = gioo• Note that gge, = 0.1(7.1) + 0.9(7.1) = 7.1.
giao = 0.9(7.3) + 0.1g

= 0.9(7.3) + 0.09(8.2) + 0.0195
= 0.9(7.3) + 0.09(8.2) + 0.009(5.5) + 0.00197
= 0.9(7.3) + 0.09(8.2) + 0.009(5.5) + 0.0009(8.9) + 0.0001(7.1) = 7.36622 (D)
21.5. This can be done recursively by calculating i,g2. etc., or in one calculation with weights of 0.1 on the current
observation y5, (0.9)(0.1) on the previous one y4, (0.92)(0.1) on y3, etc., with the leftover weight 0.95 placed on h:
g5 = 0.1(15) + 0.09(60) + 0.081(10) + 0.0729(30) + 0.06561(20) + 0.59049(27) 27.15243 (E)
21.6.
$5 = 0.6(-1.8) + 0.6(0.4)(-4.6) + 0.6(0.42)(13.3) + 0.6(0.43)(3.4) + (0.44)(15.6) = -0.3773 (B)
21.7. The following table summarizes the calculation.

t yi gt = 0.1y1 + 0.9gi-i 91 .1/1 -9
1 80 80.9 81 -1
2 81 80.91 80.9 0.1
3 89 81.719 80.91 8.09
4 74 81.719 -7.719
The mean forecast error as defined in the problem is
1(-1 + 0.1 + 8.09 - 7.719) =

-
4
-0.13225 (B)
You would've ended up with the same answer choice if by mistake you summed up yi - gi instead of yt -9,.

Copyright e2022 ASM
21.8. We calculate 01) and gr) recursively.

41) = 0.2(1) = 0.2 42) - 0.2(0.2) = 0.04
41) = 0.2(4) + 0.8(0.2) = 0.96 g2(2) - 0.2(0.96) + 0.8(0.04) = 0.224
„(2)
s^3(1) = 0.2(8) + 0.8(0.96) = 2.368 s3 - 0.2(2.368) + 0.8(0.224) =
_
0.6528
=
0.2(11) + 0.8(2.368) = 4.0944 042) = 0.2(4.0944) + 0.8(0.6528) = 1.34112
Estimated trend is
(
_ w)/w)(41) _42)54 ) 3:28 (4.0944 - 1.34112) =
0
0.68832
21.9. For exponential smoothing, using the question's rounding rules,

"(1)
6
Igo = 0-2(100 2) + 0 8(95 1) = 96.1
"
The forecast is F = 96.1.

For double exponential smoothing,
A(2)
S = 0 2(95 1) + 0 8(89 9) = 90.9
^(2)
s
10 0
=0 2(96 1) + 0 8(90 9) = 91.9
" "
Using formulas (21.5), the trend is

0.2
b1,100 =0.8-(96.1 -91.9) = 1.05
and the intercept is

b0,100 = 2(96.1) - 91.9 = 100.3
So the prediction is G = 100.3 + 2(1.05) = 102.4. The difference is 1102.4 - 96.11 =16.3 (D)
21.10. b0,2 = 0.6(100) + 0.4(90 + 10) = 100
b1,2 = 0.8(100 - 90) + 0.2(10) = 10
b0,3 = 0.6(120) + 0.4(100 + 10) = 116
b1,3 = 0.8(116 - 100) + 0.2(10) = 14.8
= 0.6(130) + 0.4(116 + 14.8) = 130.32
21.11. 1/0,2 = 0.6(80) + 0.4(90) = 84
b1,2 = 0.4(4) + 0.6(10) = 7.6
b0,3 = 0.6(100) + 0.4(91.6) = 96.64
b1,3 = 0.4(12.64) + 0.6(7.6) = 9.616
b0,4 = 0.6(90) + 0.4(106.256) = 96.5024
b1,4 = 0.4(-0.1376) + 0.6(9.616) = 5.71456
21.12. box = 0.5(55) + 0.5(60) = 57.5
= 0.8(7.5) + 0.2(10) =8
9T+5 = 57.5 + 5(8) 97.5
= 536 - 504 = 32 = + 0.5(504 - 470) b1,7'-1 30
504 = (1 - wi)(520) wi(470 + 30) w1 = 0.8

376 2L TIME SERIES: FORECASTING MODELS
21.14. We won't need bus, but we need bo,i.s.
b0,15 = 0.2(20 — 3) + 0.8(17 + 2) = 18.6

gi5 = 0.4(20 — 18.6) + 0.6(3) = 2.36
21.15. Now we need b1,15. Note that :.‘16 = 54

b1,15 = 0.3(18.6 — 17) + 0.7(2) = 1.88
916 = b0,15 b1,15 g16 = 18.6 + 1.88 — 2 = 18.48
4
21.16. Var(et) =
yi = 1 -0.2
Quiz Solutions
21-1. Using equation (21.1),
25 + (0.6)(21) + (0.62)(20) + (0.63)(18) + (0.64)(12) 20.0973

S5 —
1/0.4


Part V
Practice Exams
Practice Exam 1
1. g'ir A life insurance company is underwriting a potential insured as Preferred or Standard, for the purpose of
determining the premium. Insureds with lower expected mortality rates are Preferred. The company will use
factors such as credit rating, occupation, and blood pressure. The company constructs a decision tree, based on its
past experience, to determine whether the potential insured is Preferred or Standard.
Determine, from a statistical learning perspective, which of the following describes this underwriting method.
I. Classification setting
IL Parametric
III. Supervised
2. An insurance company is modeling the probability of a claim using logistic regression. The explanatory variable
is vehicle value. Vehicle value is banded, and the value of the variable is 1, 2, 3, 4, 5, or 6, depending on the band.
Band 1 is the reference level.
The fitted value of the /3 corresponding to band 4 is —0.695.

Let 01 be the odds of a claim for a policy in band 1, and 04 the odds of a claim for a policy in band 4.
l
f Determine 04/01.
(A) 0.30 (B) 0.35 (C) 0.40 (D) 0.45 (E) 0.50
3. kor Auto liability claim size is modeled using a generalized linear model. Based on an analysis of the data, it is
believed that the coefficient of variation of claim size is constant.
Which of the following response distributions would be most appropriate to use?

(A) Poisson (B) Normal (C) Gamma (D) Inverse Gamma (E) Inverse Gaussian
Exam SRM Study Manual 379 Exam questions continue on the next page . . .
Copyright 02022 ASM
380 PRACTICE EXAM 1
4. a: You are given the following output from a GLM to estimate loss size:
(i) Distribution selected is Inverse Gaussian.
(ii) The link is g(p) = 1I2.
Parameter
Intercept 0.00279
Vehicle Body
Coupe 0.002
Sedan —0.001
SUV 0.003
Vehicle Value (000) —0.00007
Area
—0.025
0.015
0.005
Calculate mean loss size for a sedan with value 25,000 from Area A.
(A) 80 (B) 160 (C) 320 (D) 640 (E) 1280
5. '1? For a generalized linear model,

(ii) There are 25 parameters.
(iii) The loglikelihood is —361.24
You are considering adding a cubic polynomial variable.
Determine the lowest loglikelihood for which this additional variable would not be rejected at 1% significance.
(A) —358 (B) —356 (C) —354 (D) —352 (E) —350
6. N. In a principal components analysis, there are 2 variables. The loading of the first principal component on the
first variable is —0.6 and the loading of the first principal component on the second variable is positive. The variables
have been centered at 0.
For the observation (0.4, x2), the first principal component score is 0.12.
Determine x2.
(A) 0.25 (B) 0.30 (C) 0.35 (D) 0.40 (E) 0.45
7. a: Determine which of the following statements is/are true.

I. The lasso is a more flexible approach than linear regression.
Flexible approaches lead to more accurate predictions.
III. Generally, more flexible approaches result in less bias.
Exam SRM Study Manual Exam questions continue on the next page . . .
Copyright Q2022 ASM
PRACTICE EXAM 1 381
A generalized linear model for automobile insurance with 40 observations has the following explanatory
variables:
SEX (male or female)

AGE (4 levels)
TYPE OF VEHICLE (sedan, coupe, SUV, van)
MILES DRIVEN (continuous variable)
USE (business, pleasure, farm)
Model I includes all of these variables and an intercept. Model II is the same as Model I except that it excludes
USE. You have the following statistics from these models:
Deviance AIC
Model I 23.12 58.81
Model II 62.61
Using the likelihood ratio test, which of the following statements is correct?
(A) Reject Model II at 0.5% significance.
(B) Reject Model II at 1.0% significance but not at 0.5% significance.
(C) Reject Model II at 2.5% significance but not at 1.0% significance.
(D) Reject Model II at 5.0% significance but not at 2.5% significance.
(E) Do not reject Model II at 5.0% significance.
9. You are given the following two clusters:
{(8,2), (9,7), (12,5)1 and {(10,3), (11,1))
Calculate the dissimilarity measure between the clusters using Euclidean distance and average linkage.
(A) 3.6 (B) 3.7 (C) 3.8 (D) 3.9 (E) 4.0
10.
A normal linear model with 2 variables and an intercept is based on 45 observations. gi is the fitted value of yi,
and 9i(i) is the fitted value of yi if observation i is removed. You are given:
(i) — 9)(02 = 4.1.
(ii) The leverage of the first observation is 0.15.
Determine IP' I, the absolute value of the first residual of the regression with no observation removed.
(A) 3.9 (B) 4.4 (C) 4.9 (D) 5.4 (E) 5.9
11.
11411 A least squares model with a large number of predictors is fitted to 90 observations. To reduce the number of
predictors, forward stepwise selection is performed.
For a model with k predictors, RSS = ck•
The estimated variance of the error of the fit is 82 = 40.
Determine the value of cd cd+1 for which you would be indifferent between the d + 1-predictor model and the
d-predictor model based on Mallow's C.
(A) 40 (B) 50 (C) 60 (D) 70 (E) 80
Exam SRM Study Manual Exam questions continue on the next page .
382 PRACTICE EXAM I
12. 11-: A classification response variable has three possible values: A, B, and C.
A split of a node with 100 observations in a classification tree resulted in the following two groups:
Group Number of A Number of B Number of C

I 40 10 10
II 5 25 10
Calculate the cross-entropy for this split.

(A) 0.72 (B) 0.76 (C) 0.80 (D) 0.84 (E) 0.88
13. q Determine which of the following statements are true regarding cost complexity pruning.
I. A higher a corresponds to higher MSE for the training data.
II. A higher a corresponds to higher bias for the test data.
III. A higher a corresponds to a higher IT i•
14. •-: Determine which of the following constitutes data snooping.

(A) Using personal data without authorization of the individuals.
(B) Using large amounts of low-quality data.
(C) Using an excessive number of variables to fit a model.
(D) Fitting an excessive number of models to one set of data.
(E) Validating a model with a large number of validation sets.
15. 11 Determine which of the following statements are true regarding K-nearest neighbors (KNN) regression.
I. KNN tends to perform better as the number of predictors increases.
II. KNN is easier to interpret than linear regression.
III. KNN becomes more flexible as 1/K increases.
16. *-: A department store is conducting a cluster analysis to help focus its marketing. The store sells many different
products, including food, clothing, furniture, and computers. Management would like the clusters to group together
customers with similar shopping patterns.
Determine which of the following statements regarding cluster analysis for this department store is/are true.
I. The clusters will depend on whether the input data is units sold or dollar amounts sold.
II. Hierarchical clustering would be preferable to K-means clustering.
III. If a correlation-based dissimilarity measure is used, frequent and infrequent shoppers will be grouped together.
Exam SRM Study Manual Exam questions continue on the next page ...
Copyright C2022 ASM
PRACTICE EXAM 1 383
'I Determine which of the following statements regarding principal components analysis is/are true.
I. Principal components analysis is a method to visualize data.
II. Principal components are in the direction in which the data is most variable.
III. Principal components are orthogonal.
18. '41. A random walk is the cumulative sum of a white noise process ct. You are given that ci is normally distributed
with mean 0 and variance a2.
Which of the following statements are true?
I. The mean of the random walk does not vary with time.
II. At time 50, the variance is 50a2.
III. Differences of the random walk form a stationary time series.
19. •41' You are given the following regression model, based on 22 observations.
y = )30 +isixi +02x2 + P3X3 134X4 ± /35 X5 ± e
The residual sum of squares for this model is 156.

If the variables x4 and x5 are removed, the error sum of squares is 310.
Calculate the F ratio to determine the significance of the variables x4 and x5.
(A) 3.9 (B) 4.9 (C) 5.9 (D) 6.9 (E) 7.9
20. 'It You are given the following time series:

20, 22, 21, 24, 23
The time series is fitted to an AR(1) process with yi = 20.325 + 0 .1y1-1,

Calculate the estimated variance of the residuals.
(A) 1.3 (B) 1.7 (C) 2.1 (D) 2.5 (E) 2.9
21. Determine which of the following algorithms is greedy.

I. Hierarchical clustering algorithm
II. Recursive binary splitting algorithm for decision trees
III. Forward subset selection algorithm
Exam SRM Study Manual Exam questions continue on the next page...
384 PRACTICE EXAM /
22. sis° Determine which of the following statements about boosting is/are true.
I. Selecting B too high can result in overfitting.
II. Selecting a low shrinkage parameter tends to lead to selecting a lower B.
III. If d =1, the model is an additive model.
(A) None (B) I and H only (C) I and HI only (D) II and III only
23. To validate a time series model based on 20 observations, the first 15 observations were used as a model
development subset and the remaining 5 observations were used as a validation subset. The actual and fitted values
for those 5 observations are
Yi Pt
16 7 10
17 9 12
18 12 14
19 18 16
20 22 18
Calculate the MSE.
(A) 7.4 (B) 8.4 (C) 9.5 (D) 10.5 (E) 11.5
24. In a hurdle model, the probability of overcoming the hurdle is 0.7. If the hurdle is overcome, the count
distribution is kg(j), where go]) is the probability function of a Poisson distribution with parameter A = 0.6.
Calculate the probability of 1.
(A) 0.23 (B) 0.31 (C) 0.39 (D) 0.45 (E) 0.51
25. 6-.1. For a generalized linear model, you are given

(i) The negative loglikelihood of the model is 74.88.
(ii) The deviance of the model is 8.70.
(iii) The maximized loglikelihood of the minimal model is —90.31.
Calculate the pseudo-R2 statistic.
(A) 0.64 (B) 0.68 (C) 0.71 (D) 0.74 (E) 0.78
26.
`16. The number of policies sold by an agent in a year, y, is modeled as a function of the number of years of
experience, x. The model is a Poisson regression with a log link. The fitted coefficient of x is pi = 0.06.
The expected number of policies sold after 2 years of experience is a and the expected number of policies sold
after 5 years of experience is b.
Calculate b /a.
(A) 1.18 (B) 1.19 (C) 1.20 (D) 1.21 (E) 1.22
Copyright 02022 ASM
PRACTICE EXAM 1 385
. .1 Which of the following statements are true?

I. Partial Least Squares is a supervised method of dimension reduction.
II. Partial Least Squares directions are linear combinations of the original variables.
III. Partial Least Squares can be used for feature selection.
28.
Disability income claims are modeled using linear regression. The model has two explanatory variables:
1.
Occupational class. This may be (1) professional with rare exposure to hazards, (2) professional with some
exposure to hazards, (3) light manual labor, (4) heavy manual labor.
2. Health. This may be (1) excellent, (2) good, (3) fair.
The model includes an intercept and all possible interactions.
Determine the number of interaction parameters pi in the model.
(A) 6 (B) 8 (C) 9 (D) 11 (E) 12
29. N? Consider the vector {5, —3,8, —2, 41.

Calculate the absolute difference between the 6 norm and norm of this vector.
(A) Less than 12
(13) At least 12, but less than 15
r- (C) At least 15, but less than 18
(E) At least 21
30. For a simple linear regression of y on x:

(ii) 2 = 32
(iii) The unbiased sample variance of x is 20.
(iv) x4 = 22
Calculate the leverage of x4.
(A) 0.21 (B) 0.23 (C) 0.25 (D) 0.27 (E) 0.29
31. You are given the time series
182, 138, 150, 192, 177
The series is smoothed using exponential smoothing with w = 0.8.

Calculate the sum of squared one-step prediction errors.
(A) 2042 (B) 2555 (C) 3038 (D) 3589 (E) 3966
Exam SRM Study Manual Exam questions continue on the next page .. .
Copyright Q2022 ASIs.4
386 PRACTICE EXAM I
32. Determine which of the following statements about classification trees is/are true.
I. Classification error is not sensitive enough for growing trees.
II. Classification error is not sensitive enough for pruning trees.
HI. The predicted values of two terminal nodes coming out of a split are different.
33. 0 Hierarchical clustering is performed on 7 observations, resulting in the following dendrogram:

I. Centroid linkage was used.
II. Observation 3 is closer to observation 4 than to observation 7.
III. Observations 3 and 4 are closer to each other than observations 1 and 2.
34. 'I? For a simple linear regression of the form y =Po +13pci Ei, you are given
(i) 9 =100
(ii) y = 81,004
(iii) Vi` 9= 80,525
Calculate R2.
(A) 0.46 (B) 0.48 (C) 0.50 (D) 0.52 (E) 0.54
Exam SRM Study Manual Exam questions continue on the next page
Copyright 02022 ASM
PRACTICE EXAM 1 387
(----.. 4.411 Determine which of the following are results of overfitting models.
I. The residual standard error may increase.
II. The model may be more difficult to interpret.
III. The variables may be collinear.
Solutions to the above questions begin on page 443.

Copyright 02022 ASM
Practice Exam 2
1.
Nr With regard to statistical learning, determine which of the following is/are parametric approaches.
I. Ridge regression
II. K-nearest neighbors regression
III. Principal components analysis
(A) I only (B) II only (C) III only (D) 1, II, and III
2. 141 For a linear regression model of the form
yi = Po +p2xi2+ p3xnxi2+ Ej
you are given

(ii) The total sum of squares is 8016.
(iii) R2 = 0.625.
Calculate the residual standard error.
l
f
(A) 6.8 (B) 6.9 (C) 7.0 (D) 7.1 (E) 7.2
3. •11? Determine which of the following factors are drawbacks to using causal models for time series.
I. Time series patterns may induce or mask relationships between variables.
II. Causal models cannot properly handle non-linear relationships.
III. Forecasting the variable of interest requires forecasting independent variables.
4. ay? For a simple linear regression

yi = go +13ixi Ei
you are given
(ii) The fitted value of f is 3.2465.
(iii) 4 = 23,720
(iv) xiyi = 72,559
(v) 1=31.5
Calculate the fitted value of Po.
(A) —7 (B) —3 (C) 0 (D) 3 (E) 7
Exam SRM Study Manual 389 Exam questions continue on the next page...
390 PRACTICE EXAM 2
5. 'Lf A medical research project is studying the probability of getting a certain type of cancer, based on genetic traits.
The probability is modeled using logistic regression. The number of traits is greater than the number of individuals
in the study, so it is necessary to select a subset of the traits.
Determine which of the following characteristics of statistical learning pertain to this study.
II. Supervised learning
III. Parametric
(A) I only (B) II only (C) III only • (D) I, H, and III
6. A survey is made of the importance of an automatic high beam system in a car. Importance levels are 1 (not
important), 2 (important), and 3 (very important).
A proportional cumulative odds model is used to model the responses. Explanatory variables are sex (male or
female) and age group (18-23, 24-40,> 40). The model is of the form
where i is the importance level and oi is the odds of i.

The estimated values of poi are 0.64 and 1.22 for i = 1 and 2 respectively.
The fitted probability that a male in age group 24-40 considers automatic high beam not important is 0.20.
Calculate the fitted probability that a male in age group 24-40 considers high beam very important.
(A) Less than 0.3
(E) At least 0.6
0.7
7. NI. You are given the following time series:

4 5 6 12 5 10 9 15
of 4.
Calculate the double-smoothed moving average at the last observation time, 432), using a running average length
(A) 8.5 (B) 8.6875 (C) 8.75 (D) 8.875 (E) 9.75
8.
For a linear regression based on 28 observations, there are 4 explanatory variables and an intercept. You are
given:
(i) The residual standard deviation is 12.4.
(ii) The leverage of the first observation is 0.04.
(iii) The first residual el = 6.2.
(A) 0.50 (B) 0.51 (C) 0.52 (D) 1.1 (E) 2.5
PRACTICE EXAM 2 391
NI° K-means clustering is performed on the following six observations:

(5,15) (6,12) (6,20) (9,12) (11,4) (11,16)
The observations are to be grouped into two clusters. Initially, the first three points are grouped into one cluster
and the last three points into the other cluster.
Calculate the initial value of the objective function that is minimized by the clustering algorithm.
(A) 27 (B) 111 (C) 221 (ID) 332 (E) 664
10. •-sf For a simple linear regression based on 18 observations,

(i) The fitted value of pi is 4.637.
(ii) The sample standard deviation of the independent variable is 6.280.
(iii) The residual standard deviation is 50.24.
The independent variable is tested for significance using the t ratio.
Determine which of the following statements is correct.
(A) Accept the variable at 1% significance.
(B) Reject the variable at 1% significance but not at 2% significance.
(C) Reject the variable at 2% significance but not at 5% significance.
(D) Reject the variable at 5% significance but not at 10% significance.
(E) Reject the variable at 10% significance.
64/ A logistic regression models the probability of a claim. The model includes the following explanatory variables:
Driver's age A categorical variable with 6 bands. Level 3 is the base level.
Area A categorical variable with 3 bands. Area A is the base level.
Vehicle value A continuous variable.
The following is the model fit:
Parameter Estimate(b)
Intercept —2.521
Driver's age
1 0.345
2 0.102
3 0.000
4 —0.050
5 —0.173
6 —0.124
Area
A 0.000
0.155
0.374
Vehicle value ($000's) 0.007
Calculate the odds of a claim by a driver in Area B, age band 4, driving a vehicle with value $30,000.
(A) 0.11 (B) 0.13 (C) 0.15 (D) 0.17 (E) 0.19
Exam SRM Study Manual Exam questions continue on the next page.. .
392 PRACTICE EXAM 2
12. 11/4: A classification tree is built based on 9 observations. There is a response and one predictor. The values of the
variables are:
X 2 5 6 8 12 16 19 25 30
Y No No Yes No Yes No Yes Yes Yes
Determine which of these splits is/are best using classification error as the criterion.
I. Between 6 and 8
II. Between 8 and 12
III. Between 12 and 16
(A) I only (B) H only (C) I and II only (D) I and III only (E) II and III only
13. 1141 You are using cost complexity pruning to prune a regression tree. One of the terminal nodes has the following
values for the response variable: {3,6, 8,9}. Another terminal node has the following values for the response
variable: {6,10, 12, 14}. These two terminal nodes are on branches from an intermediate node. We are considering
pruning these branches.
Determine the lowest value of a for which these branches are pruned.
(A) 12 (B) 18 (C) 22 (D) 28 (E) 32
14. u: Determine which of the following statements regarding subset selection is/are true:
I. At each iteration of forward subset selection, a variable is added to the model. The variable chosen is the one
that minimizes the test RSS based on cross-validation.
IL Forward subset selection may be used even if the number of variables is greater than the number of observations.
III. Adjusted R2 is not as well motivated in statistical theory as AlC and BIC are.
15. '1/4: For a set of 35 observations, the following two models are under consideration:
Model I
Yi = Po Pixii + 132X12+ 3x23+ P44 + p5Xi1Xi2 + Et
Model H
Yl = yo + y2x22 y3x5 +
R2 for Model I is 0.900. R2 for Model II is 0.860.

Calculate the F ratio to test the significance of the two additional parameters of Model I.
(A) 5.2 (B) 5.8 (C) 6.4 (D) 6.8 (E) 7.4
16. s: An AR(1) process is defined by

Yr = 10 + O.2y.i +
The variance of the error term is 4.
Calculate the variance of yi .

(A) 4.00 (B) 4.17 (C) 6.25 (D) 11.11 (E) 100.00
Exam SRM Shady Manual Exam questions continue on the next page...
PRACTICE EXAM 2 393
n a-dr The probability of a claim on a policy, it, is modeled with probit regression, using several predictors. The
probability of a claim given specific values of x/, x2, ,xk is 0.2.
The fitted value of gi is 0.25.
Calculate the probability of a claim if x1 is increased by 1 and the other variables are unchanged.
(A) 0.24 (B) 0.26 (C) 0.28 (D) 0.30 (E) 0.32
18. 'kir Determine which of the following statements is/are true.

I. For K-nearest neighbor regression, higher K leads to lower bias.
II. For K-means clustering, higher K leads to lower bias.
III. For regression trees, higher I TI leads to higher bias.
19.
Determine which of the following statements about boosting is/are true.
I. The number of terminal nodes in each tree is d, the number of splits parameter.
II. Each tree depends on the previous trees.
III. At each node, all available predictors are considered.
l
f
A time series follows the process
= 0.61.1_1 + Et
with Var(Et) = 6.
Determine the variance of terms in this series.
(A) 8.125 (B) 9.375 (C) 15.000
21.
"-.1. A Poisson regression model uses a log link. The systematic component is an intercept only; g(12)= Po.
Five observations are 0, 2, 0, 1, 0.
Calculate the fitted value of go using maximum likelihood.
(A) —0.5 (B) —0.1 (C) 0.1 (D) 0.6 (E) 1.8
22. You are given the following observations of a variable:
21, 30, 40, 51, 63, 76
Hierarchical clustering is performed using Euclidean distance.

Complete linkage is used.
The points are then split into two clusters.
Determine which of the following is one of the two clusters.
(A) (21) (B) (21,301 (C) (21,30,40) (D) (21,30,40,51) (E) {21,30,40,51,63}
Exam SAM Study Manual Exam questions continue on the next page. ..
Copyright Q2022 ASM
394 PRACTICE EXAM 2
23. '46 For a regression model of the form

Y = Po + Pixi + P2x2 + f33X3 + 134x4 + e
based on 15 observations:
(i) The error sum of squares is 123.

(ii) If variable x4 is removed, the error sum of squares is 177.
Which of the following statements is true?
(A) x4 is not significant at 10% significance.
(B) x4 is significant at 10% significance but not at 5% significance.
(C) x4 is significant at 5% significance but not at 2% significance.
(D) x4 is significant at 2% significance but not at 1% significance.
(E) x4 is significant at 1% significance.
24. 641 Linear regression models are fitted using subsets of 4 variables. The resulting values of RSS are:
Variables RSS Variables RSS Variables RSS Variables RSS
None 160 X4 72 X2, X3 63 Xi, X2, X4 55
Xi 66 Xi, X2 59 X2, X4 61 Xi, X3, X4 52
X2 67 Xi, X3 62 X3, X4 67 X21 X3, X4 50
X3 69 Xi, X4 60 Xi, X2, X3 57 Xi, X2, X3, X4 30
Determine the two-variable model selected by backward stepwise selection.

(A) Xi, X2 (B) Xi, X3 (C) X1 • X4 (D) X2, X3 (E) X2. X4
25. •••• Determine which of the following statements regarding hierarchical clustering is/are true.
I. Complete linkage is based on minimal intercluster dissimilarity.
II. Single linkage is based on maximal intercluster dissimilarity.
III. Single linkage may result in extended trailing clusters.
26. NI° Determine which of the following statements regarding principal components analysis is/are true.
I. Each principal component has n loadings, one for each data point.
II. Each principal component has p scores, one for each variable.
III. The sum of the squares of the loadings equals 1.
Exam SAM Study Manual Exam questions continue on the next page . . .
Copyright 02022 ASM
PRACTICE EXAM 2 395
A classification variable Y can have the value 0 or 1. It is modeled as a function of two variables, Xi and X2. You
are given the following observations:
X1 X2 Y
2 4 0
3 2 0
3 5 1
4 1 0
4 4 1
4 6 1
5 3 0
5 6 1
6 5 0
Use K-nearest neighbors with K = 3 to predict values.

Determine which of the following statements regarding the predicted value of Y is/are true.
I. = 4, X2 = 4: Y = 1
II. = 4, X2 = 2: Y = 1
III. X1 = 3, X2 = 6: Y = 0
(E) The correct answer is not given by (A) , (B) ,(C) , or (D) .
l •-•11 A distribution is a member of the exponential family with density

f
f (y) = exp
(y 61 — b (0) + S(y , cP))
with b(e) = e6 and op = 1.
A generalized linear model is created with this distribution as the response variable. Summary statistics of the
results are:
E y, = 1035 gi = 1030
ln yi = 92 >lngi=90
ln = 2047
E yi in = 2015
Calculate the deviance.
(A) 50 (B) 54 (C) 58 (D) 62 (E) 68
Copyright 02022 ASM
396 PRACTICE EXAM 2
29. A generalized linear model for amount of sales by agent, based on 65 observations, has the following explanatory
variables: j
REGION: North, South, East, West
AGE OF AGENT: Under 30, 30-39, 40-49, 50-59, 60 and over
YEARS OF EXPERIENCE OF AGENT: 1-5,6-10, over 10
All variables are categorical.
The model is run with and without the REGION variable. The BIC is 123.52 with the REGION variable and
121.08 without the REGION variable.
Determine which of the following statements is true based on the likelihood ratio statistic.
(A) Accept REGION at 1% significance
(B) Accept REGION at 25% significance but not at 1% significance
(C) Accept REGION at 5% significance but not at 2.5% significance
(D) Accept REGION at 10% significance but not at 5% significance
(E) Reject REGION at 10% significance
30. sk: Determine which of the following statements regarding collinearity are true.
I. Collinearity causes the residual standard error to increase.
II. Collinearity causes t statistics of the collinear variables to be low.
III. Collinearity is indicated by a low VIE
(A) I only (B) II only (C) HI only (D) I, II, and III
31. s: For a regression on 50 observations, you are given:

(i) The upper 3 x 3 submatrix of the hat matrix:
0.525 0.075 0.035
(0.293
0.075 0.425
0.425
0.293
0.091

(iii) yi = 80 and 91 = 81.
(A) —0.5 (B) —0.4 (C) —0.3 (D) —0.2 (E) —0.1
32. Determine which of the following purposes a scree plot can serve.
I. Visualizing the directions of the principal components.
II. Understanding how the data relates to the principal components.
III. Deciding how many principal components to use.
(A) I only (B) H only (C) III only (D) I, II, and III
Exam SRM Study Manual Exam questions continue on the next page.. .
PRACTICE EXAM 2 397
sir In an ARCH model

(i) The long term volatility parameter is 10.
(ii) yi = 0.4, y2 = 0.2, y3 = 0.1
(iii) Residuals are e5 = 1.5, e6 ••=. —0.5, e7 = 0.8.
Calculate the conditional variance of the error term at time 8 given the historical information.
(A) 10.37 (13) 10.43 (C) 10.47 (D) 10.53 (E) 10.57
34. 6-411 A generalized linear model for claim counts uses a Poisson distribution. You are given:
(i) The link function is log.
(ii) The output of the model is
Parameter
Intercept —0.73
Gender—Male 0.07
Marital Status—Single 0.03
Interaction of Gender and Marital Status 0.02
Area
0.12
0.05
—0.04
Calculate the probability of 2 or more claims for a married male in Area D.

(A) 0.09 (B) 0.11 (C) 0.13 (D) 0.15 (E) 0.17
35. Determine which of the following statements are true regarding residuals of non-linear models.
I. Anscombe residuals are based on transforming the response variable to make it approximately normal.
II. Pearson residuals are based on defining a residual function such that the residuals are approximately normal.
III. Deviance residuals are close to Anscombe residuals in many cases.
(A) I only (B) LI only (C) III only (D) I, II, and III

Practice Exam 3
1. NI° A regression tree is built based on two predictors using cost complexity pruning. Cross-validation is used to
select the best value of a.
For a = 5, the following tree is built based on the training data:
Xi<1001
X2 < 40 < 150
X. <60
20 55 72
10.4 12.0
The test data is
70 45 4
110 50 10
160 70 11
180 40 13
Calculate the mean squared prediction error on the test data.

(A) 4.5 (B) 5.5 (C) 6.5 (D) 7.5 (E) 8.5
2. *-41 Determine which of the following statements regarding principal components analysis is/are true.
I. The first two principal components span the plane that is closest to the data.
II. The coordinates of the projected values of the data are the principal component scores.
III. Principal components are independent of the scale of the variables.
(D) I, II, and III
Copyright CP2022 ASM
400 PRACTICE EXAM 3
3. 1.41 A Poisson regression model is used to model claim counts on auto insurance. Explanatory variables are the
following categorical variables:
USE (business, pleasure, farm)

AGE GROUP (under 25, 25 14, 45-64, 65 and over)
SEX (male, female)
MARITAL STATUS (single, married)
The deviances of two regressions are as follows:

Model Parameters Deviance
Model 1 USE + SEX + MARITAL STATUS 4865.0

Model 2 USE + AGE GROUP + SEX + MARITAL STATUS 4856.6
The likelihood ratio test determines which model is preferred.

Which of the following statements is true regarding Model 1?
(A) Reject at 0.5% significance.
(B) Reject at 1% significance but not at 0.5% significance.
(C) Reject at 2.5% significance but not at 1% significance.
(D) Reject at 5% significance but not at 2.5% significance.
(E) Do not reject at 5% significance.
4. Nr Which of the following statements are true regarding cross-validation?

I. 5-fold cross-validation has higher variance than 10-fold cross-validation.
II. 5-fold cross-validation is more computationally efficient than 10-fold cross-validation.
III. LOOCV when used to validate polynomial regression is computationally efficient.
(A) I only (B) H only (C) III only (D) I, II, and HI
5. For a linear regression model on 20 observations, you are given:

• The unbiased sample variance of the response variable is 88.
• The deviance is 44.
• There are 3 predictor variables and an intercept.

Calculate adjusted R2.
(A) 0.93 (B) 0.94 (C) 0.95 (D) 0.96 (E) 0.97
6. '"41/ Determine which of the following statements are true with regard to AR(1) processes.
I. A random walk is a special case of an AR(1) process.
II. White noise is a special case of an AR(1) process.
III. For stationary AR(1) processes, autocorrelation is a decreasing function of lag.
IV. For stationary AR(1) processes, the variance of the terms equals the variance of the error.
(A) 1,11, and III only (B) I, II, and IV only (C) I, III, and IV only (D) II, IH, and IV only
(E) I, II, III, and IV
Exam SRM Study Manual Exam questions continue on the next page ...
PRACTICE EXAM 3 401
For an inverse Gaussian regression with Var(Y) = E[Y]3/5, you are given that yi = 68.5 and 9, = 57.3.
Calculate the Pearson chi-square residual.
(A) Less than 0.05
(E) At least 0.08
0.09
8. Claim severity is modeled with a normal linear model. The current model does not have AGE as an explanatory
variable. You are considering adding AGE as an explanatory variable.
The model is run both with and without AGE You are given the following excerpt from an ANOVA table
compiled from the two runs:
Source of Degrees of Sum of

Variation Freedom Squares
Model without AGE 12 1111.06
Improvement from adding AGE 3 92.55
Residual 302.88
Total 38
Determine the F ratio for testing the significance of AGE.

(A) 2.343 (B) 2.382 (C) 2.475 (D) 2.524 (E) 2.568
9. You are given the following observations of a variable:

21, 30, 40, 51, 63, 76
Hierarchical clustering is performed using Euclidean distance.

Single linkage is used.
The points are then split into two clusters.
Determine which of the following is one of the two clusters.
(A) (21) (B) (21,301 (C) (21,30,401 (D) (21,30,40,511 (E) (21,30,40,51,63)
10. •41 With regard to dimension reduction methods, determine which of the following statements are applicable to
principal component regression but not to partial least squares.
I. The new predictors are linear combinations of the original predictors.
IT. The method of creating predictors is unsupervised.
III. The original predictors should be standardized.
Exam SRM Study Manual Exam questions continue on The next page. ..
Copyright 02022 ASM
402 PRACTICE EXAM 3
11. %ID Determine which of the following statements regarding hierarchical clustering is/are true.
I. The height of the cut serves the same purpose as K in K-means clustering.
II. The centroid linkage may lead to inversions.
There are 2" reorderings of a dendrogram, where n is the number of leaves.
12. •-• An AR(1) process is fitted to a set of 10 observations. To test this model, the first 6 observations are used as a
model development set and the last 4 observations are the validation set. The resulting model is yt = 6 — Ei.
The last 5 observations are 7, 6, 4, 5, 5.
Calculate the MAE statistic.
(A) 0.2 (B) 0.4 (C) 0.6 (D) 0.8 (E) 1.0
13.
6.-1? A classification response variable has three possible values: A, B, and C.
A split of a node with 100 observations in a classification tree resulted in the following two groups:
Group Number of A Number of B Number of C

40 10 10
II 5 25 10
Calculate the Gini index for this split.

(A) 0.4625 (B) 0.4875 (C) 0.5125 (D) 0.5375 (E) 0.5500
14.
*-4? A count variable is modeled as a function of two predictors using Poisson regression with a log link. The model
has an intercept. There are 5 observations. The results of the model are
Yi
1 2 1.2
2 2 2.2
3 1 1.8
4 3 4.0
5 2 0.8
It is believed that variance is greater than the mean, so an overdispersion parameter q5 is specified.
Calculate the estimate of cp.
(A) 1.0 (B) 1.2 (C) 1.3 (D) 1.5 (E) 1.6
15.
"-: Determine which of the following are advantages of regression trees over linear models.
I. Easier to interpret
II. More robust
III. Easier to handle qualitative predictors

IV. Predictions are more accurate
(A) I and II only (B) I and III only (C) I and IV only (D) II and III only (E) II and IV only
Copyright 02022 ASM
PRACTICE EXAM 3 403
n.. 41 For a linear regression with 2 explanatory variables and an intercept, you are given:
(ii) The sample standard deviation of x2 is 2.204.
(iii) If x2 is regressed on x1, the standard error of the regression is 1.284.
Determine the VIF of x2.
(A) 2.5 (B) 3.0 (C) 3.5 (D) 4.0 (E) 4.5
17. House sales is the continuous response variable of a normal model that is based on 11 observations. Explanatory
variables are:
• Interest rates
• Unemployment rate
Model I uses only interest rates as an explanatory variable, while Model II uses both interest rates and unem-
ployment rates. Both models have an intercept. The results of the models are:
Model I Model IT
Source of Variation Sum of Squares Source of Variation Sum of Squares
Regression 14,429 Regression 17,347
Error 12,204 Error 9,286
The I test is used to evaluate the significance of unemployment rate.

Determine the significance level at which unemployment rate is included in the model.
(A) Include at 1%
(B) Include at 2% but not at 1%
(C) Include at 5% but not at 2%

(D) Include at 10% but not at 5%
(E) Do not include at 10%
18. An insurance company is studying its field force, the agents that sell its products. There are many factors that
may characterize agents: age, sex, number of years at the company, number of years of experience as an agent,
annual production, region, etc. The insurance company would like to summarize all of these characteristics into a
small number of variables.
Determine which of the following characteristics of statistical learning that this study has.
II. Supervised learning
III. Parametric
Copyright Z2022 ASM
404 PRACTICE EXAM 3
19. A linear regression based on 52 observations has 5 explanatory variables plus an intercept You are given:
(i) The residual standard error is 8.25.
(ii) The studentized first residual is 0.895.
(iii) The standardized first residual is 0.823.
Calculate the residual standard error if the first observation is removed.
(A) 7.59 (B) 7.91 (C) 8.18 (D) 8.60 (E) 8.97
20. A classification tree is built based on 9 observations. There is a response and one predictor. The values of the
variables are:
X 2 5 6 8 12 16 19 25 30
Y No No Yes No Yes No Yes Yes Yes
Determine which of these splits is/are best using the Gini index as the criterion.
I. Between 6 and 8
II. Between 8 and 12
III. Between 12 and 16
(A) I only (B) II only (C) III only (D) I and II only (E) II and III only
21. rFor cross-validation, determine which of the following are advantages of k-fold cross-validation with k < n
over LOOCV.
I. k-fold cross-validation does not overestimate the test error rate as much as LOOCV.
II. k-fold cross-validation has lower bias than LOOCV.
III. Performing k-fold cross-validation multiple times produces the same results.
(A) I only (B) IT only (C) III only (D) I, II, and III
22. •-ir You are given the following 3 observations of 2 variables:

(1,0.4) (0,0.4) (-1, —0.8)
Principal component analysis results in a principal component with loadings 0.836 on the first variable and 0.549 on
the second variable.
Calculate the proportion of variance explained by this principal component.

(A) 0.88 (B) 0.90 (C) 0.92 (D) 0.94 (E) 0.96
23.
%/6 Determine which of the following are drawbacks of regression models for time series that fit trends in time.
I. Too much weight is placed on early observations.
II. Seasonal patterns cannot be incorporated.
III. Other sources of information are not considered.
Exam SRM Study Manual Exam questions continue on the next page. ..
Copyright 02022 ASM
PRACTICE EXAM 3 405
n.. For a linear regression model based on 4 observations, you are given:
(i) There is one explanatory variable and an intercept.
(ii) The residuals are -2.3, 0.1, -2.1, and 4.3.
(iii) The hat matrix is
0.3 0.4 0.1 0.2
(
0.4 0.7 -0.2 0.1
0.1 0.2 0.7 0.4
0.2 0.1 0.4 0.3
(iv) 2 = 14.1
Calculate Cook's distance for the third observation.
(A) 1.2 (13) 2.3 (C) 2.7 (D) 3.2 (E) 3.6
23. ikor A Poisson regression model is of the form

g0.0 =PO ± P1 Xil p2xi2
and uses the log link.

The fitted values of f3 are (0.15,0.04,0.12).
Determine the fitted mean when xi = 1 and x2 = 0.
(A) 1.0 (B) 1.1 (C) 1.2 (D) 1.3 (E) 1.4
Determine which of the following methods may be used in a classification setting.

I. Logistic regression
II. K-nearest neighbors
III. Boosting
27. •••• A classification variable assuming the values of 1 and 2 is modeled as a function of X using K-nearest neighbors
with K = 3. The training data is
X 10 19 30 39 43 45
Y 1 1 2 2 1 2
The test data is
X 15 25 37 50
Y 1 2 2 2
Calculate the error rate on the test data.
(A) 0 (B) 0.25 (C) 0.50 (D) 0.75 (E) 1
Copyright 1P2022 ASM
406 PRACTICE EXAM 3
28. 4411 A binomial generalized linear model for the probability of a claim has an intercept and the following variables:
(i) Deductible: can be 250, 500, or 1000.
(ii) Gender: can be male or female.
(iii) Age group: there are 5 age groups.
There are 85 observations.
The AIC for the best fit is 261.53.
Calculate the BIC.
(A) 277 (B) 279 (C) 281 (D) 283 (E) 285
29. You are given the time series
Y2 = 105, y3 = 109, y4 = 122, y5 = 104
Exponential smoothing with smoothing parameter w = 0.4 is performed on this series. The resulting value of g5
is 108.78.
Determine gi.
(A) 102 (B) 104 (C) 106 (D) 108 (E) 110
30. Determine which of the following statements regarding K-means clustering is/are true.
I. Clusters are selected to maximize the sum of the distances between points of different clusters.
II. A simple algorithm finds a local optimum.
III. The number of clusters must be specified in advance.
(E) The correct answer is not given by (A) , (B) , (C) , or (0).
31. P A generalized linear model uses a gamma distribution. The model is based on 8 observations. The results of
the model are
1 5 3
2 7 15
3 8 10
4 11 5
5 13 18
6 15 10
7 17 20
8 20 18
The dispersion parameter is cp = 1.4.

Calculate the Pearson chi-square statistic for this model.
(A) 1.84 (B) 3.60 (C) 12.69 (D) 17.76 (E) 14.87
Exam SRM Study Manual Exam questions continue on the next page
PRACTICE EXAM 3 407
, '141 K-means clustering is performed on the following six observations:

(5,15) (6,11) (7,10) (5,18) (6,14) (7,7)
and the last three points into the other cluster.
Determine the number of points that move between clusters in the first iteration of the algorithm.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
33. You are given:

(i) A response variable Y is in the exponential family,
—b(0)
1(y) = exp
(Y19 +
S(y, 0))
(ii) p = E[Y] =
(iii) .4) = 1
Determine Var(Y).
(A) 1/p (B) 1 (C) p (D) p2 (E) p3

34.
In a generalized linear model, Yi follows a Tweedie distribution. You are given
E[Yd = 2 Var(Yi) = 7.0711
E[Y2] = 3 Var(Y2) = 12.9904
E[Yfl = 2.5
Determine Var(Y3).
(A) 9.48 (B) 9.58 (C) 9.68 (D) 9.78 (E) 9.88
408 PRACTICE EXAM 3
35.
N? For a set of data with 40 observations, 2 predictors (Xi and X2), and one response (Y), the residual sum of
squares has been calculated for several different estimates of a linear model with no intercept. Only integer values
from 1 to 5 were considered for estimates of pi and P2.
standardization:
1 2 3 4 5
1 2,855.0 870.3 464.4 357.2 548.6
2 1,059.1 488.4 216.3 242.8 567.9
3 657.0 220.0 81.6 241.9 700.8
4 368.4 65.1 60.5 354.5 947.1
5 193.2 23.7 152.8 580.6 1,307.0
Let:
= Estimate of pi using ridge regression with budget parameter s =25

p2 = Estimate of p2 using a ridge regression with budget parameter s = 25
Calculate the ratio gig.
(A) 0.75 (B) 1.0 (C) 1.33 (D) 2.0 (E) 2.5

Practice Exam 4
1. N?I With regard to statistical learning, determine which of the following statements is true.
I. Increasing flexibility results in decreasing training MSE.
II. Increasing flexibility results in decreasing test bias.
III. Increasing flexibility results in decreasing test variance.
2. In a generalized linear model, the response distribution is Poisson. You are given
(i) y =2
(ii) 922 2.5
Calculate the deviance residual for observation 22.
(A) —0.39 (B) —0.36 (C) —0.33 (D) —0.30 (E) —0.27
3. •: The first ten observations of a white noise series, yi,. ,yio, are
21 24 22 18 23 22 16 18 21 22
Calculate the upper bound of a 95% forecast interval for y12.

(A) 25.6 (B) 25.8 (C) 26.4 (D) 26.6 (E) 26.7
4. 1`)? For a simple linear regression based on 18 observations of the form Y = Pa + /31X + e, the width of a 95%
prediction interval for Y when X = lc is 30.
Let s, be the unbiased standard deviation of X.
Calculate the width of a 95% prediction interval for Y when X = 2 + sx.
(A) 30.4 (B) 30.8 (C) 31.3 (D) 31.7 (E) 32.1
Copyright Q202.2 ASM
410 PRACTICE EXAM 4
5. el For a logistic regression model for a binary response, actual and fitted values are as follows:
yi ñi
0 0.32
0 0.47
0 0.02
0 0.09
0 0.15
1 0.58
1 0.82
1 0.73
1 0.64
1 0.98
Calculate the Pearson chi-square statistic.

(A) 1.7 (13) 2.2 (C) 2.6 (D) 3.1 (E) 3.6
6. .4' A linear regression model with 5 variables and an intercept is fitted to 105 observations. You are given
(i) The 21s standardized residual is 0.934.
(ii) The leverage of the 21st observation is 0.13.
Calculate Cook's distance for the 215t observation.
(A) Less than 0.03

(E) At least 0.06
7. '41/ For a set of 10 observations in a regression setting, a response variable's values are 8, 7,4, 6, 10, 14, 16, 13, 17, 15.
The response variable is modeled using a regression tree with boosting. The boosting parameters are d = 1,
= 0.1, and B = 100.
At the first iteration, the tree is split into two groups: the first 5 observations and the second 5 observations.
Calculate the revised value of the first observation that is used for the second tree.
(A) 6.9 (8) 7.0 (C) 7.1 (D) 7.2 (E) 7.3
8. 1141 Hierarchical cluster analysis with centroid linkage and Euclidean distance dissimilarity is performed. The
following four clusters result:
1(10,10), (18,10)) 1(18,16), (20,20)) 1(8,25), (8,30)) ((15,24))
Determine the two clusters that get fused at the next iteration of the algorithm.
(A) 1(10,10),(18,10)1 and {(18,16),(20,20)I
(B) 1(10,10),(18,10)} and 1(8,25),(8,30))
(C) R10,10),(18,10)) and (15,24)
(D) 1(18,16),(20,20)) and (15,24)
(E) 1(8,25),(8,30)) and 1(15,24))
Exam SAM Study Manual Exam questions continue on the next page . . .
Copyright 02022 ASM
PRACTICE EXAM 4 411
n. A set of 100 observations of 4 variables is analyzed using principal components analysis. The loadings of the
ifrst three variables on the first principal component are 0.68, 0.65, and 0.32. The fourth loading is negative.
Calculate the first principal component score of the observation (2, —1,3,5).
(A) 0.3 (B) 0.7 (C) 1.1 (D) 1.3 (E) 1.7
10. 'kir A linear regression of the form yi = Po + flux11 +i32xi2 + P3Xi3Ei is based on 35 observations.
You are given:
The residual standard deviation is 8.25.
(ii) se(bi) = 0.86.
(Hi) The sample standard deviation of x1 is 2.70.
Calculate the VIF of xi.
(A) 1.2 (B) 1.6 (C) 2.0 (D) 2.3 (E) 2.7
11. K-means clustering is performed on the following five observations:
(10,0) (8,3) (5,6) (12,4) (11,9)
and the last two points into the other cluster.
Calculate the initial value of the objective function that is minimized by the clustering algorithm.
(A) 87 (B) 113 (C) 142 (D) 175 (E) 236
l
f
Copyright (g2022 ASM
412 PRACTICE EXAM 4
12. 111 You are given the following biplot for a principal components analysis of sales of life insurance, health insurance,
dental insurance, and disability insurance by agents.
—0.5 0 0.5
—3 —2 —1 0 1 2 3
You are given the following possible inferences from the biplot:
I. Life, health, and dental insurance sales are correlated.
II. Bob did not sell a lot of life insurance.
III. Sue sold a lot of dental insurance.
Determine which of these inferences are correct.
(A) I only
(B) II only
(C) III only
(D) I, II, and III
PRACTICE EXAM 4 413
NI, In a multiple regression model of the form
y= + P2X2 P3X3 p4x4+ e

the value of 133 is estimated as 2.6, based on 10 data points. You are testing the hypothesis Ho: 133 = 2 against Hl;
P3 > 2. The standard error of b3 is 0.2.
Which of the following statements is true?
(A) Reject Ho at 0.5% significance.
(B) Reject Ho at 1% significance but not at 0.5% significance.
(D) Reject Ho at 5% significance but not at 2.5% significance.
14.
'41/1 The relationship between type of claim on an auto insurance policy and type of vehicle is modeled using a
cumulative proportional odds model.
Type of claim is an ordinal variable with the following categories:
1. Property damage only
2. Bodily injury, no fatality
3. Fatality
Type of car is a categorical variable with values coupe, sedan, SUV, and van.
Based on this model:
l
f (i) The probability of claim type 1 for a sedan is 0.21.
(ii) The probability of claim type 2 for a sedan is 0.06.
(iii) The probability of claim type 1 for an SUV is 0.28.
Determine the probability of claim type 2 for an SUV.
(A) 0.071 (B) 0.074 (C) 0.077 (D) 0.080 (E) 0.083
15. s: A generalized linear model uses a complementary log-log link. The form of the model is
g(n) = Po + Pixi
The estimated values are bo .= 0.6, bi = 0.4.
Calculate the estimated mean when xi = 0.8.
(A) 0.92 (B) 0.93 (C) 0.94 (D) 0.95 (E) 0.96
16. "-t? Y is a Tweedie distribution, and Var(Y) =

Determine the canonical link function for Y.
(A) =
(B) g ( p ) =
(C) g(p) =ph3 (D) = Al2/5 (E) g(p)= 111/2
17. stt° At a node in a classification tree, there are 30 observations of "Yes" and 10 observations of "No". This node is
split into two groups, one with 20 observations of "Yes" and 4 observations of "No" and the other group with the
remainder of the observations.
Calculate the reduction in the overall Gird index from this split.
(A) 0.02 (B) 0.04 (C) 0.06 (D) 0.08 (E) 0.10
Copyright 02022 ASM
414 PRACTICE EXAM 4
18. Ni° You are given the time series {2,4,7,9,6,4,8,16).

Calculate the lag 2 autocorrelation statistic.
(A) —0.42 (B) —0.36 (C) —0.31 (D) —0.26 (E) —0.20
19.
sli. A generalized linear model for a count variable is based on 8 observations. The response distribution is Poisson.
The observed values are
0 0 0 0 1 1 2 4
The identity link is selected.

Calculate /0, the maximized loglikelihood of the minimal model.
(A) —11.9 (B) —11.0 (C) —10.1 (D) —9.4 (E) —8.7
Exam SRM Study Manual Exam questions continue on the next page . ..
PRACTICE EXAM 4 415
i- 4. You are given the following 11 observations:

xi x2 y
10 4 15
12 2 14
12 10 19
12 16 24
15 6 8
15 17 30
17 4 27
17 13 15
19 15 30
20 3 18
20 10 24
A decision tree splits the region up as follows:

-X2
30
12
00 13 16 25x1
Calculate the predicted value of y for xi = 18, x2 = 11.

(A) 19 (B) 20 (C) 21 (D) 23 (E) 24
21. •111 Determine which of the following statements regarding dendrograms is/are true.
I. The number of clusters is determined by the height of the split.
II. The closeness of observations is determined by their horizontal distance.
III. Observations that fuse at the bottom of the tree are similar.
(A) None (B) 1 and II only (C) I and III only (D) II and III only
Copyright 02022 ASM
416 PRACTICE EXAM 4
L
22.
A real-valued variable Y is modeled as a function of two variables, X1 and X2. You are given the following
observations:
X1 X2
2 4 23
3 2 32
3 5 36
4 1 27
4 4 39
4 6 44
5 3 28
5 6 50
6 5 41
Use K-nearest neighbors regression with K 4 to predict values.

Calculate the predicted value of Y at (4,2).
(A) 29.5 (B) 31.5 (C) 33.5 (D) 35.5 (E) 37.5
23. 6411 The following two models were fit to 20 observations:

• Model 1: Y = 130 + + /32x, + E
• Model 2: Y = 130 + 132X2 + P3X3 + /34X4 + E
You are given:
(i) The F statistic for testing /33 = 134 = 0 is 1.875.
(ii) R2 for Model 1 is 0.589.
Calculate R2 for Model 2.
(A) 0.63 (B) 0.64 (C) 0.65 (D) 0.66 (E) 0.67
PRACTICE EXAM 4 417
. "I For a generalized linear model for the probability of a claim:

(i) The Binomial distribution is selected.
(ii) The complementary log-log link is selected.
(iii) The model output is
Parameter
Intercept —2.05
Gender—Male 0.40
Age group
Under 25 1.55
Over 65 0.25
Vehicle body
Coupe —0.07
SUV 0.32
Calculate the odds of a claim from a 34-year old male driving an SUV.
(A) 0.25 (B) 0.30 (C) 0.35 (D) 0.40 (E) 0.45
25. For a generalized linear model, which of the following would make it more likely that the model is accepted?
I. Higher AIC.
II. Higher BIC.
III. Higher deviance.
IV. Higher Pearson chi-square statistic.
(A) None (B) I only (C) II only (D) III only (E) IV only
26. .11 Determine which of the following statements regarding K-means clustering is/are true.
I. K-means clustering is robust to perturbations in the data.
II. At each iteration of the algorithm, Ki7 distances are calculated, where n is the number of observations.
III. The algorithm may be used for distance measures such as correlation.
Exam SAM Study Manual Exam questions continue on the next page...
Copyright 02022 ASM
418 PRACTICE EXAM 4
27. •Iir For a linear regression of the form

Yi = Pixi Li
you are given
(ii) The fitted value of pi is 1.674.
(iii) The residual sum of squares is 25.882.
(iv) E(xi — 1)2 = 30.12.
Calculate the upper bound of a 95% confidence interval for Pr.
(A) 2.04 (B) 2.06 (C) 2.08 (D) 2.10 (E) 2.12
28. A time series with 15 terms has been double exponentially smoothed. You are given
1. The smoothing parameter w = 0.3.
2. S'15 = 72.
(2)
3.
s15 = 56'
Calculate the forecast of yi8.
(A) 136 (B) 152 (C) 168 (D) 184 (E) 200
29. •-ir A zero-inflated model consists of a Poisson distribution with mean A and a 0 component. The Poisson component
has a weight of 0.8.
Determine which of the following statements is true for this model.
I. The probability of 0 is 0.2.
II. The mean of the distribution is 0.8A.
III. The probability of 2 is A/2 times the probability of 1.
(A) I and H only (B) I and III only (C) II and III only (D) I, II, and III
30. NI' Linear regression models are fitted based on 15 observations. Four predictors are considered. Subset selection
is used to select the most significant variables. The best models for each number of predictors are
Number of predictors Lowest RSS

0 82.4
1 41.3
2 37.5
3 33.8
4 30.3
Determine the number of predictors in the model selected using adjusted R2.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4
PRACTICE EXAM 4 419
Determine which of the following statements about random forests is/are true.
I. Overfitting does not result no matter how high B is.
II. A fixed number of predictors is considered at every split.
III. Random forests reduce variance.
32. *-11 The random variable X has four possible values: A, B, C, and D. The random variable Y has three possible
values: 1, 2, and 3. The probabilities of these values are as follows:
Pr(Y=yIX= x)
x Pr(X = x) 1 2 3
A 0.40 0.45 0.25 0.30
B 0.30 0.35 0.40 0.25
C 0.20 0.30 0.35 0.35
D 0.10 0.20 0.50 0.30
Calculate the Bayes error rate for a prediction of Y based on X.

(A) 0.50 (B) 0.55 (C) 0.56 (D) 0.57 (E) 0.58
33. 41 The time series yi is an AR(1) process of the form

= 2 - 0.7y +
yT = 4
Calculate the 4-step ahead forecast, 9T+4.
(A) 1.18 (B) 1.40 (C) 1.62 (D) 1.85 (E) 2.02
420 PRACTICE EXAM 4
34. •-ir For a generalized linear model for claim frequency, you are given:
Response variable: Claim frequency
Response distribution: Negative binomial, overdispersion parameter = 1.3
Link: log
Parameter df b
Intercept 1 0.205
Territory 2
A 0.000
0.132
0.198
Number of Claims,
Previous Year 2
0 0.000
1 0.405
2+ 1.101
Calculate the variance of the number of claims by an insured in Territory B who submitted one claim in the
previous year.
(A) 2.1 (B) 2.3 (C) 2.5 (D) 2.7 (E) 2.9
35. 41 Determine which of the following statements is/are true.

I. Underfitting a model results in biased estimates.
II. Overfitting a model results in biased estimates.
III. Severe censoring of data results in biased estimates.

Copyright 02022 ASM
Practice Exam 5
1. s: Determine which of the following increases the flexibility of a model.

I. Increasing the number of predictors in a linear regression model.
II. Increasing the tuning parameter of a lasso.
III. Increasing the number of bootstrap samples used for bagging.
2.
•41 An actuarial department is estimating the probability that students will end up in one of the following areas:
1. Pricing
2. Financial Reporting
3. Investments
A nominal logistic model is constructed. Explanatory variables are major in college and score on Exam SRM.
Pricing is the reference level. The estimated parameters are:
Variable Estimated b for level 2 Estimated b for level 3
Intercept -1.5 -1.1
College Major
Math 1.2 0.8
Economics 0.2 1.8
Other 0.7 0.4
Score on Exam SRM
Under 6 0.4 -0.4

6,7 0.9 0.2
Over 7 0.5 0.9
Calculate the probability that a math major who scored 8 on Exam SRM will end up in Investments.
(A) 0.40 (B) 0.45 (C) 0.50 (D) 0.55 (E) 0.60
3. '11 You are given the first ten observations of a random walk, • , Yio:
15 18 23 25 24 28 31 32 37 40
Calculate the width of an approximate 95% prediction interval for y13.

(A) 7.7 (B) 9.9 (C) 11.1 (D) 13.3 (E) 15.5
Exam SRM Study Manual 421 Exam questions continue on tire next page .. .
Copyright 02022 ASM
422 PRACTICE EXAM 5
4. 11: The grade on Exam SRM is modeled as an ordinal random variable with four categories: Fail, 6, 7, 8+. The
explanatory variable is hours of study (x) and the proportional cumulative odds model is used. The fitted parameters
are
Variable Estimated parameter

Intercept
Fail 1.2
6 1.6
7 1.8
Hours of study —0.01
Calculate the probability of 7 or higher with 150 hours of study.

(A) 0.28 (B) 0.33 (C) 0.38 (D) 0.43 (E) 0.48
5. The following model relates Y to X:

Yj = exp(Po + Ei)
For 10 observations, you are given:
(i) EX= 13.5
(ii) 1nYj =9.681
(iii) X = 19.57
(iv) Z(ln Yi)2 =9.690

(v) Xi In YI = 13.697
Calculate the prediction of Y when X = 2.

(A) 3.2 (B) 3.6 (C) 4.0 (D) 4.4 (E) 4.8
6. *-1? Determine which of the following is/are differences between K-means clustering and hierarchical clustering.
I. One method forces every observation into a cluster; the other method allows for outliers.
One method uses of within-cluster similarity; the other method uses between-cluster dissimilarity.
One method requires an initial assignment of clusters; the other method does not.
7.
%I° The variable Y is modeled as a function of X1 and X2 using K-nearest neighbors regression with K = 2. The
model is based on the following three pairs of observations:
(5,9) (6,15) (8,12)
At the point (5,9), Y = 102.

Based on this model:
(i) When Xi = 4 and X2 = 12, then Y = 91

(ii) When Xi = 5 and X2 = 12, then Y = 98
Determine the value of Y when Xi. = 7 and X2 = 12.
(A) 87.0 (B) 91.0 (C) 95.0 (D) 99.5 (E) 104.5
PRACTICE EXAM 5 423
n.. For a binomial generalized linear model based on 45 observations, the explanatory variables are
Time in system (continuous)
Time in system squared
Sex (male or female)
Department (4 levels)
Interaction of sex and department
he
T model has an intercept.
The Ioglikelihood of the minimal model is —182 and the max-scaled R2 is 0.361774.
Calculate the AIC.
(A) Less than 365

(E) At least 395
9.
qi Observations of 3 variables are studied using principal component analysis. The loading matrix of 4;19ii is
0.732 0.307 0.609
( 0.437
—0.523
0.475
0.825
—0.764
0.213 )
where Oji is the loading of xi on the ith principal component.
The scores of the three components on the first observation are, in order, 1.220, 0.002, —1.279.
Calculate an approximation of the first component of the first observation, xii.
(A) 0.11 (B) 0.22 (C) 0.25 (D) 0.58 (E) 1.56
10. A response variable has 2 possible values, A and B. A node of a classification tree has 100 observations with
65 As and 35 Bs. It is split into two groups. The first group has 60 As and 15 Bs and the second group has the
remaining observations.
Calculate the decrease in cross-entropy resulting from the split.
(A) 0.06 (B) 0.09 (C) 0.12 (D) 0.15 (E) 0.18
11. 4' A binomial generalized linear model uses a probit link. The form of the model is
g(n) = Po + Fix].
The estimated values are 1)0 = 1, b1 = —0.1.
Calculate the estimated mean when x1 = 4.
(A) 0.1 (B) 0.3 (C) 0.5 (D) 0.7 (E) 0.9
Exam SRM Study Manual Exam questions continue on the next page . .
Copyright 02022 ASM
424 PRACTICE EXAM 5
12. f To validate a time series model based on 20 observations, the first 16 observations were used as a model )
development subset and the remaining 4 observations were used as a validation subset. The actual and fitted values
for those 4 observations are
Yr Pi
17 8 9
18 12 15
19 14 18
20 22 20
Calculate the MAPE.
(A) 14.8 (B) 15.8 (C) 16.8 (D) 17.8 (E) 18.8
13.
sir The time series yi follows an AR(1) process of the form
yt =0.6y1-1+ Et
Var(si ) = 100
Calculate the variance of a 3-step ahead forecast.
(A) 136 (B) 149 (C) 153 (D) 160 (E) 196
14.
A generalized linear model for drivers is based on 5 observations. Drivers are in Class 0 or Class 1.The response
variable is Bernoulli. Actual and fitted values are
Actual class 0 0 0 1 1
Fitted probability of Class 1 0.16 0.24 x 0.56 0.68
The deviance statistic is 3.3255.

Determine x.
(A) 0.22 (B) 0.23 (C) 0.24 (D) 0.25 (E) 0.26
Exam SRM Study Manual Exam questions continue on the next page „ .
Copyright t2022 ASM
PRACTICE EXAM 5 425
You are performing a K-means cluster analysis on a set of data. The data has been initialized with 3 clusters as
follows:
Cluster Data Point

A (3,5)
A (6,2)
A (9,2)
A (10,7)
B (2,6)
B (2,8)
B (4,3)
B (8,1)
C (5,1)
C (6,6)
C (8,4)
C (9,3)
A single iteration of the algorithm is performed using squared Euclidean distance between points.
Calculate the number of data points that move from one cluster to another.
(A) 4 (B) 5 (C) 6 (D) 7 (E) 8
16. •••• Principal component analysis is applied to a set of observations of 2 variables. The score of the observation
(4,3) on the first principal component is 4.3077. The loading of the first principal component on the first variable is
greater than 0 and less than 0.5.
Calculate the loading of the first principal component on the first variable.
(A) 0.32 (B) 0.34 (C) 0.36 (D) 0.38 (5) 0.40
17.
•41 Determine which of the following statements about bagging is/are true.
I. In out-of-bag validation, approximately B/3 predictions are made for each observation.
II. The test MSE for out-of-bag validation is U shaped as a function of B.
III. For B sufficiently large, out-of-bag error is virtually equivalent to leave-one-out-cross-validation error.
18. •-: In a heterogeneity model

(i) The distribution of yi I cri is Poisson with mean 0.2.
(ii) The log link is used.
(iii) el follows a gamma distribution with mean 0.1 and variance 0.6.
Calculate the overdispersion factor.
(A) 1.06 (B) 1.12 (C) 1.24 (D) 1.30 (E) 1.36
EXAM SRM Study Manual Exam questions continue on the next page...
Copyright C2022 ASM
426 PRACTICE EXAM 5
19. rIn a linear model with an intercept, variables are selected using subset selection. The model is based on 100
observations. The total sum of squares is 4184. The RSS when all 15 predictors are included in the model is 252.
Mallow's Cr, is used to select the best model among models with different numbers of predictors.
The RSS for the best model with 5 predictors is 474.
The RSS for the best model with 6 predictors is less than x.
The model with 6 predictors is selected.
Determine the highest possible value for x.
(A) 422 (B) 430 (C) 442 (D) 454 (E) 468
20. 44/ For a linear regression with 2 variables and an intercept based on 5 observations, you are given
i Leverage Residual
1 0.8831 0.7737
2 0.3534 —1.8664
3 0.5246 0.9216
4 0.3670 0.5549
5 0.8738 —0.3838
Calculate the PRESS statistic.
(A) 11 (B) 13 (C) 15 (D) 55 (E) 66
21. i With respect to shrinkage methods, which of the following statements are true?
I. The Bayesian interpretation of the lasso is to assign the double-exponential distribution as a prior for the
coefficients pi. ._)
Ridge regression may be used for feature selection.
For all shrinkage methods, variance is a non-increasing function of the tuning parameter.
22.
"41 For a Poisson regression, you are given the following actual and fitted values:
k 1 2 3 4 5
ilk 1 2 3 3 2
0.7 1.6 2.5 3.5 2.5
Determine the observation with the highest deviance residual in absolute value.
(A) k=1 (B) k = 2 (C) k= 3 (D) k = 4 (E) k = 5
23.
•-11 For a regression model of the form y o + Pix + e based on 89 observations, you are given:
(i) I = 85.19
(ii) s, = 9.02
(iii) x35 = 92.03
Calculate the leverage of x35.
(A) 0.015 (B) 0.016 (C) 0.017 (D) 0.018 (E) 0.019
PRACTICE EXAM 5 427
4: A linear regression model based on 5 observations is of the form yi = po+ pixii+132x,2+ Ei. The values of the
variables are:
xi Xi2 Yi
1 3 24 60
2 5 29 64
3 8 34 70
4 9 38 79
5 10 50 87
=7 = 35 = 72
E(xii — Ri)2 = 34 E(xi2 — R2)2 = 392 (yi yr)2 486
Calculate the VIF of xi.
(A) 5.2 (B) 5.7 (C) 6.4 (D) 6.8 (E) 7.3
25. In a normal linear regession model, you are given

(ii) There are 3 explanatory variables and an intercept.
(iii) The unbiased sample variance of the response variable is 32.8.
(iv) The deviance is D = 302.
Calculate R2.
(A) 0.46 (8) 0.56 (C) 0.66 (D) 0.76 (E) 0.86
26. For a GARCH model, you are given

(i) The unconditional variance of the error term is 220.
(ii)
2
02 — 0.2262i-1— 0.0802i-2 W 'IC 0' 15 2EI-2
Determine w.
(A) 44 (B) 48 (C) 66 (D) 88 (E) 96
27. rA linear regression model has the form

Yi = Po + f31x + (32x i2 + Ei
You are given:

(ii) The fitted value of /32 is 12.456.
9.566 —0.322 —1.478
(iii) (X/X)-1 = —0.322 0.011 0.048
—1.478 0.048 0.241
(iv) The residual standard deviation is 8.0882.

Calculate the partial correlation of y and x2.
(A) 0.09 (B) 0.20 (C) 0.31 (D) 0.42 (E) 0.53
Copyright (D2022 ASM
428 PRACTICE EXAM 5
28. 1.41' A linear regression model has the form

yi = Po + ixj + 132X12ei
You are given:

(ii) The fitted value of pi is 4.986.
0.825 -0.029 0.065
(i i) (X/X)-1 = (-0.029
-0.065
0.010
0.014
-0.014 .
0.035
(iv) The residual standard deviation is 11.155.

The null hypothesis is Ho: pi = 3. The alternative hypothesis is Hi: Pi > 3.
Determine which of the following statements is true.
(A)Do not reject Ho at 10% significance.
(B)Reject Ho at 10% significance but not at 5% significance.
(C) Reject Ho at 5% significance but not at 2.5% significance.
(D)Reject Ho at 2.5% significance but not at 1% significance.
(E) Reject Ho at 1% significance.
29. •-i? Determine which of the following statements regarding bagging is/are true.
I. Each training data set used in bagging has n components.
II. Choosing B too high can result in overfitting.
III. Bagging is not useful for classification settings.
(E) The correct answer is not given by (A) , (B) , (C) ,or (D) .
30. a: A least squares model with a large number of predictors is fitted to 92 observations. To reduce the number of
predictors, forward stepwise selection is performed.
For a model with k predictors, RSS = ck.
The estimated variance of the error of the fit is 52 = 25.
Determine the value of cd - cd+i for which you would be indifferent between the d + 1-predictor model and the
d-predictor model based on BIC.
(A) 108 (B) 113 (C) 118 (D) 123 (E) 128
PRACTICE EXAM 5 429
Li•near regression is performed based on 5 observations. The regression is based on 2 predictors and an intercept.
To reduce the dimension, principal component regression is performed. Only the first principal component is used.
You are given
i xii xi2 yi
1 —1 0 —4
2 —2 4 —2
3 —1 1 0
4 1 —3 2
5 3 —2 4
The loadings of the principal component are —0.6 on xi, 0.8 on x2.
Calculate the fitted coefficient of zi, the first principal component, in the regression.
(A) —0.73 (B) —0.62 (C) —0.51 (D) —0.42 (E) —0.31
32. rA regression model for auto collision claim costs is of the form y =10+ 13ixi +132x2 + e, where xi is the CPI and
x2 is an index of the cost of gasoline. The model is based on the following 5 observations:
Claim costs CPI Gas cost index
100 2.651 3.045

102 2.782 2.974
105 2.805 3.146
109 2.873 3.357
116 2.904 3.104
To test whether the two explanatory variables are collinear, the VIP is calculated. You are given:
(i) S1 = 0.0096625
(ii) 42= 0.0209767
(iii) Z(x1i — R1)(xi2— )= 0.029492
Calculate the VIP of xi.
(A) 1.4 (B) 1.6 (C) 1.8 (D) 2.0 (E) 2.2
33. %I. You are given the following six observations of a single variable:
21, 30, 40, 51, 63, x
They are analyzed using hierarchical cluster analysis with average linkage and Euclidean distance. After two fusions,
the clusters are 121,301, 140,511, (631, and fx}.
Determine the highest integer value of x for which the next fusion would be {63, x}.
(A) 76 (B) 78 (C) 80 (D) 82 (E) 84
Copyright Q2022 ASM
430 PRACTICE EXAM 5
34. a: The Holt-Winters model is used to forecast a time series. The last available value of the series is for period 20.
You are given:
y19 = 185 and y20 = 199
b0,19 190 and b1,19 = 2
wi = 0.6 and zv2 = 0.7.
The series is not seasonal; zv3 = 0.
Determine the predicted value of the series for period 21.

(A) 196 (B) 198 (C) 200 (D) 202 (E) 204
35. Determine which of the following statements regarding hierarchical clustering is/are true.
I. If two different people are given the same data and perform one iteration of the algorithm, their results at that
point will be the same, regardless of the linkage they used.
At each iteration of the algorithm, the number of clusters will be greater than the number of clusters in the
previous iteration of the algorithm.
The algorithm needs to be run only once, regardless of how many clusters are ultimately decided to use.

Practice Exam 6
1. "Ir You are given the following results for the regression model y = po flix E:
Source of Variation Degrees of Freedom Sum of Squares
Regression 1 5,012
Error 9 4,296
You are also given that for the explanatory variable x, Z(x1 —5-02 = 120.
Determine the length of the symmetric 95% confidence interval for 131.
(A) 9.0 (B) 9.1 (C) 9.2 (D) 9.3 (E) 9.4
2. *41' A logistic regression model is used to model the color of a car as a function of the age of the owner (x2). The
color white is the base level. Other colors are black, gray, red, and blue. The fitted coefficients for the model are
Color b1 b2
Black 0.33 —0.009
Gray —1.05 0.010

Red 1.50 —0.100
Blue 0.05 0.005
Calculate the probability that a person aged 35 has a white car.

(A) 0.22 (B) 0.26 (C) 0.30 (D) 0.34 (E) 0.38
3. rY is a classification variable that may be 0 or 1. X is a random variable uniformly distributed on [0,1]. You are
given
Pr(Y = 1 I X = x)= x
Calculate the Bayes error rate.
(A) 0 (B) 0.125 (C) 0.250
4. A classification tree is used for a variable with two classes. The tree has 5 terminal nodes. The number of
observations in each class at each terminal node are as follows:
Node Class A Class B Total
1 45 10 55
2 32 8 40
3 10 30 40
4 5 45 50
5 3 52 55
Total 95 145 240
Calculate the residual mean deviance, as calculated by R.
(A) —0.82 (5) —0.80 (C) —0.78 (D) —0.72 (E) —0.70
Exam SRM Study Manual 431 Exam questions continue on the next page
Copyright =022 AS1VI
432 PRACTICE EXAM 6
5. *--:1.1 A regression model is of the form y = po + P1x1 + f32x2 + E. It is based on 6 observations. You are given
(i) The leverages are 0.4514, 05626, 0.3111, 0.5584, 0.3732, and 0.7433.
(ii) The residuals are —1.3009,0.0923, 0.9145,0.8649, —0.0974, and —0.4734.
Calculate PRESS.
(A) 10.3 (B) 11.4 (C) 12.5 (D) 13.6 (E) 14.7
6. iS The probability of a strike in an industry, n, is modeled with logistic regression, using an economic index xi as
a predictor. In the fitted model, f31 = 0.1. When x1 = 3, the probability of a strike is 0.1.
Calculate the probability of a strike when X1 = 7.
(A) 0.12 (B) 0.13 (C) 0.14 (D) 0.15 (E) 0.16
7. For a simple linear regression y=a+bx+E based on 12 observations, you are given:
(i) The fitted model is y = 2.1 + 1.20x + E.
=25
(iii) RSS = 100.

(iv) se(b) = 1.425.
Calculate the lower bound of a 95% prediction interval for y when x = 32.
(A) 15.9 (B) 17.1 (C) 18.3 (D) 19.5 (E) 20.7
8. •Nii. The following discrete distribution is in the linear exponential family:

f (y) (y + 1)(1 — p)2pY y = 0,1, 2, ...
(B) ln(1 + ce)2 (C) — ln(1 + ce)2 (D) ln(1 — C61)2 (E) — ln(1 e0)2
9. el' You are given the following statements regarding principal component analysis.
I. The sum of the loadings for each principal component must be 0.
II. The sum of the squared loadings for each principal component must be 1.
III. The sum of the scores of each principal component must be 0.
Determine which of these statements are true.
(A) I only
(B) II only
(C) III only
(D) I, II, and III
Copyright CO202.2 ASM
PRACTICE EXAM 6 433
House sales is the response variable of a normal model that is based on 11 observations. Explanatory variables
are:
• Interest rates
• Unemployment rate
Model I uses only interest rates as an explanatory variable, while Model II uses both interest rates and unem-
ployment rates. Both models have an intercept. The results of the models are:
Model I Model II
Source of Variation Sum of Squares Source of Variation Sum of Squares
Regression 14,429 Regression 17,347
Error 12,204 Error 9,286
Which of the following statements is correct?

(A) Model II is accepted at 1% significance.
(B) Model H is accepted at 2% significance but not at 1% significance.
(C) Model II is accepted at 5% significance but not at 2% significance.
(D) Model II is accepted at 10% significance but not at 5% significance.
(E) Model II is rejected at 10% significance.
11. "-st. Determine which of the following statements regarding bagging is/are true.
I. It is difficult to interpret the model arising from bagging.
II. At each split, all available predictors are considered.
HI. Cost complexity pruning is used to reduce the variance of the trees.
12. Y is a real-valued variable. It is modeled as a function of Xi and X2 using KNN regression, with K = 2. You are
given the following training data:
X1 10 12 12 15
X2 13 11 16 13
Y 2 5 10 6
Calculate the mean squared error on the training data.
(A) 5.2 (B) 5.4 (C) 5.6 (D) 5.8 (E) 6.0
13. al? Determine which of the following statements regarding hierarchical clustering of 11 observations is/are true.
I. At each iteration of the algorithm, the number of clusters is reduced by 1.
II. At iteration i of the algorithm, a comparison of (n i + 1)(n — i)/2 dissimilarities determines which clusters are
fused.
III. The dissimilarity of clusters increases at each iteration.
rTh- (E) The correct answer is not given by (A) , (B) , (C) , or (D) .
434 PRACTICE EXAM 6
14. I': Determine which of the following methods may be used to model stochastic seasonal effects in time series.
I. Trigonometric functions.
II. Seasonal autoregression
III. Holt-Winter
IV. Seasonal binary variables.
(A) I and II only (B) I and III only (C) I and IV only (D) II and III only (E) II and IV only
15. al In a hierarchical clustering analysis using Euclidean distance as the dissimilarity measure, the following 3
clusters have been formed:
1(40,30), (40,40)) ((60,22)) ((75,41))
Consider the following linkages:

I. Single
IL Average
III. Complete
IV. Centroid
Determine the linkages for which 1(60,22)) is fused with 1(40,30), (40,40)1 at the next iteration.
(A) I and II only
(B) I and III only
(C) land IV only
(D) I, II, and IV only
(E) II, III, and IV only
PRACTICE EXAM 6 435
. r Salaries of actuaries are modeled using a regression tree. Three variables are used:
1. X1 is the credentials of the actuary, which May be FSA, ASA, or student.
2. X2 is number of years of experience.
3. X3 is the region where the actuary works, which may be E, W, N, or S.
This is the regression tree:
= FSA
X2 < 10 (0 X3 = E W
CD X3 EW CD' X2 < 3 ®X1= ASA
X3 = In X3 =
The total sum of squares is 9,865. After each split, the RSS is:
C) 9,075
8,302
® 7,845
CD 7,411
0 7,026
® 6,798
CD 6,502
6,398
Using the deviance-related variable importance measure, rank the importance of the three variables from highest
to lowest.
(A) Xi, X2, X3 (B) X11X3,X2 (C) X2, Xi, X3 (D) X21 X3, X1 (E) X31 Xl, X2
Exam SRM Study Manual Exam questions continue on the next page ..
Copyright 02022 ASM
436 PRACTICE EXAM 6
17. •41 You have fit a multiple regression model with k explanatory variables and an intercept to 80 observations. You
are given:
(i) Residual standard deviation is 8.
(ii) Leverage of 10th observation is 0.2.
(iii) 10th residual is 6.8.
(iv) Cook's distance D10 = 0.0376.
Determine k.
(A) 4 (B) 5 (C) 6 (D) 7 (E) 8
18. ": For a linear model with 4 observations

0.35 0.45 0.05 0.15
114 5 0.65 -0.15 0.05
(i) The hat matrix is 0.05 -0.15 0.65 0.45
0.15 0.05 0.45 0.35
(ii) The response variable's values are 14, 12, 8, 3.

(iii) The standard error of the regression is 4.1623.
Calculate the standardized residual of the first observation.
(A) 0.65 (B) 0.75 (C) 0.85 (D) 0.95 (E) 1.05
19. Sales of insurance are modeled as a function of the number of agents in the field force. Let Y be sales and let X
be the number of agents. The fitted model is
Yi = Po + ei
For 10 observations, you are given

(i) The correlation between Y and X is 0.78.
(ii) The unbiased sample variance of X is 105.6.
(iii) The fitted value of pi is 87.36.
Calculate the total sum of squares.
(A) 10 million (B) 11 million (C) 12 million (D) 13 million (E) 14 million
20. •41 You are given the time series yt = 11, 0, 2, 0, 01.
Calculate the autocorrelation of yi at lag 1.
(A) -0.49 (B) -0.42 (C) -0.35 (D) -0.28 (E) -0.21
21. al A zero-inflated model consists of a mixture of a Poisson distribution with mean 0.8 and the constant 0. The
constant 0 has a weight of 0.3.
Calculate the overdispersion of this model relative to a Poisson model.
(A) 1.06 (B) 1.12 (C) 1.18 (D) 1.24 (E) 1.30
Exam SRM Study Manual EXCI711 questions continue on the next page . . .
PRACTICE EXAM 6 437
A generalized linear model is used to model claim size. You are given the following information about the
model:
(i) Claim size follows a Gamma distribution.

(ii) The link function is g(p)= p".
(iii) The scale parameter is cp = 1/5.
(iv) Model output:
Variable
Intercept 500
Vehicle value (000) 15
Number of violations 200
Number of accidents 1000
Gender—Male 600
Calculate the standard deviation of claim size for a vehicle with value 30,000 belonging to a male with 1 violation
and no accidents.
(A) 1800 (B) 2300 (C) 2800 (D) 3300 (E) 3800
23. ikir For a linear regression with k variables and an intercept:
Dyi - p)2 = 8,500
(iii) The residual sum of squares is 1,825.
(iv) The F ratio for the regression is 22.860.
Determine k.
(A) 2 (B) 3 (C) 4 (D) 5 (E) 6
24. '1•4r A hurdle model with a base Poisson count has 2 predictors. When x1 = 2 and x2 = 3, the Poisson parameter is
0.8 and overdispersion is 1.2.
Calculate the probability of 0 given that x1 = 2 and x2 = 3-
(A) 0.59 (B) 0.63 (C) 0.67 (D) 0.71 (E) 0.75
25. For a generalized linear model based on 100 observations:

(i) The model has 3 parameters: an intercept and 2 predictors.
(ii) The maximized loglikelihood of the minimal model is —45.88.
(iii) max-scaled R2 = 0.80.
Calculate the AIC of the model.
(A) 32.3 (B) 35.3 (C) 38.3 (D) 41.3 (E) 44.3
26. •41 With regard to shrinkage methods, determine which of the following statements is true.
I. The penalty function in ridge regression is a function of the £2 norm of 13.
IL It is best to standardize predictors when using ridge regression or the lasso.
IlL For both ridge regression and the lasso, the higher the budget parameter, the higher the variance.
438 PRACTICE EXAM 6
27.
Determine which of the following characteristics distinguish a linear trend in time from a random walk.
I. In a linear trend in time, consecutive terms are correlated but in a random walk they are not.
II. A linear trend in time has stationary variance but a random walk doesn't.
III. The differences of a linear trend in time are not stationary in the mean but the differences of a random walk are
stationary in the mean.
(E) The correct answer is not given by (A) , (B) (C) , or (U).
28.
N? A principal components analysis is performed on the following 4 observations of 3 variables:
(1,0.4, —0.6) (0,0.4, —0.1) (-0.5, —0.4,0) (-0.5, —0.4,0.7)
The scores of the four observations on the second principal component are 0.0016, 0.1074, —0.3432, and 0.2342.
Calculate the proportion of variance explained by the second principal component.
(A) 0.04 (B) 0.05 (C) 0.06 (D) 0.07 (E) 0.08
29.
ai? For a set of 6 observations, you perform the following two regressions:
I. yi = 130 +131.xii.
II. Xi = yo + E;
Residuals from these regressions are:
Observation Residual from Residual from
Number Regression I Regression II
1 2.022 —0.455
2 —2.587 —0.364
3 —2.391 —5.545
4 0.196 2.545
5 4.391 1.364
6 —1.630 2.455
The residual standard deviation is 3.102 for Regression I and 3.371 for Regression II.
Calculate the partial correlation coefficient of y and x2.
(A) 0.342 (13) 0.360 (C) 0.377 (D) 0.395 (E) 0.412
30.
Niel For a time series with 10 terms, you are given
yi =149
ZPIy2= 2287
V°2 yip-1 = 1994
= 20
Yio = 11
The series is fitted to an AR(1) model, yi = Po + pi yi_i
Calculate the fitted value of 13i.
(A) 0.26 (B) 0.28 (C) 0.30 (D) 0.32 (E) 0.34
Exam SRM Study Manual Exam questions continue on the next page . .
PRACTICE EXAM 6 439
NI° You are given 4 potential explanatory variables for a linear model, xi, x2, x3, and x4. The model is based on 32
observations. When x4 is regressed with an intercept against the other three variables, the F ratio for the model is
2.532.
Calculate the VIF for x4.
(A) 1.17 (B) 1.27 (C) 1.37 (D) 1.47 (E) 1.57
32. You are given the following 8 observations:
2, 4, 7, 11, 16, 22, 29, 36
K-means clustering groups these observations into 2 clusters.

Determine the minimum value of the objective function used for determining the clusters.
(A) 270.75 (B) 448 (C) 541.5 (D) 631.5 (E) 896
33.
The variable Y is modeled as function of X using a regression tree. You are given the following observations:
X 1 2 3 4 5 6
Y 10 12 17 21 22 24
Determine the first split of the feature space into two regions.
(A) Between 1 and 2
(B) Between 2 and 3
(C) Between 3 and 4
(D) Between 4 and 5
(E) Between 5 and 6
34. ••.? For a linear regression with 5 observations, you are given
y 9i
10 0.77 0.5403
9 10.78 0.3289
15 24.13 0.2036
30 44.16 0.3512
70 54.17 0.5760
Calculate the leave-one-out cross-validation statistic.
(A) Less than 300

(E) At least 450
Exam SRM Study Manual Exam questions continue on the next page „ .
440 PRACTICE EXAM 6
35.
%. Determine which of the following are reasons to use parametric approaches.
I. Parametric approaches are more flexible than non-parametric approaches.
Parametric approaches require fewer observations than non-parametric approaches.
Parametric approaches are easier to interpret than non-parametric approaches.

Appendices
Appendix A. Solutions to the Practice Exams
Answer Key for Practice Exam 1

1 C 11 E 21 E 31 C
2 E 12 E 22 C 32 A
3 C 13 B 23 B 33 A
4 B 14 D 24 E 34 D
5 B 15 E 25 E 35 D
6 E 16 D 26 C
7 C 17 D 27 B
8 C 18 D 28 A
9 C 19 E 29 A
10 B 20 D 30 C
Practice Exam 1
1. [Lesson 11 Classification setting—the company is choosing a class. Supervised—there is something being predicted.
But decision trees are not parametric. (C)
2. [Lesson 12] In logistic regression, g(p) is the logarithm of the odds, so we must exponentiate f3 to obtain odds ratio.
e-0.69.5 0.4991 (E)
3. [Section 11.1] The square of the coefficient of variation is the variance divided by the square of the mean. If it is
constant, then variance is proportional to mean squared. This is true for a gamma distribution. (C)
4. [Section 11.11 Area A is the base level, so nothing is added to g(p) for it.
g(p) = 0.00279 - 0.001 + 25(-0.00007) = 0.00004
1
7 = 0.00004
1
158.11 (B)
/-1- V0.00004
5. [Section 14.2] A cubic polynomial adds 3 parameters. The 99th percentile of chi-square at 3 degrees of freedom is
11.345. Twice the difference in loglikelihoods must exceed 11.345, so the loglikelihood must increase by 5.67. Then
-361.24 + 5.67 = -355.57 (B)
6. [Section 17.1] The loading of the first principal component on the second variable is 1/1 - 0.62 = 0.8. We are given
-0.6(0.4) + 0.8x2 = 0.12
It follows that 2c2 = 0.45 . (E)

444 PRACTICE EXAM 1, SOLUTIONS TO QUESTIONS 7-12
7. [Lesson 11
I. The lasso is more restrictive than linear regression.X
Flexible approaches may not lead to more accurate predictions due to overfitting.X
This sentence is lifted from An Introduction to Statistical Learning page 35./
(C)
8. [Lesson 14] USE has 3 levels, so Modern has 2 parameters fewer than Model I. Thus the AIC penalty on Model II is
4 less than for Model I. The AIC for Model I is 3.80 less than for Model II, but before the penalty, twice the negative
loglikelihood of Model I is 7.80 less than for Model IL The critical values for chi-square with 2 degrees of freedom
are 7.378 at 2.5% and 9.210 at 1%, making (C) the correct answer choice.
9. [Section 18.2] We have to calculate all 6 distances between points and average them.
Point 1 Point 2 Distance
(8,2) (10,3) V5-

(9,7) (10,3) \/0
(12,5) (10,3) -18-
(8,2) (11,1) /17
(9,7) (11,1) /LT3
(12,5) (11,1) Vi7
The average distance is 1/6 (ir5 + A/17 + + Ani5 +

+1/17) = 3.7996 . (C)
10. [Section 5.2] Use the second equality of formula (5.2). The standard error of the first residual is s1,11:
4.1 ei ) 2( 0.15
3s2 = 3(0.85))
= 0E.815) C0):1855)
E1—
= 4.1(0.852) 19.74 83
0.15
=
4.4439 (B)
11.
[Section 7.2] We will use the definition of Mallow's C1, from An Introduction to Statistical Learning, but you would
get the same result using the definition in Regression Modeling with Actuarial and Financial Applications.
C n (RSS + 2d62), and we can ignore 1/n. So we want
This implies
Cd Cd+1 = 2(40) = 80 (E)
12.
[Section 16.1] We weight the cross-entropies for the two groups with the proportions of observations in each group,
0.6 and 0.4
2 2 1 1 1 1 1 1 5 5 1 1
D
—0.6 kit-15 gln gln -g) — 0.4 kin •§ + gln •
0.88064 (E)

Copyright 02022 ASM
PRACTICE EXAM 1, SOLUTIONS TO QUESTIONS 13-21 445
[Section 16.11 Higher a means more tree pruning and fewer nodes. That will increase the MSE on the training data
and raise bias on the test data. I TI is the number of terminal nodes, which decreases. (B)
14. [Section 711 Data snooping refers to (D).

15. [Lesson 15]
I. KNN tends to perform worse as the number of predictors increases, since the points tend to be further apart.
X
Linear regression is easier to interpret than KNN. X

KNN is most flexible as K get smaller, or as 1/K increases. /
(E)
16. [Section 18.21

I. Furniture sales have low units but high dollar amounts, and food is the other way around, so the input data
would have quite different patterns, with significant effect on clusters./
A correlation-based dissimilarity method is desirable, and that is much easier to use with hierarchical cluster-
ing./
Correlation is scale-free, so frequent and infrequent shoppers with the same shopping patterns would be
grouped together./
(D)
[Section 17.1] All three statements are true. (D)

(11.. . [Lesson 19] All three statements are true. The mean of ct is 0 so the mean of sums of ci is also 0, a constant. The
variance at time t is ta2, here 5002. Differences of the series are white noise, which is stationary. (D)
19. [Section 41 There are n = 22 observations, k + 1 6 coefficients in the unrestricted model, and q = 2 restrictions.
(Error SSR — Error SSuR)/q (310 — 156)/2 7.897

F2,16 = (E)
Error SSuR — k — 1) 156/16
20. [Lesson 20] The residuals are
22— (20.325 + 0.1(20)) = —0.325

21 — (20.325 + 0.1(22)) = —1.525
24 — (20.325 + 0.1(21)) = 1.575
23 — (20.325 + 0.1(24)) = 0.275
The mean of the residuals is 0. The estimated variance of the residuals, by formula (20.3), is
2
5 = —
2
((-0.325)2 + (-1.525)2 + 1.5752 + 0.2752) = 2.49375 (D)
21. [Section 7.1, Lessons 16, and Section 18.2] II and III are greedy in that they select the best choice at each step and
don't consider later steps. While hierarchical clustering selects the least dissimilar cells at each iteration, there is no
("--"` particular measure that would indicate whether a better clustering is possible with a different choice, so it is not
considered greedy. (E)

Copyright g+2022 ASM
22. [Section 16.21 I and III are true. The opposite of II is true: a low shrinkage parameter leads to selecting a higher B
since less is learned at each iteration, so more time is needed to learn (C)
23.
[Section 19.6] MSE is the mean square error, with division by 5 rather than 4, since the fit is not a function of the
validation subset. The residuals are -3, -3, -2, 2, 4
32 + 32 +22 +22 + 42
MSE 8.4 (B)
5
24.
[Subsection 13.3.2] k is the quotient (1 - n)/(1 g(0)), where it is the probability of 0 (0.3 here) and g(0) is the
Poisson probability of 0, which is e-0.6 here. The probabilityof 1 is
1
- ( -C103.6)0.6e-°-6
1
0.510875 (E)
25. [Section 14.5] The deviance is twice the excess of the loglikelihood of the saturated model, 'max, over the loglikelihood
of the model under consideration, /(b), so
2(/max - /(b)) = 8.70

/max + 74.88 = 4.35
/max = -7033
The pseudo-R2 statistic is

/(b) - /0 _ -74.88 + 90.31
pseudo-R2 - imax ,
- /0 - -70.53 + 90.31
0.78 (E)
26.
[Section 13.11 In a Poisson regression with a log link, the ratio of expected values is the exponential of the difference
of the xs. Here, that is el:P.1360-2) = 1.1972 . (C)
27. [Section 8.21

1.
PLS is a supervised method since it takes the response into account when determining the coefficients. st
2.
In both dimension reduction methods we study, the selected directions are linear combinations of the original
variables. 1,/
3. PLS creates new variables that are functions of the original ones, so it does not select features. X
(B)
28.
[Lesson 21 For each explanatory variable there is a base level. There are 3 non-base occupational classes and 2
non-base health classes. Thus there are 3 x 2 = ]interaction parameters. (A)
29. [Section 8.1] Let v be the vector.
1lviii = 5 + 3 + 8 + 2 + 4 = 22
11v112 = ,52 +32 + 82 +22 + 42 ,10.8628
The absolute difference is 122 - 10.86281 = 11.1372 .(A)

Copyright 02022 ASM
. [Section 5.21 Use formula (5.1):

=
1
+
(xi — 5)2
n E7_1(x — 2)2
The denominator is n — 1 times the sample variance. We get
0.24833 (C)
31. [Section 21.2] The predictions are
x211 = 182
x312 = 0.2(138) + 0.8(182) = 173.2
x413 = 0.2(150) + 0.8(173.2) = 168.56
x514 = 0.2(192) + 0.8(168.56) = 173.248
The sum of squared errors is (-44)2 + (-23.2)2 + 23.442 + (-3.752)2 = 3037.751 (C)
32. [Section 16.11 I is true. But classification error is preferred for pruning tree since that is the measure of predictive
accuracy. And the predicted values of two terminal nodes coming out of a split may be the same, due to different
levels of node purity. (A)
[Section 18.2]
rag., I. There is an inversion; the split between {4) and (5,6,7) is at a lower level than the split between (5) and (6,7), and
of the four linkages we studied, only centroid has inversions./
II. All we know is that when the clusters were (1), (2), (3), (4), and (5,6,7), (4) was fused with (5,6,7). So (4) is closer
to the centroid of (5,6,7) than (3) is, and (4) is closer to the centroid of (5,6,7) then it is to (3). None of these
imply II. X
III. All we know is that (3) is closer to the centroid of (4,5,6,7) than (1) is to (2), since it was fused first. That doesn't
imply III. X
(A)
34. [Section 3.2] Total SS = 81,004 — 8(1002) = 1,004

Regression SS = 80,525 — 8(1002) = 525
, 525
= 0.52 (D)
1004
35. [Lesson 10] All three statements are true. (D)

Copyright 02022 ASM

1 A 11 A 21 A 31 D
2 C 12 B 22 D 32 C
3 E 13 E 23 B 33 D
4 A 14 0 24 E 34 A
5 D 15 B 25 C 35 D
6 E 16 B 26 C
7 A 17 C 27 A
8 B 18 A 28 B
9 C 19 D 29 B
10 C 20 B 30 B
Practice Exam 2
1. [Lesson 1] Ridge regression is parametric, but K-nearest neighbors regression depends on each point of data.
Principal components is not a supervised method so it is not parametric. (A)
Error SS
2. [Section 3.2] R2=1
Total SS
Error SS
0.625 = 1
8016
Error SS = 3006
There are 4 parameters, so there are 65 — 4 = 61 degrees of freedom. The RSE is V3006/61 = 7.0199 (C)
3. [Lesson 191 I and III are true. Causal models can include polynomials and other functions, so Ills false. (E)
4. [Lesson 2] We will use bo = g biX. First we calculate 17. From the formula for b1, we have
3.2465 =
72,559 — 20(9)(31.5)
23,720 — 20(31.52)
3875
Then
72,559 — 3.2465(3875)
= 95.2045
20(31.5)
So bo = 95.2045 — 3.2465(31.5) = —7.0603 (A)
5. [Lesson 1] The study is a classification setting: either someone gets cancer or doesn't. It is supervised; there is a
response variable. Logistic regression is a parametric approach. (D)
6. [Lesson 12] The odds of 1 are 0.20/(1 — 0.20) = 0.25. From the form of the logistic model,
p-1
In oi = 0.64 +
j=1

Copyright 02022 ASM
whereas the cumulative odds of 1 or 2 are

p-1
1n02 = 1.22 + 13
j=1
It follows that
02
e1.22-0.64
01
and therefore
02 e1,22-0.64,0.25‘= 0.446510
The probability of 1 or 2 is
0.446510
71 4- 7t2 = = 0.308681
1.446510
The probability of 3 is 1 - 0.308681 = 0.691319 . (E)
7.
[Section 21.11 First calculate moving averages at times 5, 6, 7, 8.
5 + 6 +12+ 5
s•5 = - 7
4
„
s6 = 6+12+5+10
4
= 8.25
A 12+5+10+9
S7 = = 9
4
5+10+9+15
s = 9.75
4
Then average the four averages.

42) 7 +8.25 +9+9.75
S8
= 8.54 (A)
4
8. [Section 5.1] Let ri be the ith standardized residual. Then

ei 6.2
0.51031 (B)
-
• 12.41 - 0.04
9. [Section 18.11 The algorithm minimizes the sum of within-cluster squared Euclidean distance divided by cluster
size. Let's calculate that. We'll calculate distances between earlier and later points, then double at the end to take
care of distances between later and earlier points.
(5,15)-(6,12): 12 + 32 = 10 (5,15)-(6,20): 12 + 52 = 26 (6,12)-(6,20): 82 = 64
(9,12)-(11,4): 22 + 82 = 68 (9,12)-(11,16): 22 + 42 = 20 (11,4)-(11,16): 122 = 144
The sum of these six numbers divided by 3 (the sizes of the two clusters) and doubled, is 2211. (C)
10. [Section 3.31 Use formula (3.12) for the standard error of b1.
50.24
se(bi) = - 1.9403
6.280VT7
The t ratio is 4.637/1.9403 = 2.3900. There are n - k - 1 = 16 degrees of freedom. 2.3900 is between the 975th
percentile of t, which is 2.1199, and the 99th percentile of t, which is 2.5835. For a two-sided test, this implies (C).
Exam Snit Study Manual

11. [Lesson 12] We add up the factors, including the intercept, and exponentiate them.
-2.521 - 0.050 + 0.155 + 30(0.007) = -2.206
e-2.206= 0.1101 (A)
12. [Section 16.1]

If we split between 6 and 8, the weighted average of the classification errors is (1/3)(1/3) + (2/3)(1/3) = 1/3.
(B)
It may be easier to just add up the number of classification errors for each split; 3 for I and III, 2 for II.
13. [Section 16.1] The sum of square differences of {3,6, 8,9} from their mean is 21 and the sum of square differences
of {6,10,12,14} from their mean is 35, for a total of 56. Putting the two sets of numbers together, the sum of square
differences from their mean is 88. Thus
56+ 2a > 88+ a
a 88 - 56 = 32
would cause these branches to be pruned. (E)

14. [Section 7.11
I. At each iteration, the variable chosen is the one that minimizes training RSS.X
II. True, but not true for backward subset selection./
III. This sentence is lifted from An Introduction to Statistical Learning page 213./
(D)
15. [Lesson 4] The F ratio with 35 observations, 6 parameters in the unrestricted model, and 2 constraints is
(Error SSII - Error SS)/2
F2,29 =
Error SS! /29
To express this in terms of R2, divide numerator and denominator by TSS.
(Er or SST/ Error SS/ )

m. Total SS Total SS
F2,29 = Error SS/
Total SS
= 14.5
(/q - Iq1)
1- R2
= 14.5
(0.900 - 0.860)
- 0.900
= 5.8 (B)
16. [Lesson 20] Use equation (20.4).
4.167 (B)
17.
[Section 12.11 In a probit model, 77 = C10(11). Initially, ij = 0-1(0.2) = -0.84. The systematic component 77 is increased
by 0.25(1) = 0.25, to become -0.59. The new probability is 0(-0.59) = 0.2776 (C)
Copyright 02022 ASM
[Lessons 15, 16, and 181 For K-nearest neighbor regression, lower K leads to not averaging in far away points,
lowering bias. Clustering is not a supervised method, so it doesn't make sense to talk about bias. For regression
trees, a greater number of terminal nodes allows a closer estimate, lowering bias. (A)
19.
[Section 16.21 The number of terminal nodes is 1 more than the number of splits, or d + 1. But the other two
statements are true. (D)
a2 6
20. [Lesson 20] Var(P) 9.375 (B)
p2 1 - 0.62 =
21. [Section 13.1] The observations follow a Poisson distribution, and their likelihood is maximized if the Poisson
parameter is the sample mean, or 0.6. Now, g(p) = In p= 130, and the fitted p is 0.6, so po = In 0.6 = -0.51083 (A)
22. [Section 18.2] 21 and 30 are closest and are linked first. Complete linkage looks at maximum distance, and that
means 40 is 19 away from 121,301, so it is linked to 51 instead. Similarly, the next link is {63,76}. At that point, {21,30)
and {40,51] are at distance 30 while 140,511 and {63,76} are at distance 36, so 121,30,40,511 is fused, making the answer
(D)
23. [Section 41 The F ratio for p4 = o is

177 - 123
F1,10 = - 4.390
123/10
The t statistic with 10 degrees freedom is the square root of the F ratio, or 2.095. This is between 1.8125, the critical
value at 10% significance for a 2-sided test, and 2.2622, the critical value at 5% significance. (B)
24. [Section 7.11 Among the 3-variable models, {X2, X3, X4} has the lowest RSS, so it is selected. Among models with
two of those three variables, {X2, X4} has the lowest RSS, so it is selected. (E)
[Section 18.21 I and II are reversed; complete linkage is based on maximal dissimilarity and single linkage is based
on intercluster dissimilarity. But III is true. (C)
26. [Section 17.1] The statements about loadings and scores should be switched; there are p loadings and n scores
for each principal component. Statement III is true since otherwise there would be no limit to the variance of the
components. (C)
27. [Lesson 15]
I. The points closest to (4,4) are (4,4) itself, (3,5), and (5,3), and 2 of those 3, (3,5) and (4,4), have Y = 1./
The points closest to (4,2) are (4,1), (3,2), and (5,3), and 2 of those 3, (4,1) and (3,2), have Y = 0.X
The points closest to (3,6) are (3,5), (4,6), and (4,4), all having Y = 1.X (A)
28. [Section 14.31 Based on b(0) and4), the response distribution is Poisson. But if you didn't recognize it, you could
work it out:
f(y) = exp(ye ee S(y,1))

and let A. = e8 and then f (y) = ke-A AY with k a constant, which is the form of a Poisson.
The deviance is
D=2
Dian yi - In fif) - (yi - 9i))
1=i
= 2 E(Yi in Yi Yi In
1=1
9i))
:= 2(2047 - 2015- 1035 + 1030) = f

l (B)

Copyright 02022 ASM
It's unusual that E yi Pi, but not impossible.

29. [Section 14.41 The REGION variable has 4 levels, hence 3 parameters. It suffices to subtract 31n 65 from the BIC of
the model with REGION to undo the extra penalty. However, we'll undo the entire penalty. There are 4 parameters
for AGE and 2 parameters for YEARS OF EXPERIENCE, plus an intercept, or 10 parameters for the full model and 7
for the model without REGION, so twice the negative loglikelihoods are 123.52 - 101n 65 = 81.776 for the full model
and 121.08 - 71n 65 = 91.859 for the model without REGION. The difference between these is 10.083, and there are
3 degrees of freedom since REGION has 3 parameters. For the chi-square distribution, 10.083 is between 9.348, the
critical value at 2.5%, and 11.345, the critical value at 1.0%. (B)
30. [Section 5.3] Collinearity does not affect the residual standard error, but causes higher standard errors for coefficients
which means lower t statistics. Collinearity is indicated by a high VIF. (B)
31. [Section 5.11 = 80 - 81 = -1
-1
-0.2379 (D)
s- = 6.1001/1 - 0.525
32.
[Section 17.5] A scree plot shows the percentage of variance explained by each principal component, so it is suited
for III only. (C)
33. [Section 21.51 Use formula (21.8).
2 2
W
10 + 0.4(0.8)2 + 0.2(-0.5)2 + 0.1(1.5)2 = 10.531 (D)
34. [Lesson 13] Interaction is the product of gender and marital status and thus only applies to male single.
g(p) = -0.73 + 0.07- 0.04 = -0.70
p = e-0•70 = 0.496585
For a Poisson distribution with mean 0.496585, po + pi = e—M96585(1 + 0.496585) = 0.910830. The probability of 2 or
more claims is 1 - 0.910830 = 0.0892 . (A)
35. [Section 14.6] All three statements are true. (D)


1 A 11 B 21 E 31 A
2 E 12 C 22 D 32 C
3 D 13 C 23 C 33 D
4 E 14 D 24 A 34 E
5 E 15 B 25 C 35 C
6 A 16 B 26 D
7 B 17 E 27 A
8 A 18 A 28 C
9 E 19 A 29 A
10 B 20 B 30 D
Practice Exam 3
1. [Section 16.11 The fitted values for the four points of test data are 5.5, 7.2, 12.0, and 10.4 respectively. The mean
squared error is
(4— 5.5)2 + (10 — 7.2)2 + (11 — 12.0)2 + (13 — 10.4)2 4.4625 (A)
4
[Section 17.1] The first two statements are true. But principal components are affected by the scale of the variables.
(E)
3. [Section 14.21 The likelihood ratio statistic is 4865.0 —4856.6 = 8.4. AGE GROUP has 4 categories, hence 3 indicator
variables, so there are 3 degrees of freedom. The critical values for chi-square at 3 degrees of freedom are 7.815 at
5% and 9.348 at 2.5%, so (D) is correct.
4. [Section 6.21
I. The more repeated observations, the more variance. 10-fold cross-validation uses 9/10 of the observations
in its test sets and therefore repeats observations more frequently, giving it higher variance than 5-fold cross-
validation. X
5-fold cross-validation requires 5 runs while 10-fold cross-validation requires 10 runs, so the former is more
efficient. s/
For polynomial or linear regression, only one expression needs to be evaluated to calculate the LOOCV
statistic. tt
(E)
5. [Section 7.21 The deviance for a normal model is the residual sum of squares, and the unbiased sample variance is
Total SS/(n — 1). Thus
RSS/(n — p) 44/(20 —4)
Adjusted R2 = 1 Total SS/(n — 1)
—1
88
0.96875 (E)
[Lesson 201 Statement IV is false; the variance of the terms is greater than the variance of the error. Letting 62, be
the variance of the error, the variance of the terms is aL2./(1 —132). (A)

Copyright 021:122 ASM
7. [Section 14.6] The variance of the observation is p3/5 = 57.33/5. So the chi-square residual is
Oi - Ei 68.5 - 57.3
0.0577 (B)
VT V57.33/5
8. [Lesson 4] Residual has 38 - 12 - 3 = 23 degrees of freedom. Mean square is 92.55/3 = 30.85 for AGE and
302.88/23 = 13.169 for residual. The F ratio is 30.85/13.169 = 2.343 , at (3,23) degrees of freedom. (A)
9. [Section 18.21 21 and 30 are closest and are linked first. Single linkage looks at minimum distance, and that means
40 is 10 away from (21,30), and since it is 11 away from 51, it is linked to {21,30}. We get a trailing cluster; 51 is linked
to 121,30,401, 63 is linked to 121,30,40,631, and we get (E).
10. [Section 8.21 I and IJI apply to both principal component regression and to partial least squares. Only II is unique
to principal component regression. (B)
11. [Section 18.21 I and II are true. In statement Ill, r should be 2-1`; there are n 1 fusions, and the two branches of
each fusion may be reversed without affecting the clusters. (B)
12. [Section 19.6 and Lesson 20] The MAE is the mean absolute error. We must compute the 4 residuals in the
validation set.
= 6 - 4.6 = 1.4
98 = 6 - 0.2(6) = 4.8 es = 4 4.8 = -0.8
e9 = 5 5.2 = -0.2
g10 = 6 - 0.2(5) = 5.0 elo = 5 - 5.0 = 0
1.4 + 0.8 + 0.2 + 0

The MAE is 0.6 (C)
4
13. [Section 16.11 We weight the Gini indices of each group 0.6 and 0.4, the proportions of the 100 observations in each
group.
0.5125 (C)
14. [Section 13.21 Use formula (13.1). There are n = 5 observations and k = 2 variables.
1 10.82 0.22 0.82 12 1.22

(i) + ± 1.479 (D)
5 - (2 + 1) k 1.2
15. [Section 16.11 Regression trees are less robust than linear models and predictions are less accurate. (B)
16. [Section 5.31 For the regression of x2 on xi, the total sum of squares (x2 - is the square of the standard deviation
of x2, multiplied by n - 1 = 50 - 1 = 49:
Total SS = 49(2.2042) = 238.0232

The error sum of squares is the square of the standard error of the regression, multiplied by n -2 = 48:
Error SS = 48(1.2842) = 79.1355

Copyright 02022 ASM
PRACTICE EXAM 3, SOLUTIONS TO QUESTIONS /7-22 455
The VIF is
R2 = 1
(2)
79'1355 = 0.667530
238.0232
1
VIF2 = 3.008 (B)
1 — 0.667530
17.
(12,204 — 9,286)/1
[Section 4] F1,8 7- = 2.514
9,286/8
The t statistic is the square root of the F ratio, or 1.5855. At 8 degrees of freedom, this is less than the 95th
percentile of a t distribution (1.8595), so for a 2-sided test, unemployment rate is not included in the model at 10%
significance. (E)
18. [Lesson 1] It sounds like the company is going to use principal components analysis. That is unsupervised learning,
so none of the characteristics apply. (A)
19.
[Section 5.1] The standardized residual r1 has s in the denominator and the studentized residual r has s(i), the
residual standard error with observation 1 removed, in the denominator. So
S
—ri
so.)
8.25
0.895 = (0.823)
s(1)
8.25(0.823) = 7.586 (A)
5(1) =. 0.895
[Section 16.1] As usual, when there are only two classifications, the Gini index is double the proportion of one of
the classes in the region. Since we're only comparing Gini indices, we won't bother doubling.
For the split between 6 and 8, half the Gini index of the first region is (2/3)(1/3) = 2/9. Half the Gird index of the
second region is also (2/3)(1/3) = 2/9. The weighted average is 2/9.
For the split between 8 and 12, we get
For the split between 12 and 16, we get
21. [Lesson 6] k-fold cross-validation has smaller training sets than LOOCV, so it will overestimate the test error rate
more than LOOCV and has higher bias. And the k folds are chosen randomly, so it may not yield the same results
if different folds are created. (E)
22. [Section 17.5] The points are centered at 0, so the variance is the just the sum of squares. The variance of the two
variables is (1/3)(12 + 02 1- (-1)2 ± 0.42 ± 0.42 + (-0.8)2) = 2.96/3. The scores of the three points are
0.836(1) + 0.549(0.4) = 1.0556
0.836(0) + 0.549(0.4) = 0.2196
0.836(-1) + 0.549(-0.8) = —1.2752
The variance of these scores is (1/3)(1.05562+0.21962+ (-1.2752)2) = 2.78865/3. The proportion of variance explained
is 2.78865/2.96 = 0.9421. (D)

Copyright Q2022 ASM
23. [Lesson 19] I is true since the influential points are the ones furthest from the mean time. II is not true; seasonal
patterns may be incorporated with dummy variables for the seasons or trigonometric functions. III is true. (C)
24. [Section 5.21 The diagonal of the hat matrix sums to k 1, so we see that k = 1. Using formula (5.2),
2
r2 _ e3 _
(-2.1)2 = 1.04255
-3- 82(1 - ha,) 14.1(1 -0,7)
0.7
D3 = 1.04255(2(1 ) 1.21631 (A)

_
-
0.7)
25. [Section 11.1] e0.154-0.04 = 1.2092 (C)
26. [Lesson 1] Any of these methods may be used in a classification setting. (D)
27. [Lesson 15] For X = 15, in the training data, 10, 19, and 30 are closest, so 1 is predicted, which is correct.
For X = 25, in the training data, 19, 30, and 39 are closest, so 2 is predicted, which is correct.
For X = 37, in the training data, 30, 39, and 43 are closest, so 2 is predicted, which is correct.
For X = 50, in the training data, 39, 43, and 45 are closest, so 2 is predicted, which is correct. (A)
28. [Section 14.4] We subtract the AIC penalty and add the BIC penalty.
There are 1 + 2 + 1 + 4 = 8 parameters, since the intercept is one parameter and there are k -1 parameters for each
k-way categorical variable. So the penalty function for AIC is 16, and the penalty function for BIC is 8 In 85 = 35.54.
The BIC is 261.53 - 16 + 35.54 = 281.07 (C)
29. [Section 21.3] Repeating gi = (1 - w)yi + wgi_1 four times,

gt = (1 - w)YI + (1 - w)wyt_i + (1 - w)w2y1_2 + (1 - zu)u)3y1_3 + w4§1-4
Plugging in the given numbers,
108.78 = 0.6(104) + 0.24(122) + 0.096(109) + 0.0384(105) + 0.025631
2.604
si = = 101.7188 (A)
0 0256
30. [Section 18.11 Clusters are selected to minimize the average squared distance of points within each cluster, not the
sum of distances, and maximizing the sum of distances between clusters is not equivalent, so I is false. But II and III
are true. (0)
31. [Section 14.1] For a gamma distribution, the variance function is p2, so the denominators of the chi-square sum are
42. The statistic is
(5 _ 3)2 + (7 _13) + (8- 10)2 (11 - 5)2

1.4(32) 1.4(152) 1.4(102) + 1.4(52) +
(13 - 18)2 ± (15 - 10)2 ± (17 - 20)2 + (20 - 18)2 _ 1.836 (A)
1.4(182) 1.4(102) 1.4(202) 1.4(182)
32.
[Section 18.1] The centroid of the first cluster is ((5 + 6 + 7)/3, (15 + 11 + 10)13) = (6, 12). The centroid of the second
cluster is ((5 + 6 + 7)/3, (18 + 14+ 7)/3) = (6,13). Thus observations with second coordinate less than 12.5 go to the
first cluster and those with second coordinate greater than 12.5 go to the second cluster. (5,15) moves to the second
cluster and (7,7) moves to the first cluster. (C)

Copyright 02022 ASM
[Section 11.1] By formulas (11.2) and (11.3), the variance is the derivative of the mean times 475, or 1/62. since
p = —1/0, it follows that 1/02 = p2 (D)
34. [Section 11.1] Since Yi is a Tweedie distribution, Var(Yi) = a E[YdP.
1n7.0711 =lna+pin2
In12.9904 = ln a + p In 3
0.608195 = 0.405465p
p = 1.5
ha = In 7.0711 — 1.51n 2
a = e0.916295 2.5
Var(Y3) 2.5(2.51.5) = 9.8821 (E)
35.
[Section 8.1] We are constrained by jq, +133 25. That means we may not consider (1, $2) = (4,4) nor = 5 nor
S2 = 5. Among the remaining choices, 60.5 is the lowest RSS, with g = 4 and = 3. The answer is 4/3 (C)

Copyright tg)21)22 ASM

1 B 11 A 21 C 31 D
2 C 12 A 22 B 32 E
3 E 13 C 23 E 33 D
4 B 14 A 24 B 34 D
5 E 15 A 25 A 35 C
6 A 16 B 26 B
7 E 17 A 27 E
8 D 18 C 28 E
9 C 19 A 29 C
10 E 20 D 30 E
Practice Exam 4
1. [Lesson 11 Increasing flexibility results in increasing test variance, but the other two statements are true. (B)
2. [Section 14.61 Using formula (14.21), with negative sign since y22 — 922 <0,
d22 = --\/2 (21n — (2 — 2.5)) = —0.3278

—
2.5
(C)
3. [Section 19.31 The sample mean is :17 20.7 and the sample standard deviation (with division by 9) is 2.5408. The
appropriate t coefficient is 41025 for a two-sided interval with 9 degrees of freedom, or 2.2622. The upper bound of
the forecast interval is
1
20.7 + (2.2622)(2.5408)V1 + —
10
= 26.73 (E)
4. [Lesson 9] Based on formula (9.3), the width of a prediction interval is
1 (x• .t)2
2ti_a/2sAl1 + +
When e = 1., the square of this is
(2t1_02s)2 (1 +
When x* =1 + sx, then (e 1)2 = s = ((x, -1)2) /(n — 1), so the squared width of a prediction interval is
(214_02s)2 + +
In our case, n = 18, so
302 = (2t1_42s)2 +

Copyrighi 02022 ASM
and we want to multiply the right side by

1 + 1/18 + 1/17
= 1.05573
1 + 1/18
so the width of a 95% prediction interval for xt = 2 + sx is
V900(1.05573) = 30.82 (B)
5. [Section 14.11 Use formula (14.3). The variance of a Bernoulli distribution is rri(1 -
0.322 0.472 (1 - 0.64)2 (1 - 0.98)2
+ + 3.550 (E)
(0.32)(0.68) + (0.47)(0.53) (0.64)(0.36) (0.98)(0.02)
6. [Section 5.2] Use formula (5.2).

hi;
D217 -
((k + 1)(1 - hii)) = (0.9342) (6(10=103.13)) 0.0217 (A)
7. [Section 16.2] The fitted value for the first 5 observations is the average of those observations, or (8+7+4+6+10)/5 = 7.
Subtracting 7A from 8, the residual entering the next tree is 8 - 0.1(7) = 7.3 (E)
[Section 18.2] The centroids for the four clusters are, in order, (14,10), (19,18), (8,27.5), and (15,24). We calculate the
distances between these four centroids.
(14,10)-(19,18): lig
(14,10)-(8,27.5); -N/34
(14,10)-(15,24): -‘/1
(19,18)-(8,27.5): V21.1 5
(19,18)-(15,24): '1,4
(8,27.5)-(15,24): —Nff6Y5
We could have skipped the fourth calculation, which was not one of the answer choices. The fifth calculation is
the lowest, making (D) the answer.
9. [Section 17.1] The fourth loading is --V1 - 0.682 - 0.652 - 0.322 = -0.1127. The score is
0.68(2) + 0.65(-1) + 0.32(3) - 0.1127(5) = 1.107 (C)
10. [Section 5.3] Use formula (5.4).
s
VVIFi
sh;
sxjAfli f.
0.86 = 8.25
2.70A/571
2
VIFJ
(0.86(2.70)VT4) 2.693 (E)
—
8.25

Copyright Q2022 ASM
11. [Section 18.11 The squared Euclidean distances in the first cluster are
(10,0)—(8,3): 13 (10,0)—(5,6): 61 (8,3)—(5,6): 18
In the second cluster, the points are 26 apart.
The objective function's value is
(2(13 + 61 + 18)) + (2(26)) = 8713 (A)
12. [Section 17.2]

I. I can be deduced./
Bob may have sold a lot of life insurance but very little health and dental.X
It is not clear whether Sue sold a lot of dental insurance or had a high first principal component score because
she sold a lot of life or health insurance.X
(A)
13. [Section 3.3] The t statistic is

2.6 — 2
T= =3
0.2
This is a 4-parameter regression, so there are 10 — 4 = 6 degrees of freedom. 3 is between 2.4469 and 3.1427, the
975thand 99thpercentiles of a t distribution. For this 1-sided test, we accept Ho at 1% but not at 2.5%. ( C)
14.
[Lesson 12] The cumulative odds of a claim of type 1 for a sedan are 0.21/(1 — 0.21) = 0.265823. The cumulative
odds of a claim of type 1 for an SUV are 0.28/(1 — 0.28) = 0388889. The relative cumulative odds of SUV to sedan is
0.388889/0.265823 = 1.46296. This ratio does not vary by type of claim.
The cumulative odds of a claim of type 2 for a sedan are 0.27/(1 — 0.27) = 0369863. Therefore, the cumulative \
odds of a claim of type 2 for an SUV are 1.46296(0.369863) = 0.541096. Let T2 be the cumulative probability of type 2
for an SUV. Then
T2
= 0.541096
1 — T2
0.541096
T2 = = 0.351111
1 + 0.541096
The probability of claim type 1 is 0.28, so the probability of claim type 2 is 0.351111 — 0.28 = 0.071111 . (A)
15.
[Lesson 12] Under the complementary log-log link, g(n) ln(-1n(1 — n)), so n = 1 — exp (—ebo+b'x'). For our
parameters, that is
— exp(_eo.6+0.40.8)) = 1 — e-25093 = 0.9187 (A)
16.
[Section 11.1] The canonical link function is the inverse of b'(0), the mean. The variance is Ob"(0), which here is
0(1;11(0)1.5. So the derivative of b' (0) is 13'(9)1-5. Let f(6) = b' (0) and solve a differential equation forf.
df 1.5
=f
df —
de
Integrating both sides and ignoring constants,

Copyright 02022 ASM
Thus ignoring the multiplicative constant, b'(19) = 1/02, with inverse g(p)= 1/1/1.i . (B)
The reduction is 0.375 - 0.354167 = 0.020833. (A)
18.
[Section 19.2] The sample mean is 7. Use formula (19.3). The denominator of the fraction in that formula is
_ 7)2 + (4 _/) + • • • ± (16 - 7)2 = 130
The numerator is
(-5)(0) + (-3)(2) + (0)(-1) + (2)(-3) + (-1)(1) + (-3)(9) = -40
The lag 2 autocorrelation statistic is -40/130 = -0.307692. (C)
19. [Section 14.5] The minimal model has only an intercept, and for a Poisson model the likelihood is maximized at
the mean of the observations, which is A = 1. The loglikelihood of each observation ni is -A + n in A - In ni!. The
loglikelihood of all 8 observations is
-8(1) + 81n 1 - 41n 1 - 21n 1 - In 2 - ln 24 = -11.8712 (A)
20. [Section 16.1] (18,11) is in the bottom right rectangle. The average of the training values in this region is (27 + 18+
24)/3 = 23 (D)
[Section 18.2] The closeness of observations is determined by how high they fuse vertically, so II is false. I and III
are true. (C)
22. [Lesson 15] The four nearest points are (4,1), (3,2), (5,3), and (4,4). The average of the values of Y at those points is
(27 + 32 + 28 + 39)/4 = 31.5. (B)
23. [Lesson 4] From the F statistic,

(Error SS' - Error 552)12
1.875 =
Error SS2/15
- 7.5
(Error SS1 - Error SS2)
Error SS2
Divide numerator and denominator by TSS.
1.875 = 7.5
(Error SSI/Total SS - Error SS2/Total SS) 7.5( - R2, )
Error SS2/T SS 1 -
0.25 - 0.253q = R - 0.589

1.25R2 - 0 8392 •
= 0.6712 (E)
24. [Lesson 12] First compute g(71), where n is the probability of a claim.
g(n) = -2.05 + 0.4 + 0.32 = -1.33
For this link, n = 1 - exp(-&(P)), so
n = 1 - exp(-e-133) = 0.232393
77 is the probability. The odds are n/(1 n) = 0.232393/(1 - 0.232393) = 0.30275. (B)

Copyright 02022 ASM
25. [Lesson 14] For all of these statistics, the lower the value, the better the model. (A)
26. [Section 18.1]

I. No clustering methods are robust to perturbations in the data.X
The distance of the n observations to each of the K centroids is computed, and each observation is assigned to
the closest centroid./
The K means clustering algorithm only works for squared Euclidean distanceX
(B)
27. [Section 3.3] The regression has n k — 1 = 21 — 1 — 1 = 19 degrees of freedom. The standard error of S'i is
V25.882/19
se(i) = = 0.2127
s, V t
The 97.5th percentile of t at 19 degrees of freedom is 2.0930. The upper bound of the confidence interval is
1.674 + 2.0930(0.2127) = 2.1192. (E)
28. [Section 21.2] Use formulas (21.5) for the coefficients of the linear expression.
bo,is 2g1s — g(125) = 2(72) — 56 = 88

bus = 1 s15)(2) — 1 —0.3(13 (72 —56) = 371
1718 = 88 + 3(371) = 200 (E)
29. [Subsection 13.3.1]

I. The probability of 0 is 0.2 + 0.8e .X
II. This is true, by the double expectation formula; the expected value of a mixture is the weighted average of the
expectations of the components./
This is true, since it is true for the Poisson distribution, and the Poisson component's probabilities are all
multiplied by 0.8. /(C)
30.
[Section 7.2] We have the Total SS: it is the RSS for the model with no predictors, 82.4. However, it suffices to
compute RSS/(n — d — 1) where d is the number of explanatory predictors, since that is the only term that varies in
the formula for adjusted R2.
41.3 37.5 33.8
15 — 2
— 3.177
15 — 3
= 3.125
15 — 4
= 3.073 30.3
10
_ 3 03
—
•
To make sure adjusted R2 is not negative (and therefore 0 predictors is preferred), we'll calculate adjusted R2 for
the 4-predictor model.
3.03
1 = 0.485194
82.4/14
The El predictor model is selected. (E)
31.
[Section 16.2] All three statements are true. III is true since the variance of the average of decorrelated trees is
lower than the variance of the average of correlated trees. (D)

(-----'. [Lesson 151 We calculate the error rate for each X = x and then weight them with the probabilities of X = x. For
A, we choose Y = 1 with error rate 0.55. For B, we choose Y = 2 with error rate 0.6. For C, we choose Y = 2 or Y = 3
with error rate 0.65. For D, we choose Y = 2 with error rate 0.5.
0.4(0.55) + 0.3(0.6) + 0.2(0.65) + 0.1(0.5) = 0.58 (E)
33. [Lesson 20] 91-4-1 = 2— 0.7(4) = —0.8

21T+2 = 2— 0.7(-0.8) = 2.56
9T+3 = 2 — 0.7(2.56) = 0.208
1.8544 (D)
34. [Section 11.1] In p = 0.205 + 0.132 + 0.405 = 0.742

ti = e0.742 = 2.100
The variance is 1.3 times the mean, so the variance is 1.3(2.100) = 2.73 . (D)
35. [Lesson 10] Overfitting a model does not result in biased estimates, but underfitting and severe censoring does
result in biased estimate. (C)
Exam SEM Study Manual


1 A 11 D 21 C 31 A
2 B 12 E 22 A 32 A
3 ID 13 B 23 D 33 C
4 E 14 A 24 C 34 B
5 B 15 C 25 A 35 C
6 E 16 ID 26 A
7 A 17 C 27 E
8 A 18 B 28 B
9 A 19 E 29 A
10 D 20 E 30 B
Practice Exam 5
1. [Lesson 1] Flexibility is the ability to fit the observations closely.

Increasing the number of predictors increases flexibility, but increasing the tuning parameter of a lasso forces
the model to have fewer predictors and decreases flexibility Increasing the number of bootstrap samples does not
increase flexibility. (A)
2. [Section 12.2] We'll have to calculate the linear expression for both categories 2 and 3 so that we can get the
probability of category 1, and then use the relative odds to get the probability that we want for category 3.
ln —Th2 = -1.5 + 1.2 + 0.5 = 0.2

hi
T
ri3
In — = -1.1 + 0.8 + 0.9 = 0.6
ill
1
hi
T = = 0.247309
1 + Ca2 ± e0.6
713 = 0.247309e" = 0.450627 (B)
3. [Section 19.3] We need the standard deviation of the differences of the series, which are 3, 5, 2, -1,4, 3,1, 5,3.
\IV-)2(c1— 02
8
=1.9221
The approximate width of a 95% prediction interval is 4(1.9221)V5 = 13.3167 where V3- is the square root of the
forecast period. (17)
4.
[Section 12.3] We will calculate the cumulative odds of 6 or less, and then the probability of 6 or less. The probability
of 7 or more is the complement of that probability.
The systematic component for 6 is 1.6 - 0.01(150) = 0.1. The odds are AI = 1.105171. The probability is
1.105171/2.105171 = 0.524979. Thus the probability of 7 or more is 1 - 0.524979 = 0.475021. (E)
J
Exam SEM Study Manual
n. [Lesson 21 The model expresses In Yi as a linear function of Xi. Use the usual formulas for bo and b1 in terms of Xi
and ln Yi.
lnYi nfan Y
b1=
— rig2
13.697 — (13.5)(9.681)/10
0.46665
19.57— 13.52/10
9.681 13.5
1)0 = ln Y — b1X=
10 0.46665( ) = 0.33812
10
The prediction of Y when X = 2 is exp(0.33812 + 0.46665(2)) = 3.566 (B)
6. [Lesson 181
I. Both methods force every observation into a cluster.X
II. K-means clustering looks at within-cluster similarities while hierarchical clustering looks at between-cluster
dissimilarity./
III. K-means clustering requires an initial assignment of clusters while hierarchical clustering does not./
(E)
7.
[Section 15.31 Let x, y, and z be the values of Y at the three points (5,9), (6,15), and (8,12). The point (4,12) is-%5 1
away from (5,9), /71 away from (6,15), and 4 away from (8,12); the average of the two closest points is (x + y)/2. The
point (5,12) is 3 away from (5,9),1,/ 71 away from (6,15), and 3 away from (8,15); the average of the two closest points
is (x z)/2. And the closest points to (7,12) are (6,15) and (8,12), so Y = (y + z)/2 at that point. Solving for x, y, and
z:
x+y
= 91
2
102 + y
= 91
2
= 80
x+z
=98
2
102 + z
=98
2
z = 94
y+z —
87 (A)
2
8. [Section 14.41 There are an intercept, 2 continuous variables, 1 variable for sex, 3 for department. There are
(2— 1)(4 — 1) = 3 interaction variables. That is a total of 10 parameters.
From the max-scaled R2 statistic and the loglikelihood of the minimal model, with 1(b) being the loglikelihood
of the model,
R2 = 0.361774 (1 — exp(-182/45)2) = 0.361663

0.361663 = 1 fexp(-182/45) \ 2
exp(/(b)/45) )
exp(-182/45)
exp (I(b/45)) = 0.021928
V1 — 0.361663

Copyright 02022 ASM
1(b) = -171.9
The AIC is 2(171.9 + 10) = 363.8 (A)
9.
[Section 17.31 We use the first column of the loading matrix, the loadings of the first variable on the three principal
components.
1.220(0.732) + 0.002(0.307) - 1.279(0.609) = 0.1147 (A)
10.
[Section 16.1] The cross-entropy at the node is -0.65 ln 0.65-0.351n 0.35 = 0.64745. After the split, the cross-entropy
is
-0.75(0.8 In 0.8 + 0.21n 0.2) - 0.25(0.81n 0.8 + 0.2 ln 0.2) = 0.50040

The decrease in cross-entropy is 0.1470. (D)
11.
[Lesson 121 With the probit link, g(n) = 43-1(71), so n =4:13(xT/3). With our parameters, that is
(I) (1 - 0.1(4)) = 4:1)(0.6) = 0.7257 (D)
12. [Section 19.61 MAPE is mean absolute percentage error. The residuals are -1, -3 -42.
1
MAPE =
100 (
4 8
+
12
+
14
+
2) =
-
22
18.79 (E)
13. [Lesson 20] Use formula (20.5).

3
Var(9T+3) = s2 ,E ,3(t-1) 100(1 + 0.62 ÷ 0.e) 148.96 (B)

i=i
14. [Section 14.31 Use formula (14.9). First, 3.3255/2 = 1.66275.

1.66275 =
(1n(1 -0.16) + 1n(1 -0.24) + ln(1 - x) + In 0.56 + In 0.68) = 1.41427 - ln(1 - x)
-1n(1 - x) = 0.24848
x = 1 - e-024 848 = 0.22 (A)
15. [Section 18.1] The centroids are

A:
(3 +6 +9 +10 5+ 2 +2 + 7)
B: (4,4.5)
4 4
=(7,4)
C: (7,3.5)
The closest points are (* indicates moved)
A: (10,7), (6,6)*, (8,4)*
B: (3,5)*, (2,6), (2,8), (4,3)
C: (6,2)*, (9,2)*, (8,1)*, (5,1), (9,3)
(C)

[Section 17.11 Let 0 be the loading we are solving for. The other loading is -\/i p2. We are given
4.3077 = 4c + 3J1 -
Let's solve for 0.
(4.3077 - 40)2 = 9(1 - 02)

18.5563 - 34.46160 + 1602 .= 9 - 902
2502 - 34.46160 + 9.5563 = 0
4.) = 0.9938, 0.3846 (D)
17. [Section 16.21 I and III are true, but not 11. As B grows larger, the test error settles down and becomes flat. (C)
18. [Subsection 13.3.31 By equations (13.8) and (13.9),
E[y] = 0.2
Var(N) = 0.2 + 0.22(0.6) = 0.224
Dividing the mean into the variance, we get an overdispersion factor of 0.224/0.2 = 1.12 (B)
19.
[Section 7.2] The answer does not depend on which formula you use for Cp, since they're equivalent. Using the
formula in Regression Modeling with Actuarial and Financial Applications:
adding 1 to p increases Cp by 2, so Error SS,/s2 must decrease by 2, or the RSS must decrease by 2s2. s 2 is based on
using all predictors. Since s 2 = RSS/(n k - 1), we have
s2 = 252/(100 - 15 - 1) = 3
So the RSS must decrease by 6. The highest possible value for x is 468. (E)
20. [Lesson 61 Use formula (6.5).

2 2 2 2
) (1 - 0.8738)\2
0.7737
\ I -1.8664 1 0.9216 0.5549 -0.3838
PRESS =
k1 - 0.8831) k1 - 0.3534) k1 - 0.5246) ( 1 - 0.3670
65.9115 (E)
21. [Section 8.1]

1. This statement is correct. The lasso selects the posterior mode. t/
2. Ridge regression shrinks coefficients; it does not set them equal to 0. X
3. This statement is correct. Bias increases and variance decreases as the tuning parameter is increased.
(C)

22.
[Section 14.6] For a Poisson, the deviance residual is sign(yk - 9k)12(yk in(Yk jIk) (Yk 9k))• We can ignore
the 2 if we wish and use the textbook's incorrect formula since we just want to determine the maximum, but we
won't. However, we will ignore the sign and the square root, since we just want the maximum absolute value. The
calculations work out to:
4. 2(1 ln(1 /0.7) - (1 -0.7)) = 0.1133

2(21n(2/1.6) - (2- 1.6)) = 0.0926
4. 2(3 ln(3 /2.5) - (3 - 2.5)) = 0.0939
4 = 2(31n(3/3.5) -(3 -3.5)) = 0.0751
4. 2 (21n(2/2.5) - (2- 2.5)) = 0.1074
k = 1 has the highest deviance residual in absolute value. (A)
23. [Section 5.2] Use formula (5.1):

h -
1 (x- ?)2
n E7=1.(xi -
The sum of squares in the second denominator of that formula is the sample variance multiplied by n - 1
1 (92.03 - 85.19)2
h35,35 0.01777 (D)
89 9.022(88)
24.
[Section 5.31 We need R). the coefficient of determination of regressing xi on x2. That is the square of the
correlation coefficient of x1 and x2. You can use your calculator to calculate the correlation, but we'll work it out. We
don't need any of the information provided for y, but we need Z(1:11 - FC1)(x12 - R2).
E x11x12 = (3)(24) + (5)(29) + (8)(34) + (9)(38) + (10)(50) = 1331

E(xii xi)(xi2 -7(2) = xiixi2 - ni1R2 = 1331 - 5(7)(35) = 106
1062
R2 0.843037
(34)(392)
1
VIF1 =
= 6.3709 (C)
1 - 0.843037
25.
[Section 3.11 The error sum of squares is 302. The samplevariance is the total sum of squares divided by n -1, so
the total sum of squares is 32.8(17) = 557.6.
302
R2 - 1 0.4584 (A)
557.6
26. [Section 21.51 Use formula (21.10).
220 -
1 -0.22-0.08-0.35- 0.15 = 7(
.2
w = 0.2(220) = 44 (A)

Copyright Q2022 ASM
[Section 3.41 We will calculate the t ratio of b2 and then use equation (3.14). The standard error of b2 is 8.0882 1=
3.9706. The t ratio is 12.456/3.9706 = 3.1371. The partial correlation coefficient is
3.1371
0.5315 (E)
V3.13712 + 28 - (2 + 1)
28. [Section 3.3] The standard error of bi, based on the second diagonal entry of the (X1X)-1 matrix, is 11.155 10 =
1.1155. The t ratio is
4.986 - 3
= 1.7804
1.1155
There are 15 - 3 = 12 degrees of freedom. 1.7804 is between 1.7823, the 95th percentile of t12, and 1.3562, the 90th
percentile of t12, making (B) the correct answer.
29. [Section 16.2] I is true, although the n components may (and usually include) duplicates. II and III are not true.
(A)
30. [Section 7.2] Using the definition of BIC in James et al (you'll come to the same conclusion even if you use the usual
definition of BIC), BIC = (RSS + (In n)d82), and we can ignore 1/(n 82). So we want
cd + d(ln 92)(25) = cd+1 + (d + 1)(1n 92)(25)
This implies
cd - cd-Fi = (1n92)(25) = 113.0 (B)
-0.72917
32. [Section 5.3] The VIF is calculated by regressing xi on x2 and evaluating R2. For a two-variable model, R2 is the
square of the correlation coefficient of the dependent and independent variables, so let's calculate the correlation
coefficient squared. Since there are 5 observations, the sum in the third bullet is divided by 5 - 1 = 4.
2 (0.029492/4)2 - 0.26820
P = (0.0096625)(0.0209767)
That is R2 of the regression of xi on x2. The VIF is VIF = 1/(1 - 0.26820) = 1.3665 (A)
33. [Section 18.2] The distance between 121,301 and 140,511 is probably greater than the distance between {40,51} and
{63), but just to make sure, we'll calculate it:
0.25(19 + 30 + 10 + 21) = 20
whereas the distance from {40,51) and {63} is 0.5(23 + 12) = 17.5. As long as x 63 + 17.5 = 80.5, it will be linked to
{631 at the next iteration. (C)
34. [Section 21.3] First we generate 1/420 and b1,20•

boo = 0.4(199) + 0.6(190 + 2) = 194.8
b1,20 = 0.3(194.8 - 190) + 0.7(2) = 2.84
The forecasted value is 194.8 + 2.84 = 197.64 . (B)

Copyright 02022 ASM
470 PRACTICE EXAM 5, SOLUTION TO QUESTION 35
35. [Section 18.21 Statement I is true since the distance between single points is the same regardless of linkage. IT is 1
false; the number of clusters goes down by 1 at each iteration of the algorithm. III is true. (C)


1 A 11 E 21 D 31 B
2 B 12 A 22 A 32 B
3 C 13 B 23 C 33 B
4 A 14 D 24 A 34 E
5 E 15 C 25 A 35 E
6 C 16 E 26 D
7 B 17 B 27 B
8 E 18 C 28 C
9 E 19 C 29 C
10 E 20 A 30 D
Practice Exam 6
1. [Lesson 31 The variance of b1 is

2 4,296
s = = 477.333
s2 477.333
s2 =
= 3.9778
bl 20
E(Xi - F02
With 9 degrees of freedom, the 2-sided 95% confidence interval has 5% of the t distribution in the tails, so the
coefficient is 2.262. The width of the confidence interval is 2(2.262)V3. 9778 = 9.023 .(A)
2. [Lesson 12] Calculate the relative odds of the 4 colors to white for x = 35.
black
Eu3-0.009(35) =1.0151
nwhite
Trgray
e -1.05+0.01(35) =0.4966
white
nred
=„1.50-0.10(35) =0.1353
nwhite
nblue
= eos5+omo505) =1.2523
nwhite
1
The probability of a white car is 1 + 1.0151 + 0.4966 + 0.1353 + 1.2523
0.2564
3.
[Lesson 15] The Bayes decision rule selects 0 when X <0.5 and 1 when X > 0.5. The Bayes error rate is then X for
X <0.5 and 1 - X when X > 0.5. The density function of X is f (x) = 1,0 x 1. The Bayes error rate is
0.5 1
fo X dx +
0.5
(1 - x)dx = 0.125 + 0.125 = 0.25 (C)
4. [Section 16.1] Use formula (16.5). We have n = 240,171 = 5.

451n S-+ 101n +321n0.8 +81n0.2 + 101n0.25 +301n0.75
-2
+51n0.1 +451n0.9 +3h4+521ng -0.8211 (A)
240-5

Copyright 02022 ASM
5. [Section 6.2] Use formula (6.5):
PRESS =
ei \z
2 2 2 2 2
-1.3009
( 0.0923
1 0.9145
1 0.8649
1 ( -0.0974 1 ( -0.4734 12
(1 - 0.4514) - 0.5626)
±
- 0.3111) ± 1 - 0.5584) ± ‘1 - 0.3732) ± 1 - 0.7433)
14.69 (E)
6. (Section 12.1) Since a logistic model is a logged odds model, the odds of a strike when xi = 7 is e(73)PI times the
odds of a strike when x1 = 3. The odds of a strike when xi = 3 is 0.1/(1 - 0.1) = 1/9. So the odds of a strike when
xi = 7 is e04 = 0.1658 and the probability of a strike is 0.1658/1.1658 = 0.1422. (C)
7.
[Lesson 9] Use formula (9.3). The predicted value of y is 2.1+1.2(32) = 40.5. Based on the RSS, s2 = 100/(12-2) = 10.
Based on se(b),
1.425 -
The t coefficient for a 2-sided 95% interval with 10 degrees of freedom is 2.2281. The lower bound of a 95% prediction
interval is
0
40.5 - 2.2281V10 + 12
+ 1.4252(32 - 25)2 = 17.0960 (B)
8. [Section 11.1] f(y) = exp (y In p ln(y + 1) + 21n(1 p))

Set 0 = in p so that the property of the linear exponential family, that y appears alone in the interaction between y
and 0, is satisfied. Then
b(0) = -2 In(1 - p) = In(1 e ° )2 (E)
9.
[Section 17.1] I is not true. II is true, since some constraint is needed when maximizing the variance and solving for
the loadings. III is true since the principal components are linear combinations of the variables and all the variables
are assumed to be centered at 0. (E)
10. [Section 41 There are n (k +1) = 11 -3 = 8 degrees of freedom in the unconstrained model, Model II. The F ratio
is
(12,204- 9,286)/1
F1,8= = 2.5139
9,286/8
The t statistic is .1,39 = 1.5855. The critical value for t at 8 degrees of freedom is 1.8595 at 10% significance for a
2-sided test, so Model II is not accepted at 10% significance. (E)
11.
[Section 16.2] I is true, since it's difficult to interpret an average of B trees. II is true. III is false; in bagging, trees
are not pruned. (E)

[Lesson 15] At (10,13), the closest two points are (10,13) and (12,11), with average Y of (2 + 5)/2 = 3.5.
At (12,11), the closest two points are (12,11) and (10,13), with average Y = 3.5.
At (12,16), the closest two points are (12,16) and (10,13), with average Y of (2 + 10)/2 = 6.
At (15,13), the closest two points are (15,13) and (12,11), with average Y of (6 + 5)/2 = 5.5.
The MSE is
(3.5 - 2)2 + (3.5 - 5)2 + (6 - 10)2 + (5.5 - 6)2 (A)
4
13. [Section 18.2] I and II are true. With centroid linkage, dissimilarity between clusters may decrease at an iteration.
(B)
14. [Section 21.3] 11 and III are for stochastic seasonal effects; J and IV are for fixed effects. (D)
15. [Section 18.2] The distance between 160,221 and 175,411 is V152 192 = -\/ 5 for any linkage. The distance from
{(40,30), (40,40)1 and {60,221 is distance to the closest point (40,30) or V202 + 82 = VTle.4 for single linkage, distance to
the furthest point (40,40) or 182 = -V:1 for complete linkage, average distance or 0.5(V + iifg) = 24.22395,
while VW = 24.20744, for average linkage, and distance to the midpoint (40,35) or 11202 + 132 = for centroid
linkage. We see that complete linkage will prefer to fuse (60,22) and (75,41), and average linkage also prefers that by
a narrow margin. Single and centroid linkages will fuse (60,22) with {(40,30), (40,40)1. (C)
16. [Section 16.2] The reduction in RSS for the variables is:
X1: (9,865 - 9,075) + (7,411 - 7,026) = 1,175
X2: (9,075 - 8,302) + (7,026 - 6,798) = 1,001
X3: (8,302 - 7,845) + (7,845 - 7,411) + (6,798 - 6,502) + (6,502 - 6,398) = 1,291
The highest reduction in RSS is from X3, followed by X1, followed by X2. (E)
l
f
J.i. [Section 5.2] The standardized 10th residual is 6.8/(8111 - 0.2), and the square of this is 0.903125. Cook's distance is
0.2
0.0376 = 0.9031251'
1
(k + 1)(1 - 0.2) )
0.1665 -
k+1
k=11 (B)
18. [Section 5.1] Standardized residuals are ti/(sAii ii). First we compute Pi. Since Sr = Hy, we have ei'Sr-y=
(1 - H)y. The residuals are
0.65 -0.45 0.05 2.85
( 0.15 ) 14
-0.45 0.35 0.15 -0.05 12 _ -1.05
-0.05 0.15 0.35 -0.45 8 2.55
-0.15 -0.05 0.45 0.65 3 -4.35
We divide the first residual by s-V1 - 0.35.

2.85
0.8493 (C)
4.1623VOTT5"
19. [Lesson 2] Use equation (2.3).
=
sY
SX

Copyright 02022 ASM
sx = /105.6
87.36 = 0.78
(1./sY
87.361/1757
sy = 1150.93
0.78
= 1,324,646
Total SS = q(n —1) = 11,921,818 (C)
20. [Section 19.2] The mean is 0.6. Then

5
2
Dyt
t=1
9)2 = 0.42 ± 1.42 ^I_
Ok 0.6) = 3.2
—9)(yi — = (0.4)(-0.6)+ (-0.6)(1.4) + (1.4)(-0.6) + (-0.6)(-0.6) = —1.56
—1.56
r1 = —0.4875 (A)
3.2
21. [Subsection 13.3.1] The overdispersion is the ratio of variance to mean. The weight on the Poisson distribution is
0.7, so the mean is (0.7)(0.8) = 0.56. If you memorized the formula for variance you can use it. Otherwise, calculate
the second moment (N is the response):
EIN2] = 0.7(0.8 + 0.82) = 1.008
and the variance is 1.008 — 0.562 = 0.6944. The overdispersion is 0.6944/0.56 = 1.24 .(D)
22. [Lesson 11]
g(p) = 500 + 15(30) + 200 + 600 = 1750

p = 17501" = 4012.136
The standard deviation is 4012.136/ =
1794.28 . (A)
23. [Lesson 4] TSS is 8,500. The F ratio is
(Total SS Error SS)/k

22.860 —
(8,500 — 1,825)/k
1,825/(29 — k)
29 — k (6,675 _
129 — k1
k 1,825) 3'6571 )
22.860k = 106.0685 — 3.6575k
k 106.0685 = [7.1
12-1 (C)
26.5175

Copyright 02022 ASM
n. [Subsection 13.3.2] Overdispersion is variance divided by mean. We'll use equations (13.6) and (13.7) to compute
mean and variance. The quotient of those two equations, variance divided by mean, is:
1.2 = 1 + (1 - k)p
and y = 0.8. So
1.2 = 1 + 0.8(1 - k)
0.2
0.8
k = 0.75
Now, k = (1 - T)/ (1 - h(0)) where TE is the probability of 0, and h(0) = e-0.8. so

1- it
U. =
1 -
ti = 1 - (0.75)(1 - C°.8) = 0.5870 (A)
25. [Section 14.5] First we have to back out the loglikelihood of the model. Using equation (14.15),
R2 = (max-scaled R2)(1 - exp(/0/n)2) = 0.80(1 - (e-m588)2) = 0.480419

Then using equation (14.14),
1 R2 exp(k/n) )2
kexp(/(b)/n))
e-0.4588
- 0.480419 =
er(b)/loo
0.632042
eI(b)/100= - 0.876837
0.720820
/(b) = 1001n 0.876837 = -13.1434
The AIC penalty is 2 times the number of parameters. The AIC is
AIC = -21 + 2p = -2(-13.1434) + 2(3) = 32.29 (A)
26. [Section 8.1] All three statements are true. In statement III, notice that increasing the budget parameter means
decreasing the tuning parameter, bringing them closer to unadjusted regression. (D)
27. [Section 19.3] II is true, but the other two statements are not. Consecutive terms are correlated in both types of
time series. Differences of a linear trend in time have mean pi and differences of a random walk have a mean equal
to the mean of the white noise series underlying the random walk. (B)
28. [Section 17.5] The sum of squared scores is 0.00162 + 0.10742 + (-0.3432)2 + 0.23422 = 0.1842. That number divided
by 4 is the variance of the second principal component. The total variance of the three variables is the sum of squares
of all coefficients divided by 4, and the sum of squares is 12 +02 + (-0.5)2 ± (-0.5)2 + 0.42 ± 0.42 -I- • • • -I- 0.72 = 3. The
proportion of variance explained is 0.1842/3 = 0.0614 . (C)

Copyright 02022 ASM
29. [Section 3.4] The partial correlation coefficient is the correlation between the residuals of the two regressions. Since
Ej = 0 for any regression, we only need to stun up squares and products. For the sum of squares, note that the
residual standard deviation is the square root of the sum of squares divided by the number of degrees of freedom,
4, so we multiply each one by 2 to get the square root of the sum of squares.
Ej< = 15.767
15.767
Partial correlation — 0.377 (C)
4(3.102)(3.371)
30. [Lessons 2 and 201 We use the usual simple linear regression formulas with yi as the response and yt_i as the
explanatory variable. Since the letter y is in use for the time series, we'll use z for the response. The regression is
based on 9 observations; n = 9.
E xi = 149 — 11 =138
E 4 := 2287 — 112 = 2166
zi = 149 — 20 = 129
Exfzi = 1994
1994 — (138)(129)/9
b1 = 0.32 (D)
2166 — 1382/9
31. [Section 5.3] For the regression, k = 3 and n = 32. The error sum of squares has 32 — 3 — 1 = 28 degrees of freedom.
Then
Regression SS/3 R2/3

2.532 = F4,28 =
Error 55/28 (1 — R2)/28
R2 3
=
(2.532) = 0.271286
1 — R2 28
7 0.271286
R- — = 0.213395
1.271286
The VIF is 1/(1 — 0.213395) = 1.2713 (B)
32. [Section 18.1] The clustering algorithm ends up with two candidates for the minimum: 12,4,7,111 for the first
cluster and (2,4,7,11,161 for the first cluster. The former has centroids at 6 and 25.75, so 16 stays in the second cluster,
whereas the latter has centroids at 8 and 29, so 16 stays in the first cluster.
It's probably easier to use formula (18.4) to calculate the objective function. In fact, you can use your statistical
calculator to calculate the required sums of squares, using the variance. For the split into 4 points and 4 points, the
objective function is
2((2 — 6)2 + (4 — 6)2 ± (7 — 6)2 + (11 — 6)2 + (16— 25.75)2 + (22— 25.75)2 + (29 — 25.75)2 + (36 — 25.75)2) = 541.5
For the split into 5 points and 3 points, the objective function is
2 ((2 — 8)2 + • • • + (16 — 8)2 + (22 — 29)2 + (29 — 29)2 + (36 — 29)2) = 448 (B)

Copyright 02022 ASM
n . [Section 16.11 You can try all 5 splits, but the splits between 2 and 3 and between 3 and 4 are the ones most likely
to minimize RSS. You can use your calculator to calculate the RSS, which is a multiple of the variance.
For a split between 2 and 3, the RSS of 1 and 2, for which 9 11, is (10— 11)2 (12— 11)2 = 2, and the RSS of
13,4,5,6), for which 9 = (17 + 21 + 22 +24)/4 = 21, is (17-21)2 + (22 — 21)2 + (24 —21)2 = 26. The sum of the RSSs is 28.
For a split between 3 and 4, the RSS of 11,2,3), for which9= 13, is 26, and the RSS of {4,5,6}, for which 9 = 221,
is 4Z'for a total RSS of 30Z'Thus the split between 2 and 3 yields a lower RSS. (B)33
2 2
34. [Section 6.21

1
0.77 — 10 1 + ( 10.78 — 9 1 + 124.13 — 15)2+/44.16_30\2± (54.17 — 7012 = 482.36 (E)
5 k1 — 0.5403) kl. — 0.3289) kl 0.2036)
-
k1 — 0.3512) k1 — 0.5760)
35.
[Lesson 1] Parametric approaches are less flexible than non-parametric approaches. But they require fewer
observations since they only estimate a small number of parameters, and they are easier to interpret since they relate
the response to the predictors in a simple way. (E)

Appendix B. Cross Reference Tables
Table B.1: Lessons corresponding to practice exam questions
Q Practice Exams
# 1 2 3 4 5 6
1 1 1 16 1 1 3
2 12 3 17 14 12 12
3 11 19 14 19 19 15
4 11 2 6 9 12 16
5 14 1 7 14 2 6
6 17 12 20 5 18 12
7 1 21 14 16 15 9
8 14 5 4 18 14 11
9 18 18 18 17 17 17
10 5 3 8 5 16 4
11 7 12 18 18 12 16
12 16 16 20 17 19 15
13 16 16 16 3 20 18
14 7 7 13 12 14 21
15 15 4 16 12 18 18
16 18 20 5 11 17 16
17 17 12 4 16 16 5
18 19 18 1 19 13 5
19 4 16 5 14 7 2
20 20 20 16 16 6 19
21 18 13 6 18 8 13
22 16 18 17 15 14 11
23 19 4 19 4 5 4
24 13 7 5 12 5 13
25 14 18 11 14 3 14
26 13 17 1 18 21 8
27 8 15 15 3 3 19
28 2 14 14 21 3 17
29 8 14 21 13 16 3
30 5 5 18 7 7 20
31 21 5 14 16 8 5
32 16 17 18 15 5 18
33 18 21 11 20 18 16
34 3 13 11 11 21 6
35 10 14 8 10 18 1

480 APPENDIX B. CROSS REFERENCE TABLES
Table B.2: Exercises taken from SOA sample questions
Question Exercise Page Question Exercise Page Question Exercise Page

Number Number Number Number Number Number Number Number Number
1 18.10 334 21 19.14 352 41 16.12 287
2 18.17 335 22 20.6 363 42 12.1 195
3 19.10 351 23 2.6 18 43 18.4 332
4 19.13 352 24 4.28 67 44 4.13 62
5 17.1 309 25 16.11 287 45 11.18 176
6 17.13 315 26 16.23 293 46 21.9 372
7 12.3 196 27 3.34 48 47 2.4 17
8 8.26 141 28 14.1 239 48 16.6 284
9 16.17 288 29 16.1 281 49 9.8 151
10 16.19 289 30 17.17 317 50 16.18 289
11 3.1 38 31 19.11 351 51 16.3 282
12 16.24 293 32 18.2 332 52 14.34 247
13 9.5 150 33 16.2 281 53 2.5 17
14 12.2 196 34 18.18 336 54 7.2 109
15 18.9 334 35 17.18 318 55 19.25 354
16 18.19 336 36 18.21 336 56 9.6 151
17 2.3 17 37 17.11 314 57 16.7 285
18 3.7 39 38 19.12 351
19 14.16 243 39 16.20 290
20 12.4 196 40 18.20 336
When you study the modules for Exam PA, they will reference the textbooks on Exam SRM. You may have to
refer to the textbooks. However, if you would rather refer to this manual, you can use the following cross-reference
lists to find the corresponding reading in this manual. This map is rough, since in many cases material is organized
differently.
Some material is not covered in this manual. This material may be
• Introductory material that doesn't say much
• Examples of topics covered in other sections provided by the textbooks
• Obscure topics I don't expect to be on the exam
• The R labs in An Introduction to Statistical Learning

An Introduction to Statistical Learning 4.1-4.3, which is included in the tables here, is not on the syllabus for
Exam SRM but is on the syllabus for Exam PA. It covers binary response GLMs, which are also covered by Regression
Modeling with Actuarial and Financial Applications.

APPENDIX B. CROSS REFERENCE TABLES 481
Table B.3: Correspondence between Regression Modeling and this manual
Chapter in Lesson in
RM this manual
2.1-2.2 2.1
2.3 3.1-3.2
2.4-2.5.2 3.3
2.5.3 9
2.6 5.2
2.7-2.8 not covered
3.1-3.2 2.2
3.3 3.1-3.2
3.4.1-3.4.2 3.1-3.3
3.4.3-3.4.4 3.4
3.5 2.3
5.1 not covered
5.2 7.1
5.3-5.4 5.1-5.2
5.5 5.3
5.6 6
5.7 2.3
6 10
7.1-7.2 19.1
7.3-7.5 19.3-19.5
7.6 19.6
8.1 19.2
8.2-8.4 20
9.1 21.1
9.2 21.2
9.3 21.3
9.4 21.4
9.5 21.5
11.1-11.2 12.1
11.3 14.2,14.5
11.4 not covered
11.5 12.2
11.6 12.3
12.1-12.2 13.1
12.3 13.2
12.4 13.3
13.1-13.2 11.1-11.2
13.3-13.4 11.3-11.4,14.1,14.3
13.5 14.6
13.6 11.1

482 APPENDIX B. CROSS REFERENCE TABLES
Table B.4: Correspondence between An Introduction to Statistical Learning and this manual
Chapter in Lesson in
ISL this manual
2.1-2.2.2 1.1
2.2.3 15.1-15.2
3.1 2.1,3.1-3.3
3.2,3.3.1 2.2,4,3.1,7.1
3.3.2 2.3
3.3.3 2.3,5
3.4 not covered
3.5 15.3
4.1-4.3 12.1
5.1 6
6.1.1-6.1.2 7.1
6.1.3 7.2
6.2.1 8.1.1
6.2.2 8.1.2
6.2.3 8.1
6.3.1 8.2.1
6.3.2 8.2.2
6.4 8.3
8.1 16.1
8.2 16.2
10.1 not covered
10.2 17
10.3 18

Index
added variable plots, 36 cross-sectional data, 343

adjacent points, 4 cross-validation, 96, 274, 306
adjusted R2, 106, 108 k-fold, 97
AIC, 105, 108, 235, 238 cumulative logit model, 193
Akaike Information Criterion, see AIC curse of dimensionality, 131
ANOVA, 33, 57
Anscombe residuals, 237 data snooping, 104
AR(p), 359 decision frees, 273
ARCH model, 370 dendrogram, 325, 327
autocorrelation, 344 design matrix, 14
autoregressive seasonal model, 368 deviance, 233
deviance residual, 237
backward stepwise selection, 104 Dickey-Fuller test, 369
backwards stepwise selection, 108 dimension reduction, 129
bagging, 277 drift, 345
base level, 15 dummy variable, 15
basic linear regression, 11
Bayes classifier, 257 endogenous variable, 158
Bayes decision boundary, 257 Error SS, 31
Bayesian Information Criterion, see BIC error sum of squares, 31
benchmarking, 157 exogenous variable, 158
best subset selection, 103 exponential smoothing, 367
bias-variance tradeoff, 2
F statistic, 57
summary, 280
BIC, 105, 108, 235, 238 fences (box plot), 4
Fixed effects seasonal model, 368
binary variable, 15
biplot, 301 forward stepwise selection, 103, 108
boosting, 278 GARCH model, 370
bootstrap, 277
generalized linear model, 161
box plot, 4
generalized logit, 191
potential outliers, 4 Gini index, 275
Box-Cox, 15
greedy algorithm, 103, 274
budget (ridge regression, lasso), 125, 127
hat matrix, 75, 81
canonical link function, 163
heterogeniety models, 221
centroid, 324 heteroscedastic, 77
cluster analysis, 323 hierarchical clustering, 325
coefficient of determination, 33 hold-out set, 95
collinear variables, 79 Holt-Winter, 369
complementary log-log, 187 homoscedasticity, 11, 77
confidence interval
hurdle models, 221
ilnear model, 149
control charts, 346 independence of irrelevant alternatives, 193
Cook's distance, 79, 81 indicator variable, 15
cost complexity pruning, 274 influential points, 77,78
cross validation
inversions (cluster analysis), 326
fl Leave One Out, 96
cross-entropy, 275 k-fold cross-validation, see cross-validation, k-fold

Copyright Q2022 ASM
484 INDEX
K-means clustering, 323 PRESS, 96

K-nearest neighbors, 257-259 principal components analysis, 130, 299
KNN classifier, 258 loading, 130
KNN regression, 259 scores, 130
principal components regression, 130
el norm, 127 probit, 187
t2 norm, 125 proportion of variance explained, 306
lasso, 127, 274 proportional odds model, 193
latent models, 222 pseudo-R2, 236
Leave One Out Cross Validation, see LOOCV PVE, 306
leaves (decision trees), 273
leverage, 75, 78,81 Lig plot, 4
likelihood ratio test, 232
linear exponential family, 161 R2, 33
linear probability model, 161 R charts, 346
link function, 161 random forests, 278
linkage (cluster analysis), 326 random walk, 345
loadings, 300 reference category, 191
logistic regression, 187 reference level, 15
logit, 187 regime change, 343
longitudinal data, 343 regression
LOOCV, 96 coefficient of determination, 33
F statistic, 57
Mallow's C1„ 105, 108 t statistic, 34
max-scaled R2, 236
Regression SS, 31
mean function, 163
regression sum of squares, 31
mean squared error, 2 relative odds, 191
regression, 32 resampling methods, 95
moving average smoothing, 367 residual, 14
MSE, 2
residual standard deviation, 32
regression, 32 residual standard error, 32
rnuItinornial logit, 193 residual sum of squares, 31
multiple correlation coefficient, 34
ridge regression, 125
RSE, 32
nominal variable, 3
RSS, 31
normal probability plot, 5, 77
odds ratio, 188, 192 SAR(P), 368
offset (GLM), 219 scaled deviance, 232
ordinal variable, 3 scatter plot, 3
orthogonal variables, 80 scores, 300
out-of-bag, 278 scores (maximum likelihood), 165
out-of-sample validation, 95 scores (principal components analysis), 130
outliers, 77, 81 seasonal adjustment, 343
overdispersion, 167, 219 seasonal base, 368
serial correlation, 344
parsimony, 158 simple linear regression, 11
partial correlation coefficient, 36 SSE, 31
partial least squares, 130 SSPE, 96
PCR, see principal components regression standard error of bi (linear regression), 35
Pearson chi-square, 167 standardized residual, 75
Pearson residual, 237 stationary (time series), 344
PLS, see partial least squares studentized residual, 75
prediction interval, 149 suppressor variable, 80

INDEX 485
systematic component, 161 validation set approach, 95

variable importance plot, 278
terminal nodes, 273 VIF, 79
test data, 2
time series plot, 343 weakest link pruning, 274
Total SS, 31 weighted least squares, 15
total sum of squares, 31 whiskers (box plot), 4
trailing clusters, 326 white noise, 345
training data, 2
Tweedie distribution, 163 Xbar charts, 346
validation set, 95 zero-inflated model, 220

We make studying easier.
Access all your ACTEX and ASM products in one place, with product
integration, video instruction, discussion forums, and more at
www.ActuarialUniversity.com

(2022 Full) ASM SRM (Ocr)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(2022 Full) ASM SRM (Ocr)

Uploaded by

Copyright:

Available Formats

An Integrated

Abraham Weishaus, Ph.D., FSA, CFA, MAAA

Learning Made Easier

Exam SRM Study Manual

Abraham Weishaus, Ph.D., FSA, CFA, MAAA

Please check A.S.M.'s web site at www.studymanuals.com for errata

©Copyright 2022, Actuarial Study Materials, a division of ArchiMedia Advantage Inc.

All rights reserved. Reproduction in whole or in part without express written

You can find integrated topics using this network icon.

*Jo When this icon appears, it will be next to an important

1. Login to: www.actuariatuniversity.com

2. Locate the Topic Search on a Pardo Distribution

enter the word or phrase into

If X is Type II Pareto with parameters tie,13, then

3. A topic "Hub" will display a Epri —1

that offer more ways to

4. Here is an example of the

ASM Manual for !FM a

Unlocked Products are the products that you own.

Probability for Risk Management, 3rd Edition ill

Many oF Actuarial University's Features are already unlocked with your

GOAL Practice Tool Instructional Videos*

GOAL: Guided Online Actuarial Learning

Practice. Quiz. Test. Pass!

Free with your ACTEX or

Actuarial University Quickly access the

D 6.9% 1:117.o% 1117.1% 1137.2% (

Rate this problem I6 Excellent c inadequate problem or give

Track your exam readiness

GOAL Score tracks your performance through GOAL Practice Sessions,

Your COAL Store IF your GOAL

See key areas

0S/24/2022 111340S 05/24/2022 14L57:A9 05/24/2022 14;57:40 PrtkoSession 'Addition:and DO Complete

05/21/202219:3250 05/23/702220/0200 5imu1ate-0 WM !11m2 30 StrAewing

Have you heard of GOAL: Guided

Connect with me on the

aisim Study Manuals Learning Made Easier

1 Basics of Statistical Learning 1

2 Linear Regression: Estimating Parameters 11

3 Linear Regression: Standard Error, R2, and t statistic 31

5 Linear Regression: Validation 75

7 Linear Regression: Subset Selection 103

Exam SRM Study Manual xi

8 Linear Regression: Shrinkage and Dimension Reduction 125

8.1.1 Ridge regression 125

8.2.1 Principal components regression 130

8.2.2 Partial least squares 130

9 Linear Regression: Predictions 149

10 Interpreting Regression Results 157

10.1 Statistical significance 157

II Generalized Linear Model 159

11 Generalized Linear Model: Basics 161

12 Generalized Linear Model: Categorical Response 187

12.1 Binomial response 187

12.2 Nominal response 191

13 Generalized Linear Model: Count Response 219

14 Generalized Linear Model: Measures of Fit 231

14.3 Deviance 232

Exam SRM Study Manual

14.4 Penalized loglikelihood tests 235

III Other Statistical Learning Methods 255

15.3 KNN regression 259

16 Decision Trees 273

16.2.3 Boosting 278