Predictive Modelling Process: A First Tour

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Applied computational Goal Applied computational Goal

intelligence Statistical learning intelligence Statistical learning


Case study Case study

Today’s goal

We are going to ...

I Introduce the predictive modeling process Define the terms.


Predictive modelling process I Case study: Predict fuel economy.
A first tour
I Assign HW1.

Michela Mulas
Reading list

M. Kuhn and K. Johnson. Applied Predictive Modeling1 , 2014


G. James, D. Witten, T. Hastie and R. Tibshirani, An Introduction to Statistical
Learning with Applications in R2 , 2014

1 Chapters 1 and 2
2 Chapter 2
Predictive modelling process 1 / 41 Predictive modelling process 2 / 41

Applied computational Goal Key ingredients Applied computational Goal Key ingredients
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Data sets from different scenarios
Case study Case study

Motivations Motivations
Information has become more readily available via the internet and media and our Each field approaches the problem using different perspectives and tool sets, the
desire to use this information to help us make decisions has intensified. ultimate objective is the same: to make an accurate prediction.

Human brain can consciously and subconsciously assemble a vast amount of data but Geisser1 defines predictive modeling as the process by which a model is created or
it cannot process the even greater amount of easily obtainable, relevant information for chosen to try to best predict the probability of an outcome.
the problem at hand.
Predictive modeling
I Websites are used to filter billions of web pages to find the most appropriate The process of developing a mathematical tool or model that generates
information for our queries. an accurate prediction.
I These sites use tools that take our current information, sift through data looking for
patterns that are relevant to our problem, and return answers. Examples of artificial intelligence can be found everywhere2 :
I The process of developing these kinds of tools has evolved throughout a number I The Google global machine uses AI to interpret cryptic human queries.
of fields such as chemistry, computer science, physics, and statistics and has
been called machine learning, artificial intelligence, pattern recognition, data I Credit card companies use it to track fraud.
mining, predictive analytics, and knowledge discovery. I Netflix uses it to recommend movies to subscribers.
I Financial systems use AI to handle billions of trades.
1 Geisser S. (1993). Predictive Inference: An Introduction. Chapman and Hall
2 Levy S. (2010). The AI Revolution is On. Wired.
Predictive modelling process 3 / 41 Predictive modelling process 4 / 41
Applied computational Goal Key ingredients Applied computational Goal Key ingredients
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Data sets from different scenarios
Case study Case study

A simple example A simple example


Suppose that we are statistical consultants hired by a client to provide advice on how to Suppose that we are statistical consultants hired by a client to provide advice on how to
improve sales of a particular product. improve sales of a particular product.
I We get a set of data: I We get a set of data:
Sales of that product in 200 different markets Sales is the output variable Y
Advertising budgets
16 in TV, radio,
2. Statistical
25 and newspaper
Learning Advertising budgets
16 are the input
2. Statistical variables
Learning X1 , X2 , X3

25

25

25

25

25
20

20

20

20

20

20
Sales

Sales

Sales

Sales

Sales

Sales
15

15

15

15

15

15
10

10

10

10

10

10
5

5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100 0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100

TV Radio Newspaper TV Radio Newspaper

FIGURE 2.1. The Advertising data set. The plot displays sales, in thousands FIGURE 2.1. The Advertising data set. The plot displays sales, in thousands
I Objective: Develop ana accurate
of units, as function of model that
TV, radio, andcan be used
newspaper to predict
budgets, sales
in thousands of on the
I Objective: Develop ana accurate
of units, as function of model Y =newspaper
TV, radio, and f (X) + εbudgets, in thousands of
basis of the three
dollars, media budgets.
for 200 different markets. In each plot we show the simple least squares
f dollars, for 200 different markets. In each plot we show the simple least squares
is some fixed but unknown function.
fit of sales to that variable, as described in Chapter 3. In other words, each blue fit of sales to that variable, as described in Chapter 3. In other words, each blue
line represents a simple model that can be used to predict sales using TV, radio, ε is a randomlineerror
represents
terma simple model that can
(independent of be
X,used to zero
with predict mean)
sales using TV, radio,
and newspaper, respectively. and newspaper, respectively.
Predictive modelling process 5 / 41 Predictive modelling process 6 / 41

More generally, suppose that we observe a quantitative response Y and p More generally, suppose that we observe a quantitative response Y and p
different predictors, X1 , X2 , . . . , Xp . We assume that there is some different predictors, X1 , X2 , . . . , Xp . We assume that there is some
relationship between Y and X = (X1 , X2 , . . . , Xp ), which can be written relationship between Y and X = (X1 , X2 , . . . , Xp ), which can be written
in the very general form Key ingredients in the very general form Key ingredients
Applied computational Goal Applied computational Goal
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Data sets from different scenarios
Case studyY = f (X) + ϵ. (2.1) Case studyY = f (X) + ϵ. (2.1)

Here f is some fixed but unknown function of X1 , . . . , Xp , and ϵ is a random


Defining statistical learning Here f is some fixed but unknown function of X1 , . . . , Xp , and ϵ is a random
Defining statistical learning
error term, which is independent of X and has mean zero. In this formula- error term error term, which is independent of X and has mean zero. In this formula- error term
We observe Y (the tion, f represents
response) the psystematic
and information
(the predictors X1that
, X2X
, . .provides aboutassume
. , Xp ). We Y. that
systematic We observe Y (thetion, f represents
response) the psystematic
and information
(the predictors X1that
, X2X
, . .provides aboutassume
. , Xp ). We Y. that
systematic
As another example, consider the left-hand panel of Figure 2.2, a plot of As another example, consider the left-hand panel of Figure 2.2, a plot of
there is some relationship between Y = f (X) + ε
income versus years of education for 30 individuals in the Income data set.
there is some relationship between Y = f (X) + ε
income versus years of education for 30 individuals in the Income data set.
The plot suggests that one might be able to predict income using years of The plot suggests that one might be able to predict income using years of
education. However, the function f that connects the input variable to the education. However, the function f that connects the input variable to the
Statistical learning refers to a set of approaches for estimating f Statistical learning refers to a set of approaches for estimating f
output variable is in general unknown. In this situation one must estimate output variable is in general unknown. In this situation one must estimate
f based on the observed points. Since Income is a simulated data set, f is f based on the observed points. Since Income is a simulated data set, f is
known and is shown by the blue curve in the right-hand panel of Figure 2.2. known and is shown by the blue curve in the right-hand panel of Figure 2.2.
For prediction: In The
many vertical lines represent
situations, a set the error terms
of inputs ϵ. We
X are note that
readily some of but
available, the the output Y For prediction: In The
manyvertical lines represent
situations, a set the error terms
of inputs ϵ. We
X are note that
readily some of but
available, the the output Y
30 observations lie above the blue curve and some lie below it; overall, the
cannot be easily obtained. 30 observations lie above the blue curve and some lie below it; overall, the
cannot be easily obtained.
errors have approximately mean zero. errors have approximately mean zero.
In this setting, sinceInthe
general,
errorthe function
term f may to
averages involve
zero,more
we than
can one inputYvariable.
predict using In this setting, sinceInthe
general,
errorthe function
term f may to
averages involve
zero,more
we than
can one inputYvariable.
predict using
In Figure 2.3 we plot income as a function of years of education and In Figure 2.3 we plot income as a function of years of education and
seniority. Here f is a two-dimensional surface that must be estimated seniority. Here f is a two-dimensional surface that must be estimated
based on the observed data. Ŷ = f̂ (X) based on the observed data. Ŷ = f̂ (X)
|{z} |{z} |{z} |{z}
prediction for Y estimate of f prediction for Y estimate of f

The accuracy of Ŷ as a prediction for Y depends on: Consider a given estimate f̂ and a set of predictors X, which yields the prediction
Ŷ = f (X). Assume for a moment that both f and X are fixed.
I Reducible error: It is our error on f̂ we can reduce it.
For any estimate f̂ (x) of f (x), we have
I Irreducible error: It is due to ε no matter how well we estimate f , we cannot h i
reduce the error introduced by ε. E (Y − Ŷ)2 |X = x = [f (X) − f̂ (X)]2 + Var(ε)
| {z } | {z }
Reducible Irreducible

Predictive modelling process 7 / 41 Predictive modelling process 8 / 41


Applied computational Goal Key ingredients Applied computational Goal Key ingredients
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Data sets from different scenarios
Case study Case study

Defining statistical learning Terminology


We observe Y (the response) and p (the predictors X1 , X2 , . . . , Xp ). We assume that Since many scientific domains have contributed to this field, there are synonyms:
there is some relationship between Y = f (X) + ε
I Sample, data point, observation, or instance: refer to a single, independent unit of
data, such as a customer, patient, or compound.
Statistical learning refers to a set of approaches for estimating f
Sample can also refer to a subset of data points (e.g. training set sample).
For inference: We want to understand how Y changes as a function of X1 , . . . , Xp . I Training set consists of the data used to develop models while the test or
f̂ cannot be treated as a black box: We need to know its exact form. validation sets are used solely for evaluating the performance of a final set of
candidate models.
Which predictors are associated with the response?
I Predictors, independent variables, attributes, or descriptors are the data used as
Identify the few important predictors among a large set of possible variables. input for the prediction equation.
What is the relationship between the response and each predictor? I Outcome, dependent variable, target, class, or response refer to the outcome
event or quantity that is being predicted.
Identify the relationship between the response and a given predictor that may
also depend on the values of the other predictors.

Can the relationship between Y and each predictor be adequately summarized


using a linear equation, or is the relationship more complicated?

Identify if the true relationship is more complicated than linear.


Predictive modelling process 9 / 41 Predictive modelling process 10 / 41

Applied computational Goal Key ingredients Applied computational Goal Key ingredients
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Data sets from different scenarios
Case study Case study

Terminology Key ingredients


Since many scientific domains have contributed to this field, there are synonyms: For an effective predictive model we need:

I Continuous data have natural, numeric scales.


I Intuition and deep knowledge of the problem context:
Blood pressure, the cost of an item, or the number of bathrooms, ...
Vital for driving decisions about model development.
In the last case, the counts cannot be a fractional number, but is still treated
as continuous data. I Relevant data:

I Categorical data, also known as nominal, attribute, or discrete data, take on The whole process begins with data.
specific values that have no scale.
I Versatile computational toolbox:
Credit status (“good” or “bad”) or color (“red”, “blue”,etc.) are examples of
these data. Including techniques for data pre-processing and visualization;
Suite of modeling tools for handling a range of possible scenarios.
I Model building, model training, and parameter estimation all refer to the process of
using data to determine values of model equations.

Predictive modelling process 11 / 41 Predictive modelling process 12 / 41


Music Genre

This data set was published as a contest data set on the TunedIT web
Applied computational Key ingredients Applied computational Key ingredients
site (http://tunedit.org/challenge/music-retrieval/genres). InGoal
this competi-
Data sets from different scenarios
Goal
Data sets from different scenarios
intelligence Statistical learning intelligence Statistical learning
tion, the objective was to develop a predictive model for classifying music
Case study Case study
into six categories. In total, there were 12,495 music samples for which 191
characteristics were determined. The response categories were not balanced
Example
(Fig. 1.1), with thedata sets
smallest and coming
segment typical data
from the scenarios
heavy metal category Example data sets and typical data scenarios
(7 %) and the largest coming from the classical category (28 %). All predic-
We can
tors were continuous; have
many werevery diverse
highly problems
correlated as wellspanned
and the predictors as the characteristics of the collected data. We can have very diverse problems as well as the characteristics of the collected data.
different scales of measurement. This data collection was created using 60 per-
formers from which 15–20 pieces of music were selected for each performer.
Music genre
Then 20 segments of each piece were parameterized in order to create the
Grant applications
final data set. Hence,
I Thisthedata
samples
set are
wasinherently
published not asindependent
a context of data
each set: I This data set was published as a context data set: http://www.kaggle.com
other.
http://tunedit.org/challenge/music-retrieval/genres
Historical database of 8707 University of Melbourne grant applications from 2009
and 2010 with 249 predictors.
3000
I 12495 music samples with 191
characteristics. Grant status: “unsuccessful” or “successful” (with 46% successful).
Response categories are not Australian grant success rates are less than 25%: the dataset is not
Frequency

I
2000
balanced. representative of the Australian rates.
I All predictors are continuous and Predictors include Sponsor ID, Grant Category, Grant Value Range, Research
1000 many are highly correlated and Field and Department.
I The predictors spanned different Data are continuous, count and categorical.
scales of measurement. Many predictor missing values (83%).
Blues Classical Jazz Metal Pop Rock
Genre The samples are not independent: the same grant writers occurred multiple times.
Fig. 1.1: The frequency distribution
I Objective: Developofagenres in the music
predictive model data
for classifying music into six categories. I Objective: Develop a predictive model for the probability of successful application.

Predictive modelling process 13 / 41 Predictive modelling process 14 / 41


1.4 Example Data Sets and Typical Data Scenarios 9

150

Applied computational Goal Key ingredients Applied computational Goal Key ingredients
Data sets from different scenarios 100 Data sets from different scenarios

Frequency
intelligence Statistical learning intelligence Statistical learning
Case study Case study

Example data sets and typical data scenarios Example50data sets and typical data scenarios
We can have very diverse problems as well as the characteristics of the collected data. We can have very diverse problems as well as the characteristics of the collected data.

Hepatic injury Permeability


None Mild Severe
Severity
I Data set from the pharmaceutical industry. I Data set from the pharmaceutical industry.
1.4 Example Data Sets and Typical Data Scenarios 9 Fig. 1.2: Distribution of hepatic injury type
I 165 unique compounds.
I 281 unique compounds, 376
150 50 I For each 1107 molecular
predictors measured or computed
for each. fingerprints (binary sequence of
40
numbers that represents the

Percent of Total
100
I Categorical response: “does not 30 presence or absence of a specific
Frequency

cause injury”, “mild injury”, “severe molecular substructure).


injury”. 20
I The response is highly skewed, the
50 I Highly unbalanced, common in 10
predictors are sparse (15.5% are
pharmaceutical data. 0 present).
I Measurements from 184 biological 0 10 20 30 40 50 60 I Attempt to potentially reduce the
None Mild Severe screens and 192 chemical feature Permeability
need for the assay.
Severity predictors. Fig. 1.3: Distribution of permeability values
Fig. 1.2: Distribution of hepatic injury type
I Objective: Develop a model for predicting compounds’ probability of causing I Objective: Develop a model for predicting compounds’ permeability (the measure
hepatic injury (i.e., liver damage). the gut and body in ofthe intestines. These
a molecule’s membranes
ability to crosshelp the body guard
a membrane).
50
critical regions from receiving undesirable or detrimental substances. For an
40 orally taken drug to be effective in the brain, it first must pass through the
t of Total

Predictive modelling process 15 / 41 intestinal wall and then must pass through the blood–brain barrier in order
Predictive modelling process 16 / 41
30 to be present for the desired neurological target. Therefore, a compound’s
ability to permeate relevant biological membranes is critically important to
Applied computational Goal Key ingredients Applied computational Goal Key ingredients
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Data sets from different scenarios
Case study Case study

Example data sets and typical data scenarios Example data sets and typical data scenarios
We can have very diverse problems as well as the characteristics of the collected data. We can have very diverse problems as well as the characteristics of the collected data.

Chemical Manufacturing Process Fraudulent Financial Statements


I Data set from a chemical process industry for producing pharmaceuticals. I Data set from public data sources, such as U.S. Securities and Exchange
Commission documents.
177 samples of biological material with 57 measured characteristics.
12 of the biological starting material and 45 of the manufacturing process. Fanning and Cogger1 sample non-fraudulent companies for important factors
(e.g., company size and industry type).
Process variables included temperature, drying time, washing time, and
concentrations of by-products at various steps. 150 data points were used to train models and the 54 to evaluate them.
Some of the process measurements can be controlled, while others are observed. The analysis started with an unidentified number of predictors derived from key
Predictors are continuous, count, categorical; some are correlated areas, such as executive turnover rates, litigation, and debt structure.
Some contain missing values. 20 predictors were used in the models.
Samples are not independent (sets of samples come from the same batch of
biological starting material).

I Objective: Develop a model to predict percent yield of the manufacturing process. I Objective: Predict management fraud for publicly traded companies.
1 Fanning K, Cogger K (1998). Neural Network Detection of Management Fraud Using Published Financial Data.
Int. J of Intell Sys in Accounting, Finance & Management, 7(1), 21-41.
Predictive modelling process 17 / 41 Predictive modelling process 18 / 41

Understand the data


Applied computational Goal Key ingredients Applied computational Goal Measure the performance
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Select the model
Case study Case study Considerations

Comparing the data Case Study: Predicting Fuel Economy


Table 1.1: A comparison of several characteristics of the example data sets
1.4 Example Data Sets and Typical Data Scenarios

Consider a simple example that illustrates the broad concepts of model building.

Data set The data set of different estimates of fuel economy for passenger cars and trucks
is provided by: fueleconomy.gov
Data Music Grant Hepatic Fraud Chemical
characteristic genre applications injury detection Permeability manufacturing
From the U.S. Department of Energy’s Office of Energy Efficiency and Renewable
Dimensions
# Samples 12,495 8,707 281 204 165 177 Energy and the U.S. Environmental Protection Agency.
# Predictors 191 249 376 20 1,107 57
Various characteristics are recorded for each vehicle such as engine displacement
Response
characteristics
or number of cylinders.
Categorical or continuous Categorical Categorical Categorical Categorical Continuous Continuous
Balanced/symmetric × × × Laboratory measurements are made for the city and highway miles per gallon
Unbalanced/skewed × × × (MPG) of the car.
Independent × ×

Predictor
Characteristics Objective: Building a model to predict the MPG for a new car line by using
Continuous × × × × ×
Count × × × ×
Categorical × × × × ×
I A single predictor: Engine displacement (the volume inside the engine cylinders).
Correlated/associated × × × × × ×
Different scales × × × × × I A single response: Unadjusted highway MPG for 2010-2011 model year cars.
Missing values × ×
Sparse ×
13

Predictive modelling process 19 / 41 Predictive modelling process 20 / 41


Understand the data Understand the data
Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations

Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Understand the data. Understand the data.
It can most easily be done through a graph. It can most easily be done through a graph.
I We have just one predictor and one response use a scatter plot. I As engine displacement increases, the fuel efficiency drops regardless of year.
I Left: Contains all the 2010 data. I The relationship is somewhat linear but does exhibit some curvature towards the
I Right: Shows the data only for new 2011 vehicles. extreme ends of the displacement axis.
2 4 6 8 2 4 6 8

2010 Model Year 2011 Model Year 2010 Model Year 2011 Model Year
70 70
Fuel Efficiency (MPG)

Fuel Efficiency (MPG)


60 60

50 50

40 40

30 30

20 20

2 4 6 8 2 4 6 8

Engine Displacement Engine Displacement


Predictive modelling process 21 / 41 Predictive modelling process 22 / 41

Understand the data Understand the data


Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations

Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Understand the data. Build and evaluate a model on the data.
What if we had more than one predictor? I Standard approach: Take a random sample of the data for model building and use
the rest to understand model performance.
I Need to further understand characteristics of the predictors and their relationships.
I These characteristics may suggest important and necessary pre-processing steps
I Build the models using the 2010 data (1107 cars): training set
that must be taken prior to building a model. I Test the models using the new 2011 data (245 cars): test or validation set.
2 4 6 8 2 4 6 8

2010 Model Year 2011 Model Year 2010 Model Year 2011 Model Year
70 70
Fuel Efficiency (MPG)

Fuel Efficiency (MPG)


60 60

50 50

40 40

30 30

20 20

2 4 6 8 2 4 6 8

Engine Displacement Engine Displacement


Predictive modelling process 23 / 41 Predictive modelling process 24 / 41
Understand the data Understand the data
Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations

Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Build and evaluate a model on the data. Measure the performance of the model.
I To evaluate the performance we cannot use the same data used for building. I For regression problems, the residuals are important sources of information.
I Re-predict the training set data: potential to produce overly optimistic estimates. I Computed as observed value minus the predicted value (yi − ŷi ).
I The root mean squared error is commonly used to evaluate models.
I Alternative approach use resampling: different subversions of the training data set
are used to fit the model. I RMSE is interpreted as how far, on average, the residuals are from zero.
2 4 6 8 2 4 6 8

2010 Model Year 2011 Model Year 2010 Model Year 2011 Model Year
70 70
Fuel Efficiency (MPG)

Fuel Efficiency (MPG)


60 60

50 50

40 40

30 30

20 20

2 4 6 8 2 4 6 8

Engine Displacement Engine Displacement


Predictive modelling process 25 / 41 Predictive modelling process 26 / 41

Understand the data Understand the data


Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations

Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Define the relationship between the predictor and outcome. First attempt: linear regression model.
I The modeler will try various techniques to mathematically define this relationship. I The predicted MPG is a basic slope and intercept model.
I Training set used to estimate the various values needed by the model equations. I With the training data, we estimate the model using the least squares method.
I Left: A linear model fit defined by the estimated slope and intercept.
I Test set used only when a few strong candidate models have been finalized.
I Right: Observed and predicted MPG.
2 4 6 8

2010 Model Year 2011 Model Year 70 70


70
60 60
Fuel Efficiency (MPG)

60

Fuel Economy
50 50

Predicted
50
40 40

40
30 30

30
20 20

20
10
2 4 6 8 20 30 40 50 60 70
2 4 6 8
Engine Displacement Observed
Engine Displacement
Predictive modelling process 27 / 41 Predictive modelling process 28 / 41
Understand the data Understand the data
Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations

Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
First attempt: linear regression model. Improve the model: Introduce some nonlinearity.
I The model misses some of the patterns in the data. I The most basic approach is to add complexity.
I Under-predicting fuel economy when the displacement is < than 2L or > 6L. I As for example, by adding a squared term:
economy = 63.2 − 11.9× displacement + 0.94× displacement2 .
I We resample the data and estimate a RMSE=4.6 MPG.
I This is a quadratic model It includes a squared term.

70 70 70 70

60 60 60 60
Fuel Economy

Fuel Economy
50 50
50 50

Predicted

Predicted
40 40
40 40

30 30
30 30
20 20
20 20
10
2 4 6 8 20 30 40 50 60 70 2 4 6 8 20 30 40 50 60 70

Engine Displacement Observed Engine Displacement Observed

Predictive modelling process 29 / 41 Predictive modelling process 30 / 41

Understand the data Understand the data


Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations

Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Improve the model: Introduce some nonlinearity. Improve the model: Multivariate Adaptive Regression Spline1 .
I The quadratic term improves the model fit (RMSE = 4.2 MPG). I With a single predictor, MARS can fit separate linear regression lines for different
ranges of engine displacement.
I Drawback: it can perform poorly on the extremes of the predictor.
I The slopes and intercepts are estimated for this model, as well as the number and
I The model appears to be bending upwards unrealistically. Predicting new vehicles size of the separate regions for the linear models.
with large displacement values may produce significantly inaccurate results.
I Unlike the linear regression models, MARS has a tuning parameter which cannot
70 70 be directly estimated from the data.

60 60
I There is no analytical equation that can be used to determine how many
segments should be used to model the data.
Fuel Economy

50 50
Predicted

I We can try different values and use resampling to determine the appropriate one.
40 40 I Once the value is found, a final MARS model would be fit using all the training set
data and used for prediction.
30 30

20 20

2 4 6 8 20 30 40 50 60 70
1 Friedman J. (1991). Multivariate Adaptive Regression Splines.
Engine Displacement Observed
The Annals of Statistics, 19(1), 1-141.
Predictive modelling process 31 / 41 Predictive modelling process 32 / 41
Understand the data Understand the data
Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations

Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Improve the model: Multivariate Adaptive Regression Spline. Improve the model: Multivariate Adaptive Regression Spline.
I For a single predictor, MARS can allow for up to five model terms (similar to the I After fitting the final MARS model with four terms, the training set fit is shown
previous slopes and intercepts). below where several linear segments were predicted.
I The lowest RMSE value is associated with four terms, although the scale of
change in the RMSE values indicates that there is some insensitivity to this tuning
parameter.
70 70

60 60

Fuel Economy
RMSE (Cross−Validation)

50 50

Predicted
4.28
40 40

4.26 30 30


20 20
4.24
2 4 6 8 20 30 40 50 60 70

Engine Displacement Observed
2.0 2.5 3.0 3.5 4.0 4.5 5.0

#Terms
Predictive modelling process 33 / 41 Predictive modelling process 34 / 41

Understand the data Understand the data


Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations

Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Compare the models on the test set Considerations on the model building process: Data Splitting
I Both models fit very similarly. How allocate data to model building and evaluating performance?
I For the test set: RMSEquadratic = 4.72 MPG and the RMESMARS = 4.69 MPG. I The primary interest was to predict the fuel economy of new vehicles, which is not
the same population as the data used to build the model.
I Either model would be appropriate for the prediction of new car lines.
I This means that, to some degree, we are testing how well the model extrapolates
to a different population.
2 3 4 5 6 7
I If we were interested in predicting from the same population of vehicles (i.e.,
MARS Quadratic Regression
60
interpolation), taking a simple random sample of the data would be more
appropriate.
50 I How the training and test sets are determined reflects how the model will be
Fuel Economy

applied.
40

30

20

2 3 4 5 6 7

Engine Displacement

Predictive modelling process 35 / 41 Predictive modelling process 36 / 41


Understand the data Understand the data
Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations

Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Considerations on the model building process: Data Splitting Considerations on the model building process: Predictor data.
How much data should be allocated to the training and test sets? The fuel economy example has revolved around one of many predictors: the engine
displacement.
I Generally it depends on the situation.
I The original data contain many other factors: number of cylinders, the type of
I If the pool of data is small, the data splitting decisions can be critical. transmission, and the manufacturer.
A small test would have limited utility as a judge of performance. I An earnest attempt to predict the fuel economy would examine as many predictors
A sole reliance on resampling techniques might be more effective. as possible to improve performance.
I Using more predictors, it is likely that the RMSE for the new model cars can be
I Large data sets reduce the criticality of these decisions.
driven down further.
I Some investigation into the data can also help.
For example, none of the models were effective at predicting fuel economy
when the engine displacement was small. Inclusion of predictors that target
these types of vehicles would help improve performance.

Feature selection, the process of determining the minimum set of relevant predictors
needed by the model, is a common way to approach the problem.

Predictive modelling process 37 / 41 Predictive modelling process 38 / 41

Understand the data Understand the data


Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations

Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Considerations on the model building process: Estimating performance. Considerations on the model building process: Evaluating several models.
We used two techniques to determine the effectiveness of the model. Three different models were evaluated.
1. Quantitative assessments of statistics (i.e., the RMSE) using resampling help the I The No Free Lunch Theorem1 argues that, without having substantive information
user understand how each technique would perform on new data. about the modeling problem, there is no single model that will always do better
than any other model.
2. Simple visualizations (e.g., plotting the observed and predicted values) to discover
areas of the data where the model does particularly good or bad. I A strong case can be made to try a wide variety of techniques, then determine
which model to focus on.
This type of qualitative information is critical for improving models and is lost when the
model is gauged only on summary statistics. In the fuel economy example, a simple plot of the data shows that there is a nonlinear
relationship between the outcome and the predictor.
Given this knowledge, we might exclude linear models from consideration, but there is
still a wide variety of techniques to evaluate.
One might say that “model X is always the best performing model” but, for these data, a
simple quadratic model is extremely competitive.

1 Wolpert D (1996). “The Lack of a priori Distinctions Between Learning Algorithms”. Neural Computation, 8(7),
1341–1390 (http://www.no-free-lunch.org)
Predictive modelling process 39 / 41 Predictive modelling process 40 / 41
Understand the data
Applied computational Goal Measure the performance
intelligence Statistical learning Select the model
Case study Considerations

Case Study: Predicting Fuel Economy


Considerations on the model building process: Model selection.
At some point in the process, a specific model must be chosen.
This example demonstrated two types of model selection:
I Between models: The linear regression model did not fit well and was dropped.
I Within models: For MARS, the tuning parameter was chosen using
cross-validation.
In either case, we relied on cross-validation and the test set to produce quantitative
assessments of the models to help us make the choice.
Because we focused on a single predictor, which will not often be the case, we also
made visualizations of the model fit to help inform us.
At the end of the process, the MARS and quadratic models appear to give equivalent
performance. However, knowing that the quadratic model might not do well for vehicles
with very large displacements, our intuition might tell us to favor the MARS model.

Predictive modelling process 41 / 41

You might also like