Professional Documents
Culture Documents
Predictive Modelling Process: A First Tour
Predictive Modelling Process: A First Tour
Predictive Modelling Process: A First Tour
Today’s goal
Michela Mulas
Reading list
1 Chapters 1 and 2
2 Chapter 2
Predictive modelling process 1 / 41 Predictive modelling process 2 / 41
Applied computational Goal Key ingredients Applied computational Goal Key ingredients
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Data sets from different scenarios
Case study Case study
Motivations Motivations
Information has become more readily available via the internet and media and our Each field approaches the problem using different perspectives and tool sets, the
desire to use this information to help us make decisions has intensified. ultimate objective is the same: to make an accurate prediction.
Human brain can consciously and subconsciously assemble a vast amount of data but Geisser1 defines predictive modeling as the process by which a model is created or
it cannot process the even greater amount of easily obtainable, relevant information for chosen to try to best predict the probability of an outcome.
the problem at hand.
Predictive modeling
I Websites are used to filter billions of web pages to find the most appropriate The process of developing a mathematical tool or model that generates
information for our queries. an accurate prediction.
I These sites use tools that take our current information, sift through data looking for
patterns that are relevant to our problem, and return answers. Examples of artificial intelligence can be found everywhere2 :
I The process of developing these kinds of tools has evolved throughout a number I The Google global machine uses AI to interpret cryptic human queries.
of fields such as chemistry, computer science, physics, and statistics and has
been called machine learning, artificial intelligence, pattern recognition, data I Credit card companies use it to track fraud.
mining, predictive analytics, and knowledge discovery. I Netflix uses it to recommend movies to subscribers.
I Financial systems use AI to handle billions of trades.
1 Geisser S. (1993). Predictive Inference: An Introduction. Chapman and Hall
2 Levy S. (2010). The AI Revolution is On. Wired.
Predictive modelling process 3 / 41 Predictive modelling process 4 / 41
Applied computational Goal Key ingredients Applied computational Goal Key ingredients
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Data sets from different scenarios
Case study Case study
25
25
25
25
25
20
20
20
20
20
20
Sales
Sales
Sales
Sales
Sales
Sales
15
15
15
15
15
15
10
10
10
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100 0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
FIGURE 2.1. The Advertising data set. The plot displays sales, in thousands FIGURE 2.1. The Advertising data set. The plot displays sales, in thousands
I Objective: Develop ana accurate
of units, as function of model that
TV, radio, andcan be used
newspaper to predict
budgets, sales
in thousands of on the
I Objective: Develop ana accurate
of units, as function of model Y =newspaper
TV, radio, and f (X) + εbudgets, in thousands of
basis of the three
dollars, media budgets.
for 200 different markets. In each plot we show the simple least squares
f dollars, for 200 different markets. In each plot we show the simple least squares
is some fixed but unknown function.
fit of sales to that variable, as described in Chapter 3. In other words, each blue fit of sales to that variable, as described in Chapter 3. In other words, each blue
line represents a simple model that can be used to predict sales using TV, radio, ε is a randomlineerror
represents
terma simple model that can
(independent of be
X,used to zero
with predict mean)
sales using TV, radio,
and newspaper, respectively. and newspaper, respectively.
Predictive modelling process 5 / 41 Predictive modelling process 6 / 41
More generally, suppose that we observe a quantitative response Y and p More generally, suppose that we observe a quantitative response Y and p
different predictors, X1 , X2 , . . . , Xp . We assume that there is some different predictors, X1 , X2 , . . . , Xp . We assume that there is some
relationship between Y and X = (X1 , X2 , . . . , Xp ), which can be written relationship between Y and X = (X1 , X2 , . . . , Xp ), which can be written
in the very general form Key ingredients in the very general form Key ingredients
Applied computational Goal Applied computational Goal
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Data sets from different scenarios
Case studyY = f (X) + ϵ. (2.1) Case studyY = f (X) + ϵ. (2.1)
The accuracy of Ŷ as a prediction for Y depends on: Consider a given estimate f̂ and a set of predictors X, which yields the prediction
Ŷ = f (X). Assume for a moment that both f and X are fixed.
I Reducible error: It is our error on f̂ we can reduce it.
For any estimate f̂ (x) of f (x), we have
I Irreducible error: It is due to ε no matter how well we estimate f , we cannot h i
reduce the error introduced by ε. E (Y − Ŷ)2 |X = x = [f (X) − f̂ (X)]2 + Var(ε)
| {z } | {z }
Reducible Irreducible
Applied computational Goal Key ingredients Applied computational Goal Key ingredients
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Data sets from different scenarios
Case study Case study
I Categorical data, also known as nominal, attribute, or discrete data, take on The whole process begins with data.
specific values that have no scale.
I Versatile computational toolbox:
Credit status (“good” or “bad”) or color (“red”, “blue”,etc.) are examples of
these data. Including techniques for data pre-processing and visualization;
Suite of modeling tools for handling a range of possible scenarios.
I Model building, model training, and parameter estimation all refer to the process of
using data to determine values of model equations.
This data set was published as a contest data set on the TunedIT web
Applied computational Key ingredients Applied computational Key ingredients
site (http://tunedit.org/challenge/music-retrieval/genres). InGoal
this competi-
Data sets from different scenarios
Goal
Data sets from different scenarios
intelligence Statistical learning intelligence Statistical learning
tion, the objective was to develop a predictive model for classifying music
Case study Case study
into six categories. In total, there were 12,495 music samples for which 191
characteristics were determined. The response categories were not balanced
Example
(Fig. 1.1), with thedata sets
smallest and coming
segment typical data
from the scenarios
heavy metal category Example data sets and typical data scenarios
(7 %) and the largest coming from the classical category (28 %). All predic-
We can
tors were continuous; have
many werevery diverse
highly problems
correlated as wellspanned
and the predictors as the characteristics of the collected data. We can have very diverse problems as well as the characteristics of the collected data.
different scales of measurement. This data collection was created using 60 per-
formers from which 15–20 pieces of music were selected for each performer.
Music genre
Then 20 segments of each piece were parameterized in order to create the
Grant applications
final data set. Hence,
I Thisthedata
samples
set are
wasinherently
published not asindependent
a context of data
each set: I This data set was published as a context data set: http://www.kaggle.com
other.
http://tunedit.org/challenge/music-retrieval/genres
Historical database of 8707 University of Melbourne grant applications from 2009
and 2010 with 249 predictors.
3000
I 12495 music samples with 191
characteristics. Grant status: “unsuccessful” or “successful” (with 46% successful).
Response categories are not Australian grant success rates are less than 25%: the dataset is not
Frequency
I
2000
balanced. representative of the Australian rates.
I All predictors are continuous and Predictors include Sponsor ID, Grant Category, Grant Value Range, Research
1000 many are highly correlated and Field and Department.
I The predictors spanned different Data are continuous, count and categorical.
scales of measurement. Many predictor missing values (83%).
Blues Classical Jazz Metal Pop Rock
Genre The samples are not independent: the same grant writers occurred multiple times.
Fig. 1.1: The frequency distribution
I Objective: Developofagenres in the music
predictive model data
for classifying music into six categories. I Objective: Develop a predictive model for the probability of successful application.
150
Applied computational Goal Key ingredients Applied computational Goal Key ingredients
Data sets from different scenarios 100 Data sets from different scenarios
Frequency
intelligence Statistical learning intelligence Statistical learning
Case study Case study
Example data sets and typical data scenarios Example50data sets and typical data scenarios
We can have very diverse problems as well as the characteristics of the collected data. We can have very diverse problems as well as the characteristics of the collected data.
Percent of Total
100
I Categorical response: “does not 30 presence or absence of a specific
Frequency
Predictive modelling process 15 / 41 intestinal wall and then must pass through the blood–brain barrier in order
Predictive modelling process 16 / 41
30 to be present for the desired neurological target. Therefore, a compound’s
ability to permeate relevant biological membranes is critically important to
Applied computational Goal Key ingredients Applied computational Goal Key ingredients
intelligence Statistical learning Data sets from different scenarios intelligence Statistical learning Data sets from different scenarios
Case study Case study
Example data sets and typical data scenarios Example data sets and typical data scenarios
We can have very diverse problems as well as the characteristics of the collected data. We can have very diverse problems as well as the characteristics of the collected data.
I Objective: Develop a model to predict percent yield of the manufacturing process. I Objective: Predict management fraud for publicly traded companies.
1 Fanning K, Cogger K (1998). Neural Network Detection of Management Fraud Using Published Financial Data.
Int. J of Intell Sys in Accounting, Finance & Management, 7(1), 21-41.
Predictive modelling process 17 / 41 Predictive modelling process 18 / 41
Consider a simple example that illustrates the broad concepts of model building.
Data set The data set of different estimates of fuel economy for passenger cars and trucks
is provided by: fueleconomy.gov
Data Music Grant Hepatic Fraud Chemical
characteristic genre applications injury detection Permeability manufacturing
From the U.S. Department of Energy’s Office of Energy Efficiency and Renewable
Dimensions
# Samples 12,495 8,707 281 204 165 177 Energy and the U.S. Environmental Protection Agency.
# Predictors 191 249 376 20 1,107 57
Various characteristics are recorded for each vehicle such as engine displacement
Response
characteristics
or number of cylinders.
Categorical or continuous Categorical Categorical Categorical Categorical Continuous Continuous
Balanced/symmetric × × × Laboratory measurements are made for the city and highway miles per gallon
Unbalanced/skewed × × × (MPG) of the car.
Independent × ×
Predictor
Characteristics Objective: Building a model to predict the MPG for a new car line by using
Continuous × × × × ×
Count × × × ×
Categorical × × × × ×
I A single predictor: Engine displacement (the volume inside the engine cylinders).
Correlated/associated × × × × × ×
Different scales × × × × × I A single response: Unadjusted highway MPG for 2010-2011 model year cars.
Missing values × ×
Sparse ×
13
Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Understand the data. Understand the data.
It can most easily be done through a graph. It can most easily be done through a graph.
I We have just one predictor and one response use a scatter plot. I As engine displacement increases, the fuel efficiency drops regardless of year.
I Left: Contains all the 2010 data. I The relationship is somewhat linear but does exhibit some curvature towards the
I Right: Shows the data only for new 2011 vehicles. extreme ends of the displacement axis.
2 4 6 8 2 4 6 8
2010 Model Year 2011 Model Year 2010 Model Year 2011 Model Year
70 70
Fuel Efficiency (MPG)
50 50
40 40
30 30
20 20
2 4 6 8 2 4 6 8
Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Understand the data. Build and evaluate a model on the data.
What if we had more than one predictor? I Standard approach: Take a random sample of the data for model building and use
the rest to understand model performance.
I Need to further understand characteristics of the predictors and their relationships.
I These characteristics may suggest important and necessary pre-processing steps
I Build the models using the 2010 data (1107 cars): training set
that must be taken prior to building a model. I Test the models using the new 2011 data (245 cars): test or validation set.
2 4 6 8 2 4 6 8
2010 Model Year 2011 Model Year 2010 Model Year 2011 Model Year
70 70
Fuel Efficiency (MPG)
50 50
40 40
30 30
20 20
2 4 6 8 2 4 6 8
Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Build and evaluate a model on the data. Measure the performance of the model.
I To evaluate the performance we cannot use the same data used for building. I For regression problems, the residuals are important sources of information.
I Re-predict the training set data: potential to produce overly optimistic estimates. I Computed as observed value minus the predicted value (yi − ŷi ).
I The root mean squared error is commonly used to evaluate models.
I Alternative approach use resampling: different subversions of the training data set
are used to fit the model. I RMSE is interpreted as how far, on average, the residuals are from zero.
2 4 6 8 2 4 6 8
2010 Model Year 2011 Model Year 2010 Model Year 2011 Model Year
70 70
Fuel Efficiency (MPG)
50 50
40 40
30 30
20 20
2 4 6 8 2 4 6 8
Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Define the relationship between the predictor and outcome. First attempt: linear regression model.
I The modeler will try various techniques to mathematically define this relationship. I The predicted MPG is a basic slope and intercept model.
I Training set used to estimate the various values needed by the model equations. I With the training data, we estimate the model using the least squares method.
I Left: A linear model fit defined by the estimated slope and intercept.
I Test set used only when a few strong candidate models have been finalized.
I Right: Observed and predicted MPG.
2 4 6 8
60
Fuel Economy
50 50
Predicted
50
40 40
40
30 30
30
20 20
20
10
2 4 6 8 20 30 40 50 60 70
2 4 6 8
Engine Displacement Observed
Engine Displacement
Predictive modelling process 27 / 41 Predictive modelling process 28 / 41
Understand the data Understand the data
Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations
Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
First attempt: linear regression model. Improve the model: Introduce some nonlinearity.
I The model misses some of the patterns in the data. I The most basic approach is to add complexity.
I Under-predicting fuel economy when the displacement is < than 2L or > 6L. I As for example, by adding a squared term:
economy = 63.2 − 11.9× displacement + 0.94× displacement2 .
I We resample the data and estimate a RMSE=4.6 MPG.
I This is a quadratic model It includes a squared term.
70 70 70 70
60 60 60 60
Fuel Economy
Fuel Economy
50 50
50 50
Predicted
Predicted
40 40
40 40
30 30
30 30
20 20
20 20
10
2 4 6 8 20 30 40 50 60 70 2 4 6 8 20 30 40 50 60 70
Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Improve the model: Introduce some nonlinearity. Improve the model: Multivariate Adaptive Regression Spline1 .
I The quadratic term improves the model fit (RMSE = 4.2 MPG). I With a single predictor, MARS can fit separate linear regression lines for different
ranges of engine displacement.
I Drawback: it can perform poorly on the extremes of the predictor.
I The slopes and intercepts are estimated for this model, as well as the number and
I The model appears to be bending upwards unrealistically. Predicting new vehicles size of the separate regions for the linear models.
with large displacement values may produce significantly inaccurate results.
I Unlike the linear regression models, MARS has a tuning parameter which cannot
70 70 be directly estimated from the data.
60 60
I There is no analytical equation that can be used to determine how many
segments should be used to model the data.
Fuel Economy
50 50
Predicted
I We can try different values and use resampling to determine the appropriate one.
40 40 I Once the value is found, a final MARS model would be fit using all the training set
data and used for prediction.
30 30
20 20
2 4 6 8 20 30 40 50 60 70
1 Friedman J. (1991). Multivariate Adaptive Regression Splines.
Engine Displacement Observed
The Annals of Statistics, 19(1), 1-141.
Predictive modelling process 31 / 41 Predictive modelling process 32 / 41
Understand the data Understand the data
Applied computational Goal Measure the performance Applied computational Goal Measure the performance
intelligence Statistical learning Select the model intelligence Statistical learning Select the model
Case study Considerations Case study Considerations
Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Improve the model: Multivariate Adaptive Regression Spline. Improve the model: Multivariate Adaptive Regression Spline.
I For a single predictor, MARS can allow for up to five model terms (similar to the I After fitting the final MARS model with four terms, the training set fit is shown
previous slopes and intercepts). below where several linear segments were predicted.
I The lowest RMSE value is associated with four terms, although the scale of
change in the RMSE values indicates that there is some insensitivity to this tuning
parameter.
70 70
60 60
●
Fuel Economy
RMSE (Cross−Validation)
50 50
Predicted
4.28
40 40
4.26 30 30
●
●
20 20
4.24
2 4 6 8 20 30 40 50 60 70
●
Engine Displacement Observed
2.0 2.5 3.0 3.5 4.0 4.5 5.0
#Terms
Predictive modelling process 33 / 41 Predictive modelling process 34 / 41
Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Compare the models on the test set Considerations on the model building process: Data Splitting
I Both models fit very similarly. How allocate data to model building and evaluating performance?
I For the test set: RMSEquadratic = 4.72 MPG and the RMESMARS = 4.69 MPG. I The primary interest was to predict the fuel economy of new vehicles, which is not
the same population as the data used to build the model.
I Either model would be appropriate for the prediction of new car lines.
I This means that, to some degree, we are testing how well the model extrapolates
to a different population.
2 3 4 5 6 7
I If we were interested in predicting from the same population of vehicles (i.e.,
MARS Quadratic Regression
60
interpolation), taking a simple random sample of the data would be more
appropriate.
50 I How the training and test sets are determined reflects how the model will be
Fuel Economy
applied.
40
30
20
2 3 4 5 6 7
Engine Displacement
Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Considerations on the model building process: Data Splitting Considerations on the model building process: Predictor data.
How much data should be allocated to the training and test sets? The fuel economy example has revolved around one of many predictors: the engine
displacement.
I Generally it depends on the situation.
I The original data contain many other factors: number of cylinders, the type of
I If the pool of data is small, the data splitting decisions can be critical. transmission, and the manufacturer.
A small test would have limited utility as a judge of performance. I An earnest attempt to predict the fuel economy would examine as many predictors
A sole reliance on resampling techniques might be more effective. as possible to improve performance.
I Using more predictors, it is likely that the RMSE for the new model cars can be
I Large data sets reduce the criticality of these decisions.
driven down further.
I Some investigation into the data can also help.
For example, none of the models were effective at predicting fuel economy
when the engine displacement was small. Inclusion of predictors that target
these types of vehicles would help improve performance.
Feature selection, the process of determining the minimum set of relevant predictors
needed by the model, is a common way to approach the problem.
Case Study: Predicting Fuel Economy Case Study: Predicting Fuel Economy
Considerations on the model building process: Estimating performance. Considerations on the model building process: Evaluating several models.
We used two techniques to determine the effectiveness of the model. Three different models were evaluated.
1. Quantitative assessments of statistics (i.e., the RMSE) using resampling help the I The No Free Lunch Theorem1 argues that, without having substantive information
user understand how each technique would perform on new data. about the modeling problem, there is no single model that will always do better
than any other model.
2. Simple visualizations (e.g., plotting the observed and predicted values) to discover
areas of the data where the model does particularly good or bad. I A strong case can be made to try a wide variety of techniques, then determine
which model to focus on.
This type of qualitative information is critical for improving models and is lost when the
model is gauged only on summary statistics. In the fuel economy example, a simple plot of the data shows that there is a nonlinear
relationship between the outcome and the predictor.
Given this knowledge, we might exclude linear models from consideration, but there is
still a wide variety of techniques to evaluate.
One might say that “model X is always the best performing model” but, for these data, a
simple quadratic model is extremely competitive.
1 Wolpert D (1996). “The Lack of a priori Distinctions Between Learning Algorithms”. Neural Computation, 8(7),
1341–1390 (http://www.no-free-lunch.org)
Predictive modelling process 39 / 41 Predictive modelling process 40 / 41
Understand the data
Applied computational Goal Measure the performance
intelligence Statistical learning Select the model
Case study Considerations