Professional Documents
Culture Documents
Seth F. Henriett (Fajcsák Henrietta) - Cornell University and Babson College
Seth F. Henriett (Fajcsák Henrietta) - Cornell University and Babson College
January 3, 2009
Prodigies
Math, music, chess Gauss
Story of age 3 adding up 1 + + 100 Magnum Opus Disquisitiones Arithmeticae by 21
Pascal Mozart, Schubert, Mendelssohn Bobby Fischer Why these three areas? Each creates its own world with its own set of rules There is no experience required Once you know the rules, you are free to create anything
January 3, 2009
Prodigies in Literature?
Thomas Dulack There are no child prodigies in literature. List from Wikipedia
William Cullen Bryant Thomas Chatterton H.P. Lovecraft Mattie Stepanek (Died at 13) Lope de Vega Henriett Seth-F.
Others?
Mary Wollstonecraft Shelley
Why? Literature is about the world, not about rules. It deals with lifes experience and the wisdom we develop over time.
MAA Short Course January 3, 2009
3
January 3, 2009
What is easy?
The math part well, not easy, but
Math is axiomatic logical laid out beforehand Given one example, we can change the numbers and it still makes sense
PROBLEM 5 : A sheet of cardboard 3 ft. by 4 ft. will be made into a box by cutting equal-sized squares from each corner and folding up the four edges. What will be the dimensions of the box with largest volume ?
January 3, 2009
January 3, 2009
Based on this, is there a better strategy for who should get the next mailing?
MAA Short Course January 3, 2009
7
January 3, 2009
T-Code
Max Q3 Median Q1 Min 72000 2 1 0 0
Mean SD
54.41 957.5
January 3, 2009
More Tcode
January 3, 2009
10
Transformation?
January 3, 2009
11
Categories?
January 3, 2009
12
January 3, 2009
13
Sensible Model?
January 3, 2009
14
Linear Regression
Predicted Weight = 1121+ 0.6733*Year
January 3, 2009
15
Forecast
Predicted Weight = 1121+ 0.6733* 2005 = 228.21 Actual Weight 2005 Team = 228.1 lbs
January 3, 2009
16
Williams College
Div III Powerhouse -- Sears Cup 10/11 years
January 3, 2009
17
Williams Forecast
Predicted Weight = 1977.22 + 1.094 *Year Predicted Weight 2005 = 216.97 lbs.
January 3, 2009
18
January 3, 2009
19
January 3, 2009
20
Teaching Calculus
Of course, its not easy But (1st semester) Calculus has fewer concepts to get across
Functions and Graphs (review) Limits (review?) Continuity (some review) Derivative max and min Implicit differentiation Antiderivative area Fundamental Theorem
Emphasis is on Computation
January 3, 2009
21
Data Collection
Developing and implementing surveys Interpreting surveys (errors) Experimental Design Causation vs. Correlation Population vs. Sample
Emphasis is on Interpretation
Probability????
MAA Short Course January 3, 2009
22
January 3, 2009
23
2. Be Skeptical
Being skeptical is part of critical thinking
Be cautious about making claims based on data.
January 3, 2009
24
January 3, 2009
25
Example
A town has two hospitals
Large hospital about 100 babies a day Smaller hospitals about 15 babies a day Over the course of the year, which hospital (if either) would probably have more days in which more than 60% of the babies born are male?
MAA Short Course January 3, 2009
26
January 3, 2009
27
Confidence Intervals
We dont say The mean is 31.2. We dont say The mean is probably 31.2 We dont say The mean is close to 31.2. All we can manage is
The mean is close to 31.2. Probably
And we go on at great length to tell you how wrong we probably are
January 3, 2009
28
It is easy to show that we dont naturally think clearly about conditional probabilities.
But we need to in order to make rational decisions in the world
January 3, 2009
29
January 3, 2009
30
10
January 3, 2009
31
1 2 3 4
MAA Short Course January 3, 2009
32
Random?
January 3, 2009
33
11
Random II
30
20 Count 10 1 2 3 4 5 6 7 8 9 10 12
MAA Short Course January 3, 2009
34
6. Solution as Process
In Statistics, there is often not a simple right answer, but a process of inquiry Induction Deduction Induction
January 3, 2009
35
January 3, 2009
36
12
January 3, 2009
37
January 3, 2009
38
January 3, 2009
39
13
January 3, 2009
40
Common Models
Probability models
Regression model
January 3, 2009
41
Common Models
Simulation by hand or computer
January 3, 2009
42
14
January 3, 2009
43
Technology frees us
Calculation is for calculators and statistics packages. Let the technology do the work, so students can think about statistical thinking. Let them do it so we can play Statistics
January 3, 2009
44
Play Stats
January 3, 2009
45
15
January 3, 2009
46
January 3, 2009
47
Emphasizing Communication
Machines are better at computing than even my best math students We take a sample of commuting times and find the mean to be 31.7 with a 95% confidence interval of (24.6, 38.8) minutes.
What does this mean? What doesnt this mean?
January 3, 2009
48
16
January 3, 2009
49
January 3, 2009
50
GAISE Report (2005), Members of the GAISE Group: Martha Aliaga, George Cobb, Carolyn Cuff, Joan Garfield (Chair), Rob Gould, Robin Lock, Tom Moore, Allan Rossman, Bob Stephenson, Jessica Utts, Paul Velleman, and Jeff Witmer
January 3, 2009
51
17
January 3, 2009
52
January 3, 2009
53
How many songs are on your iPod? What kind of food do you find most appealing? What kind of music do you most enjoy?
January 3, 2009
54
18
January 3, 2009
55
What next?
Relationships between quantitative variables
How to summarize? History of correlation and regression
Data collection
Surveys Observational studies Experiments
January 3, 2009
56
Probability
Historically probability was first
Equally likely events What do we mean?
Law of large numbers Rules
57
19
Inference
The black swan
Does seeing 1,000,000 white swans out of 1,000,000 prove that all swans are white? How do we prove that?
January 3, 2009
58
Simulation
Suppose we had a fair coin
Pennies from the 1960s When do we decide the coin is biased?
January 3, 2009
59
Inference
Hypothesis test
Two types of errors Decisions under uncertainty What is a P-value? What do I do if the P-value is small? Large?
Confidence interval
Why bother? What does it mean?
20
January 3, 2009
61
Whats Different?
Students have had calculus More important, they are comfortable with:
Formulas Abstraction Computers
January 3, 2009
62
January 3, 2009
63
21
January 3, 2009
64
Data = $$
January 3, 2009
65
Data Mining Is
the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. --- Fayyad finding interesting structure (patterns, statistical models, relationships) in data bases.--- Fayyad, Chaduri and Bradley a knowledge discovery process of extracting previously unknown, actionable information from very large data bases--- Zornes a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. ---Edelstein
MAA Short Course January 3, 2009
66
22
Visualization Methods Neural networks Decision trees K Nearest Neighbor Methods K-means
January 3, 2009
67
UPS
16TB U.S. library of congress Mostly tracking
Google
1 PB every 72 minutes
January 3, 2009
69
23
January 3, 2009
70
Marketing Experiments
Often, new hypotheses are generated by data mining a planned experiment. Segmentation
MAA Short Course January 3, 2009
71
Financial Applications
Credit assessment
Is this loan application a good credit risk? Who is likely to declare bankruptcy?
Financial performance
What should be a portfolio product mix
January 3, 2009
72
24
Manufacturing Applications
Product reliability and quality control Process control
What can I do to improve batch yields?
Warranty analysis
Product problems Service assessment Adverse experiences link to production
MAA Short Course January 3, 2009
73
Medical Applications
Medical procedure effectiveness
Who are good candidates for surgery?
Physician effectiveness
Which tests are ineffective?
January 3, 2009
74
E-commerce
Automatic web page design Recommendations for new purchases Cross selling
January 3, 2009
75
25
Pharmaceutical Applications
Combine clinical trial results with extensive medical/demographic information Non traditional uses of clinical trial data warehouse to explore:
Prediction of adverse experiences combining more than one trial Who is likely to be non-compliant or drop out? What are alternative (I.E., Non-approved) uses supported by the data?
MAA Short Course January 3, 2009
76
Pharmaceutical Applications
High throughput screening
Predict actions in assays Predict results in animals or humans
January 3, 2009
77
Which stock trades are based on insider information? Whose cell phone numbers have been stolen? Which credit card transactions are from stolen cards? Which documents are interesting When are changes in networks signs of potential illegal activity?
MAA Short Course January 3, 2009
78
26
To optimize profit, who should receive the current solicitation? What is the most cost effective strategy?
January 3, 2009
79
January 3, 2009
80
T-Code
January 3, 2009
81
27
More Tcode
January 3, 2009
82
Transformation?
January 3, 2009
83
Categories?
January 3, 2009
84
28
January 3, 2009
85
Shoppers Person ID 135366 135366 259835 person name Lyle Lyle Dick ZIPCODE 19103 19103 01267 item bought T35691 C56621 RS5292
January 3, 2009
86
Metadata
The data survey describes the data set contents and characteristics
Table name Description Primary key/foreign key relationships Collection information: how, where, conditions Timeframe: daily, weekly, monthly Cosynchronus: every Monday or Tuesday
January 3, 2009
87
29
Data Preparation
Build data mining database Explore data Prepare data for modeling
60% to 95% of the time is spent 60% to 95% of the time is spent preparing the data preparing the data
January 3, 2009
88
Data Challenges
Data definitions
Types of variables
Data consolidation
Combine data from different sources NASA mars lander
Data heterogeneity
Homonyms Synonyms
Data quality
MAA Short Course January 3, 2009
89
Missing Values
Random missing values
Delete row?
Paralyzed Veterans
Substitute value
Imputation Multiple Imputation JMP 8 (!)
January 3, 2009
90
30
January 3, 2009
92
January 3, 2009
93
31
January 3, 2009
94
Data Collection
Happenstance Data Why bother? We have big, parallel computers Designed Surveys, Experiments
Sample?
You bet! We even get error estimates. $599 with coupon from Amstat News
Presentation Medium
Overhead foils are still the best Dallas in August, Orlando in August, Philadelphia in August, D.C. in January
January 3, 2009
95
Prediction often most important Computation matters Results are actionable Variable selection and overfitting are problems
January 3, 2009
96
32
Note: This process model borrows from CRISP-DM: CRoss Industry Standard Process for Data CRISPMining
MAA Short Course January 3, 2009
97
98
Success depends more on the way you mine the data rather than the specific tool
January 3, 2009
99
33
Associations?
Most associated variables in the census Most associated variables in a supermarket Assocation Rules
January 3, 2009
100
Why Models?
Beer and Diapers
In the convenience stores we looked at, on Friday nights, purchases of beer and purchases of diapers are highly associated Conclusions? Actions?
January 3, 2009
101
Models
Models are:
Powerful summaries for understanding Used for exploration and prediction
January 3, 2009
102
34
January 3, 2009
103
January 3, 2009
104
Data Processing
Five months to consolidate process data Three months to analyze and reduce dimension of water data Eight months after starting projects, statisticians received flat file:
960 ingots (rows) 149 variables
MAA Short Course
January 3, 2009
105
35
0.11
0.06
0.01
0.05
January 3, 2009
106
January 3, 2009
107
Confusion Matrix
Doctors in ER Actual Heart Attack No Heart Attack Tree Algorithm (Goldman) Actual Heart Attack No Heart Attack
0.89
0.75
0.92
0.08
0.11
0.25
0.04
0.96
January 3, 2009
108
36
Total
710 1491 2201
Survivors
Class
Non-Survivors
January 3, 2009
109
Mosaic Plot
2 1 F D M F S M
C C2 31
January 3, 2009
110
Tree Model
M F
|
Adult
Child
1,2,C
2 or 3
1 or Crew
1 or 2 46% 93%
Crew 14%
23%
33%
January 3, 2009
111
37
Y N N Y
Y N N N
N N N N N N
Household Income
MAA Short Course January 3, 2009
112
Regression Tree
Price<9446.5 |
Weight<2280
Disp.<134
Weight<3637.5 34.00 30.17 26.22 Price<11522 18.60 Reliability:abde 24.17 HP<154 22.57
21.67
20.40
January 3, 2009
113
January 3, 2009
114
38
Tree Advantages
Model explains its reasoning -- builds rules Build model quickly Handles non-numeric data No problems with missing data
Missing data as a new value Surrogate splits
Can mix continuous and categorical predictors Selects subsets of predictors easily Can treat missing values as another category
January 3, 2009
117
39
First Tree
All Rows
Count Mean Std Dev 912 8.1560673 18.003357
Alloy (6045,7348,8234,2345,3234)
Count Mean Std Dev 697 4.6580583 12.311747
Alloy (5434,5894,2439)
Count Mean Std Dev 215 19.496124 26.79083
We know that some alloys are hard to make. Thats why we gave you the data in the first place.
MAA Short Course January 3, 2009
118
Second Tree
All Rows
Count Mean Std Dev 912 8.1560673 18.003357
CR<2.72
Count Mean Std Dev 680 4.9877451 13.444396
CR>=2.72
Count Mean Std Dev 232 17.442529 25.115348
January 3, 2009
119
January 3, 2009
120
40
January 3, 2009
121
TARGET_D>=1
Count 4792 G^2 Level 00 1 Prob 0.0000 1.0000
TARGET_D<1
Count 89857 G^2 Level 00 1 Prob 1.0000 0.0000
January 3, 2009
122
Types of Models
Descriptions Classification (categorical or discrete values) Regression (continuous values)
Time series (continuous values)
Clustering Association
January 3, 2009
123
41
Model Building
Model building
Train Test
Evaluate
January 3, 2009
124
Overfitting in Regression
Classical overfitting:
Fit 6th order polynomial to 6 data points
3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -1 0 1 2 3 4 5 6
January 3, 2009
125
Overfitting
Fitting non-explanatory variables to data Overfitting is the result of
Including too many predictor variables Lack of regularizing the model
Neural net run too long Decision tree too deep
January 3, 2009
126
42
Avoiding Overfitting
Avoiding overfitting is a balancing act Occams Razor
Fit fewer variables rather than more Have a reason for including a variable (other than it is in the database) Regularize (dont overtrain) Know your field.
All models should be as simple as possible All models should be as simple as possible but no simpler than necessary but no simpler than necessary Albert Einstein Albert Einstein
MAA Short Course January 3, 2009
127
Significance
Costs (symmetric?) Statistical Reasonableness Sensitivity Compute value of decisions
The so what test
January 3, 2009
128
Simple Validation
Method : split data into a training data set and a testing data set. A third data set for validation may also be used Advantages: easy to use and understand. Good estimate of prediction error for reasonably large data sets Disadvantages: lose up to 20%-30% of data from model building
January 3, 2009
129
43
Test
January 3, 2009
130
1 2 3 4 5 6 7 8 9 10
MAA Short Course January 3, 2009
131
External
Used for model comparison
Compute R2 on the test set
Just the correlation of predicted and actual
January 3, 2009
132
44
Regularization
A model can be built to closely fit the training set but not the real data. Symptom: the errors in the training set are reduced, but increased in the test or validation sets. Regularization minimizes the residual sum of squares adjusted for model complexity. Accomplished by using a smaller decision tree or by pruning it. In neural nets, avoiding over-training.
MAA Short Course January 3, 2009
133
Toy Problem
25 20 train2$y train2$y 0.0 0.2 0.4 train2[, i] 0.6 0.8 1.0 15 10 5 5 10 15 20 25 0.0 0.2 0.4 train2[, i] 0.6 0.8 1.0
25
20
train2$y
15
10
10
15
20
25
0.0
0.2
0.4 train2[, i]
0.6
0.8
1.0
25
20
train2$y
15
10
10
15
20
25
0.0
0.2
0.4 train2[, i]
0.6
0.8
1.0
25
20
train2$y
15
10
10
15
20
25
0.0
0.2
0.4 train2[, i]
0.6
0.8
1.0
25
20
train2$y
15
10
10
15
20
25
0.0
0.2
0.4 train2[, i]
0.6
0.8
1.0
January 3, 2009
134
Tree Model
x4<0.512146
x1<0.359395
x1<0.209569
x4<0.140557
x2<0.299431
x5<0.260297
x2<0.27271
x3<0.215425
x5<0.232708
x2<0.129879
x5<0.412206 10.640
x3<0.885533
x5<0.621811
x5<0.588094
x4<0.223909
x4<0.264999
x4<0.768584
x2<0.20279 x3<0.248065
x4<0.916189
x2<0.414822
16.830
x3<0.821878
x8<0.933915
x4<0.941058 25.320
17.690 19.470
67.2% Test
135
January 3, 2009
45
y
10 0 10 20
y Predictor
67.2% Test
136
January 3, 2009
Linear Regression
Term Estimate Std Error t Ratio Prob>|t| -0.900 0.482 -1.860 0.063 Intercept 4.658 0.292 15.950 <.0001 x1 4.685 0.294 15.920 <.0001 x2 -0.040 0.291 -0.140 0.892 x3 9.806 0.298 32.940 <.0001 x4 5.361 0.281 19.090 <.0001 x5 0.369 0.284 1.300 0.194 x6 0.001 0.291 0.000 0.998 x7 -0.110 0.295 -0.370 0.714 x8 0.467 0.301 1.550 0.122 x9 -0.200 0.289 -0.710 0.479 x10
69.4% Test
137
January 3, 2009
Stepwise Regression
Term Intercept x1 x2 x4 x5 Estimate Std Error t Ratio
-0.625 4.619 4.665 9.824 5.366 0.309 0.289 0.292 0.296 0.28 -2.019 15.998 15.984 33.176 19.145
Prob>|t|
0.0439 <.0001 <.0001 <.0001 <.0001
69.8% Test
January 3, 2009
138
46
t Ratio
-7.68 23.47 26.04 -2.79 53.79 29.67 -0.97 2.28 -1.31 14.14 -2.1 -1.25 29.71 1.56 -1.78 -1.28 1.78 0.2 1.69
Prob>|t|
<.0001 <.0001 <.0001 0.0054 <.0001 <.0001 0.3301 0.0232 0.1905 <.0001 0.0358 0.2111 <.0001 0.1197 0.075 0.2008 0.0748 0.8414 0.0914
88.8% Test
139
January 3, 2009
Next Steps
Higher order terms? When to stop? Transformations? Too simple: underfitting bias Too complex: inconsistent predictions, overfitting high variance Selecting models is Occams razor
Keep goals of interpretation vs. prediction in mind
MAA Short Course January 3, 2009
140
Logistic Regression
What happens if we use linear regression on 1-0 (yes/no) data?
1.0 0.0 0.2 0.4 0.6 0.8
20000
40000 Income
60000
80000
January 3, 2009
141
47
Logistic Regression II
Points on the line can be interpreted as probability, but dont stay within [0,1] Use a sigmoidal function instead of linear function to fit the data
f (I ) =
1 1+ eI
January 3, 2009
142
20000
40000 Income
60000
80000
January 3, 2009
143
Regression - Summary
Often works well Easy to use Theory gives prediction and confidence intervals Key is variable selection with interactions and transformations Use logistic regression for binary data
MAA Short Course
January 3, 2009
144
48
145
Scatterplot Smoother
Bivariate Fit of Euro/USD By Time
1.2 1.15 1.1 Euro/USD 1.05 1 0.95 0.9 0.85 1999 2000 Time Smoothing Spline Fit, lambda=10.44403 2001 2002
146
Less Smoothing
Usually these smoothers have choices on how much smoothing
Bivariate Fit of Euro/USD By Time
1.2 1.15 1.1 Euro/USD 1.05 1 0.95 0.9 0.85 1999 2000 Time Smoothing Spline Fit, lambda=0.001478 2001 2002
147
49
dayofyear
January 3, 2009
148
January 3, 2009
149
Today
Bivariate Fit of Euro/USD By Time
1.6 1.5
1.4
1.3 Euro/USD
1.2
1.1
0.9
1999
2000
2001
2002
2003
2004 Time
2005
2006
2007
2008
January 3, 2009
150
50
More Dimensions
Why not smooth using 10 predictors?
Curse of dimensionality With 10 predictors, if we use 10% of each as a neighborhood, how many points do we need to get 100 points in cube? Conversely, to get 10% of the points, what percentage do we need to take of each predictor? Need new approach
151
Additive Model
Cant get
y = f ( x1 ,..., x p )
So, simplify to:
y = f1 ( x1 ) + f 2 ( x2 ) + ... + f p ( x p )
Each of the fi are easy to find
Scatterplot smoothers
January 3, 2009
152
zi = +b1 x1 + ... + b p x p
Principal components Factor analysis Multidimensional scaling
January 3, 2009
153
51
January 3, 2009
154
Examples
If the fs are sigmoidal (definition), this is called a neural network
January 3, 2009
155
Neural Nets
Dont resemble the brain
Are a statistical model Closest relative is projection pursuit regression
January 3, 2009
156
52
A Single Neuron
x1 x2 x3 x4 x5 x0
MAA Short Course
0.3 0.7 -0.2 h(z1) 0.4 -0.5 0.8 z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5 Input (z1) Output
January 3, 2009
157
Single Node
Input to outer layer from hidden node:
I = zl = w1 jk x j + l
j
Output:
yk = h( z kl )
January 3, 2009
158
Layered Architecture
z1 x1 z2 x2 z3 Output layer Input layer Hidden layer
MAA Short Course January 3, 2009
159
53
Neural Networks
Create lots of features hidden nodes
zl = w1 jk x j + l
j
yk = w21 h( z1 ) + w22 h( z 2 ) + + j
January 3, 2009
160
Put It Together
~ yk = h ( w2 kl h( w1 jk x j + l ) + j )
l j
The resulting model is just a flexible nonlinear regression of the response on a set of predictor variables.
January 3, 2009
161
R2
MAA Short Course
89.5% Train
87.7% Test
162
January 3, 2009
54
Interpretation?
January 3, 2009
163
January 3, 2009
164
January 3, 2009
165
55
4 x
10
1.2
1.0
0.8
0.6
0.4
0.2
0.0
-0.2
4 x
10
-0.2 0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
4 x
10
166
January 3, 2009
167
MARS Output
MARS modeling, version 3.5 (6/16/91) forward stepwise knot placement: basfn(s) 0 1 3 5 7 9 11 13 15 17 19 21 23 25 2 4 6 8 10 12 14 16 18 20 22 24 gcv 25.67 17.36 12.26 7.794 6.698 5.701 5.324 5.052 5.869 6.998 8.761 11.59 20.83 58.24 461.7 #indbsfns 0.0 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0 19.0 21.0 23.0 25.0 26.0 #efprms var 1.0 7.0 14.0 21.0 28.0 35.0 42.0 49.0 56.0 63.0 70.0 77.0 84.0 91.0 97.0 4. 1. 2. 3. 5. 1. 3. 4. 1. 3. 3. 3. 10. 10. 0.9308E-02 0.7059 0.6765 0.6465 0.3413 0.3754 0.3103 0.3269 0.5097 0.4290 0.8270 0.5001 0.2250 0.4740E-02 0. 0. 0. 1. 0. 4. 5. 2. 5. 0. 3. 2. 9. 8. knot parent
26
January 3, 2009
168
56
94.3% Test
169
January 3, 2009
January 3, 2009
170
Y
10 0 0 10 20
ESTIMATE
January 3, 2009
171
57
January 3, 2009
172
K-Nearest Neighbors(KNN)
To predict y for an x:
Find the k most similar x's Average their y's
173
Collaborative Filtering
Goal: predict what movies people will like Data: list of movies each person has watched
Lyle Ellen Fred Dean Jason
MAA Short Course
Andr, Starwars Andr, Starwars, Destin Starwars, Batman Starwars, Batman, Rambo Destin dAmlie Poulin, Cach
174
58
Data Base
Data can be represented as a sparse matrix
Starwars Rambo Lyle Ellen Fred Dean Jason Karen y y y y y ? Batman My Dinner w/Andr Destin D'Amilie y y y y y y ? ? ? ? y ? Cach
Karen likes Andr. What else might she like? CDNow doubled e-mail responses
MAA Short Course
175
Clustering
Turn the problem around Instead of predicting something about a variable, use the variables to group the observations
K-means Hierarchical clustering
January 3, 2009
176
K-Means
Rather than find the K nearest neighbors, find K clusters Problem is now to group observations into clusters rather than predict Not a predictive model, but a segmentation model
177
59
Example
Final Grades
Homework 3 Midterms Final
Principal Components
First is weighted average Second is difference between 1 and 3rd midterms and 2nd
January 3, 2009
178
Scatterplot Matrix
January 3, 2009
179
Cluster Means
January 3, 2009
180
60
Biplot
January 3, 2009
181
Hierarchical Clustering
Define distance between two observations Find closest observations and form a group
Add on to this to form hierarchy
January 3, 2009
182
French Cities
Bayonne Besanon Bordeaux Brest Caen Calais Dunkerqu Le Havre Lyon Marseille Metz MontpellieNancy Nantes Paris Pau Rennes Toulon Toulouse Bayonne 0 897 183 811 765 1056 1059 817 723 700 1088 537 1079 509 765 111 624 764 299 Besanon 897 0 695 962 636 606 608 598 225 536 263 528 204 747 405 951 714 601 758 183 695 0 624 578 870 872 630 536 647 901 484 892 323 579 198 437 712 244 Bordeaux Brest 811 962 624 0 378 718 762 727 1014 1266 919 1103 891 298 596 817 246 1331 863 Caen 765 636 578 378 0 339 382 91 692 1004 573 928 556 295 236 771 191 1068 746 Calais 1056 606 870 718 339 0 47 278 750 1062 462 1053 481 598 290 1066 531 1126 970 Dunkerqu 1059 608 872 762 382 47 0 322 748 1060 440 1051 494 643 288 1064 575 1124 968 Le Havre 817 598 630 727 91 278 322 0 655 966 536 891 519 382 198 819 277 1031 819 Lyon 723 225 536 1014 692 750 748 655 0 314 456 305 404 661 458 728 766 378 535 Marseille 700 536 647 1266 1004 1062 1060 966 314 0 767 175 715 966 769 598 1014 64 405 Metz 1088 263 901 919 573 462 440 536 456 767 0 758 57 705 330 1097 672 832 988 Montpellie 537 528 484 1103 928 1053 1051 891 305 175 758 0 704 804 758 436 919 237 242 Nancy 1079 204 892 891 556 481 494 519 404 715 57 704 0 676 313 1130 643 780 937 Nantes 509 747 323 298 295 598 643 382 661 966 705 804 676 0 381 519 109 1033 565 Paris 765 405 579 596 236 290 288 198 458 769 330 758 313 381 0 773 348 834 678 Pau 111 951 198 817 771 1066 1064 819 728 598 1097 436 1130 519 773 0 632 663 198 Rennes 624 714 437 246 191 531 575 277 766 1014 672 919 643 109 348 632 0 1080 679 Toulon 764 601 712 1331 1068 1126 1124 1031 378 64 832 237 780 1033 834 663 1080 0 471 Toulouse 299 758 244 863 746 970 968 819 535 405 988 242 937 565 678 198 679 471 0
January 3, 2009
183
61
Dendogram
January 3, 2009
184
experiments
-10
-5
-2
10
fold repression
fold induction
MAA Short Course
ste mutants
experiments
fold repression
fold induction
MAA Short Course
62
Grade Example
January 3, 2009
187
Newer equipment is prohibitively expensive for the developing world Early detection of breast cancer is crucial Cumulative type I error over a decade is near 100% leading to needless biopsies
MAA Short Course January 3, 2009
188
The Data
1618 mammograms showing clustered microcalcifications
Biostatistics Dept Institut Curie
Variables
Response: Malignant or not Predictors: Age, Tissue Type (light/dense) Size (mm), Number of microcalc, Number of suspicious clusters, Shape of microcalc (1-5), Polyshape?(y/n), Shape of cluster (1,2,3), Retro (cluster near nipple?), Deep? (y/n)
January 3, 2009
189
63
Tree model
January 3, 2009
190
Combining Models
In 1950s forecasters found that combining forecasting models worked better on average than any single forecast model
Reduces variance by averaging Can reduce bias if collection is broader than single model
January 3, 2009
191
Boosting
Create repeated samples of weighted data Weights based on misclassification Combine by majority rule, or linear combination of predictions
MAA Short Course January 3, 2009
192
64
Random Forests
Issues to decide
How large a tree How many trees How many predictors to select at random for each tree
January 3, 2009
193
MART
Boosting Version 1
Use logistic regression. Weight observations by misclassification Upweight your mistakes Repeat on reweighted data Take majority vote
Boosting Version 2
use CART with 4-8 nodes Use new tree on residuals Repeat many, many times Take predictions to be the sum of all these trees
MAA Short Course January 3, 2009
194
Upshot of MART
Robust because of loss function and because we use trees Low interaction order because we use small trees (adjustable) Reuses all the data after each tree
January 3, 2009
195
65
MART in action
Training and test absolute error
absolute error
0 0
50
100 Iterations
150
200
January 3, 2009
196
More MART
Training and test absolute error
absolute error
0 0
1000
2000
3000 Iterations
4000
5000
January 3, 2009
197
MART summary
Relative Variable Importance
x7 x9 x8 x10 x6 x3 x2 x1 x5 x4
20
40
60
80
100
Relative importance
66
-4
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 0.0
| 0.2
| 0.4 x2
| 0.6
| 0.8
| 1.0
x1
partial dependence
partial dependence
-1
| 0.0
| 0.2
| 0.4 x3
| 0.6
| 0.8
| 1.0
-4
| 0.0
| 0.2
| 0.4
| 0.6 x4
| 0.8
| 1.0
partial dependence
partial dependence
-2
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
-0.3
-4
-0.1
0.1
| 0.0
| 0.2
| 0.4
| 0.6 x6
| 0.8
| 1.0
x5
January 3, 2009
199
Interaction order?
x4
Rel. rms- error 0.10 Rel. rms- error 0.10
x5
Rel. rms- error 0.15
x1
0.04
0.0
0.04
0.0
mult
add imp = 98
mult
0.0
0.05
add imp = 89
mult
x2
Rel. rms- error 0.15 Rel. rms- error
x3
Rel. rms- error
x6
0.0
0.05
0.0
add imp = 89
mult
add imp = 56
mult
0.0
0.0004
0.04
add imp = 16
mult
x10
Rel. rms- error Rel. rms- error 0.0012
x8
Rel. rms- error
x9
0.0006
0.0
0.0006
0.0
add imp = 15
mult
add imp = 15
mult
0.0
0.0004
add imp = 15
mult
x7
Rel. rms- error 8 0 2 4 6
add
mult imp = 14
January 3, 2009
200
Pairplots
-6
1 0.8 0.6 x2 0.4 0.2
partial dependence 4 2 0 -4 -2
x1
January 3, 2009
201
67
MART Results
20
Y
10 0 6 7 8 9 10 11 12 13 14 15 16 17
RESPONSE
78.7% Test
202
January 3, 2009
Results
Split data into train and test (62.5% 37.5%) Repeat random splits 1000 times
For each iteration, count false positives and false negatives on the 600 test set cases
Sim ple Tre e Ne ura l Ne tw ork Booste d Tre e s Ba gge d Tre e s Ra diologists Fa lse Positive s Fa lse Nega tive s 32.20% 33.70% 25.50% 31.70% 24.90% 32.50% 19.30% 28.80% 22.40% 35.80%
January 3, 2009
203
January 3, 2009
204
68
Where to Start
Three rules of data analysis
Draw a picture Draw a picture Draw a picture
January 3, 2009
205
Build model
Compare model on small subset with full predictive model
January 3, 2009
206
More Realistic
250 predictors
200 Continuous 50 Categorical
January 3, 2009
207
69
x1<0.297806
x5<0.465905
x1<0.333728
x5<0.529173
x2<0.343653
x5<0.466843
x4<0.752766
2.000 4.570
January 3, 2009
208
Brushing
January 3, 2009
209
January 3, 2009
210
70
Fast Fail
Not every modeling effort is a success
A model search can save lots of queries
Data took 8 months to get ready Analyst spent 2 months exploring it Tree models, stepwise regression (and a neural network running for several hours) found no out of sample predictive ability
January 3, 2009
211
Automatic Models
KXEN
January 3, 2009
212
KXEN on PVA
January 3, 2009
213
71
Lift Curve
5000 Lift 0 0 1000 2000 3000 4000
20000
40000 Sample
60000
80000
100000
January 3, 2009
214
Exploratory Model
January 3, 2009
215
Tree Model
Tree model on 40 key variables as indentified by KXEN
Very similar performance to KXEN model More coarse Based only on
RFA_2 Lastdate Nextdate Lastgift Cardprom
January 3, 2009
216
72
January 3, 2009
217
January 3, 2009
218
A neural networks shows that Zip code is the most important predictor
MAA Short Course January 3, 2009
219
73
Zip Code?
January 3, 2009
220
January 3, 2009
221
PVA data
Useful predictor increased sales 40%
Depression Study
Identified critical intervention point at 2 weeks
Ingots
Gave clues as to where to look Experimental design followed
Churches
When to quit
Printers
When to experiment what factors
MAA Short Course January 3, 2009
222
74
January 3, 2009
223
January 3, 2009
224
January 3, 2009
225
75
January 3, 2009
226
January 3, 2009
227
KDNuggets
http://www.kdnuggets.com
deveaux@williams.edu
M. Berry and G. Linoff, Data Mining Techniques, Wiley, 1997 Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999 Hand, D.J., Mannila, H. and Smyth, P., Principles of Data Mining, MIT Press 2001 Tan, P.N, Steinbach, and Kumar: Introduction to Data Mining, Addison-Wesley, 2006 Hastie, Tibshirani and Friedman, Elements of Statistical Learning 2nd edition, Springer
MAA Short Course January 3, 2009
228
76