Professional Documents
Culture Documents
IE 527 Intelligent Engineering Systems: Basic Concepts Model/performance Evaluation Overfitting
IE 527 Intelligent Engineering Systems: Basic Concepts Model/performance Evaluation Overfitting
Basic concepts
Model/performance evaluation
Overfitting
Descriptive modeling
Predictive modeling
The model should both fit the input data well and
correctly predict the class labels of unknown records
(generalization).
Tid
Attrib1
Yes
Large
Attrib2
125K
Attrib3
No
Class
No
Medium
100K
No
No
Small
70K
No
Yes
Medium
120K
No
No
Large
95K
Yes
No
Medium
60K
No
Yes
Large
220K
No
No
Small
85K
Yes
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
10
Training Set
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
12
Yes
Medium
80K
13
Yes
Large
110K
14
No
Small
95K
15
No
Large
67K
Apply
Model
Class
Model
Deduction
10
Test Set
6
Root/Internal nodes
- Splitting Attributes
Tid Refund
Marital
Status
Taxable
Income
Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced
95K
Yes
No
Married
60K
No
Refund
Yes
No
NO
MarSt
Single, Divorced
Married
NO
TaxInc
Yes
Divorced
220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
< 80K
Leaf nodes
- Class labels
10
> 80K
NO
YES
Induced Model
(Decision Tree)
Training Data
Married
Tid Refund
Marital
Status
Taxable
Income
Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced
95K
Yes
No
Married
60K
No
Yes
Divorced
220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
MarSt
NO
Single,
Divorced
Refund
Yes
NO
No
TaxInc
< 80K
NO
> 80K
YES
10
Test Data
Yes
No
NO
MarSt
Single, Divorced
TaxInc
NO
Taxable
Income Cheat
No
80K
10
Refund
< 80K
Refund Marital
Status
Married
Married
Assign Cheat to No
NO
> 80K
YES
10
12
2
=1
Confusion Matrix
PREDICTED CLASS
ACTUAL
CLASS
201 (TP)
85 (FN)
Class = No (0)
25 (FP)
2689 (TN)
Performance metrics
15
PREDICTED CLASS
ACTUAL
CLASS
f++ (TP)
f+- (FN)
f-+ (FP)
f-- (TN)
16
PREDICTED CLASS
ACTUAL
+: rare but important
CLASS
-: less important
Ctotal(M) =
C(i,j) (or
C(-,+) (FP)
C(-,-) (TN)
C(i,j)
Cost
Matrix
PREDICTED CLASS
C(i, j)
-1
100
ACTUAL
CLASS
Model M1
PREDICTED CLASS
+
ACTUAL
CLASS
PREDICTED CLASS
150
40
60
250
Model M2
(or Attr. A2)
ACTUAL
CLASS
250
45
200
Majority vote
Typical decision rule for a binary classification (at a leaf in DT)
A leaf node is labeled as the majority class (by probability)
For example, to assign +:
p(+) > p(-) = 1-p(+) p(+) > 0.5
Typically the cutoff value is set to 0.5, assuming symmetric cost,
which gives the lowest error rate.
Cost-sensitive
Assign class label j to a leaf node that minimizes the cost:
Labeling cost, C(j) = i p(i)*C(i,j)
For example, to assign + (assuming C(+,+)=C(-,-)=0):
C(-) > C(+): p(+)C(+,-) > p(-)C(-,+) = (1 p(+))C(-,+)
p(+) > C(-,+)/[C(-,+) + C(+,-)]
If C(-,+) < C(+,-) (FN is more expensive than FP, i.e., + is
more important), cutoff < 0.5 (allowing more +).
19
Count
ACTUAL
CLASS
PREDICTED CLASS
Class=Yes
Class=No
Class=Yes
Class=No
n=a+b+c+d
Accuracy = (a + d)/n
Cost
ACTUAL
CLASS
PREDICTED CLASS
Class=Yes
Class=No
Class=Yes
Class=No
Cost = p (a + d) + q (b + c)
= p (a + d) + q (n a d)
= q n (q p)(a + d)
= n [q (q p) Accuracy]
22
Holdout
Random subsampling
Cant control no. of each record used for testing and training
Cross validation
Utilizing as much data as possible for training; test sets mutually exclusive
Computationally expensive; high variance (only one record in each test set)
23
Stratified sampling
For imbalanced classes, e.g. consider 100 + and 1000 -.
Undersampling for :
A random sample of 100 ; Focused undersampling
Underrepresented Oversampling for +:
Replicate + until (no. of +) = (no. of -); Generate new + by interpolation
Overfitting possible
Hybrid
Bootstrap
24
25
ROC curve plots TPR (on y-axis) against FPR (on x-axis)
26
Diagonal line
M1 vs. M2
Ideal: AUC = 1
Instance
P(+)
True Class
0.95
0.93
0.87
0.85
0.85
0.85
0.76
0.53
0.43
10
0.25
Cutoff Table
Class
P
TP
0.25
5
0.43
4
0.53
4
0.76
3
0.85
3
0.85
3
0.85
3
0.87
2
0.93
2
0.95
1
1.00
0
FP
TN
FN
TPR
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.2
FPR
0.8
0.8
0.6
0.4
0.2
0.2
ROC Curve
29
acc p
p(1 p) / N
Z1/ 2 ) 1
p 2n acc Z 2
/ 2
Z 2/ 2 4n acc 4n acc
/ 2
2(n Z
30
31
Occams Razor
Given two models of similar generalization errors, one
should prefer the simpler model over the more complex
model.
For complex models, there is a greater chance that it was fitted
by chance or by noise in data and/or it overfits the data.
Therefore, one should include model complexity when evaluating
a model.