Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

PHUONG NGUYEN

PREDICTIVE PERFORMANCE
HOW TO EVALUATE CLASSIFICATION & ESTIMATION PERFORMANCE
CONTENT
1. INTRODUCTION

2. ESTIMATION ACCURACY MEASURES

3. CLASSIFICATION ACCURACY MEASURES


INTRODUCTION


INTRODUCTION

▪ →






INTRODUCTION


INTRODUCTION


INTRODUCTION


ESTIMATION ACCURACY MEASURES
ESTIMATION ACCURACY MEASURES
 Ŷ
Ŷ
1 𝑛
▪ σ ε
𝑛 𝑖=1 𝑖
1 𝑛
▪ σ ε𝑖
𝑛 𝑖=1
▪ σ𝑛𝑖=1 ε2𝑖
1 𝑛
▪ σ ε /𝑌
𝑛 𝑖=1 𝑖 𝑖
1 𝑛
▪ σ ε𝑖 /𝑌𝑖
𝑛 𝑖=1

1 𝑛
▪ σ𝑖=1 ε2𝑖
𝑛
TOYOTA COROLLA EXAMPLE
TOYOTA COROLLA EXAMPLE
TOYOTA COROLLA EXAMPLE
import math
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, roc_curve, auc
import matplotlib.pylab as plt

from dmba import regressionSummary, classificationSummary


from dmba import liftChart, gainsChart

https://github.com/nnbphuong/datascience4biz/blob/
master/Predictive_Performance.ipynb
TOYOTA COROLLA EXAMPLE
# Reduce data frame to the top 1000 rows and select columns for regression analysis
car_df = pd.read_csv(’ToyotaCorolla.csv’)

# create a list of predictor variables by remvoing output variables and text columns
excludeColumns = (’Price’, ’Id’, ’Model’, ’Fuel_Type’, ’Color’)
predictors = [s for s in car_df.columns if s not in excludeColumns]
outcome = ’Price’

# partition data
X = car_df[predictors]
y = car_df[outcome]
train_X, valid_X, train_y, valid_y = train_test_split(X, y,
test_size=0.4, random_state=1)
# train linear regression model
reg = LinearRegression()
reg.fit(train_X, train_y)

# evaluate performance
# training
regressionSummary(train_y, reg.predict(train_X))
# validation
regressionSummary(valid_y, reg.predict(valid_X))
TOYOTA COROLLA EXAMPLE
# training
Regression statistics

Mean Error (ME) : 0.0000


Root Mean Squared Error (RMSE) : 1121.0606
Mean Absolute Error (MAE) : 811.6770
Mean Percentage Error (MPE) : -0.8630
Mean Absolute Percentage Error (MAPE) : 8.0054

# validation
Regression statistics

Mean Error (ME) : 97.1891


Root Mean Squared Error (RMSE) : 1382.0352
Mean Absolute Error (MAE) : 880.1396
Mean Percentage Error (MPE) : 0.0138
Mean Absolute Percentage Error (MAPE) : 8.8744
TOYOTA COROLLA EXAMPLE
CLASSIFICATION ACCURACY MEASURES
CLASSIFICATION ACCURACY MEASURES


CLASSIFICATION ACCURACY MEASURES



CLASSIFICATION ACCURACY MEASURES
ni, j is the number of records that are class Ci members and were classified as Cj
members
Important class
Confusion Matrix (Classification Matrix)
Predicted Actual Class
Class C1 C2
n1,1= No. of C1 cases classified n2,1= No. of C2 cases classified
C1
correctly as C1 incorrectly as C1
n1,2= No. of C1 cases classified n2,2= No. of C2 cases classified
C2
incorrectly as C2 correctly as C2

(Error rate)
𝑛 +𝑛 𝑛1,1
Misclassification rate = err = 1,2 𝑛 2,1 Sensitivity (Recall) = 𝑛
1,1 +𝑛1,2

𝑛1,1 +𝑛2,2 𝑛2,2


Accuracy = 1 - err = Specificity =
𝑛 𝑛2,1 +𝑛2,2

n is the total number of records in the validation data


CLASSIFICATION ACCURACY MEASURES
PROPENSITIES AND CUTOFF FOR CLASSIFICATION


RIDING MOWERS EXAMPLE
Income Lot_Size Ownership
60.0 18.4 owner
85.5 16.8 owner
64.8 21.6 owner
61.5 20.8 owner
87.0 23.6 owner
110.1 19.2 owner
108.0 17.6 owner
82.8 22.4 owner
69.0 20.0 owner
93.0 20.8 owner
▪ 51.0 22.0 owner
81.0 20.0 owner
75.0 19.6 non-owner
52.8 20.8 non-owner
▪ 64.8 17.2 non-owner
43.2 20.4 non-owner
▪ 84.0 17.6 non-owner
▪ 49.2
59.4
17.6
16.0
non-owner
non-owner
66.0 18.4 non-owner
47.4 16.4 non-owner
33.0 18.8 non-owner
51.0 14.0 non-owner
63.0 14.8 non-owner
RIDING MOWERS EXAMPLE

24 Records with Their Actual Class and the Probability (Propensity) of Them Being Class
“owner” Members, as Estimated by a Classifier (ownerExample.csv)
RIDING MOWERS EXAMPLE
owner_df = pd.read_csv('ownerExample.csv')

## cutoff = 0.5
predicted = ['owner' if p > 0.5 else 'nonowner' for p in
owner_df.Probability]
classificationSummary(owner_df.Class, predicted,
class_names=['nonowner', 'owner’])

## cutoff = 0.25
predicted = ['owner' if p > 0.25 else 'nonowner' for p in
owner_df.Probability]
classificationSummary(owner_df.Class, predicted,
class_names=['nonowner', 'owner'])

## cutoff = 0.75
predicted = ['owner' if p > 0.75 else 'nonowner' for p in
owner_df.Probability]
classificationSummary(owner_df.Class, predicted,
class_names=['nonowner', 'owner'])
RIDING MOWERS EXAMPLE
# cutoff = 0.5
Confusion Matrix (Accuracy 0.8750)

Prediction
Actual nonowner owner
nonowner 10 2
owner 1 11

# cutoff = 0.25 # cutoff = 0.75


Confusion Matrix (Accuracy 0.7917) Confusion Matrix (Accuracy 0.7500)

Prediction Prediction
Actual nonowner owner Actual nonowner owner
nonowner 8 4 nonowner 11 1
owner 1 11 owner 5 7
RIDING MOWERS EXAMPLE
df = pd.read_csv('liftExample.csv')

cutoffs = [i * 0.1 for i in range(0, 11)]


accT = []
for cutoff in cutoffs:
predicted = [1 if p > cutoff else 0 for p in df.prob]
accT.append(accuracy_score(df.actual, predicted))

line_accuracy = plt.plot(cutoffs, accT, '-', label='Accuracy')[0]


line_error = plt.plot(cutoffs, [1 - acc for acc in accT], '--’,
label='Overall error')[0]
plt.ylim([0,1])
plt.xlabel('Cutoff Value')
plt.legend(handles=[line_accuracy, line_error])
plt.show()
RIDING MOWERS EXAMPLE

Plotting accuracy and overall error as a function of the cutoff value


ROC CURVE


RIDING MOWERS EXAMPLE
from sklearn.metrics import roc_curve, auc

# compute ROC curve and AUC


fpr, tpr, _ = roc_curve(df.actual, df.prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=[5, 5])
plt.plot(fpr, tpr, color='darkorange',
lw=2, label='ROC curve (area = %0.4f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.legend(loc="lower right")
RIDING MOWERS EXAMPLE

ROC (Receiver Operating Characteristic) curves


SUMMARY

You might also like