Professional Documents
Culture Documents
Predictive Performance
Predictive Performance
PREDICTIVE PERFORMANCE
HOW TO EVALUATE CLASSIFICATION & ESTIMATION PERFORMANCE
CONTENT
1. INTRODUCTION
▪
INTRODUCTION
▪ →
▪
▪
▪
→
▪
▪
▪
INTRODUCTION
▪
INTRODUCTION
▪
▪
INTRODUCTION
▪
▪
ESTIMATION ACCURACY MEASURES
ESTIMATION ACCURACY MEASURES
Ŷ
Ŷ
1 𝑛
▪ σ ε
𝑛 𝑖=1 𝑖
1 𝑛
▪ σ ε𝑖
𝑛 𝑖=1
▪ σ𝑛𝑖=1 ε2𝑖
1 𝑛
▪ σ ε /𝑌
𝑛 𝑖=1 𝑖 𝑖
1 𝑛
▪ σ ε𝑖 /𝑌𝑖
𝑛 𝑖=1
1 𝑛
▪ σ𝑖=1 ε2𝑖
𝑛
TOYOTA COROLLA EXAMPLE
TOYOTA COROLLA EXAMPLE
TOYOTA COROLLA EXAMPLE
import math
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, roc_curve, auc
import matplotlib.pylab as plt
https://github.com/nnbphuong/datascience4biz/blob/
master/Predictive_Performance.ipynb
TOYOTA COROLLA EXAMPLE
# Reduce data frame to the top 1000 rows and select columns for regression analysis
car_df = pd.read_csv(’ToyotaCorolla.csv’)
# create a list of predictor variables by remvoing output variables and text columns
excludeColumns = (’Price’, ’Id’, ’Model’, ’Fuel_Type’, ’Color’)
predictors = [s for s in car_df.columns if s not in excludeColumns]
outcome = ’Price’
# partition data
X = car_df[predictors]
y = car_df[outcome]
train_X, valid_X, train_y, valid_y = train_test_split(X, y,
test_size=0.4, random_state=1)
# train linear regression model
reg = LinearRegression()
reg.fit(train_X, train_y)
# evaluate performance
# training
regressionSummary(train_y, reg.predict(train_X))
# validation
regressionSummary(valid_y, reg.predict(valid_X))
TOYOTA COROLLA EXAMPLE
# training
Regression statistics
# validation
Regression statistics
▪
CLASSIFICATION ACCURACY MEASURES
→
→
▪
→
CLASSIFICATION ACCURACY MEASURES
ni, j is the number of records that are class Ci members and were classified as Cj
members
Important class
Confusion Matrix (Classification Matrix)
Predicted Actual Class
Class C1 C2
n1,1= No. of C1 cases classified n2,1= No. of C2 cases classified
C1
correctly as C1 incorrectly as C1
n1,2= No. of C1 cases classified n2,2= No. of C2 cases classified
C2
incorrectly as C2 correctly as C2
(Error rate)
𝑛 +𝑛 𝑛1,1
Misclassification rate = err = 1,2 𝑛 2,1 Sensitivity (Recall) = 𝑛
1,1 +𝑛1,2
▪
RIDING MOWERS EXAMPLE
Income Lot_Size Ownership
60.0 18.4 owner
85.5 16.8 owner
64.8 21.6 owner
61.5 20.8 owner
87.0 23.6 owner
110.1 19.2 owner
108.0 17.6 owner
82.8 22.4 owner
69.0 20.0 owner
93.0 20.8 owner
▪ 51.0 22.0 owner
81.0 20.0 owner
75.0 19.6 non-owner
52.8 20.8 non-owner
▪ 64.8 17.2 non-owner
43.2 20.4 non-owner
▪ 84.0 17.6 non-owner
▪ 49.2
59.4
17.6
16.0
non-owner
non-owner
66.0 18.4 non-owner
47.4 16.4 non-owner
33.0 18.8 non-owner
51.0 14.0 non-owner
63.0 14.8 non-owner
RIDING MOWERS EXAMPLE
24 Records with Their Actual Class and the Probability (Propensity) of Them Being Class
“owner” Members, as Estimated by a Classifier (ownerExample.csv)
RIDING MOWERS EXAMPLE
owner_df = pd.read_csv('ownerExample.csv')
## cutoff = 0.5
predicted = ['owner' if p > 0.5 else 'nonowner' for p in
owner_df.Probability]
classificationSummary(owner_df.Class, predicted,
class_names=['nonowner', 'owner’])
## cutoff = 0.25
predicted = ['owner' if p > 0.25 else 'nonowner' for p in
owner_df.Probability]
classificationSummary(owner_df.Class, predicted,
class_names=['nonowner', 'owner'])
## cutoff = 0.75
predicted = ['owner' if p > 0.75 else 'nonowner' for p in
owner_df.Probability]
classificationSummary(owner_df.Class, predicted,
class_names=['nonowner', 'owner'])
RIDING MOWERS EXAMPLE
# cutoff = 0.5
Confusion Matrix (Accuracy 0.8750)
Prediction
Actual nonowner owner
nonowner 10 2
owner 1 11
Prediction Prediction
Actual nonowner owner Actual nonowner owner
nonowner 8 4 nonowner 11 1
owner 1 11 owner 5 7
RIDING MOWERS EXAMPLE
df = pd.read_csv('liftExample.csv')
▪
RIDING MOWERS EXAMPLE
from sklearn.metrics import roc_curve, auc
plt.figure(figsize=[5, 5])
plt.plot(fpr, tpr, color='darkorange',
lw=2, label='ROC curve (area = %0.4f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.legend(loc="lower right")
RIDING MOWERS EXAMPLE