In Favor of WOE

1 of 39
Evidence in Favor of Weight of Evidence and Binning Transformations for

Predictive Modeling
Abstract
Weight of Evidence transformation of categorical variables is a technique used by credit
scoring professionals for decades. This paper investigates whether using this
transformation improves predictive performance. For models without interaction terms
the use of Weight of evidence and binning or discretization of numeric variables
improves predictive accuracy. The addition of Weight of evidence transformations
without binning is marginal in models without interactions. This is consistent with the
excellent results achieved on the Paralyzed Veteran Admin KDD 98 data where the best
performance was achieved using both WOE and binning.
For models with interaction terms the use of WOE transform improves model
performance for 2 out of 3 data sets and performance is the same for the third data set
with or without the WOE. WOE tends to improve logistic regression and I* tuned
logistic regression performance while degrading random forest performance slightly.
WOE and I* algorithm thus reach peak predictive models in achieving area under the
curve competitive with winning KDD benchmarks.
The combination of WOE and binning reduces performance for models with interaction
terms. This makes sense in retrospect as binning variables results in loss of information
about interaction amongst continuous variables.
WOE and binning thus improve model performance when used together when modeling
without interactions as thought by practitioners. However when interaction effects exist
in the data interaction effects are more predictive than WOE and binning and WOE
should be used alone as binning can result in loss of predictive power of interaction
effects. Interactions exist in the data set when random forest outperforms logistic
regression out of the box (Sharma, 2011b).
Weight of evidence, discretization, classification, binary, logistic, i*, interaction, credit

scoring, predictive modeling
Electronic copy available at: http://ssrn.com/abstract=1925510

2 of 39
Introduction
The goal of this study is to study the impact of Weight of evidence transformations and
binning on logistic regression, random forests, and I* algorithm which builds logistic
regression using random forest variable importance. WOE or weight of evidence is
defined as the log of bayes factor, log of distribution of one event vs. the other in a binary
classification problem (Good,1985).
Traditionally the benefits of Weight of Evidence transformations of categorical variables

for predictive modeling and classification has been that it allows information from
categorical variables to be included in model building which might be too much for
regression to handle and also improvements in speed of model building. The credit
scoring community has long advocated the use of WOE in coarse classification
scorecards used with logistic regression with the justification that using WOE and
binning or discretization improves performance as it allows nonlinearities to be modeled
by a linear model (Leung etal, 2008).
Recently using the combination of WOE and binning was shown to produce excellent
performance using the KDD 98 data set where there were many categorical variables
(Sharma, 2011a).
Data Sets
The 3 data sets used for this study are the German Credit data set(Frank, 2010), Credit
Card KDD 2010 data set(from KDD 2010 available from
http://sede.neurotech.com.br/PAKDD2010 ), and the Home Equity data set from from
SAS® called the home equity data set from
Electronic copy available at: http://ssrn.com/abstract=1925510

3 of 39
www.sasenterpriseminer.com/data/HMEQ.xls. The 3 data sets are all binary

classification data sets suitable for logistic regression.
Models
The logistic regression without tuning, random forests, and I * algorithm which builds
logistic regression models using random forest variable importance in model
development are compared using 70% train, 15% out of sample and 15% validation data
set. The performance reported below is for the validation data set.
Results
Discussion
WOE and Binning enhance model performance for logistic regression and I* results
when no interaction terms are used. That said the performance improvement of using
interaction terms is greater than that of WOE/binning in these data sets. In the KDD 98
data set with a large number of categorical variables with little predictive interaction
effects the WOE and Binning performance was much better. This would indicate in the
absence of predictive interaction effects its optimal to use WOE and binning while using
WOE alone is advisable in data sets without predictive interactions. To determine
whether a data set has predictive interactions it is recommended to compare random
forest performance against logistic regression and if random forest performance is better
it is because of the presence of significant interaction effects in the data and quite
possible a more complex decision boundary (Sharma,). WOE and binning reduce random
forest performance slightly. WOE itself with interaction terms improves performance of
logistic regression and I*. However WOE and Binning applied to models with
interaction result in loss or performance and this is because binning loses information on
predictive interactions and results. Interactions exist in the data set when random forest
outperforms logistic regression out of the box (Sharma, 2011b).
Conclusions
Recommendations based on this study:

• Using interaction terms enhances performance more than WOE when there
are predictive interactions in data
• WOE enhances I* model performance and logistic regression performance
and should be used as a default option.
• WOE and Binning should be used when no interaction are suspected or
included in model building. This is the default setting used by Credit scoring
professional.
• To determine where to explore interaction terms compare model performance
out of the box of random forests and logistic regression. If random forest
performance is much better than the logistic regression then the performance
4 of 39
gap can be attributed to powerful and statistically significant interaction

effects in the data (Sharma, 2011b).
References
Good, I.J. (1985) Weight of Evidence: A Brief Survey. Bayesian Statistics. Vol 2.
pp.249-270. Retrieved from
http://www.swrcb.ca.gov/water_issues/programs/tmdl/docs/303d_policydocs/207.pdf
KDD CUP 1998 Data Retrieved from

http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html ;
Sharma, D. (2011a) Optimal Response Modeling: Comparison of Logistic Regression,

Random Forest and I* algorithm. Retrieved from
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1918410
Sharma, D. (2011b) I*: Optimizing Logistic Regression to Match Ensemble Performance

Using Random Forest Variable Importance. Retrieved from
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1858378.
Leung, K., Cheong, F., Cheong, C. (2008). Building a Scorecard in Practice. WSPC
Proceedings. Retrieved from http://www.aiecon.org/conference/2008/CIEF/Building
%20a%20Scorecard%20in%20Practice.pdf
5 of 39
Appendix of R code
return_RF_Formula <-function (arf1,n=7)

{
imp <- importance(arf1, class = null, scale = TRUE, type = NULL)
e<-imp <- imp[, -(1:(ncol(imp) - 2))]
#sort(e[,1], dec=TRUE)
#only keep variables with info length(names(e[,1][e[,1]>0]))
a<-""
for (i in 1: min(n,length(sort(e[,1][e[,1]>0.025], dec=TRUE))))
{
a<-paste(a,(names(sort(e[,1][e[,1]>0.025], dec=TRUE))[i]),sep="")
if (i< min(n,length(sort(e[,1][e[,1]>0.025], dec=TRUE))))

{
a<-paste(a,("+"),sep="")
}
}
return (a)
}
getRfTerms<-function(rf, n=10)
{
rfterms<-character(0)
6 of 39
imp <- importance(rf, class = null, scale = TRUE, type = NULL)

e<-imp <- imp[, -(1:(ncol(imp) - 2))]
for (i in 1: min(n,length(sort(e[,1], dec=TRUE))))

{
print (paste("forest: " ,substr(names(sort(e[,1], dec=TRUE))[i],1,nchar(names(sort(e[,1],
dec=TRUE))[i])),sep=""))
rfterms[length(rfterms)+1]= substr(names(sort(e[,1], dec=TRUE))
[i],1,nchar(names(sort(e[,1], dec=TRUE))[i]))
}
return (rfterms)
getGlmTerms<- function (m)

{
x <- array(dim=c(length( names(summary(m)$coefficients[,1])),3))
for (i in 1:length( names(summary(m)$coefficients[,1])))

{
#print(names(summary(m)$coefficients[,1])[i])
#print(summary(m)$coefficients[i])
#print (summary(m)$coefficients[i,4])
x[i,1]<-names(summary(m)$coefficients[,1])[i]
x[i,2]<-summary(m)$coefficients[i]
x[i,3]<-summary(m)$coefficients[i,4]
}
y<-data.frame(x,stringsAsFactors = FALSE)
return (y)
}
barplotwithtext<- function(datainplot=c(1,2,3), y=c(0,1))

{
datainplot<- c(1,2,3)
w<-barplot(datainplot,ylim=y);
text(cex=.5, x=w, y=datainplot+par("cxy")[2]/2, round(datainplot,2), xpd=TRUE)
}
7 of 39
EqualFreq2 <- function(x,n){

nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
rsamp<-function(c, p)
{
d = sort(sample(nrow(c), nrow(c)*p))
#select training sample
train<-c[d,]
test<-c[-d,]
return(train)
#WOE
woe<-function(data, test, target,bin=TRUE)

{
library(randomForest)
data<-na.roughfix(data)
test<-na.roughfix(test)
print("WOE1")
#library(discretization)
library(sqldf)
#ensure target is factor

data[,target]<-as.factor(data[,target])
test[,target]<-as.factor(test[,target])
len<-length(data)
8 of 39
#loop through training categorical variables and create woe table and apply to test
for (i in 1:length(data))
{
if ( bin==TRUE & (class(data[,i])=='numeric' | class(data[,i])=='integer')

)
{
data[,i]<-EqualFreq2(data[,i],10)
test[,i]<-EqualFreq2(test[,i],10)
#data[,i]<-cut(data[,i], as.integer(sqrt(nrow(data))))
#data[,i]<- log(data[,i]+1.0)
#test[,i]<- log(test[,i]+1.0)
#data[,i]<-cut(data[,i],10)
#test[,i]<-cut(data[,i],10)
index<- match(target, colnames(data))
if ( (class(data[,i])=="factor" | ( class(data[,i])=="integer" &

length(levels(as.factor(data[,i])))<=50)
)
& i!=index
)
{
#print("non numeric")
name<-names(data)[i]
#print(name)
print(class(data[,i]))
a=paste("'",levels(data[,target])[1],"'",sep="")
9 of 39
b=paste("'",levels(data[,target])[2],"'",sep="")
#b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data where %s))/(sum(case when %s then 1.00 else 0.00 end) / (select #count(*) from
data where %s))
%s',paste(target,"=",a,sep=""),paste(target,"=",a,sep=""),paste(target,"=",b,sep=""),paste(t
arget,"=",b,sep=""), #paste('WOE',names(data)[i],' , [', names(data)[i], '] from data group
by [',names(data)[i], ']',sep="") ))
b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data)) %s',paste(target,"=",a,sep=""), paste('WOE',names(data)[i],' , [', names(data)[i], ']
from data group by [',names(data)[i], ']',sep="") ))
sql<- b1
print(sql)
t<-sqldf(sql)
print('ran sql')
gc()
#data <-sqldf(paste('select * from data a, t b where a.',names(data)[i],' = b.',names(data)
[i],sep=""))
print('merge1')
data<-merge(data,t)
test<-merge(test,t)
gc()
}
}
#remove categ
l<-vector()
for (i in 1:len)
{
#remove original columns in data

10 of 39

)
& i!=index
){
print(name)
l[i]<-name
}
}
myvars <- names(data) %in% c(l)

data <- data[!myvars]
myvars <- names(test) %in% c(l)

test <- test[!myvars]
gc()
#ensure woes all numeric
{
if (class(data[,i])=='factor' & i!=index )

{
data[,i] <- as.numeric(data[,i])+1

test[,i] <- as.numeric(test[,i])+1
#if ( regexpr("WOE", names(data)[i])[1]>=1)

#{
#data[,i]<- log(data[,i])
#test[,i]<- log(test[,i])
#}
}
11 of 39
#for nas replace woe as median woe value

print('end woe')
return(list(data=data,test=test))
# input data, name of target classification variable , n=number of top most predictive
variables to use for interaction mining based on rf
# rfsample size of classes to sample from for large dataset; usually 500 works well .
# sample a boolean flag for large datasets to sample 50k obs for model building etc.
# 2 way interacs flag; slower not much better;
#formStat is a string model formula of a base existing model with interactions you may
want to start with as a starting point for model improvement
# you can run the function once and feed back the resulting model to see if more
predictive power can be added using additional interaction terms using the
# algorithm recusively one itself.
tuneGlmToRf<-function(data, target,n=7,rfsample=0, sample=FALSE, nway=1,

formStart=NA, woe=TRUE,bin=TRUE)
{
#if (nway==0)
if (!is.na(formStart))
{
d<-gsub("\\+",",",formStart)
d<-gsub("\\~",",",d)
d<-gsub("\\WOE","",d)
print(d)
q<-unlist(strsplit(d, "\\,"))
myvars <- names(data) %in% c(q)
data<- data[myvars]
12 of 39
# if not woe and levels >=32 then remove as rf cant handle them
if (woe==FALSE)
{
{
# if (length(levels(data[,i]))>=32) { data[,i]<-NULL }
if (is.factor(data[,i]) & names(data)[i]!=target) { data[,i]<-as.numeric(data[,i]) }
if (nrow(data)>52000 & sample==TRUE)

{
d = sort(sample(nrow(data), 52000))
data<-data[d,]
if (rfsample==0) { rfsample=500 }
}
options(warn=-1)
#library(fSeries)
#if (nway!=0)
#{
#data<- data.frame(substituteNA(data, type ="zeros"))
#for (i in 1:(length(names(data))))
#{
#{
# if (names(data[i]) != target)
# {
# data[,i] <- as.numeric( data[,i])
# print (1)
# }
13 of 39
# else { data[,i] <- as.factor( data[,i]) }

#}
#}
#}
formula <- paste(target,"~. ",sep="")

formularf<- as.formula(formula)
d = sort(sample(nrow(data), nrow(data)*.7))
train<-data[d,]
test<-data[-d,]
rm(data)
gc()
if (woe==TRUE)
{
if (bin==TRUE) { prep<-woe(train,test, target) }

if (bin==FALSE) { prep<-woe(train,test, target,FALSE) }
train<-prep$data
test<-prep$test
print("a")
}
if (nrow(train)>=34000) train<-train[1:34000,]
gc()
w= sort(sample(nrow(test), nrow(test)*.5))
test<-test[w,]
validate<-test[-w,]
print(length(train))
u<-length(train)
u<-as.integer(log(u,10)*sqrt(u))
14 of 39
library(ROCR)
set.seed(42)
if (rfsample > 0)
{
print("rf")
print(rfsample)
print(summary(train))
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
sampsize=c(rfsample,rfsample),
na.action=na.roughfix,
replace=FALSE)
}
else
{
data=train,
ntree=500,
mtry=u,
importance=TRUE,
replace=FALSE)
}
print("C")
#subset train test and validate to terms that have predictive value
e<-imp <- imp[, -(1:(ncol(imp) - 2))]
print("D")
#data<-subset(data,select=c(names(e[,1][e[,1]>0]),target))
train<-subset(train,select=c(names(e[,1][e[,1]>0]),target))
print("E")
test<-subset(test,select=c(names(e[,1][e[,1]>0]),target))
15 of 39
print("F")
validate<-subset(validate,select=c(names(e[,1][e[,1]>0]),target))
print("done woe")
#made perf lousy for all

#if (nway==0 & n==1)
#{
#data<-subset(data,select=c(head(names(e[,1][e[,1]>0]),30),target))
#train<-subset(train,select=c(head(names(e[,1][e[,1]>0]),30),target))
#test<-subset(test,select=c(head(names(e[,1][e[,1]>0]),30),target))
#validate<-subset(validate,select=c(head(names(e[,1][e[,1]>0]),30),target))
#
#}
u<-length(train)
#rebuild forest with reduced data set

if (rfsample > 0)
{
data=train,
ntree=500,
mtry=u,
importance=TRUE,
replace=FALSE)
}
else
{
data=train,
ntree=500,
mtry=u,
importance=TRUE,
replace=FALSE)
}
16 of 39
rfterms<-getRfTerms(rf,n)
list_of_terms<-character(0)
form<- paste(formula, " * ",sep="")
#if (nway==0 ) form<-formula
for (k in 1:length(rfterms))
{
print(paste(form,rfterms[k],sep=""))
if ( k<=nway | nway ==1 )

{
print("a")
m<-glm(paste(form,rfterms[k],sep=""),data=train,family=binomial)
print("b")
if (k>nway & nway>1)

{
f<- paste(form,rfterms[k],sep="")
for (p in 1:nway)
{
f<-paste(f, " * ", rfterms[k-p],sep="")
print(f)
}
#paste(form,rfterms[k]," * ", rfterms[k-1], " * ", rfterms[k-2]," * ",rfterms[k-3]," *

",rfterms[k-4]," * ",rfterms[k-5], sep="")
m<-glm(f,data=train,family=binomial)
print("c")
}
print("d")
if (nway==0) m<-glm(formula,data=train,family=binomial)
print(summary(m))
y <- getGlmTerms(m)
y<- na.omit(y)
rm(m)
gc()
17 of 39
y$X2<- as.numeric(y$X2)
for (j in 1: nrow(y))
{
print("Adding term")
if ((y[j,3]<=.25 | runif(1, 0, .61)< .16 | TRUE ) & (regexpr(":", y[j,1]) > 0) | ( nway==0
& !(regexpr("Intercept", y[j,1]) > 0) ))
#print(names(y[1,j]))
list_of_terms[length(list_of_terms)+1]= y[j,1]
}
############# do the same above by updating formula to include 1 way interactions

against final model
#################
maxperf<-0
morig<-glm(formula,data=train,family=binomial)
test$score1<-predict(morig,type='response',test)
pred1<-prediction(test$score1,test[target])
perf1 <- performance(pred1,"tpr","fpr")
validate$score1a<-predict(morig,type='response',validate)
pred1a<-prediction(validate$score1a,validate[target])
perf1a <- performance(pred1a,"tpr","fpr")
perforig<-attributes(performance(pred1a, "auc"))$y.values[[1]][1]
18 of 39
maxperf<-attributes(performance(pred1, "auc"))$y.values[[1]][1]
print(maxperf)
rm(morig)
gc()
if (nway==0) formula <- paste(target,"~ ",sep="")
list2<-character(0)
print( paste("about to try ",length(list_of_terms), " interactions.",sep=""))
for (i in 1: length(list_of_terms))
{
if (length(list2)==0)
{
print(paste(formula,list_of_terms[i],sep="+"))
mfin<-glm(paste(formula,list_of_terms[i],sep="+"),data=train,family=binomial)
}
else
{
finform<- formula
for (j in 1: length(list2))
{
if (j==1 & nway==0) finform<- paste(finform, list2[j],sep=" ")
finform<- paste(finform, list2[j],sep="+")
}
print(i)
print(finform)
print(paste(finform, list_of_terms[i],sep="+"))
mfin<-glm(paste(finform, list_of_terms[i],sep="+"),data=train,family=binomial)
}
test$score2<-predict(mfin,type='response',test)
if (nway==0 & i==1) maxperf<-attributes(performance(pred2, "auc"))$y.values[[1]][1]
print(maxperf)
print(attributes(performance(pred2, "auc"))$y.values[[1]][1])
19 of 39
if (attributes(performance(pred2, "auc"))$y.values[[1]][1] > maxperf | (nway==0 &

i==1)
)
{
list2[length(list2)+1]= list_of_terms[i]
maxperf<- attributes(performance(pred2, "auc"))$y.values[[1]][1]
rm(mfin)
gc()
if (length(list2)>0)
{
mfin<-glm(finform,data=train,family=binomial)
print(paste("explored ", length(list_of_terms) ," interactions and ended up with following

optimal interactions : ", length(list2),sep="" ))
print("optimal glm model is ")
print(finform)
print(summary(mfin))
#compare on validation set

validate$score3<-predict(mfin,type='response',validate)
pred3<-prediction(validate$score3,validate[target])
optglmperf<-attributes(performance(pred3, "auc"))$y.values[[1]][1]
print (paste(" perf of orig glm no interactions oos 15% : ",perforig,sep=""))

print (paste(" perf of logit tuned w interactions guided by rf oos 15% : ",
optglmperf,sep=""))
validate$p4<-predict(rf,validate,type='prob')[,2]
pred4<-prediction(validate$p4,validate[target])
rfperf<-attributes(performance(pred4, "auc"))$y.values[[1]][1]
print (paste(" perf of orig rf oos 15% : ",rfperf ,sep=""))
windows();
20 of 39
#pdf(varImpPlot(rf,main="random forest variable importance (used to choose interaction

terms for glm)"));
#windows();
par(mfrow=c(2,2))
plot(perf1a, col='darkgreen', main="OOS 15% validation ROC ")

plot(perf3, col='red',add=TRUE);
plot(perf4, col='blue',add=TRUE);
legend(0.6,0.6,c('orig glm','glm w interactions' ,'orig
rf'),col=c('darkgreen','red','blue'),lwd=2)
dotchart(c(rfperf,perforig,optglmperf),labels=c('oos rf','oos orig glm','oos opt

glm' ),cex=.7, main="oos area under curve for validation data")
datainplot<-c((optglmperf/perforig)-1,(optglmperf/rfperf)-1)
q<-barplot(datainplot,names.arg=c('% improve AUC over original glm','% improve AUC
over rf'), main="OOS AUC improvement by glm w interactions based on
rf",col=c("green","red"),beside=TRUE)
text(cex=.5, x=q, y=datainplot+par("cxy")[2]/2, round(datainplot,2), xpd=TRUE) ;
#barplot(c(perforig,optglmperf,rfperf),names.arg=c('original glm','glm with interactions

tuned' ,'orig rf'), main="OOS AUC by model",col=c("green","red","blue"),beside=TRUE)
#varImpPlot(rf, main="variable importance from rf")

e<-imp <- imp[, -(1:(ncol(imp) - 2))]
dotchart(sort(e[,1]),main="random forest variable importance by mean decrease in
accuracy")
}
# run regression of list 2 terms + term
#print count and final model
#plot oss samples
return
(list(finform=finform,perforig=perforig,optgkmperf=optglmperf,rfperf=rfperf,vi=importa
nce(rf)))
return_RF_Formula <-function (arf1,n=7)

21 of 39
{
imp <- importance(arf1, class = null, scale = TRUE, type = NULL)
e<-imp <- imp[, -(1:(ncol(imp) - 2))]
#sort(e[,1], dec=TRUE)
#only keep variables with info length(names(e[,1][e[,1]>0]))
a<-""
for (i in 1: min(n,length(sort(e[,1][e[,1]>0.025], dec=TRUE))))
{
a<-paste(a,(names(sort(e[,1][e[,1]>0.025], dec=TRUE))[i]),sep="")
if (i< min(n,length(sort(e[,1][e[,1]>0.025], dec=TRUE))))

{
a<-paste(a,("+"),sep="")
}
}
return (a)
}
getRfTerms<-function(rf, n=10)
{
rfterms<-character(0)

e<-imp <- imp[, -(1:(ncol(imp) - 2))]
for (i in 1: min(n,length(sort(e[,1], dec=TRUE))))

{
print (paste("forest: " ,substr(names(sort(e[,1], dec=TRUE))[i],1,nchar(names(sort(e[,1],
dec=TRUE))[i])),sep=""))
22 of 39
rfterms[length(rfterms)+1]= substr(names(sort(e[,1], dec=TRUE))

[i],1,nchar(names(sort(e[,1], dec=TRUE))[i]))
}
return (rfterms)
getGlmTerms<- function (m)

{
x <- array(dim=c(length( names(summary(m)$coefficients[,1])),3))
for (i in 1:length( names(summary(m)$coefficients[,1])))

{
#print(names(summary(m)$coefficients[,1])[i])
#print(summary(m)$coefficients[i])
#print (summary(m)$coefficients[i,4])
x[i,1]<-names(summary(m)$coefficients[,1])[i]
x[i,2]<-summary(m)$coefficients[i]
x[i,3]<-summary(m)$coefficients[i,4]
}
y<-data.frame(x,stringsAsFactors = FALSE)
return (y)
}
barplotwithtext<- function(datainplot=c(1,2,3), y=c(0,1))

{
datainplot<- c(1,2,3)
w<-barplot(datainplot,ylim=y);
text(cex=.5, x=w, y=datainplot+par("cxy")[2]/2, round(datainplot,2), xpd=TRUE)
EqualFreq2 <- function(x,n){

nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
23 of 39
rsamp<-function(c, p)
{
d = sort(sample(nrow(c), nrow(c)*p))
train<-c[d,]
test<-c[-d,]
return(train)
#WOE
woe<-function(data, test, target,bin=TRUE)

{
print("WOE1")
#library(discretization)
library(sqldf)
#ensure target is factor

data[,target]<-as.factor(data[,target])
test[,target]<-as.factor(test[,target])
len<-length(data)
#loop through training categorical variables and create woe table and apply to test
{
24 of 39
if ( bin==TRUE & (class(data[,i])=='numeric' | class(data[,i])=='integer')

)
{
data[,i]<-EqualFreq2(data[,i],10)
test[,i]<-EqualFreq2(test[,i],10)
#data[,i]<-cut(data[,i], as.integer(sqrt(nrow(data))))
#data[,i]<-cut(data[,i],10)
#test[,i]<-cut(data[,i],10)

)
& i!=index
)
{
#print("non numeric")
#print(name)
print(class(data[,i]))
a=paste("'",levels(data[,target])[1],"'",sep="")
b=paste("'",levels(data[,target])[2],"'",sep="")
#b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data where %s))/(sum(case when %s then 1.00 else 0.00 end) / (select #count(*) from
data where %s))
%s',paste(target,"=",a,sep=""),paste(target,"=",a,sep=""),paste(target,"=",b,sep=""),paste(t
arget,"=",b,sep=""), #paste('WOE',names(data)[i],' , [', names(data)[i], '] from data group
by [',names(data)[i], ']',sep="") ))
25 of 39
b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data)) %s',paste(target,"=",a,sep=""), paste('WOE',names(data)[i],' , [', names(data)[i], ']
from data group by [',names(data)[i], ']',sep="") ))
sql<- b1
print(sql)
t<-sqldf(sql)
print('ran sql')
gc()
#data <-sqldf(paste('select * from data a, t b where a.',names(data)[i],' = b.',names(data)
[i],sep=""))
print('merge1')
data<-merge(data,t)
test<-merge(test,t)
gc()
}
}
#remove categ
l<-vector()
for (i in 1:len)
{
#remove original columns in data

)
& i!=index
){
print(name)
l[i]<-name
26 of 39
}
}
myvars <- names(data) %in% c(l)

data <- data[!myvars]
myvars <- names(test) %in% c(l)

test <- test[!myvars]
gc()
#ensure woes all numeric
{
if (class(data[,i])=='factor' & i!=index )

{
data[,i] <- as.numeric(data[,i])+1

test[,i] <- as.numeric(test[,i])+1
#if ( regexpr("WOE", names(data)[i])[1]>=1)

#{
#data[,i]<- log(data[,i])
#test[,i]<- log(test[,i])
#}
#for nas replace woe as median woe value

print('end woe')
return(list(data=data,test=test))
27 of 39
# input data, name of target classification variable , n=number of top most predictive
variables to use for interaction mining based on rf
# rfsample size of classes to sample from for large dataset; usually 500 works well .
# sample a boolean flag for large datasets to sample 50k obs for model building etc.
# 2 way interacs flag; slower not much better;
#formStat is a string model formula of a base existing model with interactions you may
want to start with as a starting point for model improvement
# you can run the function once and feed back the resulting model to see if more
predictive power can be added using additional interaction terms using the
# algorithm recusively one itself.
tuneGlmToRf<-function(data, target,n=7,rfsample=0, sample=FALSE, nway=1,

formStart=NA, woe=TRUE,bin=TRUE)
{
#if (nway==0)
if (!is.na(formStart))
{
d<-gsub("\\+",",",formStart)
d<-gsub("\\~",",",d)
d<-gsub("\\WOE","",d)
print(d)
q<-unlist(strsplit(d, "\\,"))
myvars <- names(data) %in% c(q)
data<- data[myvars]
# if not woe and levels >=32 then remove as rf cant handle them
if (woe==FALSE)
{
28 of 39
{
# if (length(levels(data[,i]))>=32) { data[,i]<-NULL }
if (is.factor(data[,i]) & names(data)[i]!=target) { data[,i]<-as.numeric(data[,i]) }
if (nrow(data)>52000 & sample==TRUE)

{
d = sort(sample(nrow(data), 52000))
data<-data[d,]
if (rfsample==0) { rfsample=500 }
}
options(warn=-1)
#library(fSeries)
#if (nway!=0)
#{
#data<- data.frame(substituteNA(data, type ="zeros"))
#for (i in 1:(length(names(data))))
#{
#{
# if (names(data[i]) != target)
# {
# data[,i] <- as.numeric( data[,i])
# print (1)
# }
# else { data[,i] <- as.factor( data[,i]) }
#}
#}
#}
formula <- paste(target,"~. ",sep="")

formularf<- as.formula(formula)
29 of 39
d = sort(sample(nrow(data), nrow(data)*.7))
train<-data[d,]
test<-data[-d,]
rm(data)
gc()
if (woe==TRUE)
{
if (bin==TRUE) { prep<-woe(train,test, target) }

if (bin==FALSE) { prep<-woe(train,test, target,FALSE) }
train<-prep$data
test<-prep$test
print("a")
}
if (nrow(train)>=34000) train<-train[1:34000,]
gc()
w= sort(sample(nrow(test), nrow(test)*.5))
test<-test[w,]
validate<-test[-w,]
print(length(train))
u<-length(train)
library(ROCR)
set.seed(42)
if (rfsample > 0)
{
print("rf")
print(rfsample)
30 of 39
print(summary(train))
data=train,
ntree=500,
mtry=u,
importance=TRUE,
replace=FALSE)
}
else
{
data=train,
ntree=500,
mtry=u,
importance=TRUE,
replace=FALSE)
}
print("C")
#subset train test and validate to terms that have predictive value
e<-imp <- imp[, -(1:(ncol(imp) - 2))]
print("D")
#data<-subset(data,select=c(names(e[,1][e[,1]>0]),target))
train<-subset(train,select=c(names(e[,1][e[,1]>0]),target))
print("E")
test<-subset(test,select=c(names(e[,1][e[,1]>0]),target))
print("F")
validate<-subset(validate,select=c(names(e[,1][e[,1]>0]),target))
print("done woe")
#made perf lousy for all

31 of 39
#if (nway==0 & n==1)

#{
#data<-subset(data,select=c(head(names(e[,1][e[,1]>0]),30),target))
#train<-subset(train,select=c(head(names(e[,1][e[,1]>0]),30),target))
#test<-subset(test,select=c(head(names(e[,1][e[,1]>0]),30),target))
#validate<-subset(validate,select=c(head(names(e[,1][e[,1]>0]),30),target))
#
#}
u<-length(train)
#rebuild forest with reduced data set

if (rfsample > 0)
{
data=train,
ntree=500,
mtry=u,
importance=TRUE,
replace=FALSE)
}
else
{
data=train,
ntree=500,
mtry=u,
importance=TRUE,
replace=FALSE)
}
rfterms<-getRfTerms(rf,n)
list_of_terms<-character(0)
form<- paste(formula, " * ",sep="")
#if (nway==0 ) form<-formula

32 of 39
for (k in 1:length(rfterms))
{
print(paste(form,rfterms[k],sep=""))
if ( k<=nway | nway ==1 )

{
print("a")
m<-glm(paste(form,rfterms[k],sep=""),data=train,family=binomial)
print("b")
if (k>nway & nway>1)

{
f<- paste(form,rfterms[k],sep="")
for (p in 1:nway)
{
f<-paste(f, " * ", rfterms[k-p],sep="")
print(f)
}
#paste(form,rfterms[k]," * ", rfterms[k-1], " * ", rfterms[k-2]," * ",rfterms[k-3]," *

",rfterms[k-4]," * ",rfterms[k-5], sep="")
m<-glm(f,data=train,family=binomial)
print("c")
}
print("d")
if (nway==0) m<-glm(formula,data=train,family=binomial)
print(summary(m))
y <- getGlmTerms(m)
y<- na.omit(y)
rm(m)
gc()
for (j in 1: nrow(y))
{
print("Adding term")
33 of 39
if ((y[j,3]<=.25 | runif(1, 0, .61)< .16 | TRUE ) & (regexpr(":", y[j,1]) > 0) | ( nway==0
& !(regexpr("Intercept", y[j,1]) > 0) ))
#print(names(y[1,j]))
list_of_terms[length(list_of_terms)+1]= y[j,1]
}
############# do the same above by updating formula to include 1 way interactions

against final model
#################
maxperf<-0
morig<-glm(formula,data=train,family=binomial)
test$score1<-predict(morig,type='response',test)
validate$score1a<-predict(morig,type='response',validate)
pred1a<-prediction(validate$score1a,validate[target])
perf1a <- performance(pred1a,"tpr","fpr")
perforig<-attributes(performance(pred1a, "auc"))$y.values[[1]][1]
maxperf<-attributes(performance(pred1, "auc"))$y.values[[1]][1]
print(maxperf)
rm(morig)
gc()
34 of 39
if (nway==0) formula <- paste(target,"~ ",sep="")
list2<-character(0)
print( paste("about to try ",length(list_of_terms), " interactions.",sep=""))
for (i in 1: length(list_of_terms))
{
if (length(list2)==0)
{
print(paste(formula,list_of_terms[i],sep="+"))
mfin<-glm(paste(formula,list_of_terms[i],sep="+"),data=train,family=binomial)
}
else
{
finform<- formula
for (j in 1: length(list2))
{
if (j==1 & nway==0) finform<- paste(finform, list2[j],sep=" ")
finform<- paste(finform, list2[j],sep="+")
}
print(i)
print(finform)
print(paste(finform, list_of_terms[i],sep="+"))
mfin<-glm(paste(finform, list_of_terms[i],sep="+"),data=train,family=binomial)
}
test$score2<-predict(mfin,type='response',test)
if (nway==0 & i==1) maxperf<-attributes(performance(pred2, "auc"))$y.values[[1]][1]
print(maxperf)
print(attributes(performance(pred2, "auc"))$y.values[[1]][1])
if (attributes(performance(pred2, "auc"))$y.values[[1]][1] > maxperf | (nway==0 &

i==1)
)
{
list2[length(list2)+1]= list_of_terms[i]
maxperf<- attributes(performance(pred2, "auc"))$y.values[[1]][1]
35 of 39
rm(mfin)
gc()
if (length(list2)>0)
{
mfin<-glm(finform,data=train,family=binomial)
print(paste("explored ", length(list_of_terms) ," interactions and ended up with following

optimal interactions : ", length(list2),sep="" ))
print("optimal glm model is ")
print(finform)
print(summary(mfin))
#compare on validation set

validate$score3<-predict(mfin,type='response',validate)
pred3<-prediction(validate$score3,validate[target])
optglmperf<-attributes(performance(pred3, "auc"))$y.values[[1]][1]
print (paste(" perf of orig glm no interactions oos 15% : ",perforig,sep=""))

print (paste(" perf of logit tuned w interactions guided by rf oos 15% : ",
optglmperf,sep=""))
validate$p4<-predict(rf,validate,type='prob')[,2]
pred4<-prediction(validate$p4,validate[target])
rfperf<-attributes(performance(pred4, "auc"))$y.values[[1]][1]
print (paste(" perf of orig rf oos 15% : ",rfperf ,sep=""))
windows();
#pdf(varImpPlot(rf,main="random forest variable importance (used to choose interaction
terms for glm)"));
#windows();
par(mfrow=c(2,2))
plot(perf1a, col='darkgreen', main="OOS 15% validation ROC ")

36 of 39
plot(perf3, col='red',add=TRUE);
plot(perf4, col='blue',add=TRUE);
legend(0.6,0.6,c('orig glm','glm w interactions' ,'orig
rf'),col=c('darkgreen','red','blue'),lwd=2)
dotchart(c(rfperf,perforig,optglmperf),labels=c('oos rf','oos orig glm','oos opt

glm' ),cex=.7, main="oos area under curve for validation data")
datainplot<-c((optglmperf/perforig)-1,(optglmperf/rfperf)-1)
q<-barplot(datainplot,names.arg=c('% improve AUC over original glm','% improve AUC
over rf'), main="OOS AUC improvement by glm w interactions based on
rf",col=c("green","red"),beside=TRUE)
text(cex=.5, x=q, y=datainplot+par("cxy")[2]/2, round(datainplot,2), xpd=TRUE) ;
#barplot(c(perforig,optglmperf,rfperf),names.arg=c('original glm','glm with interactions

tuned' ,'orig rf'), main="OOS AUC by model",col=c("green","red","blue"),beside=TRUE)
#varImpPlot(rf, main="variable importance from rf")

e<-imp <- imp[, -(1:(ncol(imp) - 2))]
dotchart(sort(e[,1]),main="random forest variable importance by mean decrease in
accuracy")
}
# run regression of list 2 terms + term
#print count and final model
#plot oss samples
return
(list(finform=finform,perforig=perforig,optgkmperf=optglmperf,rfperf=rfperf,vi=importa
nce(rf)))
# load data sets: kdd, german, and hmeqt

#BAD
hmeq<-read.csv("C:/Documents and Settings/My Documents/HMEQ.csv")
hmeq$BAD <- as.factor(hmeq$BAD)
#good_bad
c<-read.csv("C:/Documents and Settings/My Documents/GermanCredit.csv")
c<-subset(c,select=-default)
37 of 39
#TARGET_LABEL_BAD
cc<-read.csv("C:/Documents and Settings/My Documents/cckdd2010.csv")
cc$TARGET_LABEL_BAD<-as.factor(cc$TARGET_LABEL_BAD)
cc$QUANT_DEPENDANTS<-
ifelse(cc$QUANT_DEPENDANTS>=13,13,cc$QUANT_DEPENDANTS)
cc$MissingResidentialPhoneCode<-
as.factor(ifelse(is.na(cc$RESIDENCIAL_PHONE_AREA_CODE)==TRUE,'Y','N'))
cc$MissingProfPhoneCode<-
as.factor(ifelse(is.na(cc$PROFESSIONAL_PHONE_AREA_CODE)==TRUE,'Y','N'))
cc<-subset(cc,select=-ID_CLIENT )
cc<-subset(cc,select=-CLERK_TYPE )
cc<-subset(cc,select=-QUANT_ADDITIONAL_CARDS)
cc<-subset(cc,select=-EDUCATION_LEVEL.1)
cc<-subset(cc,select=-FLAG_MOBILE_PHONE)
cc<-subset(cc,select=-FLAG_HOME_ADDRESS_DOCUMENT)
cc<-subset(cc,select=-FLAG_RG )
cc<-subset(cc,select=-FLAG_CPF )
cc<-subset(cc,select=-FLAG_INCOME_PROOF)
cc<-subset(cc,select=-FLAG_ACSP_RECORD)
cc<-subset(cc,select=-TARGET_LABEL_BAD.1)
cc$RESIDENCIAL_PHONE_AREA_CODE<-
as.factor(cc$RESIDENCIAL_PHONE_AREA_CODE)
cc$PROFESSIONAL_PHONE_AREA_CODE<-
as.factor(cc$PROFESSIONAL_PHONE_AREA_CODe)
#cc<-subset(cc,select=-PROFESSIONAL_ZIP_3)
#cc<-subset(cc,select=-RESIDENCIAL_ZIP_3)
#cc<-subset(cc,select=-RESIDENCIAL_PHONE_AREA_CODE)
#cc<-subset(cc,select=-PROFESSIONAL_PHONE_AREA_CODE)
# generate interaction data
#no woe no interac

a1<-tuneGlmToRf(c, "good_bad", n=1,100, FALSE,0,NA,FALSE)
a2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=1, 100, FALSE,0,NA,FALSE)
a3<-tuneGlmToRf(hmeq, "BAD", n=1, 100, FALSE,0,NA,FALSE)
#woe no interace
b1<-tuneGlmToRf(c, "good_bad", n=1,100, FALSE,0,NA,TRUE,FALSE)
b2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=1, 100,
FALSE,0,NA,TRUE,FALSE)
38 of 39
b3<-tuneGlmToRf(hmeq, "BAD", n=1, 100, FALSE,0,NA,TRUE,FALSE)
#woe +bin no interac

c1<-tuneGlmToRf(c, "good_bad", n=1,100, FALSE,0)
c2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=1, 100, FALSE,0)
c3<-tuneGlmToRf(hmeq, "BAD", n=1, 100, FALSE,0)
#woe+bin interac
d1<-tuneGlmToRf(c, "good_bad", n=10,100, FALSE,1)
d2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=10, 100, FALSE,1)
d3<-tuneGlmToRf(hmeq, "BAD", n=10, 100, FALSE,1)
#woe no bin interac

e1<-tuneGlmToRf(c, "good_bad", n=10,100, FALSE,1,NA,TRUE,FALSE)
e2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=10, 100,
e3<-tuneGlmToRf(hmeq, "BAD", n=10, 100, FALSE,1,NA,TRUE,FALSE)
e4<-tuneGlmToRf(hmeq, "BAD", n=11, 100, FALSE,5,NA,TRUE,FALSE)
e2a<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=7, 100,

#no wo interac
f1<-tuneGlmToRf(c, "good_bad", n=10,100, FALSE,1,NA,FALSE)
f2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=10, 100, FALSE,1,NA,FALSE)
f3<-tuneGlmToRf(hmeq, "BAD", n=10, 100, FALSE,1,NA,FALSE)
print(paste("no woe& no interac in i* ","german", a1$perforig,

a1$rfperf,a1$optgkmperf,sep=","))
print(paste("no woe& no interac in i* ","cc", a2$perforig,
print(paste("no woe& no interac in i* ","hmeq", a3$perforig,
print(paste("woe & no interac in i* ","german", b1$perforig,

b1$rfperf,b1$optgkmperf,sep=","))
print(paste("woe & no interac in i* ","cc", b2$perforig,
print(paste("woe & no interac in i* ","hmeq", b3$perforig,
39 of 39
print(paste("woe+bin & no interac in i* ","german", c1$perforig,

c1$rfperf,c1$optgkmperf,sep=","))
print(paste("woe+bin & no interac in i* ","cc", c2$perforig,
print(paste("woe+bin & no interac in i* ","hmeq", c3$perforig,
print(paste("woe+bin interac in i* ","german", d1$perforig,

d1$rfperf,d1$optgkmperf,sep=","))
print(paste("woe+bin interac in i* ","cc", d2$perforig, d2$rfperf,d2$optgkmperf,sep=","))
print(paste("woe+bin interac in i* ","hmeq", d3$perforig,
d3$rfperf,d3$optgkmperf,sep=","))
print(paste("woe interac in i* ","german", e1$perforig,

e1$rfperf,e1$optgkmperf,sep=","))
print(paste("woe interac in i* ","cc", e2$perforig, e2$rfperf,e2$optgkmperf,sep=","))
print(paste("woe interac in i* ","hmeq", e3$perforig, e3$rfperf,e3$optgkmperf,sep=","))
print(paste("no woe interac in i* ","german", f1$perforig,

f1$rfperf,f1$optgkmperf,sep=","))
print(paste("no woe interac in i* ","cc", f2$perforig, f2$rfperf,f2$optgkmperf,sep=","))
print(paste("no woe interac in i* ","hmeq", f3$perforig,
f3$rfperf,f3$optgkmperf,sep=","))

In Favor of WOE

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

In Favor of WOE

Uploaded by

Copyright:

Available Formats

1 of 39

Evidence in Favor of Weight of Evidence and Binning Transformations for

Weight of evidence, discretization, classification, binary, logistic, i*, interaction, credit

Electronic copy available at: http://ssrn.com/abstract=1925510

Traditionally the benefits of Weight of Evidence transformations of categorical variables

Electronic copy available at: http://ssrn.com/abstract=1925510

www.sasenterpriseminer.com/data/HMEQ.xls. The 3 data sets are all binary

Recommendations based on this study:

gap can be attributed to powerful and statistically significant interaction

KDD CUP 1998 Data Retrieved from

Sharma, D. (2011a) Optimal Response Modeling: Comparison of Logistic Regression,

Sharma, D. (2011b) I*: Optimizing Logistic Regression to Match Ensemble Performance

return_RF_Formula <-function (arf1,n=7)

#only keep variables with info length(names(e[,1][e[,1]>0]))

if (i< min(n,length(sort(e[,1][e[,1]>0.025], dec=TRUE))))

imp <- importance(rf, class = null, scale = TRUE, type = NULL)

for (i in 1: min(n,length(sort(e[,1], dec=TRUE))))

getGlmTerms<- function (m)

for (i in 1:length( names(summary(m)$coefficients[,1])))

barplotwithtext<- function(datainplot=c(1,2,3), y=c(0,1))

text(cex=.5, x=w, y=datainplot+par("cxy")[2]/2, round(datainplot,2), xpd=TRUE)

EqualFreq2 <- function(x,n){

woe<-function(data, test, target,bin=TRUE)

#ensure target is factor

if ( bin==TRUE & (class(data[,i])=='numeric' | class(data[,i])=='integer')

index<- match(target, colnames(data))

if ( (class(data[,i])=="factor" | ( class(data[,i])=="integer" &

#remove original columns in data

if ( (class(data[,i])=="factor" | ( class(data[,i])=="integer" &

myvars <- names(data) %in% c(l)

myvars <- names(test) %in% c(l)

index<- match(target, colnames(data))

if (class(data[,i])=='factor' & i!=index )

data[,i] <- as.numeric(data[,i])+1

#if ( regexpr("WOE", names(data)[i])[1]>=1)

#for nas replace woe as median woe value

tuneGlmToRf<-function(data, target,n=7,rfsample=0, sample=FALSE, nway=1,

if (nrow(data)>52000 & sample==TRUE)

# else { data[,i] <- as.factor( data[,i]) }

formula <- paste(target,"~. ",sep="")

if (bin==TRUE) { prep<-woe(train,test, target) }

#made perf lousy for all

#rebuild forest with reduced data set

form<- paste(formula, " * ",sep="")

#if (nway==0 ) form<-formula

if ( k<=nway | nway ==1 )

if (k>nway & nway>1)

#paste(form,rfterms[k]," * ", rfterms[k-1], " * ", rfterms[k-2]," * ",rfterms[k-3]," *

############# do the same above by updating formula to include 1 way interactions

if (nway==0) formula <- paste(target,"~ ",sep="")

if (nway==0 & i==1) maxperf<-attributes(performance(pred2, "auc"))$y.values[[1]][1]

if (attributes(performance(pred2, "auc"))$y.values[[1]][1] > maxperf | (nway==0 &

print(paste("explored ", length(list_of_terms) ," interactions and ended up with following

#compare on validation set

print (paste(" perf of orig glm no interactions oos 15% : ",perforig,sep=""))

print (paste(" perf of orig rf oos 15% : ",rfperf ,sep=""))

#pdf(varImpPlot(rf,main="random forest variable importance (used to choose interaction

plot(perf1a, col='darkgreen', main="OOS 15% validation ROC ")

dotchart(c(rfperf,perforig,optglmperf),labels=c('oos rf','oos orig glm','oos opt

#barplot(c(perforig,optglmperf,rfperf),names.arg=c('original glm','glm with interactions

imp <- importance(rf, class = null, scale = TRUE, type = NULL)

#print count and final model

#plot oss samples

return_RF_Formula <-function (arf1,n=7)