Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

1 of 39

Evidence in Favor of Weight of Evidence and Binning Transformations for


Predictive Modeling

Abstract
Weight of Evidence transformation of categorical variables is a technique used by credit
scoring professionals for decades. This paper investigates whether using this
transformation improves predictive performance. For models without interaction terms
the use of Weight of evidence and binning or discretization of numeric variables
improves predictive accuracy. The addition of Weight of evidence transformations
without binning is marginal in models without interactions. This is consistent with the
excellent results achieved on the Paralyzed Veteran Admin KDD 98 data where the best
performance was achieved using both WOE and binning.

For models with interaction terms the use of WOE transform improves model
performance for 2 out of 3 data sets and performance is the same for the third data set
with or without the WOE. WOE tends to improve logistic regression and I* tuned
logistic regression performance while degrading random forest performance slightly.
WOE and I* algorithm thus reach peak predictive models in achieving area under the
curve competitive with winning KDD benchmarks.

The combination of WOE and binning reduces performance for models with interaction
terms. This makes sense in retrospect as binning variables results in loss of information
about interaction amongst continuous variables.

WOE and binning thus improve model performance when used together when modeling
without interactions as thought by practitioners. However when interaction effects exist
in the data interaction effects are more predictive than WOE and binning and WOE
should be used alone as binning can result in loss of predictive power of interaction
effects. Interactions exist in the data set when random forest outperforms logistic
regression out of the box (Sharma, 2011b).

Weight of evidence, discretization, classification, binary, logistic, i*, interaction, credit


scoring, predictive modeling

Electronic copy available at: http://ssrn.com/abstract=1925510


2 of 39

Introduction

The goal of this study is to study the impact of Weight of evidence transformations and
binning on logistic regression, random forests, and I* algorithm which builds logistic
regression using random forest variable importance. WOE or weight of evidence is
defined as the log of bayes factor, log of distribution of one event vs. the other in a binary
classification problem (Good,1985).

Traditionally the benefits of Weight of Evidence transformations of categorical variables


for predictive modeling and classification has been that it allows information from
categorical variables to be included in model building which might be too much for
regression to handle and also improvements in speed of model building. The credit
scoring community has long advocated the use of WOE in coarse classification
scorecards used with logistic regression with the justification that using WOE and
binning or discretization improves performance as it allows nonlinearities to be modeled
by a linear model (Leung etal, 2008).

Recently using the combination of WOE and binning was shown to produce excellent
performance using the KDD 98 data set where there were many categorical variables
(Sharma, 2011a).

Data Sets

The 3 data sets used for this study are the German Credit data set(Frank, 2010), Credit
Card KDD 2010 data set(from KDD 2010 available from
http://sede.neurotech.com.br/PAKDD2010 ), and the Home Equity data set from from
SAS® called the home equity data set from

Electronic copy available at: http://ssrn.com/abstract=1925510


3 of 39

www.sasenterpriseminer.com/data/HMEQ.xls. The 3 data sets are all binary


classification data sets suitable for logistic regression.

Models

The logistic regression without tuning, random forests, and I * algorithm which builds
logistic regression models using random forest variable importance in model
development are compared using 70% train, 15% out of sample and 15% validation data
set. The performance reported below is for the validation data set.

Results

Discussion
WOE and Binning enhance model performance for logistic regression and I* results
when no interaction terms are used. That said the performance improvement of using
interaction terms is greater than that of WOE/binning in these data sets. In the KDD 98
data set with a large number of categorical variables with little predictive interaction
effects the WOE and Binning performance was much better. This would indicate in the
absence of predictive interaction effects its optimal to use WOE and binning while using
WOE alone is advisable in data sets without predictive interactions. To determine
whether a data set has predictive interactions it is recommended to compare random
forest performance against logistic regression and if random forest performance is better
it is because of the presence of significant interaction effects in the data and quite
possible a more complex decision boundary (Sharma,). WOE and binning reduce random
forest performance slightly. WOE itself with interaction terms improves performance of
logistic regression and I*. However WOE and Binning applied to models with
interaction result in loss or performance and this is because binning loses information on
predictive interactions and results. Interactions exist in the data set when random forest
outperforms logistic regression out of the box (Sharma, 2011b).

Conclusions

Recommendations based on this study:


• Using interaction terms enhances performance more than WOE when there
are predictive interactions in data
• WOE enhances I* model performance and logistic regression performance
and should be used as a default option.
• WOE and Binning should be used when no interaction are suspected or
included in model building. This is the default setting used by Credit scoring
professional.
• To determine where to explore interaction terms compare model performance
out of the box of random forests and logistic regression. If random forest
performance is much better than the logistic regression then the performance
4 of 39

gap can be attributed to powerful and statistically significant interaction


effects in the data (Sharma, 2011b).

References

Good, I.J. (1985) Weight of Evidence: A Brief Survey. Bayesian Statistics. Vol 2.
pp.249-270. Retrieved from
http://www.swrcb.ca.gov/water_issues/programs/tmdl/docs/303d_policydocs/207.pdf

KDD CUP 1998 Data Retrieved from


http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html ;

Sharma, D. (2011a) Optimal Response Modeling: Comparison of Logistic Regression,


Random Forest and I* algorithm. Retrieved from
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1918410

Sharma, D. (2011b) I*: Optimizing Logistic Regression to Match Ensemble Performance


Using Random Forest Variable Importance. Retrieved from
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1858378.

Leung, K., Cheong, F., Cheong, C. (2008). Building a Scorecard in Practice. WSPC
Proceedings. Retrieved from http://www.aiecon.org/conference/2008/CIEF/Building
%20a%20Scorecard%20in%20Practice.pdf
5 of 39

Appendix of R code

return_RF_Formula <-function (arf1,n=7)


{
imp <- importance(arf1, class = null, scale = TRUE, type = NULL)
e<-imp <- imp[, -(1:(ncol(imp) - 2))]
#sort(e[,1], dec=TRUE)

#only keep variables with info length(names(e[,1][e[,1]>0]))

a<-""
for (i in 1: min(n,length(sort(e[,1][e[,1]>0.025], dec=TRUE))))
{
a<-paste(a,(names(sort(e[,1][e[,1]>0.025], dec=TRUE))[i]),sep="")

if (i< min(n,length(sort(e[,1][e[,1]>0.025], dec=TRUE))))


{
a<-paste(a,("+"),sep="")
}

}
return (a)
}

getRfTerms<-function(rf, n=10)
{

rfterms<-character(0)
6 of 39

imp <- importance(rf, class = null, scale = TRUE, type = NULL)


e<-imp <- imp[, -(1:(ncol(imp) - 2))]

for (i in 1: min(n,length(sort(e[,1], dec=TRUE))))


{
print (paste("forest: " ,substr(names(sort(e[,1], dec=TRUE))[i],1,nchar(names(sort(e[,1],
dec=TRUE))[i])),sep=""))
rfterms[length(rfterms)+1]= substr(names(sort(e[,1], dec=TRUE))
[i],1,nchar(names(sort(e[,1], dec=TRUE))[i]))
}

return (rfterms)

getGlmTerms<- function (m)


{
x <- array(dim=c(length( names(summary(m)$coefficients[,1])),3))

for (i in 1:length( names(summary(m)$coefficients[,1])))


{
#print(names(summary(m)$coefficients[,1])[i])
#print(summary(m)$coefficients[i])
#print (summary(m)$coefficients[i,4])

x[i,1]<-names(summary(m)$coefficients[,1])[i]
x[i,2]<-summary(m)$coefficients[i]
x[i,3]<-summary(m)$coefficients[i,4]

}
y<-data.frame(x,stringsAsFactors = FALSE)
return (y)
}

barplotwithtext<- function(datainplot=c(1,2,3), y=c(0,1))


{
datainplot<- c(1,2,3)
w<-barplot(datainplot,ylim=y);

text(cex=.5, x=w, y=datainplot+par("cxy")[2]/2, round(datainplot,2), xpd=TRUE)

}
7 of 39

EqualFreq2 <- function(x,n){


nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}

rsamp<-function(c, p)
{

d = sort(sample(nrow(c), nrow(c)*p))
#select training sample
train<-c[d,]
test<-c[-d,]

return(train)

#WOE

woe<-function(data, test, target,bin=TRUE)


{
library(randomForest)
data<-na.roughfix(data)
test<-na.roughfix(test)

print("WOE1")
#library(discretization)
library(randomForest)
library(sqldf)

#ensure target is factor


data[,target]<-as.factor(data[,target])
test[,target]<-as.factor(test[,target])

len<-length(data)
8 of 39

#loop through training categorical variables and create woe table and apply to test

for (i in 1:length(data))
{

if ( bin==TRUE & (class(data[,i])=='numeric' | class(data[,i])=='integer')


)
{

data[,i]<-EqualFreq2(data[,i],10)
test[,i]<-EqualFreq2(test[,i],10)

#data[,i]<-cut(data[,i], as.integer(sqrt(nrow(data))))
#data[,i]<- log(data[,i]+1.0)
#test[,i]<- log(test[,i]+1.0)

#data[,i]<- log(data[,i]+1.0)
#test[,i]<- log(test[,i]+1.0)

#data[,i]<-cut(data[,i],10)
#test[,i]<-cut(data[,i],10)

index<- match(target, colnames(data))

if ( (class(data[,i])=="factor" | ( class(data[,i])=="integer" &


length(levels(as.factor(data[,i])))<=50)

)
& i!=index
)
{

#print("non numeric")
name<-names(data)[i]

#print(name)
print(class(data[,i]))

a=paste("'",levels(data[,target])[1],"'",sep="")
9 of 39

b=paste("'",levels(data[,target])[2],"'",sep="")

#b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data where %s))/(sum(case when %s then 1.00 else 0.00 end) / (select #count(*) from
data where %s))
%s',paste(target,"=",a,sep=""),paste(target,"=",a,sep=""),paste(target,"=",b,sep=""),paste(t
arget,"=",b,sep=""), #paste('WOE',names(data)[i],' , [', names(data)[i], '] from data group
by [',names(data)[i], ']',sep="") ))

b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data)) %s',paste(target,"=",a,sep=""), paste('WOE',names(data)[i],' , [', names(data)[i], ']
from data group by [',names(data)[i], ']',sep="") ))

sql<- b1

print(sql)
t<-sqldf(sql)

print('ran sql')
gc()
#data <-sqldf(paste('select * from data a, t b where a.',names(data)[i],' = b.',names(data)
[i],sep=""))
print('merge1')
data<-merge(data,t)
test<-merge(test,t)

gc()

}
}

#remove categ
l<-vector()
for (i in 1:len)
{
index<- match(target, colnames(data))

#remove original columns in data


10 of 39

if ( (class(data[,i])=="factor" | ( class(data[,i])=="integer" &


length(levels(as.factor(data[,i])))<=50)
)
& i!=index
){
name<-names(data)[i]
print(name)
l[i]<-name

}
}

myvars <- names(data) %in% c(l)


data <- data[!myvars]

myvars <- names(test) %in% c(l)


test <- test[!myvars]

gc()
#ensure woes all numeric
for (i in 1:length(data))
{

index<- match(target, colnames(data))

if (class(data[,i])=='factor' & i!=index )


{

data[,i] <- as.numeric(data[,i])+1


test[,i] <- as.numeric(test[,i])+1

#if ( regexpr("WOE", names(data)[i])[1]>=1)


#{

#data[,i]<- log(data[,i])
#test[,i]<- log(test[,i])

#}

}
11 of 39

#for nas replace woe as median woe value


data<-na.roughfix(data)
test<-na.roughfix(test)

print('end woe')
return(list(data=data,test=test))

# input data, name of target classification variable , n=number of top most predictive
variables to use for interaction mining based on rf
# rfsample size of classes to sample from for large dataset; usually 500 works well .
# sample a boolean flag for large datasets to sample 50k obs for model building etc.
# 2 way interacs flag; slower not much better;
#formStat is a string model formula of a base existing model with interactions you may
want to start with as a starting point for model improvement
# you can run the function once and feed back the resulting model to see if more
predictive power can be added using additional interaction terms using the
# algorithm recusively one itself.

tuneGlmToRf<-function(data, target,n=7,rfsample=0, sample=FALSE, nway=1,


formStart=NA, woe=TRUE,bin=TRUE)
{
library(randomForest)
#if (nway==0)
data<-na.roughfix(data)

if (!is.na(formStart))
{

d<-gsub("\\+",",",formStart)
d<-gsub("\\~",",",d)
d<-gsub("\\WOE","",d)
print(d)
q<-unlist(strsplit(d, "\\,"))
myvars <- names(data) %in% c(q)

data<- data[myvars]
12 of 39

# if not woe and levels >=32 then remove as rf cant handle them

if (woe==FALSE)
{
for (i in 1:length(data))
{
# if (length(levels(data[,i]))>=32) { data[,i]<-NULL }
if (is.factor(data[,i]) & names(data)[i]!=target) { data[,i]<-as.numeric(data[,i]) }

if (nrow(data)>52000 & sample==TRUE)


{
d = sort(sample(nrow(data), 52000))
data<-data[d,]

if (rfsample==0) { rfsample=500 }
}

options(warn=-1)
#library(fSeries)

#if (nway!=0)
#{
#data<- data.frame(substituteNA(data, type ="zeros"))
#for (i in 1:(length(names(data))))

#{
#{
# if (names(data[i]) != target)
# {
# data[,i] <- as.numeric( data[,i])
# print (1)
# }
13 of 39

# else { data[,i] <- as.factor( data[,i]) }


#}
#}

#}

formula <- paste(target,"~. ",sep="")


formularf<- as.formula(formula)

d = sort(sample(nrow(data), nrow(data)*.7))
#select training sample
train<-data[d,]
test<-data[-d,]
rm(data)
gc()

if (woe==TRUE)
{

if (bin==TRUE) { prep<-woe(train,test, target) }


if (bin==FALSE) { prep<-woe(train,test, target,FALSE) }

train<-prep$data
test<-prep$test
print("a")
}

if (nrow(train)>=34000) train<-train[1:34000,]

gc()
w= sort(sample(nrow(test), nrow(test)*.5))
test<-test[w,]
validate<-test[-w,]

print(length(train))
u<-length(train)
u<-as.integer(log(u,10)*sqrt(u))

library(randomForest)
14 of 39

library(ROCR)
set.seed(42)

if (rfsample > 0)
{
print("rf")
print(rfsample)

print(summary(train))
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
sampsize=c(rfsample,rfsample),
na.action=na.roughfix,
replace=FALSE)
}

else
{
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
na.action=na.roughfix,
replace=FALSE)
}

print("C")
#subset train test and validate to terms that have predictive value
imp <- importance(rf, class = null, scale = TRUE, type = NULL)
e<-imp <- imp[, -(1:(ncol(imp) - 2))]

print("D")

#data<-subset(data,select=c(names(e[,1][e[,1]>0]),target))

train<-subset(train,select=c(names(e[,1][e[,1]>0]),target))

print("E")

test<-subset(test,select=c(names(e[,1][e[,1]>0]),target))
15 of 39

print("F")
validate<-subset(validate,select=c(names(e[,1][e[,1]>0]),target))

print("done woe")

#made perf lousy for all


#if (nway==0 & n==1)
#{
#data<-subset(data,select=c(head(names(e[,1][e[,1]>0]),30),target))
#train<-subset(train,select=c(head(names(e[,1][e[,1]>0]),30),target))
#test<-subset(test,select=c(head(names(e[,1][e[,1]>0]),30),target))
#validate<-subset(validate,select=c(head(names(e[,1][e[,1]>0]),30),target))
#
#}

u<-length(train)
u<-as.integer(log(u,10)*sqrt(u))

#rebuild forest with reduced data set


if (rfsample > 0)
{

rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
sampsize=c(rfsample,rfsample),
na.action=na.roughfix,
replace=FALSE)
}

else
{
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
na.action=na.roughfix,
replace=FALSE)
}
16 of 39

rfterms<-getRfTerms(rf,n)
list_of_terms<-character(0)

form<- paste(formula, " * ",sep="")

#if (nway==0 ) form<-formula

for (k in 1:length(rfterms))
{
print(paste(form,rfterms[k],sep=""))

if ( k<=nway | nway ==1 )


{
print("a")

m<-glm(paste(form,rfterms[k],sep=""),data=train,family=binomial)
print("b")

if (k>nway & nway>1)


{
f<- paste(form,rfterms[k],sep="")
for (p in 1:nway)
{
f<-paste(f, " * ", rfterms[k-p],sep="")
print(f)
}

#paste(form,rfterms[k]," * ", rfterms[k-1], " * ", rfterms[k-2]," * ",rfterms[k-3]," *


",rfterms[k-4]," * ",rfterms[k-5], sep="")
m<-glm(f,data=train,family=binomial)
print("c")

}
print("d")

if (nway==0) m<-glm(formula,data=train,family=binomial)

print(summary(m))
y <- getGlmTerms(m)
y<- na.omit(y)
rm(m)
gc()
17 of 39

y$X2<- as.numeric(y$X2)
y$X3<- as.numeric(y$X3)

for (j in 1: nrow(y))
{
print("Adding term")

if ((y[j,3]<=.25 | runif(1, 0, .61)< .16 | TRUE ) & (regexpr(":", y[j,1]) > 0) | ( nway==0
& !(regexpr("Intercept", y[j,1]) > 0) ))

#print(names(y[1,j]))
list_of_terms[length(list_of_terms)+1]= y[j,1]
}

############# do the same above by updating formula to include 1 way interactions


against final model

#################

maxperf<-0

morig<-glm(formula,data=train,family=binomial)

test$score1<-predict(morig,type='response',test)
pred1<-prediction(test$score1,test[target])
perf1 <- performance(pred1,"tpr","fpr")

validate$score1a<-predict(morig,type='response',validate)
pred1a<-prediction(validate$score1a,validate[target])
perf1a <- performance(pred1a,"tpr","fpr")

perforig<-attributes(performance(pred1a, "auc"))$y.values[[1]][1]
18 of 39

maxperf<-attributes(performance(pred1, "auc"))$y.values[[1]][1]

print(maxperf)

rm(morig)
gc()

if (nway==0) formula <- paste(target,"~ ",sep="")

list2<-character(0)
print( paste("about to try ",length(list_of_terms), " interactions.",sep=""))
for (i in 1: length(list_of_terms))
{

if (length(list2)==0)
{
print(paste(formula,list_of_terms[i],sep="+"))
mfin<-glm(paste(formula,list_of_terms[i],sep="+"),data=train,family=binomial)

}
else
{
finform<- formula
for (j in 1: length(list2))
{
if (j==1 & nway==0) finform<- paste(finform, list2[j],sep=" ")
finform<- paste(finform, list2[j],sep="+")

}
print(i)
print(finform)
print(paste(finform, list_of_terms[i],sep="+"))
mfin<-glm(paste(finform, list_of_terms[i],sep="+"),data=train,family=binomial)
}

test$score2<-predict(mfin,type='response',test)
pred2<-prediction(test$score2,test[target])
perf2 <- performance(pred2,"tpr","fpr")

if (nway==0 & i==1) maxperf<-attributes(performance(pred2, "auc"))$y.values[[1]][1]

print(maxperf)
print(attributes(performance(pred2, "auc"))$y.values[[1]][1])
19 of 39

if (attributes(performance(pred2, "auc"))$y.values[[1]][1] > maxperf | (nway==0 &


i==1)
)
{

list2[length(list2)+1]= list_of_terms[i]
maxperf<- attributes(performance(pred2, "auc"))$y.values[[1]][1]

rm(mfin)
gc()

if (length(list2)>0)
{
mfin<-glm(finform,data=train,family=binomial)

print(paste("explored ", length(list_of_terms) ," interactions and ended up with following


optimal interactions : ", length(list2),sep="" ))
print("optimal glm model is ")
print(finform)
print(summary(mfin))

#compare on validation set


validate$score3<-predict(mfin,type='response',validate)
pred3<-prediction(validate$score3,validate[target])
perf3 <- performance(pred3,"tpr","fpr")
optglmperf<-attributes(performance(pred3, "auc"))$y.values[[1]][1]

print (paste(" perf of orig glm no interactions oos 15% : ",perforig,sep=""))


print (paste(" perf of logit tuned w interactions guided by rf oos 15% : ",
optglmperf,sep=""))

validate$p4<-predict(rf,validate,type='prob')[,2]
pred4<-prediction(validate$p4,validate[target])
perf4 <- performance(pred4,"tpr","fpr")
rfperf<-attributes(performance(pred4, "auc"))$y.values[[1]][1]

print (paste(" perf of orig rf oos 15% : ",rfperf ,sep=""))

windows();
20 of 39

#pdf(varImpPlot(rf,main="random forest variable importance (used to choose interaction


terms for glm)"));
#windows();
par(mfrow=c(2,2))

plot(perf1a, col='darkgreen', main="OOS 15% validation ROC ")


plot(perf3, col='red',add=TRUE);
plot(perf4, col='blue',add=TRUE);
legend(0.6,0.6,c('orig glm','glm w interactions' ,'orig
rf'),col=c('darkgreen','red','blue'),lwd=2)

dotchart(c(rfperf,perforig,optglmperf),labels=c('oos rf','oos orig glm','oos opt


glm' ),cex=.7, main="oos area under curve for validation data")

datainplot<-c((optglmperf/perforig)-1,(optglmperf/rfperf)-1)
q<-barplot(datainplot,names.arg=c('% improve AUC over original glm','% improve AUC
over rf'), main="OOS AUC improvement by glm w interactions based on
rf",col=c("green","red"),beside=TRUE)
text(cex=.5, x=q, y=datainplot+par("cxy")[2]/2, round(datainplot,2), xpd=TRUE) ;

#barplot(c(perforig,optglmperf,rfperf),names.arg=c('original glm','glm with interactions


tuned' ,'orig rf'), main="OOS AUC by model",col=c("green","red","blue"),beside=TRUE)
#varImpPlot(rf, main="variable importance from rf")

imp <- importance(rf, class = null, scale = TRUE, type = NULL)


e<-imp <- imp[, -(1:(ncol(imp) - 2))]
dotchart(sort(e[,1]),main="random forest variable importance by mean decrease in
accuracy")

}
# run regression of list 2 terms + term

#print count and final model

#plot oss samples

return
(list(finform=finform,perforig=perforig,optgkmperf=optglmperf,rfperf=rfperf,vi=importa
nce(rf)))

return_RF_Formula <-function (arf1,n=7)


21 of 39

{
imp <- importance(arf1, class = null, scale = TRUE, type = NULL)
e<-imp <- imp[, -(1:(ncol(imp) - 2))]
#sort(e[,1], dec=TRUE)

#only keep variables with info length(names(e[,1][e[,1]>0]))

a<-""
for (i in 1: min(n,length(sort(e[,1][e[,1]>0.025], dec=TRUE))))
{
a<-paste(a,(names(sort(e[,1][e[,1]>0.025], dec=TRUE))[i]),sep="")

if (i< min(n,length(sort(e[,1][e[,1]>0.025], dec=TRUE))))


{
a<-paste(a,("+"),sep="")
}

}
return (a)
}

getRfTerms<-function(rf, n=10)
{

rfterms<-character(0)

imp <- importance(rf, class = null, scale = TRUE, type = NULL)


e<-imp <- imp[, -(1:(ncol(imp) - 2))]

for (i in 1: min(n,length(sort(e[,1], dec=TRUE))))


{
print (paste("forest: " ,substr(names(sort(e[,1], dec=TRUE))[i],1,nchar(names(sort(e[,1],
dec=TRUE))[i])),sep=""))
22 of 39

rfterms[length(rfterms)+1]= substr(names(sort(e[,1], dec=TRUE))


[i],1,nchar(names(sort(e[,1], dec=TRUE))[i]))
}

return (rfterms)

getGlmTerms<- function (m)


{
x <- array(dim=c(length( names(summary(m)$coefficients[,1])),3))

for (i in 1:length( names(summary(m)$coefficients[,1])))


{
#print(names(summary(m)$coefficients[,1])[i])
#print(summary(m)$coefficients[i])
#print (summary(m)$coefficients[i,4])

x[i,1]<-names(summary(m)$coefficients[,1])[i]
x[i,2]<-summary(m)$coefficients[i]
x[i,3]<-summary(m)$coefficients[i,4]

}
y<-data.frame(x,stringsAsFactors = FALSE)
return (y)
}

barplotwithtext<- function(datainplot=c(1,2,3), y=c(0,1))


{
datainplot<- c(1,2,3)
w<-barplot(datainplot,ylim=y);

text(cex=.5, x=w, y=datainplot+par("cxy")[2]/2, round(datainplot,2), xpd=TRUE)

EqualFreq2 <- function(x,n){


nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
23 of 39

rsamp<-function(c, p)
{

d = sort(sample(nrow(c), nrow(c)*p))
#select training sample
train<-c[d,]
test<-c[-d,]

return(train)

#WOE

woe<-function(data, test, target,bin=TRUE)


{
library(randomForest)
data<-na.roughfix(data)
test<-na.roughfix(test)

print("WOE1")
#library(discretization)
library(randomForest)
library(sqldf)

#ensure target is factor


data[,target]<-as.factor(data[,target])
test[,target]<-as.factor(test[,target])

len<-length(data)

#loop through training categorical variables and create woe table and apply to test

for (i in 1:length(data))
{
24 of 39

if ( bin==TRUE & (class(data[,i])=='numeric' | class(data[,i])=='integer')


)
{

data[,i]<-EqualFreq2(data[,i],10)
test[,i]<-EqualFreq2(test[,i],10)

#data[,i]<-cut(data[,i], as.integer(sqrt(nrow(data))))
#data[,i]<- log(data[,i]+1.0)
#test[,i]<- log(test[,i]+1.0)

#data[,i]<- log(data[,i]+1.0)
#test[,i]<- log(test[,i]+1.0)

#data[,i]<-cut(data[,i],10)
#test[,i]<-cut(data[,i],10)

index<- match(target, colnames(data))

if ( (class(data[,i])=="factor" | ( class(data[,i])=="integer" &


length(levels(as.factor(data[,i])))<=50)

)
& i!=index
)
{

#print("non numeric")
name<-names(data)[i]

#print(name)
print(class(data[,i]))

a=paste("'",levels(data[,target])[1],"'",sep="")
b=paste("'",levels(data[,target])[2],"'",sep="")

#b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data where %s))/(sum(case when %s then 1.00 else 0.00 end) / (select #count(*) from
data where %s))
%s',paste(target,"=",a,sep=""),paste(target,"=",a,sep=""),paste(target,"=",b,sep=""),paste(t
arget,"=",b,sep=""), #paste('WOE',names(data)[i],' , [', names(data)[i], '] from data group
by [',names(data)[i], ']',sep="") ))
25 of 39

b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data)) %s',paste(target,"=",a,sep=""), paste('WOE',names(data)[i],' , [', names(data)[i], ']
from data group by [',names(data)[i], ']',sep="") ))

sql<- b1

print(sql)
t<-sqldf(sql)

print('ran sql')
gc()
#data <-sqldf(paste('select * from data a, t b where a.',names(data)[i],' = b.',names(data)
[i],sep=""))
print('merge1')
data<-merge(data,t)
test<-merge(test,t)

gc()

}
}

#remove categ
l<-vector()
for (i in 1:len)
{
index<- match(target, colnames(data))

#remove original columns in data


if ( (class(data[,i])=="factor" | ( class(data[,i])=="integer" &
length(levels(as.factor(data[,i])))<=50)
)
& i!=index
){
name<-names(data)[i]
print(name)
l[i]<-name
26 of 39

}
}

myvars <- names(data) %in% c(l)


data <- data[!myvars]

myvars <- names(test) %in% c(l)


test <- test[!myvars]

gc()
#ensure woes all numeric
for (i in 1:length(data))
{

index<- match(target, colnames(data))

if (class(data[,i])=='factor' & i!=index )


{

data[,i] <- as.numeric(data[,i])+1


test[,i] <- as.numeric(test[,i])+1

#if ( regexpr("WOE", names(data)[i])[1]>=1)


#{

#data[,i]<- log(data[,i])
#test[,i]<- log(test[,i])

#}

#for nas replace woe as median woe value


data<-na.roughfix(data)
test<-na.roughfix(test)

print('end woe')
return(list(data=data,test=test))
27 of 39

# input data, name of target classification variable , n=number of top most predictive
variables to use for interaction mining based on rf
# rfsample size of classes to sample from for large dataset; usually 500 works well .
# sample a boolean flag for large datasets to sample 50k obs for model building etc.
# 2 way interacs flag; slower not much better;
#formStat is a string model formula of a base existing model with interactions you may
want to start with as a starting point for model improvement
# you can run the function once and feed back the resulting model to see if more
predictive power can be added using additional interaction terms using the
# algorithm recusively one itself.

tuneGlmToRf<-function(data, target,n=7,rfsample=0, sample=FALSE, nway=1,


formStart=NA, woe=TRUE,bin=TRUE)
{
library(randomForest)
#if (nway==0)
data<-na.roughfix(data)

if (!is.na(formStart))
{

d<-gsub("\\+",",",formStart)
d<-gsub("\\~",",",d)
d<-gsub("\\WOE","",d)
print(d)
q<-unlist(strsplit(d, "\\,"))
myvars <- names(data) %in% c(q)

data<- data[myvars]

# if not woe and levels >=32 then remove as rf cant handle them

if (woe==FALSE)
{
28 of 39

for (i in 1:length(data))
{
# if (length(levels(data[,i]))>=32) { data[,i]<-NULL }
if (is.factor(data[,i]) & names(data)[i]!=target) { data[,i]<-as.numeric(data[,i]) }

if (nrow(data)>52000 & sample==TRUE)


{
d = sort(sample(nrow(data), 52000))
data<-data[d,]

if (rfsample==0) { rfsample=500 }
}

options(warn=-1)
#library(fSeries)

#if (nway!=0)
#{
#data<- data.frame(substituteNA(data, type ="zeros"))
#for (i in 1:(length(names(data))))

#{
#{
# if (names(data[i]) != target)
# {
# data[,i] <- as.numeric( data[,i])
# print (1)
# }
# else { data[,i] <- as.factor( data[,i]) }
#}
#}

#}

formula <- paste(target,"~. ",sep="")


formularf<- as.formula(formula)
29 of 39

d = sort(sample(nrow(data), nrow(data)*.7))
#select training sample
train<-data[d,]
test<-data[-d,]
rm(data)
gc()

if (woe==TRUE)
{

if (bin==TRUE) { prep<-woe(train,test, target) }


if (bin==FALSE) { prep<-woe(train,test, target,FALSE) }

train<-prep$data
test<-prep$test
print("a")
}

if (nrow(train)>=34000) train<-train[1:34000,]

gc()
w= sort(sample(nrow(test), nrow(test)*.5))
test<-test[w,]
validate<-test[-w,]

print(length(train))
u<-length(train)
u<-as.integer(log(u,10)*sqrt(u))

library(randomForest)
library(ROCR)
set.seed(42)

if (rfsample > 0)
{
print("rf")
print(rfsample)
30 of 39

print(summary(train))
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
sampsize=c(rfsample,rfsample),
na.action=na.roughfix,
replace=FALSE)
}

else
{
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
na.action=na.roughfix,
replace=FALSE)
}

print("C")
#subset train test and validate to terms that have predictive value
imp <- importance(rf, class = null, scale = TRUE, type = NULL)
e<-imp <- imp[, -(1:(ncol(imp) - 2))]

print("D")

#data<-subset(data,select=c(names(e[,1][e[,1]>0]),target))

train<-subset(train,select=c(names(e[,1][e[,1]>0]),target))

print("E")

test<-subset(test,select=c(names(e[,1][e[,1]>0]),target))

print("F")
validate<-subset(validate,select=c(names(e[,1][e[,1]>0]),target))

print("done woe")

#made perf lousy for all


31 of 39

#if (nway==0 & n==1)


#{
#data<-subset(data,select=c(head(names(e[,1][e[,1]>0]),30),target))
#train<-subset(train,select=c(head(names(e[,1][e[,1]>0]),30),target))
#test<-subset(test,select=c(head(names(e[,1][e[,1]>0]),30),target))
#validate<-subset(validate,select=c(head(names(e[,1][e[,1]>0]),30),target))
#
#}

u<-length(train)
u<-as.integer(log(u,10)*sqrt(u))

#rebuild forest with reduced data set


if (rfsample > 0)
{

rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
sampsize=c(rfsample,rfsample),
na.action=na.roughfix,
replace=FALSE)
}

else
{
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
na.action=na.roughfix,
replace=FALSE)
}

rfterms<-getRfTerms(rf,n)
list_of_terms<-character(0)

form<- paste(formula, " * ",sep="")

#if (nway==0 ) form<-formula


32 of 39

for (k in 1:length(rfterms))
{
print(paste(form,rfterms[k],sep=""))

if ( k<=nway | nway ==1 )


{
print("a")

m<-glm(paste(form,rfterms[k],sep=""),data=train,family=binomial)
print("b")

if (k>nway & nway>1)


{
f<- paste(form,rfterms[k],sep="")
for (p in 1:nway)
{
f<-paste(f, " * ", rfterms[k-p],sep="")
print(f)
}

#paste(form,rfterms[k]," * ", rfterms[k-1], " * ", rfterms[k-2]," * ",rfterms[k-3]," *


",rfterms[k-4]," * ",rfterms[k-5], sep="")
m<-glm(f,data=train,family=binomial)
print("c")

}
print("d")

if (nway==0) m<-glm(formula,data=train,family=binomial)

print(summary(m))
y <- getGlmTerms(m)
y<- na.omit(y)
rm(m)
gc()

y$X2<- as.numeric(y$X2)
y$X3<- as.numeric(y$X3)

for (j in 1: nrow(y))
{
print("Adding term")
33 of 39

if ((y[j,3]<=.25 | runif(1, 0, .61)< .16 | TRUE ) & (regexpr(":", y[j,1]) > 0) | ( nway==0
& !(regexpr("Intercept", y[j,1]) > 0) ))

#print(names(y[1,j]))
list_of_terms[length(list_of_terms)+1]= y[j,1]
}

############# do the same above by updating formula to include 1 way interactions


against final model

#################

maxperf<-0

morig<-glm(formula,data=train,family=binomial)

test$score1<-predict(morig,type='response',test)
pred1<-prediction(test$score1,test[target])
perf1 <- performance(pred1,"tpr","fpr")

validate$score1a<-predict(morig,type='response',validate)
pred1a<-prediction(validate$score1a,validate[target])
perf1a <- performance(pred1a,"tpr","fpr")

perforig<-attributes(performance(pred1a, "auc"))$y.values[[1]][1]

maxperf<-attributes(performance(pred1, "auc"))$y.values[[1]][1]

print(maxperf)

rm(morig)
gc()
34 of 39

if (nway==0) formula <- paste(target,"~ ",sep="")

list2<-character(0)
print( paste("about to try ",length(list_of_terms), " interactions.",sep=""))
for (i in 1: length(list_of_terms))
{

if (length(list2)==0)
{
print(paste(formula,list_of_terms[i],sep="+"))
mfin<-glm(paste(formula,list_of_terms[i],sep="+"),data=train,family=binomial)

}
else
{
finform<- formula
for (j in 1: length(list2))
{
if (j==1 & nway==0) finform<- paste(finform, list2[j],sep=" ")
finform<- paste(finform, list2[j],sep="+")

}
print(i)
print(finform)
print(paste(finform, list_of_terms[i],sep="+"))
mfin<-glm(paste(finform, list_of_terms[i],sep="+"),data=train,family=binomial)
}

test$score2<-predict(mfin,type='response',test)
pred2<-prediction(test$score2,test[target])
perf2 <- performance(pred2,"tpr","fpr")

if (nway==0 & i==1) maxperf<-attributes(performance(pred2, "auc"))$y.values[[1]][1]

print(maxperf)
print(attributes(performance(pred2, "auc"))$y.values[[1]][1])

if (attributes(performance(pred2, "auc"))$y.values[[1]][1] > maxperf | (nway==0 &


i==1)
)
{

list2[length(list2)+1]= list_of_terms[i]
maxperf<- attributes(performance(pred2, "auc"))$y.values[[1]][1]
35 of 39

rm(mfin)
gc()

if (length(list2)>0)
{
mfin<-glm(finform,data=train,family=binomial)

print(paste("explored ", length(list_of_terms) ," interactions and ended up with following


optimal interactions : ", length(list2),sep="" ))
print("optimal glm model is ")
print(finform)
print(summary(mfin))

#compare on validation set


validate$score3<-predict(mfin,type='response',validate)
pred3<-prediction(validate$score3,validate[target])
perf3 <- performance(pred3,"tpr","fpr")
optglmperf<-attributes(performance(pred3, "auc"))$y.values[[1]][1]

print (paste(" perf of orig glm no interactions oos 15% : ",perforig,sep=""))


print (paste(" perf of logit tuned w interactions guided by rf oos 15% : ",
optglmperf,sep=""))

validate$p4<-predict(rf,validate,type='prob')[,2]
pred4<-prediction(validate$p4,validate[target])
perf4 <- performance(pred4,"tpr","fpr")
rfperf<-attributes(performance(pred4, "auc"))$y.values[[1]][1]

print (paste(" perf of orig rf oos 15% : ",rfperf ,sep=""))

windows();
#pdf(varImpPlot(rf,main="random forest variable importance (used to choose interaction
terms for glm)"));
#windows();
par(mfrow=c(2,2))

plot(perf1a, col='darkgreen', main="OOS 15% validation ROC ")


36 of 39

plot(perf3, col='red',add=TRUE);
plot(perf4, col='blue',add=TRUE);
legend(0.6,0.6,c('orig glm','glm w interactions' ,'orig
rf'),col=c('darkgreen','red','blue'),lwd=2)

dotchart(c(rfperf,perforig,optglmperf),labels=c('oos rf','oos orig glm','oos opt


glm' ),cex=.7, main="oos area under curve for validation data")

datainplot<-c((optglmperf/perforig)-1,(optglmperf/rfperf)-1)
q<-barplot(datainplot,names.arg=c('% improve AUC over original glm','% improve AUC
over rf'), main="OOS AUC improvement by glm w interactions based on
rf",col=c("green","red"),beside=TRUE)
text(cex=.5, x=q, y=datainplot+par("cxy")[2]/2, round(datainplot,2), xpd=TRUE) ;

#barplot(c(perforig,optglmperf,rfperf),names.arg=c('original glm','glm with interactions


tuned' ,'orig rf'), main="OOS AUC by model",col=c("green","red","blue"),beside=TRUE)
#varImpPlot(rf, main="variable importance from rf")

imp <- importance(rf, class = null, scale = TRUE, type = NULL)


e<-imp <- imp[, -(1:(ncol(imp) - 2))]
dotchart(sort(e[,1]),main="random forest variable importance by mean decrease in
accuracy")

}
# run regression of list 2 terms + term

#print count and final model

#plot oss samples

return
(list(finform=finform,perforig=perforig,optgkmperf=optglmperf,rfperf=rfperf,vi=importa
nce(rf)))

# load data sets: kdd, german, and hmeqt


#BAD
hmeq<-read.csv("C:/Documents and Settings/My Documents/HMEQ.csv")
hmeq$BAD <- as.factor(hmeq$BAD)

#good_bad
c<-read.csv("C:/Documents and Settings/My Documents/GermanCredit.csv")
c<-subset(c,select=-default)
37 of 39

#TARGET_LABEL_BAD
cc<-read.csv("C:/Documents and Settings/My Documents/cckdd2010.csv")
cc$TARGET_LABEL_BAD<-as.factor(cc$TARGET_LABEL_BAD)
cc$QUANT_DEPENDANTS<-
ifelse(cc$QUANT_DEPENDANTS>=13,13,cc$QUANT_DEPENDANTS)
cc$MissingResidentialPhoneCode<-
as.factor(ifelse(is.na(cc$RESIDENCIAL_PHONE_AREA_CODE)==TRUE,'Y','N'))
cc$MissingProfPhoneCode<-
as.factor(ifelse(is.na(cc$PROFESSIONAL_PHONE_AREA_CODE)==TRUE,'Y','N'))
cc<-subset(cc,select=-ID_CLIENT )
cc<-subset(cc,select=-CLERK_TYPE )
cc<-subset(cc,select=-QUANT_ADDITIONAL_CARDS)
cc<-subset(cc,select=-EDUCATION_LEVEL.1)
cc<-subset(cc,select=-FLAG_MOBILE_PHONE)
cc<-subset(cc,select=-FLAG_HOME_ADDRESS_DOCUMENT)
cc<-subset(cc,select=-FLAG_RG )
cc<-subset(cc,select=-FLAG_CPF )
cc<-subset(cc,select=-FLAG_INCOME_PROOF)
cc<-subset(cc,select=-FLAG_ACSP_RECORD)
cc<-subset(cc,select=-TARGET_LABEL_BAD.1)
cc$RESIDENCIAL_PHONE_AREA_CODE<-
as.factor(cc$RESIDENCIAL_PHONE_AREA_CODE)
cc$PROFESSIONAL_PHONE_AREA_CODE<-
as.factor(cc$PROFESSIONAL_PHONE_AREA_CODe)

#cc<-subset(cc,select=-PROFESSIONAL_ZIP_3)
#cc<-subset(cc,select=-RESIDENCIAL_ZIP_3)
#cc<-subset(cc,select=-RESIDENCIAL_PHONE_AREA_CODE)
#cc<-subset(cc,select=-PROFESSIONAL_PHONE_AREA_CODE)
# generate interaction data

#no woe no interac


a1<-tuneGlmToRf(c, "good_bad", n=1,100, FALSE,0,NA,FALSE)
a2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=1, 100, FALSE,0,NA,FALSE)
a3<-tuneGlmToRf(hmeq, "BAD", n=1, 100, FALSE,0,NA,FALSE)

#woe no interace
b1<-tuneGlmToRf(c, "good_bad", n=1,100, FALSE,0,NA,TRUE,FALSE)
b2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=1, 100,
FALSE,0,NA,TRUE,FALSE)
38 of 39

b3<-tuneGlmToRf(hmeq, "BAD", n=1, 100, FALSE,0,NA,TRUE,FALSE)

#woe +bin no interac


c1<-tuneGlmToRf(c, "good_bad", n=1,100, FALSE,0)
c2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=1, 100, FALSE,0)
c3<-tuneGlmToRf(hmeq, "BAD", n=1, 100, FALSE,0)

#woe+bin interac
d1<-tuneGlmToRf(c, "good_bad", n=10,100, FALSE,1)
d2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=10, 100, FALSE,1)
d3<-tuneGlmToRf(hmeq, "BAD", n=10, 100, FALSE,1)

#woe no bin interac


e1<-tuneGlmToRf(c, "good_bad", n=10,100, FALSE,1,NA,TRUE,FALSE)
e2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=10, 100,
FALSE,1,NA,TRUE,FALSE)
e3<-tuneGlmToRf(hmeq, "BAD", n=10, 100, FALSE,1,NA,TRUE,FALSE)
e4<-tuneGlmToRf(hmeq, "BAD", n=11, 100, FALSE,5,NA,TRUE,FALSE)

e2a<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=7, 100,


FALSE,1,NA,TRUE,FALSE)

#no wo interac
f1<-tuneGlmToRf(c, "good_bad", n=10,100, FALSE,1,NA,FALSE)
f2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=10, 100, FALSE,1,NA,FALSE)
f3<-tuneGlmToRf(hmeq, "BAD", n=10, 100, FALSE,1,NA,FALSE)

print(paste("no woe& no interac in i* ","german", a1$perforig,


a1$rfperf,a1$optgkmperf,sep=","))
print(paste("no woe& no interac in i* ","cc", a2$perforig,
a2$rfperf,a2$optgkmperf,sep=","))
print(paste("no woe& no interac in i* ","hmeq", a3$perforig,
a3$rfperf,a3$optgkmperf,sep=","))

print(paste("woe & no interac in i* ","german", b1$perforig,


b1$rfperf,b1$optgkmperf,sep=","))
print(paste("woe & no interac in i* ","cc", b2$perforig,
b2$rfperf,b2$optgkmperf,sep=","))
print(paste("woe & no interac in i* ","hmeq", b3$perforig,
b3$rfperf,b3$optgkmperf,sep=","))
39 of 39

print(paste("woe+bin & no interac in i* ","german", c1$perforig,


c1$rfperf,c1$optgkmperf,sep=","))
print(paste("woe+bin & no interac in i* ","cc", c2$perforig,
c2$rfperf,c2$optgkmperf,sep=","))
print(paste("woe+bin & no interac in i* ","hmeq", c3$perforig,
c3$rfperf,c3$optgkmperf,sep=","))

print(paste("woe+bin interac in i* ","german", d1$perforig,


d1$rfperf,d1$optgkmperf,sep=","))
print(paste("woe+bin interac in i* ","cc", d2$perforig, d2$rfperf,d2$optgkmperf,sep=","))
print(paste("woe+bin interac in i* ","hmeq", d3$perforig,
d3$rfperf,d3$optgkmperf,sep=","))

print(paste("woe interac in i* ","german", e1$perforig,


e1$rfperf,e1$optgkmperf,sep=","))
print(paste("woe interac in i* ","cc", e2$perforig, e2$rfperf,e2$optgkmperf,sep=","))
print(paste("woe interac in i* ","hmeq", e3$perforig, e3$rfperf,e3$optgkmperf,sep=","))

print(paste("no woe interac in i* ","german", f1$perforig,


f1$rfperf,f1$optgkmperf,sep=","))
print(paste("no woe interac in i* ","cc", f2$perforig, f2$rfperf,f2$optgkmperf,sep=","))
print(paste("no woe interac in i* ","hmeq", f3$perforig,
f3$rfperf,f3$optgkmperf,sep=","))

You might also like