Professional Documents
Culture Documents
In Favor of WOE
In Favor of WOE
Abstract
Weight of Evidence transformation of categorical variables is a technique used by credit
scoring professionals for decades. This paper investigates whether using this
transformation improves predictive performance. For models without interaction terms
the use of Weight of evidence and binning or discretization of numeric variables
improves predictive accuracy. The addition of Weight of evidence transformations
without binning is marginal in models without interactions. This is consistent with the
excellent results achieved on the Paralyzed Veteran Admin KDD 98 data where the best
performance was achieved using both WOE and binning.
For models with interaction terms the use of WOE transform improves model
performance for 2 out of 3 data sets and performance is the same for the third data set
with or without the WOE. WOE tends to improve logistic regression and I* tuned
logistic regression performance while degrading random forest performance slightly.
WOE and I* algorithm thus reach peak predictive models in achieving area under the
curve competitive with winning KDD benchmarks.
The combination of WOE and binning reduces performance for models with interaction
terms. This makes sense in retrospect as binning variables results in loss of information
about interaction amongst continuous variables.
WOE and binning thus improve model performance when used together when modeling
without interactions as thought by practitioners. However when interaction effects exist
in the data interaction effects are more predictive than WOE and binning and WOE
should be used alone as binning can result in loss of predictive power of interaction
effects. Interactions exist in the data set when random forest outperforms logistic
regression out of the box (Sharma, 2011b).
Introduction
The goal of this study is to study the impact of Weight of evidence transformations and
binning on logistic regression, random forests, and I* algorithm which builds logistic
regression using random forest variable importance. WOE or weight of evidence is
defined as the log of bayes factor, log of distribution of one event vs. the other in a binary
classification problem (Good,1985).
Recently using the combination of WOE and binning was shown to produce excellent
performance using the KDD 98 data set where there were many categorical variables
(Sharma, 2011a).
Data Sets
The 3 data sets used for this study are the German Credit data set(Frank, 2010), Credit
Card KDD 2010 data set(from KDD 2010 available from
http://sede.neurotech.com.br/PAKDD2010 ), and the Home Equity data set from from
SAS® called the home equity data set from
Models
The logistic regression without tuning, random forests, and I * algorithm which builds
logistic regression models using random forest variable importance in model
development are compared using 70% train, 15% out of sample and 15% validation data
set. The performance reported below is for the validation data set.
Results
Discussion
WOE and Binning enhance model performance for logistic regression and I* results
when no interaction terms are used. That said the performance improvement of using
interaction terms is greater than that of WOE/binning in these data sets. In the KDD 98
data set with a large number of categorical variables with little predictive interaction
effects the WOE and Binning performance was much better. This would indicate in the
absence of predictive interaction effects its optimal to use WOE and binning while using
WOE alone is advisable in data sets without predictive interactions. To determine
whether a data set has predictive interactions it is recommended to compare random
forest performance against logistic regression and if random forest performance is better
it is because of the presence of significant interaction effects in the data and quite
possible a more complex decision boundary (Sharma,). WOE and binning reduce random
forest performance slightly. WOE itself with interaction terms improves performance of
logistic regression and I*. However WOE and Binning applied to models with
interaction result in loss or performance and this is because binning loses information on
predictive interactions and results. Interactions exist in the data set when random forest
outperforms logistic regression out of the box (Sharma, 2011b).
Conclusions
References
Good, I.J. (1985) Weight of Evidence: A Brief Survey. Bayesian Statistics. Vol 2.
pp.249-270. Retrieved from
http://www.swrcb.ca.gov/water_issues/programs/tmdl/docs/303d_policydocs/207.pdf
Leung, K., Cheong, F., Cheong, C. (2008). Building a Scorecard in Practice. WSPC
Proceedings. Retrieved from http://www.aiecon.org/conference/2008/CIEF/Building
%20a%20Scorecard%20in%20Practice.pdf
5 of 39
Appendix of R code
a<-""
for (i in 1: min(n,length(sort(e[,1][e[,1]>0.025], dec=TRUE))))
{
a<-paste(a,(names(sort(e[,1][e[,1]>0.025], dec=TRUE))[i]),sep="")
}
return (a)
}
getRfTerms<-function(rf, n=10)
{
rfterms<-character(0)
6 of 39
return (rfterms)
x[i,1]<-names(summary(m)$coefficients[,1])[i]
x[i,2]<-summary(m)$coefficients[i]
x[i,3]<-summary(m)$coefficients[i,4]
}
y<-data.frame(x,stringsAsFactors = FALSE)
return (y)
}
}
7 of 39
rsamp<-function(c, p)
{
d = sort(sample(nrow(c), nrow(c)*p))
#select training sample
train<-c[d,]
test<-c[-d,]
return(train)
#WOE
print("WOE1")
#library(discretization)
library(randomForest)
library(sqldf)
len<-length(data)
8 of 39
#loop through training categorical variables and create woe table and apply to test
for (i in 1:length(data))
{
data[,i]<-EqualFreq2(data[,i],10)
test[,i]<-EqualFreq2(test[,i],10)
#data[,i]<-cut(data[,i], as.integer(sqrt(nrow(data))))
#data[,i]<- log(data[,i]+1.0)
#test[,i]<- log(test[,i]+1.0)
#data[,i]<- log(data[,i]+1.0)
#test[,i]<- log(test[,i]+1.0)
#data[,i]<-cut(data[,i],10)
#test[,i]<-cut(data[,i],10)
)
& i!=index
)
{
#print("non numeric")
name<-names(data)[i]
#print(name)
print(class(data[,i]))
a=paste("'",levels(data[,target])[1],"'",sep="")
9 of 39
b=paste("'",levels(data[,target])[2],"'",sep="")
#b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data where %s))/(sum(case when %s then 1.00 else 0.00 end) / (select #count(*) from
data where %s))
%s',paste(target,"=",a,sep=""),paste(target,"=",a,sep=""),paste(target,"=",b,sep=""),paste(t
arget,"=",b,sep=""), #paste('WOE',names(data)[i],' , [', names(data)[i], '] from data group
by [',names(data)[i], ']',sep="") ))
b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data)) %s',paste(target,"=",a,sep=""), paste('WOE',names(data)[i],' , [', names(data)[i], ']
from data group by [',names(data)[i], ']',sep="") ))
sql<- b1
print(sql)
t<-sqldf(sql)
print('ran sql')
gc()
#data <-sqldf(paste('select * from data a, t b where a.',names(data)[i],' = b.',names(data)
[i],sep=""))
print('merge1')
data<-merge(data,t)
test<-merge(test,t)
gc()
}
}
#remove categ
l<-vector()
for (i in 1:len)
{
index<- match(target, colnames(data))
}
}
gc()
#ensure woes all numeric
for (i in 1:length(data))
{
#data[,i]<- log(data[,i])
#test[,i]<- log(test[,i])
#}
}
11 of 39
print('end woe')
return(list(data=data,test=test))
# input data, name of target classification variable , n=number of top most predictive
variables to use for interaction mining based on rf
# rfsample size of classes to sample from for large dataset; usually 500 works well .
# sample a boolean flag for large datasets to sample 50k obs for model building etc.
# 2 way interacs flag; slower not much better;
#formStat is a string model formula of a base existing model with interactions you may
want to start with as a starting point for model improvement
# you can run the function once and feed back the resulting model to see if more
predictive power can be added using additional interaction terms using the
# algorithm recusively one itself.
if (!is.na(formStart))
{
d<-gsub("\\+",",",formStart)
d<-gsub("\\~",",",d)
d<-gsub("\\WOE","",d)
print(d)
q<-unlist(strsplit(d, "\\,"))
myvars <- names(data) %in% c(q)
data<- data[myvars]
12 of 39
# if not woe and levels >=32 then remove as rf cant handle them
if (woe==FALSE)
{
for (i in 1:length(data))
{
# if (length(levels(data[,i]))>=32) { data[,i]<-NULL }
if (is.factor(data[,i]) & names(data)[i]!=target) { data[,i]<-as.numeric(data[,i]) }
if (rfsample==0) { rfsample=500 }
}
options(warn=-1)
#library(fSeries)
#if (nway!=0)
#{
#data<- data.frame(substituteNA(data, type ="zeros"))
#for (i in 1:(length(names(data))))
#{
#{
# if (names(data[i]) != target)
# {
# data[,i] <- as.numeric( data[,i])
# print (1)
# }
13 of 39
#}
d = sort(sample(nrow(data), nrow(data)*.7))
#select training sample
train<-data[d,]
test<-data[-d,]
rm(data)
gc()
if (woe==TRUE)
{
train<-prep$data
test<-prep$test
print("a")
}
if (nrow(train)>=34000) train<-train[1:34000,]
gc()
w= sort(sample(nrow(test), nrow(test)*.5))
test<-test[w,]
validate<-test[-w,]
print(length(train))
u<-length(train)
u<-as.integer(log(u,10)*sqrt(u))
library(randomForest)
14 of 39
library(ROCR)
set.seed(42)
if (rfsample > 0)
{
print("rf")
print(rfsample)
print(summary(train))
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
sampsize=c(rfsample,rfsample),
na.action=na.roughfix,
replace=FALSE)
}
else
{
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
na.action=na.roughfix,
replace=FALSE)
}
print("C")
#subset train test and validate to terms that have predictive value
imp <- importance(rf, class = null, scale = TRUE, type = NULL)
e<-imp <- imp[, -(1:(ncol(imp) - 2))]
print("D")
#data<-subset(data,select=c(names(e[,1][e[,1]>0]),target))
train<-subset(train,select=c(names(e[,1][e[,1]>0]),target))
print("E")
test<-subset(test,select=c(names(e[,1][e[,1]>0]),target))
15 of 39
print("F")
validate<-subset(validate,select=c(names(e[,1][e[,1]>0]),target))
print("done woe")
u<-length(train)
u<-as.integer(log(u,10)*sqrt(u))
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
sampsize=c(rfsample,rfsample),
na.action=na.roughfix,
replace=FALSE)
}
else
{
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
na.action=na.roughfix,
replace=FALSE)
}
16 of 39
rfterms<-getRfTerms(rf,n)
list_of_terms<-character(0)
for (k in 1:length(rfterms))
{
print(paste(form,rfterms[k],sep=""))
m<-glm(paste(form,rfterms[k],sep=""),data=train,family=binomial)
print("b")
}
print("d")
if (nway==0) m<-glm(formula,data=train,family=binomial)
print(summary(m))
y <- getGlmTerms(m)
y<- na.omit(y)
rm(m)
gc()
17 of 39
y$X2<- as.numeric(y$X2)
y$X3<- as.numeric(y$X3)
for (j in 1: nrow(y))
{
print("Adding term")
if ((y[j,3]<=.25 | runif(1, 0, .61)< .16 | TRUE ) & (regexpr(":", y[j,1]) > 0) | ( nway==0
& !(regexpr("Intercept", y[j,1]) > 0) ))
#print(names(y[1,j]))
list_of_terms[length(list_of_terms)+1]= y[j,1]
}
#################
maxperf<-0
morig<-glm(formula,data=train,family=binomial)
test$score1<-predict(morig,type='response',test)
pred1<-prediction(test$score1,test[target])
perf1 <- performance(pred1,"tpr","fpr")
validate$score1a<-predict(morig,type='response',validate)
pred1a<-prediction(validate$score1a,validate[target])
perf1a <- performance(pred1a,"tpr","fpr")
perforig<-attributes(performance(pred1a, "auc"))$y.values[[1]][1]
18 of 39
maxperf<-attributes(performance(pred1, "auc"))$y.values[[1]][1]
print(maxperf)
rm(morig)
gc()
list2<-character(0)
print( paste("about to try ",length(list_of_terms), " interactions.",sep=""))
for (i in 1: length(list_of_terms))
{
if (length(list2)==0)
{
print(paste(formula,list_of_terms[i],sep="+"))
mfin<-glm(paste(formula,list_of_terms[i],sep="+"),data=train,family=binomial)
}
else
{
finform<- formula
for (j in 1: length(list2))
{
if (j==1 & nway==0) finform<- paste(finform, list2[j],sep=" ")
finform<- paste(finform, list2[j],sep="+")
}
print(i)
print(finform)
print(paste(finform, list_of_terms[i],sep="+"))
mfin<-glm(paste(finform, list_of_terms[i],sep="+"),data=train,family=binomial)
}
test$score2<-predict(mfin,type='response',test)
pred2<-prediction(test$score2,test[target])
perf2 <- performance(pred2,"tpr","fpr")
print(maxperf)
print(attributes(performance(pred2, "auc"))$y.values[[1]][1])
19 of 39
list2[length(list2)+1]= list_of_terms[i]
maxperf<- attributes(performance(pred2, "auc"))$y.values[[1]][1]
rm(mfin)
gc()
if (length(list2)>0)
{
mfin<-glm(finform,data=train,family=binomial)
validate$p4<-predict(rf,validate,type='prob')[,2]
pred4<-prediction(validate$p4,validate[target])
perf4 <- performance(pred4,"tpr","fpr")
rfperf<-attributes(performance(pred4, "auc"))$y.values[[1]][1]
windows();
20 of 39
datainplot<-c((optglmperf/perforig)-1,(optglmperf/rfperf)-1)
q<-barplot(datainplot,names.arg=c('% improve AUC over original glm','% improve AUC
over rf'), main="OOS AUC improvement by glm w interactions based on
rf",col=c("green","red"),beside=TRUE)
text(cex=.5, x=q, y=datainplot+par("cxy")[2]/2, round(datainplot,2), xpd=TRUE) ;
}
# run regression of list 2 terms + term
return
(list(finform=finform,perforig=perforig,optgkmperf=optglmperf,rfperf=rfperf,vi=importa
nce(rf)))
{
imp <- importance(arf1, class = null, scale = TRUE, type = NULL)
e<-imp <- imp[, -(1:(ncol(imp) - 2))]
#sort(e[,1], dec=TRUE)
a<-""
for (i in 1: min(n,length(sort(e[,1][e[,1]>0.025], dec=TRUE))))
{
a<-paste(a,(names(sort(e[,1][e[,1]>0.025], dec=TRUE))[i]),sep="")
}
return (a)
}
getRfTerms<-function(rf, n=10)
{
rfterms<-character(0)
return (rfterms)
x[i,1]<-names(summary(m)$coefficients[,1])[i]
x[i,2]<-summary(m)$coefficients[i]
x[i,3]<-summary(m)$coefficients[i,4]
}
y<-data.frame(x,stringsAsFactors = FALSE)
return (y)
}
rsamp<-function(c, p)
{
d = sort(sample(nrow(c), nrow(c)*p))
#select training sample
train<-c[d,]
test<-c[-d,]
return(train)
#WOE
print("WOE1")
#library(discretization)
library(randomForest)
library(sqldf)
len<-length(data)
#loop through training categorical variables and create woe table and apply to test
for (i in 1:length(data))
{
24 of 39
data[,i]<-EqualFreq2(data[,i],10)
test[,i]<-EqualFreq2(test[,i],10)
#data[,i]<-cut(data[,i], as.integer(sqrt(nrow(data))))
#data[,i]<- log(data[,i]+1.0)
#test[,i]<- log(test[,i]+1.0)
#data[,i]<- log(data[,i]+1.0)
#test[,i]<- log(test[,i]+1.0)
#data[,i]<-cut(data[,i],10)
#test[,i]<-cut(data[,i],10)
)
& i!=index
)
{
#print("non numeric")
name<-names(data)[i]
#print(name)
print(class(data[,i]))
a=paste("'",levels(data[,target])[1],"'",sep="")
b=paste("'",levels(data[,target])[2],"'",sep="")
#b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data where %s))/(sum(case when %s then 1.00 else 0.00 end) / (select #count(*) from
data where %s))
%s',paste(target,"=",a,sep=""),paste(target,"=",a,sep=""),paste(target,"=",b,sep=""),paste(t
arget,"=",b,sep=""), #paste('WOE',names(data)[i],' , [', names(data)[i], '] from data group
by [',names(data)[i], ']',sep="") ))
25 of 39
b1=(sprintf('select (sum(case when %s then 1.00 else 0.00 end) / (select count(*) from
data)) %s',paste(target,"=",a,sep=""), paste('WOE',names(data)[i],' , [', names(data)[i], ']
from data group by [',names(data)[i], ']',sep="") ))
sql<- b1
print(sql)
t<-sqldf(sql)
print('ran sql')
gc()
#data <-sqldf(paste('select * from data a, t b where a.',names(data)[i],' = b.',names(data)
[i],sep=""))
print('merge1')
data<-merge(data,t)
test<-merge(test,t)
gc()
}
}
#remove categ
l<-vector()
for (i in 1:len)
{
index<- match(target, colnames(data))
}
}
gc()
#ensure woes all numeric
for (i in 1:length(data))
{
#data[,i]<- log(data[,i])
#test[,i]<- log(test[,i])
#}
print('end woe')
return(list(data=data,test=test))
27 of 39
# input data, name of target classification variable , n=number of top most predictive
variables to use for interaction mining based on rf
# rfsample size of classes to sample from for large dataset; usually 500 works well .
# sample a boolean flag for large datasets to sample 50k obs for model building etc.
# 2 way interacs flag; slower not much better;
#formStat is a string model formula of a base existing model with interactions you may
want to start with as a starting point for model improvement
# you can run the function once and feed back the resulting model to see if more
predictive power can be added using additional interaction terms using the
# algorithm recusively one itself.
if (!is.na(formStart))
{
d<-gsub("\\+",",",formStart)
d<-gsub("\\~",",",d)
d<-gsub("\\WOE","",d)
print(d)
q<-unlist(strsplit(d, "\\,"))
myvars <- names(data) %in% c(q)
data<- data[myvars]
# if not woe and levels >=32 then remove as rf cant handle them
if (woe==FALSE)
{
28 of 39
for (i in 1:length(data))
{
# if (length(levels(data[,i]))>=32) { data[,i]<-NULL }
if (is.factor(data[,i]) & names(data)[i]!=target) { data[,i]<-as.numeric(data[,i]) }
if (rfsample==0) { rfsample=500 }
}
options(warn=-1)
#library(fSeries)
#if (nway!=0)
#{
#data<- data.frame(substituteNA(data, type ="zeros"))
#for (i in 1:(length(names(data))))
#{
#{
# if (names(data[i]) != target)
# {
# data[,i] <- as.numeric( data[,i])
# print (1)
# }
# else { data[,i] <- as.factor( data[,i]) }
#}
#}
#}
d = sort(sample(nrow(data), nrow(data)*.7))
#select training sample
train<-data[d,]
test<-data[-d,]
rm(data)
gc()
if (woe==TRUE)
{
train<-prep$data
test<-prep$test
print("a")
}
if (nrow(train)>=34000) train<-train[1:34000,]
gc()
w= sort(sample(nrow(test), nrow(test)*.5))
test<-test[w,]
validate<-test[-w,]
print(length(train))
u<-length(train)
u<-as.integer(log(u,10)*sqrt(u))
library(randomForest)
library(ROCR)
set.seed(42)
if (rfsample > 0)
{
print("rf")
print(rfsample)
30 of 39
print(summary(train))
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
sampsize=c(rfsample,rfsample),
na.action=na.roughfix,
replace=FALSE)
}
else
{
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
na.action=na.roughfix,
replace=FALSE)
}
print("C")
#subset train test and validate to terms that have predictive value
imp <- importance(rf, class = null, scale = TRUE, type = NULL)
e<-imp <- imp[, -(1:(ncol(imp) - 2))]
print("D")
#data<-subset(data,select=c(names(e[,1][e[,1]>0]),target))
train<-subset(train,select=c(names(e[,1][e[,1]>0]),target))
print("E")
test<-subset(test,select=c(names(e[,1][e[,1]>0]),target))
print("F")
validate<-subset(validate,select=c(names(e[,1][e[,1]>0]),target))
print("done woe")
u<-length(train)
u<-as.integer(log(u,10)*sqrt(u))
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
sampsize=c(rfsample,rfsample),
na.action=na.roughfix,
replace=FALSE)
}
else
{
rf <- randomForest(formularf,
data=train,
ntree=500,
mtry=u,
importance=TRUE,
na.action=na.roughfix,
replace=FALSE)
}
rfterms<-getRfTerms(rf,n)
list_of_terms<-character(0)
for (k in 1:length(rfterms))
{
print(paste(form,rfterms[k],sep=""))
m<-glm(paste(form,rfterms[k],sep=""),data=train,family=binomial)
print("b")
}
print("d")
if (nway==0) m<-glm(formula,data=train,family=binomial)
print(summary(m))
y <- getGlmTerms(m)
y<- na.omit(y)
rm(m)
gc()
y$X2<- as.numeric(y$X2)
y$X3<- as.numeric(y$X3)
for (j in 1: nrow(y))
{
print("Adding term")
33 of 39
if ((y[j,3]<=.25 | runif(1, 0, .61)< .16 | TRUE ) & (regexpr(":", y[j,1]) > 0) | ( nway==0
& !(regexpr("Intercept", y[j,1]) > 0) ))
#print(names(y[1,j]))
list_of_terms[length(list_of_terms)+1]= y[j,1]
}
#################
maxperf<-0
morig<-glm(formula,data=train,family=binomial)
test$score1<-predict(morig,type='response',test)
pred1<-prediction(test$score1,test[target])
perf1 <- performance(pred1,"tpr","fpr")
validate$score1a<-predict(morig,type='response',validate)
pred1a<-prediction(validate$score1a,validate[target])
perf1a <- performance(pred1a,"tpr","fpr")
perforig<-attributes(performance(pred1a, "auc"))$y.values[[1]][1]
maxperf<-attributes(performance(pred1, "auc"))$y.values[[1]][1]
print(maxperf)
rm(morig)
gc()
34 of 39
list2<-character(0)
print( paste("about to try ",length(list_of_terms), " interactions.",sep=""))
for (i in 1: length(list_of_terms))
{
if (length(list2)==0)
{
print(paste(formula,list_of_terms[i],sep="+"))
mfin<-glm(paste(formula,list_of_terms[i],sep="+"),data=train,family=binomial)
}
else
{
finform<- formula
for (j in 1: length(list2))
{
if (j==1 & nway==0) finform<- paste(finform, list2[j],sep=" ")
finform<- paste(finform, list2[j],sep="+")
}
print(i)
print(finform)
print(paste(finform, list_of_terms[i],sep="+"))
mfin<-glm(paste(finform, list_of_terms[i],sep="+"),data=train,family=binomial)
}
test$score2<-predict(mfin,type='response',test)
pred2<-prediction(test$score2,test[target])
perf2 <- performance(pred2,"tpr","fpr")
print(maxperf)
print(attributes(performance(pred2, "auc"))$y.values[[1]][1])
list2[length(list2)+1]= list_of_terms[i]
maxperf<- attributes(performance(pred2, "auc"))$y.values[[1]][1]
35 of 39
rm(mfin)
gc()
if (length(list2)>0)
{
mfin<-glm(finform,data=train,family=binomial)
validate$p4<-predict(rf,validate,type='prob')[,2]
pred4<-prediction(validate$p4,validate[target])
perf4 <- performance(pred4,"tpr","fpr")
rfperf<-attributes(performance(pred4, "auc"))$y.values[[1]][1]
windows();
#pdf(varImpPlot(rf,main="random forest variable importance (used to choose interaction
terms for glm)"));
#windows();
par(mfrow=c(2,2))
plot(perf3, col='red',add=TRUE);
plot(perf4, col='blue',add=TRUE);
legend(0.6,0.6,c('orig glm','glm w interactions' ,'orig
rf'),col=c('darkgreen','red','blue'),lwd=2)
datainplot<-c((optglmperf/perforig)-1,(optglmperf/rfperf)-1)
q<-barplot(datainplot,names.arg=c('% improve AUC over original glm','% improve AUC
over rf'), main="OOS AUC improvement by glm w interactions based on
rf",col=c("green","red"),beside=TRUE)
text(cex=.5, x=q, y=datainplot+par("cxy")[2]/2, round(datainplot,2), xpd=TRUE) ;
}
# run regression of list 2 terms + term
return
(list(finform=finform,perforig=perforig,optgkmperf=optglmperf,rfperf=rfperf,vi=importa
nce(rf)))
#good_bad
c<-read.csv("C:/Documents and Settings/My Documents/GermanCredit.csv")
c<-subset(c,select=-default)
37 of 39
#TARGET_LABEL_BAD
cc<-read.csv("C:/Documents and Settings/My Documents/cckdd2010.csv")
cc$TARGET_LABEL_BAD<-as.factor(cc$TARGET_LABEL_BAD)
cc$QUANT_DEPENDANTS<-
ifelse(cc$QUANT_DEPENDANTS>=13,13,cc$QUANT_DEPENDANTS)
cc$MissingResidentialPhoneCode<-
as.factor(ifelse(is.na(cc$RESIDENCIAL_PHONE_AREA_CODE)==TRUE,'Y','N'))
cc$MissingProfPhoneCode<-
as.factor(ifelse(is.na(cc$PROFESSIONAL_PHONE_AREA_CODE)==TRUE,'Y','N'))
cc<-subset(cc,select=-ID_CLIENT )
cc<-subset(cc,select=-CLERK_TYPE )
cc<-subset(cc,select=-QUANT_ADDITIONAL_CARDS)
cc<-subset(cc,select=-EDUCATION_LEVEL.1)
cc<-subset(cc,select=-FLAG_MOBILE_PHONE)
cc<-subset(cc,select=-FLAG_HOME_ADDRESS_DOCUMENT)
cc<-subset(cc,select=-FLAG_RG )
cc<-subset(cc,select=-FLAG_CPF )
cc<-subset(cc,select=-FLAG_INCOME_PROOF)
cc<-subset(cc,select=-FLAG_ACSP_RECORD)
cc<-subset(cc,select=-TARGET_LABEL_BAD.1)
cc$RESIDENCIAL_PHONE_AREA_CODE<-
as.factor(cc$RESIDENCIAL_PHONE_AREA_CODE)
cc$PROFESSIONAL_PHONE_AREA_CODE<-
as.factor(cc$PROFESSIONAL_PHONE_AREA_CODe)
#cc<-subset(cc,select=-PROFESSIONAL_ZIP_3)
#cc<-subset(cc,select=-RESIDENCIAL_ZIP_3)
#cc<-subset(cc,select=-RESIDENCIAL_PHONE_AREA_CODE)
#cc<-subset(cc,select=-PROFESSIONAL_PHONE_AREA_CODE)
# generate interaction data
#woe no interace
b1<-tuneGlmToRf(c, "good_bad", n=1,100, FALSE,0,NA,TRUE,FALSE)
b2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=1, 100,
FALSE,0,NA,TRUE,FALSE)
38 of 39
#woe+bin interac
d1<-tuneGlmToRf(c, "good_bad", n=10,100, FALSE,1)
d2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=10, 100, FALSE,1)
d3<-tuneGlmToRf(hmeq, "BAD", n=10, 100, FALSE,1)
#no wo interac
f1<-tuneGlmToRf(c, "good_bad", n=10,100, FALSE,1,NA,FALSE)
f2<-tuneGlmToRf(cc, "TARGET_LABEL_BAD", n=10, 100, FALSE,1,NA,FALSE)
f3<-tuneGlmToRf(hmeq, "BAD", n=10, 100, FALSE,1,NA,FALSE)