Professional Documents
Culture Documents
GL - ML - 2017 (Reworked)
GL - ML - 2017 (Reworked)
Cost function:
When Y=1, then Cost = -log (Prediction)
When Y=0, then Cost = -log (1-Prediction)
“Forest” or an
ensemble of trees
required to
address over-
fitting
Impact
Larger number of trees = Less chance of over-fitting
More complex solution
Higher runtime
Impact
More randomly selected variables = Significant variables show up
Repetitive trees –all variables in
data not evaluated
Impact
Higher sampling ratio = Enough data points to build trees
Not enough data points to test the
stability of trees
Impact
Sampling without replacement = Trees covering different dimensions
Limit on maximum number of trees
The number of trees you build, the depth of each tree and the learning rate will all
decide how good a model you make!
Jun 10, 2020 67
Overview of GBM –
Regression and Classification
1stPercentile
•Replacing with Percentile Values:
training$INVST1<-ifelse(training$INVST<= 10,10,training$INVST)
training$INVST1<-ifelse(training$INVST1>= 1222.485,
1222.485,training$INVST1)
99thPercentile