Submission PPT 110721 - Group 2

AIPM_GROUP 2
Submission: Module 2_3

• Abhiram TSR
11 July 2021
• Bhushan Kudale
• Malvika Chaturvedi
• Minali Shah
• Shivendra Srivastava
Proprietary and confidential — do not distribute
Models covered
1. Boston Housing Data

2. Customer Profit Data
3. E-sign
4. Lead scoring

Boston Housing Data
Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 3
Boston Housing Data : Summary
Task : MEDV is the dependent variable. Using all variables, build a model to predict Median
Value of owner-occupied homes.
1. Data cleaning process

• Imputing the data to clear missing values in PTRATIO
• Identified correlation of PTRATIO with other variable
• Constructed regression equation for PTRATIO using highly co-related variables to
complete the missing values
2. Identifying hyperparameters –mtry

3. Chosen Activation function – Tanh, Neural Network

Key findings: Random forest
TUNING VARIABLE: MTRY. TUNRF
Additional packages explored : Metrics
Key findings:
1. After running tunRF function we get values of OOB for

different values of mtry for given value of ntree. We tried ntree
values of 500, 400, 300, 200 and find that lowest OOB occurs
at ntree=400 and mtry=4
2. The top 4 variables that influenced MEDV for Random Forest
were RM, LSTAT, CHAS and PTRATIO
Proprietary and confidential — do not distribute 8/22/21 | 5

Key findings: Neural Network
ACTIVATION FUNCTION: TANH
Additional packages explored : neuralnet
Key findings:
1. The top 4 variables that influenced MEDV for Neural

Network were AGE, NOX, LSTAT and RM
2. Tanh activation function was applied with two hidden layers
with 10, 20 and 200 epochs on the training set
3. Original function (Rectifier, 1 layer , X neurons, 100 epochs)
had an accuracy of xx%

Deriving the best model
SVM summary
RMSE Linear – 5.9
Regression summary
RMSE Polynomial – 6.8
RMSE - 5.8
RMSE Radial – 6.04
RMSE Sigmoid – 12.4
Neural Network summary RF summary

RMSE – 2.9 RMSE – 4.9
Based on the lowest RMSE, the preferred method for model fitting is Neural
Network

Shiny apps interface
Customer Profit
Customer Profit Data : Summary
Task : Profit is the dependent variable. Using all variables, build a model to predict profit

• Imputing the data to clear missing values using the package ‘MICE’, ‘VIM’
2. Identifying hyperparameters for all models –mtry , epsilon, gamma

3. Chosen Activation function – Tanh, Neural Network

Key findings: SVM & Random Forest
TUNING VARIABLE: MTRY, EPSILON, GAMMA
Key findings:
1. For various values of ntree, the lowest OOB is achieved every
time with mtry value of 1. Also the OOB error difference for
ntree of 400 and 500 is very close. So in interest of having a
simpler model and best balance between speed and accuracy
we chose ntree = 400
2. The top 2 variables that influenced Profit for Random Forest
were INCOME and TENURE
3. The best parameters for radial kernel in SVM are epsilon
=0.6 , gamma = 0.1428571, cost = 0.25
4. For Neural network lowest RMSE is seen with 'TANH'
activation function

Key findings:
1. The top variables that influenced Profit for Neural
Network were ONLINE and AGE
2. For Neural network lowest RMSE is seen with 'TANH'
activation function Tanh activation function was applied
with two hidden layers with 10, 5 on the training set
3. Original function (Rectifier, 1 layer , X neurons, 100
epochs) had an accuracy of xx%

SVM
RMSE Linear – 223.5
Regression
RMSE Polynomial – 223
RMSE – 217.4
RMSE Radial – 222.8
RMSE Sigmoid – 53052.4
Neural Network (H2o) RF

RMSE – 214.3 RMSE – 216.6
Based on the lowest RMSE, the preferred method for model fitting is Neural
Network

E-sign
E-sign : Summary
Task : Prediction of conversion and sign up is the dependent variable.

• All values were complete.
• Data was scaled
2. Chosen Activation Function : Random Forest , mtry=8, mtree=400
Key findings: Random Forest and SVM
TUNING VARIABLE: MTRY, TUNRF, EPSILON,

GAMMA
1. To reduce processing time, we ran two different grids and then selected the
model with optimum C value having better accuracy. We find that for linear
kernel, best model is with C=1
2. Best parameters for radial kernel are epsilon = 0, cost = 2
3. The influencing variables for RF were AMOUNT REQUESTED, RISK
SCORE and INCOME
4. This is classification problem hence number of variables used at each split
would be square root of total number of variables default value of ntree is set
to 500, we used print command which allowed to see the number of trees
tried, variables considered at each split and the OOB error rate - OOB error
rate is 36.89%, NTREE = 500, MTRY = 4
5. We used mtry for better accuracy when mtry – 8, ntree – 300 the OOB error
was 36.71% then tunRF OOB reduced to 36.37% .AUC WAS 64%
6. SVM Radial with AUC 61% was nearest to RF

• Created a model each time with varying hidden layers

and number of neurons and the activation functions
along with epochs, along with a couple of experiments
with input_dropout_ratio.
• After multiple trials, with Tanh as activation function,

epochs = 50, hidden = c(50,50), input_dropout_ratio =
0, the AUC of 0.60 was achieved, this proved to be
faster and efficient.

SVM
AUC Linear – 0.58
Logistic Regression
AUC Polynomial – 0.60
AUC – 0.57
AUC Radial – 0.61
AUC Sigmoid – 0.50
Neural Network (H2o) RF

AUC- 0.60 AUC – 0.64
Based on the highest AUC, the preferred method for model fitting is Random Forest

Lead scoring – Ed tech
Lead Scoring: Summary
Task : Identify predictor variables to generate a model that predicts likelihood of customer
conversion for online courses. The current conversion rate is 30% and the sales team with
this model can focus more on likely prospects.
Data cleaning process (Steps 1-5)

1. Removed following 14 columns
– Which would not have impact (Cheque, Update me on supply chain, direct mailer)
– Were calculated or subjective (Tags, 4 Asymmetrique variables, Lead Quality)
– Covered via other variable (Last notable activity, total visits, page views per visit )
– Results were unchanged (Magazines, updates= all no)
2. Coded categorical variables where feasible.

3. Scaling of data / converting into ranges (total
time spent on website)
4. Imputing the data to clear missing values

using the package ‘MICE’, ‘VIM’
5. Identified 11 important variables via Boruta.

Created a new dataset with these variables.
Key findings on Neural network
Additional packages : Boruta
Multiple NN models run with H2o

Final model
1. Activation : Tanh
2. Hidden layers : 3
3. Branches : 10,6,4
4. Epochs : 100
Default parameters gave an accuracy of

77% (1 layer, 6 branches, 100 epochs,
Rectifier)
Key findings on Random Forest
TUNING PARAMETERS
Abhi/Bhushan- I need your inputs on the
tuning parameters
SVM
AUC Linear – 0.81
Logistic Regression
AUC Polynomial – 0.59
AUC – 0.815
AUC Radial – 0.81
AUC Sigmoid – 0.80
Neural Network (H2o)

Accuracy 80%
RF
After tuning AUC – 0.83
AUC- 0.78
Based on the highest AUC, the preferred method for model fitting is Random Forest


Submission PPT 110721 - Group 2

Uploaded by

Copyright:

Available Formats

You might also like

Submission PPT 110721 - Group 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Submission PPT 110721 - Group 2

Uploaded by

Copyright:

Available Formats

AIPM_GROUP 2

Submission: Module 2_3

1. Boston Housing Data

Proprietary and confidential — do not distribute

1. Data cleaning process

2. Identifying hyperparameters –mtry

Proprietary and confidential — do not distribute

Additional packages explored : Metrics

1. After running tunRF function we get values of OOB for

Proprietary and confidential — do not distribute 8/22/21 | 5

Additional packages explored : neuralnet

1. The top 4 variables that influenced MEDV for Neural

Proprietary and confidential — do not distribute 8/22/21 | 6

Neural Network summary RF summary

Proprietary and confidential — do not distribute 8/22/21 | 7

1. Data cleaning process

2. Identifying hyperparameters for all models –mtry , epsilon, gamma

Proprietary and confidential — do not distribute

Proprietary and confidential — do not distribute 8/22/21 | 11

Proprietary and confidential — do not distribute 8/22/21 | 12

Neural Network (H2o) RF

Proprietary and confidential — do not distribute 8/22/21 | 13

Task : Prediction of conversion and sign up is the dependent variable.

1. Data cleaning process

2. Chosen Activation Function : Random Forest , mtry=8, mtree=400

TUNING VARIABLE: MTRY, TUNRF, EPSILON,

Proprietary and confidential — do not distribute 8/22/21 | 17

ACTIVATION FUNCTION: TANH

• Created a model each time with varying hidden layers

• After multiple trials, with Tanh as activation function,

Proprietary and confidential — do not distribute 8/22/21 | 18

Neural Network (H2o) RF

Proprietary and confidential — do not distribute 8/22/21 | 19

Data cleaning process (Steps 1-5)

2. Coded categorical variables where feasible.

Proprietary and confidential — do not distribute

4. Imputing the data to clear missing values

5. Identified 11 important variables via Boruta.

Multiple NN models run with H2o

Default parameters gave an accuracy of

Neural Network (H2o)

Proprietary and confidential — do not distribute 8/22/21 | 26

You might also like