Submission PPT 110721 - Group 2

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

AIPM_GROUP 2

Submission: Module 2_3


• Abhiram TSR
11 July 2021
• Bhushan Kudale
• Malvika Chaturvedi
• Minali Shah
• Shivendra Srivastava
Proprietary and confidential — do not distribute
Models covered

1. Boston Housing Data


2. Customer Profit Data
3. E-sign
4. Lead scoring

Proprietary and confidential — do not distribute


Boston Housing Data

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 3
Boston Housing Data : Summary
Task : MEDV is the dependent variable. Using all variables, build a model to predict Median
Value of owner-occupied homes.

1. Data cleaning process


• Imputing the data to clear missing values in PTRATIO
• Identified correlation of PTRATIO with other variable
• Constructed regression equation for PTRATIO using highly co-related variables to
complete the missing values

2. Identifying hyperparameters –mtry


3. Chosen Activation function – Tanh, Neural Network

Proprietary and confidential — do not distribute


Key findings: Random forest
TUNING VARIABLE: MTRY. TUNRF

Additional packages explored : Metrics

Key findings:

1. After running tunRF function we get values of OOB for


different values of mtry for given value of ntree. We tried ntree
values of 500, 400, 300, 200 and find that lowest OOB occurs
at ntree=400 and mtry=4
2. The top 4 variables that influenced MEDV for Random Forest
were RM, LSTAT, CHAS and PTRATIO

Proprietary and confidential — do not distribute 8/22/21 | 5


Key findings: Neural Network
ACTIVATION FUNCTION: TANH

Additional packages explored : neuralnet

Key findings:

1. The top 4 variables that influenced MEDV for Neural


Network were AGE, NOX, LSTAT and RM
2. Tanh activation function was applied with two hidden layers
with 10, 20 and 200 epochs on the training set
3. Original function (Rectifier, 1 layer , X neurons, 100 epochs)
had an accuracy of xx%

Proprietary and confidential — do not distribute 8/22/21 | 6


Deriving the best model

SVM summary
RMSE Linear – 5.9
Regression summary
RMSE Polynomial – 6.8
RMSE - 5.8
RMSE Radial – 6.04
RMSE Sigmoid – 12.4

Neural Network summary RF summary


RMSE – 2.9 RMSE – 4.9

Based on the lowest RMSE, the preferred method for model fitting is Neural
Network

Proprietary and confidential — do not distribute 8/22/21 | 7


Shiny apps interface

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 8
Customer Profit

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 9
Customer Profit Data : Summary
Task : Profit is the dependent variable. Using all variables, build a model to predict profit

1. Data cleaning process


• Imputing the data to clear missing values using the package ‘MICE’, ‘VIM’

2. Identifying hyperparameters for all models –mtry , epsilon, gamma


3. Chosen Activation function – Tanh, Neural Network

Proprietary and confidential — do not distribute


Key findings: SVM & Random Forest
TUNING VARIABLE: MTRY, EPSILON, GAMMA

Key findings:
1. For various values of ntree, the lowest OOB is achieved every
time with mtry value of 1. Also the OOB error difference for
ntree of 400 and 500 is very close. So in interest of having a
simpler model and best balance between speed and accuracy
we chose ntree = 400
2. The top 2 variables that influenced Profit for Random Forest
were INCOME and TENURE
3. The best parameters for radial kernel in SVM are epsilon
=0.6 , gamma = 0.1428571, cost = 0.25
4. For Neural network lowest RMSE is seen with 'TANH'
activation function

Proprietary and confidential — do not distribute 8/22/21 | 11


Key findings: Neural Network
ACTIVATION FUNCTION: TANH

Key findings:
1. The top variables that influenced Profit for Neural
Network were ONLINE and AGE
2. For Neural network lowest RMSE is seen with 'TANH'
activation function Tanh activation function was applied
with two hidden layers with 10, 5 on the training set
3. Original function (Rectifier, 1 layer , X neurons, 100
epochs) had an accuracy of xx%

Proprietary and confidential — do not distribute 8/22/21 | 12


Deriving the best model

SVM
RMSE Linear – 223.5
Regression
RMSE Polynomial – 223
RMSE – 217.4
RMSE Radial – 222.8
RMSE Sigmoid – 53052.4

Neural Network (H2o) RF


RMSE – 214.3 RMSE – 216.6

Based on the lowest RMSE, the preferred method for model fitting is Neural
Network

Proprietary and confidential — do not distribute 8/22/21 | 13


Shiny apps interface

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 14
E-sign

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 15
E-sign : Summary

Task : Prediction of conversion and sign up is the dependent variable.

1. Data cleaning process


• All values were complete.
• Data was scaled

2. Chosen Activation Function : Random Forest , mtry=8, mtree=400

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 16
Key findings: Random Forest and SVM

TUNING VARIABLE: MTRY, TUNRF, EPSILON,


GAMMA

1. To reduce processing time, we ran two different grids and then selected the
model with optimum C value having better accuracy. We find that for linear
kernel, best model is with C=1
2. Best parameters for radial kernel are epsilon = 0, cost = 2
3. The influencing variables for RF were AMOUNT REQUESTED, RISK
SCORE and INCOME
4. This is classification problem hence number of variables used at each split
would be square root of total number of variables default value of ntree is set
to 500, we used print command which allowed to see the number of trees
tried, variables considered at each split and the OOB error rate - OOB error
rate is 36.89%, NTREE = 500, MTRY = 4
5. We used mtry for better accuracy when mtry – 8, ntree – 300 the OOB error
was 36.71% then tunRF OOB reduced to 36.37% .AUC WAS 64%
6. SVM Radial with AUC 61% was nearest to RF

Proprietary and confidential — do not distribute 8/22/21 | 17


Key findings: Neural Network

ACTIVATION FUNCTION: TANH

• Created a model each time with varying hidden layers


and number of neurons and the activation functions
along with epochs, along with a couple of experiments
with input_dropout_ratio.

• After multiple trials, with Tanh as activation function,


epochs = 50, hidden = c(50,50), input_dropout_ratio =
0, the AUC of 0.60 was achieved, this proved to be
faster and efficient.

Proprietary and confidential — do not distribute 8/22/21 | 18


Deriving the best model

SVM
AUC Linear – 0.58
Logistic Regression
AUC Polynomial – 0.60
AUC – 0.57
AUC Radial – 0.61
AUC Sigmoid – 0.50

Neural Network (H2o) RF


AUC- 0.60 AUC – 0.64

Based on the highest AUC, the preferred method for model fitting is Random Forest

Proprietary and confidential — do not distribute 8/22/21 | 19


Shiny apps interface

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 20
Lead scoring – Ed tech

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 21
Lead Scoring: Summary
Task : Identify predictor variables to generate a model that predicts likelihood of customer
conversion for online courses. The current conversion rate is 30% and the sales team with
this model can focus more on likely prospects.

Data cleaning process (Steps 1-5)


1. Removed following 14 columns
– Which would not have impact (Cheque, Update me on supply chain, direct mailer)
– Were calculated or subjective (Tags, 4 Asymmetrique variables, Lead Quality)
– Covered via other variable (Last notable activity, total visits, page views per visit )
– Results were unchanged (Magazines, updates= all no)

2. Coded categorical variables where feasible.

Proprietary and confidential — do not distribute


3. Scaling of data / converting into ranges (total
time spent on website)

4. Imputing the data to clear missing values


using the package ‘MICE’, ‘VIM’

5. Identified 11 important variables via Boruta.


Created a new dataset with these variables.

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 23
Key findings on Neural network
Additional packages : Boruta

Multiple NN models run with H2o


Final model
1. Activation : Tanh
2. Hidden layers : 3
3. Branches : 10,6,4
4. Epochs : 100

Default parameters gave an accuracy of


77% (1 layer, 6 branches, 100 epochs,
Rectifier)

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 24
Key findings on Random Forest
TUNING PARAMETERS
Abhi/Bhushan- I need your inputs on the
tuning parameters

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 25
Deriving the best model

SVM
AUC Linear – 0.81
Logistic Regression
AUC Polynomial – 0.59
AUC – 0.815
AUC Radial – 0.81
AUC Sigmoid – 0.80

Neural Network (H2o)


Accuracy 80%
RF
After tuning AUC – 0.83
AUC- 0.78

Based on the highest AUC, the preferred method for model fitting is Random Forest

Proprietary and confidential — do not distribute 8/22/21 | 26


Shiny apps interface

Proprietary and confidential — do not distribute Enter title via "insert>header and footer>footer" | 8/22/21 | 27

You might also like