Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

SVKM’S NMIMS

SCHOOL OF BUSINESS MANAGEMENT

HYDERABAD

PGDM BATCH – 2018-20

AMTA ASSIGNMENT
To analyze the probability of doing default if given loan to a
particular person

Submitted to: - Prepared By: -

Dr. Abhilash Poonam Saransh Kansal (80303180090)

Section_B Vamini Madan (80303180108)

Surbhi Jain (80303180083)

1|Page
GLMNET- Classification and Regression Model

1. Introduction: -

Glmnet package fits the linear regression model by penalizing maximum likelihood.
Regularization helps in solving over fitting problem. Glmnet package does
regularization by computing regularization path for lasso and elastic penalty at the
grid values for the regularization parameter lambda. It is an extremely fast
algorithm and can exploit sparsity in input matrix. It fits linear, logistic, multinomial,
poison and cox regression models. A variety of predictions can be made from the
fitted models.

2. Working of Algorithm: -

The glmnet use cyclical coordinate descent algorithm technique in which objective
function is successively optimized over each parameter by keeping other
parameters as constant and this cycle repeats until convergence. The code can
handle sparse input-matrix formats, as well as range constraints on coefficients. The
core of glmnet is a set of fortran subroutines, which make for very fast execution.
The package also includes methods for prediction and plotting, and a function that
performs K-fold cross-validation.

Using caret package: -

glmnet (method = 'glmnet')

For classification and regression using packages glmnet and Matrix with tuning
parameters:

 Mixing Percentage (alpha, numeric)


 Regularization Parameter (lambda, numeric)

2|Page
3. Analysis using glmnet model: -

Description

A simulated data set containing information on ten thousand customers. The aim here is to predict which
customers will default on their credit card debt.

Default: - A factor with levels No and Yes indicating whether the customer defaulted on their debt

Student: - A factor with levels No and Yes indicating whether the customer is a student

Balance: - The average balance that the customer has remaining on their credit card after making their
monthly payment

Income: -Income of customer

Step -1:- We have run the glmnet classification model using caret package.

Output:-

Step- 2:- Then we predicted the probability of doing default using above model and run the
confusion matrix.

3|Page
Output :-

What we found out is even though the accuracy of model is 97.34% but the Sensitivity of the model
is very less 31.53%. Which means out of total actual defaulter’s model is able to identify only 31.53
% correctly. Hence the above model cannot be used for prediction.

Since the Sensitivity is the main criteria for the best model fit we again run the model using metric
as specificity.

4|Page
One more important thing is that the data is more biased towards non- defaulters i.e. there are
More number of non- defaulters as compared to defaulters in the data.

So, in order to remove biasness, we have to overestimate the data.

Considering these things, we again run the classification model.

Output :-

This time Specificity is used to select the best model.

Now found out the probability and run the confusion matrix.

5|Page
Output :-

ROC CURVE: -

6|Page
The ROC curve gives the threshold cutoff known as Youden’s Index at which sensitivity is max. In
this case threshold is 0.487709

This time we got the accuracy of 86.4% but the sensitivity is the metric we are looking for is 90.09%
. Which means out of the total defaulters our model is able to predict 90.09% of them correctly. So
this is not 100% percent accurate yet much better model and can be used for prediction.

4. Recommendations: -
1. Model has 90.09% accuracy so 90% confidence interval has to be considered
while taking decision.
2. Sensitivity is better criteria than Accuracy for measuring the model.
3. If the probability of default is more than 0.5 then personal is likely to default.
4. Data has been overfitted for developing better model.

7|Page
Boosted Tree (Regression and Classification)

1. Introduction
Gradient boosting is a machine learning technique for regression and classification problems, which
produces a prediction model in the form of an ensemble of weak prediction models,
typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do,
and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

2. Working of Algorithm: -
The algorithm for Boosting Trees evolved from the application of boosting methods to regression
trees. The general idea is to compute a sequence of (very) simple trees, where each successive tree
is built for the prediction residuals of the preceding tree. As described in the General Classification
and Regression Trees Introductory Overview, this method will build binary trees, i.e., partition the
data into two samples at each split node. Now suppose that you were to limit the complexities of
the trees to 3 nodes only: a root node and two child nodes, i.e., a single split. Thus, at each step of
the boosting (boosting trees algorithm), a simple (best) partitioning of the data is determined, and
the deviations of the observed values from the respective means (residuals for each partition) are
computed. The next 3-node tree will then be fitted to those residuals, to find another partition that
will further reduce the residual (error) variance for the data, given the preceding sequence of trees.

Using caret package: -

Boosted Tree (method = 'blackboost')

For classification and regression using packages party, mboost and plyr with tuning parameters:

 Number of Trees (mstop, numeric)


 Max Tree Depth (maxdepth, numeric

8|Page
3. Analysis using boosted tree model: -

Description

A simulated data set containing information on ten thousand customers. The aim here is to predict which
customers will default on their credit card debt.

Default: - A factor with levels No and Yes indicating whether the customer defaulted on their debt

Student: - A factor with levels No and Yes indicating whether the customer is a student

Balance: - The average balance that the customer has remaining on their credit card after making their
monthly payment

Income: -Income of customer

Step -1:- We have run the boosted tree classification model using caret package.

Output:-

9|Page
Step- 2:- Then we predicted the probability of doing default using above model and run the
confusion matrix.

Output :-

10 | P a g e
What we found out is even though the accuracy of model is 97.37% but the Sensitivity of the model
is very less 32.13%. Which means out of total actual defaulter’s model is able to identify only 32.13
% correctly. Hence the above model cannot be used for prediction.

Since the Sensitivity is the main criteria for the best model fit we again run the model using metric
as specificity.

One more important thing is that the data is more biased towards non- defaulters i.e. there are
More number of non- defaulters as compared to defaulters in the data.

So, in order to remove biasness, we have to overestimate the data.

Considering these things, we again run the classification model.

11 | P a g e
Output :-

This time Specificity is used to select the best model.

Now found out the probability and run the confusion matrix.

12 | P a g e
Output :-

ROC CURVE:-

Output: -

13 | P a g e
The ROC curve gives the threshold cutoff known as Youden’s Index at which sensitivity is max. In
this case threshold is 0.4784064

This time we got the accuracy of 87.99% but the sensitivity is the metric we are looking for is 95.79
%. Which means out of the total defaulters our model is able to predict 95.79% of them correctly.
So this is not 100% percent accurate yet much better model and can be used for prediction.

4. Recommendations: -

1. Model has 95.79% accuracy so 95% confidence interval has to be considered while
taking decision.
2. Sensitivity is better criteria than Accuracy for measuring the model.
3. If the probability of default is more than 0.4794064 then person is likely to default.
4. Data has been overfitted for developing better model.

Flexible Discriminant Analysis (Classification)

14 | P a g e
1. Introduction
FDA is a flexible extension of LDA that uses non-linear combinations of predictors such as splines.
FDA is useful to model multivariate non-normality or non-linear relationships among variables within
each group, allowing for a more accurate classification.

2. Working of Algorithm: -
Linear Discriminant Analysis is a dimensionality reduction technique used as a preprocessing step in
Machine Learning and pattern classification applications. The main goal of dimensionality reduction
techniques is to reduce the dimensions by removing the redundant and dependent features by
transforming the features from higher dimensional space to a space with lower dimensions.

Linear Discriminant Analysis is a simple and effective method for classification. Because it is simple
and so well understood, there are many extensions and variations to the method. Some popular
extensions include:

 Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).

 Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used such as
splines.

 Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the
variance (actually covariance), moderating the influence of different variables on LDA.

Using caret package: -


Flexible Discriminant Analysis (method = 'fda')

For classification using packages earth and mda with tuning parameters:
 Product Degree (degree, numeric)
 Number of Terms (nprune, numeric)
Note: Unlike other packages used by train, the earth package is fully loaded when this model
is used.

3. Analysis using FDA model: -

15 | P a g e
Description
an object of class "fda". Use predict to extract discriminant variables, posterior probabilities or
predicted class memberships. Other extractor functions are coef, confusion and plot.

Step -1:- We have run the boosted tree classification model using caret package.

Output:-

Step- 2:- Then we predicted the probability of doing default using above model and run the
confusion matrix.

16 | P a g e
Output :-

What we found out is even though the accuracy of model is 96.74% but the Sensitivity of the model
is very less 51.95%. Which means out of total actual defaulter’s model is able to identify only 51.95
% correctly. Hence the above model cannot be used for prediction.

Since the Sensitivity is the main criteria for the best model fit we again run the model using metric
as specificity.

One more important thing is that the data is more biased towards non- defaulters i.e. there are
More number of non- defaulters as compared to defaulters in the data.

So, in order to remove biasness, we have to overestimate the data.

Considering these things, we again run the classification model.

17 | P a g e
Output :-

This time Specificity is used to select the best model.

Now found out the probability and run the confusion matrix.

Output :-

18 | P a g e
ROC CURVE: -

Output: -

19 | P a g e
The ROC curve gives the threshold cutoff known as Youden’s Index at which sensitivity is max. In
this case threshold is 0.4298001

This time we got the accuracy of 87.57% but the sensitivity is the metric we are looking for is 90.09
%. Which means out of the total defaulters our model is able to predict 90 .09% of them correctly.
So this is not 100% percent accurate yet much better model and can be used for prediction.

4. Recommendations: -
1. Model has 90.09% accuracy so 90% confidence interval has to be
considered while taking decision.
2. Sensitivity is better criteria than Accuracy for measuring the model.
3. If the probability of default is more than 0.4298001 then personal is likely
to default.
4. Data has been overfitted for developing better model.

20 | P a g e

You might also like