Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 107

Agenda

•How Credit Decisions are Taken in the


Financial Industry?
•How is Machine Learning Being Adopted in
Taking the Decisions?

Jun 10, 2020 1


Why Credit Risk?
•Net Credit Loss : 25% of Revenues and Twice the net income in Q2 2011

Jun 10, 2020 2


Game One –Rank Them in Order
•Provided is the Financial Data on 10
Companies.
•You have to Rank Order them in terms of the
Companies Future Prospect / Credit Quality
•Time Allotted is 30 Min
•Presentation
•Discussion

Jun 10, 2020 3


Example -Looking for
Key Attributes

Jun 10, 2020 4


What Did We Learn ?
1.If we are a team of 4 –we have 5 Rank
Orderings!
2.We can develop own rating methodology
based on own inferences and Logic
3.Can be Super-Successful if we Blend the
Knowledge and Industry Wisdom

Jun 10, 2020 5


Agenda

Jun 10, 2020 6


PD Models -Statistical
Models in Credit Risk Measurement

Jun 10, 2020 7


How do you build a
Default Model for India
•Defining Default –Some considerations
–It Should Work
–Easy to Define
–Easy to Apply and Measure
–Lend to Future Improvement

Jun 10, 2020 8


Pages in Indian ‘Default’ History
•Sick Industrial Company Act (SICA, 1985)
–Registered for Five* Years
–Incurred Cash Loss for two consecutive years
–Networth is negative
•Eligible Company Mandatorily refer to BIFR
(Board for Industrial and Financial
Reconstruction)

Jun 10, 2020 9


Pages in Indian
‘Default’ History (Continued)
•Sick Industry Report (1992)*
-Evaluated Criteria Independently
-Recommended using 2 years of consecutive
Cash Flow as the Criteria
-Rationale –Early Intervention has chances of
High Survival of the Sick Companies

Jun 10, 2020 10


Does this Default Definition Work?
•Year 2007 : PAT of a company was -60.7 crores
•Year 2008 : PAT of a company was -97.4 crores
•Shall we define this company as a Defaulter / Sick
Company ?
•Net worth of the company was positive though in
these two years
•Name of this company : TATA Advanced Material Ltd
•TATA Advanced Material Ltd sprung back to action
with positive profit in the later years

Jun 10, 2020 11


Proposed Definition
•Companies which has Negative Networth for the First timein
the Time window 1991-97
–Midway between the Two Definitions
–Is Measurable and can be Improved

Jun 10, 2020 12


PD Models –Linear Models
•Altman’s Z Score Model (1968)
•Z (Public) = 1.2X(1) + 1.4X(2) + 3.3X(3) + 0.6X(4) + 1.0X(5)
–Where Z > 2.99 is healthy
–Z < 1.81 is unhealthy
–1.81 < Z < 2.99 is indeterminate
•X(1) = Working Capital / Total Assets
•X(2) = Retained Earnings / Total Assets
•X(3) = Earnings Before Interest and Taxes (EBIT) / Total Assets
•X(4) = Market Value of Equity / Total Assets
•X(5) = Net Sales / Total Assets

Jun 10, 2020 13


Steps to Build a Good Function
•Selection of the ‘Default’ Sample
•Creation of the Non-Default Sample
–Choice of Industry, Firm Size & Time Period
•Appropriate Treatment of the Data
–Choose the predictors carefully
•Choice of a Good Out of Sample Validation
Data –Stress Test the Data

Jun 10, 2020 14


Deep Dive on the Variables
•Clear Separation in mean values of variables between
defaulters and non-defaulters

Jun 10, 2020 15


Discriminant Function
•India Z Score Model (2000)
•Z (Public) = 1.06 + 0.01PBITINT –3.12TDTA + 0.48QR +
4.61NCATA
•PBITINT = Profits Before Interest & Taxes / Total
Interest
•TDTA = Total Borrowings / Total Assets
•QR = Current Assets –Inventories / Current Liabilities
& Provisions
•NCATA = Profit After Tax & Depreciation / Total Assets

Jun 10, 2020 16


Does it Work ?

Jun 10, 2020 17


Challenges of Discriminant Model
•Challenges
–Zone of Indifference (Indeterminate)
–Sensitivity to Industry
–Dealing with New Companies
–Loss of Predictive Power across Time –use of Penalty
Functions
–Multi-colinearity of Variables
•Some variables provide the same information and are
highly correlated
–Assumption of Multivariate Normality of Variables

Jun 10, 2020 18


What if we use a
Logistic Framework?
•We can also Build a Logistic Model using 2010-11
Data
•Take all companies with negative Net worth in 2011
•Take the same number of companies with similar
asset size and same industry with positive Net worth
in 2011
•QC the data carefully, removing all outliers
•Compute all the predictors as of 2010
•Run the logistic regression and get the equation

Jun 10, 2020 19


Logistic Model -The new Equation
•Logit(score) = -6.9965
+ 4.8879 *Total Borrowings/Total Assets
-0.455 * Net working cap / Current Liabilities
-6.8605 *Net cash accruals / Total assets
+1.6317 *Current Liabilities / Current Assets
+ 0.1978 *Total Borrowings/Total Liabilities

Jun 10, 2020 20


Logistic Models –
Use in Credit Rating
•If we run the equation on N companies, we
can rank-order the companies based on the
score
•Create ratings based on the cutoffs chosen

Jun 10, 2020 21


How is Rating
Determined in CRAs ?

Jun 10, 2020 22


Developing CRA’s
Rating Algorithm
•Developing Ordered-LogitRegressions with Market
Information ( Incorporated market information into the logic)

Jun 10, 2020 23


Recap –What we Studied
•Discriminant Models though Accurate has ‘High
Maintenance’
•Logit/ ProbitRegressions display stability than
Discriminant Models
•The Strength of the Tool is a Function of the Quality of
Data used for the Analysis
•Credit Rating tries to resolve Information Asymmetry
in assigning the Right Price of Debt
•Either Tool can be used to develop an Independent
Rating Framework

Jun 10, 2020 24


Total Recall –Market Risk
•Develop Two Methods of VaR using
Parametric and Distribution Method.
•Single Stock
•Portfolio
•Developed Mean-Variance Portfolio
Approach of Weight Optimization

Jun 10, 2020 25


Industrial Wisdom
•If you want to do something New –Know the
Past –Spend 30% of the Allotted time.
•Spend 40% of Time getting the ‘Right’ Data
•Building the Analytical Solution is 10% of the
Time
•Stress Test the Solution for Remaining 20%
•Document the Key Learnings and
Opportunities for Future Enhancements

Jun 10, 2020 26


Total Recall –Credit Risk
•Challenge to Define Default in the Context of
India
•Choosing Logistic over Discriminant Models to
predict Default
•Need to Find ‘independent’ attributes / features
to predict Default
•Understanding and Cleaning of Data is Essential
•Dividing the Sample in Development, Validation,
Out of Sample Validation is a Must.

Jun 10, 2020 27


Total Recall –Credit Risk (Cont)
•Machine Learning to be used on the
same Sample to Develop Alternate type
of Models.
•Having a good understanding of the
Problem to be solved and the underlying
data is of Paramount Importance.

Jun 10, 2020 28


Game 2: Building a
Smart Default Model
•Define Default
•Do a data quality check and remove outliers
•Dividethe data into development and validation datasets
•Identify and define variables that has high co-relation
with default
•DevelopLogistic function
•Develop Classification table
•Test performance on out-of-sample data
•Rank Order your 10 Companies –what do you Find?

Jun 10, 2020 29


Define Default
•Step 1 : Define Default in your dataset
–All the company with negative
Networthshould be defined as default
–If Networth<= 0 , then default = 1, else
default = 0 ( Add a new column in your dataset
called Default which takes a value of 0/1)
–What is the default rate in your dataset ?

Jun 10, 2020 30


Quality Check
•Step 2 –Quality Check of the data
–Have a look at the data and remove the outliers
•These outliers will distort your equation if not removed
–After you are satisfied with the data, randomly divide it
into development dataset (70%) and validation dataset(30%)
•Check the default rates of the development and validation
dataset
–since it is randomly split, the default rates should be similar
–You will now work on development dataset to develop the
model equation

Jun 10, 2020 31


Identification of Predictors
•Step 3 –Identify the predictors
–Look at the variables which give a good discrimination
between the defaulters and non-defaulters in the dataset
•Eg. If you want to use variable X in your model, you should
look at the mean value of variable X for non-defaulters , and
the mean value for defaulters. If there is a good
discrimination between the two mean values, then the
variable X should be used in your model
–Ratio variables MIGHT be more appropriate for the model
equation!

Jun 10, 2020 32


Workshop Recap
•Defined Default
•Divided the data into development and
validation datasets
•Did a data quality check and remove outliers
•Identified and defined variables that had high
co-relation with default

Jun 10, 2020 33


Develop Logistic Equation
•Step 4 : Develop Logistic Equation
–Transfer the data to the SPSS sheet
–Look for the Logistic Regression field in the SPSS sheet ( will
come under the Analyze tab)
–Dependent variable is the 0/1 Default Column
–Independent variables are the predictors that you have chosen
–Remove the variables that have high significance level
–Fine tune your final model by looking at the sign of the predictors
•The predictors sign should make intuitive sense. Eg. A variable
like CL/CA should have positive sign
–Look at the efficiency rate of the model prediction

Jun 10, 2020 34


Out of Sample Validation
•Step 5 : Your model is now ready
–Take the validation dataset
–Score each row with your equation –get the score for
everyone
–Count the number of defaulters in your dataset
•Lets assume that there are 30 defaulters in your dataset
•Based on your score, take the top 30 companies
•Look at the default rate of these top 30 companies –This
gives you the efficiency of your model
•Higher the efficiency, better is the model

Jun 10, 2020 35


First Set of Lessons
1.Understanding Data in the First Step to
Successful Analytical Exercise
2.One Can Build a Great Model for Credit
Rating using Any Technique provided you
know its Limitations
3.Can be Super-Successful if we Blend the
Knowledge and Industry Wisdom

Jun 10, 2020 36


Machine Learning in Credit Risk

Jun 10, 2020 37


Learning from 2
MM Models -Kaggle
Great Lakes
•Data Type
–Structured Data
–Unstructured Data
•Step 1: Understand the Data Generation
Process
–Explore the Data

Jun 10, 2020 38


Learning from 2
MM Models -Kaggle
•Step 2: Feature Engineering
–Structured Data
–Rank Plot / Hypothesis Testing
–Synthetic Variables
•Feature Engineering
–Not Relevant for Un-Structured Data

Jun 10, 2020 39


Learning from 2
MM Models -Kaggle
•Step 3: Structured Data -Fitting the Right
Algorithm
–Random Forest
–Support Vector Machine
–Gradient Boosting Machine
•Unstructured Data -Deep Learning
–CNN or RNN (image vs sequence data)

Jun 10, 2020 40


Learning from 2
MM Models -Kaggle
•Caution
–Overfitting
–Use Cross Validation to Test Model
Performance
–Poor Performance in Out of time Sample
•Participate in Kaggle
–Get the Real Experience
–Cloud Based Kernel of Kaggle
Jun 10, 2020 41
LOGISTIC REGRESSION

Jun 10, 2020 42


Logistic Regression
Logistic Regression builds a non-linear equation to predict a dichotomous
variable. In fact, what it does is classification rather than regression unlike
its name!

Jun 10, 2020 43


Why not Linear?
•The Y variable is a binary
variable –1 or 0
•The relationship between the
dependent and independent
variables is non-linear
•The usual linear regression
generates values outside [0,1]
•A linear fit to a binary variable
becomes very sensitive to
extreme values
•Other statistical Complications!

Jun 10, 2020 44


Logistic function –a better fit!
So, we need a function that stays within the bounds of 0 & 1
and represents the data in a much better manner

Jun 10, 2020 45


How does logistic learn
the Coefficients -An Example
Can you predict whether a person will buy a house
with the given information?

Jun 10, 2020 46


How does logistic learn
the Coefficients -An Example

Jun 10, 2020 47


How does logistic learn
the Coefficients -An Example

Cost function:
When Y=1, then Cost = -log (Prediction)
When Y=0, then Cost = -log (1-Prediction)

Jun 10, 2020 48


How does logistic learn
the Coefficients -An Example
Step 3: Adjust the coefficients and the
predictions in an iterative fashion to move
towards the global Cost minima

Jun 10, 2020 49


RANDOM FOREST

Jun 10, 2020 50


Why is it called a Forest?
• Predictive model based on a branching series of Boolean tests
• Boolean tests are less complex than one-stage classifiers

“Forest” or an
ensemble of trees
required to
address over-
fitting

Jun 10, 2020 51


But why Random?
• Bootstrap aggregating(or bagging)
• Random feature selection
Split variable at every node of a
tree is randomly selected from the
full list of features
Bagging is the process through
which samples are selected to
build the trees; random samples
are repetitively drawn (with or
without replacement) from the
training set

Jun 10, 2020 52


Let’s take an example

Jun 10, 2020 53


Building Decision Trees
Subsequent trees are generated in a similar manner on different samples

Jun 10, 2020 54


To summarize the model
building process…

Jun 10, 2020 55


How best to use Random Forest?
Parameter Tuning

Impact
Larger number of trees = Less chance of over-fitting
More complex solution
Higher runtime

Jun 10, 2020 56


How best to use Random Forest?
Parameter Tuning

Impact
More randomly selected variables = Significant variables show up
Repetitive trees –all variables in
data not evaluated

Jun 10, 2020 57


How best to use Random Forest?
Parameter Tuning

Impact
Higher sampling ratio = Enough data points to build trees
Not enough data points to test the
stability of trees

Jun 10, 2020 58


How best to use Random Forest?
Parameter Tuning

Impact
Sampling without replacement = Trees covering different dimensions
Limit on maximum number of trees

Jun 10, 2020 59


GBM

Jun 10, 2020 60


Gradient Boosting !
Gradient boosting produces a strong prediction model by
ensembling many weak prediction models, typically decision
trees, built in a stage-wise fashion

Jun 10, 2020 61


RF Vs GBM

Jun 10, 2020 62


GBM Introduction
•GBM builds decision trees in a stage-wise manner
•The first set of prediction is initialized to a constant value and the first tree is build
on the residual error from this constant value
•Successive trees use the residual from the previous tree to reduce prediction
error
•GBM score is a linear combination of the individual tree predictions

Jun 10, 2020 63


Lets Look At An Example
Can you predict the value of the home of any person with the given
information?

Jun 10, 2020 64


Lets Look At An Example

Jun 10, 2020 65


Lets Look At An Example

Learning Rate : 10%

Jun 10, 2020 66


Lets Look At An Example
Step 6,7,8.. : Repeat process with next trees until errors are minimized!

The number of trees you build, the depth of each tree and the learning rate will all
decide how good a model you make!
Jun 10, 2020 67
Overview of GBM –
Regression and Classification

Jun 10, 2020 68


How best to use GBM?

Jun 10, 2020 69


Keep in Mind

Jun 10, 2020 70


k-NN algorithm

Jun 10, 2020 71


Finding Lookalikes
How do you find people similarto SushilKumar among a group
of sportspersons?

Jun 10, 2020 72


Concept of Distance and Similarity

Jun 10, 2020 73


k-NN
(k-Nearest Neighbors) Algorithm
Algorithm tofind the k most similar people, i.e., the Knearest neighbors

Jun 10, 2020 74


Example using kNN
Can you predict whether Maitreehas a car?

Jun 10, 2020 75


k-NN Algorithm:
Mathematical Formulation

Jun 10, 2020 76


k-NN Algorithm:
Mathematical Formulation

Jun 10, 2020 77


k-NN Algorithm:
Mathematical Formulation

Jun 10, 2020 78


k-NN Algorithm:
Mathematical Formulation

Jun 10, 2020 79


Parameters for kNNmodels
Distance Metric Dimensions
Should satisfy triangle inequality. Should be independent and identically
For example: distributed (IID).
•Euclidean distance For example:
•Chebysev’sdistance •Age
•Manhattan distance •Income
•Mahanalobisdistance

Value of k Scoring Function


Typically selected through cross- Label of the nearest neighbors weighed
validation. differently:
•Distance to the point
•Rank of the neighbor

Jun 10, 2020 80


Steps to Build a ML Model -1
•Step 1: Divide the Data in Development and
Validation
•Step 2: Appropriately Floor and Cap the
Variables
•Step 3: Execute the Different Models
•Step 4: Save the Results

Jun 10, 2020 81


Steps to Build a ML Model -2
•Step 5: Score the Validation Data Sets
•Step 6: Keep the Relevant Variables
•Step 7: Compute the GINI of the Out of
Sample
•Step 8: Save the Data Sets with Predicted
Variables and MERGE Key
•Step 9: Compare the Results

Jun 10, 2020 82


Step 1a –Read the Data
•Read the Data in R
data<-read.csv("dev-data-1.csv")
•Check the observations
nrow(dataset name)

Jun 10, 2020 83


Step 1b –Split the Data
in Development and Validation
•Do a 75-25 Split of the Data –Training (development) and Test (Validation)
splitdf<-function(dataframe, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
index <-1:nrow(dataframe)
trainindex<-sample(index, trunc(length(index)*0.75))
trainset <-dataframe[trainindex, ]
testset<-dataframe[-trainindex, ]
list(trainset=trainset,testset=testset)
}
splits <-splitdf( data, seed=nrow(data))
training_lg<-splits$trainset
testing_lg<-splits$testset

Jun 10, 2020 84


Step 2 –Floor and Cap
the Variables
•Flooring of the Missing Has been Done as “1”
training_lg$TA[is.na(training_lg$TA)]<-1
training_lg$TI[is.na(training_lg$TI)]<-1
training_lg$TE[is.na(training_lg$TE)]<-1
training_lg$PAT[is.na(training_lg$PAT)]<-1
•Alternate Flooring Capping can also be
used…..

Jun 10, 2020 85


Step 2a –Alternate
Capping and Flooring of Variables
•Replacing Missing with Means
training$TLIAB[is.na(training$TLIAB)] <-
round(mean(training$TLIAB,na.rm=TRUE))

1stPercentile
•Replacing with Percentile Values:
training$INVST1<-ifelse(training$INVST<= 10,10,training$INVST)

training$INVST1<-ifelse(training$INVST1>= 1222.485,
1222.485,training$INVST1)
99thPercentile

Jun 10, 2020 86


Step 3 & 4 –Execute the
Different Models -Logistic
•Need to Load the GLM library –happens
automatically
library(glm2)
•Run the Relevant Equation
model<-glm( DEF ~
TA+TI+TE+PAT,data=training_lg,family="binomial")
summary(model)
•Play Around till you get the ‘Best’ Equation
Tips: P1 <-fitted(model)

Jun 10, 2020 87


Step 5 –Validate the
Model in the Validation Data
•Use the Model Equation to Come with the Predicted Scores
on Validation Sample
predicted<-predict(model, newdata=testing_lg,
type="response")
•Transform the ‘predicted’ temp variable to a Variable in the
File
d <-transform(predicted)
•Save the Temp d variable to predlgvariable in the
Testing_lgFile
testing_lg$predlg<-d$X_data
•Check the results –head(testing_lg)

Jun 10, 2020 88


Step 6: Keep the
Relevant Variables
•Only Keep the Relevant Variables –3 of them

testing1_lg <-subset(testing_lg, select = c(predlg,DEF,Num))

Predicted prob Default Indicator Merge key

Jun 10, 2020 89


Step 7 –Compute the GINI
•Load new Library -library(Hmisc)
•Relevant Commands:
rcorr.cens(oot_lg$pred,oot_lg$DEF)
rcorr.cens(testing_lg$pred,testing_lg$DEF)
•Dxy= GINI
•C Index = concordence

Jun 10, 2020 90


Step 8 –Save the Results in a File
•Save the Results in a CSV File
write.csv(oot1_lg,"oot_logit_pred.csv")
•Check the Data in EXCEL

Jun 10, 2020 91


Step 3 & 4: Algorithm –
Random Forest
•Load the Following Library: Random Forest
Library (randomForest)
•Run the Relevant Command
training$DEF<-factor(training$DEF)
library(randomForest)
set.seed(71)
RF<-randomForest(DEF ~ TA + TI + PAT + PBDITA + PBT + CPFT
+ PBTI + PATI + Sales + QR + CR + DE -Num, data= training ,
ntree= 50, mtry= 3, importance = TRUE, na.action= na.omit,
keep.forest= TRUE, do.trace= 10)

Jun 10, 2020 92


Step 3 & 4: Random
Forest –Model Diagnostics
•Result Summary –summary(RF)
•Variable Importance –importance (RF)
•Another Way of Variable Importance –
round(importance (RF), 2)
•Classification Metrics –print(RF)
•Printing one of the trees –getTree(RF,1) –
1stTree

Jun 10, 2020 93


Step 5 & Rest: Score
Validation Data -RF
•Scoring the Validation Dataset
predicted <-predict(RF, testing, type = "prob")
•Only Keep the ProbCorresponding to the
Second Column
prob_rf<-predicted[,2]
•Save it as variables
g <-transform(prob_rf)
testing$pred_rf<-g$X_data
Jun 10, 2020 94
Step 5 & Rest:
Random Forest
•Scoring the Validation Dataset
predicted <-predict(RF, testing, type = "prob")
•Only Keep the ProbCorresponding to the Second
Column
prob_rf<-predicted[,2]
•Save it as variables
g <-transform(prob_rf)
testing$pred_rf<-g$X_data
head(testing)

Jun 10, 2020 95


Step 5 & Rest:
Score Validation Data -RF
•Subset the Data for the Relevant Variables
testing1_rf <-subset(testing, select =
c(pred_rf,DEF,Num))
•Compute the GINI
library(Hmisc)
rcorr.cens(testing1_rf$pred,testing1_rf$DEF)

Jun 10, 2020 96


Step 5 & Rest:
Score Validation Data -RF
•Merge the Relevant Files
Scored_testing1 <-
merge(testing1_lg,testing1_rf ,by="Num")
•Save this as a Permanent Data
write.csv(Scored_testing1,"testing_lg_rf_pred.
csv")
•Repeat this Action as you Append Other
Probabilities using Different Algorithms

Jun 10, 2020 97


Step 3 & 4: Algorithm –
Gradient Boosting Machine
•Load the Following Library: GBM
Library (gbm)
•Run the Relevant Command
library(gbm)
gbm_model<-gbm(DEF ~
TA+TI+TE+PAT+PBDITA+CPFT+PBDITAI+Sales+SHF
+NWC+QR+CR+DE+EPS+TLIAB, training,
distribution = "bernoulli", n.trees= 5,shrinkage=
0.1, interaction.depth= 3)

Jun 10, 2020 98


Step 3 & 4: Random
Forest –Model Diagnostics
•Variable Importance
summary(gbm_model,
cBars=length(gbm_model$var.names),
n.trees=gbm_model$n.trees,
plotit=TRUE,
order=TRUE,
method=relative.influence,
normalize=TRUE)
Jun 10, 2020 99
Step 5 & Rest:
Score Validation Data -GBM
•Scoring the Validation Dataset
predict_gbm<-predict(gbm_model, testing,
n.trees=5, type="response")
summary(predict_gbm)
•Save it as variables
b <-transform(predict_gbm)
testing$predgbm<-b$X_data
head(testing)
Jun 10, 2020 100
Step 5 & Rest:
Score Validation Data -GBM
•Subset the Data for the Relevant Variables
testing1_gbm <-subset(testing, select =
c(predgbm,DEF,Num))
•Compute the GINI
library(Hmisc)
rcorr.cens(testing1_gbm$predgbm,testing1_g
bm$DEF)

Jun 10, 2020 101


Step 5 & Rest:
Score Validation Data -GBM
•Merge the Relevant Files
Scored_testing1 <-merge(testing1_lg,testing1_rf
,testing1_gbm by="Num")
•Save this as a Permanent Data
write.csv(Scored_testing1,"testing_lg_rf_gbm_pr
ed.csv")
•Repeat this Action as you Append Other
Probabilities using Different Algorithms

Jun 10, 2020 102


Step 2: Algorithm –KNN
•Load the Following Library: KNN
library(kknn)
•Standardize the Data
training$nTA=(training$TA-min(training$TA))/
(max(training$TA)-min(training$TA))
training$nTI=(training$TI-min(training$TI))/
(max(training$TI)-min(training$TI))
•Do the Same for Validation Data as well

Jun 10, 2020 103


Step 3 & 4: Algorithm –KNN
•Run the Relevant Algorithm
library(kknn)
knn< -kknn(as.factor(DEF)~ nTA+ nTI+ nPAT+
nPBDITA+ nPBT+ nCPFT+ nPBTI+ nPATI+ nSales+
nQR+ nCR+ nDE,training, testing, k = 7, distance = 2)
K = number of points in the Neighbourhood
Distance = 2 (Minowski’sDistance = (|distance|)**(2)
Use Both the Training and Test in the same code

Jun 10, 2020 104


Step 5 : Model Results &
Diagnostics: Algorithm –KNN
•Check the Results
summary(knn)
fit<-fitted(knn)
plot(fit)
•Save the Results as a Variable
b <-transform(fit)
testing$pred_knn<-b$X_data
table (testing$DEF,testing$pred_knn)
Jun 10, 2020 105
Step 6 and Beyond:
Algorithm –KNN
•Save Relevant Variables and Merge with Original
Data
testing1_knn <-subset(testing, select =
c(pred_knn,DEF,Num))
•Create the Final Data Set
Scored_testing1 <-
merge(testing1_lg,testing1_gbm,testing1_rf,
testing1_knn ,by="Num")
•Repeat this for other Algorithms

Jun 10, 2020 106


References
•Altman (1968), Financial Ratios, Discriminant Analysis and the
Prediction of Corporate Bankruptcy, Journal of Finance, 23, No.4, 589-
609
•Altman (1993), Corporate Financial Distress and Bankruptcy: A
Complete Guide to Predicting & Avoiding Distress and Profiting from
Bankruptcy, John Wiley Second Edition
•AnantT C, GangopadhyayS and GoswamiO (1992), Industrial Sickness in
India: Characteristics, Determinants and History, 1970-90, Report 2,
Government of India, Ministry of Indian Office of Economic Advisors
•RaghunathanV and J Verma(1992), CrisilRatings: When does AAA mean
B? Vikalpa, Vol17, No 2, 35-42
•Emerging Market Score Model:
http://pages.stern.nyu.edu/~ealtman/emerging_markets_review.pdf

Jun 10, 2020 107

You might also like