Professional Documents
Culture Documents
Capstone Project Taiwan-Customer Defaults Notes1 Indranil Banerjee
Capstone Project Taiwan-Customer Defaults Notes1 Indranil Banerjee
Capstone Project Taiwan-Customer Defaults Notes1 Indranil Banerjee
Project Notes 1
Introduction
This is the first set of project note for Capstone project “Taiwan- Customer defaults”. This note aims at
understanding the case description and setting up the project objective &agenda by analyzing the
business opportunity. This also covers exploratory analysis of the data through univariate and bivariate
analysis that will form the base of further model building as progress further in the project.
Case Description
Objective
This project aims at the case of customers’ default payments in Taiwan where we are expected to
calculate the probability of default for a customer using data that corresponds to credit data of costumers.
The data collected contains the customer details and data on the credit history. This data also contains
whether there was a default in payment in next month. The agenda of this project is to determine with
accuracy, the occurrence of a payment default and also to set the costumer probability of default so the
necessary risk mitigation can be done.
Business Opportunity
Finance sector in Taiwan generates revenue through credit based products/services. This puts a
Taiwanese financial institution in a position where recovering the money is both crucial to the overall
running of the system and the interest earned drives operations and profits. If the customer does not pay
the credits and/or EMIs on time, the person is classified as a defaulter and if the amount is not recovered,
the loan gets characterized as a bad loan and every year financial institute incurs losses due to this
uncertainty of predicting a good/ bad client. Thus, if a creditor can predict a bad loan in advance or can
predict risk of non-payment of the next month EMI on time, measures can be taken to take decisions
which can help in reducing the risk of loan default. This project thus is going to serve the same purpose of
predicting the defaults of next month EMI also suggesting the likelihood of a default basis customer credit
repayment history. This will help the credit system take decision as per the business requirement to
mitigate risk and increase profitability.
Benefits:
The key focus of the project is not only to predict worth accuracy the occurrence of a default but also the
business sense of risk profile of a customer with percentile-wise separation of high risk and low risk
clients. One of the key objectives in this case will be to get the accurate prediction of defaulters as
opposed to an overall accuracy. This will ensure that maximum possible cases of defaults are classified
as risky customers.
Data Report
- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and
his/her family (supplementary) credit.
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to
September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the
repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The
measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month;
2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay
for nine months and above.
- X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September,
2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April,
2005.
- X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19
= amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
Data Description
The dataset considered in this analysis is the “Taiwan-Customer defaults” dataset provided under
Capstone Project. This dataset contains payment data from April 2005 to September 2005 in Taiwan.
This dataset contains 30000 observations of 25 variables; where each observation corresponds to a
particular credit card client. Among the total 30000 observations, 6636 observations are customers with
default payment. The variables of interest in this dataset are variables like gender, education level,
marriage status, age and financial variables like, amount of given credit, monthly repayment status,
monthly amount of bill statements and monthly amount of previous payments.
Exploratory Analysis
Data Size
df <- read_excel("E:/ repository/Taiwan-Customer defaults.xls ", skip = 1)
The data has thirty thousand rows and twenty-five columns
Data Structure
str(df)
$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ LIMIT_BAL : num 20000 120000 90000 50000 50000 50000 500000 100000
140000 20000 ...
$ SEX : num 2 2 2 2 1 1 1 2 2 1 ...
$ EDUCATION : num 2 2 2 2 2 1 1 2 3 3 ...
$ MARRIAGE : num 1 2 2 1 1 2 2 2 1 2 ...
$ AGE : num 24 26 34 37 57 37 29 23 28 35 ...
$ PAY_0 : num 2 -1 0 0 -1 0 0 0 0 -2 ...
$ PAY_2 : num 2 2 0 0 0 0 0 -1 0 -2 ...
$ PAY_3 : num -1 0 0 0 -1 0 0 -1 2 -2 ...
$ PAY_4 : num -1 0 0 0 0 0 0 0 0 -2 ...
$ PAY_5 : num -2 0 0 0 0 0 0 0 0 -1 ...
$ PAY_6 : num -2 2 0 0 0 0 0 -1 0 -1 ...
$ BILL_AMT1 : num 3913 2682 29239 46990 8617 ...
$ BILL_AMT2 : num 3102 1725 14027 48233 5670 ...
$ BILL_AMT3 : num 689 2682 13559 49291 35835 ...
$ BILL_AMT4 : num 0 3272 14331 28314 20940 ...
$ BILL_AMT5 : num 0 3455 14948 28959 19146 ...
$ BILL_AMT6 : num 0 3261 15549 29547 19131 ...
$ PAY_AMT1 : num 0 0 1518 2000 2000 ...
$ PAY_AMT2 : num 689 1000 1500 2019 36681 ...
$ PAY_AMT3 : num 0 1000 1000 1200 10000 657 38000 0 432 0 ...
$ PAY_AMT4 : num 0 1000 1000 1100 9000 ...
$ PAY_AMT5 : num 0 0 1000 1069 689 ...
$ PAY_AMT6 : num 0 2000 5000 1000 679 ...
$ default payment next month: num 1 1 0 0 0 0 0 0 0 0 ...
We will change the column name of our dependent variable to "payment_default" for the ease of coding
and convert it into a categorical variable.
Descriptive Statistics
Observations:
- PAY_0 to PAY_6 are seen to be between -2 to 8. However, as per the data attributes the values
should be -1 for pay duly and 1 - 9 for payment delay of 1 month, 2 month... 9 months and above.
Could it be that the values should be shifted by +1 (so -2 becomes -1... 8 becomes 9)? That still
doesn't account for the value of 0 though (-1 will become 0 based on this transformation).
Univariate Analysis
Let us take a closer look at some the variables like Sex, Education and Marriage. However, before
proceeding we will encode them as per attribute the information for better understanding and then plot
them. We would also need to convert them into the correct data type for categorical variables.
- EDUCATION: 1: Graduate School, 2: University, 3: High School , 4: "Others", 5 and 6 have been
clubbed together as "Unknown"
df$EDUCATION<-ifelse(df$EDUCATION==1,"Graduate
School",ifelse(df$EDUCATION==2,"University",
(ifelse(df$EDUCATION==3,"High School",ifelse(df$EDUCATION==4,"Others","Unknown")))))
df$MARRIAGE<-ifelse(df$MARRIAGE==1,"Married",ifelse(df$MARRIAGE==2,"Single","Others"))
names<-c("SEX","EDUCATION","MARRIAGE")
df[names]<-lapply(df[names],as.factor)
plot_bar(df[,c("SEX","EDUCATION","MARRIAGE")])
There are monthly variables for the repayment status which needs to be coded properly for better
analysis. Further they needed to be converted into categorical variables for the modeling. The variables
have been coded as following:
-2 to 0: "Paid Duly"
1: "1 month delay"
2: "2 month delay"
3: "3 month delay"
4: "4 month delay"
5: "5 month delay"
6: "6 month delay"
7: "7 month delay"
8: "8 month delay"
Anything more than 8 has been put as "9 month or more delay".
But before proceeding we will quickly check the number of counts under each category.
PAY_VAR<-lapply(df[,c("PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6")],
function(x) table(x))
print(PAY_VAR)
$PAY_0
x
-2 -1 0 1 2 3 4 5 6 7 8
2759 5686 14737 3688 2667 322 76 26 11 9 19
$PAY_2
x
-2 -1 0 1 2 3 4 5 6 7 8
3782 6050 15730 28 3927 326 99 25 12 20 1
$PAY_3
x
-2 -1 0 1 2 3 4 5 6 7 8
4085 5938 15764 4 3819 240 76 21 23 27 3
$PAY_4
x
-2 -1 0 1 2 3 4 5 6 7 8
4348 5687 16455 2 3159 180 69 35 5 58 2
$PAY_5
x
-2 -1 0 2 3 4 5 6 7 8
4546 5539 16947 2626 178 84 17 4 58 1
$PAY_6
x
-2 -1 0 2 3 4 5 6 7 8
4895 5740 16286 2766 184 49 13 19 46 2
As we see from the results above the number of customers who have delayed payments by 5 or more
months is very low so we would group them under a single category.
names<-c("PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6")
df[names]<-lapply(df[names],as.factor)
BAR PLOT OF PAYMENT DATA
Bi-Variate Analysis
# EDUCATION vs MARRIAGE
ggplot(data=df , aes(EDUCATION))+ geom_bar()+ facet_grid(rows=vars(MARRIAGE))+
ggtitle('Education Level vs Marriage')
# EDUCATION vs SEX
ggplot(data=df , aes(EDUCATION))+ geom_bar()+ facet_grid(rows=vars(SEX))+
ggtitle('Education Level vs Sex')
It seems like for both Sex and Marriage, Education seems to follow a similar distribution.
Observation
- Surprisingly ,Single people are less likely to default
- However, if a default has happened , then the defaulter is almost equally likely to be single or
married