Capstone Project Taiwan-Customer Defaults Notes1 Indranil Banerjee

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Capstone Project Taiwan-Customer defaults

Project Notes 1

Introduction

This is the first set of project note for Capstone project “Taiwan- Customer defaults”. This note aims at
understanding the case description and setting up the project objective &agenda by analyzing the
business opportunity. This also covers exploratory analysis of the data through univariate and bivariate
analysis that will form the base of further model building as progress further in the project.

Case Description

Objective

This project aims at the case of customers’ default payments in Taiwan where we are expected to
calculate the probability of default for a customer using data that corresponds to credit data of costumers.
The data collected contains the customer details and data on the credit history. This data also contains
whether there was a default in payment in next month. The agenda of this project is to determine with
accuracy, the occurrence of a payment default and also to set the costumer probability of default so the
necessary risk mitigation can be done.

Business Opportunity

Finance sector in Taiwan generates revenue through credit based products/services. This puts a
Taiwanese financial institution in a position where recovering the money is both crucial to the overall
running of the system and the interest earned drives operations and profits. If the customer does not pay
the credits and/or EMIs on time, the person is classified as a defaulter and if the amount is not recovered,
the loan gets characterized as a bad loan and every year financial institute incurs losses due to this
uncertainty of predicting a good/ bad client. Thus, if a creditor can predict a bad loan in advance or can
predict risk of non-payment of the next month EMI on time, measures can be taken to take decisions
which can help in reducing the risk of loan default. This project thus is going to serve the same purpose of
predicting the defaults of next month EMI also suggesting the likelihood of a default basis customer credit
repayment history. This will help the credit system take decision as per the business requirement to
mitigate risk and increase profitability.

Benefits:

The key focus of the project is not only to predict worth accuracy the occurrence of a default but also the
business sense of risk profile of a customer with percentile-wise separation of high risk and low risk
clients. One of the key objectives in this case will be to get the accurate prediction of defaulters as
opposed to an overall accuracy. This will ensure that maximum possible cases of defaults are classified
as risky customers.

Data Report

There are 25 variables:


Attribute Information:

- Default Payment: (Yes = 1, No = 0)

- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and
his/her family (supplementary) credit.

- X2: Gender (1 = male; 2 = female).

- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

- X4: Marital status (1 = married; 2 = single; 3 = others).

- X5: Age (year).

- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to
September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the
repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The
measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month;
2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay
for nine months and above.

- X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September,
2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April,
2005.

- X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19
= amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

Data Description

The dataset considered in this analysis is the “Taiwan-Customer defaults” dataset provided under
Capstone Project. This dataset contains payment data from April 2005 to September 2005 in Taiwan.
This dataset contains 30000 observations of 25 variables; where each observation corresponds to a
particular credit card client. Among the total 30000 observations, 6636 observations are customers with
default payment. The variables of interest in this dataset are variables like gender, education level,
marriage status, age and financial variables like, amount of given credit, monthly repayment status,
monthly amount of bill statements and monthly amount of previous payments.

Exploratory Analysis

Data Size
df <- read_excel("E:/ repository/Taiwan-Customer defaults.xls ", skip = 1)
 The data has thirty thousand rows and twenty-five columns

Data Structure

str(df)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 30000 obs. of 25 variables:

$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ LIMIT_BAL : num 20000 120000 90000 50000 50000 50000 500000 100000
140000 20000 ...
$ SEX : num 2 2 2 2 1 1 1 2 2 1 ...
$ EDUCATION : num 2 2 2 2 2 1 1 2 3 3 ...
$ MARRIAGE : num 1 2 2 1 1 2 2 2 1 2 ...
$ AGE : num 24 26 34 37 57 37 29 23 28 35 ...
$ PAY_0 : num 2 -1 0 0 -1 0 0 0 0 -2 ...
$ PAY_2 : num 2 2 0 0 0 0 0 -1 0 -2 ...
$ PAY_3 : num -1 0 0 0 -1 0 0 -1 2 -2 ...
$ PAY_4 : num -1 0 0 0 0 0 0 0 0 -2 ...
$ PAY_5 : num -2 0 0 0 0 0 0 0 0 -1 ...
$ PAY_6 : num -2 2 0 0 0 0 0 -1 0 -1 ...
$ BILL_AMT1 : num 3913 2682 29239 46990 8617 ...
$ BILL_AMT2 : num 3102 1725 14027 48233 5670 ...
$ BILL_AMT3 : num 689 2682 13559 49291 35835 ...
$ BILL_AMT4 : num 0 3272 14331 28314 20940 ...
$ BILL_AMT5 : num 0 3455 14948 28959 19146 ...
$ BILL_AMT6 : num 0 3261 15549 29547 19131 ...
$ PAY_AMT1 : num 0 0 1518 2000 2000 ...
$ PAY_AMT2 : num 689 1000 1500 2019 36681 ...
$ PAY_AMT3 : num 0 1000 1000 1200 10000 657 38000 0 432 0 ...
$ PAY_AMT4 : num 0 1000 1000 1100 9000 ...
$ PAY_AMT5 : num 0 0 1000 1069 689 ...
$ PAY_AMT6 : num 0 2000 5000 1000 679 ...
$ default payment next month: num 1 1 0 0 0 0 0 0 0 0 ...

We will change the column name of our dependent variable to "payment_default" for the ease of coding
and convert it into a categorical variable.

setnames(df, "default payment next month", "payment_default")


df$payment_default<-as.factor(df$payment_default)

Descriptive Statistics

ID LIMIT_BAL SEX EDUCATION


Min. : 1 Min. : 10000 Min. :1.000 Min. :0.000
1st Qu.: 7501 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000
Median :15000 Median : 140000 Median :2.000 Median :2.000
Mean :15000 Mean : 167484 Mean :1.604 Mean :1.853
3rd Qu.:22500 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000
Max. :30000 Max. :1000000 Max. :2.000 Max. :6.000
MARRIAGE AGE PAY_0 PAY_2
Min. :0.000 Min. :21.00 Min. :-2.0000 Min. :-2.0000
1st Qu.:1.000 1st Qu.:28.00 1st Qu.:-1.0000 1st Qu.:-1.0000
Median :2.000 Median :34.00 Median : 0.0000 Median : 0.0000
Mean :1.552 Mean :35.49 Mean :-0.0167 Mean :-0.1338
3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. :3.000 Max. :79.00 Max. : 8.0000 Max. : 8.0000
PAY_3 PAY_4 PAY_5 PAY_6
Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000
1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000
Mean :-0.1662 Mean :-0.2207 Mean :-0.2662 Mean :-0.2911
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000
BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4
Min. :-165580 Min. :-69777 Min. :-157264 Min. :-170000
1st Qu.: 3559 1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327
Median : 22382 Median : 21200 Median : 20089 Median : 19052
Mean : 51223 Mean : 49179 Mean : 47013 Mean : 43263
3rd Qu.: 67091 3rd Qu.: 64006 3rd Qu.: 60165 3rd Qu.: 54506
Max. : 964511 Max. :983931 Max. :1664089 Max. : 891586
BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
Min. :-81334 Min. :-339603 Min. : 0 Min. : 0
1st Qu.: 1763 1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833
Median : 18105 Median : 17071 Median : 2100 Median : 2009
Mean : 40311 Mean : 38872 Mean : 5664 Mean : 5921
3rd Qu.: 50191 3rd Qu.: 49198 3rd Qu.: 5006 3rd Qu.: 5000
Max. :927171 Max. : 961664 Max. :873552 Max. :1684259
PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
1st Qu.: 390 1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8
Median : 1800 Median : 1500 Median : 1500.0 Median : 1500.0
Mean : 5226 Mean : 4826 Mean : 4799.4 Mean : 5215.5
3rd Qu.: 4505 3rd Qu.: 4013 3rd Qu.: 4031.5 3rd Qu.: 4000.0
Max. :896040 Max. :621000 Max. :426529.0 Max. :528666.0
payment_default
0:23364
1: 6636

 Observations:

Some discrepancies are observed like:

- EDUCATION is supposed to be an integer between 1 and 6 (both inclusive). However, we note 0


is present as well. These should be changed to value 6

- MARRIAGE is supposed to be an integer between 1 and 3 (both inclusive). However, 0 is present


as well. These should be changed to value 3

- PAY_0 to PAY_6 are seen to be between -2 to 8. However, as per the data attributes the values
should be -1 for pay duly and 1 - 9 for payment delay of 1 month, 2 month... 9 months and above.
Could it be that the values should be shifted by +1 (so -2 becomes -1... 8 becomes 9)? That still
doesn't account for the value of 0 though (-1 will become 0 based on this transformation).

- Continuous features (BILL_AMT1-BILL_AMT6, PAY_AMT1-PAY_AMT6) -There is a huge spread


in each of the features. Also, there are negative values (corresponding to people who overpay on
their credit card bills?).

Univariate Analysis

Let us take a closer look at some the variables like Sex, Education and Marriage. However, before
proceeding we will encode them as per attribute the information for better understanding and then plot
them. We would also need to convert them into the correct data type for categorical variables.

- SEX: 1: Male and 2: Female

- EDUCATION: 1: Graduate School, 2: University, 3: High School , 4: "Others", 5 and 6 have been
clubbed together as "Unknown"

- MARRIAGE: 1: Married, 2: Single; 3 and 0 have been clubbed together as "Others"


df$SEX<-ifelse(df$SEX==1,"Male","Female")

df$EDUCATION<-ifelse(df$EDUCATION==1,"Graduate
School",ifelse(df$EDUCATION==2,"University",
(ifelse(df$EDUCATION==3,"High School",ifelse(df$EDUCATION==4,"Others","Unknown")))))

df$MARRIAGE<-ifelse(df$MARRIAGE==1,"Married",ifelse(df$MARRIAGE==2,"Single","Others"))

names<-c("SEX","EDUCATION","MARRIAGE")

df[names]<-lapply(df[names],as.factor)

plot_bar(df[,c("SEX","EDUCATION","MARRIAGE")])

BAR PLOT OF VARIOUS VARIABLES

There are monthly variables for the repayment status which needs to be coded properly for better
analysis. Further they needed to be converted into categorical variables for the modeling. The variables
have been coded as following:
-2 to 0: "Paid Duly"
1: "1 month delay"
2: "2 month delay"
3: "3 month delay"
4: "4 month delay"
5: "5 month delay"
6: "6 month delay"
7: "7 month delay"
8: "8 month delay"
Anything more than 8 has been put as "9 month or more delay".
But before proceeding we will quickly check the number of counts under each category.

PAY_VAR<-lapply(df[,c("PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6")],
function(x) table(x))

print(PAY_VAR)

$PAY_0
x
-2 -1 0 1 2 3 4 5 6 7 8
2759 5686 14737 3688 2667 322 76 26 11 9 19

$PAY_2
x
-2 -1 0 1 2 3 4 5 6 7 8
3782 6050 15730 28 3927 326 99 25 12 20 1

$PAY_3
x
-2 -1 0 1 2 3 4 5 6 7 8
4085 5938 15764 4 3819 240 76 21 23 27 3

$PAY_4
x
-2 -1 0 1 2 3 4 5 6 7 8
4348 5687 16455 2 3159 180 69 35 5 58 2

$PAY_5
x
-2 -1 0 2 3 4 5 6 7 8
4546 5539 16947 2626 178 84 17 4 58 1

$PAY_6
x
-2 -1 0 2 3 4 5 6 7 8
4895 5740 16286 2766 184 49 13 19 46 2

As we see from the results above the number of customers who have delayed payments by 5 or more
months is very low so we would group them under a single category.

df$PAY_0<-ifelse(df$PAY_0 <1 ,"Paid Duly",ifelse(df$PAY_0==1,"1 month delay",


ifelse(df$PAY_0==2,"2 month delay",ifelse(df$PAY_0==3,"3 month delay",
ifelse(df$PAY_0==4,"4 month delay","5 month or more delay")))))

df$PAY_2<-ifelse(df$PAY_2 <1 ,"Paid Duly",ifelse(df$PAY_2==1,"1 month delay",


ifelse(df$PAY_2==2,"2 month delay",ifelse(df$PAY_2==3,"3 month delay",
ifelse(df$PAY_2==4,"4 month delay","5 month or more delay")))))
df$PAY_3<-ifelse(df$PAY_3 <1 ,"Paid Duly",ifelse(df$PAY_3==1,"1 month delay",
ifelse(df$PAY_3==2,"2 month delay",ifelse(df$PAY_3==3,"3 month delay",
ifelse(df$PAY_3==4,"4 month delay","5 month or more delay")))))

df$PAY_4<-ifelse(df$PAY_4 <1 ,"Paid Duly",ifelse(df$PAY_4==1,"1 month delay",


ifelse(df$PAY_4==2,"2 month delay",ifelse(df$PAY_4==3,"3 month delay",
ifelse(df$PAY_4==4,"4 month delay","5 month or more delay")))))

df$PAY_5<-ifelse(df$PAY_5 <1 ,"Paid Duly",ifelse(df$PAY_5==1,"1 month delay",


ifelse(df$PAY_5==2,"2 month delay",ifelse(df$PAY_5==3,"3 month delay",
ifelse(df$PAY_5==4,"4 month delay","5 month or more delay")))))

df$PAY_6<-ifelse(df$PAY_6 <1 ,"Paid Duly",ifelse(df$PAY_6==1,"1 month delay",


ifelse(df$PAY_6==2,"2 month delay",ifelse(df$PAY_6==3,"3 month delay",
ifelse(df$PAY_6==4,"4 month delay","5 month or more delay")))))

names<-c("PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6")
df[names]<-lapply(df[names],as.factor)
BAR PLOT OF PAYMENT DATA

Let us analysis Continuous Variables and plot them


 Observations:
- All of the Variables looks skewed and has some outliers which require treatment .However We
would not treat the variable “Age” and instead break it down into buckets.
- Most people generate a bill rapidly , but pay very small amounts but over many months (Bill
Amount generated in September ,may get paid off over September, October and November).

Bi-Variate Analysis

# EDUCATION vs MARRIAGE
ggplot(data=df , aes(EDUCATION))+ geom_bar()+ facet_grid(rows=vars(MARRIAGE))+
ggtitle('Education Level vs Marriage')

# EDUCATION vs SEX
ggplot(data=df , aes(EDUCATION))+ geom_bar()+ facet_grid(rows=vars(SEX))+
ggtitle('Education Level vs Sex')
 It seems like for both Sex and Marriage, Education seems to follow a similar distribution.

# EDUCATION DEFAULT vs NON-DEFAULT


ggplot(data=df, aes(EDUCATION)) + geom_bar() + facet_grid(rows=vars(payment_default))
+
ggtitle('Education profile - Default vs Non-Default')
# SEX vs DEFAULT
ggplot(data=df, aes(MARRIAGE)) + geom_bar() + facet_grid(rows=vars(payment_default)) +
ggtitle('SEX status - Default vs Non-Default')

 Observation
- Surprisingly ,Single people are less likely to default
- However, if a default has happened , then the defaulter is almost equally likely to be single or
married

# Default vs Marriage and Education


ggplot(data=df, aes(payment_default)) + geom_bar() + facet_grid(rows=vars(EDUCATION),
cols=vars(MARRIAGE)) +
ggtitle('Default vs Marriage + Education')

# Default vs Marriage and Sex


ggplot(data=df, aes(payment_default)) + geom_bar() + facet_grid(rows=vars(MARRIAGE),
cols=vars(SEX)) +
ggtitle('Default vs Marriage + Sex')
 Observation
- Nothing too concrete; single males seems to be most reliable

You might also like