Credit Default Practice PDF

Company
Default Data
Description:
The dataset contains information on default payments & company details of Companies in India as on May 2019.
In [1]:
# Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns # for making plots with seaborn
color = sns.color_palette()
import sklearn.metrics as metrics
import warnings
warnings.filterwarnings("ignore")
Importing the dataset
In [2]:
Company = pd.read_csv('Company_Practice.csv')
#Glimpse of Data
Company.head()
Out[2]:
Networth Total Change Change in Finished Raw Equity

Total Net Total Total WIP Shares
Num Next default Income/Total in stock/Total ... goods material face
assets worth Income expenses turnover outstanding
Year assets stock Income turnover turnover value
0 327 -7590.7 1 20099.1 398.3 1239.4 0.061664 -6.7 -0.005406 2082.5 ... 12.18 10.16 0.24 53223026.0 10.0 -13.733
1 3434 -3976.6 1 33592.2 1665.9 33935.2 1.010211 53.6 0.001579 37897.1 ... 190.48 127.42 18.14 107948196.0 10.0 -13.733
2 3164 -1733.7 1 15318.7 1028.8 2387.2 0.155836 70.3 0.029449 3022.2 ... 29.57 6.95 4.83 124166101.0 10.0 -4.550
3 3267 -1438.4 1 5924.6 356.9 4800.8 0.810316 21.4 0.004458 5105.3 ... 9.06 7.42 8.11 35618998.0 10.0 -7.950
4 1750 -974.2 1 3411.8 163.2 3455.2 1.012721 7.5 0.002171 3831.0 ... 50.66 18.65 6.55 41840000.0 10.0 -5.870
5 rows × 76 columns
Fixing messy column names (containing spaces) for ease of use
In [3]:
Company.columns = Company.columns.str.strip().str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str

.replace('%', 'perc').str.replace('/','_by_').str.replace('&','and')
Checking top 5 rows again
In [4]:
Company.head()
Out[4]:
Num Networth_Next_Year default Total_assets Net_worth Total_Income Total_Income_by_Total_assets Change_in_stock Change_in_stock_by_Total_Income
0 327 -7590.7 1 20099.1 398.3 1239.4 0.061664 -6.7 -0.005406
1 3434 -3976.6 1 33592.2 1665.9 33935.2 1.010211 53.6 0.001579
2 3164 -1733.7 1 15318.7 1028.8 2387.2 0.155836 70.3 0.029449
3 3267 -1438.4 1 5924.6 356.9 4800.8 0.810316 21.4 0.004458
4 1750 -974.2 1 3411.8 163.2 3455.2 1.012721 7.5 0.002171
Now, let us check the number of rows (observations) and the number of columns (variables)
In [5]:
print('The number of rows (observations) is',Company.shape[0],'\n''The number of columns (variables) is',Company.

shape[1])
The number of rows (observations) is 2409

The number of columns (variables) is 76
Checking datatype of all columns
In [6]:
Company.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2409 entries, 0 to 2408
Data columns (total 76 columns):
Num 2409 non-null int64
Networth_Next_Year 2409 non-null float64
default 2409 non-null int64
Total_assets 2409 non-null float64
Net_worth 2409 non-null float64
Total_Income 2409 non-null float64
Total_Income_by_Total_assets 2409 non-null float64
Change_in_stock 2409 non-null float64
Change_in_stock_by_Total_Income 2409 non-null float64
Total_expenses 2400 non-null float64
Total_expenses_by_Total_Income 2409 non-null float64
Profit_after_tax 2409 non-null float64
Profit_after_tax_by_Total_assets 2409 non-null float64
PBDITA 2397 non-null float64
PBDITA_by_Total_assets 2409 non-null float64
PBT 2409 non-null float64
PBT_by_Total_assets 2399 non-null float64
Cash_profit 2409 non-null float64
Cash_profit_by_Total_assets 2409 non-null float64
PBDITA_as_perc_of_total_income 2401 non-null float64
PBT_as_perc_of_total_income 2409 non-null float64
PAT_as_perc_of_total_income 2409 non-null float64
Cash_profit_as_perc_of_total_income 2409 non-null float64
PAT_as_perc_of_net_worth 2409 non-null float64
Sales 2409 non-null float64
Sales_by_Total_assets 2409 non-null float64
Income_from_financial_services 2398 non-null float64
Income_from_financial_services_by_Total_Income 2409 non-null float64
Other_income 2409 non-null float64
Other_income_by_Total_Income 2409 non-null float64
Total_capital 2409 non-null float64
Total_capital_by_Total_Assets 2409 non-null float64
Reserves_and_funds 2409 non-null float64
Reserves_and_funds_by_Total_Assets 2399 non-null float64
Borrowings 2409 non-null float64
Borrowings_by_Total_Assets 2409 non-null float64
Current_liabilities_and_provisions 2409 non-null float64
Current_liabilities_and_provisions_by_Total_assets 2409 non-null float64
Deferred_tax_liability 2409 non-null float64
Deferred_tax_liability_by_Total_Assets 2409 non-null float64
Shareholders_funds 2409 non-null float64
Shareholders_funds_by_Total_assets 2409 non-null float64
Cumulative_retained_profits 2409 non-null float64
Cumulative_retained_profits_by_Total_Income 2409 non-null float64
Capital_employed 2409 non-null float64
Capital_employed_by_Total_assets 2409 non-null float64
TOL_by_TNW 2409 non-null float64
Total_term_liabilities__by__tangible_net_worth 2371 non-null float64
Contingent_liabilities__by__Net_worth_perc 2409 non-null float64
Contingent_liabilities 2409 non-null float64
Contingent_liabilities_by_Total_Assets 2409 non-null float64
Net_fixed_assets 2409 non-null float64
Net_fixed_assets_by_Total_Assets 2409 non-null float64
Investments 2384 non-null float64
Investments_by_Total_Income 2409 non-null float64
Current_assets 2409 non-null float64
Current_assets_by_Total_Assets 2409 non-null float64
Net_working_capital 2409 non-null float64
Net_working_capital_by_Total_Capital 2409 non-null float64
Quick_ratio_times 2409 non-null float64
Current_ratio_times 2409 non-null float64
Debt_to_equity_ratio_times 2409 non-null float64
Cash_to_current_liabilities_times 2409 non-null float64
Cash_to_average_cost_of_sales_per_day 2409 non-null float64
Creditors_turnover 2409 non-null float64
Debtors_turnover 2409 non-null float64
Finished_goods_turnover 2388 non-null float64
WIP_turnover 2409 non-null float64
Raw_material_turnover 2409 non-null float64
Shares_outstanding 2409 non-null float64
Equity_face_value 2409 non-null float64
EPS 2409 non-null float64
Adjusted_EPS 2409 non-null float64
Total_liabilities 2409 non-null float64
PE_on_BSE 2409 non-null float64
Dev_model 2409 non-null int64
dtypes: float64(73), int64(3)
memory usage: 1.4 MB
Now, let us check the basic measures of descriptive statistics for the continuous variables
In [7]:
Company.describe()
Out[7]:
Num Networth_Next_Year default Total_assets Net_worth Total_Income Total_Income_by_Total_assets Change_in_stock Change_in_sto
count 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000
mean 1772.206310 1468.828103 0.070154 2003.925598 725.237767 2176.605398 17.817319 24.591451
std 1024.890149 17993.532199 0.255459 6285.073259 2397.772300 5312.328009 109.648029 89.402757
min 2.000000 -7590.700000 0.000000 4.200000 0.500000 0.458000 0.000426 -237.194000
25% 879.000000 34.900000 0.000000 97.600000 32.900000 126.800000 0.728960 -0.700000
50% 1770.000000 118.000000 0.000000 319.300000 103.200000 505.000000 1.135046 4.300000
75% 2658.000000 468.300000 0.000000 1099.200000 378.200000 1866.900000 1.764075 41.894006
max 3545.000000 805773.400000 1.000000 54287.498000 21290.980000 42941.208000 1107.327045 607.530000
What does variable 'default' look like
In [8]:
Company['default'].value_counts()
Out[8]:
0 2240
1 169
Name: default, dtype: int64
Checking proportion of default
In [9]:
169/(2240+169)
Out[9]:
0.0701535907015359
Checking summary statistics of default variable
In [10]:
Company['default'].describe()
Out[10]:
count 2409.000000
mean 0.070154
std 0.255459
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: default, dtype: float64
Average default rate matches with overall default rate of 7%
Lets check for missing values in the dataset

In [11]:
Company.isnull().sum()
Out[11]:
Num 0
Networth_Next_Year 0
default 0
Total_assets 0
Net_worth 0
Total_Income 0
Total_Income_by_Total_assets 0
Change_in_stock 0
Change_in_stock_by_Total_Income 0
Total_expenses 9
Total_expenses_by_Total_Income 0
Profit_after_tax 0
Profit_after_tax_by_Total_assets 0
PBDITA 12
PBDITA_by_Total_assets 0
PBT 0
PBT_by_Total_assets 10
Cash_profit 0
Cash_profit_by_Total_assets 0
PBDITA_as_perc_of_total_income 8
PBT_as_perc_of_total_income 0
PAT_as_perc_of_total_income 0
Cash_profit_as_perc_of_total_income 0
PAT_as_perc_of_net_worth 0
Sales 0
Sales_by_Total_assets 0
Income_from_financial_services 11
Income_from_financial_services_by_Total_Income 0
Other_income 0
Other_income_by_Total_Income 0
..
TOL_by_TNW 0
Total_term_liabilities__by__tangible_net_worth 38
Contingent_liabilities__by__Net_worth_perc 0
Contingent_liabilities 0
Contingent_liabilities_by_Total_Assets 0
Net_fixed_assets 0
Net_fixed_assets_by_Total_Assets 0
Investments 25
Investments_by_Total_Income 0
Current_assets 0
Current_assets_by_Total_Assets 0
Net_working_capital 0
Net_working_capital_by_Total_Capital 0
Quick_ratio_times 0
Current_ratio_times 0
Debt_to_equity_ratio_times 0
Cash_to_current_liabilities_times 0
Cash_to_average_cost_of_sales_per_day 0
Creditors_turnover 0
Debtors_turnover 0
Finished_goods_turnover 21
WIP_turnover 0
Raw_material_turnover 0
Shares_outstanding 0
Equity_face_value 0
EPS 0
Adjusted_EPS 0
Total_liabilities 0
PE_on_BSE 0
Dev_model 0
Length: 76, dtype: int64
In [12]:
#Columns with missing values

print(np.where(Company.isnull().sum()>0))
(array([ 9, 13, 16, 19, 26, 33, 47, 53, 66], dtype=int64),)
In [13]:
Company.iloc[:,66].isnull().sum()
Out[13]:
21
There are missing values in the dataset
Lets treat these missing values with median (replacement with median eliminates impact of outliers in the treatment)
In [14]:
#dropping 'Deposits_accepted_by_commercial_banks' as its completely(100%) blank

col=list(Company)
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='median')
Company = pd.DataFrame(imputer.fit_transform(Company))
Company.columns=col
Company.head()
Out[14]:
Num Networth_Next_Year default Total_assets Net_worth Total_Income Total_Income_by_Total_assets Change_in_stock Change_in_stock_by_Total_Incom
0 327.0 -7590.7 1.0 20099.1 398.3 1239.4 0.061664 -6.7 -0.00540
1 3434.0 -3976.6 1.0 33592.2 1665.9 33935.2 1.010211 53.6 0.00157
2 3164.0 -1733.7 1.0 15318.7 1028.8 2387.2 0.155836 70.3 0.02944
3 3267.0 -1438.4 1.0 5924.6 356.9 4800.8 0.810316 21.4 0.00445
4 1750.0 -974.2 1.0 3411.8 163.2 3455.2 1.012721 7.5 0.00217
Outlier detection & Treatment
Creating outlier identification (Lower & Upper whiskers) function
In [15]:
# Checking Outliers in dataset
col_names = list(Company.columns)
col_names.remove('Num')
fig, ax = plt.subplots(len(col_names), figsize=(8,100))
for i, col_val in enumerate(col_names):
sns.boxplot(y=Company[col_val], ax=ax[i])
ax[i].set_title('Box plot - {}'.format(col_val), fontsize=10)
ax[i].set_xlabel(col_val, fontsize=8)
plt.show()
There are outliers in the dataset, lets use capping method to treat them
In [16]:
def check_outlier(col):
sorted(col)
Q1,Q3=col.quantile([.25,.75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
lets check outliers(Lower and Upper whiskers) in these variables
In [17]:
check_outlier(Company['Networth_Next_Year'])
Out[17]:
(-615.2, 1118.4)
In [18]:
check_outlier(Company['Total_Income'])
Out[18]:
(-2483.35, 4477.05)
In [19]:
check_outlier(Company['PBT_as_perc_of_total_income'])
Out[19]:
(-11.75, 20.89)
Capping the outliers
In [20]:
def treat_outlier(x):
# taking 5,25,75 percentile of column
q5= np.percentile(x,5)
q25=np.percentile(x,25)
q75=np.percentile(x,75)
dt=np.percentile(x,95)
#calculationg IQR range
IQR=q75-q25
#Calculating minimum threshold
lower_bound=q25-(1.5*IQR)
upper_bound=q75+(1.5*IQR)
#Calculating maximum threshold
print(q5,q25,q75,dt,min,max)
#Capping outliers
return x.apply(lambda y: dt if y > upper_bound else y).apply(lambda y: q5 if y < lower_bound else y)
In [21]:
for i in Company:
Company[i]=treat_outlier(Company[i])
176.8 879.0 2658.0 3359.6 <built-in function min> <built-in function max>
-1.3599999999999994 34.9 468.3 3776.2199999999984 <built-in function min> <built-in function max>
0.1406239064 0.728960396 1.7640750669999998 7.228589743399996 <built-in function min> <built-in func
tion max>
-34.160000000000004 -0.7 41.8940059 143.93999999999994 <built-in function min> <built-in function ma
x>
-0.050323483200000005 -0.003595275 0.033124192999999996 1.464825381 <built-in function min> <built-i
n function max>
0.7254820486 0.9299519959999999 1.009932459 1.2399386659999996 <built-in function min> <built-in fun
ction max>
-0.0655606408 0.006578947 0.075884486 0.23523163479999998 <built-in function min> <built-in function
max>
-0.004334886399999998 0.066334725 0.17059377899999997 0.36541999459999985 <built-in function min> <b
uilt-in function max>
-0.068493151 0.009526938 0.10338433 0.3055486707999997 <built-in function min> <built-in function ma
x>
-0.0335843032 0.030327647000000003 0.113774272 0.26847384739999997 <built-in function min> <built-in
function max>
0.17162601200000005 0.736177614 1.8096885809999999 65.43780717599992 <built-in function min> <built-
in function max>
0.00035547380000000003 0.0017890970000000001 0.056191785 4.779567872599992 <built-in function min> <
built-in function max>
0.0002229854 0.001546193 0.090012723 3.876173254799987 <built-in function min> <built-in function ma
x>
0.0110241274 0.047815334 0.265692854 0.8938436113999992 <built-in function min> <built-in function m
ax>
-0.29325279719999997 0.07119958 0.378732031 0.7338049069999999 <built-in function min> <built-in fun
ction max>
0.019636433200000004 0.18614718600000002 0.51354924 16.29957974 <built-in function min> <built-in fu
nction max>
nction max>
ax>
-0.7489221649999998 0.017729922 0.26862938 0.8730013199999984 <built-in function min> <built-in func
tion max>
ax>
x>
x>
nction max>
0.0003268968 0.010220674 1.6595975180000002 42.17713215399993 <built-in function min> <built-in func
tion max>
0.060105398000000004 0.291954023 0.606837607 0.8307446566 <built-in function min> <built-in function
max>
-160.85999999999999 -0.8 90.5 701.3 <built-in function min> <built-in function max>
-3.2702105266000006 -0.042511201 1.9375 11.410476187999997 <built-in function min> <built-in functio
n max>
1.8520000000000003 5.99 27.97205637 70.31199999999998 <built-in function min> <built-in function max
>
In [22]:
Company.shape
Out[22]:
(2409, 76)
Let us check significance of variables 'PBT_as_perc_of_total_income' in predicting

Networth_Next_Year(default) before proceeding to model development
Checking Descriptive statistics of the variable 'PBT_as_perc_of_total_income'
In [23]:
Company['PBT_as_perc_of_total_income'].describe()
Out[23]:
count 2409.000000
mean 3.581635
std 10.147879
min -21.140000
25% 0.490000
50% 3.280000
75% 8.650000
max 22.688000
Name: PBT_as_perc_of_total_income, dtype: float64
Checking Descriptive statistics of the variable 'PBT_as_perc_of_total_income' for non defaulters

In [24]:
Company.loc[Company['default'] == 0,'PBT_as_perc_of_total_income'].describe()
Out[24]:
count 2240.000000
mean 4.611910
std 9.236063
min -21.140000
25% 0.920000
50% 3.690000
75% 9.182500
max 22.688000
For companies whose have not defaulted, median 'Profit before tax (as % of income) is about 3.7'
Checking Descriptive statistics of the variable 'PBT_as_perc_of_total_income' for defaulters
In [25]:
Company.loc[Company['default'] == 1,'PBT_as_perc_of_total_income'].describe()
Out[25]:
count 169.000000
mean -10.074083
std 11.722073
min -21.140000
25% -21.140000
50% -10.320000
75% 0.000000
max 22.688000
For companies whose have defaulted, median 'Profit before tax (as % of income) is about -10.3'
In conclusion what it means is, typical good companies makes a profit of about 3.7 units per 100 units of income
And a typical defaulted companies loses about 10.3 units per 100 units of income
Model Building using Logistic Regression for 'Probability at default'
The equation of the Logistic Regression by which we predict the corresponding

probabilities and then go on predict a discrete target variable is
1
y =
1−e −z
n
Note: z = β 0 + ∑ i = 1(β iX 1)
Now, Importing statsmodels modules
In [26]:
import statsmodels.formula.api as SM
Creating logistic regression equation & storing it in f_1
model = SM.logit(formula=’Dependent Variable ~ Σ ( )’ data = ‘Data Frame containing the required values’).fit()
Splitting arrays or matrices into random train and test subsets. Model will be fitted on train set and predictions will be made on the test
set
In [27]:
X = Company.drop(['default','Networth_Next_Year','Num'], axis=1)
y = Company['default']
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42,stratify=Company['default'
])
Company_train = pd.concat([X_train,y_train], axis=1)

Company_test = pd.concat([X_test,y_test], axis=1)
Company_train.to_csv('Company_train.csv',index=False)
Company_test.to_csv('Company_test.csv',index=False)
In [28]:
Company_train.columns
Out[28]:
Index(['Total_assets', 'Net_worth', 'Total_Income',

'Total_Income_by_Total_assets', 'Change_in_stock',
'Change_in_stock_by_Total_Income', 'Total_expenses',
'Total_expenses_by_Total_Income', 'Profit_after_tax',
'Profit_after_tax_by_Total_assets', 'PBDITA', 'PBDITA_by_Total_assets',
'PBT', 'PBT_by_Total_assets', 'Cash_profit',
'Cash_profit_by_Total_assets', 'PBDITA_as_perc_of_total_income',
'PBT_as_perc_of_total_income', 'PAT_as_perc_of_total_income',
'Cash_profit_as_perc_of_total_income', 'PAT_as_perc_of_net_worth',
'Sales', 'Sales_by_Total_assets', 'Income_from_financial_services',
'Income_from_financial_services_by_Total_Income', 'Other_income',
'Other_income_by_Total_Income', 'Total_capital',
'Total_capital_by_Total_Assets', 'Reserves_and_funds',
'Reserves_and_funds_by_Total_Assets', 'Borrowings',
'Borrowings_by_Total_Assets', 'Current_liabilities_and_provisions',
'Current_liabilities_and_provisions_by_Total_assets',
'Deferred_tax_liability', 'Deferred_tax_liability_by_Total_Assets',
'Shareholders_funds', 'Shareholders_funds_by_Total_assets',
'Cumulative_retained_profits',
'Cumulative_retained_profits_by_Total_Income', 'Capital_employed',
'Capital_employed_by_Total_assets', 'TOL_by_TNW',
'Total_term_liabilities__by__tangible_net_worth',
'Contingent_liabilities__by__Net_worth_perc', 'Contingent_liabilities',
'Contingent_liabilities_by_Total_Assets', 'Net_fixed_assets',
'Net_fixed_assets_by_Total_Assets', 'Investments',
'Investments_by_Total_Income', 'Current_assets',
'Current_assets_by_Total_Assets', 'Net_working_capital',
'Net_working_capital_by_Total_Capital', 'Quick_ratio_times',
'Current_ratio_times', 'Debt_to_equity_ratio_times',
'Cash_to_current_liabilities_times',
'Cash_to_average_cost_of_sales_per_day', 'Creditors_turnover',
'Debtors_turnover', 'Finished_goods_turnover', 'WIP_turnover',
'Raw_material_turnover', 'Shares_outstanding', 'Equity_face_value',
'EPS', 'Adjusted_EPS', 'Total_liabilities', 'PE_on_BSE', 'Dev_model',
'default'],
dtype='object')
Model 1
Before starting model building, lets look at the problem of multicollinearity. Multicollinearity occurs when two or more independent variables are highly
correlated with one another in a regression model.
In [29]:
# Import library for VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):
# Calculating VIF
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return(vif)
In [30]:
calc_vif(X_train).sort_values(by='VIF', ascending = True)
Out[30]:
Out[30]:
variables VIF
45 Contingent_liabilities__by__Net_worth_perc 1.262117
71 PE_on_BSE 1.371907
65 Raw_material_turnover 1.457993
50 Investments 1.613993
4 Change_in_stock 1.706929
62 Debtors_turnover 1.711831
61 Creditors_turnover 1.768949
5 Change_in_stock_by_Total_Income 1.863721
46 Contingent_liabilities 1.979391
25 Other_income 1.989548
55 Net_working_capital_by_Total_Capital 2.233915
63 Finished_goods_turnover 2.340059
51 Investments_by_Total_Income 2.356976
64 WIP_turnover 2.418147
54 Net_working_capital 2.453126
49 Net_fixed_assets_by_Total_Assets 2.524236
67 Equity_face_value 2.658062
32 Borrowings_by_Total_Assets 2.683636
60 Cash_to_average_cost_of_sales_per_day 2.722499
47 Contingent_liabilities_by_Total_Assets 2.833447
35 Deferred_tax_liability 2.836829
26 Other_income_by_Total_Income 2.937239
23 Income_from_financial_services 3.103908
40 Cumulative_retained_profits_by_Total_Income 3.124709
20 PAT_as_perc_of_net_worth 3.168076
24 Income_from_financial_services_by_Total_Income 3.349914
36 Deferred_tax_liability_by_Total_Assets 3.403342
53 Current_assets_by_Total_Assets 3.601563
59 Cash_to_current_liabilities_times 3.690171
48 Net_fixed_assets 4.133867
... ... ...
33 Current_liabilities_and_provisions 5.922509
43 TOL_by_TNW 6.392183
22 Sales_by_Total_assets 6.879268
52 Current_assets 7.063778
30 Reserves_and_funds_by_Total_Assets 7.330419
38 Shareholders_funds_by_Total_assets 7.772884
29 Reserves_and_funds 8.569776
16 PBDITA_as_perc_of_total_income 8.770527
3 Total_Income_by_Total_assets 9.338925
58 Debt_to_equity_ratio_times 9.408160
19 Cash_profit_as_perc_of_total_income 13.826134
41 Capital_employed 17.126263
6 Total_expenses 18.125364
11 PBDITA_by_Total_assets 18.716639
14 Cash_profit 19.287246
21 Sales 20.697583
10 PBDITA 21.230536
15 Cash_profit_by_Total_assets 21.530913
8 Profit_after_tax 24.076989
12 PBT 25.865234
18 PAT_as_perc_of_total_income 30.186683
2 Total_Income 30.757437
2 Total_Income 30.757437
37 Shareholders_funds 33.057718
9 Profit_after_tax_by_Total_assets 33.820119
1 Net_worth 34.278786
17 PBT_as_perc_of_total_income 34.280187
13 PBT_by_Total_assets 36.137749
72 Dev_model 587.879602
70 Total_liabilities inf
0 Total_assets inf
Here, we see that the value of VIF is high for many variables. Here, we may drop variables with VIF more than 5 (very high correlation) & build our model
In [31]:
f_1 = 'default ~ Adjusted_EPS + Cumulative_retained_profits + Total_expenses_by_Total_Income + Total_capital + Ne

t_fixed_assets + Cash_to_current_liabilities_times + Current_assets_by_Total_Assets + Deferred_tax_liability_by_T
otal_Assets + Income_from_financial_services_by_Total_Income + PAT_as_perc_of_net_worth + Cumulative_retained_pro
fits_by_Total_Income + Income_from_financial_services + Other_income_by_Total_Income + Deferred_tax_liability + C
ontingent_liabilities_by_Total_Assets + Cash_to_average_cost_of_sales_per_day + Borrowings_by_Total_Assets + Equi
ty_face_value + Net_fixed_assets_by_Total_Assets + Net_working_capital + WIP_turnover + Investments_by_Total_Inco
me + Finished_goods_turnover + Net_working_capital_by_Total_Capital + Other_income + Contingent_liabilities + Cha
nge_in_stock_by_Total_Income + Creditors_turnover + Debtors_turnover + Change_in_stock + Investments + Raw_materi
al_turnover + PE_on_BSE + Contingent_liabilities__by__Net_worth_perc'
Fitting the logistic regression model
In [32]:
model_1 = SM.logit(formula = f_1, data=Company).fit()
Optimization terminated successfully.

Current function value: 0.135954
Iterations 10
Studying whether this equation is significant or not
In [33]:
model_1.summary()
Out[33]:
Logit Regression Results
Dep. Variable: default No. Observations: 2409
Model: Logit Df Residuals: 2374
Method: MLE Df Model: 34
Date: Mon, 31 Aug 2020 Pseudo R-squ.: 0.4648
Time: 12:23:34 Log-Likelihood: -327.51
converged: True LL-Null: -611.97
Covariance Type: nonrobust LLR p-value: 2.689e-98
coef std err z P>|z| [0.025 0.975]
Intercept -3.9379 1.090 -3.613 0.000 -6.074 -1.802
Adjusted_EPS -0.0226 0.013 -1.684 0.092 -0.049 0.004
Cumulative_retained_profits 0.0004 0.000 1.714 0.087 -6.16e-05 0.001
Total_expenses_by_Total_Income -0.1733 0.922 -0.188 0.851 -1.980 1.633
Total_capital -0.0027 0.001 -2.824 0.005 -0.005 -0.001
Net_fixed_assets 0.0003 0.000 1.882 0.060 -1.26e-05 0.001
Cash_to_current_liabilities_times -0.0157 0.301 -0.052 0.958 -0.605 0.574
Current_assets_by_Total_Assets -0.4262 0.545 -0.782 0.434 -1.494 0.642
Deferred_tax_liability_by_Total_Assets -0.0021 0.033 -0.064 0.949 -0.067 0.063
Income_from_financial_services_by_Total_Income 0.1022 0.087 1.168 0.243 -0.069 0.274
PAT_as_perc_of_net_worth -0.0927 0.010 -9.530 0.000 -0.112 -0.074
Cumulative_retained_profits_by_Total_Income -2.0638 0.368 -5.614 0.000 -2.784 -1.343
Income_from_financial_services -0.0018 0.005 -0.404 0.686 -0.011 0.007
Other_income_by_Total_Income -0.0981 0.114 -0.858 0.391 -0.322 0.126
Deferred_tax_liability 0.0007 0.002 0.427 0.669 -0.002 0.004
Contingent_liabilities_by_Total_Assets 0.0217 0.008 2.819 0.005 0.007 0.037
Cash_to_average_cost_of_sales_per_day 0.0017 0.002 0.916 0.360 -0.002 0.005
Borrowings_by_Total_Assets 0.0111 0.022 0.514 0.607 -0.031 0.053
Equity_face_value -1.314e-05 0.000 -0.070 0.944 -0.000 0.000
Net_fixed_assets_by_Total_Assets 1.0835 0.558 1.940 0.052 -0.011 2.178
Net_working_capital -0.0012 0.001 -1.339 0.181 -0.003 0.001
WIP_turnover -0.0056 0.010 -0.534 0.594 -0.026 0.015
Investments_by_Total_Income -0.0285 0.009 -3.040 0.002 -0.047 -0.010
Finished_goods_turnover 0.0020 0.004 0.460 0.645 -0.006 0.010
Net_working_capital_by_Total_Capital 0.0217 0.043 0.499 0.618 -0.064 0.107
Other_income 0.0010 0.008 0.119 0.906 -0.015 0.017
Contingent_liabilities -0.0008 0.000 -2.273 0.023 -0.002 -0.000
Change_in_stock_by_Total_Income 0.1094 0.247 0.442 0.658 -0.376 0.595
Creditors_turnover -0.0151 0.016 -0.954 0.340 -0.046 0.016
Debtors_turnover 0.0129 0.012 1.086 0.277 -0.010 0.036
Change_in_stock 0.0015 0.004 0.351 0.726 -0.007 0.010
Investments 0.0012 0.000 2.936 0.003 0.000 0.002
Raw_material_turnover 0.0095 0.014 0.659 0.510 -0.019 0.038
PE_on_BSE 0.0086 0.004 2.303 0.021 0.001 0.016
Contingent_liabilities__by__Net_worth_perc 0.0099 0.002 4.876 0.000 0.006 0.014
We can see that few variables are insignificant & may not be useful to discriminate cases of deault
Let us look at the adjusted pseudo R-square value
In [34]:
print('The adjusted pseudo R-square value is',1 - ((model_1.llf - model_1.df_model)/model_1.llnull))
The adjusted pseudo R-square value is 0.4092656890548343

Adjusted pseudo R-square seems to be lower than Pseudo R-square value which means there are insignificant variables present in the model. Lets try &
remove variables whose p value is greater than 0.05 & rebuild our model
Model 2
In [35]:
f_2 = 'default ~ Total_capital + Net_fixed_assets + PAT_as_perc_of_net_worth + Cumulative_retained_profits_by_Tot

al_Income + Contingent_liabilities_by_Total_Assets + Investments_by_Total_Income + Contingent_liabilities + Inves
tments + PE_on_BSE + Contingent_liabilities__by__Net_worth_perc'
In [36]:
model_2 = SM.logit(formula = f_2, data=Company_train).fit()
Optimization terminated successfully.

Current function value: 0.139671
Iterations 9
In [37]:
model_2.summary()
Out[37]:
Logit Regression Results
Dep. Variable: default No. Observations: 1614
Model: Logit Df Residuals: 1603
Method: MLE Df Model: 10
Date: Mon, 31 Aug 2020 Pseudo R-squ.: 0.4494
Time: 12:23:40 Log-Likelihood: -225.43
converged: True LL-Null: -409.42
Covariance Type: nonrobust LLR p-value: 6.028e-73
coef std err z P>|z| [0.025 0.975]
Intercept -3.8271 0.364 -10.504 0.000 -4.541 -3.113
Total_capital -0.0037 0.001 -3.311 0.001 -0.006 -0.001
Net_fixed_assets 0.0007 0.000 4.222 0.000 0.000 0.001
PAT_as_perc_of_net_worth -0.0986 0.010 -9.921 0.000 -0.118 -0.079
Cumulative_retained_profits_by_Total_Income -2.3753 0.409 -5.807 0.000 -3.177 -1.574
Contingent_liabilities_by_Total_Assets 0.0292 0.009 3.255 0.001 0.012 0.047
Investments_by_Total_Income -0.0206 0.009 -2.348 0.019 -0.038 -0.003
Contingent_liabilities -0.0011 0.000 -2.374 0.018 -0.002 -0.000
Investments 0.0011 0.000 2.571 0.010 0.000 0.002
PE_on_BSE 0.0128 0.004 2.893 0.004 0.004 0.021
Contingent_liabilities__by__Net_worth_perc 0.0081 0.002 3.404 0.001 0.003 0.013
We can see that all variables are significant & may be useful to discriminate cases of deault
Let us also check the multicollinearity of the model using Variance Inflation Factor (VIF) for the predictor variables
In [38]:
calc_vif(X_train[['Total_capital','Net_fixed_assets', 'PAT_as_perc_of_net_worth' , 'Cumulative_retained_profits_b

y_Total_Income' , 'Contingent_liabilities_by_Total_Assets' , 'Investments_by_Total_Income' , 'Contingent_liabilit
ies' , 'Investments' , 'PE_on_BSE' , 'Contingent_liabilities__by__Net_worth_perc']]).sort_values(by='VIF', ascend
ing = True)
Out[38]:
variables VIF
3 Cumulative_retained_profits_by_Total_Income 1.353936
9 Contingent_liabilities__by__Net_worth_perc 1.406154
2 PAT_as_perc_of_net_worth 1.479623
5 Investments_by_Total_Income 1.975775
1 Net_fixed_assets 2.236608
0 Total_capital 2.261477
8 PE_on_BSE 2.447948
7 Investments 2.563008
4 Contingent_liabilities_by_Total_Assets 2.609765
6 Contingent_liabilities 3.285351
We can see that multicollinearity still exists but lets not drop them as VIFs are not very high
In [39]:
print('The adjusted pseudo R-square value is',1 - ((model_2.llf - model_2.df_model)/model_2.llnull))
The adjusted pseudo R-square value is 0.4249760562990271
We see that adjusted R sq is now close to Rsq, thus suggesting lesser insignificant variables in the model
We also notice that current model has no insignificant variables and can be used for prediction purposes.
Lets test the prediction of this model on train and test dataset
Prediction on the Data
Let us first check the distribution plot of the logit function values
In [40]:
sns.distplot(model_2.fittedvalues);
Now, let us see the predicted probability values:
Prediction on Train set

In [41]:
y_predict_train = model_2.predict(X_train)
y_predict_train
Out[41]:
1195 0.000238
630 0.079001
1978 0.005952
2312 0.000817
1000 0.012295
464 0.282067
198 0.034514
711 0.280323
1949 0.001276
2206 0.001816
1986 0.000249
265 0.146941
52 0.782141
1570 0.174541
1701 0.007760
556 0.048509
2001 0.003805
1551 0.016193
2156 0.000045
664 0.002812
1174 0.014924
1646 0.029623
158 0.867386
2086 0.009801
892 0.003239
1521 0.026492
1068 0.027455
528 0.022768
1108 0.001461
1062 0.000715
...
2290 0.006842
121 0.379384
2080 0.018987
1350 0.007507
1512 0.002323
1447 0.016040
993 0.018150
614 0.000523
1675 0.001548
3 0.284965
1724 0.021777
2028 0.010548
1738 0.000410
278 0.062157
1938 0.005347
239 0.179794
287 0.024797
1095 0.004992
1775 0.008386
2271 0.001393
791 0.017388
1735 0.013902
1759 0.011669
344 0.006982
1110 0.030625
1354 0.010515
2060 0.003080
19 0.292098
1816 0.004499
444 0.053537
Length: 1614, dtype: float64
In [42]:
sns.boxplot(x=Company['default'],y=y_predict_train)
plt.xlabel('Default');
From the above boxplot, we need to decide on one such value of a cut-off which will give us the most reasonable descriptive power of the model. Let us
take a cut-off of 0.07 and check.
Let us now see the predicted classes
In [43]:
y_class_pred=[]
for i in range(0,len(y_predict_train)):
if np.array(y_predict_train)[i]>0.07:
a=1
else:
a=0
y_class_pred.append(a)
Checking the accuracy of the model using confusion matrix for training set
In [44]:
#print(metrics.confusion_matrix(y_test, y_predict))
sns.heatmap((metrics.confusion_matrix(y_train,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
plt.xlabel('Predicted');
plt.ylabel('Actuals',rotation=0);
In [45]:
tn, fp, fn, tp = metrics.confusion_matrix(y_train,y_class_pred).ravel()
print('True Negative:',tn,'\n''False Positives:' ,fp,'\n''False Negatives:', fn,'\n''True Positives:', tp)
True Negative: 1294

False Positives: 207
False Negatives: 18
True Positives: 95
In [46]:
print(metrics.classification_report(y_train,y_class_pred,digits=3))
precision recall f1-score support
0.0 0.986 0.862 0.920 1501

1.0 0.315 0.841 0.458 113
accuracy 0.861 1614

macro avg 0.650 0.851 0.689 1614
weighted avg 0.939 0.861 0.888 1614
As observed above, accuracy of the model i.e. %overall correct predictions is 86%
Sensitivity of the model is 84% i.e. 84% of those defaulted were correctly identified as defaulters by the model
Prediction on Test set
In [47]:
y_predict_test = model_2.predict(X_test)
y_predict_test
Out[47]:
1238 0.264364
1446 0.009470
840 0.007608
166 0.054838
658 0.007321
1691 0.018224
499 0.032348
366 0.001656
1129 0.017894
1517 0.009549
868 0.033910
927 0.039776
1451 0.008103
2245 0.000346
1651 0.016640
1165 0.010501
872 0.077171
1006 0.042804
1422 0.008625
1323 0.001854
2038 0.005669
1876 0.027107
2347 0.052555
1884 0.005696
2345 0.000829
2381 0.001835
2051 0.011987
930 0.026261
153 0.716588
969 0.000512
...
1739 0.018299
1317 0.003664
1104 0.003370
2364 0.004185
2308 0.008268
115 0.407130
267 0.009058
1185 0.016101
129 0.927627
690 0.020068
1902 0.003642
1627 0.003918
1891 0.009009
2296 0.000012
1088 0.000972
398 0.029559
1541 0.018018
1464 0.002371
866 0.000250
164 0.608210
1183 0.002765
1465 0.018457
2187 0.000468
983 0.002116
415 0.492578
1848 0.013528
1546 0.000522
2237 0.003755
1096 0.018530
1177 0.007781
In [48]:
y_class_pred=[]
for i in range(0,len(y_predict_test)):
if np.array(y_predict_test)[i]>0.07:
a=1
else:
a=0
Checking the accuracy of the model using confusion matrix for test set
In [49]:
sns.heatmap((metrics.confusion_matrix(y_test,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
In [50]:
tn, fp, fn, tp = metrics.confusion_matrix(y_test,y_class_pred).ravel()
True Negative: 652

False Positives: 87
False Negatives: 9
True Positives: 47
Let us now go ahead and print the classification report to check the various other parameters
In [51]:
print(metrics.classification_report(y_test,y_class_pred,digits=3))
0.0 0.986 0.882 0.931 739

1.0 0.351 0.839 0.495 56
accuracy 0.879 795

macro avg 0.669 0.861 0.713 795
weighted avg 0.942 0.879 0.901 795
As observed above, accuracy of the model i.e. %overall correct predictions is 88%
Sensitivity of the model is 84% i.e. 84% of those defaulted were correctly identified as defaulters by the model
Let us take a cut-off of 0.08 and check if our predictions have improved
In [52]:
y_class_pred=[]
for i in range(0,len(y_predict_train)):
if np.array(y_predict_train)[i]>0.08:
a=1
else:
a=0
Checking the accuracy of the model using confusion matrix for training set
In [53]:
sns.heatmap((metrics.confusion_matrix(y_train,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
In [54]:
tn, fp, fn, tp = metrics.confusion_matrix(y_train,y_class_pred).ravel()

True Negative: 1321

False Positives: 180
False Negatives: 19
True Positives: 94
In [55]:
print(metrics.classification_report(y_train,y_class_pred,digits=3))
0.0 0.986 0.880 0.930 1501

1.0 0.343 0.832 0.486 113
accuracy 0.877 1614

macro avg 0.664 0.856 0.708 1614
weighted avg 0.941 0.877 0.899 1614
Accuracy of the model i.e. %overall correct predictions has increased from 86% to 88% but sensitivity of the model has dropped slightly from 84% to 83%
Prediction on Test set
In [56]:
y_predict_test = model_2.predict(X_test)
y_predict_test
Out[56]:
1238 0.264364
1446 0.009470
840 0.007608
166 0.054838
658 0.007321
1691 0.018224
499 0.032348
366 0.001656
1129 0.017894
1517 0.009549
868 0.033910
927 0.039776
1451 0.008103
2245 0.000346
1651 0.016640
1165 0.010501
872 0.077171
1006 0.042804
1422 0.008625
1323 0.001854
2038 0.005669
1876 0.027107
2347 0.052555
1884 0.005696
2345 0.000829
2381 0.001835
2051 0.011987
930 0.026261
153 0.716588
969 0.000512
...
1739 0.018299
1317 0.003664
1104 0.003370
2364 0.004185
2308 0.008268
115 0.407130
267 0.009058
1185 0.016101
129 0.927627
690 0.020068
1902 0.003642
1627 0.003918
1891 0.009009
2296 0.000012
1088 0.000972
398 0.029559
1541 0.018018
1464 0.002371
866 0.000250
164 0.608210
1183 0.002765
1465 0.018457
2187 0.000468
983 0.002116
415 0.492578
1848 0.013528
1546 0.000522
2237 0.003755
1096 0.018530
1177 0.007781
In [57]:
y_class_pred=[]
for i in range(0,len(y_predict_test)):
if np.array(y_predict_test)[i]>0.08:
a=1
else:
a=0
Checking the accuracy of the model using confusion matrix for test set
In [58]:
sns.heatmap((metrics.confusion_matrix(y_test,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
In [59]:
tn, fp, fn, tp = metrics.confusion_matrix(y_test,y_class_pred).ravel()
True Negative: 658

False Positives: 81
False Negatives: 9
True Positives: 47
Let us now go ahead and print the classification report to check the various other parameters
In [60]:
print(metrics.classification_report(y_test,y_class_pred,digits=3))
0.0 0.987 0.890 0.936 739

1.0 0.367 0.839 0.511 56
accuracy 0.887 795

macro avg 0.677 0.865 0.723 795
weighted avg 0.943 0.887 0.906 795
Accuracy of the model i.e. %overall correct predictions is 89% & sensitivity of the model stands at 84%
We may choose cutoff of 0.08 as it gave higher model sensitivity & overall accuracy of the model in test dataset
END

Credit Default Practice PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Credit Default Practice PDF

Uploaded by

Copyright:

Available Formats

Company

# Importing the libraries

Importing the dataset

Networth Total Change Change in Finished Raw Equity

Fixing messy column names (containing spaces) for ease of use

Company.columns = Company.columns.str.strip().str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str

Checking top 5 rows again

Num Networth_Next_Year default Total_assets Net_worth Total_Income Total_Income_by_Total_assets Change_in_stock Change_in_stock_by_Total_Income

0 327 -7590.7 1 20099.1 398.3 1239.4 0.061664 -6.7 -0.005406

1 3434 -3976.6 1 33592.2 1665.9 33935.2 1.010211 53.6 0.001579

2 3164 -1733.7 1 15318.7 1028.8 2387.2 0.155836 70.3 0.029449

3 3267 -1438.4 1 5924.6 356.9 4800.8 0.810316 21.4 0.004458

4 1750 -974.2 1 3411.8 163.2 3455.2 1.012721 7.5 0.002171

print('The number of rows (observations) is',Company.shape[0],'\n''The number of columns (variables) is',Company.

The number of rows (observations) is 2409

Checking datatype of all columns

Num Networth_Next_Year default Total_assets Net_worth Total_Income Total_Income_by_Total_assets Change_in_stock Change_in_sto

count 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000

mean 1772.206310 1468.828103 0.070154 2003.925598 725.237767 2176.605398 17.817319 24.591451

std 1024.890149 17993.532199 0.255459 6285.073259 2397.772300 5312.328009 109.648029 89.402757

min 2.000000 -7590.700000 0.000000 4.200000 0.500000 0.458000 0.000426 -237.194000

25% 879.000000 34.900000 0.000000 97.600000 32.900000 126.800000 0.728960 -0.700000

50% 1770.000000 118.000000 0.000000 319.300000 103.200000 505.000000 1.135046 4.300000

75% 2658.000000 468.300000 0.000000 1099.200000 378.200000 1866.900000 1.764075 41.894006

max 3545.000000 805773.400000 1.000000 54287.498000 21290.980000 42941.208000 1107.327045 607.530000

What does variable 'default' look like

Checking proportion of default

Checking summary statistics of default variable

Average default rate matches with overall default rate of 7%

Lets check for missing values in the dataset

#Columns with missing values

#dropping 'Deposits_accepted_by_commercial_banks' as its completely(100%) blank

from sklearn.impute import SimpleImputer

Num Networth_Next_Year default Total_assets Net_worth Total_Income Total_Income_by_Total_assets Change_in_stock Change_in_stock_by_Total_Incom

0 327.0 -7590.7 1.0 20099.1 398.3 1239.4 0.061664 -6.7 -0.00540

1 3434.0 -3976.6 1.0 33592.2 1665.9 33935.2 1.010211 53.6 0.00157

2 3164.0 -1733.7 1.0 15318.7 1028.8 2387.2 0.155836 70.3 0.02944

3 3267.0 -1438.4 1.0 5924.6 356.9 4800.8 0.810316 21.4 0.00445

4 1750.0 -974.2 1.0 3411.8 163.2 3455.2 1.012721 7.5 0.00217

Outlier detection & Treatment

Creating outlier identification (Lower & Upper whiskers) function

# Checking Outliers in dataset

for i, col_val in enumerate(col_names):

lets check outliers(Lower and Upper whiskers) in these variables

Let us check significance of variables 'PBT_as_perc_of_total_income' in predicting

Checking Descriptive statistics of the variable 'PBT_as_perc_of_total_income'

Checking Descriptive statistics of the variable 'PBT_as_perc_of_total_income' for non defaulters

Checking Descriptive statistics of the variable 'PBT_as_perc_of_total_income' for defaulters

Model Building using Logistic Regression for 'Probability at default'

The equation of the Logistic Regression by which we predict the corresponding

Now, Importing statsmodels modules

Creating logistic regression equation & storing it in f_1

from sklearn.model_selection import train_test_split

Company_train = pd.concat([X_train,y_train], axis=1)

Index(['Total_assets', 'Net_worth', 'Total_Income',

# Import library for VIF

calc_vif(X_train).sort_values(by='VIF', ascending = True)

... ... ...

f_1 = 'default ~ Adjusted_EPS + Cumulative_retained_profits + Total_expenses_by_Total_Income + Total_capital + Ne

Fitting the logistic regression model

model_1 = SM.logit(formula = f_1, data=Company).fit()

Optimization terminated successfully.

Contingent_liabilitiesbyNet_worth_perc 0.0099 0.002 4.876 0.000 0.006 0.014