Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Company

Default Data

Description:
The dataset contains information on default payments & company details of Companies in India as on May 2019.

In [1]:

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns # for making plots with seaborn
color = sns.color_palette()
import sklearn.metrics as metrics

import warnings
warnings.filterwarnings("ignore")

Importing the dataset

In [2]:

Company = pd.read_csv('Company_Practice.csv')

#Glimpse of Data
Company.head()

Out[2]:

Networth Total Change Change in Finished Raw Equity


Total Net Total Total WIP Shares
Num Next default Income/Total in stock/Total ... goods material face
assets worth Income expenses turnover outstanding
Year assets stock Income turnover turnover value

0 327 -7590.7 1 20099.1 398.3 1239.4 0.061664 -6.7 -0.005406 2082.5 ... 12.18 10.16 0.24 53223026.0 10.0 -13.733

1 3434 -3976.6 1 33592.2 1665.9 33935.2 1.010211 53.6 0.001579 37897.1 ... 190.48 127.42 18.14 107948196.0 10.0 -13.733

2 3164 -1733.7 1 15318.7 1028.8 2387.2 0.155836 70.3 0.029449 3022.2 ... 29.57 6.95 4.83 124166101.0 10.0 -4.550

3 3267 -1438.4 1 5924.6 356.9 4800.8 0.810316 21.4 0.004458 5105.3 ... 9.06 7.42 8.11 35618998.0 10.0 -7.950

4 1750 -974.2 1 3411.8 163.2 3455.2 1.012721 7.5 0.002171 3831.0 ... 50.66 18.65 6.55 41840000.0 10.0 -5.870

5 rows × 76 columns

Fixing messy column names (containing spaces) for ease of use

In [3]:

Company.columns = Company.columns.str.strip().str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str


.replace('%', 'perc').str.replace('/','_by_').str.replace('&','and')

Checking top 5 rows again

In [4]:

Company.head()

Out[4]:

Num Networth_Next_Year default Total_assets Net_worth Total_Income Total_Income_by_Total_assets Change_in_stock Change_in_stock_by_Total_Income

0 327 -7590.7 1 20099.1 398.3 1239.4 0.061664 -6.7 -0.005406

1 3434 -3976.6 1 33592.2 1665.9 33935.2 1.010211 53.6 0.001579

2 3164 -1733.7 1 15318.7 1028.8 2387.2 0.155836 70.3 0.029449

3 3267 -1438.4 1 5924.6 356.9 4800.8 0.810316 21.4 0.004458

4 1750 -974.2 1 3411.8 163.2 3455.2 1.012721 7.5 0.002171

5 rows × 76 columns

Now, let us check the number of rows (observations) and the number of columns (variables)
In [5]:

print('The number of rows (observations) is',Company.shape[0],'\n''The number of columns (variables) is',Company.


shape[1])

The number of rows (observations) is 2409


The number of columns (variables) is 76

Checking datatype of all columns

In [6]:
Company.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2409 entries, 0 to 2408
Data columns (total 76 columns):
Num 2409 non-null int64
Networth_Next_Year 2409 non-null float64
default 2409 non-null int64
Total_assets 2409 non-null float64
Net_worth 2409 non-null float64
Total_Income 2409 non-null float64
Total_Income_by_Total_assets 2409 non-null float64
Change_in_stock 2409 non-null float64
Change_in_stock_by_Total_Income 2409 non-null float64
Total_expenses 2400 non-null float64
Total_expenses_by_Total_Income 2409 non-null float64
Profit_after_tax 2409 non-null float64
Profit_after_tax_by_Total_assets 2409 non-null float64
PBDITA 2397 non-null float64
PBDITA_by_Total_assets 2409 non-null float64
PBT 2409 non-null float64
PBT_by_Total_assets 2399 non-null float64
Cash_profit 2409 non-null float64
Cash_profit_by_Total_assets 2409 non-null float64
PBDITA_as_perc_of_total_income 2401 non-null float64
PBT_as_perc_of_total_income 2409 non-null float64
PAT_as_perc_of_total_income 2409 non-null float64
Cash_profit_as_perc_of_total_income 2409 non-null float64
PAT_as_perc_of_net_worth 2409 non-null float64
Sales 2409 non-null float64
Sales_by_Total_assets 2409 non-null float64
Income_from_financial_services 2398 non-null float64
Income_from_financial_services_by_Total_Income 2409 non-null float64
Other_income 2409 non-null float64
Other_income_by_Total_Income 2409 non-null float64
Total_capital 2409 non-null float64
Total_capital_by_Total_Assets 2409 non-null float64
Reserves_and_funds 2409 non-null float64
Reserves_and_funds_by_Total_Assets 2399 non-null float64
Borrowings 2409 non-null float64
Borrowings_by_Total_Assets 2409 non-null float64
Current_liabilities_and_provisions 2409 non-null float64
Current_liabilities_and_provisions_by_Total_assets 2409 non-null float64
Deferred_tax_liability 2409 non-null float64
Deferred_tax_liability_by_Total_Assets 2409 non-null float64
Shareholders_funds 2409 non-null float64
Shareholders_funds_by_Total_assets 2409 non-null float64
Cumulative_retained_profits 2409 non-null float64
Cumulative_retained_profits_by_Total_Income 2409 non-null float64
Capital_employed 2409 non-null float64
Capital_employed_by_Total_assets 2409 non-null float64
TOL_by_TNW 2409 non-null float64
Total_term_liabilities__by__tangible_net_worth 2371 non-null float64
Contingent_liabilities__by__Net_worth_perc 2409 non-null float64
Contingent_liabilities 2409 non-null float64
Contingent_liabilities_by_Total_Assets 2409 non-null float64
Net_fixed_assets 2409 non-null float64
Net_fixed_assets_by_Total_Assets 2409 non-null float64
Investments 2384 non-null float64
Investments_by_Total_Income 2409 non-null float64
Current_assets 2409 non-null float64
Current_assets_by_Total_Assets 2409 non-null float64
Net_working_capital 2409 non-null float64
Net_working_capital_by_Total_Capital 2409 non-null float64
Quick_ratio_times 2409 non-null float64
Current_ratio_times 2409 non-null float64
Debt_to_equity_ratio_times 2409 non-null float64
Cash_to_current_liabilities_times 2409 non-null float64
Cash_to_average_cost_of_sales_per_day 2409 non-null float64
Creditors_turnover 2409 non-null float64
Debtors_turnover 2409 non-null float64
Finished_goods_turnover 2388 non-null float64
WIP_turnover 2409 non-null float64
Raw_material_turnover 2409 non-null float64
Shares_outstanding 2409 non-null float64
Equity_face_value 2409 non-null float64
EPS 2409 non-null float64
Adjusted_EPS 2409 non-null float64
Total_liabilities 2409 non-null float64
PE_on_BSE 2409 non-null float64
Dev_model 2409 non-null int64
dtypes: float64(73), int64(3)
memory usage: 1.4 MB
Now, let us check the basic measures of descriptive statistics for the continuous variables

In [7]:

Company.describe()

Out[7]:

Num Networth_Next_Year default Total_assets Net_worth Total_Income Total_Income_by_Total_assets Change_in_stock Change_in_sto

count 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000 2409.000000

mean 1772.206310 1468.828103 0.070154 2003.925598 725.237767 2176.605398 17.817319 24.591451

std 1024.890149 17993.532199 0.255459 6285.073259 2397.772300 5312.328009 109.648029 89.402757

min 2.000000 -7590.700000 0.000000 4.200000 0.500000 0.458000 0.000426 -237.194000

25% 879.000000 34.900000 0.000000 97.600000 32.900000 126.800000 0.728960 -0.700000

50% 1770.000000 118.000000 0.000000 319.300000 103.200000 505.000000 1.135046 4.300000

75% 2658.000000 468.300000 0.000000 1099.200000 378.200000 1866.900000 1.764075 41.894006

max 3545.000000 805773.400000 1.000000 54287.498000 21290.980000 42941.208000 1107.327045 607.530000

8 rows × 76 columns

What does variable 'default' look like

In [8]:

Company['default'].value_counts()

Out[8]:

0 2240
1 169
Name: default, dtype: int64

Checking proportion of default

In [9]:

169/(2240+169)

Out[9]:

0.0701535907015359

Checking summary statistics of default variable

In [10]:

Company['default'].describe()

Out[10]:

count 2409.000000
mean 0.070154
std 0.255459
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: default, dtype: float64

Average default rate matches with overall default rate of 7%

Lets check for missing values in the dataset


In [11]:

Company.isnull().sum()

Out[11]:

Num 0
Networth_Next_Year 0
default 0
Total_assets 0
Net_worth 0
Total_Income 0
Total_Income_by_Total_assets 0
Change_in_stock 0
Change_in_stock_by_Total_Income 0
Total_expenses 9
Total_expenses_by_Total_Income 0
Profit_after_tax 0
Profit_after_tax_by_Total_assets 0
PBDITA 12
PBDITA_by_Total_assets 0
PBT 0
PBT_by_Total_assets 10
Cash_profit 0
Cash_profit_by_Total_assets 0
PBDITA_as_perc_of_total_income 8
PBT_as_perc_of_total_income 0
PAT_as_perc_of_total_income 0
Cash_profit_as_perc_of_total_income 0
PAT_as_perc_of_net_worth 0
Sales 0
Sales_by_Total_assets 0
Income_from_financial_services 11
Income_from_financial_services_by_Total_Income 0
Other_income 0
Other_income_by_Total_Income 0
..
TOL_by_TNW 0
Total_term_liabilities__by__tangible_net_worth 38
Contingent_liabilities__by__Net_worth_perc 0
Contingent_liabilities 0
Contingent_liabilities_by_Total_Assets 0
Net_fixed_assets 0
Net_fixed_assets_by_Total_Assets 0
Investments 25
Investments_by_Total_Income 0
Current_assets 0
Current_assets_by_Total_Assets 0
Net_working_capital 0
Net_working_capital_by_Total_Capital 0
Quick_ratio_times 0
Current_ratio_times 0
Debt_to_equity_ratio_times 0
Cash_to_current_liabilities_times 0
Cash_to_average_cost_of_sales_per_day 0
Creditors_turnover 0
Debtors_turnover 0
Finished_goods_turnover 21
WIP_turnover 0
Raw_material_turnover 0
Shares_outstanding 0
Equity_face_value 0
EPS 0
Adjusted_EPS 0
Total_liabilities 0
PE_on_BSE 0
Dev_model 0
Length: 76, dtype: int64

In [12]:

#Columns with missing values


print(np.where(Company.isnull().sum()>0))

(array([ 9, 13, 16, 19, 26, 33, 47, 53, 66], dtype=int64),)

In [13]:

Company.iloc[:,66].isnull().sum()

Out[13]:

21
There are missing values in the dataset

Lets treat these missing values with median (replacement with median eliminates impact of outliers in the treatment)

In [14]:

#dropping 'Deposits_accepted_by_commercial_banks' as its completely(100%) blank


col=list(Company)

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(missing_values=np.nan, strategy='median')

Company = pd.DataFrame(imputer.fit_transform(Company))
Company.columns=col
Company.head()

Out[14]:

Num Networth_Next_Year default Total_assets Net_worth Total_Income Total_Income_by_Total_assets Change_in_stock Change_in_stock_by_Total_Incom

0 327.0 -7590.7 1.0 20099.1 398.3 1239.4 0.061664 -6.7 -0.00540

1 3434.0 -3976.6 1.0 33592.2 1665.9 33935.2 1.010211 53.6 0.00157

2 3164.0 -1733.7 1.0 15318.7 1028.8 2387.2 0.155836 70.3 0.02944

3 3267.0 -1438.4 1.0 5924.6 356.9 4800.8 0.810316 21.4 0.00445

4 1750.0 -974.2 1.0 3411.8 163.2 3455.2 1.012721 7.5 0.00217

5 rows × 76 columns

Outlier detection & Treatment

Creating outlier identification (Lower & Upper whiskers) function

In [15]:

# Checking Outliers in dataset

col_names = list(Company.columns)
col_names.remove('Num')
fig, ax = plt.subplots(len(col_names), figsize=(8,100))

for i, col_val in enumerate(col_names):

sns.boxplot(y=Company[col_val], ax=ax[i])
ax[i].set_title('Box plot - {}'.format(col_val), fontsize=10)
ax[i].set_xlabel(col_val, fontsize=8)

plt.show()
There are outliers in the dataset, lets use capping method to treat them

In [16]:

def check_outlier(col):
sorted(col)
Q1,Q3=col.quantile([.25,.75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

lets check outliers(Lower and Upper whiskers) in these variables

In [17]:

check_outlier(Company['Networth_Next_Year'])

Out[17]:

(-615.2, 1118.4)

In [18]:

check_outlier(Company['Total_Income'])

Out[18]:

(-2483.35, 4477.05)

In [19]:

check_outlier(Company['PBT_as_perc_of_total_income'])

Out[19]:
(-11.75, 20.89)
Capping the outliers

In [20]:

def treat_outlier(x):
# taking 5,25,75 percentile of column
q5= np.percentile(x,5)
q25=np.percentile(x,25)
q75=np.percentile(x,75)
dt=np.percentile(x,95)
#calculationg IQR range
IQR=q75-q25
#Calculating minimum threshold
lower_bound=q25-(1.5*IQR)
upper_bound=q75+(1.5*IQR)
#Calculating maximum threshold
print(q5,q25,q75,dt,min,max)
#Capping outliers
return x.apply(lambda y: dt if y > upper_bound else y).apply(lambda y: q5 if y < lower_bound else y)

In [21]:

for i in Company:
Company[i]=treat_outlier(Company[i])

176.8 879.0 2658.0 3359.6 <built-in function min> <built-in function max>
-1.3599999999999994 34.9 468.3 3776.2199999999984 <built-in function min> <built-in function max>
0.0 0.0 0.0 1.0 <built-in function min> <built-in function max>
15.2 97.6 1099.2 8835.079999999965 <built-in function min> <built-in function max>
4.380000000000001 32.9 378.2 3004.4999999999945 <built-in function min> <built-in function max>
6.9 126.8 1866.9 8506.299999999996 <built-in function min> <built-in function max>
0.1406239064 0.728960396 1.7640750669999998 7.228589743399996 <built-in function min> <built-in func
tion max>
-34.160000000000004 -0.7 41.8940059 143.93999999999994 <built-in function min> <built-in function ma
x>
-0.050323483200000005 -0.003595275 0.033124192999999996 1.464825381 <built-in function min> <built-i
n function max>
4.660000000000002 110.3 1602.7 8418.319999999998 <built-in function min> <built-in function max>
0.7254820486 0.9299519959999999 1.009932459 1.2399386659999996 <built-in function min> <built-in fun
ction max>
-12.7 0.7 60.8 561.9599999999998 <built-in function min> <built-in function max>
-0.0655606408 0.006578947 0.075884486 0.23523163479999998 <built-in function min> <built-in function
max>
-0.2 8.0 175.4 1169.8999999999996 <built-in function min> <built-in function max>
-0.004334886399999998 0.066334725 0.17059377899999997 0.36541999459999985 <built-in function min> <b
uilt-in function max>
-13.32 1.0 84.1 723.4599999999998 <built-in function min> <built-in function max>
-0.068493151 0.009526938 0.10338433 0.3055486707999997 <built-in function min> <built-in function ma
x>
-4.2 3.5 108.7 802.5199999999995 <built-in function min> <built-in function max>
-0.0335843032 0.030327647000000003 0.113774272 0.26847384739999997 <built-in function min> <built-in
function max>
-0.2339999999999978 4.86 16.07 34.202 <built-in function min> <built-in function max>
-21.14 0.49 8.65 22.687999999999995 <built-in function min> <built-in function max>
-20.53 0.28 6.2 17.839999999999975 <built-in function min> <built-in function max>
-9.909999999999998 1.92 10.58 23.941999999999986 <built-in function min> <built-in function max>
-27.282 0.0 20.01 50.39399999999999 <built-in function min> <built-in function max>
9.520000000000001 138.0 2044.8 8454.24 <built-in function min> <built-in function max>
0.17162601200000005 0.736177614 1.8096885809999999 65.43780717599992 <built-in function min> <built-
in function max>
0.1 0.7 81.34220077 88.23999999999997 <built-in function min> <built-in function max>
0.00035547380000000003 0.0017890970000000001 0.056191785 4.779567872599992 <built-in function min> <
built-in function max>
0.1 0.8 41.69389313 41.69389313 <built-in function min> <built-in function max>
0.0002229854 0.001546193 0.090012723 3.876173254799987 <built-in function min> <built-in function ma
x>
1.9 14.1 100.0 572.1399999999996 <built-in function min> <built-in function max>
0.0110241274 0.047815334 0.265692854 0.8938436113999992 <built-in function min> <built-in function m
ax>
-36.82 7.4 335.2 2790.8199999999974 <built-in function min> <built-in function max>
-0.29325279719999997 0.07119958 0.378732031 0.7338049069999999 <built-in function min> <built-in fun
ction max>
1.8000000000000003 29.8 625.8 2557.3199999999965 <built-in function min> <built-in function max>
0.019636433200000004 0.18614718600000002 0.51354924 16.29957974 <built-in function min> <built-in fu
nction max>
1.2400000000000007 18.9 283.9 2058.1599999999976 <built-in function min> <built-in function max>
0.0315466354 0.12038523300000001 0.346615721 0.6424676679999999 <built-in function min> <built-in fu
nction max>
0.7 6.7 228.1465496 263.5399999999998 <built-in function min> <built-in function max>
0.00404 0.0211 0.4556 11.796699999999994 <built-in function min> <built-in function max>
4.5 33.8 393.2 3096.779999999999 <built-in function min> <built-in function max>
0.0946652752 0.236430063 0.552631579 0.8777995385999999 <built-in function min> <built-in function m
ax>
-55.77999999999999 2.4 211.3 2045.6799999999946 <built-in function min> <built-in function max>
-0.7489221649999998 0.017729922 0.26862938 0.8730013199999984 <built-in function min> <built-in func
tion max>
9.8 64.6 798.3 6322.599999999991 <built-in function min> <built-in function max>
0.3525276156 0.606348471 0.837276578 0.9608631966000001 <built-in function min> <built-in function m
ax>
0.05 0.6 2.87 10.17599999999999 <built-in function min> <built-in function max>
0.0 0.05 0.99 4.16 <built-in function min> <built-in function max>
0.0 0.0 31.51 157.22999999999976 <built-in function min> <built-in function max>
0.7400000000000005 15.3 936.4739761 1019.7799999999972 <built-in function min> <built-in function ma
x>
0.0036494186 0.036217304 4.180687393 53.08850287999998 <built-in function min> <built-in function ma
x>
3.3 28.3 391.6 2594.199999999997 <built-in function min> <built-in function max>
0.0507089758 0.18633540399999998 0.478791356 0.7994111695999999 <built-in function min> <built-in fu
nction max>
0.2 4.6 701.0139914 701.0139914 <built-in function min> <built-in function max>
0.0003268968 0.010220674 1.6595975180000002 42.17713215399993 <built-in function min> <built-in func
tion max>
2.2400000000000007 37.9 528.7 3320.2 <built-in function min> <built-in function max>
0.060105398000000004 0.291954023 0.606837607 0.8307446566 <built-in function min> <built-in function
max>
-160.85999999999999 -0.8 90.5 701.3 <built-in function min> <built-in function max>
-3.2702105266000006 -0.042511201 1.9375 11.410476187999997 <built-in function min> <built-in functio
n max>
0.10400000000000006 0.42 1.09 2.98 <built-in function min> <built-in function max>
0.35400000000000004 0.93 1.77 4.111999999999998 <built-in function min> <built-in function max>
0.0 0.23 1.77 6.515999999999998 <built-in function min> <built-in function max>
0.0 0.03 0.21 1.25 <built-in function min> <built-in function max>
0.26400000000000007 3.11 25.39 186.47399999999976 <built-in function min> <built-in function max>
0.06800000000000012 3.96 15.32 39.57799999999996 <built-in function min> <built-in function max>
0.0 4.03 15.2 42.08599999999991 <built-in function min> <built-in function max>
3.074 10.21 87.5324658 157.76999999999992 <built-in function min> <built-in function max>
1.8520000000000003 5.99 27.97205637 70.31199999999998 <built-in function min> <built-in function max
>
0.0 3.39 16.29 31.915999999999983 <built-in function min> <built-in function max>
103148.0 2212500.0 22284601.68 59535979.599999964 <built-in function min> <built-in function max>
-1348.444874 10.0 10.0 100.0 <built-in function min> <built-in function max>
-4.816 0.0 10.12 90.53599999999985 <built-in function min> <built-in function max>
-4.832 0.0 7.73 88.53799999999984 <built-in function min> <built-in function max>
15.2 97.6 1099.2 8835.079999999965 <built-in function min> <built-in function max>
-2.2279999999999998 13.99 64.07801191 64.07801191 <built-in function min> <built-in function max>
1.0 1.0 1.0 1.0 <built-in function min> <built-in function max>

In [22]:
Company.shape

Out[22]:

(2409, 76)

Let us check significance of variables 'PBT_as_perc_of_total_income' in predicting


Networth_Next_Year(default) before proceeding to model development

Checking Descriptive statistics of the variable 'PBT_as_perc_of_total_income'

In [23]:

Company['PBT_as_perc_of_total_income'].describe()

Out[23]:

count 2409.000000
mean 3.581635
std 10.147879
min -21.140000
25% 0.490000
50% 3.280000
75% 8.650000
max 22.688000
Name: PBT_as_perc_of_total_income, dtype: float64

Checking Descriptive statistics of the variable 'PBT_as_perc_of_total_income' for non defaulters


In [24]:

Company.loc[Company['default'] == 0,'PBT_as_perc_of_total_income'].describe()

Out[24]:

count 2240.000000
mean 4.611910
std 9.236063
min -21.140000
25% 0.920000
50% 3.690000
75% 9.182500
max 22.688000
Name: PBT_as_perc_of_total_income, dtype: float64

For companies whose have not defaulted, median 'Profit before tax (as % of income) is about 3.7'

Checking Descriptive statistics of the variable 'PBT_as_perc_of_total_income' for defaulters

In [25]:

Company.loc[Company['default'] == 1,'PBT_as_perc_of_total_income'].describe()

Out[25]:

count 169.000000
mean -10.074083
std 11.722073
min -21.140000
25% -21.140000
50% -10.320000
75% 0.000000
max 22.688000
Name: PBT_as_perc_of_total_income, dtype: float64

For companies whose have defaulted, median 'Profit before tax (as % of income) is about -10.3'

In conclusion what it means is, typical good companies makes a profit of about 3.7 units per 100 units of income
And a typical defaulted companies loses about 10.3 units per 100 units of income

Model Building using Logistic Regression for 'Probability at default'

The equation of the Logistic Regression by which we predict the corresponding


probabilities and then go on predict a discrete target variable is

1
y =
1−e −z

n
Note: z = β 0 + ∑ i = 1(β iX 1)

Now, Importing statsmodels modules

In [26]:

import statsmodels.formula.api as SM

Creating logistic regression equation & storing it in f_1

model = SM.logit(formula=’Dependent Variable ~ Σ ( )’ data = ‘Data Frame containing the required values’).fit()

Splitting arrays or matrices into random train and test subsets. Model will be fitted on train set and predictions will be made on the test
set
In [27]:

X = Company.drop(['default','Networth_Next_Year','Num'], axis=1)
y = Company['default']

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42,stratify=Company['default'
])

Company_train = pd.concat([X_train,y_train], axis=1)


Company_test = pd.concat([X_test,y_test], axis=1)

Company_train.to_csv('Company_train.csv',index=False)
Company_test.to_csv('Company_test.csv',index=False)

In [28]:

Company_train.columns

Out[28]:

Index(['Total_assets', 'Net_worth', 'Total_Income',


'Total_Income_by_Total_assets', 'Change_in_stock',
'Change_in_stock_by_Total_Income', 'Total_expenses',
'Total_expenses_by_Total_Income', 'Profit_after_tax',
'Profit_after_tax_by_Total_assets', 'PBDITA', 'PBDITA_by_Total_assets',
'PBT', 'PBT_by_Total_assets', 'Cash_profit',
'Cash_profit_by_Total_assets', 'PBDITA_as_perc_of_total_income',
'PBT_as_perc_of_total_income', 'PAT_as_perc_of_total_income',
'Cash_profit_as_perc_of_total_income', 'PAT_as_perc_of_net_worth',
'Sales', 'Sales_by_Total_assets', 'Income_from_financial_services',
'Income_from_financial_services_by_Total_Income', 'Other_income',
'Other_income_by_Total_Income', 'Total_capital',
'Total_capital_by_Total_Assets', 'Reserves_and_funds',
'Reserves_and_funds_by_Total_Assets', 'Borrowings',
'Borrowings_by_Total_Assets', 'Current_liabilities_and_provisions',
'Current_liabilities_and_provisions_by_Total_assets',
'Deferred_tax_liability', 'Deferred_tax_liability_by_Total_Assets',
'Shareholders_funds', 'Shareholders_funds_by_Total_assets',
'Cumulative_retained_profits',
'Cumulative_retained_profits_by_Total_Income', 'Capital_employed',
'Capital_employed_by_Total_assets', 'TOL_by_TNW',
'Total_term_liabilities__by__tangible_net_worth',
'Contingent_liabilities__by__Net_worth_perc', 'Contingent_liabilities',
'Contingent_liabilities_by_Total_Assets', 'Net_fixed_assets',
'Net_fixed_assets_by_Total_Assets', 'Investments',
'Investments_by_Total_Income', 'Current_assets',
'Current_assets_by_Total_Assets', 'Net_working_capital',
'Net_working_capital_by_Total_Capital', 'Quick_ratio_times',
'Current_ratio_times', 'Debt_to_equity_ratio_times',
'Cash_to_current_liabilities_times',
'Cash_to_average_cost_of_sales_per_day', 'Creditors_turnover',
'Debtors_turnover', 'Finished_goods_turnover', 'WIP_turnover',
'Raw_material_turnover', 'Shares_outstanding', 'Equity_face_value',
'EPS', 'Adjusted_EPS', 'Total_liabilities', 'PE_on_BSE', 'Dev_model',
'default'],
dtype='object')

Model 1

Before starting model building, lets look at the problem of multicollinearity. Multicollinearity occurs when two or more independent variables are highly
correlated with one another in a regression model.

In [29]:

# Import library for VIF


from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

# Calculating VIF
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

return(vif)

In [30]:

calc_vif(X_train).sort_values(by='VIF', ascending = True)

Out[30]:
Out[30]:

variables VIF

45 Contingent_liabilities__by__Net_worth_perc 1.262117

71 PE_on_BSE 1.371907

65 Raw_material_turnover 1.457993

50 Investments 1.613993

4 Change_in_stock 1.706929

62 Debtors_turnover 1.711831

61 Creditors_turnover 1.768949

5 Change_in_stock_by_Total_Income 1.863721

46 Contingent_liabilities 1.979391

25 Other_income 1.989548

55 Net_working_capital_by_Total_Capital 2.233915

63 Finished_goods_turnover 2.340059

51 Investments_by_Total_Income 2.356976

64 WIP_turnover 2.418147

54 Net_working_capital 2.453126

49 Net_fixed_assets_by_Total_Assets 2.524236

67 Equity_face_value 2.658062

32 Borrowings_by_Total_Assets 2.683636

60 Cash_to_average_cost_of_sales_per_day 2.722499

47 Contingent_liabilities_by_Total_Assets 2.833447

35 Deferred_tax_liability 2.836829

26 Other_income_by_Total_Income 2.937239

23 Income_from_financial_services 3.103908

40 Cumulative_retained_profits_by_Total_Income 3.124709

20 PAT_as_perc_of_net_worth 3.168076

24 Income_from_financial_services_by_Total_Income 3.349914

36 Deferred_tax_liability_by_Total_Assets 3.403342

53 Current_assets_by_Total_Assets 3.601563

59 Cash_to_current_liabilities_times 3.690171

48 Net_fixed_assets 4.133867

... ... ...

33 Current_liabilities_and_provisions 5.922509

43 TOL_by_TNW 6.392183

22 Sales_by_Total_assets 6.879268

52 Current_assets 7.063778

30 Reserves_and_funds_by_Total_Assets 7.330419

38 Shareholders_funds_by_Total_assets 7.772884

29 Reserves_and_funds 8.569776

16 PBDITA_as_perc_of_total_income 8.770527

3 Total_Income_by_Total_assets 9.338925

58 Debt_to_equity_ratio_times 9.408160

19 Cash_profit_as_perc_of_total_income 13.826134

41 Capital_employed 17.126263

6 Total_expenses 18.125364

11 PBDITA_by_Total_assets 18.716639

14 Cash_profit 19.287246

21 Sales 20.697583

10 PBDITA 21.230536

15 Cash_profit_by_Total_assets 21.530913

8 Profit_after_tax 24.076989

12 PBT 25.865234

18 PAT_as_perc_of_total_income 30.186683

2 Total_Income 30.757437
2 Total_Income 30.757437

37 Shareholders_funds 33.057718

9 Profit_after_tax_by_Total_assets 33.820119

1 Net_worth 34.278786

17 PBT_as_perc_of_total_income 34.280187

13 PBT_by_Total_assets 36.137749

72 Dev_model 587.879602

70 Total_liabilities inf

0 Total_assets inf

73 rows × 2 columns

Here, we see that the value of VIF is high for many variables. Here, we may drop variables with VIF more than 5 (very high correlation) & build our model

In [31]:

f_1 = 'default ~ Adjusted_EPS + Cumulative_retained_profits + Total_expenses_by_Total_Income + Total_capital + Ne


t_fixed_assets + Cash_to_current_liabilities_times + Current_assets_by_Total_Assets + Deferred_tax_liability_by_T
otal_Assets + Income_from_financial_services_by_Total_Income + PAT_as_perc_of_net_worth + Cumulative_retained_pro
fits_by_Total_Income + Income_from_financial_services + Other_income_by_Total_Income + Deferred_tax_liability + C
ontingent_liabilities_by_Total_Assets + Cash_to_average_cost_of_sales_per_day + Borrowings_by_Total_Assets + Equi
ty_face_value + Net_fixed_assets_by_Total_Assets + Net_working_capital + WIP_turnover + Investments_by_Total_Inco
me + Finished_goods_turnover + Net_working_capital_by_Total_Capital + Other_income + Contingent_liabilities + Cha
nge_in_stock_by_Total_Income + Creditors_turnover + Debtors_turnover + Change_in_stock + Investments + Raw_materi
al_turnover + PE_on_BSE + Contingent_liabilities__by__Net_worth_perc'

Fitting the logistic regression model

In [32]:

model_1 = SM.logit(formula = f_1, data=Company).fit()

Optimization terminated successfully.


Current function value: 0.135954
Iterations 10

Studying whether this equation is significant or not

In [33]:

model_1.summary()
Out[33]:

Logit Regression Results

Dep. Variable: default No. Observations: 2409

Model: Logit Df Residuals: 2374

Method: MLE Df Model: 34

Date: Mon, 31 Aug 2020 Pseudo R-squ.: 0.4648

Time: 12:23:34 Log-Likelihood: -327.51

converged: True LL-Null: -611.97

Covariance Type: nonrobust LLR p-value: 2.689e-98

coef std err z P>|z| [0.025 0.975]

Intercept -3.9379 1.090 -3.613 0.000 -6.074 -1.802

Adjusted_EPS -0.0226 0.013 -1.684 0.092 -0.049 0.004

Cumulative_retained_profits 0.0004 0.000 1.714 0.087 -6.16e-05 0.001

Total_expenses_by_Total_Income -0.1733 0.922 -0.188 0.851 -1.980 1.633

Total_capital -0.0027 0.001 -2.824 0.005 -0.005 -0.001

Net_fixed_assets 0.0003 0.000 1.882 0.060 -1.26e-05 0.001

Cash_to_current_liabilities_times -0.0157 0.301 -0.052 0.958 -0.605 0.574

Current_assets_by_Total_Assets -0.4262 0.545 -0.782 0.434 -1.494 0.642

Deferred_tax_liability_by_Total_Assets -0.0021 0.033 -0.064 0.949 -0.067 0.063

Income_from_financial_services_by_Total_Income 0.1022 0.087 1.168 0.243 -0.069 0.274

PAT_as_perc_of_net_worth -0.0927 0.010 -9.530 0.000 -0.112 -0.074

Cumulative_retained_profits_by_Total_Income -2.0638 0.368 -5.614 0.000 -2.784 -1.343

Income_from_financial_services -0.0018 0.005 -0.404 0.686 -0.011 0.007

Other_income_by_Total_Income -0.0981 0.114 -0.858 0.391 -0.322 0.126

Deferred_tax_liability 0.0007 0.002 0.427 0.669 -0.002 0.004

Contingent_liabilities_by_Total_Assets 0.0217 0.008 2.819 0.005 0.007 0.037

Cash_to_average_cost_of_sales_per_day 0.0017 0.002 0.916 0.360 -0.002 0.005

Borrowings_by_Total_Assets 0.0111 0.022 0.514 0.607 -0.031 0.053

Equity_face_value -1.314e-05 0.000 -0.070 0.944 -0.000 0.000

Net_fixed_assets_by_Total_Assets 1.0835 0.558 1.940 0.052 -0.011 2.178

Net_working_capital -0.0012 0.001 -1.339 0.181 -0.003 0.001

WIP_turnover -0.0056 0.010 -0.534 0.594 -0.026 0.015

Investments_by_Total_Income -0.0285 0.009 -3.040 0.002 -0.047 -0.010

Finished_goods_turnover 0.0020 0.004 0.460 0.645 -0.006 0.010

Net_working_capital_by_Total_Capital 0.0217 0.043 0.499 0.618 -0.064 0.107

Other_income 0.0010 0.008 0.119 0.906 -0.015 0.017

Contingent_liabilities -0.0008 0.000 -2.273 0.023 -0.002 -0.000

Change_in_stock_by_Total_Income 0.1094 0.247 0.442 0.658 -0.376 0.595

Creditors_turnover -0.0151 0.016 -0.954 0.340 -0.046 0.016

Debtors_turnover 0.0129 0.012 1.086 0.277 -0.010 0.036

Change_in_stock 0.0015 0.004 0.351 0.726 -0.007 0.010

Investments 0.0012 0.000 2.936 0.003 0.000 0.002

Raw_material_turnover 0.0095 0.014 0.659 0.510 -0.019 0.038

PE_on_BSE 0.0086 0.004 2.303 0.021 0.001 0.016

Contingent_liabilities__by__Net_worth_perc 0.0099 0.002 4.876 0.000 0.006 0.014

We can see that few variables are insignificant & may not be useful to discriminate cases of deault

Let us look at the adjusted pseudo R-square value

In [34]:

print('The adjusted pseudo R-square value is',1 - ((model_1.llf - model_1.df_model)/model_1.llnull))

The adjusted pseudo R-square value is 0.4092656890548343


Adjusted pseudo R-square seems to be lower than Pseudo R-square value which means there are insignificant variables present in the model. Lets try &
remove variables whose p value is greater than 0.05 & rebuild our model

Model 2
In [35]:

f_2 = 'default ~ Total_capital + Net_fixed_assets + PAT_as_perc_of_net_worth + Cumulative_retained_profits_by_Tot


al_Income + Contingent_liabilities_by_Total_Assets + Investments_by_Total_Income + Contingent_liabilities + Inves
tments + PE_on_BSE + Contingent_liabilities__by__Net_worth_perc'

In [36]:

model_2 = SM.logit(formula = f_2, data=Company_train).fit()

Optimization terminated successfully.


Current function value: 0.139671
Iterations 9

In [37]:

model_2.summary()

Out[37]:

Logit Regression Results

Dep. Variable: default No. Observations: 1614

Model: Logit Df Residuals: 1603

Method: MLE Df Model: 10

Date: Mon, 31 Aug 2020 Pseudo R-squ.: 0.4494

Time: 12:23:40 Log-Likelihood: -225.43

converged: True LL-Null: -409.42

Covariance Type: nonrobust LLR p-value: 6.028e-73

coef std err z P>|z| [0.025 0.975]

Intercept -3.8271 0.364 -10.504 0.000 -4.541 -3.113

Total_capital -0.0037 0.001 -3.311 0.001 -0.006 -0.001

Net_fixed_assets 0.0007 0.000 4.222 0.000 0.000 0.001

PAT_as_perc_of_net_worth -0.0986 0.010 -9.921 0.000 -0.118 -0.079

Cumulative_retained_profits_by_Total_Income -2.3753 0.409 -5.807 0.000 -3.177 -1.574

Contingent_liabilities_by_Total_Assets 0.0292 0.009 3.255 0.001 0.012 0.047

Investments_by_Total_Income -0.0206 0.009 -2.348 0.019 -0.038 -0.003

Contingent_liabilities -0.0011 0.000 -2.374 0.018 -0.002 -0.000

Investments 0.0011 0.000 2.571 0.010 0.000 0.002

PE_on_BSE 0.0128 0.004 2.893 0.004 0.004 0.021

Contingent_liabilities__by__Net_worth_perc 0.0081 0.002 3.404 0.001 0.003 0.013

We can see that all variables are significant & may be useful to discriminate cases of deault

Let us also check the multicollinearity of the model using Variance Inflation Factor (VIF) for the predictor variables
In [38]:

calc_vif(X_train[['Total_capital','Net_fixed_assets', 'PAT_as_perc_of_net_worth' , 'Cumulative_retained_profits_b


y_Total_Income' , 'Contingent_liabilities_by_Total_Assets' , 'Investments_by_Total_Income' , 'Contingent_liabilit
ies' , 'Investments' , 'PE_on_BSE' , 'Contingent_liabilities__by__Net_worth_perc']]).sort_values(by='VIF', ascend
ing = True)

Out[38]:

variables VIF

3 Cumulative_retained_profits_by_Total_Income 1.353936

9 Contingent_liabilities__by__Net_worth_perc 1.406154

2 PAT_as_perc_of_net_worth 1.479623

5 Investments_by_Total_Income 1.975775

1 Net_fixed_assets 2.236608

0 Total_capital 2.261477

8 PE_on_BSE 2.447948

7 Investments 2.563008

4 Contingent_liabilities_by_Total_Assets 2.609765

6 Contingent_liabilities 3.285351

We can see that multicollinearity still exists but lets not drop them as VIFs are not very high

In [39]:

print('The adjusted pseudo R-square value is',1 - ((model_2.llf - model_2.df_model)/model_2.llnull))

The adjusted pseudo R-square value is 0.4249760562990271

We see that adjusted R sq is now close to Rsq, thus suggesting lesser insignificant variables in the model

We also notice that current model has no insignificant variables and can be used for prediction purposes.

Lets test the prediction of this model on train and test dataset

Prediction on the Data

Let us first check the distribution plot of the logit function values

In [40]:

sns.distplot(model_2.fittedvalues);

Now, let us see the predicted probability values:

Prediction on Train set


In [41]:

y_predict_train = model_2.predict(X_train)
y_predict_train

Out[41]:

1195 0.000238
630 0.079001
1978 0.005952
2312 0.000817
1000 0.012295
464 0.282067
198 0.034514
711 0.280323
1949 0.001276
2206 0.001816
1986 0.000249
265 0.146941
52 0.782141
1570 0.174541
1701 0.007760
556 0.048509
2001 0.003805
1551 0.016193
2156 0.000045
664 0.002812
1174 0.014924
1646 0.029623
158 0.867386
2086 0.009801
892 0.003239
1521 0.026492
1068 0.027455
528 0.022768
1108 0.001461
1062 0.000715
...
2290 0.006842
121 0.379384
2080 0.018987
1350 0.007507
1512 0.002323
1447 0.016040
993 0.018150
614 0.000523
1675 0.001548
3 0.284965
1724 0.021777
2028 0.010548
1738 0.000410
278 0.062157
1938 0.005347
239 0.179794
287 0.024797
1095 0.004992
1775 0.008386
2271 0.001393
791 0.017388
1735 0.013902
1759 0.011669
344 0.006982
1110 0.030625
1354 0.010515
2060 0.003080
19 0.292098
1816 0.004499
444 0.053537
Length: 1614, dtype: float64
In [42]:

sns.boxplot(x=Company['default'],y=y_predict_train)
plt.xlabel('Default');

From the above boxplot, we need to decide on one such value of a cut-off which will give us the most reasonable descriptive power of the model. Let us
take a cut-off of 0.07 and check.

Let us now see the predicted classes

In [43]:
y_class_pred=[]
for i in range(0,len(y_predict_train)):
if np.array(y_predict_train)[i]>0.07:
a=1
else:
a=0
y_class_pred.append(a)

Checking the accuracy of the model using confusion matrix for training set

In [44]:

#print(metrics.confusion_matrix(y_test, y_predict))
sns.heatmap((metrics.confusion_matrix(y_train,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
plt.xlabel('Predicted');
plt.ylabel('Actuals',rotation=0);

In [45]:
tn, fp, fn, tp = metrics.confusion_matrix(y_train,y_class_pred).ravel()
print('True Negative:',tn,'\n''False Positives:' ,fp,'\n''False Negatives:', fn,'\n''True Positives:', tp)

True Negative: 1294


False Positives: 207
False Negatives: 18
True Positives: 95
In [46]:

print(metrics.classification_report(y_train,y_class_pred,digits=3))

precision recall f1-score support

0.0 0.986 0.862 0.920 1501


1.0 0.315 0.841 0.458 113

accuracy 0.861 1614


macro avg 0.650 0.851 0.689 1614
weighted avg 0.939 0.861 0.888 1614

As observed above, accuracy of the model i.e. %overall correct predictions is 86%
Sensitivity of the model is 84% i.e. 84% of those defaulted were correctly identified as defaulters by the model

Prediction on Test set

In [47]:
y_predict_test = model_2.predict(X_test)
y_predict_test
Out[47]:

1238 0.264364
1446 0.009470
840 0.007608
166 0.054838
658 0.007321
1691 0.018224
499 0.032348
366 0.001656
1129 0.017894
1517 0.009549
868 0.033910
927 0.039776
1451 0.008103
2245 0.000346
1651 0.016640
1165 0.010501
872 0.077171
1006 0.042804
1422 0.008625
1323 0.001854
2038 0.005669
1876 0.027107
2347 0.052555
1884 0.005696
2345 0.000829
2381 0.001835
2051 0.011987
930 0.026261
153 0.716588
969 0.000512
...
1739 0.018299
1317 0.003664
1104 0.003370
2364 0.004185
2308 0.008268
115 0.407130
267 0.009058
1185 0.016101
129 0.927627
690 0.020068
1902 0.003642
1627 0.003918
1891 0.009009
2296 0.000012
1088 0.000972
398 0.029559
1541 0.018018
1464 0.002371
866 0.000250
164 0.608210
1183 0.002765
1465 0.018457
2187 0.000468
983 0.002116
415 0.492578
1848 0.013528
1546 0.000522
2237 0.003755
1096 0.018530
1177 0.007781
Length: 795, dtype: float64

In [48]:

y_class_pred=[]
for i in range(0,len(y_predict_test)):
if np.array(y_predict_test)[i]>0.07:
a=1
else:
a=0
y_class_pred.append(a)

Checking the accuracy of the model using confusion matrix for test set
In [49]:

#print(metrics.confusion_matrix(y_test, y_predict))
sns.heatmap((metrics.confusion_matrix(y_test,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
plt.xlabel('Predicted');
plt.ylabel('Actuals',rotation=0);

In [50]:
tn, fp, fn, tp = metrics.confusion_matrix(y_test,y_class_pred).ravel()
print('True Negative:',tn,'\n''False Positives:' ,fp,'\n''False Negatives:', fn,'\n''True Positives:', tp)

True Negative: 652


False Positives: 87
False Negatives: 9
True Positives: 47

Let us now go ahead and print the classification report to check the various other parameters

In [51]:

print(metrics.classification_report(y_test,y_class_pred,digits=3))

precision recall f1-score support

0.0 0.986 0.882 0.931 739


1.0 0.351 0.839 0.495 56

accuracy 0.879 795


macro avg 0.669 0.861 0.713 795
weighted avg 0.942 0.879 0.901 795

As observed above, accuracy of the model i.e. %overall correct predictions is 88%
Sensitivity of the model is 84% i.e. 84% of those defaulted were correctly identified as defaulters by the model

Let us take a cut-off of 0.08 and check if our predictions have improved

In [52]:
y_class_pred=[]
for i in range(0,len(y_predict_train)):
if np.array(y_predict_train)[i]>0.08:
a=1
else:
a=0
y_class_pred.append(a)

Checking the accuracy of the model using confusion matrix for training set
In [53]:

#print(metrics.confusion_matrix(y_test, y_predict))
sns.heatmap((metrics.confusion_matrix(y_train,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
plt.xlabel('Predicted');
plt.ylabel('Actuals',rotation=0);

In [54]:

tn, fp, fn, tp = metrics.confusion_matrix(y_train,y_class_pred).ravel()


print('True Negative:',tn,'\n''False Positives:' ,fp,'\n''False Negatives:', fn,'\n''True Positives:', tp)

True Negative: 1321


False Positives: 180
False Negatives: 19
True Positives: 94

In [55]:

print(metrics.classification_report(y_train,y_class_pred,digits=3))

precision recall f1-score support

0.0 0.986 0.880 0.930 1501


1.0 0.343 0.832 0.486 113

accuracy 0.877 1614


macro avg 0.664 0.856 0.708 1614
weighted avg 0.941 0.877 0.899 1614

Accuracy of the model i.e. %overall correct predictions has increased from 86% to 88% but sensitivity of the model has dropped slightly from 84% to 83%

Prediction on Test set

In [56]:
y_predict_test = model_2.predict(X_test)
y_predict_test
Out[56]:

1238 0.264364
1446 0.009470
840 0.007608
166 0.054838
658 0.007321
1691 0.018224
499 0.032348
366 0.001656
1129 0.017894
1517 0.009549
868 0.033910
927 0.039776
1451 0.008103
2245 0.000346
1651 0.016640
1165 0.010501
872 0.077171
1006 0.042804
1422 0.008625
1323 0.001854
2038 0.005669
1876 0.027107
2347 0.052555
1884 0.005696
2345 0.000829
2381 0.001835
2051 0.011987
930 0.026261
153 0.716588
969 0.000512
...
1739 0.018299
1317 0.003664
1104 0.003370
2364 0.004185
2308 0.008268
115 0.407130
267 0.009058
1185 0.016101
129 0.927627
690 0.020068
1902 0.003642
1627 0.003918
1891 0.009009
2296 0.000012
1088 0.000972
398 0.029559
1541 0.018018
1464 0.002371
866 0.000250
164 0.608210
1183 0.002765
1465 0.018457
2187 0.000468
983 0.002116
415 0.492578
1848 0.013528
1546 0.000522
2237 0.003755
1096 0.018530
1177 0.007781
Length: 795, dtype: float64

In [57]:

y_class_pred=[]
for i in range(0,len(y_predict_test)):
if np.array(y_predict_test)[i]>0.08:
a=1
else:
a=0
y_class_pred.append(a)

Checking the accuracy of the model using confusion matrix for test set
In [58]:

#print(metrics.confusion_matrix(y_test, y_predict))
sns.heatmap((metrics.confusion_matrix(y_test,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
plt.xlabel('Predicted');
plt.ylabel('Actuals',rotation=0);

In [59]:
tn, fp, fn, tp = metrics.confusion_matrix(y_test,y_class_pred).ravel()
print('True Negative:',tn,'\n''False Positives:' ,fp,'\n''False Negatives:', fn,'\n''True Positives:', tp)

True Negative: 658


False Positives: 81
False Negatives: 9
True Positives: 47

Let us now go ahead and print the classification report to check the various other parameters

In [60]:

print(metrics.classification_report(y_test,y_class_pred,digits=3))

precision recall f1-score support

0.0 0.987 0.890 0.936 739


1.0 0.367 0.839 0.511 56

accuracy 0.887 795


macro avg 0.677 0.865 0.723 795
weighted avg 0.943 0.887 0.906 795

Accuracy of the model i.e. %overall correct predictions is 89% & sensitivity of the model stands at 84%

We may choose cutoff of 0.08 as it gave higher model sensitivity & overall accuracy of the model in test dataset

END

You might also like