Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

FRA PROJECT (MILESTONE - 1)

Predicting Credit Risk ¶

Problem Statement

Businesses or companies can fall prey to default if they are not able to keep up their debt obligations. Defaults
will lead to a lower credit rating for the company which in turn reduces its chances of getting credit in the future
and may have to pay higher interests on existing debts as well as any new obligations. From an investor's point
of view, he would want to invest in a company if it is capable of handling its financial obligations, can grow
quickly, and is able to manage the growth scale.

A balance sheet is a financial statement of a company that provides a snapshot of what a company owns,
owes, and the amount invested by the shareholders. Thus, it is an important tool that helps evaluate the
performance of a business.

Data that is available includes information from the financial statement of the companies for the previous year
(2015). Also, information about the Networth of the company in the following year (2016) is provided which can
be used to drive the labeled field.

Explanation of data fields available in Data Dictionary, 'Credit Default Data Dictionary.xlsx'

In [175]: 

# Importing the libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # for making plots with seaborn
import sklearn
color = sns.color_palette()
import sklearn.metrics as mertics
import scipy.stats as stats
import sklearn.metrics as metrics

import warnings
warnings.filterwarnings('ignore')

In [2]: 

import os
In [3]: 

os.getcwd()

Out[3]:

'C:\\Users\\Hakers Zone\\Desktop\\Data Science\\Great Lakes\\Core Course fil


es\\11. Finance and Risk Analytics\\FRA Milestone -1\\Working files'

Importing the dataset

In [4]: 

df = pd.read_excel('Company_Data2015-1.xlsx')

In [5]: 

#Glimpse of Data
df.head()

Out[5]:

Networth Equity Net


Capital Total Gross C
Co_Code Co_Name Next Paid Networth Working
Employed Debt Block
Year Up Capital

0 16974 Hind.Cables -8021.60 419.36 -7027.48 -1007.24 5936.03 474.30 -1076.34

Tata Tele.
1 21214 -3986.19 1954.93 -2968.08 4458.20 7410.18 9070.86 -1098.88
Mah.

ABG
2 14852 -3192.58 53.84 506.86 7714.68 6944.54 1281.54 4496.25 9
Shipyard

3 2439 GTL -3054.51 157.30 -623.49 2353.88 2326.05 1033.69 -2612.42 1

Bharati
4 23505 -2967.36 50.30 -1070.83 4675.33 5740.90 1084.20 1836.23 4
Defence

5 rows × 67 columns

Fixing messy column names (containing spaces) for ease of use

In [6]: 

df.columns = df.columns.str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str


In [7]: 

#### Checking top 5 rows again


df.head()

Out[7]:

Co_Code Co_Name Networth_Next_Year Equity_Paid_Up Networth Capital_Employed Tot

0 16974 Hind.Cables -8021.60 419.36 -7027.48 -1007.24

Tata Tele.
1 21214 -3986.19 1954.93 -2968.08 4458.20
Mah.

ABG
2 14852 -3192.58 53.84 506.86 7714.68
Shipyard

3 2439 GTL -3054.51 157.30 -623.49 2353.88

Bharati
4 23505 -2967.36 50.30 -1070.83 4675.33
Defence

5 rows × 67 columns

In [69]: 

df.dtypes.value_counts()

Out[69]:

float64 63

int64 3

object 1

dtype: int64

Now, let us check the number of rows (observations) and the number of columns (variables)

In [8]: 

print('The number of rows (observations) is =',df.shape[0],'\n''The number of columns (vari

The number of rows (observations) is = 3586

The number of columns (variables) is = 67

Checking datatype of all columns


In [9]: 

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 3586 entries, 0 to 3585

Data columns (total 67 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Co_Code 3586 non-null int64

1 Co_Name 3586 non-null object

2 Networth_Next_Year 3586 non-null float64

3 Equity_Paid_Up 3586 non-null float64

4 Networth 3586 non-null float64

5 Capital_Employed 3586 non-null float64

6 Total_Debt 3586 non-null float64

7 Gross_Block_ 3586 non-null float64

8 Net_Working_Capital_ 3586 non-null float64

9 Current_Assets_ 3586 non-null float64

10 Current_Liabilities_and_Provisions_ 3586 non-null float64

11 Total_Assets_to_Liabilities_ 3586 non-null float64

12 Gross_Sales 3586 non-null float64

13 Net_Sales 3586 non-null float64

14 Other_Income 3586 non-null float64

15 Value_Of_Output 3586 non-null float64

16 Cost_of_Production 3586 non-null float64

17 Selling_Cost 3586 non-null float64

18 PBIDT 3586 non-null float64

19 PBDT 3586 non-null float64

20 PBIT 3586 non-null float64

21 PBT 3586 non-null float64

22 PAT 3586 non-null float64

23 Adjusted_PAT 3586 non-null float64

24 CP 3586 non-null float64

25 Revenue_earnings_in_forex 3586 non-null float64

26 Revenue_expenses_in_forex 3586 non-null float64

27 Capital_expenses_in_forex 3586 non-null float64

28 Book_Value_Unit_Curr 3586 non-null float64

29 Book_Value_Adj._Unit_Curr 3582 non-null float64

30 Market_Capitalisation 3586 non-null float64

31 CEPS_annualised_Unit_Curr 3586 non-null float64

32 Cash_Flow_From_Operating_Activities 3586 non-null float64

33 Cash_Flow_From_Investing_Activities 3586 non-null float64

34 Cash_Flow_From_Financing_Activities 3586 non-null float64

35 ROG-Net_Worth_perc 3586 non-null float64

36 ROG-Capital_Employed_perc 3586 non-null float64

37 ROG-Gross_Block_perc 3586 non-null float64

38 ROG-Gross_Sales_perc 3586 non-null float64

39 ROG-Net_Sales_perc 3586 non-null float64

40 ROG-Cost_of_Production_perc 3586 non-null float64

41 ROG-Total_Assets_perc 3586 non-null float64

42 ROG-PBIDT_perc 3586 non-null float64

43 ROG-PBDT_perc 3586 non-null float64

44 ROG-PBIT_perc 3586 non-null float64

45 ROG-PBT_perc 3586 non-null float64

46 ROG-PAT_perc 3586 non-null float64

47 ROG-CP_perc 3586 non-null float64

48 ROG-Revenue_earnings_in_forex_perc 3586 non-null float64

49 ROG-Revenue_expenses_in_forex_perc 3586 non-null float64

50 ROG-Market_Capitalisation_perc 3586 non-null float64

51 Current_Ratio[Latest] 3585 non-null float64

52 Fixed_Assets_Ratio[Latest] 3585 non-null float64

53 Inventory_Ratio[Latest] 3585 non-null float64

54 Debtors_Ratio[Latest] 3585 non-null float64

55 Total_Asset_Turnover_Ratio[Latest] 3585 non-null float64

56 Interest_Cover_Ratio[Latest] 3585 non-null float64

57 PBIDTM_perc[Latest] 3585 non-null float64

58 PBITM_perc[Latest] 3585 non-null float64

59 PBDTM_perc[Latest] 3585 non-null float64

60 CPM_perc[Latest] 3585 non-null float64

61 APATM_perc[Latest] 3585 non-null float64

62 Debtors_Velocity_Days 3586 non-null int64

63 Creditors_Velocity_Days 3586 non-null int64

64 Inventory_Velocity_Days 3483 non-null float64

65 Value_of_Output_to_Total_Assets 3586 non-null float64

66 Value_of_Output_to_Gross_Block 3586 non-null float64

dtypes: float64(63), int64(3), object(1)

memory usage: 1.8+ MB

Now, let us check the basic measures of descriptive statistics for the continuous variables

In [10]: 

pd.options.display.float_format = '{:.2f}'.format

df.describe().T

Out[10]:

count mean std min 25% 50% 7

Co_Code 3586.00 16065.39 19776.82 4.00 3029.25 6077.50 24269

Networth_Next_Year 3586.00 725.05 4769.68 -8021.60 3.98 19.02 123

Equity_Paid_Up 3586.00 62.97 778.76 0.00 3.75 8.29 19

Networth 3586.00 649.75 4091.99 -7027.48 3.89 18.58 117

Capital_Employed 3586.00 2799.61 26975.14 -1824.75 7.60 39.09 226

... ... ... ... ... ... ...

Debtors_Velocity_Days 3586.00 603.89 10636.76 0.00 8.00 49.00 106

Creditors_Velocity_Days 3586.00 2057.85 54169.48 0.00 8.00 39.00 89

Inventory_Velocity_Days 3483.00 79.64 137.85 -199.00 0.00 35.00 96

Value_of_Output_to_Total_Assets 3586.00 0.82 1.20 -0.33 0.07 0.48 1

Value_of_Output_to_Gross_Block 3586.00 61.88 976.82 -61.00 0.27 1.53 4

66 rows × 8 columns
In [11]: 

# Check for duplicate data

dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))

df[dups]

Number of duplicate rows = 0

Out[11]:

Co_Code Co_Name Networth_Next_Year Equity_Paid_Up Networth Capital_Employed Total_

0 rows × 67 columns

Q 1.1 Outlier Treatment

In [12]: 

df1 = df.drop(['Co_Code', 'Co_Name'], axis = 1)

In [13]: 

df1.head()

Out[13]:

Networth_Next_Year Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Block_

0 -8021.60 419.36 -7027.48 -1007.24 5936.03 474.30

1 -3986.19 1954.93 -2968.08 4458.20 7410.18 9070.86

2 -3192.58 53.84 506.86 7714.68 6944.54 1281.54

3 -3054.51 157.30 -623.49 2353.88 2326.05 1033.69

4 -2967.36 50.30 -1070.83 4675.33 5740.90 1084.20

5 rows × 65 columns

In [14]: 

df1.shape

Out[14]:

(3586, 65)
In [15]: 

# Checking Outliers in dataset


df1.boxplot(figsize=(16,8))
plt.xticks(rotation=90)
plt.show()

Outlier treatment :
#Define a function which returns the Upper and Lower limit to detect outliers for each feature
In [16]: 

def mod_outlier(df1):
df1 = df1._get_numeric_data()

q1 = df1.quantile(0.25)
q3 = df1.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 -(1.5 * iqr)
upper_bound = q3 +(1.5 * iqr)

for col in df1.columns:


for i in range(0,len(df1[col])):
if df1[col][i] < lower_bound[col]:
df1[col][i] = lower_bound[col]

if df1[col][i] > upper_bound[col]:


df1[col][i] = upper_bound[col]
for col in df1.columns:
return(df1)

In [17]: 

df_new = mod_outlier(df1)

In [18]: 

# Checking Outliers in dataset


df_new.boxplot(figsize=(16,8))
plt.xticks(rotation=90)
plt.show()
Q 1.2 Missing Value Treatment

Lets check for missing values in the dataset

In [19]: 

#check isnull present


df_new.isnull().sum().sort_values(ascending=False).head(13)

Out[19]:

Inventory_Velocity_Days 103

Book_Value_Adj._Unit_Curr 4

Current_Ratio[Latest] 1

PBITM_perc[Latest] 1

Fixed_Assets_Ratio[Latest] 1

Inventory_Ratio[Latest] 1

Debtors_Ratio[Latest] 1

Total_Asset_Turnover_Ratio[Latest] 1

PBIDTM_perc[Latest] 1

Interest_Cover_Ratio[Latest] 1

PBDTM_perc[Latest] 1

CPM_perc[Latest] 1

APATM_perc[Latest] 1

dtype: int64

In [21]: 

df_new.isnull().sum().sum()

Out[21]:

118

In [97]: 

df_new.size

Out[97]:

236676

There are 0.05% missing values in the dataset

In [20]: 

#Columns with missing values


print(np.where(df_new.isnull().sum()>0))

(array([27, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 62], dtype=int64),)

Treating Missing value


In [22]: 

df_new.fillna(df.median())

Out[22]:

Networth_Next_Year Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Bloc

0 -175.74 43.17 -166.22 -320.90 180.83 328

1 -175.74 43.17 -166.22 555.11 180.83 328

2 -175.74 43.17 287.41 555.11 180.83 328

3 -175.74 43.17 -166.22 555.11 180.83 328

4 -175.74 43.17 -166.22 555.11 180.83 328

... ... ... ... ... ...

3581 303.53 43.17 287.41 555.11 180.83 328

3582 303.53 43.17 287.41 555.11 180.83 328

3583 303.53 43.17 287.41 555.11 180.83 328

3584 303.53 43.17 287.41 555.11 180.83 328

3585 303.53 43.17 287.41 555.11 180.83 328

3586 rows × 65 columns

In [23]: 

df_new.isnull().any().any()

Out[23]:

True

Still there are missing value

Let's visually inspect the missing values in our data


In [24]: 

plt.figure(figsize = (12,8))
sns.heatmap(df_new.isnull(), cbar = False, cmap = 'coolwarm', yticklabels = False)
plt.show()
In [25]: 

cat=[]
num=[]
for i in df_new.columns:
if df_new[i].dtype=="object":
cat.append(i)
else:
num.append(i)
print(cat)
print(num)

[]

['Networth_Next_Year', 'Equity_Paid_Up', 'Networth', 'Capital_Employed', 'To


tal_Debt', 'Gross_Block_', 'Net_Working_Capital_', 'Current_Assets_', 'Curre
nt_Liabilities_and_Provisions_', 'Total_Assets_to_Liabilities_', 'Gross_Sale
s', 'Net_Sales', 'Other_Income', 'Value_Of_Output', 'Cost_of_Production', 'S
elling_Cost', 'PBIDT', 'PBDT', 'PBIT', 'PBT', 'PAT', 'Adjusted_PAT', 'CP',
'Revenue_earnings_in_forex', 'Revenue_expenses_in_forex', 'Capital_expenses_
in_forex', 'Book_Value_Unit_Curr', 'Book_Value_Adj._Unit_Curr', 'Market_Capi
talisation', 'CEPS_annualised_Unit_Curr', 'Cash_Flow_From_Operating_Activiti
es', 'Cash_Flow_From_Investing_Activities', 'Cash_Flow_From_Financing_Activi
ties', 'ROG-Net_Worth_perc', 'ROG-Capital_Employed_perc', 'ROG-Gross_Block_p
erc', 'ROG-Gross_Sales_perc', 'ROG-Net_Sales_perc', 'ROG-Cost_of_Production_
perc', 'ROG-Total_Assets_perc', 'ROG-PBIDT_perc', 'ROG-PBDT_perc', 'ROG-PBIT
_perc', 'ROG-PBT_perc', 'ROG-PAT_perc', 'ROG-CP_perc', 'ROG-Revenue_earnings
_in_forex_perc', 'ROG-Revenue_expenses_in_forex_perc', 'ROG-Market_Capitalis
ation_perc', 'Current_Ratio[Latest]', 'Fixed_Assets_Ratio[Latest]', 'Invento
ry_Ratio[Latest]', 'Debtors_Ratio[Latest]', 'Total_Asset_Turnover_Ratio[Late
st]', 'Interest_Cover_Ratio[Latest]', 'PBIDTM_perc[Latest]', 'PBITM_perc[Lat
est]', 'PBDTM_perc[Latest]', 'CPM_perc[Latest]', 'APATM_perc[Latest]', 'Debt
ors_Velocity_Days', 'Creditors_Velocity_Days', 'Inventory_Velocity_Days', 'V
alue_of_Output_to_Total_Assets', 'Value_of_Output_to_Gross_Block']

In [26]: 

##Lets treat these missing values with median (replacement with median eliminates impact of
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')

df_new = pd.DataFrame(imputer.fit_transform(df_new))
df_new.columns=num
#df_new =df_fra[cat]
df_new.head()

Out[26]:

Networth_Next_Year Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Block_

0 -175.74 43.17 -166.22 -320.90 180.83 328.88

1 -175.74 43.17 -166.22 555.11 180.83 328.88

2 -175.74 43.17 287.41 555.11 180.83 328.88

3 -175.74 43.17 -166.22 555.11 180.83 328.88

4 -175.74 43.17 -166.22 555.11 180.83 328.88

5 rows × 65 columns

In [27]: 

df_new.isnull().any().any()

Out[27]:

False
In [42]: 

df_new.isnull().sum().sort_values(ascending=False).head(20)

Out[42]:

Networth_Next_Year 0

Current_Ratio[Latest] 0

ROG-Gross_Block_perc 0

ROG-Gross_Sales_perc 0

ROG-Net_Sales_perc 0

ROG-Cost_of_Production_perc 0

ROG-Total_Assets_perc 0

ROG-PBIDT_perc 0

ROG-PBDT_perc 0

ROG-PBIT_perc 0

ROG-PBT_perc 0

ROG-PAT_perc 0

ROG-CP_perc 0

ROG-Revenue_earnings_in_forex_perc 0

ROG-Revenue_expenses_in_forex_perc 0

ROG-Market_Capitalisation_perc 0

Fixed_Assets_Ratio[Latest] 0

Equity_Paid_Up 0

Inventory_Ratio[Latest] 0

Debtors_Ratio[Latest] 0

dtype: int64

Q 1.3 Transform Target variable into 0 and 1 ( Creating a binary target variable using
'Networth_Next_Year')

In [29]: 

df_new['default'] = np.where((df_new['Networth_Next_Year'] > 0), 0, 1)

Checking top 5 rows

In [30]: 

df_new[['default','Networth_Next_Year']].head(5)

Out[30]:

default Networth_Next_Year

0 1 -175.74

1 1 -175.74

2 1 -175.74

3 1 -175.74

4 1 -175.74

Checking tail 5 rows


In [31]: 

df_new[['default','Networth_Next_Year']].tail(5)

Out[31]:

default Networth_Next_Year

3581 0 303.53

3582 0 303.53

3583 0 303.53

3584 0 303.53

3585 0 303.53

What does variable 'default' look like

In [32]: 

df_new['default'].value_counts()

Out[32]:

0 3198

1 388

Name: default, dtype: int64

Checking proportion of default

In [34]: 

df_new['default'].value_counts(normalize = True)

Out[34]:

0 0.89

1 0.11

Name: default, dtype: float64

Q 1.4 Univariate & Bivariate analysis with proper interpretation. (You may choose to include only those
variables which were significant in the model building)

Univariate analysis :
In [53]: 

pd.options.display.float_format = '{:.2f}'.format

df_new.describe().T

Out[53]:

count mean std min 25% 50% 75% max

Networth_Next_Year 3586.00 77.40 120.48 -175.74 3.98 19.02 123.80 303.53

Equity_Paid_Up 3586.00 13.99 14.00 0.00 3.75 8.29 19.52 43.17

Networth 3586.00 73.69 112.94 -166.22 3.89 18.58 117.30 287.41

Capital_Employed 3586.00 152.49 207.87 -320.90 7.60 39.09 226.60 555.11

Total_Debt 3586.00 47.44 68.22 -0.72 0.03 7.49 72.35 180.83

... ... ... ... ... ... ... ... ...

Creditors_Velocity_Days 3586.00 62.39 68.03 0.00 8.00 39.00 89.00 210.00

Inventory_Velocity_Days 3586.00 61.22 73.20 -144.00 0.00 35.00 93.00 240.00

Value_of_Output_to_Total_Assets 3586.00 0.73 0.77 -0.33 0.07 0.48 1.16 2.79

Value_of_Output_to_Gross_Block 3586.00 3.36 4.10 -6.69 0.27 1.53 4.91 11.87

default 3586.00 0.11 0.31 0.00 0.00 0.00 0.00 1.00

66 rows × 8 columns

Some Important Feature Box Plot


In [201]: 

plt.figure(figsize=(25,10))
sns.boxplot(data=df_new)
plt.xlabel("Variables")
plt.xticks(rotation=90)
plt.ylabel("Density")
plt.title('Figure:Boxplot of few important features')

Out[201]:

Text(0.5, 1.0, 'Figure:Boxplot of few important features')

Variable 'Total_Asset_To_Liabilities still have some extreme values.

Varible 'Capital_Employes'still have some Extreme and Lower values.

15 Significate Scaled Feature Variables - Distribution of column with Displot & Box plot:
In [233]: 

fig,axes = plt.subplots(nrows= 5, ncols=2)


fig.set_size_inches(14,24)

sns.distplot(df_new['Selling_Cost'], ax= axes[0][0])


sns.boxplot(df_new['Selling_Cost'],orient = 'H', ax= axes[0][1])

sns.distplot(df_new['PBIDT'], ax= axes[1][0])


sns.boxplot(df_new['PBIDT'],orient = 'H', ax= axes[1][1])

sns.distplot(df_new['Cash_Flow_From_Operating_Activities'], ax= axes[2][0])


sns.boxplot(df_new['Cash_Flow_From_Operating_Activities'],orient = 'H', ax= axes[2][1])

sns.distplot(df_new['ROG-Net_Worth_perc'], ax= axes[3][0])


sns.boxplot(df_new['ROG-Net_Worth_perc'],orient = 'H', ax= axes[3][1])

sns.distplot(df_new['ROG-Capital_Employed_perc'], ax= axes[4][0])


sns.boxplot(df_new['ROG-Capital_Employed_perc'],orient = 'H', ax= axes[4][1])

Out[233]:

<AxesSubplot:xlabel='ROG-Capital_Employed_perc'>
In [238]: 

fig,axes = plt.subplots(nrows= 5, ncols=2)


fig.set_size_inches(14,24)

sns.distplot(df_new['ROG-Total_Assets_perc'], ax= axes[0][0])


sns.boxplot(df_new['ROG-Total_Assets_perc'],orient = 'H', ax= axes[0][1])

sns.distplot(df_new['ROG-PBT_perc'], ax= axes[1][0])


sns.boxplot(df_new['ROG-PBT_perc'],orient = 'H', ax= axes[1][1])

sns.distplot(df_new['ROG-PBIDT_perc'], ax= axes[2][0])


sns.boxplot(df_new['ROG-PBIDT_perc'],orient = 'H', ax= axes[2][1])

sns.distplot(df_new['Interest_Cover_Ratio[Latest]'], ax= axes[3][0])


sns.boxplot(df_new['Interest_Cover_Ratio[Latest]'],orient = 'H', ax= axes[3][1])

sns.distplot(df_new['Current_Ratio[Latest]'], ax= axes[4][0])


sns.boxplot(df_new['Current_Ratio[Latest]'],orient = 'H', ax= axes[4][1])

Out[238]:

<AxesSubplot:xlabel='Current_Ratio[Latest]'>
In [243]: 

fig,axes = plt.subplots(nrows= 5, ncols=2)


fig.set_size_inches(14,24)

sns.distplot(df_new['APATM_perc[Latest]'], ax= axes[0][0])


sns.boxplot(df_new['APATM_perc[Latest]'],orient = 'H', ax= axes[0][1])

sns.distplot(df_new['ROG-CP_perc'], ax= axes[1][0])


sns.boxplot(df_new['ROG-CP_perc'],orient = 'H', ax= axes[1][1])

sns.distplot(df_new['Value_of_Output_to_Total_Assets'], ax= axes[2][0])


sns.boxplot(df_new['Value_of_Output_to_Total_Assets'],orient = 'H', ax= axes[2][1])

sns.distplot(df_new['Net_Sales'], ax= axes[3][0])


sns.boxplot(df_new['Net_Sales'],orient = 'H', ax= axes[3][1])

sns.distplot(df_new['Book_Value_Adj._Unit_Curr'], ax= axes[4][0])


sns.boxplot(df_new['Book_Value_Adj._Unit_Curr'],orient = 'H', ax= axes[4][1])

Out[243]:

<AxesSubplot:xlabel='Book_Value_Adj._Unit_Curr'>
- Please refer Business Report for Inference
In [203]: 

sns.catplot('default', data=df_new, kind='count',aspect=1.5, palette='mako')


plt.title("Figure: Countplot of Target Variable Default")

Out[203]:

Text(0.5, 1.0, 'Figure: Countplot of Target Variable Default')

Bivariate Analysis

Gross Sales Vs Net Sales:


In [254]: 

sns.jointplot(x = df_new['Gross_Sales'],
y = df_new['Net_Sales'])

Out[254]:

<seaborn.axisgrid.JointGrid at 0x212bdc14790>

There exists linear relationship between these two important variables.

Networth Vs Capital Employment:


In [253]: 

sns.jointplot(x = df_new['Networth'],
y = df1['Capital_Employed'])

Out[253]:

<seaborn.axisgrid.JointGrid at 0x212be1238b0>

As the capital increases, net worth also increases, but in some cases, capital seems to be disbursed even
for lesser networth.

Networth Vs Cost of Production


In [255]: 

sns.jointplot(x = df_new['Networth'],
y = df1['Cost_of_Production'])

Out[255]:

<seaborn.axisgrid.JointGrid at 0x212a64b1fa0>

This plot is scattered and there exists no such relationship between these two variables.
In [208]: 

sns.boxplot(x = df_new['default'],
y = df1['ROG-Net_Worth_perc'])

Out[208]:

<AxesSubplot:xlabel='default', ylabel='ROG-Net_Worth_perc'>

In [228]: 

# As per Regression Feature Elimination(RFE) we got Top important Variables and 1 Target Va
df_imp =pd.DataFrame(df_new,columns=['Net_Sales','Value_Of_Output', 'PBIDT', 'PBDT', 'PBIT'
, 'ROG-Net_Worth_perc', 'ROG-Capital_Employed_perc', 'ROG-Total_Assets_perc', 'Current_Rati
'Inventory_Ratio[Latest]', 'Total_Asset_Turnover_Ratio[Latest]', 'Interest_Cover_Ratio[Late
'Value_of_Output_to_Gross_Block', 'default'])

Correlation Heatmap
In [230]: 

plt.figure(figsize = (12,8))
cor_matrix = df_imp.corr()
sns.heatmap(cor_matrix, cmap = 'plasma', vmin = -1, vmax= 1)

Out[230]:

<AxesSubplot:>

- Please refer Business Report for Inference

Q 1.5 Split the data into Train and Test dataset in a ratio of 67:33
Split the data into Train and Test dataset in a ratio of 67:33 and use random_state =42.
Model Building is to be done on Train Dataset and Model Validation is to be done on Test Dataset.

In [108]: 

X = df_new.drop(['default','Networth_Next_Year'], axis=1)
y = df_new['default']

In [119]: 

X.head()

Out[119]:

Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Block_ Net_Working_Capital

0 43.17 -166.22 -320.90 180.83 328.88 -89.4

1 43.17 -166.22 555.11 180.83 328.88 -89.4

2 43.17 287.41 555.11 180.83 328.88 151.5

3 43.17 -166.22 555.11 180.83 328.88 -89.4

4 43.17 -166.22 555.11 180.83 328.88 151.5

5 rows × 64 columns

In [120]: 

y.head()

Out[120]:

0 1

1 1

2 1

3 1

4 1

Name: default, dtype: int32

In [109]: 

#Split X and y into training and test set in 67:33 ratio


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42,stra

In [110]: 

from sklearn.preprocessing import StandardScaler


ss=StandardScaler()
# we are scaling the data for ANN. Without scaling it will give very poor results. Computat
X_train_scaled=ss.fit_transform(X_train)
X_test_scaled=ss.transform(X_test)
In [200]: 

X_train_scaled.shape

Out[200]:

(2402, 64)

In [198]: 

X_train.shape

Out[198]:

(2402, 64)

In [112]: 

# scaled Train data first 5 rows


X_train.head()

Out[112]:

Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Block_ Net_Working_Cap

842 3.25 3.49 3.54 0.05 1.53

1057 4.75 5.29 5.39 0.02 0.75

1595 13.06 13.50 13.50 0.00 0.00 1

100 43.17 -60.79 -58.30 2.46 18.66 -8

1191 4.00 5.95 20.62 14.33 7.76

5 rows × 64 columns

In [113]: 

# scaled Test data first 5 rows


X_test.head()

Out[113]:

Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Block_ Net_Working_Cap

251 2.19 -5.58 -1.55 4.03 1.07

3493 43.17 287.41 555.11 180.83 328.88 15

3063 32.33 287.41 350.02 0.00 5.49

2384 3.01 63.35 90.88 25.74 31.22 6

1679 7.44 14.27 14.27 0.00 13.87

5 rows × 64 columns
In [114]: 

X_train.columns

Out[114]:

Index(['Equity_Paid_Up', 'Networth', 'Capital_Employed', 'Total_Debt',

'Gross_Block_', 'Net_Working_Capital_', 'Current_Assets_',

'Current_Liabilities_and_Provisions_', 'Total_Assets_to_Liabilities
_',

'Gross_Sales', 'Net_Sales', 'Other_Income', 'Value_Of_Output',

'Cost_of_Production', 'Selling_Cost', 'PBIDT', 'PBDT', 'PBIT', 'PBT',

'PAT', 'Adjusted_PAT', 'CP', 'Revenue_earnings_in_forex',

'Revenue_expenses_in_forex', 'Capital_expenses_in_forex',

'Book_Value_Unit_Curr', 'Book_Value_Adj._Unit_Curr',

'Market_Capitalisation', 'CEPS_annualised_Unit_Curr',

'Cash_Flow_From_Operating_Activities',

'Cash_Flow_From_Investing_Activities',

'Cash_Flow_From_Financing_Activities', 'ROG-Net_Worth_perc',

'ROG-Capital_Employed_perc', 'ROG-Gross_Block_perc',

'ROG-Gross_Sales_perc', 'ROG-Net_Sales_perc',

'ROG-Cost_of_Production_perc', 'ROG-Total_Assets_perc',

'ROG-PBIDT_perc', 'ROG-PBDT_perc', 'ROG-PBIT_perc', 'ROG-PBT_perc',

'ROG-PAT_perc', 'ROG-CP_perc', 'ROG-Revenue_earnings_in_forex_perc',

'ROG-Revenue_expenses_in_forex_perc', 'ROG-Market_Capitalisation_per
c',

'Current_Ratio[Latest]', 'Fixed_Assets_Ratio[Latest]',

'Inventory_Ratio[Latest]', 'Debtors_Ratio[Latest]',

'Total_Asset_Turnover_Ratio[Latest]', 'Interest_Cover_Ratio[Latest]',

'PBIDTM_perc[Latest]', 'PBITM_perc[Latest]', 'PBDTM_perc[Latest]',

'CPM_perc[Latest]', 'APATM_perc[Latest]', 'Debtors_Velocity_Days',

'Creditors_Velocity_Days', 'Inventory_Velocity_Days',

'Value_of_Output_to_Total_Assets', 'Value_of_Output_to_Gross_Block'],

dtype='object')

In [152]: 

y_train.value_counts(1)

Out[152]:

0 0.89

1 0.11

Name: default, dtype: float64

In [153]: 

y_test.value_counts(1)

Out[153]:

0 0.89

1 0.11

Name: default, dtype: float64


In [121]: 

#Number of rows and columns of the training set for the independent variables:
X_train.shape[0]

Out[121]:

2402

In [122]: 

#Number of rows and columns of the test set for the independent variables:
X_test.shape[0]

Out[122]:

1184

Q 1.6 Build Logistic Regression Model (using statsmodel library) on most important variables on Train
Dataset and choose the optimum cutoff. Also showcase your model building approach

Model Building using Logistic Regression for 'Predicting Credit Risk'

The equation of the Logistic Regression by which we predict the corresponding probabilities and then
go on predict a discrete target variable is

1 −𝑧
y= 1+𝑒
Note: z = 𝛽0 +∑𝑛𝑖=1 (𝛽𝑖 𝑋1 )
Now, Importing statsmodels modules

In [124]: 

from sklearn.feature_selection import RFE


from sklearn.linear_model import LogisticRegression
import statsmodels.formula.api as SM

For modeling we will use Logistic Regression with recursive feature elimination

In [127]: 

LogR = LogisticRegression()
In [133]: 

#We using Regression Feature Elimination Total number of features are 64 out of we will be
selector = RFE(estimator = LogR, n_features_to_select=21, step=1)

In [129]: 

selector = selector.fit(X_train, y_train)

In [130]: 

selector.n_features_

Out[130]:

21

In [131]: 

selector.ranking_

Out[131]:

array([ 5, 33, 16, 7, 35, 10, 19, 29, 17, 14, 1, 4, 1, 8, 21, 1, 1,

1, 15, 25, 36, 20, 1, 1, 42, 1, 1, 11, 9, 34, 6, 30, 1, 1,

3, 13, 12, 18, 1, 31, 23, 32, 41, 40, 26, 44, 43, 28, 1, 1, 1,

37, 1, 1, 2, 1, 24, 27, 1, 22, 39, 38, 1, 1])


In [132]: 

df = pd.DataFrame({'Feature': scaled_predictors.columns, 'Rank': selector.ranking_})


df[df['Rank'] == 1]

Out[132]:

Feature Rank

10 Net_Sales 1

12 Value_Of_Output 1

15 PBIDT 1

16 PBDT 1

17 PBIT 1

22 Revenue_earnings_in_forex 1

23 Revenue_expenses_in_forex 1

25 Book_Value_Unit_Curr 1

26 Book_Value_Adj._Unit_Curr 1

32 ROG-Net_Worth_perc 1

33 ROG-Capital_Employed_perc 1

38 ROG-Total_Assets_perc 1

48 Current_Ratio[Latest] 1

49 Fixed_Assets_Ratio[Latest] 1

50 Inventory_Ratio[Latest] 1

52 Total_Asset_Turnover_Ratio[Latest] 1

53 Interest_Cover_Ratio[Latest] 1

55 PBITM_perc[Latest] 1

58 APATM_perc[Latest] 1

62 Value_of_Output_to_Total_Assets 1

63 Value_of_Output_to_Gross_Block 1

In [154]: 

from sklearn.model_selection import GridSearchCV

In [155]: 

grid={'penalty':['l2','none'],
'solver':['sag','lbfgs'],
'tol':[0.0001,0.00001]}

In [156]: 

model = LogisticRegression(max_iter=10000,n_jobs=2)
In [157]: 

grid_search = GridSearchCV(estimator = model, param_grid = grid, cv = 3,n_jobs=-1,scoring='

In [158]: 

grid_search.fit(X_train, y_train)

Out[158]:

GridSearchCV(cv=3, estimator=LogisticRegression(max_iter=10000, n_jobs=2),

n_jobs=-1,

param_grid={'penalty': ['l2', 'none'], 'solver': ['sag', 'lbfg


s'],

'tol': [0.0001, 1e-05]},

scoring='f1')

In [159]: 

print(grid_search.best_params_,'\n')
print(grid_search.best_estimator_)

{'penalty': 'none', 'solver': 'lbfgs', 'tol': 0.0001}

LogisticRegression(max_iter=10000, n_jobs=2, penalty='none')

In [160]: 

best_model = grid_search.best_estimator_

In [161]: 

# Prediction on the training set

ytrain_predict = best_model.predict(X_train)
ytest_predict = best_model.predict(X_test)
In [162]: 

## Getting the probabilities on the test set

ytest_predict_prob=best_model.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()

Out[162]:

0 1

0 0.42 0.58

1 1.00 0.00

2 1.00 0.00

3 0.99 0.01

4 1.00 0.00

Q 1.7 Validate the Model on Test Dataset and state the performance matrices. Also state interpretation
from the model

In [164]: 

from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix,


In [165]: 

## Confusion matrix on the training data

plot_confusion_matrix(best_model,X_train,y_train)
print(classification_report(y_train, ytrain_predict),'\n');

precision recall f1-score support

0 0.97 0.99 0.98 2142

1 0.88 0.75 0.81 260

accuracy 0.96 2402

macro avg 0.93 0.87 0.89 2402

weighted avg 0.96 0.96 0.96 2402

In [166]: 

## Confusion matrix on the test data

plot_confusion_matrix(best_model,X_test,y_test)
print(classification_report(y_test, ytest_predict),'\n');

precision recall f1-score support

0 0.96 0.98 0.97 1056

1 0.81 0.70 0.75 128

accuracy 0.95 1184

macro avg 0.89 0.84 0.86 1184

weighted avg 0.95 0.95 0.95 1184

We can select other parameters to perform GridSearchCV and try optimize the desired parameter.

We see not good recall score for both train and test
Since only 11% of the total data had defaults, we will now try to balance the data before fiting the model.

Imblearn over_sampling

In [176]: 

from imblearn.over_sampling import SMOTE


sm = SMOTE(random_state=33)
X_res, y_res = sm.fit_resample(X_train, y_train)
In [177]: 

selector_smote = selector.fit(X_res, y_res)

In [178]: 

selector_smote.n_features_

Out[178]:

21

In [179]: 

pred_train_smote = selector_smote.predict(X_res)
pred_test_smote = selector_smote.predict(X_test)

In [194]: 

plot_confusion_matrix(selector_smote,X_train,y_train)
print(classification_report(y_res, pred_train_smote),'\n');

precision recall f1-score support

0 0.95 0.92 0.93 2142

1 0.92 0.95 0.94 2142

accuracy 0.93 4284

macro avg 0.93 0.93 0.93 4284

weighted avg 0.93 0.93 0.93 4284

In [191]: 

plot_confusion_matrix(selector_smote,X_test,y_test)
print(classification_report(y_test, pred_test_smote),'\n');

precision recall f1-score support

0 0.99 0.92 0.95 1056

1 0.58 0.93 0.71 128

accuracy 0.92 1184

macro avg 0.78 0.92 0.83 1184

weighted avg 0.95 0.92 0.93 1184

Finally, we are able to achieve a descent recall value without overfitting. Considering the opportunities such as
outliers, missing values and correlated features this is a fairly good model. It can be improved if we get better
quality data where the features explaining the default are not missing to this extent. Of course we can try other
techniques which are not sensitive towards missing values and outliers.

Please refer Business Report for Interpretation

THE END!
In [ ]: 

You might also like