FRA Milestone 1 Jupyter Notebook PDF

FRA PROJECT (MILESTONE - 1)
Predicting Credit Risk ¶
Problem Statement
Businesses or companies can fall prey to default if they are not able to keep up their debt obligations. Defaults
will lead to a lower credit rating for the company which in turn reduces its chances of getting credit in the future
and may have to pay higher interests on existing debts as well as any new obligations. From an investor's point
of view, he would want to invest in a company if it is capable of handling its financial obligations, can grow
quickly, and is able to manage the growth scale.
A balance sheet is a financial statement of a company that provides a snapshot of what a company owns,
owes, and the amount invested by the shareholders. Thus, it is an important tool that helps evaluate the
performance of a business.
Data that is available includes information from the financial statement of the companies for the previous year
(2015). Also, information about the Networth of the company in the following year (2016) is provided which can
be used to drive the labeled field.
Explanation of data fields available in Data Dictionary, 'Credit Default Data Dictionary.xlsx'
In [175]: 
# Importing the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # for making plots with seaborn
import sklearn
color = sns.color_palette()
import sklearn.metrics as mertics
import scipy.stats as stats
import sklearn.metrics as metrics
import warnings
warnings.filterwarnings('ignore')
In [2]: 
import os
In [3]: 
os.getcwd()
Out[3]:
'C:\\Users\\Hakers Zone\\Desktop\\Data Science\\Great Lakes\\Core Course fil

es\\11. Finance and Risk Analytics\\FRA Milestone -1\\Working files'
Importing the dataset
In [4]: 
df = pd.read_excel('Company_Data2015-1.xlsx')
In [5]: 
#Glimpse of Data
df.head()
Out[5]:
Networth Equity Net

Capital Total Gross C
Co_Code Co_Name Next Paid Networth Working
Employed Debt Block
Year Up Capital
0 16974 Hind.Cables -8021.60 419.36 -7027.48 -1007.24 5936.03 474.30 -1076.34
Tata Tele.
1 21214 -3986.19 1954.93 -2968.08 4458.20 7410.18 9070.86 -1098.88
Mah.
ABG
2 14852 -3192.58 53.84 506.86 7714.68 6944.54 1281.54 4496.25 9
Shipyard
3 2439 GTL -3054.51 157.30 -623.49 2353.88 2326.05 1033.69 -2612.42 1
Bharati
4 23505 -2967.36 50.30 -1070.83 4675.33 5740.90 1084.20 1836.23 4
Defence
5 rows × 67 columns
Fixing messy column names (containing spaces) for ease of use
In [6]: 
df.columns = df.columns.str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str

In [7]: 
#### Checking top 5 rows again

df.head()
Out[7]:
Co_Code Co_Name Networth_Next_Year Equity_Paid_Up Networth Capital_Employed Tot
0 16974 Hind.Cables -8021.60 419.36 -7027.48 -1007.24
Tata Tele.
1 21214 -3986.19 1954.93 -2968.08 4458.20
Mah.
ABG
2 14852 -3192.58 53.84 506.86 7714.68
Shipyard
3 2439 GTL -3054.51 157.30 -623.49 2353.88
Bharati
4 23505 -2967.36 50.30 -1070.83 4675.33
Defence
In [69]: 
df.dtypes.value_counts()
Out[69]:
float64 63
int64 3
object 1
dtype: int64
Now, let us check the number of rows (observations) and the number of columns (variables)
In [8]: 
print('The number of rows (observations) is =',df.shape[0],'\n''The number of columns (vari
The number of rows (observations) is = 3586
The number of columns (variables) is = 67
Checking datatype of all columns

In [9]: 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3586 entries, 0 to 3585
Data columns (total 67 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Co_Code 3586 non-null int64
1 Co_Name 3586 non-null object
2 Networth_Next_Year 3586 non-null float64
3 Equity_Paid_Up 3586 non-null float64
4 Networth 3586 non-null float64
5 Capital_Employed 3586 non-null float64
6 Total_Debt 3586 non-null float64
7 Gross_Block_ 3586 non-null float64
8 Net_Working_Capital_ 3586 non-null float64
9 Current_Assets_ 3586 non-null float64
10 Current_Liabilities_and_Provisions_ 3586 non-null float64
11 Total_Assets_to_Liabilities_ 3586 non-null float64
12 Gross_Sales 3586 non-null float64
13 Net_Sales 3586 non-null float64
14 Other_Income 3586 non-null float64
15 Value_Of_Output 3586 non-null float64
16 Cost_of_Production 3586 non-null float64
17 Selling_Cost 3586 non-null float64
18 PBIDT 3586 non-null float64
19 PBDT 3586 non-null float64
20 PBIT 3586 non-null float64
21 PBT 3586 non-null float64
22 PAT 3586 non-null float64
23 Adjusted_PAT 3586 non-null float64
24 CP 3586 non-null float64
25 Revenue_earnings_in_forex 3586 non-null float64
26 Revenue_expenses_in_forex 3586 non-null float64
27 Capital_expenses_in_forex 3586 non-null float64
28 Book_Value_Unit_Curr 3586 non-null float64
29 Book_Value_Adj._Unit_Curr 3582 non-null float64
30 Market_Capitalisation 3586 non-null float64
31 CEPS_annualised_Unit_Curr 3586 non-null float64
32 Cash_Flow_From_Operating_Activities 3586 non-null float64
33 Cash_Flow_From_Investing_Activities 3586 non-null float64
34 Cash_Flow_From_Financing_Activities 3586 non-null float64
35 ROG-Net_Worth_perc 3586 non-null float64
36 ROG-Capital_Employed_perc 3586 non-null float64
37 ROG-Gross_Block_perc 3586 non-null float64
38 ROG-Gross_Sales_perc 3586 non-null float64
39 ROG-Net_Sales_perc 3586 non-null float64
40 ROG-Cost_of_Production_perc 3586 non-null float64
41 ROG-Total_Assets_perc 3586 non-null float64
42 ROG-PBIDT_perc 3586 non-null float64
43 ROG-PBDT_perc 3586 non-null float64
44 ROG-PBIT_perc 3586 non-null float64
45 ROG-PBT_perc 3586 non-null float64
46 ROG-PAT_perc 3586 non-null float64
47 ROG-CP_perc 3586 non-null float64
48 ROG-Revenue_earnings_in_forex_perc 3586 non-null float64
49 ROG-Revenue_expenses_in_forex_perc 3586 non-null float64
50 ROG-Market_Capitalisation_perc 3586 non-null float64
51 Current_Ratio[Latest] 3585 non-null float64
52 Fixed_Assets_Ratio[Latest] 3585 non-null float64
53 Inventory_Ratio[Latest] 3585 non-null float64
54 Debtors_Ratio[Latest] 3585 non-null float64
55 Total_Asset_Turnover_Ratio[Latest] 3585 non-null float64
56 Interest_Cover_Ratio[Latest] 3585 non-null float64
57 PBIDTM_perc[Latest] 3585 non-null float64
58 PBITM_perc[Latest] 3585 non-null float64
59 PBDTM_perc[Latest] 3585 non-null float64
60 CPM_perc[Latest] 3585 non-null float64
61 APATM_perc[Latest] 3585 non-null float64
62 Debtors_Velocity_Days 3586 non-null int64
63 Creditors_Velocity_Days 3586 non-null int64
64 Inventory_Velocity_Days 3483 non-null float64
65 Value_of_Output_to_Total_Assets 3586 non-null float64
66 Value_of_Output_to_Gross_Block 3586 non-null float64
dtypes: float64(63), int64(3), object(1)
memory usage: 1.8+ MB
Now, let us check the basic measures of descriptive statistics for the continuous variables
In [10]: 
pd.options.display.float_format = '{:.2f}'.format
df.describe().T
Out[10]:
count mean std min 25% 50% 7
Co_Code 3586.00 16065.39 19776.82 4.00 3029.25 6077.50 24269
Networth_Next_Year 3586.00 725.05 4769.68 -8021.60 3.98 19.02 123
Equity_Paid_Up 3586.00 62.97 778.76 0.00 3.75 8.29 19
Networth 3586.00 649.75 4091.99 -7027.48 3.89 18.58 117
Capital_Employed 3586.00 2799.61 26975.14 -1824.75 7.60 39.09 226
... ... ... ... ... ... ...
Debtors_Velocity_Days 3586.00 603.89 10636.76 0.00 8.00 49.00 106
Creditors_Velocity_Days 3586.00 2057.85 54169.48 0.00 8.00 39.00 89
Inventory_Velocity_Days 3483.00 79.64 137.85 -199.00 0.00 35.00 96
Value_of_Output_to_Total_Assets 3586.00 0.82 1.20 -0.33 0.07 0.48 1
Value_of_Output_to_Gross_Block 3586.00 61.88 976.82 -61.00 0.27 1.53 4
In [11]: 
# Check for duplicate data
dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
df[dups]
Number of duplicate rows = 0
Out[11]:
Co_Code Co_Name Networth_Next_Year Equity_Paid_Up Networth Capital_Employed Total_
Q 1.1 Outlier Treatment
In [12]: 
df1 = df.drop(['Co_Code', 'Co_Name'], axis = 1)
In [13]: 
df1.head()
Out[13]:
Networth_Next_Year Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Block_
0 -8021.60 419.36 -7027.48 -1007.24 5936.03 474.30
1 -3986.19 1954.93 -2968.08 4458.20 7410.18 9070.86
2 -3192.58 53.84 506.86 7714.68 6944.54 1281.54
3 -3054.51 157.30 -623.49 2353.88 2326.05 1033.69
4 -2967.36 50.30 -1070.83 4675.33 5740.90 1084.20
In [14]: 
df1.shape
Out[14]:
(3586, 65)
In [15]: 
# Checking Outliers in dataset

df1.boxplot(figsize=(16,8))
plt.xticks(rotation=90)
plt.show()
Outlier treatment :
#Define a function which returns the Upper and Lower limit to detect outliers for each feature
In [16]: 
def mod_outlier(df1):
df1 = df1._get_numeric_data()
q1 = df1.quantile(0.25)
q3 = df1.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 -(1.5 * iqr)
upper_bound = q3 +(1.5 * iqr)
for col in df1.columns:

for i in range(0,len(df1[col])):
if df1[col][i] < lower_bound[col]:
df1[col][i] = lower_bound[col]
if df1[col][i] > upper_bound[col]:

df1[col][i] = upper_bound[col]
for col in df1.columns:
return(df1)
In [17]: 
df_new = mod_outlier(df1)
In [18]: 
# Checking Outliers in dataset

df_new.boxplot(figsize=(16,8))
plt.show()
Q 1.2 Missing Value Treatment
Lets check for missing values in the dataset
In [19]: 
#check isnull present

df_new.isnull().sum().sort_values(ascending=False).head(13)
Out[19]:
Inventory_Velocity_Days 103
Book_Value_Adj._Unit_Curr 4
Current_Ratio[Latest] 1
PBITM_perc[Latest] 1
Fixed_Assets_Ratio[Latest] 1
Inventory_Ratio[Latest] 1
Debtors_Ratio[Latest] 1
Total_Asset_Turnover_Ratio[Latest] 1
PBIDTM_perc[Latest] 1
Interest_Cover_Ratio[Latest] 1
PBDTM_perc[Latest] 1
CPM_perc[Latest] 1
APATM_perc[Latest] 1
dtype: int64
In [21]: 
df_new.isnull().sum().sum()
Out[21]:
118
In [97]: 
df_new.size
Out[97]:
236676
There are 0.05% missing values in the dataset
In [20]: 
#Columns with missing values

print(np.where(df_new.isnull().sum()>0))
(array([27, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 62], dtype=int64),)
Treating Missing value

In [22]: 
df_new.fillna(df.median())
Out[22]:
Networth_Next_Year Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Bloc
0 -175.74 43.17 -166.22 -320.90 180.83 328
1 -175.74 43.17 -166.22 555.11 180.83 328
2 -175.74 43.17 287.41 555.11 180.83 328
3 -175.74 43.17 -166.22 555.11 180.83 328
4 -175.74 43.17 -166.22 555.11 180.83 328
... ... ... ... ... ...
3581 303.53 43.17 287.41 555.11 180.83 328
3582 303.53 43.17 287.41 555.11 180.83 328
3583 303.53 43.17 287.41 555.11 180.83 328
3584 303.53 43.17 287.41 555.11 180.83 328
3585 303.53 43.17 287.41 555.11 180.83 328
In [23]: 
df_new.isnull().any().any()
Out[23]:
True
Still there are missing value
Let's visually inspect the missing values in our data

In [24]: 
plt.figure(figsize = (12,8))
sns.heatmap(df_new.isnull(), cbar = False, cmap = 'coolwarm', yticklabels = False)
plt.show()
In [25]: 
cat=[]
num=[]
for i in df_new.columns:
if df_new[i].dtype=="object":
cat.append(i)
else:
num.append(i)
print(cat)
print(num)
[]
['Networth_Next_Year', 'Equity_Paid_Up', 'Networth', 'Capital_Employed', 'To

tal_Debt', 'Gross_Block_', 'Net_Working_Capital_', 'Current_Assets_', 'Curre
nt_Liabilities_and_Provisions_', 'Total_Assets_to_Liabilities_', 'Gross_Sale
s', 'Net_Sales', 'Other_Income', 'Value_Of_Output', 'Cost_of_Production', 'S
elling_Cost', 'PBIDT', 'PBDT', 'PBIT', 'PBT', 'PAT', 'Adjusted_PAT', 'CP',
'Revenue_earnings_in_forex', 'Revenue_expenses_in_forex', 'Capital_expenses_
in_forex', 'Book_Value_Unit_Curr', 'Book_Value_Adj._Unit_Curr', 'Market_Capi
talisation', 'CEPS_annualised_Unit_Curr', 'Cash_Flow_From_Operating_Activiti
es', 'Cash_Flow_From_Investing_Activities', 'Cash_Flow_From_Financing_Activi
ties', 'ROG-Net_Worth_perc', 'ROG-Capital_Employed_perc', 'ROG-Gross_Block_p
erc', 'ROG-Gross_Sales_perc', 'ROG-Net_Sales_perc', 'ROG-Cost_of_Production_
perc', 'ROG-Total_Assets_perc', 'ROG-PBIDT_perc', 'ROG-PBDT_perc', 'ROG-PBIT
_perc', 'ROG-PBT_perc', 'ROG-PAT_perc', 'ROG-CP_perc', 'ROG-Revenue_earnings
_in_forex_perc', 'ROG-Revenue_expenses_in_forex_perc', 'ROG-Market_Capitalis
ation_perc', 'Current_Ratio[Latest]', 'Fixed_Assets_Ratio[Latest]', 'Invento
ry_Ratio[Latest]', 'Debtors_Ratio[Latest]', 'Total_Asset_Turnover_Ratio[Late
st]', 'Interest_Cover_Ratio[Latest]', 'PBIDTM_perc[Latest]', 'PBITM_perc[Lat
est]', 'PBDTM_perc[Latest]', 'CPM_perc[Latest]', 'APATM_perc[Latest]', 'Debt
ors_Velocity_Days', 'Creditors_Velocity_Days', 'Inventory_Velocity_Days', 'V
alue_of_Output_to_Total_Assets', 'Value_of_Output_to_Gross_Block']
In [26]: 
##Lets treat these missing values with median (replacement with median eliminates impact of
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
df_new = pd.DataFrame(imputer.fit_transform(df_new))
df_new.columns=num
#df_new =df_fra[cat]
df_new.head()
Out[26]:
Networth_Next_Year Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Block_
0 -175.74 43.17 -166.22 -320.90 180.83 328.88
1 -175.74 43.17 -166.22 555.11 180.83 328.88
2 -175.74 43.17 287.41 555.11 180.83 328.88
3 -175.74 43.17 -166.22 555.11 180.83 328.88
4 -175.74 43.17 -166.22 555.11 180.83 328.88
In [27]: 
df_new.isnull().any().any()
Out[27]:
False
In [42]: 
df_new.isnull().sum().sort_values(ascending=False).head(20)
Out[42]:
Networth_Next_Year 0
Current_Ratio[Latest] 0
ROG-Gross_Block_perc 0
ROG-Gross_Sales_perc 0
ROG-Net_Sales_perc 0
ROG-Cost_of_Production_perc 0
ROG-Total_Assets_perc 0
ROG-PBIDT_perc 0
ROG-PBDT_perc 0
ROG-PBIT_perc 0
ROG-PBT_perc 0
ROG-PAT_perc 0
ROG-CP_perc 0
ROG-Revenue_earnings_in_forex_perc 0
ROG-Revenue_expenses_in_forex_perc 0
ROG-Market_Capitalisation_perc 0
Fixed_Assets_Ratio[Latest] 0
Equity_Paid_Up 0
Inventory_Ratio[Latest] 0
Debtors_Ratio[Latest] 0
dtype: int64
Q 1.3 Transform Target variable into 0 and 1 ( Creating a binary target variable using
'Networth_Next_Year')
In [29]: 
df_new['default'] = np.where((df_new['Networth_Next_Year'] > 0), 0, 1)
Checking top 5 rows
In [30]: 
df_new[['default','Networth_Next_Year']].head(5)
Out[30]:
default Networth_Next_Year
0 1 -175.74
1 1 -175.74
2 1 -175.74
3 1 -175.74
4 1 -175.74
Checking tail 5 rows

In [31]: 
df_new[['default','Networth_Next_Year']].tail(5)
Out[31]:
default Networth_Next_Year
3581 0 303.53
3582 0 303.53
3583 0 303.53
3584 0 303.53
3585 0 303.53
What does variable 'default' look like
In [32]: 
df_new['default'].value_counts()
Out[32]:
0 3198
1 388
Name: default, dtype: int64
Checking proportion of default
In [34]: 
df_new['default'].value_counts(normalize = True)
Out[34]:
0 0.89
1 0.11
Name: default, dtype: float64
Q 1.4 Univariate & Bivariate analysis with proper interpretation. (You may choose to include only those
variables which were significant in the model building)
Univariate analysis :
In [53]: 
pd.options.display.float_format = '{:.2f}'.format
df_new.describe().T
Out[53]:
count mean std min 25% 50% 75% max
Networth_Next_Year 3586.00 77.40 120.48 -175.74 3.98 19.02 123.80 303.53
Equity_Paid_Up 3586.00 13.99 14.00 0.00 3.75 8.29 19.52 43.17
Networth 3586.00 73.69 112.94 -166.22 3.89 18.58 117.30 287.41
Capital_Employed 3586.00 152.49 207.87 -320.90 7.60 39.09 226.60 555.11
Total_Debt 3586.00 47.44 68.22 -0.72 0.03 7.49 72.35 180.83
... ... ... ... ... ... ... ... ...
Creditors_Velocity_Days 3586.00 62.39 68.03 0.00 8.00 39.00 89.00 210.00
Inventory_Velocity_Days 3586.00 61.22 73.20 -144.00 0.00 35.00 93.00 240.00
Value_of_Output_to_Total_Assets 3586.00 0.73 0.77 -0.33 0.07 0.48 1.16 2.79
Value_of_Output_to_Gross_Block 3586.00 3.36 4.10 -6.69 0.27 1.53 4.91 11.87
default 3586.00 0.11 0.31 0.00 0.00 0.00 0.00 1.00
Some Important Feature Box Plot

In [201]: 
plt.figure(figsize=(25,10))
sns.boxplot(data=df_new)
plt.xlabel("Variables")
plt.ylabel("Density")
plt.title('Figure:Boxplot of few important features')
Out[201]:
Text(0.5, 1.0, 'Figure:Boxplot of few important features')
Variable 'Total_Asset_To_Liabilities still have some extreme values.
Varible 'Capital_Employes'still have some Extreme and Lower values.
15 Significate Scaled Feature Variables - Distribution of column with Displot & Box plot:
In [233]: 
fig,axes = plt.subplots(nrows= 5, ncols=2)

fig.set_size_inches(14,24)
sns.distplot(df_new['Selling_Cost'], ax= axes[0][0])

sns.boxplot(df_new['Selling_Cost'],orient = 'H', ax= axes[0][1])
sns.distplot(df_new['PBIDT'], ax= axes[1][0])

sns.boxplot(df_new['PBIDT'],orient = 'H', ax= axes[1][1])
sns.distplot(df_new['Cash_Flow_From_Operating_Activities'], ax= axes[2][0])

sns.boxplot(df_new['Cash_Flow_From_Operating_Activities'],orient = 'H', ax= axes[2][1])
sns.distplot(df_new['ROG-Net_Worth_perc'], ax= axes[3][0])

sns.boxplot(df_new['ROG-Net_Worth_perc'],orient = 'H', ax= axes[3][1])
sns.distplot(df_new['ROG-Capital_Employed_perc'], ax= axes[4][0])

sns.boxplot(df_new['ROG-Capital_Employed_perc'],orient = 'H', ax= axes[4][1])
Out[233]:
<AxesSubplot:xlabel='ROG-Capital_Employed_perc'>
In [238]: 

sns.distplot(df_new['ROG-Total_Assets_perc'], ax= axes[0][0])

sns.boxplot(df_new['ROG-Total_Assets_perc'],orient = 'H', ax= axes[0][1])
sns.distplot(df_new['ROG-PBT_perc'], ax= axes[1][0])

sns.boxplot(df_new['ROG-PBT_perc'],orient = 'H', ax= axes[1][1])
sns.distplot(df_new['ROG-PBIDT_perc'], ax= axes[2][0])

sns.boxplot(df_new['ROG-PBIDT_perc'],orient = 'H', ax= axes[2][1])
sns.distplot(df_new['Interest_Cover_Ratio[Latest]'], ax= axes[3][0])

sns.boxplot(df_new['Interest_Cover_Ratio[Latest]'],orient = 'H', ax= axes[3][1])
sns.distplot(df_new['Current_Ratio[Latest]'], ax= axes[4][0])

sns.boxplot(df_new['Current_Ratio[Latest]'],orient = 'H', ax= axes[4][1])
Out[238]:
<AxesSubplot:xlabel='Current_Ratio[Latest]'>
In [243]: 

sns.distplot(df_new['APATM_perc[Latest]'], ax= axes[0][0])

sns.boxplot(df_new['APATM_perc[Latest]'],orient = 'H', ax= axes[0][1])
sns.distplot(df_new['ROG-CP_perc'], ax= axes[1][0])

sns.boxplot(df_new['ROG-CP_perc'],orient = 'H', ax= axes[1][1])
sns.distplot(df_new['Value_of_Output_to_Total_Assets'], ax= axes[2][0])

sns.boxplot(df_new['Value_of_Output_to_Total_Assets'],orient = 'H', ax= axes[2][1])
sns.distplot(df_new['Net_Sales'], ax= axes[3][0])

sns.boxplot(df_new['Net_Sales'],orient = 'H', ax= axes[3][1])
sns.distplot(df_new['Book_Value_Adj._Unit_Curr'], ax= axes[4][0])

sns.boxplot(df_new['Book_Value_Adj._Unit_Curr'],orient = 'H', ax= axes[4][1])
Out[243]:
<AxesSubplot:xlabel='Book_Value_Adj._Unit_Curr'>
- Please refer Business Report for Inference
In [203]: 
sns.catplot('default', data=df_new, kind='count',aspect=1.5, palette='mako')

plt.title("Figure: Countplot of Target Variable Default")
Out[203]:
Text(0.5, 1.0, 'Figure: Countplot of Target Variable Default')
Bivariate Analysis
Gross Sales Vs Net Sales:

In [254]: 
sns.jointplot(x = df_new['Gross_Sales'],
y = df_new['Net_Sales'])
Out[254]:
<seaborn.axisgrid.JointGrid at 0x212bdc14790>
There exists linear relationship between these two important variables.
Networth Vs Capital Employment:

In [253]: 
sns.jointplot(x = df_new['Networth'],
y = df1['Capital_Employed'])
Out[253]:
<seaborn.axisgrid.JointGrid at 0x212be1238b0>
As the capital increases, net worth also increases, but in some cases, capital seems to be disbursed even
for lesser networth.
Networth Vs Cost of Production

In [255]: 
sns.jointplot(x = df_new['Networth'],
y = df1['Cost_of_Production'])
Out[255]:
<seaborn.axisgrid.JointGrid at 0x212a64b1fa0>
This plot is scattered and there exists no such relationship between these two variables.
In [208]: 
sns.boxplot(x = df_new['default'],
y = df1['ROG-Net_Worth_perc'])
Out[208]:
<AxesSubplot:xlabel='default', ylabel='ROG-Net_Worth_perc'>
In [228]: 
# As per Regression Feature Elimination(RFE) we got Top important Variables and 1 Target Va
df_imp =pd.DataFrame(df_new,columns=['Net_Sales','Value_Of_Output', 'PBIDT', 'PBDT', 'PBIT'
, 'ROG-Net_Worth_perc', 'ROG-Capital_Employed_perc', 'ROG-Total_Assets_perc', 'Current_Rati
'Inventory_Ratio[Latest]', 'Total_Asset_Turnover_Ratio[Latest]', 'Interest_Cover_Ratio[Late
'Value_of_Output_to_Gross_Block', 'default'])
Correlation Heatmap
In [230]: 
plt.figure(figsize = (12,8))
cor_matrix = df_imp.corr()
sns.heatmap(cor_matrix, cmap = 'plasma', vmin = -1, vmax= 1)
Out[230]:
<AxesSubplot:>
- Please refer Business Report for Inference
Q 1.5 Split the data into Train and Test dataset in a ratio of 67:33
Split the data into Train and Test dataset in a ratio of 67:33 and use random_state =42.
Model Building is to be done on Train Dataset and Model Validation is to be done on Test Dataset.
In [108]: 
X = df_new.drop(['default','Networth_Next_Year'], axis=1)
y = df_new['default']
In [119]: 
X.head()
Out[119]:
Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Block_ Net_Working_Capital
0 43.17 -166.22 -320.90 180.83 328.88 -89.4
1 43.17 -166.22 555.11 180.83 328.88 -89.4
2 43.17 287.41 555.11 180.83 328.88 151.5
3 43.17 -166.22 555.11 180.83 328.88 -89.4
4 43.17 -166.22 555.11 180.83 328.88 151.5
In [120]: 
y.head()
Out[120]:
0 1
1 1
2 1
3 1
4 1
Name: default, dtype: int32
In [109]: 
#Split X and y into training and test set in 67:33 ratio

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42,stra
In [110]: 
from sklearn.preprocessing import StandardScaler

ss=StandardScaler()
# we are scaling the data for ANN. Without scaling it will give very poor results. Computat
X_train_scaled=ss.fit_transform(X_train)
X_test_scaled=ss.transform(X_test)
In [200]: 
X_train_scaled.shape
Out[200]:
(2402, 64)
In [198]: 
X_train.shape
Out[198]:
(2402, 64)
In [112]: 
# scaled Train data first 5 rows

X_train.head()
Out[112]:
Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Block_ Net_Working_Cap
842 3.25 3.49 3.54 0.05 1.53
1057 4.75 5.29 5.39 0.02 0.75
1595 13.06 13.50 13.50 0.00 0.00 1
100 43.17 -60.79 -58.30 2.46 18.66 -8
1191 4.00 5.95 20.62 14.33 7.76
In [113]: 
# scaled Test data first 5 rows

X_test.head()
Out[113]:
Equity_Paid_Up Networth Capital_Employed Total_Debt Gross_Block_ Net_Working_Cap
251 2.19 -5.58 -1.55 4.03 1.07
3493 43.17 287.41 555.11 180.83 328.88 15
3063 32.33 287.41 350.02 0.00 5.49
2384 3.01 63.35 90.88 25.74 31.22 6
1679 7.44 14.27 14.27 0.00 13.87
In [114]: 
X_train.columns
Out[114]:
Index(['Equity_Paid_Up', 'Networth', 'Capital_Employed', 'Total_Debt',
'Gross_Block_', 'Net_Working_Capital_', 'Current_Assets_',
'Current_Liabilities_and_Provisions_', 'Total_Assets_to_Liabilities
_',
'Gross_Sales', 'Net_Sales', 'Other_Income', 'Value_Of_Output',
'Cost_of_Production', 'Selling_Cost', 'PBIDT', 'PBDT', 'PBIT', 'PBT',
'PAT', 'Adjusted_PAT', 'CP', 'Revenue_earnings_in_forex',
'Revenue_expenses_in_forex', 'Capital_expenses_in_forex',
'Book_Value_Unit_Curr', 'Book_Value_Adj._Unit_Curr',
'Market_Capitalisation', 'CEPS_annualised_Unit_Curr',
'Cash_Flow_From_Operating_Activities',
'Cash_Flow_From_Investing_Activities',
'Cash_Flow_From_Financing_Activities', 'ROG-Net_Worth_perc',
'ROG-Capital_Employed_perc', 'ROG-Gross_Block_perc',
'ROG-Gross_Sales_perc', 'ROG-Net_Sales_perc',
'ROG-Cost_of_Production_perc', 'ROG-Total_Assets_perc',
'ROG-PBIDT_perc', 'ROG-PBDT_perc', 'ROG-PBIT_perc', 'ROG-PBT_perc',
'ROG-PAT_perc', 'ROG-CP_perc', 'ROG-Revenue_earnings_in_forex_perc',
'ROG-Revenue_expenses_in_forex_perc', 'ROG-Market_Capitalisation_per
c',
'Current_Ratio[Latest]', 'Fixed_Assets_Ratio[Latest]',
'Inventory_Ratio[Latest]', 'Debtors_Ratio[Latest]',
'Total_Asset_Turnover_Ratio[Latest]', 'Interest_Cover_Ratio[Latest]',
'PBIDTM_perc[Latest]', 'PBITM_perc[Latest]', 'PBDTM_perc[Latest]',
'CPM_perc[Latest]', 'APATM_perc[Latest]', 'Debtors_Velocity_Days',
'Creditors_Velocity_Days', 'Inventory_Velocity_Days',
'Value_of_Output_to_Total_Assets', 'Value_of_Output_to_Gross_Block'],
dtype='object')
In [152]: 
y_train.value_counts(1)
Out[152]:
0 0.89
1 0.11
In [153]: 
y_test.value_counts(1)
Out[153]:
0 0.89
1 0.11

In [121]: 
#Number of rows and columns of the training set for the independent variables:
X_train.shape[0]
Out[121]:
2402
In [122]: 
#Number of rows and columns of the test set for the independent variables:
X_test.shape[0]
Out[122]:
1184
Q 1.6 Build Logistic Regression Model (using statsmodel library) on most important variables on Train
Dataset and choose the optimum cutoff. Also showcase your model building approach
Model Building using Logistic Regression for 'Predicting Credit Risk'
The equation of the Logistic Regression by which we predict the corresponding probabilities and then
go on predict a discrete target variable is
1 −𝑧
y= 1+𝑒
Note: z = 𝛽0 +∑𝑛𝑖=1 (𝛽𝑖 𝑋1 )
Now, Importing statsmodels modules
In [124]: 
from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression
import statsmodels.formula.api as SM
For modeling we will use Logistic Regression with recursive feature elimination
In [127]: 
LogR = LogisticRegression()
In [133]: 
#We using Regression Feature Elimination Total number of features are 64 out of we will be
selector = RFE(estimator = LogR, n_features_to_select=21, step=1)
In [129]: 
selector = selector.fit(X_train, y_train)
In [130]: 
selector.n_features_
Out[130]:
21
In [131]: 
selector.ranking_
Out[131]:
array([ 5, 33, 16, 7, 35, 10, 19, 29, 17, 14, 1, 4, 1, 8, 21, 1, 1,
1, 15, 25, 36, 20, 1, 1, 42, 1, 1, 11, 9, 34, 6, 30, 1, 1,
3, 13, 12, 18, 1, 31, 23, 32, 41, 40, 26, 44, 43, 28, 1, 1, 1,
37, 1, 1, 2, 1, 24, 27, 1, 22, 39, 38, 1, 1])

In [132]: 
df = pd.DataFrame({'Feature': scaled_predictors.columns, 'Rank': selector.ranking_})

df[df['Rank'] == 1]
Out[132]:
Feature Rank
10 Net_Sales 1
12 Value_Of_Output 1
15 PBIDT 1
16 PBDT 1
17 PBIT 1
22 Revenue_earnings_in_forex 1
23 Revenue_expenses_in_forex 1
25 Book_Value_Unit_Curr 1
26 Book_Value_Adj._Unit_Curr 1
32 ROG-Net_Worth_perc 1
33 ROG-Capital_Employed_perc 1
38 ROG-Total_Assets_perc 1
48 Current_Ratio[Latest] 1
49 Fixed_Assets_Ratio[Latest] 1
50 Inventory_Ratio[Latest] 1
52 Total_Asset_Turnover_Ratio[Latest] 1
53 Interest_Cover_Ratio[Latest] 1
55 PBITM_perc[Latest] 1
58 APATM_perc[Latest] 1
62 Value_of_Output_to_Total_Assets 1
63 Value_of_Output_to_Gross_Block 1
In [154]: 
from sklearn.model_selection import GridSearchCV
In [155]: 
grid={'penalty':['l2','none'],
'solver':['sag','lbfgs'],
'tol':[0.0001,0.00001]}
In [156]: 
model = LogisticRegression(max_iter=10000,n_jobs=2)
In [157]: 
grid_search = GridSearchCV(estimator = model, param_grid = grid, cv = 3,n_jobs=-1,scoring='
In [158]: 
grid_search.fit(X_train, y_train)
Out[158]:
GridSearchCV(cv=3, estimator=LogisticRegression(max_iter=10000, n_jobs=2),
n_jobs=-1,
param_grid={'penalty': ['l2', 'none'], 'solver': ['sag', 'lbfg

s'],
'tol': [0.0001, 1e-05]},
scoring='f1')
In [159]: 
print(grid_search.best_params_,'\n')
print(grid_search.best_estimator_)
{'penalty': 'none', 'solver': 'lbfgs', 'tol': 0.0001}
LogisticRegression(max_iter=10000, n_jobs=2, penalty='none')
In [160]: 
best_model = grid_search.best_estimator_
In [161]: 
# Prediction on the training set
ytrain_predict = best_model.predict(X_train)
ytest_predict = best_model.predict(X_test)
In [162]: 
## Getting the probabilities on the test set
ytest_predict_prob=best_model.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()
Out[162]:
0 1
0 0.42 0.58
1 1.00 0.00
2 1.00 0.00
3 0.99 0.01
4 1.00 0.00
Q 1.7 Validate the Model on Test Dataset and state the performance matrices. Also state interpretation
from the model
In [164]: 
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix,

In [165]: 
## Confusion matrix on the training data
plot_confusion_matrix(best_model,X_train,y_train)
print(classification_report(y_train, ytrain_predict),'\n');
precision recall f1-score support
0 0.97 0.99 0.98 2142
1 0.88 0.75 0.81 260
accuracy 0.96 2402
macro avg 0.93 0.87 0.89 2402
weighted avg 0.96 0.96 0.96 2402
In [166]: 
## Confusion matrix on the test data
plot_confusion_matrix(best_model,X_test,y_test)
print(classification_report(y_test, ytest_predict),'\n');
0 0.96 0.98 0.97 1056
1 0.81 0.70 0.75 128
accuracy 0.95 1184
macro avg 0.89 0.84 0.86 1184
weighted avg 0.95 0.95 0.95 1184
We can select other parameters to perform GridSearchCV and try optimize the desired parameter.
We see not good recall score for both train and test
Since only 11% of the total data had defaults, we will now try to balance the data before fiting the model.
Imblearn over_sampling
In [176]: 
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=33)
X_res, y_res = sm.fit_resample(X_train, y_train)
In [177]: 
selector_smote = selector.fit(X_res, y_res)
In [178]: 
selector_smote.n_features_
Out[178]:
21
In [179]: 
pred_train_smote = selector_smote.predict(X_res)
pred_test_smote = selector_smote.predict(X_test)
In [194]: 
plot_confusion_matrix(selector_smote,X_train,y_train)
print(classification_report(y_res, pred_train_smote),'\n');
0 0.95 0.92 0.93 2142
1 0.92 0.95 0.94 2142
accuracy 0.93 4284
macro avg 0.93 0.93 0.93 4284
weighted avg 0.93 0.93 0.93 4284
In [191]: 
plot_confusion_matrix(selector_smote,X_test,y_test)
print(classification_report(y_test, pred_test_smote),'\n');
0 0.99 0.92 0.95 1056
1 0.58 0.93 0.71 128
accuracy 0.92 1184
macro avg 0.78 0.92 0.83 1184
weighted avg 0.95 0.92 0.93 1184
Finally, we are able to achieve a descent recall value without overfitting. Considering the opportunities such as
outliers, missing values and correlated features this is a fairly good model. It can be improved if we get better
quality data where the features explaining the default are not missing to this extent. Of course we can try other
techniques which are not sensitive towards missing values and outliers.
Please refer Business Report for Interpretation
THE END!
In [ ]: 

FRA Milestone 1 Jupyter Notebook PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FRA Milestone 1 Jupyter Notebook PDF

Uploaded by

Copyright:

Available Formats

FRA PROJECT (MILESTONE - 1)

Predicting Credit Risk ¶

# Importing the libraries

'C:\\Users\\Hakers Zone\\Desktop\\Data Science\\Great Lakes\\Core Course fil

Importing the dataset

Networth Equity Net

0 16974 Hind.Cables -8021.60 419.36 -7027.48 -1007.24 5936.03 474.30 -1076.34

3 2439 GTL -3054.51 157.30 -623.49 2353.88 2326.05 1033.69 -2612.42 1

Fixing messy column names (containing spaces) for ease of use

df.columns = df.columns.str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str

#### Checking top 5 rows again

Co_Code Co_Name Networth_Next_Year Equity_Paid_Up Networth Capital_Employed Tot

0 16974 Hind.Cables -8021.60 419.36 -7027.48 -1007.24

3 2439 GTL -3054.51 157.30 -623.49 2353.88

print('The number of rows (observations) is =',df.shape[0],'\n''The number of columns (vari

The number of rows (observations) is = 3586

The number of columns (variables) is = 67

Checking datatype of all columns

RangeIndex: 3586 entries, 0 to 3585

Data columns (total 67 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Co_Code 3586 non-null int64

1 Co_Name 3586 non-null object

2 Networth_Next_Year 3586 non-null float64

3 Equity_Paid_Up 3586 non-null float64

4 Networth 3586 non-null float64

5 Capital_Employed 3586 non-null float64

6 Total_Debt 3586 non-null float64

7 Gross_Block_ 3586 non-null float64

8 Net_Working_Capital_ 3586 non-null float64

9 Current_Assets_ 3586 non-null float64

10 Current_Liabilities_and_Provisions_ 3586 non-null float64

11 Total_Assets_to_Liabilities_ 3586 non-null float64

12 Gross_Sales 3586 non-null float64

13 Net_Sales 3586 non-null float64

14 Other_Income 3586 non-null float64

15 Value_Of_Output 3586 non-null float64

16 Cost_of_Production 3586 non-null float64

17 Selling_Cost 3586 non-null float64

18 PBIDT 3586 non-null float64

19 PBDT 3586 non-null float64

20 PBIT 3586 non-null float64

21 PBT 3586 non-null float64

22 PAT 3586 non-null float64

23 Adjusted_PAT 3586 non-null float64

24 CP 3586 non-null float64

25 Revenue_earnings_in_forex 3586 non-null float64

26 Revenue_expenses_in_forex 3586 non-null float64

27 Capital_expenses_in_forex 3586 non-null float64

28 Book_Value_Unit_Curr 3586 non-null float64

29 Book_Value_Adj._Unit_Curr 3582 non-null float64

30 Market_Capitalisation 3586 non-null float64

31 CEPS_annualised_Unit_Curr 3586 non-null float64

32 Cash_Flow_From_Operating_Activities 3586 non-null float64

33 Cash_Flow_From_Investing_Activities 3586 non-null float64

34 Cash_Flow_From_Financing_Activities 3586 non-null float64

35 ROG-Net_Worth_perc 3586 non-null float64

36 ROG-Capital_Employed_perc 3586 non-null float64

37 ROG-Gross_Block_perc 3586 non-null float64

38 ROG-Gross_Sales_perc 3586 non-null float64

39 ROG-Net_Sales_perc 3586 non-null float64