Professional Documents
Culture Documents
FRA Milestone 1 Jupyter Notebook PDF
FRA Milestone 1 Jupyter Notebook PDF
Problem Statement
Businesses or companies can fall prey to default if they are not able to keep up their debt obligations. Defaults
will lead to a lower credit rating for the company which in turn reduces its chances of getting credit in the future
and may have to pay higher interests on existing debts as well as any new obligations. From an investor's point
of view, he would want to invest in a company if it is capable of handling its financial obligations, can grow
quickly, and is able to manage the growth scale.
A balance sheet is a financial statement of a company that provides a snapshot of what a company owns,
owes, and the amount invested by the shareholders. Thus, it is an important tool that helps evaluate the
performance of a business.
Data that is available includes information from the financial statement of the companies for the previous year
(2015). Also, information about the Networth of the company in the following year (2016) is provided which can
be used to drive the labeled field.
Explanation of data fields available in Data Dictionary, 'Credit Default Data Dictionary.xlsx'
In [175]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import os
In [3]:
os.getcwd()
Out[3]:
In [4]:
df = pd.read_excel('Company_Data2015-1.xlsx')
In [5]:
#Glimpse of Data
df.head()
Out[5]:
Tata Tele.
1 21214 -3986.19 1954.93 -2968.08 4458.20 7410.18 9070.86 -1098.88
Mah.
ABG
2 14852 -3192.58 53.84 506.86 7714.68 6944.54 1281.54 4496.25 9
Shipyard
Bharati
4 23505 -2967.36 50.30 -1070.83 4675.33 5740.90 1084.20 1836.23 4
Defence
5 rows × 67 columns
In [6]:
Out[7]:
Tata Tele.
1 21214 -3986.19 1954.93 -2968.08 4458.20
Mah.
ABG
2 14852 -3192.58 53.84 506.86 7714.68
Shipyard
Bharati
4 23505 -2967.36 50.30 -1070.83 4675.33
Defence
5 rows × 67 columns
In [69]:
df.dtypes.value_counts()
Out[69]:
float64 63
int64 3
object 1
dtype: int64
Now, let us check the number of rows (observations) and the number of columns (variables)
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Now, let us check the basic measures of descriptive statistics for the continuous variables
In [10]:
pd.options.display.float_format = '{:.2f}'.format
df.describe().T
Out[10]:
66 rows × 8 columns
In [11]:
dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
df[dups]
Out[11]:
0 rows × 67 columns
In [12]:
In [13]:
df1.head()
Out[13]:
5 rows × 65 columns
In [14]:
df1.shape
Out[14]:
(3586, 65)
In [15]:
Outlier treatment :
#Define a function which returns the Upper and Lower limit to detect outliers for each feature
In [16]:
def mod_outlier(df1):
df1 = df1._get_numeric_data()
q1 = df1.quantile(0.25)
q3 = df1.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 -(1.5 * iqr)
upper_bound = q3 +(1.5 * iqr)
In [17]:
df_new = mod_outlier(df1)
In [18]:
In [19]:
Out[19]:
Inventory_Velocity_Days 103
Book_Value_Adj._Unit_Curr 4
Current_Ratio[Latest] 1
PBITM_perc[Latest] 1
Fixed_Assets_Ratio[Latest] 1
Inventory_Ratio[Latest] 1
Debtors_Ratio[Latest] 1
Total_Asset_Turnover_Ratio[Latest] 1
PBIDTM_perc[Latest] 1
Interest_Cover_Ratio[Latest] 1
PBDTM_perc[Latest] 1
CPM_perc[Latest] 1
APATM_perc[Latest] 1
dtype: int64
In [21]:
df_new.isnull().sum().sum()
Out[21]:
118
In [97]:
df_new.size
Out[97]:
236676
In [20]:
(array([27, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 62], dtype=int64),)
df_new.fillna(df.median())
Out[22]:
In [23]:
df_new.isnull().any().any()
Out[23]:
True
plt.figure(figsize = (12,8))
sns.heatmap(df_new.isnull(), cbar = False, cmap = 'coolwarm', yticklabels = False)
plt.show()
In [25]:
cat=[]
num=[]
for i in df_new.columns:
if df_new[i].dtype=="object":
cat.append(i)
else:
num.append(i)
print(cat)
print(num)
[]
In [26]:
##Lets treat these missing values with median (replacement with median eliminates impact of
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
df_new = pd.DataFrame(imputer.fit_transform(df_new))
df_new.columns=num
#df_new =df_fra[cat]
df_new.head()
Out[26]:
5 rows × 65 columns
In [27]:
df_new.isnull().any().any()
Out[27]:
False
In [42]:
df_new.isnull().sum().sort_values(ascending=False).head(20)
Out[42]:
Networth_Next_Year 0
Current_Ratio[Latest] 0
ROG-Gross_Block_perc 0
ROG-Gross_Sales_perc 0
ROG-Net_Sales_perc 0
ROG-Cost_of_Production_perc 0
ROG-Total_Assets_perc 0
ROG-PBIDT_perc 0
ROG-PBDT_perc 0
ROG-PBIT_perc 0
ROG-PBT_perc 0
ROG-PAT_perc 0
ROG-CP_perc 0
ROG-Revenue_earnings_in_forex_perc 0
ROG-Revenue_expenses_in_forex_perc 0
ROG-Market_Capitalisation_perc 0
Fixed_Assets_Ratio[Latest] 0
Equity_Paid_Up 0
Inventory_Ratio[Latest] 0
Debtors_Ratio[Latest] 0
dtype: int64
Q 1.3 Transform Target variable into 0 and 1 ( Creating a binary target variable using
'Networth_Next_Year')
In [29]:
In [30]:
df_new[['default','Networth_Next_Year']].head(5)
Out[30]:
default Networth_Next_Year
0 1 -175.74
1 1 -175.74
2 1 -175.74
3 1 -175.74
4 1 -175.74
df_new[['default','Networth_Next_Year']].tail(5)
Out[31]:
default Networth_Next_Year
3581 0 303.53
3582 0 303.53
3583 0 303.53
3584 0 303.53
3585 0 303.53
In [32]:
df_new['default'].value_counts()
Out[32]:
0 3198
1 388
In [34]:
df_new['default'].value_counts(normalize = True)
Out[34]:
0 0.89
1 0.11
Q 1.4 Univariate & Bivariate analysis with proper interpretation. (You may choose to include only those
variables which were significant in the model building)
Univariate analysis :
In [53]:
pd.options.display.float_format = '{:.2f}'.format
df_new.describe().T
Out[53]:
66 rows × 8 columns
plt.figure(figsize=(25,10))
sns.boxplot(data=df_new)
plt.xlabel("Variables")
plt.xticks(rotation=90)
plt.ylabel("Density")
plt.title('Figure:Boxplot of few important features')
Out[201]:
15 Significate Scaled Feature Variables - Distribution of column with Displot & Box plot:
In [233]:
Out[233]:
<AxesSubplot:xlabel='ROG-Capital_Employed_perc'>
In [238]:
Out[238]:
<AxesSubplot:xlabel='Current_Ratio[Latest]'>
In [243]:
Out[243]:
<AxesSubplot:xlabel='Book_Value_Adj._Unit_Curr'>
- Please refer Business Report for Inference
In [203]:
Out[203]:
Bivariate Analysis
sns.jointplot(x = df_new['Gross_Sales'],
y = df_new['Net_Sales'])
Out[254]:
<seaborn.axisgrid.JointGrid at 0x212bdc14790>
sns.jointplot(x = df_new['Networth'],
y = df1['Capital_Employed'])
Out[253]:
<seaborn.axisgrid.JointGrid at 0x212be1238b0>
As the capital increases, net worth also increases, but in some cases, capital seems to be disbursed even
for lesser networth.
sns.jointplot(x = df_new['Networth'],
y = df1['Cost_of_Production'])
Out[255]:
<seaborn.axisgrid.JointGrid at 0x212a64b1fa0>
This plot is scattered and there exists no such relationship between these two variables.
In [208]:
sns.boxplot(x = df_new['default'],
y = df1['ROG-Net_Worth_perc'])
Out[208]:
<AxesSubplot:xlabel='default', ylabel='ROG-Net_Worth_perc'>
In [228]:
# As per Regression Feature Elimination(RFE) we got Top important Variables and 1 Target Va
df_imp =pd.DataFrame(df_new,columns=['Net_Sales','Value_Of_Output', 'PBIDT', 'PBDT', 'PBIT'
, 'ROG-Net_Worth_perc', 'ROG-Capital_Employed_perc', 'ROG-Total_Assets_perc', 'Current_Rati
'Inventory_Ratio[Latest]', 'Total_Asset_Turnover_Ratio[Latest]', 'Interest_Cover_Ratio[Late
'Value_of_Output_to_Gross_Block', 'default'])
Correlation Heatmap
In [230]:
plt.figure(figsize = (12,8))
cor_matrix = df_imp.corr()
sns.heatmap(cor_matrix, cmap = 'plasma', vmin = -1, vmax= 1)
Out[230]:
<AxesSubplot:>
Q 1.5 Split the data into Train and Test dataset in a ratio of 67:33
Split the data into Train and Test dataset in a ratio of 67:33 and use random_state =42.
Model Building is to be done on Train Dataset and Model Validation is to be done on Test Dataset.
In [108]:
X = df_new.drop(['default','Networth_Next_Year'], axis=1)
y = df_new['default']
In [119]:
X.head()
Out[119]:
5 rows × 64 columns
In [120]:
y.head()
Out[120]:
0 1
1 1
2 1
3 1
4 1
In [109]:
In [110]:
X_train_scaled.shape
Out[200]:
(2402, 64)
In [198]:
X_train.shape
Out[198]:
(2402, 64)
In [112]:
Out[112]:
5 rows × 64 columns
In [113]:
Out[113]:
5 rows × 64 columns
In [114]:
X_train.columns
Out[114]:
'Current_Liabilities_and_Provisions_', 'Total_Assets_to_Liabilities
_',
'Revenue_expenses_in_forex', 'Capital_expenses_in_forex',
'Book_Value_Unit_Curr', 'Book_Value_Adj._Unit_Curr',
'Market_Capitalisation', 'CEPS_annualised_Unit_Curr',
'Cash_Flow_From_Operating_Activities',
'Cash_Flow_From_Investing_Activities',
'Cash_Flow_From_Financing_Activities', 'ROG-Net_Worth_perc',
'ROG-Capital_Employed_perc', 'ROG-Gross_Block_perc',
'ROG-Gross_Sales_perc', 'ROG-Net_Sales_perc',
'ROG-Cost_of_Production_perc', 'ROG-Total_Assets_perc',
'ROG-Revenue_expenses_in_forex_perc', 'ROG-Market_Capitalisation_per
c',
'Current_Ratio[Latest]', 'Fixed_Assets_Ratio[Latest]',
'Inventory_Ratio[Latest]', 'Debtors_Ratio[Latest]',
'Total_Asset_Turnover_Ratio[Latest]', 'Interest_Cover_Ratio[Latest]',
'Creditors_Velocity_Days', 'Inventory_Velocity_Days',
'Value_of_Output_to_Total_Assets', 'Value_of_Output_to_Gross_Block'],
dtype='object')
In [152]:
y_train.value_counts(1)
Out[152]:
0 0.89
1 0.11
In [153]:
y_test.value_counts(1)
Out[153]:
0 0.89
1 0.11
#Number of rows and columns of the training set for the independent variables:
X_train.shape[0]
Out[121]:
2402
In [122]:
#Number of rows and columns of the test set for the independent variables:
X_test.shape[0]
Out[122]:
1184
Q 1.6 Build Logistic Regression Model (using statsmodel library) on most important variables on Train
Dataset and choose the optimum cutoff. Also showcase your model building approach
The equation of the Logistic Regression by which we predict the corresponding probabilities and then
go on predict a discrete target variable is
1 −𝑧
y= 1+𝑒
Note: z = 𝛽0 +∑𝑛𝑖=1 (𝛽𝑖 𝑋1 )
Now, Importing statsmodels modules
In [124]:
For modeling we will use Logistic Regression with recursive feature elimination
In [127]:
LogR = LogisticRegression()
In [133]:
#We using Regression Feature Elimination Total number of features are 64 out of we will be
selector = RFE(estimator = LogR, n_features_to_select=21, step=1)
In [129]:
In [130]:
selector.n_features_
Out[130]:
21
In [131]:
selector.ranking_
Out[131]:
array([ 5, 33, 16, 7, 35, 10, 19, 29, 17, 14, 1, 4, 1, 8, 21, 1, 1,
3, 13, 12, 18, 1, 31, 23, 32, 41, 40, 26, 44, 43, 28, 1, 1, 1,
Out[132]:
Feature Rank
10 Net_Sales 1
12 Value_Of_Output 1
15 PBIDT 1
16 PBDT 1
17 PBIT 1
22 Revenue_earnings_in_forex 1
23 Revenue_expenses_in_forex 1
25 Book_Value_Unit_Curr 1
26 Book_Value_Adj._Unit_Curr 1
32 ROG-Net_Worth_perc 1
33 ROG-Capital_Employed_perc 1
38 ROG-Total_Assets_perc 1
48 Current_Ratio[Latest] 1
49 Fixed_Assets_Ratio[Latest] 1
50 Inventory_Ratio[Latest] 1
52 Total_Asset_Turnover_Ratio[Latest] 1
53 Interest_Cover_Ratio[Latest] 1
55 PBITM_perc[Latest] 1
58 APATM_perc[Latest] 1
62 Value_of_Output_to_Total_Assets 1
63 Value_of_Output_to_Gross_Block 1
In [154]:
In [155]:
grid={'penalty':['l2','none'],
'solver':['sag','lbfgs'],
'tol':[0.0001,0.00001]}
In [156]:
model = LogisticRegression(max_iter=10000,n_jobs=2)
In [157]:
In [158]:
grid_search.fit(X_train, y_train)
Out[158]:
n_jobs=-1,
scoring='f1')
In [159]:
print(grid_search.best_params_,'\n')
print(grid_search.best_estimator_)
In [160]:
best_model = grid_search.best_estimator_
In [161]:
ytrain_predict = best_model.predict(X_train)
ytest_predict = best_model.predict(X_test)
In [162]:
ytest_predict_prob=best_model.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()
Out[162]:
0 1
0 0.42 0.58
1 1.00 0.00
2 1.00 0.00
3 0.99 0.01
4 1.00 0.00
Q 1.7 Validate the Model on Test Dataset and state the performance matrices. Also state interpretation
from the model
In [164]:
plot_confusion_matrix(best_model,X_train,y_train)
print(classification_report(y_train, ytrain_predict),'\n');
In [166]:
plot_confusion_matrix(best_model,X_test,y_test)
print(classification_report(y_test, ytest_predict),'\n');
We can select other parameters to perform GridSearchCV and try optimize the desired parameter.
We see not good recall score for both train and test
Since only 11% of the total data had defaults, we will now try to balance the data before fiting the model.
Imblearn over_sampling
In [176]:
In [178]:
selector_smote.n_features_
Out[178]:
21
In [179]:
pred_train_smote = selector_smote.predict(X_res)
pred_test_smote = selector_smote.predict(X_test)
In [194]:
plot_confusion_matrix(selector_smote,X_train,y_train)
print(classification_report(y_res, pred_train_smote),'\n');
In [191]:
plot_confusion_matrix(selector_smote,X_test,y_test)
print(classification_report(y_test, pred_test_smote),'\n');
Finally, we are able to achieve a descent recall value without overfitting. Considering the opportunities such as
outliers, missing values and correlated features this is a fairly good model. It can be improved if we get better
quality data where the features explaining the default are not missing to this extent. Of course we can try other
techniques which are not sensitive towards missing values and outliers.
THE END!
In [ ]: