Professional Documents
Culture Documents
Credit Default Practice PDF
Credit Default Practice PDF
Default Data
Description:
The dataset contains information on default payments & company details of Companies in India as on May 2019.
In [1]:
import warnings
warnings.filterwarnings("ignore")
In [2]:
Company = pd.read_csv('Company_Practice.csv')
#Glimpse of Data
Company.head()
Out[2]:
0 327 -7590.7 1 20099.1 398.3 1239.4 0.061664 -6.7 -0.005406 2082.5 ... 12.18 10.16 0.24 53223026.0 10.0 -13.733
1 3434 -3976.6 1 33592.2 1665.9 33935.2 1.010211 53.6 0.001579 37897.1 ... 190.48 127.42 18.14 107948196.0 10.0 -13.733
2 3164 -1733.7 1 15318.7 1028.8 2387.2 0.155836 70.3 0.029449 3022.2 ... 29.57 6.95 4.83 124166101.0 10.0 -4.550
3 3267 -1438.4 1 5924.6 356.9 4800.8 0.810316 21.4 0.004458 5105.3 ... 9.06 7.42 8.11 35618998.0 10.0 -7.950
4 1750 -974.2 1 3411.8 163.2 3455.2 1.012721 7.5 0.002171 3831.0 ... 50.66 18.65 6.55 41840000.0 10.0 -5.870
5 rows × 76 columns
In [3]:
In [4]:
Company.head()
Out[4]:
5 rows × 76 columns
Now, let us check the number of rows (observations) and the number of columns (variables)
In [5]:
In [6]:
Company.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2409 entries, 0 to 2408
Data columns (total 76 columns):
Num 2409 non-null int64
Networth_Next_Year 2409 non-null float64
default 2409 non-null int64
Total_assets 2409 non-null float64
Net_worth 2409 non-null float64
Total_Income 2409 non-null float64
Total_Income_by_Total_assets 2409 non-null float64
Change_in_stock 2409 non-null float64
Change_in_stock_by_Total_Income 2409 non-null float64
Total_expenses 2400 non-null float64
Total_expenses_by_Total_Income 2409 non-null float64
Profit_after_tax 2409 non-null float64
Profit_after_tax_by_Total_assets 2409 non-null float64
PBDITA 2397 non-null float64
PBDITA_by_Total_assets 2409 non-null float64
PBT 2409 non-null float64
PBT_by_Total_assets 2399 non-null float64
Cash_profit 2409 non-null float64
Cash_profit_by_Total_assets 2409 non-null float64
PBDITA_as_perc_of_total_income 2401 non-null float64
PBT_as_perc_of_total_income 2409 non-null float64
PAT_as_perc_of_total_income 2409 non-null float64
Cash_profit_as_perc_of_total_income 2409 non-null float64
PAT_as_perc_of_net_worth 2409 non-null float64
Sales 2409 non-null float64
Sales_by_Total_assets 2409 non-null float64
Income_from_financial_services 2398 non-null float64
Income_from_financial_services_by_Total_Income 2409 non-null float64
Other_income 2409 non-null float64
Other_income_by_Total_Income 2409 non-null float64
Total_capital 2409 non-null float64
Total_capital_by_Total_Assets 2409 non-null float64
Reserves_and_funds 2409 non-null float64
Reserves_and_funds_by_Total_Assets 2399 non-null float64
Borrowings 2409 non-null float64
Borrowings_by_Total_Assets 2409 non-null float64
Current_liabilities_and_provisions 2409 non-null float64
Current_liabilities_and_provisions_by_Total_assets 2409 non-null float64
Deferred_tax_liability 2409 non-null float64
Deferred_tax_liability_by_Total_Assets 2409 non-null float64
Shareholders_funds 2409 non-null float64
Shareholders_funds_by_Total_assets 2409 non-null float64
Cumulative_retained_profits 2409 non-null float64
Cumulative_retained_profits_by_Total_Income 2409 non-null float64
Capital_employed 2409 non-null float64
Capital_employed_by_Total_assets 2409 non-null float64
TOL_by_TNW 2409 non-null float64
Total_term_liabilities__by__tangible_net_worth 2371 non-null float64
Contingent_liabilities__by__Net_worth_perc 2409 non-null float64
Contingent_liabilities 2409 non-null float64
Contingent_liabilities_by_Total_Assets 2409 non-null float64
Net_fixed_assets 2409 non-null float64
Net_fixed_assets_by_Total_Assets 2409 non-null float64
Investments 2384 non-null float64
Investments_by_Total_Income 2409 non-null float64
Current_assets 2409 non-null float64
Current_assets_by_Total_Assets 2409 non-null float64
Net_working_capital 2409 non-null float64
Net_working_capital_by_Total_Capital 2409 non-null float64
Quick_ratio_times 2409 non-null float64
Current_ratio_times 2409 non-null float64
Debt_to_equity_ratio_times 2409 non-null float64
Cash_to_current_liabilities_times 2409 non-null float64
Cash_to_average_cost_of_sales_per_day 2409 non-null float64
Creditors_turnover 2409 non-null float64
Debtors_turnover 2409 non-null float64
Finished_goods_turnover 2388 non-null float64
WIP_turnover 2409 non-null float64
Raw_material_turnover 2409 non-null float64
Shares_outstanding 2409 non-null float64
Equity_face_value 2409 non-null float64
EPS 2409 non-null float64
Adjusted_EPS 2409 non-null float64
Total_liabilities 2409 non-null float64
PE_on_BSE 2409 non-null float64
Dev_model 2409 non-null int64
dtypes: float64(73), int64(3)
memory usage: 1.4 MB
Now, let us check the basic measures of descriptive statistics for the continuous variables
In [7]:
Company.describe()
Out[7]:
8 rows × 76 columns
In [8]:
Company['default'].value_counts()
Out[8]:
0 2240
1 169
Name: default, dtype: int64
In [9]:
169/(2240+169)
Out[9]:
0.0701535907015359
In [10]:
Company['default'].describe()
Out[10]:
count 2409.000000
mean 0.070154
std 0.255459
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: default, dtype: float64
Company.isnull().sum()
Out[11]:
Num 0
Networth_Next_Year 0
default 0
Total_assets 0
Net_worth 0
Total_Income 0
Total_Income_by_Total_assets 0
Change_in_stock 0
Change_in_stock_by_Total_Income 0
Total_expenses 9
Total_expenses_by_Total_Income 0
Profit_after_tax 0
Profit_after_tax_by_Total_assets 0
PBDITA 12
PBDITA_by_Total_assets 0
PBT 0
PBT_by_Total_assets 10
Cash_profit 0
Cash_profit_by_Total_assets 0
PBDITA_as_perc_of_total_income 8
PBT_as_perc_of_total_income 0
PAT_as_perc_of_total_income 0
Cash_profit_as_perc_of_total_income 0
PAT_as_perc_of_net_worth 0
Sales 0
Sales_by_Total_assets 0
Income_from_financial_services 11
Income_from_financial_services_by_Total_Income 0
Other_income 0
Other_income_by_Total_Income 0
..
TOL_by_TNW 0
Total_term_liabilities__by__tangible_net_worth 38
Contingent_liabilities__by__Net_worth_perc 0
Contingent_liabilities 0
Contingent_liabilities_by_Total_Assets 0
Net_fixed_assets 0
Net_fixed_assets_by_Total_Assets 0
Investments 25
Investments_by_Total_Income 0
Current_assets 0
Current_assets_by_Total_Assets 0
Net_working_capital 0
Net_working_capital_by_Total_Capital 0
Quick_ratio_times 0
Current_ratio_times 0
Debt_to_equity_ratio_times 0
Cash_to_current_liabilities_times 0
Cash_to_average_cost_of_sales_per_day 0
Creditors_turnover 0
Debtors_turnover 0
Finished_goods_turnover 21
WIP_turnover 0
Raw_material_turnover 0
Shares_outstanding 0
Equity_face_value 0
EPS 0
Adjusted_EPS 0
Total_liabilities 0
PE_on_BSE 0
Dev_model 0
Length: 76, dtype: int64
In [12]:
(array([ 9, 13, 16, 19, 26, 33, 47, 53, 66], dtype=int64),)
In [13]:
Company.iloc[:,66].isnull().sum()
Out[13]:
21
There are missing values in the dataset
Lets treat these missing values with median (replacement with median eliminates impact of outliers in the treatment)
In [14]:
Company = pd.DataFrame(imputer.fit_transform(Company))
Company.columns=col
Company.head()
Out[14]:
5 rows × 76 columns
In [15]:
col_names = list(Company.columns)
col_names.remove('Num')
fig, ax = plt.subplots(len(col_names), figsize=(8,100))
sns.boxplot(y=Company[col_val], ax=ax[i])
ax[i].set_title('Box plot - {}'.format(col_val), fontsize=10)
ax[i].set_xlabel(col_val, fontsize=8)
plt.show()
There are outliers in the dataset, lets use capping method to treat them
In [16]:
def check_outlier(col):
sorted(col)
Q1,Q3=col.quantile([.25,.75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
In [17]:
check_outlier(Company['Networth_Next_Year'])
Out[17]:
(-615.2, 1118.4)
In [18]:
check_outlier(Company['Total_Income'])
Out[18]:
(-2483.35, 4477.05)
In [19]:
check_outlier(Company['PBT_as_perc_of_total_income'])
Out[19]:
(-11.75, 20.89)
Capping the outliers
In [20]:
def treat_outlier(x):
# taking 5,25,75 percentile of column
q5= np.percentile(x,5)
q25=np.percentile(x,25)
q75=np.percentile(x,75)
dt=np.percentile(x,95)
#calculationg IQR range
IQR=q75-q25
#Calculating minimum threshold
lower_bound=q25-(1.5*IQR)
upper_bound=q75+(1.5*IQR)
#Calculating maximum threshold
print(q5,q25,q75,dt,min,max)
#Capping outliers
return x.apply(lambda y: dt if y > upper_bound else y).apply(lambda y: q5 if y < lower_bound else y)
In [21]:
for i in Company:
Company[i]=treat_outlier(Company[i])
176.8 879.0 2658.0 3359.6 <built-in function min> <built-in function max>
-1.3599999999999994 34.9 468.3 3776.2199999999984 <built-in function min> <built-in function max>
0.0 0.0 0.0 1.0 <built-in function min> <built-in function max>
15.2 97.6 1099.2 8835.079999999965 <built-in function min> <built-in function max>
4.380000000000001 32.9 378.2 3004.4999999999945 <built-in function min> <built-in function max>
6.9 126.8 1866.9 8506.299999999996 <built-in function min> <built-in function max>
0.1406239064 0.728960396 1.7640750669999998 7.228589743399996 <built-in function min> <built-in func
tion max>
-34.160000000000004 -0.7 41.8940059 143.93999999999994 <built-in function min> <built-in function ma
x>
-0.050323483200000005 -0.003595275 0.033124192999999996 1.464825381 <built-in function min> <built-i
n function max>
4.660000000000002 110.3 1602.7 8418.319999999998 <built-in function min> <built-in function max>
0.7254820486 0.9299519959999999 1.009932459 1.2399386659999996 <built-in function min> <built-in fun
ction max>
-12.7 0.7 60.8 561.9599999999998 <built-in function min> <built-in function max>
-0.0655606408 0.006578947 0.075884486 0.23523163479999998 <built-in function min> <built-in function
max>
-0.2 8.0 175.4 1169.8999999999996 <built-in function min> <built-in function max>
-0.004334886399999998 0.066334725 0.17059377899999997 0.36541999459999985 <built-in function min> <b
uilt-in function max>
-13.32 1.0 84.1 723.4599999999998 <built-in function min> <built-in function max>
-0.068493151 0.009526938 0.10338433 0.3055486707999997 <built-in function min> <built-in function ma
x>
-4.2 3.5 108.7 802.5199999999995 <built-in function min> <built-in function max>
-0.0335843032 0.030327647000000003 0.113774272 0.26847384739999997 <built-in function min> <built-in
function max>
-0.2339999999999978 4.86 16.07 34.202 <built-in function min> <built-in function max>
-21.14 0.49 8.65 22.687999999999995 <built-in function min> <built-in function max>
-20.53 0.28 6.2 17.839999999999975 <built-in function min> <built-in function max>
-9.909999999999998 1.92 10.58 23.941999999999986 <built-in function min> <built-in function max>
-27.282 0.0 20.01 50.39399999999999 <built-in function min> <built-in function max>
9.520000000000001 138.0 2044.8 8454.24 <built-in function min> <built-in function max>
0.17162601200000005 0.736177614 1.8096885809999999 65.43780717599992 <built-in function min> <built-
in function max>
0.1 0.7 81.34220077 88.23999999999997 <built-in function min> <built-in function max>
0.00035547380000000003 0.0017890970000000001 0.056191785 4.779567872599992 <built-in function min> <
built-in function max>
0.1 0.8 41.69389313 41.69389313 <built-in function min> <built-in function max>
0.0002229854 0.001546193 0.090012723 3.876173254799987 <built-in function min> <built-in function ma
x>
1.9 14.1 100.0 572.1399999999996 <built-in function min> <built-in function max>
0.0110241274 0.047815334 0.265692854 0.8938436113999992 <built-in function min> <built-in function m
ax>
-36.82 7.4 335.2 2790.8199999999974 <built-in function min> <built-in function max>
-0.29325279719999997 0.07119958 0.378732031 0.7338049069999999 <built-in function min> <built-in fun
ction max>
1.8000000000000003 29.8 625.8 2557.3199999999965 <built-in function min> <built-in function max>
0.019636433200000004 0.18614718600000002 0.51354924 16.29957974 <built-in function min> <built-in fu
nction max>
1.2400000000000007 18.9 283.9 2058.1599999999976 <built-in function min> <built-in function max>
0.0315466354 0.12038523300000001 0.346615721 0.6424676679999999 <built-in function min> <built-in fu
nction max>
0.7 6.7 228.1465496 263.5399999999998 <built-in function min> <built-in function max>
0.00404 0.0211 0.4556 11.796699999999994 <built-in function min> <built-in function max>
4.5 33.8 393.2 3096.779999999999 <built-in function min> <built-in function max>
0.0946652752 0.236430063 0.552631579 0.8777995385999999 <built-in function min> <built-in function m
ax>
-55.77999999999999 2.4 211.3 2045.6799999999946 <built-in function min> <built-in function max>
-0.7489221649999998 0.017729922 0.26862938 0.8730013199999984 <built-in function min> <built-in func
tion max>
9.8 64.6 798.3 6322.599999999991 <built-in function min> <built-in function max>
0.3525276156 0.606348471 0.837276578 0.9608631966000001 <built-in function min> <built-in function m
ax>
0.05 0.6 2.87 10.17599999999999 <built-in function min> <built-in function max>
0.0 0.05 0.99 4.16 <built-in function min> <built-in function max>
0.0 0.0 31.51 157.22999999999976 <built-in function min> <built-in function max>
0.7400000000000005 15.3 936.4739761 1019.7799999999972 <built-in function min> <built-in function ma
x>
0.0036494186 0.036217304 4.180687393 53.08850287999998 <built-in function min> <built-in function ma
x>
3.3 28.3 391.6 2594.199999999997 <built-in function min> <built-in function max>
0.0507089758 0.18633540399999998 0.478791356 0.7994111695999999 <built-in function min> <built-in fu
nction max>
0.2 4.6 701.0139914 701.0139914 <built-in function min> <built-in function max>
0.0003268968 0.010220674 1.6595975180000002 42.17713215399993 <built-in function min> <built-in func
tion max>
2.2400000000000007 37.9 528.7 3320.2 <built-in function min> <built-in function max>
0.060105398000000004 0.291954023 0.606837607 0.8307446566 <built-in function min> <built-in function
max>
-160.85999999999999 -0.8 90.5 701.3 <built-in function min> <built-in function max>
-3.2702105266000006 -0.042511201 1.9375 11.410476187999997 <built-in function min> <built-in functio
n max>
0.10400000000000006 0.42 1.09 2.98 <built-in function min> <built-in function max>
0.35400000000000004 0.93 1.77 4.111999999999998 <built-in function min> <built-in function max>
0.0 0.23 1.77 6.515999999999998 <built-in function min> <built-in function max>
0.0 0.03 0.21 1.25 <built-in function min> <built-in function max>
0.26400000000000007 3.11 25.39 186.47399999999976 <built-in function min> <built-in function max>
0.06800000000000012 3.96 15.32 39.57799999999996 <built-in function min> <built-in function max>
0.0 4.03 15.2 42.08599999999991 <built-in function min> <built-in function max>
3.074 10.21 87.5324658 157.76999999999992 <built-in function min> <built-in function max>
1.8520000000000003 5.99 27.97205637 70.31199999999998 <built-in function min> <built-in function max
>
0.0 3.39 16.29 31.915999999999983 <built-in function min> <built-in function max>
103148.0 2212500.0 22284601.68 59535979.599999964 <built-in function min> <built-in function max>
-1348.444874 10.0 10.0 100.0 <built-in function min> <built-in function max>
-4.816 0.0 10.12 90.53599999999985 <built-in function min> <built-in function max>
-4.832 0.0 7.73 88.53799999999984 <built-in function min> <built-in function max>
15.2 97.6 1099.2 8835.079999999965 <built-in function min> <built-in function max>
-2.2279999999999998 13.99 64.07801191 64.07801191 <built-in function min> <built-in function max>
1.0 1.0 1.0 1.0 <built-in function min> <built-in function max>
In [22]:
Company.shape
Out[22]:
(2409, 76)
In [23]:
Company['PBT_as_perc_of_total_income'].describe()
Out[23]:
count 2409.000000
mean 3.581635
std 10.147879
min -21.140000
25% 0.490000
50% 3.280000
75% 8.650000
max 22.688000
Name: PBT_as_perc_of_total_income, dtype: float64
Company.loc[Company['default'] == 0,'PBT_as_perc_of_total_income'].describe()
Out[24]:
count 2240.000000
mean 4.611910
std 9.236063
min -21.140000
25% 0.920000
50% 3.690000
75% 9.182500
max 22.688000
Name: PBT_as_perc_of_total_income, dtype: float64
For companies whose have not defaulted, median 'Profit before tax (as % of income) is about 3.7'
In [25]:
Company.loc[Company['default'] == 1,'PBT_as_perc_of_total_income'].describe()
Out[25]:
count 169.000000
mean -10.074083
std 11.722073
min -21.140000
25% -21.140000
50% -10.320000
75% 0.000000
max 22.688000
Name: PBT_as_perc_of_total_income, dtype: float64
For companies whose have defaulted, median 'Profit before tax (as % of income) is about -10.3'
In conclusion what it means is, typical good companies makes a profit of about 3.7 units per 100 units of income
And a typical defaulted companies loses about 10.3 units per 100 units of income
1
y =
1−e −z
n
Note: z = β 0 + ∑ i = 1(β iX 1)
In [26]:
import statsmodels.formula.api as SM
model = SM.logit(formula=’Dependent Variable ~ Σ ( )’ data = ‘Data Frame containing the required values’).fit()
Splitting arrays or matrices into random train and test subsets. Model will be fitted on train set and predictions will be made on the test
set
In [27]:
X = Company.drop(['default','Networth_Next_Year','Num'], axis=1)
y = Company['default']
Company_train.to_csv('Company_train.csv',index=False)
Company_test.to_csv('Company_test.csv',index=False)
In [28]:
Company_train.columns
Out[28]:
Model 1
Before starting model building, lets look at the problem of multicollinearity. Multicollinearity occurs when two or more independent variables are highly
correlated with one another in a regression model.
In [29]:
def calc_vif(X):
# Calculating VIF
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return(vif)
In [30]:
Out[30]:
Out[30]:
variables VIF
45 Contingent_liabilities__by__Net_worth_perc 1.262117
71 PE_on_BSE 1.371907
65 Raw_material_turnover 1.457993
50 Investments 1.613993
4 Change_in_stock 1.706929
62 Debtors_turnover 1.711831
61 Creditors_turnover 1.768949
5 Change_in_stock_by_Total_Income 1.863721
46 Contingent_liabilities 1.979391
25 Other_income 1.989548
55 Net_working_capital_by_Total_Capital 2.233915
63 Finished_goods_turnover 2.340059
51 Investments_by_Total_Income 2.356976
64 WIP_turnover 2.418147
54 Net_working_capital 2.453126
49 Net_fixed_assets_by_Total_Assets 2.524236
67 Equity_face_value 2.658062
32 Borrowings_by_Total_Assets 2.683636
60 Cash_to_average_cost_of_sales_per_day 2.722499
47 Contingent_liabilities_by_Total_Assets 2.833447
35 Deferred_tax_liability 2.836829
26 Other_income_by_Total_Income 2.937239
23 Income_from_financial_services 3.103908
40 Cumulative_retained_profits_by_Total_Income 3.124709
20 PAT_as_perc_of_net_worth 3.168076
24 Income_from_financial_services_by_Total_Income 3.349914
36 Deferred_tax_liability_by_Total_Assets 3.403342
53 Current_assets_by_Total_Assets 3.601563
59 Cash_to_current_liabilities_times 3.690171
48 Net_fixed_assets 4.133867
33 Current_liabilities_and_provisions 5.922509
43 TOL_by_TNW 6.392183
22 Sales_by_Total_assets 6.879268
52 Current_assets 7.063778
30 Reserves_and_funds_by_Total_Assets 7.330419
38 Shareholders_funds_by_Total_assets 7.772884
29 Reserves_and_funds 8.569776
16 PBDITA_as_perc_of_total_income 8.770527
3 Total_Income_by_Total_assets 9.338925
58 Debt_to_equity_ratio_times 9.408160
19 Cash_profit_as_perc_of_total_income 13.826134
41 Capital_employed 17.126263
6 Total_expenses 18.125364
11 PBDITA_by_Total_assets 18.716639
14 Cash_profit 19.287246
21 Sales 20.697583
10 PBDITA 21.230536
15 Cash_profit_by_Total_assets 21.530913
8 Profit_after_tax 24.076989
12 PBT 25.865234
18 PAT_as_perc_of_total_income 30.186683
2 Total_Income 30.757437
2 Total_Income 30.757437
37 Shareholders_funds 33.057718
9 Profit_after_tax_by_Total_assets 33.820119
1 Net_worth 34.278786
17 PBT_as_perc_of_total_income 34.280187
13 PBT_by_Total_assets 36.137749
72 Dev_model 587.879602
70 Total_liabilities inf
0 Total_assets inf
73 rows × 2 columns
Here, we see that the value of VIF is high for many variables. Here, we may drop variables with VIF more than 5 (very high correlation) & build our model
In [31]:
In [32]:
In [33]:
model_1.summary()
Out[33]:
We can see that few variables are insignificant & may not be useful to discriminate cases of deault
In [34]:
Model 2
In [35]:
In [36]:
In [37]:
model_2.summary()
Out[37]:
We can see that all variables are significant & may be useful to discriminate cases of deault
Let us also check the multicollinearity of the model using Variance Inflation Factor (VIF) for the predictor variables
In [38]:
Out[38]:
variables VIF
3 Cumulative_retained_profits_by_Total_Income 1.353936
9 Contingent_liabilities__by__Net_worth_perc 1.406154
2 PAT_as_perc_of_net_worth 1.479623
5 Investments_by_Total_Income 1.975775
1 Net_fixed_assets 2.236608
0 Total_capital 2.261477
8 PE_on_BSE 2.447948
7 Investments 2.563008
4 Contingent_liabilities_by_Total_Assets 2.609765
6 Contingent_liabilities 3.285351
We can see that multicollinearity still exists but lets not drop them as VIFs are not very high
In [39]:
We see that adjusted R sq is now close to Rsq, thus suggesting lesser insignificant variables in the model
We also notice that current model has no insignificant variables and can be used for prediction purposes.
Lets test the prediction of this model on train and test dataset
Let us first check the distribution plot of the logit function values
In [40]:
sns.distplot(model_2.fittedvalues);
y_predict_train = model_2.predict(X_train)
y_predict_train
Out[41]:
1195 0.000238
630 0.079001
1978 0.005952
2312 0.000817
1000 0.012295
464 0.282067
198 0.034514
711 0.280323
1949 0.001276
2206 0.001816
1986 0.000249
265 0.146941
52 0.782141
1570 0.174541
1701 0.007760
556 0.048509
2001 0.003805
1551 0.016193
2156 0.000045
664 0.002812
1174 0.014924
1646 0.029623
158 0.867386
2086 0.009801
892 0.003239
1521 0.026492
1068 0.027455
528 0.022768
1108 0.001461
1062 0.000715
...
2290 0.006842
121 0.379384
2080 0.018987
1350 0.007507
1512 0.002323
1447 0.016040
993 0.018150
614 0.000523
1675 0.001548
3 0.284965
1724 0.021777
2028 0.010548
1738 0.000410
278 0.062157
1938 0.005347
239 0.179794
287 0.024797
1095 0.004992
1775 0.008386
2271 0.001393
791 0.017388
1735 0.013902
1759 0.011669
344 0.006982
1110 0.030625
1354 0.010515
2060 0.003080
19 0.292098
1816 0.004499
444 0.053537
Length: 1614, dtype: float64
In [42]:
sns.boxplot(x=Company['default'],y=y_predict_train)
plt.xlabel('Default');
From the above boxplot, we need to decide on one such value of a cut-off which will give us the most reasonable descriptive power of the model. Let us
take a cut-off of 0.07 and check.
In [43]:
y_class_pred=[]
for i in range(0,len(y_predict_train)):
if np.array(y_predict_train)[i]>0.07:
a=1
else:
a=0
y_class_pred.append(a)
Checking the accuracy of the model using confusion matrix for training set
In [44]:
#print(metrics.confusion_matrix(y_test, y_predict))
sns.heatmap((metrics.confusion_matrix(y_train,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
plt.xlabel('Predicted');
plt.ylabel('Actuals',rotation=0);
In [45]:
tn, fp, fn, tp = metrics.confusion_matrix(y_train,y_class_pred).ravel()
print('True Negative:',tn,'\n''False Positives:' ,fp,'\n''False Negatives:', fn,'\n''True Positives:', tp)
print(metrics.classification_report(y_train,y_class_pred,digits=3))
As observed above, accuracy of the model i.e. %overall correct predictions is 86%
Sensitivity of the model is 84% i.e. 84% of those defaulted were correctly identified as defaulters by the model
In [47]:
y_predict_test = model_2.predict(X_test)
y_predict_test
Out[47]:
1238 0.264364
1446 0.009470
840 0.007608
166 0.054838
658 0.007321
1691 0.018224
499 0.032348
366 0.001656
1129 0.017894
1517 0.009549
868 0.033910
927 0.039776
1451 0.008103
2245 0.000346
1651 0.016640
1165 0.010501
872 0.077171
1006 0.042804
1422 0.008625
1323 0.001854
2038 0.005669
1876 0.027107
2347 0.052555
1884 0.005696
2345 0.000829
2381 0.001835
2051 0.011987
930 0.026261
153 0.716588
969 0.000512
...
1739 0.018299
1317 0.003664
1104 0.003370
2364 0.004185
2308 0.008268
115 0.407130
267 0.009058
1185 0.016101
129 0.927627
690 0.020068
1902 0.003642
1627 0.003918
1891 0.009009
2296 0.000012
1088 0.000972
398 0.029559
1541 0.018018
1464 0.002371
866 0.000250
164 0.608210
1183 0.002765
1465 0.018457
2187 0.000468
983 0.002116
415 0.492578
1848 0.013528
1546 0.000522
2237 0.003755
1096 0.018530
1177 0.007781
Length: 795, dtype: float64
In [48]:
y_class_pred=[]
for i in range(0,len(y_predict_test)):
if np.array(y_predict_test)[i]>0.07:
a=1
else:
a=0
y_class_pred.append(a)
Checking the accuracy of the model using confusion matrix for test set
In [49]:
#print(metrics.confusion_matrix(y_test, y_predict))
sns.heatmap((metrics.confusion_matrix(y_test,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
plt.xlabel('Predicted');
plt.ylabel('Actuals',rotation=0);
In [50]:
tn, fp, fn, tp = metrics.confusion_matrix(y_test,y_class_pred).ravel()
print('True Negative:',tn,'\n''False Positives:' ,fp,'\n''False Negatives:', fn,'\n''True Positives:', tp)
Let us now go ahead and print the classification report to check the various other parameters
In [51]:
print(metrics.classification_report(y_test,y_class_pred,digits=3))
As observed above, accuracy of the model i.e. %overall correct predictions is 88%
Sensitivity of the model is 84% i.e. 84% of those defaulted were correctly identified as defaulters by the model
Let us take a cut-off of 0.08 and check if our predictions have improved
In [52]:
y_class_pred=[]
for i in range(0,len(y_predict_train)):
if np.array(y_predict_train)[i]>0.08:
a=1
else:
a=0
y_class_pred.append(a)
Checking the accuracy of the model using confusion matrix for training set
In [53]:
#print(metrics.confusion_matrix(y_test, y_predict))
sns.heatmap((metrics.confusion_matrix(y_train,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
plt.xlabel('Predicted');
plt.ylabel('Actuals',rotation=0);
In [54]:
In [55]:
print(metrics.classification_report(y_train,y_class_pred,digits=3))
Accuracy of the model i.e. %overall correct predictions has increased from 86% to 88% but sensitivity of the model has dropped slightly from 84% to 83%
In [56]:
y_predict_test = model_2.predict(X_test)
y_predict_test
Out[56]:
1238 0.264364
1446 0.009470
840 0.007608
166 0.054838
658 0.007321
1691 0.018224
499 0.032348
366 0.001656
1129 0.017894
1517 0.009549
868 0.033910
927 0.039776
1451 0.008103
2245 0.000346
1651 0.016640
1165 0.010501
872 0.077171
1006 0.042804
1422 0.008625
1323 0.001854
2038 0.005669
1876 0.027107
2347 0.052555
1884 0.005696
2345 0.000829
2381 0.001835
2051 0.011987
930 0.026261
153 0.716588
969 0.000512
...
1739 0.018299
1317 0.003664
1104 0.003370
2364 0.004185
2308 0.008268
115 0.407130
267 0.009058
1185 0.016101
129 0.927627
690 0.020068
1902 0.003642
1627 0.003918
1891 0.009009
2296 0.000012
1088 0.000972
398 0.029559
1541 0.018018
1464 0.002371
866 0.000250
164 0.608210
1183 0.002765
1465 0.018457
2187 0.000468
983 0.002116
415 0.492578
1848 0.013528
1546 0.000522
2237 0.003755
1096 0.018530
1177 0.007781
Length: 795, dtype: float64
In [57]:
y_class_pred=[]
for i in range(0,len(y_predict_test)):
if np.array(y_predict_test)[i]>0.08:
a=1
else:
a=0
y_class_pred.append(a)
Checking the accuracy of the model using confusion matrix for test set
In [58]:
#print(metrics.confusion_matrix(y_test, y_predict))
sns.heatmap((metrics.confusion_matrix(y_test,y_class_pred)),annot=True,fmt='.5g'
,cmap='plasma');
plt.xlabel('Predicted');
plt.ylabel('Actuals',rotation=0);
In [59]:
tn, fp, fn, tp = metrics.confusion_matrix(y_test,y_class_pred).ravel()
print('True Negative:',tn,'\n''False Positives:' ,fp,'\n''False Negatives:', fn,'\n''True Positives:', tp)
Let us now go ahead and print the classification report to check the various other parameters
In [60]:
print(metrics.classification_report(y_test,y_class_pred,digits=3))
Accuracy of the model i.e. %overall correct predictions is 89% & sensitivity of the model stands at 84%
We may choose cutoff of 0.08 as it gave higher model sensitivity & overall accuracy of the model in test dataset
END