Professional Documents
Culture Documents
Untitled
Untitled
Business Objective :
This is a classification project, since the variable to predict is binary (bankruptcy or non-
bankruptcy).
The goal here is to model the probability that a business goes bankrupt from different
features.
BANKRUPTCY PREVENTION PROJECT
DATASET DETAILS :
The data file contains 7 features about 250 companies.
Industrial_risk : 0=low risk, 0.5=medium risk, 1=high risk.
management_risk : 0=low risk, 0.5=medium risk, 1=high risk.
financial flexibility: 0=low flexibility, 0.5=medium flexibility, 1=high flexibility.
credibility: 0=low credibility, 0.5=medium credibility, 1=high credibility.
competitiveness: 0=low competitiveness, 0.5=medium competitiveness, 1=high
competitiveness
operating_risk : 0=low risk, 0.5=medium risk, 1=high risk.
class: bankruptcy, non-bankruptcy (target variable).
PROJECT WORK FLOW
Start Fetch Data
Data Cleaning
EDA
Data Spliting
Feature Selection
Train Model
Predict Data
Performance Measure
Exploratory Data Analysis (EDA)
Labels: 0 = Low, 0.5= medium, 1 = high
• Industrial_risk column has 1.0 = 89 or 0.5 = 81 or 0.0 = 80 unique values
• Management_risk column has 1.0 = 119 or 0.5 = 69 or 0.0 = 62 unique values
• Financial flexibility column has 1.0 = 57 or 0.5 = 74 or 0.0 = 119 unique values
• Credibility column has 1.0 = 79 or 0.5 = 77 or 0.0 = 94 unique values
• Competitiveness column has 1.0 = 91 or 0.5 = 56 or 0.0 = 103 unique values
• Operating_risk column has 1.0 = 114 or 0.5 = 57 or 0.0 = 79 unique values
• Class Column has Bankruptcy 107 unique items or Non-bankruptcy 143 unique items.
Data describe
Data Size
Checking the Missing Values Visualizing Missing Values
[ ] data.isnull.sum()
[ ] sns.heatmap(data.isnull())
Data Is Imbalanced
By using Oversampling And Smote Technique we are going to balance
the data.
SMOTE is a machine learning technique that solves problems that
occur when using an imbalanced data set.
Data is Balanced
Visualizing count plot and pie chart
Checking Outliers Independent features With Class Column
Visualizing Bar plot Independent features With Class Column
Visualizing Violin plot Independent features With Class Column
Visualizing Crosstab Inependent features With Class Column
Visualizing Distribution plot for Non Bankruptcy
Visualizing Distribution plot for Bankruptcy
Correlation Analysis
Visualizing Correlation Using Pair Plot
Model Building
We use 80% data for training And 20% for testing
1. Logistic Regression
• The logistic regression is also known
in the literature as logit regression,
maximum-entropy classification
(MaxEnt) or the log-linear classifier. In
this model, the probabilities
describing the possible outcomes of a
single trial are modeled using a
logistic function.
• Logistic regression is commonly used
for prediction and classification
problems
2. KNN - K-Nearest Neighbors
A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression
tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.
Entropy Gini
5. SUPPORT VECTOR MACHINE
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words,
given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new
examples. What is Support Vector Machine