Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

BANKRUPTCY PREVENTION PROJECT

P_205 Team Members Mentor Name :

1. Mr. Sanket Santosh Bait 1. Ms. Pallavi

2. Mr. Nikhil Ravindra Shinkar 2. Mrs.

3. Ms. Apurva Anil Ghodeswar

4. Ms. Gunjal Omkar Gulab


Date : 14/03/2023
5. Ms. Pratiksha Bapusaheb Lagad

6. Mr. Avatar Singh

7. Mr. Vihang Hanmant Lawand


BANKRUPTCY PREVENTION PROJECT
CONTENT
• Business Objective
• Project Architecture
• Data Collection and Details
• Exploratory Data Analysis
• Visualization
• Modeling
• Evaluation
• Deployment
BANKRUPTCY PREVENTION PROJECT
Business Problem :

 Business Companies goes Bankrupt

Business Objective :

 This is a classification project, since the variable to predict is binary (bankruptcy or non-
bankruptcy).
 The goal here is to model the probability that a business goes bankrupt from different
features.
BANKRUPTCY PREVENTION PROJECT
DATASET DETAILS :
 The data file contains 7 features about 250 companies.
 Industrial_risk : 0=low risk, 0.5=medium risk, 1=high risk.
 management_risk : 0=low risk, 0.5=medium risk, 1=high risk.
 financial flexibility: 0=low flexibility, 0.5=medium flexibility, 1=high flexibility.
 credibility: 0=low credibility, 0.5=medium credibility, 1=high credibility.
 competitiveness: 0=low competitiveness, 0.5=medium competitiveness, 1=high
competitiveness
 operating_risk : 0=low risk, 0.5=medium risk, 1=high risk.
 class: bankruptcy, non-bankruptcy (target variable).
PROJECT WORK FLOW
Start Fetch Data

Data Cleaning

EDA

Data Spliting

Training Data Test Data

Feature Selection

Train Model

Predict Data

Performance Measure
Exploratory Data Analysis (EDA)
Labels: 0 = Low, 0.5= medium, 1 = high
• Industrial_risk column has 1.0 = 89 or 0.5 = 81 or 0.0 = 80 unique values
• Management_risk column has 1.0 = 119 or 0.5 = 69 or 0.0 = 62 unique values
• Financial flexibility column has 1.0 = 57 or 0.5 = 74 or 0.0 = 119 unique values
• Credibility column has 1.0 = 79 or 0.5 = 77 or 0.0 = 94 unique values
• Competitiveness column has 1.0 = 91 or 0.5 = 56 or 0.0 = 103 unique values
• Operating_risk column has 1.0 = 114 or 0.5 = 57 or 0.0 = 79 unique values
• Class Column has Bankruptcy 107 unique items or Non-bankruptcy 143 unique items.

Data set Information : Feature Of Interest:

No. of Columns: 07 1. Independent Variable, X= 6 Features


No. of Records: 280 2. Dependent Variable, y = class
About The Dataset
data Data Information :

Data describe

 Data Size
Checking the Missing Values Visualizing Missing Values

[ ] data.isnull.sum()

[ ] sns.heatmap(data.isnull())

- There is no missing values in the dataset.

• Check Duplicated records in the dataset.


- [ ] df.duplicated().sum()
- There are 147 duplicate values present in
the dataset
Count Plot for Bankruptcy and Non Bankruptcy
Count Plot for Bankruptcy and Non Bankruptcy

Data Is Imbalanced
By using Oversampling And Smote Technique we are going to balance
the data.
SMOTE is a machine learning technique that solves problems that
occur when using an imbalanced data set.

Data is Balanced
Visualizing count plot and pie chart
Checking Outliers Independent features With Class Column
Visualizing Bar plot Independent features With Class Column
Visualizing Violin plot Independent features With Class Column
Visualizing Crosstab Inependent features With Class Column
Visualizing Distribution plot for Non Bankruptcy
Visualizing Distribution plot for Bankruptcy
Correlation Analysis
Visualizing Correlation Using Pair Plot
Model Building
We use 80% data for training And 20% for testing
1. Logistic Regression
• The logistic regression is also known
in the literature as logit regression,
maximum-entropy classification
(MaxEnt) or the log-linear classifier. In
this model, the probabilities
describing the possible outcomes of a
single trial are modeled using a 
logistic function.
• Logistic regression is commonly used
for prediction and classification
problems
2. KNN -  K-Nearest Neighbors

 KNN (K-Nearest Neighbors)


• The k-nearest neighbors algorithm,
also known as KNN or k-NN, is a
non-parametric, supervised learning
classifier, which uses proximity to
make classifications or predictions
about the grouping of an individual
data point. While it can be used for
either regression or classification
problems, it is typically used as a
classification algorithm, working off
the assumption that similar points can
be found near one another.
3. NAIVE BAYES CLASSIFIER
The Naïve Bayes classifier is a supervised machine learning algorithm, which is used for classification tasks, like
text classification. It is also part of a family of generative learning algorithms, meaning that it seeks to model the
distribution of inputs of a given class or category.

Gaussian Multinomial BernoulliNB


4. Decision Tree Classifier Model

A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression
tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

Entropy Gini
5. SUPPORT VECTOR MACHINE
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words,
given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new
examples. What is Support Vector Machine

Linear Polynomial RBF


2. RANDOM FOREST CLASSIFIER

RANDOM FOREST CLASSIFIER


• Random forest is a commonly-used
machine learning algorithm which
combines the output of multiple
decision trees to reach a single result.
Its ease of use and flexibility have
fueled its adoption, as it handles both
classification and regression problems.
• Since the random forest model is made
up of multiple decision trees, it would
be helpful to start by describing the
decision tree algorithm briefly.
2. MODEL PERFORMANCE
MODLE DEPLOYMENT ON STREAMLIT
From the above try multiple model but SVM polynomial kernel giving good accuracy, So we
can use SVM polynomial kernel model for deployment

You might also like