Business Analytics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Business Analytics

Business Analytics Life Cycle


Learning Objectives

By the end of this lesson, you will be able to:

Explore business analytics life cycle phases to define the


roadmap and achieve business goals

List the challenges faced in each phase

Perform and outline business analytics life cycle phases with


the help of loan default prediction case study
Business Analytics Life Cycle (BALC)
Introduction

BALC is a framework that describes the process of using data and analytics to drive business decisions.
The phases involved are:

Data exploration Data modeling

Data collection Data deployment

Business Monitoring and


understanding maintenance
Business Understanding: Overview

This phase involves understanding and addressing the business problem or opportunity.

Identifying the
stakeholders

Establishing the Defining the scope


goals and objectives of the problem
Business Understanding: Example

A retail company wants to improve its customer retention. The phase would involve:

Defining the problem Defining success metrics

Identifying stakeholders Identifying data sources

Establishing the scope Determining objectives


Business Understanding: Challenges

Insufficient Inadequate
Ambiguous Limited data Frequent changes
domain stakeholder
problem definition availability in business needs
expertise involvement
Data Collection: Overview

Data is collected from various sources, including internal and external sources.

Cleanse Integrate

Steps for
preparing data for
analysis:

Transform
Data Collection: Example

After completing the business understanding phase, the retail company will collect data.

Customer transaction Social media

Data
Demographic collected can Competitor
be related to:

Customer feedback Website traffic


Data Collection: Challenges

Data quality Data security Data integration Data volume


Data availability
and privacy

To overcome these challenges, it is essential to have a structured approach to data collection.


Data Exploration: Overview

The goal of data exploration is to gain insights and identify patterns, trends, and outliers that can
inform subsequent analysis.

Data
visualization

Descriptive Correlation
statistics analysis
Data
exploration
techniques:

Data cleaning Outlier detection


Data Exploration: Example

After completing the data collection phase, the retail company will explore the collected data.

Examples of data exploration techniques are:

Descriptive statistics: Calculate summary statistics

Data visualization: Create visualizations

Correlation analysis: Calculate correlation coefficients


between variables

Outlier detection: Identify and investigate outliers

Data cleaning: Identify and address missing values


Data Exploration: Challenges

Bias and Data privacy and


Data quality issues Data complexity Time constraints
subjectivity security
Data Modeling: Overview

It creates a mathematical representation of the data that captures the relationships


between different variables.

Descriptive Predictive

Types of data
models:

Prescriptive
Data Modeling: Example

After completing the data exploration phase, the retail company can use the following
data modeling approaches:

Define the problem Apply the model

Clean and preprocess the


Validate the model
data

Select the modeling


Train the model
technique
Data Modeling: Challenges

Data privacy and


Data quality
security

Overfitting Interpretability

Underfitting Model selection


Deployment: Overview

Model deployment is the process of integrating a data model into a production environment to
generate predictions or support decision-making. It involves:

Preparing the
model

Selecting a
Monitoring and
deployment
maintenance
environment

Testing and Integrating with


validation other systems
Deployment: Challenges

Model drift

Security User adoption

Scalability Regulatory compliance

Integration with
existing systems Data governance

A successful model deployment requires planning, testing, and maintenance to meet the needs.
Monitoring and Maintenance: Overview

It is essential for ensuring the accuracy, reliability, and usefulness of data-driven insights.

Performance
monitoring

Data quality
Model validation
monitoring
Some key
considerations
are:

Continuous Data security


improvement

It is an ongoing process that requires regular attention and adjustment.


Monitoring and Maintenance: Techniques

Error analysis

Performance monitoring Feedback loops

Automated testing Versioning

Regular retraining Security monitoring


Case Study: Loan Default Prediction
Problem Statement

When customers fail to pay their loans on time, banks suffer losses. These losses, which amount to
millions of dollars every year, have a significant impact on a country's economic growth.
In this case study, you will predict whether a person will default on a loan by examining various
factors such as location, loan balance, funded amount, and more.
A training and testing dataset of 67,463 rows by 35 columns and 28,913 rows by 34 columns,
respectively, is provided.

Source: www.Kaggle.com
Data Description

• ID (Int): Unique ID of a representative


• Loan amount (Int): Loan amount applied • Subgrade (Object): Subgrade by the bank
• Funded amount (Int): Loan amount funded • Employment duration (Object): Duration
• Funded amount investor (Float): Loan amount • Home ownership (Float): Ownership of home
approved by the investors • Verification status (Object): Income verification
• Term (Int): Term of the loan (in months) by the bank
• Batch enrolled (Object): Batch numbers to • Payment plan (Object): If any payment plan has
representatives been started against the loan
• Interest rate (Float): Interest rate (%) on loan • Loan title (Object): Loan title provided
• Grade (Object): Grade by the bank
Data Description

• Revolving Balance (Int): Total credit revolving


• Debit to income (Float): Ratio of the
balance
representative's total monthly debt
• Revolving utilities (Float): Amount of credit a
repayment divided by self-reported monthly
representative is using relative to the
income excluding mortgage
revolving balance
• Delinquency two years (Int): Number of 30+
• Total accounts (Int): Total number of credit
days of delinquency in the past two years
lines available in a representative credit line
• Inquires in six months (Int): Total number of
• Initial list status (Object): Unique listing status
inquiries in the last six months
of the loan (W for waiting and F for forwarded)
• Open account (Int): Number of open credit
• Total received interest (Float): Total interest
lines in the representative's credit line
received to date
• Public record (Int): Number of derogatory
• Total received late fee (Float): Total late fee
public records
received to date
Data Description

• Recoveries (Float): Post charge-off gross


recovery • Accounts delinquent (Int): Number of accounts
• Collection recovery fee (Float): Post charge-off on which the representative is delinquent
collection fee • Total collection amount (Int): Total collection
• Collection12 months medical (Int): Total amount from all accounts
collections in the last 12 months, excluding • Total current balance (Int): Total current
medical collections balance from all accounts
• Application type (Object): Indicates when the • Total revolving credit limit (Int): Total revolving
representative is an individual or joint credit limit
• Last week’s pay (Int): Indicates how long (in • Loan status (Int): 1 = defaulter, 0 = non-
weeks) a representative has paid EMI after the defaulter (Target feature)
batch enrolled
Data Understanding

There are 67,463 observations and 35 features in the training dataset.

• Out of 35 features, there are:

o 9 features of datatype float


o 17 features of datatype int
o 9 features of datatype object

• Feature ID is the identifier


• Loan Status is the target feature
Data Understanding: Target

The target variable indicates the presence of imbalanced data.

Non Defaulters

90.75% (61,222)

9.25% (6,241)

Defaulters
Problem with Imbalance Data

It refers to a situation where the distribution of classes in the dataset is unequal.

Difficult to detect rare events Biased model performance

Some of the
common problems
are:

Inaccurate evaluation metrics


Techniques to Address Imbalanced Data

Oversampling

Undersampling Cost-sensitive learning

Changing the Ensemble learning


performance metric
Data Exploration: Examples

Univariate analysis
Data Exploration: Examples

Univariate analysis

Interest rate Debit to income


Data Exploration: Examples

Bivariate analysis
Data Exploration: Examples

Bivariate analysis
Data Preparation

Missing values: There are no missing values in the data.

Duplicate values: There are no duplicate values in the data.

Low variance features:


1. Constant features (Variance = 0)
2. Quasi-constant features (Variance = 0.02)

• Feature accounts delinquent has variance = 0


• Collection 12 months medical and accounts delinquent are quasi-constant features.
Data Preparation

Per box plots, the following features have


outliers:
• Funded amount investor
• Interest rate
• Home ownership
Outliers and anomalies
• Open account
• Percentile method
• Revolving balance
• IQR method
• Total accounts
• Box plot method
• Total received interest
• Total received late fee
• Recoveries
• Collection recovery fee
• Total collection amount
• Total current balance
• Total revolving credit limit
Hypothesis Generation

Check if the Target variable has a significant correlation with the Input features
Hypothesis Generation

Check if there is any kind of pattern between the Initial list status and the Loan status
Hypothesis Generation

Check if Subgrade is associated with the Loan status


Hypothesis Generation

On similar lines, you can check the effect of:


• Application type
• Collection 12 months medical
• Term
• Employment duration
• Public record
• Inquiries - six months
• Grade on target feature that is loan status
Outlier Treatment

Once outliers are identified, you need to decide on the appropriate treatment.

Removal Transformation Imputation

By using these options, outliers can be removed.


Feature Encoding

It is the process of converting categorical variables into numerical values that can be used for
analysis or modeling. Techniques for feature encoding are:

One hot

Label Ordinal
Binary Encoding

This technique creates binary columns for a categorical variable by using binary numbers.
Techniques for binary encoding are:

1 2 3

Count Target Hashing


Categorical features

• Batch enrolled- 41
• Grade- 7
• Subgrade- 35
• Employment duration- 3
• Verification status- 3
• Payment plan- 1
• Loan title- 109
• Initial list status- 2
• Application type- 2
Data Pre-processing

Feature wise roadmap of data pre-processing are as follows:


• Batch enrolled: Remove BAT and typecast into it
• Grade: Ordinal
• Sub grade: Ordinal and too many unique values
• Employment duration: Manually typecast
• Verification status: Manually typecast
• Payment plan: Drop
• Loan title: Too many unique values
• Initial list status: Binary nominal
• Application type: Binary nominal
Model Selection

A number of models are tried and tested before deciding which one gives the better result.
Loan Default Prediction

Decision tree performance: Bagging classifier performance:


Loan Default Prediction

Boosting algo performance: Logistic regression performance:


Final Model

• You will use the XGB model as it gives better results.


• Next step would be to fine-tune the model for better precision
and recall.
Model Deployment

Business Production machine learning


inputs
Pipeline

Packaging*
hardening Model
Data science Deploy Monitoring
(Data hardening
engineering)
Data
engineering

Model Model Feature


Model catalog
security governance catalog

Data catalog
Model Deployment: Approach

Considerations
ML architectures

• Modularity
• Train by the batch; predict on the fly; serve via REST API
• Reproducibility
• Train by the batch; predict by the batch; serve through a shared
• Scalability
database
• Extensibility
• Train and predict by streaming
• Testing
• Train by the batch; predict on the mobile (or by other clients)
• Automation
Model Deployment: Comparison
Model Deployment: High Level Architecture

Evaluation layer Scoring layer Feature layer Data layer


Monitoring and Maintenance
Monitoring

Production machine learning needs:


• Monitoring mechanism that is model agnostic
• Instrumentation of both the data flow in and the model
performance metrics out
• To collect performance metrics
Key Takeaways

The BALC is a framework that describes the process of using


data and analytics to drive business decisions.

The business understanding phase involves understanding the


business problem or opportunity that needs to be addressed.

The data collected from various sources is summarized and


visualized to understand the key characteristics of a dataset.

Successful model deployment requires planning, testing, and


maintenance to meet the needs.

You might also like