Final Report - Group 4

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Unlocking Credit Worthiness

MBA 739 Advanced Data Mining

Dr. J.P. Auffret

Group 4

Tanzilla Siddique, Suhaga Weerasinghe, Jonathan Sarker, Matthew Foreman

Contents
Overview........................................................................................................................................................3

1
Source Data Analysis.....................................................................................................................................3
Dataset Cleaning and Preprocessing..............................................................................................................4
Analytical Approach......................................................................................................................................5
Random Forest Model................................................................................................................................5
Logistic Regression Model........................................................................................................................6
Comparison................................................................................................................................................8
Recommendations..........................................................................................................................................8
Limitations of Research.................................................................................................................................9

Overview
The financial industry is continuously seeking ways to enhance the accuracy and efficiency of
credit card approval processes. The ability to accurately predict card eligibility criteria is critical
for financial institutions aiming to mitigate risk, increase revenue, and improve overall customer
satisfaction. The significance of this research will not only highlight key focus areas for financial
institutions but also provide consumers with insight into relevant factors that lead to credit

2
eligibility. By leveraging advanced analytical models, we aim to identify the most significant
predictors of credit card eligibility, allowing for more targeted marketing strategies and refined
risk assessment criteria.

In this report, we analyze a comprehensive dataset from Kaggle.com containing various


attributes of individuals applying for credit cards. Our objective is to pinpoint the variables that
most strongly correlate with credit card eligibility. To achieve this, we employ two powerful
analytical methods: random forest and logistic regression. Each method offers unique insights
and strengths, providing a robust framework for our analysis.

In the following sections, we will outline our data cleaning and processing methods, present the
findings from our random forest and logistic regression models, compare the results, and offer
actionable recommendations based on our analysis. This comprehensive approach ensures that
our conclusions are well-supported and can be effectively implemented to achieve our goals.
Findings from this study can also contribute to broader discussions on financial inclusion,
illuminating how data-driven approaches can uphold equitable lending practices.

Source Data Analysis


We sourced our dataset from Kaggle.com, which includes a variety of attributes related to
individuals applying for credit cards. This rich dataset encompasses the following key attributes:
Demographic Information:
 Age: The age of the applicant.
 Gender: The gender of the applicant.
 Marital Status: The marital status of the applicant (e.g., Single, Married, Divorced).
Financial Information:
 Total Income: The total annual income of the applicant.
 Employment Status: The employment status of the applicant (e.g., Employed,
Unemployed).
 Years Employed: The number of years the applicant has been employed.
Credit History:
 Account Length: The number of years the applicant has held an account with the bank.
 Number of Credit Cards: The total number of credit cards the applicant currently holds.
 Credit Card Limits: The combined credit limits of all the applicant’s credit cards.
Contact Information:
 Work Phone: Indicator of whether the applicant has provided a work phone number.
 Email: Indicator of whether the applicant has provided an email address.
 Phone: Indicator of whether the applicant has provided a personal phone number.
Behavioral Information:

3
 Owns Car: Indicator of whether the applicant owns a car.
 Owns Property: Indicator of whether the applicant owns property.
 Number of Family Members: The number of family members living with the applicant.
Target Variable:
 Credit Card Eligibility: A binary outcome indicating whether the applicant is eligible
for a credit card (1 = Eligible, 0 = Not Eligible).
This dataset provides a comprehensive view of various factors that could influence an
individual's eligibility for a credit card. By analyzing these attributes, we aim to uncover patterns
and insights that can help improve the credit card approval process.

Dataset Cleaning and Preprocessing


To ensure the integrity and reliability of our analysis, we undertook several crucial steps to clean
and preprocess the dataset:
1. Identifying and Removing Missing Values:
o We began by identifying any missing values within the dataset. Missing data can
lead to biased results and reduced model accuracy. Rows with missing values
were removed to maintain data quality and ensure robust analysis. This was
executed to reduce estimation bias, which could influence prediction results.
2. Transforming the Target Variable:
o The target variable, representing credit card eligibility, was transformed into a
categorical factor. This transformation is essential for classification tasks,
enabling the models to distinguish between eligible and non-eligible applicants
accurately.
3. Splitting the Dataset:
o To build and evaluate our models effectively, we split the dataset into two parts:
80% for training and 20% for testing. The training set was used to develop the
models, while the testing set was reserved for evaluating their performance on
unseen data. This split ensures that our models generalize well to new data.

Analytical Approach
We selected the random forest and logistic regression models for our analysis due to their
complementary strengths in handling classification tasks.
The random forest algorithm constructs multiple decision trees and combines their outputs to
enhance accuracy and stability in predictions. It excels at managing extensive datasets with
intricate variable interactions. One of its key advantages is its ability to highlight the most
influential predictors of credit card eligibility through variable importance metrics. Furthermore,

4
random forests are highly resilient to overfitting and can effectively deal with missing data. Its
high accuracy, reduced overfitting, and ability to measure the importance of each factor were
advantages considered for this approach.
Logistic regression is a standard approach for binary classification problems. It offers clear
insights into the relationship between predictors and the target variable. By examining the model
coefficients, we can understand the impact of each predictor on credit card eligibility, both in
terms of direction and strength. It also reveals the statistical significance of each variable,
helping to confirm the key factors identified by the random forest model.
Using both models allows us to cross-validate results and gain a comprehensive understanding of
the factors influencing credit card eligibility.
Random Forest Model
To develop a robust random forest model for predicting credit card eligibility, we employed a
meticulous approach that ensured data integrity and model reliability.

First, all missing values were identified and removed, and the target variable, representing credit
card eligibility, was transformed into a categorical factor to facilitate the classification process.
Next, we divided the dataset into training and testing subsets, with 80% allocated for training and
20% for testing. This split was critical for evaluating the model’s performance on unseen data
and ensuring its predictive accuracy. With the data prepared, we built the random forest model
using the training set. The model was configured with 500 trees, and at each decision point, four
variables were considered to optimize the model’s performance. We also enabled the importance
parameter to assess the significance of each variable in predicting credit card eligibility. This
comprehensive process ensured the dataset was properly handled, and the model was both
accurate and insightful, providing a solid foundation for analyzing the factors that influence
credit card eligibility.

The random forest model, utilizing 500 trees and evaluating four variables at each split,
demonstrated an out-of-bag (OOB) error rate of 13.21%, indicating a prediction accuracy of
approximately 86.79%.

The following variables were identified by the Random Forest Model as Significant factors for
determining credit card eligibility:
1. Age: Age was the most influential factor, with the highest Mean Decrease Accuracy
(26.49) and Mean Decrease Gini (268.09). This means that age significantly improves the
model's accuracy and is very effective in splitting the data for decision-making.
2. Years Employed: This variable had a substantial impact, with a Mean Decrease Gini
score of 218.72. This indicates that the number of years someone has been employed is a
strong predictor of their credit card eligibility.
3. Account Length: Account length was another important factor, with a Mean Decrease
Gini score of 212.36. This means that the length of time someone has had an account is a
significant indicator of their eligibility.

5
4. Total Income: Total income contributed moderately to the model’s predictions, with a
Mean Decrease Gini score of 190.02. While not as influential as age, years employed, or
account length, income still plays a role in determining eligibility.
5. Occupation Type: This variable also had a moderate impact, with a Mean Decrease Gini
score of 112.44. This type of occupation helps predict credit card eligibility, but to a
lesser extent than the top three variables.

The confusion matrix provided further insights into the model’s performance.

The random forest model has an overall


accuracy of 86.81%, meaning it correctly
predicts credit card eligibility around 87%
of the time. The confidence interval for
this accuracy is between 85.22% and
88.29%, suggesting we can be quite
confident in this range.
The model is highly sensitive (99.94%)
when it comes to identifying individuals
who are not eligible for the credit card
(class 0). This means it almost always
correctly identifies non-eligible
individuals. However, it has very low
Figure 1: RFM Matrix (Enlarged Image Available in Appendix)
specificity (0.39%) for identifying eligible
individuals (class 1), meaning it rarely correctly identifies those who are eligible.
While the model is effective at identifying non-eligible applicants, it struggles significantly with
identifying eligible ones. This issue is likely due to an imbalance in the dataset, where there are
many more non-eligible individuals than eligible ones.
Logistic Regression Model
To better understand credit card eligibility, we developed a logistic regression model using
various predictors. This model helps us identify which factors significantly influence the
likelihood of being eligible for a credit card.
The following variables were identified by the logistic regression model as significant factors
Variables for determining credit card eligibility:
1. Account Length: This was the most significant predictor (p < 0.001, Estimate = 0.016).
Longer account lengths markedly increase the likelihood of being eligible for a credit
card.

2. Age: Age also showed significance (p < 0.01, Estimate = -0.013), indicating that younger
individuals are more likely to be eligible. The negative coefficient suggests that eligibility
decreases with age.

6
The following factors were found as Marginally significant variables:
1. Number of Family Members: This variable showed marginal significance (p ≈ 0.09,
Estimate = 0.754), suggesting that individuals with more family members may have a
higher likelihood of eligibility.
2. Single Marital Status: Being single also had a marginally significant positive impact (p
≈ 0.10, Estimate = 0.754) on eligibility.
Most other variables, such as “Gender”, “Own car”, “Own property”, and various “Occupation
type” and “Income type” categories, were not statistically significant. This indicates that these
factors do not meaningfully impact credit card eligibility in this model.
The confusion matrix provides a
detailed evaluation of our logistic
regression model's performance on
the test dataset, offering insights into
its predictive accuracy for credit
card eligibility. The matrix reveals
the following key outcomes:
True Positives (TP): The
model correctly identified
1685 instances as non-
eligible (class 0).
 False Negatives (FN): The
Figure 2: LRM Matrix (Enlarged Image Avail. in Appendix)
model incorrectly identified
3 instances as non-eligible when they were actually eligible (class 1).
 False Positives (FP): The model did not incorrectly predict any instances as eligible
when they were actually non-eligible.
 True Negatives (TN): The model correctly identified 253 instances as eligible (class 1).
The confusion matrix for our logistic regression model provides valuable insights into its
performance in predicting credit card eligibility. The model demonstrates exceptional accuracy,
with an overall correctness rate of approximately 99.85%, indicating that it reliably predicts both
eligible and non-eligible applicants. The model’s sensitivity, or recall, is also impressively high
at around 99.82%, showcasing its strong ability to correctly identify eligible applicants.
Specificity, which measures the model’s accuracy in identifying non-eligible applicants, is
perfect at 100%, meaning it accurately classifies all non-eligible individuals.

Moreover, the positive predictive value (precision) is also flawless at 100%, indicating that all
applicants predicted as non-eligible are indeed non-eligible. The negative predictive value, while
slightly lower at about 98.82%, still shows that the majority of applicants predicted as eligible
are indeed eligible.

These metrics collectively highlight the model’s robust performance, particularly in accurately
identifying non-eligible applicants. However, the model also reveals a potential class imbalance

7
issue, as indicated by the extremely high specificity and precision similar to the random forest
model. This imbalance might suggest that the majority of applicants are non-eligible, skewing
the predictions.

Comparison
The logistic regression model aligns with the random forest model in identifying “Account
length” and “Age” as significant predictors. However, it does not highlight “Years employed” or
“Total income” as significant, which were notable in the random forest model.
Combining the insights from both the logistic regression and random forest models provides a
more comprehensive understanding of credit card eligibility. The consistency in identifying
“Account length” and “Age” as significant factors across both models enhances the reliability of
these findings. While the random forest model emphasizes the importance of “Years employed”
and “Total income”, the logistic regression model sheds light on additional nuances, such as the
influence of having been single. This multi-model approach allows for cross-validation of
results, increasing confidence in the identified key predictors.

Recommendations
Based on the analysis of credit card eligibility using both logistic regression and random forest
models, several key insights have emerged that may help financial institutions develop strategies
moving forward. First, our models have shown that account length is a significant predictor of
credit card eligibility. We recommend institutions prioritize applicants who have maintained
long-standing accounts. Marketing campaigns should target these loyal customers with pre-
approved credit card offers, highlighting their valued relationship with our bank.
Second, strategies aimed at younger demographics, such as young professionals and students.
These products can emphasize benefits that align with their lifestyle and financial needs.
Additionally, the importance of years employed was highlighted, suggesting that stable Age was
found to be a crucial factor, with younger individuals more likely to be eligible. Financial
institutions should design specific credit card products and marketing; long-term employment is
a key factor in eligibility. Incorporating employment history into eligibility criteria and offering
tailored products to individuals with steady employment can enhance approval rates and
customer loyalty.
Finally, while not always statistically significant, total income and occupation type were
highlighted by the models. Financial Institutions should refine our risk assessment process by
developing tailored credit products for different income levels and occupations, offering special
terms or benefits to high-income earners or individuals in stable, high-skill jobs.

Limitations of Research
It is important to note that we will need to improve our predictive accuracy due to the class
imbalance in our models. We should implement measures to balance our dataset, such as
oversampling eligible applicants or undersampling non-eligible ones. This will ensure fair
representation and improve the model’s performance. Additionally, we could revisit the model,
8
experimenting with different training and testing subset parameters. Altering the ratios could
result in higher performance and increased accuracy. By implementing these recommendations,
we can improve our credit card approval process, enhance customer satisfaction, and increase the
number of eligible applicants while effectively managing risks.

Appendix A: Technical Exhibits


I. Random Forest Model Variable Importance Metrics
Variable 0 1 Mean Decrease Accuracy Mean Decrease Gini
Age 29.12265 -12.29063 26.491112 268.0933
ID 1.014392 6.587294 3.282243 266.0823

Years Employed 19.71356 -9.912052 19.308653 218.7194

Account Length 3.944472 7.358619 6.212769 212.3571

9
Total Income 3.870738 3.435502 4.89657 190.0246

Occupation Type 8.587951 -1.128961 8.505376 112.4415

II. Random Forest Model: Confusion Matrix

III. Random Forest Model: Confusion Matrix


Metric Value
Accuracy 0.8681
95% CI (0.8522, 0.8829)
No Information Rate 0.8681
P-Value [Acc > NIR] 0.5167
Kappa 0.0057
Mcnemar's Test P-Value <2e-16
Sensitivity 0.999407
Specificity 0.003906
Positive Predictive Value (PPV) 0.868489
Negative Predictive Value (NPV) 0.5

10
Metric Value
Prevalence 0.868109
Detection Rate 0.867594
Detection Prevalence 0.99897
Balanced Accuracy 0.501656
Positive Class 0

IV. Logistic Regression Model: Confusion Matrix

11

You might also like