Professional Documents
Culture Documents
Final Report - Group 4
Final Report - Group 4
Final Report - Group 4
Group 4
Contents
Overview........................................................................................................................................................3
1
Source Data Analysis.....................................................................................................................................3
Dataset Cleaning and Preprocessing..............................................................................................................4
Analytical Approach......................................................................................................................................5
Random Forest Model................................................................................................................................5
Logistic Regression Model........................................................................................................................6
Comparison................................................................................................................................................8
Recommendations..........................................................................................................................................8
Limitations of Research.................................................................................................................................9
Overview
The financial industry is continuously seeking ways to enhance the accuracy and efficiency of
credit card approval processes. The ability to accurately predict card eligibility criteria is critical
for financial institutions aiming to mitigate risk, increase revenue, and improve overall customer
satisfaction. The significance of this research will not only highlight key focus areas for financial
institutions but also provide consumers with insight into relevant factors that lead to credit
2
eligibility. By leveraging advanced analytical models, we aim to identify the most significant
predictors of credit card eligibility, allowing for more targeted marketing strategies and refined
risk assessment criteria.
In the following sections, we will outline our data cleaning and processing methods, present the
findings from our random forest and logistic regression models, compare the results, and offer
actionable recommendations based on our analysis. This comprehensive approach ensures that
our conclusions are well-supported and can be effectively implemented to achieve our goals.
Findings from this study can also contribute to broader discussions on financial inclusion,
illuminating how data-driven approaches can uphold equitable lending practices.
3
Owns Car: Indicator of whether the applicant owns a car.
Owns Property: Indicator of whether the applicant owns property.
Number of Family Members: The number of family members living with the applicant.
Target Variable:
Credit Card Eligibility: A binary outcome indicating whether the applicant is eligible
for a credit card (1 = Eligible, 0 = Not Eligible).
This dataset provides a comprehensive view of various factors that could influence an
individual's eligibility for a credit card. By analyzing these attributes, we aim to uncover patterns
and insights that can help improve the credit card approval process.
Analytical Approach
We selected the random forest and logistic regression models for our analysis due to their
complementary strengths in handling classification tasks.
The random forest algorithm constructs multiple decision trees and combines their outputs to
enhance accuracy and stability in predictions. It excels at managing extensive datasets with
intricate variable interactions. One of its key advantages is its ability to highlight the most
influential predictors of credit card eligibility through variable importance metrics. Furthermore,
4
random forests are highly resilient to overfitting and can effectively deal with missing data. Its
high accuracy, reduced overfitting, and ability to measure the importance of each factor were
advantages considered for this approach.
Logistic regression is a standard approach for binary classification problems. It offers clear
insights into the relationship between predictors and the target variable. By examining the model
coefficients, we can understand the impact of each predictor on credit card eligibility, both in
terms of direction and strength. It also reveals the statistical significance of each variable,
helping to confirm the key factors identified by the random forest model.
Using both models allows us to cross-validate results and gain a comprehensive understanding of
the factors influencing credit card eligibility.
Random Forest Model
To develop a robust random forest model for predicting credit card eligibility, we employed a
meticulous approach that ensured data integrity and model reliability.
First, all missing values were identified and removed, and the target variable, representing credit
card eligibility, was transformed into a categorical factor to facilitate the classification process.
Next, we divided the dataset into training and testing subsets, with 80% allocated for training and
20% for testing. This split was critical for evaluating the model’s performance on unseen data
and ensuring its predictive accuracy. With the data prepared, we built the random forest model
using the training set. The model was configured with 500 trees, and at each decision point, four
variables were considered to optimize the model’s performance. We also enabled the importance
parameter to assess the significance of each variable in predicting credit card eligibility. This
comprehensive process ensured the dataset was properly handled, and the model was both
accurate and insightful, providing a solid foundation for analyzing the factors that influence
credit card eligibility.
The random forest model, utilizing 500 trees and evaluating four variables at each split,
demonstrated an out-of-bag (OOB) error rate of 13.21%, indicating a prediction accuracy of
approximately 86.79%.
The following variables were identified by the Random Forest Model as Significant factors for
determining credit card eligibility:
1. Age: Age was the most influential factor, with the highest Mean Decrease Accuracy
(26.49) and Mean Decrease Gini (268.09). This means that age significantly improves the
model's accuracy and is very effective in splitting the data for decision-making.
2. Years Employed: This variable had a substantial impact, with a Mean Decrease Gini
score of 218.72. This indicates that the number of years someone has been employed is a
strong predictor of their credit card eligibility.
3. Account Length: Account length was another important factor, with a Mean Decrease
Gini score of 212.36. This means that the length of time someone has had an account is a
significant indicator of their eligibility.
5
4. Total Income: Total income contributed moderately to the model’s predictions, with a
Mean Decrease Gini score of 190.02. While not as influential as age, years employed, or
account length, income still plays a role in determining eligibility.
5. Occupation Type: This variable also had a moderate impact, with a Mean Decrease Gini
score of 112.44. This type of occupation helps predict credit card eligibility, but to a
lesser extent than the top three variables.
The confusion matrix provided further insights into the model’s performance.
2. Age: Age also showed significance (p < 0.01, Estimate = -0.013), indicating that younger
individuals are more likely to be eligible. The negative coefficient suggests that eligibility
decreases with age.
6
The following factors were found as Marginally significant variables:
1. Number of Family Members: This variable showed marginal significance (p ≈ 0.09,
Estimate = 0.754), suggesting that individuals with more family members may have a
higher likelihood of eligibility.
2. Single Marital Status: Being single also had a marginally significant positive impact (p
≈ 0.10, Estimate = 0.754) on eligibility.
Most other variables, such as “Gender”, “Own car”, “Own property”, and various “Occupation
type” and “Income type” categories, were not statistically significant. This indicates that these
factors do not meaningfully impact credit card eligibility in this model.
The confusion matrix provides a
detailed evaluation of our logistic
regression model's performance on
the test dataset, offering insights into
its predictive accuracy for credit
card eligibility. The matrix reveals
the following key outcomes:
True Positives (TP): The
model correctly identified
1685 instances as non-
eligible (class 0).
False Negatives (FN): The
Figure 2: LRM Matrix (Enlarged Image Avail. in Appendix)
model incorrectly identified
3 instances as non-eligible when they were actually eligible (class 1).
False Positives (FP): The model did not incorrectly predict any instances as eligible
when they were actually non-eligible.
True Negatives (TN): The model correctly identified 253 instances as eligible (class 1).
The confusion matrix for our logistic regression model provides valuable insights into its
performance in predicting credit card eligibility. The model demonstrates exceptional accuracy,
with an overall correctness rate of approximately 99.85%, indicating that it reliably predicts both
eligible and non-eligible applicants. The model’s sensitivity, or recall, is also impressively high
at around 99.82%, showcasing its strong ability to correctly identify eligible applicants.
Specificity, which measures the model’s accuracy in identifying non-eligible applicants, is
perfect at 100%, meaning it accurately classifies all non-eligible individuals.
Moreover, the positive predictive value (precision) is also flawless at 100%, indicating that all
applicants predicted as non-eligible are indeed non-eligible. The negative predictive value, while
slightly lower at about 98.82%, still shows that the majority of applicants predicted as eligible
are indeed eligible.
These metrics collectively highlight the model’s robust performance, particularly in accurately
identifying non-eligible applicants. However, the model also reveals a potential class imbalance
7
issue, as indicated by the extremely high specificity and precision similar to the random forest
model. This imbalance might suggest that the majority of applicants are non-eligible, skewing
the predictions.
Comparison
The logistic regression model aligns with the random forest model in identifying “Account
length” and “Age” as significant predictors. However, it does not highlight “Years employed” or
“Total income” as significant, which were notable in the random forest model.
Combining the insights from both the logistic regression and random forest models provides a
more comprehensive understanding of credit card eligibility. The consistency in identifying
“Account length” and “Age” as significant factors across both models enhances the reliability of
these findings. While the random forest model emphasizes the importance of “Years employed”
and “Total income”, the logistic regression model sheds light on additional nuances, such as the
influence of having been single. This multi-model approach allows for cross-validation of
results, increasing confidence in the identified key predictors.
Recommendations
Based on the analysis of credit card eligibility using both logistic regression and random forest
models, several key insights have emerged that may help financial institutions develop strategies
moving forward. First, our models have shown that account length is a significant predictor of
credit card eligibility. We recommend institutions prioritize applicants who have maintained
long-standing accounts. Marketing campaigns should target these loyal customers with pre-
approved credit card offers, highlighting their valued relationship with our bank.
Second, strategies aimed at younger demographics, such as young professionals and students.
These products can emphasize benefits that align with their lifestyle and financial needs.
Additionally, the importance of years employed was highlighted, suggesting that stable Age was
found to be a crucial factor, with younger individuals more likely to be eligible. Financial
institutions should design specific credit card products and marketing; long-term employment is
a key factor in eligibility. Incorporating employment history into eligibility criteria and offering
tailored products to individuals with steady employment can enhance approval rates and
customer loyalty.
Finally, while not always statistically significant, total income and occupation type were
highlighted by the models. Financial Institutions should refine our risk assessment process by
developing tailored credit products for different income levels and occupations, offering special
terms or benefits to high-income earners or individuals in stable, high-skill jobs.
Limitations of Research
It is important to note that we will need to improve our predictive accuracy due to the class
imbalance in our models. We should implement measures to balance our dataset, such as
oversampling eligible applicants or undersampling non-eligible ones. This will ensure fair
representation and improve the model’s performance. Additionally, we could revisit the model,
8
experimenting with different training and testing subset parameters. Altering the ratios could
result in higher performance and increased accuracy. By implementing these recommendations,
we can improve our credit card approval process, enhance customer satisfaction, and increase the
number of eligible applicants while effectively managing risks.
9
Total Income 3.870738 3.435502 4.89657 190.0246
10
Metric Value
Prevalence 0.868109
Detection Rate 0.867594
Detection Prevalence 0.99897
Balanced Accuracy 0.501656
Positive Class 0
11