Welcome to Scribd!

Home Credit Default Risk

Uploaded by

0% found this document useful (0 votes)

26 views21 pages

This document discusses a Kaggle competition hosted by Home Credit to predict credit default risk. The goal is to evaluate customers who lack credit histories to determine if they are worthy of obtaining a loan. The dataset contains application information for past loans, with the target variable indicating whether each loan was repaid or defaulted. Key steps in the analysis include exploring the target distribution, handling missing values, encoding categorical variables, identifying anomalies, and analyzing correlations between features to understand relationships with the target variable.

Original Description:

Copyright

Available Formats

PPTX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pptx, pdf, or txt

0% found this document useful (0 votes)

26 views21 pages

Home Credit Default Risk

Uploaded by

Vinothkumar Radhakrishnan

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pptx, pdf, or txt

Jump to Page

You are on page 1of 21

Search inside document

Home Credit Default Risk Mike Sun

孙穆
Kaggle Project: Analysis
Information adapted
from Will Koehrson
项目概况

• Kaggle 是一个举办机器学习比赛的网站
• The Home Credit Group is hosting this competition to find more
accurate models for predicting risk of credit default.
• Their goal is to allow worthy customers who lack credit histories to be
evaluated and be able to obtain a loan
• 一共十个数据集
• 主要是 application_train.csv( 训练集 ) 和
application_test.csv( 测试集 )
• 每一行代表一个贷款（ SK_ID_CURR ）
• 训练集有多一行： TARGET （目标）
• 0: 偿还贷款了 1 ：没有偿还贷款

了解数据
Evaluation Metrics
• ROC AUC – Receiver Operating Characteristic Area Under the Curve
• Graphs true positive rate vs the false positive rate
• Integral of the ROC directly proportional with quality of the model
• Measured as a probability, not as binary
• Accuracy – correctly identified vs total
• Sensitivity – true positive rate (% true positives identified correctly)
• Specificity – true negative rate (% true negatives identified correctly)
Relevant Tools
• Numpy – support for managing large data sets represented as
matrices
• Pandas – data manipulation and analysis
• Scikit-learn – simple tools for data analysis (regression, clustering
algorithms, etc)
• Matplotlib – plotting and graphical representation
• Seaborn – expands graphical tools in matplotlib
• Target Distribution
• Missing Values
Exploratory Data • Column Types
Analysis • Dealing with Categorical Values
• Dealing with Anomalies
Target Distribution
• Recall 0: repaid, 1: defaulted
• Note this is an “imbalanced class problem”
• In more complicated machine learning
models, we can “weight the classes by their
representation in the data to reflect this
imbalance”
Missing Values
• We should fill in missing values to
accommodate our ML model, know as
imputation.
• There are models such as XGBoost, however,
that can handle missing values.
Column Types
Encoding Categorical Values
• Machine learning models cannot directly work with categorical
values, as categorical values are not numbers.
• Label encoding – simpler but assigns arbitrary ordering
• One-hot encoding – solves the arbitrary ordering but creates many
new columns

Original Label Encoding One-hot Encoding

Anomalies
Example Anomaly – Days Employed
Correlations
• Pearson correlation coefficient
• .00-.19 “very weak”
• .20-.39 “weak”
• .40-.59 “moderate”
• .60-.79 “strong”
• .80-1.0 “very strong”
Correlations from Data
Example Correlation – Repayment vs Age
• In the data, age is actually expressed as the negative age in days.
Thus, the proper correlation of [Positive Age vs Repayment] would be
-0.07823930830982694.
• This means that age has a negative linear relationship with likelihood of
repayment. The older an individual, the more likely they will repay the loan.

KDE (kernel density estimation plot)

Average Failure to Repay Loans by Age
External Sources
• There are columns labeled only as EXT_SOURCE_[1,2,3]
• They had the greatest negative correlation with the Target data
• We know nothing about how these figures were obtained, but can
still check their correlation with and predict their effect on the target
Co-correlation Matrix
Co-correlation Matrix

• Shows the correlation of each feature

with each other feature
• Any feature will have a correlation of 1
with itself
• Correlation with TARGET is the most
important aspect here
Distribution vs Target of
EXT_SOURCE_[1,2,3]
Pairs Plots
• [Learn and add slides]
• https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-
f228cf529166

Chapter 3 Quantitative Research Methodology
Document54 pages
Chapter 3 Quantitative Research Methodology
Clarence Dave T. Tolentino
No ratings yet
Week 15
Document41 pages
Week 15
sirajquirish
No ratings yet
Dat Science: CLASS 11: Clustering and Dimensionality Reduction
Document30 pages
Dat Science: CLASS 11: Clustering and Dimensionality Reduction
ashishamitav123
No ratings yet
DM - Ch4 - Classification (Part1)
Document20 pages
DM - Ch4 - Classification (Part1)
C.RadhiyaDevi
No ratings yet
Fitting A Model To Data
Document41 pages
Fitting A Model To Data
Gökhan
No ratings yet
ANN Lecture 13
Document53 pages
ANN Lecture 13
Afras Ahmad
No ratings yet
DADM S15 K-NN Classification
Document13 pages
DADM S15 K-NN Classification
Anisha Sapra
No ratings yet
8 Classification
Document16 pages
8 Classification
sdog444514
No ratings yet
Data Mining Project Shivani Pandey
Document40 pages
Data Mining Project Shivani Pandey
Shivich10
100% (1)
02.data Preprocessing PDF
Document31 pages
02.data Preprocessing PDF
sunil
100% (1)
Intro To Machine Learning
Document25 pages
Intro To Machine Learning
PRAGYA SINGH BAGHEL STUDENTS JAIPURIA INDORE
No ratings yet
Types of Machine Learning Algorithms
Document14 pages
Types of Machine Learning Algorithms
Vipin Rajput
No ratings yet
BIDM Session 07-08
Document44 pages
BIDM Session 07-08
Ajit chowdary
No ratings yet
Credit EDA Case Study
Document42 pages
Credit EDA Case Study
varsha
No ratings yet
DWDM Notes Unit-4
Document89 pages
DWDM Notes Unit-4
Mudit Rajput
No ratings yet
MachineLearning Jan2nd
Document171 pages
MachineLearning Jan2nd
goodboy
100% (2)
7 Regression
Document15 pages
7 Regression
sdog444514
No ratings yet
Application of ANN in Predicting Credit Card Default
Document19 pages
Application of ANN in Predicting Credit Card Default
Indra kandala
No ratings yet
Lecture 2 Final
Document90 pages
Lecture 2 Final
somsonengda
No ratings yet
DuongToGiangSon 517H0162 HW2 Nov-26
Document17 pages
DuongToGiangSon 517H0162 HW2 Nov-26
Son Tran
No ratings yet
Unit 2
Document18 pages
Unit 2
rk73462002
No ratings yet
IS4242 W5 Predictive Modeling
Document81 pages
IS4242 W5 Predictive Modeling
wongdeshun4
No ratings yet
Chapter 2 Data Preprocessing
Document23 pages
Chapter 2 Data Preprocessing
liyu agye
No ratings yet
Coding Neural Networks-Classification & Regression
Document39 pages
Coding Neural Networks-Classification & Regression
Hasanur Rahman
No ratings yet
ML 1 2 3
Document54 pages
ML 1 2 3
Shoba Natesh
No ratings yet
Ihic-2022 PPT Paper - Id 100
Document11 pages
Ihic-2022 PPT Paper - Id 100
prashantrinku
No ratings yet
Data Splitting and Bias Variance Tradeoff
Document14 pages
Data Splitting and Bias Variance Tradeoff
Eileen Lovegood
No ratings yet
Churn Prediction Report
Document4 pages
Churn Prediction Report
Ause El
No ratings yet
Module 1 ML Mumbai University
Document47 pages
Module 1 ML Mumbai University
2021.shreya.pawaskar
No ratings yet
Unit 4 - Machine Learning PDF
Document49 pages
Unit 4 - Machine Learning PDF
test test
No ratings yet
SL Vs Ul
Document3 pages
SL Vs Ul
Aditi Jindal
No ratings yet
Chap2 SupervisedLearning
Document24 pages
Chap2 SupervisedLearning
Nguyễn Phương Nam
No ratings yet
Unsupervised Machine Learning
Document10 pages
Unsupervised Machine Learning
Ananya S
No ratings yet
Lecture 3: Handwriting Recognition and Classification
Document51 pages
Lecture 3: Handwriting Recognition and Classification
kunal13
No ratings yet
1 - KNN
Document19 pages
1 - KNN
abdala sabry
No ratings yet
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
Document61 pages
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
Dr. Dnyaneshwar Kirange
No ratings yet
Unit III 1
Document21 pages
Unit III 1
mananrawat537
No ratings yet
GDS14a New
Document8 pages
GDS14a New
Anonymous 8X5XEg
No ratings yet
Basics of ML and Evaluation
Document42 pages
Basics of ML and Evaluation
animeshrajak649
No ratings yet
Credit Risk Analysis
Document6 pages
Credit Risk Analysis
Subangkit Ramadiputra
No ratings yet
Week 03
Document28 pages
Week 03
Osii C
No ratings yet
Machine Learning Introduction
Document17 pages
Machine Learning Introduction
Er Himanshu Singhal
No ratings yet
Choosing Model and Tuning
Document20 pages
Choosing Model and Tuning
kar20201214
No ratings yet
CS550 Regression
Document62 pages
CS550 Regression
dipsresearch
No ratings yet
Machine Learning Presentation
Document18 pages
Machine Learning Presentation
Lallu Bhai YT
No ratings yet
Chapter 5: Supervised Learning - Classification: Track 2: Cloud Big Data - Big Data 3
Document48 pages
Chapter 5: Supervised Learning - Classification: Track 2: Cloud Big Data - Big Data 3
Alexis Harkov
No ratings yet
ML Unit 2
Document90 pages
ML Unit 2
Aanchal Padmavat
No ratings yet
Regression
Document109 pages
Regression
Pranati Bharadkar
100% (2)
Machine Learning Theory
Document12 pages
Machine Learning Theory
airplaneunderwater
No ratings yet
Supervised Algorithms-Regression: Linear & Logistic: Muhammad Bello Aliyu
Document32 pages
Supervised Algorithms-Regression: Linear & Logistic: Muhammad Bello Aliyu
Mohammed Danlami Yusuf
100% (1)
Classification: Unit-III
Document90 pages
Classification: Unit-III
KRISHMA
No ratings yet
Semi Supervised Learning
Document86 pages
Semi Supervised Learning
chaudharylalit025
No ratings yet
Expert Talk On Methods in Data Science and Machine Learning
Document18 pages
Expert Talk On Methods in Data Science and Machine Learning
priyalisakhare
No ratings yet
CHP 3
Document70 pages
CHP 3
its9918k
No ratings yet
20ECE633T Machine Learning in VLSI
Document81 pages
20ECE633T Machine Learning in VLSI
UMAMAHESWARI P (RA2113004011004)
No ratings yet
6 CNN
Document50 pages
6 CNN
SWAMYA RANJAN DAS
No ratings yet
Lecture8 Model Assessment Students
Document37 pages
Lecture8 Model Assessment Students
翁江靖
No ratings yet
ML - Machine Learning PDF
Document13 pages
ML - Machine Learning PDF
David Esteban Meneses Rendic
No ratings yet
Python Machine Learning for Beginners: Unsupervised Learning, Clustering, and Dimensionality Reduction. Part 1
From Everand
Python Machine Learning for Beginners: Unsupervised Learning, Clustering, and Dimensionality Reduction. Part 1
Tom Lesley
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Capstone-2 Market Basket Analysis Vinothkumar R
Document18 pages
Capstone-2 Market Basket Analysis Vinothkumar R
Vinothkumar Radhakrishnan
No ratings yet
Pandas Data Structures: Sections
Document13 pages
Pandas Data Structures: Sections
Vinothkumar Radhakrishnan
No ratings yet
Guidelines On PD Modelling: Fondi Besa
Document13 pages
Guidelines On PD Modelling: Fondi Besa
Vinothkumar Radhakrishnan
No ratings yet
EOM Checklist
Document4 pages
EOM Checklist
Vinothkumar Radhakrishnan
No ratings yet
Practical Research 2: What'S in
Document4 pages
Practical Research 2: What'S in
Claudine D. Salde
No ratings yet
Kursus NUS Singapore Quanti SPSS
Document2 pages
Kursus NUS Singapore Quanti SPSS
Arif Hassan
No ratings yet
Consumers Preference in Milk Tea Consumption in Jaen
Document11 pages
Consumers Preference in Milk Tea Consumption in Jaen
Jasmine Novelles Constantino
No ratings yet
Project Report
Document27 pages
Project Report
Panchami MP
No ratings yet
PGDBA 2021 PI Toolkit
Document2 pages
PGDBA 2021 PI Toolkit
thecoolguy96
No ratings yet
Composite Samples
Document4 pages
Composite Samples
Mas Bro
No ratings yet
EFM MCQs
Document34 pages
EFM MCQs
Stephen Celestine
No ratings yet
Block ANOVA Design Plant Science Example
Document28 pages
Block ANOVA Design Plant Science Example
merga abdeta
No ratings yet
Autoregressive Diffusion Model
Document23 pages
Autoregressive Diffusion Model
MohamedChiheb BenChaabane
No ratings yet
Impact of Using Technology in Auditing On Reducing
Document9 pages
Impact of Using Technology in Auditing On Reducing
Chiran Adhikari
No ratings yet
Hypothesis Testing
Document8 pages
Hypothesis Testing
Cindy Bartolay
No ratings yet
MAS-01 Cost Behavior Analysis
Document6 pages
MAS-01 Cost Behavior Analysis
Paupau
No ratings yet
Literature Survey
Document1 page
Literature Survey
Learn Samadesh
No ratings yet
Concepts and Methods in Migration Research: Conference Reader
Document153 pages
Concepts and Methods in Migration Research: Conference Reader
Nina Mariella
No ratings yet
Peer Pressure and Its Impact To Personality Development (Chapter 1-5)
Document51 pages
Peer Pressure and Its Impact To Personality Development (Chapter 1-5)
Jaze Ramos
No ratings yet
Statistics Sol
Document208 pages
Statistics Sol
khushi nagpal
No ratings yet
Unit - III Test No. 1
Document5 pages
Unit - III Test No. 1
madanvikas18
No ratings yet
Argumentative Content
Document10 pages
Argumentative Content
sharmine.cabusas
No ratings yet
Aparat PT Biodiesel METROHM
Document16 pages
Aparat PT Biodiesel METROHM
Corina Stanculescu
No ratings yet
Data Science MCQ Questions and Answer PDF
Document6 pages
Data Science MCQ Questions and Answer PDF
shrikant ilhe
100% (1)
ARIMA (P, D, Q) Model
Document4 pages
ARIMA (P, D, Q) Model
selamitsp
No ratings yet
File 5eac396f17871
Document6 pages
File 5eac396f17871
ajith
No ratings yet
Comparative Analysis of Time-Series Forecasting Algorithms For Stock Pice Prediction 2019
Document6 pages
Comparative Analysis of Time-Series Forecasting Algorithms For Stock Pice Prediction 2019
nurdi afrianto
No ratings yet
Research Final
Document59 pages
Research Final
Karen Joy Torres
100% (1)
Topolski J. - Methodology of History (1977) - P. 431-586, P. 605-624
Document186 pages
Topolski J. - Methodology of History (1977) - P. 431-586, P. 605-624
Sarbu Bianca
No ratings yet
BANBEIS
Document2 pages
BANBEIS
Anika Rahman
50% (2)
Econometrics Sample Paper
Document5 pages
Econometrics Sample Paper
Giri Prasad
No ratings yet
Introduction To Machine Learning: Workshop On Machine Learning For Intelligent Image Processing
Document44 pages
Introduction To Machine Learning: Workshop On Machine Learning For Intelligent Image Processing
Lekshmi
No ratings yet
Lecture On Sampling and Sample Size
Document16 pages
Lecture On Sampling and Sample Size
PRITI DAS
No ratings yet