KYC Passing Rate Data Analysis Using Feature Engineering Clustering and Data Visualization

KYC Success Rate Analysis Using
Supervised and Unsupervised Machine

Learning and Data Visualization
Siddesh Sanna Gowdru
Dublin Business School
This dissertation is submitted for the degree of
Master of Science
January 2021
Declaration
I hereby declare that except where specific reference is made to the work of others, the
contents of this dissertation are original and have not been submitted in whole or in part
for consideration for any other degree or qualification in this, or any other university. This
dissertation is my own work and contains nothing which is the outcome of work done in
collaboration with others, except as specified in the text and Acknowledgements.
Siddesh Sanna Gowdru

January 2021
Acknowledgements
I want to thank my supervisor, Dr Charles Nwankire, for his guidance throughout the research
and shared his immense knowledge to make the research more insightful. Finally, I would
like to thank my parents, college professors, life partner, and friends who stood behind to
support my Master’s degree.
Abstract
In a fast-growing world, all financial institutions struggling to provide satisfactory services

to their customers by improving their existing process and at the same it involves risk in
it. KYC process is one such process where financial institutions provide a digital platform
to submit their identification documents required for the KYC process. N26 is the bank
that allowed the new applicants to upload their documents to the digital platform. Later,
these documents are verified using document processing tools to verify authenticity. It was
observed that the KYC passing rate is declined and the bank wants to analyze its reason. In
this research, we analyse the document reports verification dataset by employing the Random
forest method to know which factor influences the passing rate and K-Means method to know
the similar patterns buried under the dataset. Tableau is used to analyze the clusters obtained
from K-Means method for the risk assessment and process improvement. The study results
indicate that the image’s integrity plays a vital role in KYC pass rate and documents that
failed to pass the KYC process shows lack of image integrity.
Keywords: KYC, K-Means Clustering, Random Forest, Data Visualization
Table of contents
List of figures vii
List of tables viii
1 Introduction 1
1.1 What is Know Your Customer in the banking system and why it is important? 2
1.2 Why is it important to analyze the KYC passing rate? . . . . . . . . . . . . 3
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Research Question: . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Research aims and objectives . . . . . . . . . . . . . . . . . . . . 4
1.4 Road map for the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Review 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Anti Money Laundering - Know Your Customer . . . . . . . . . . . . . . . 7
2.3 Customer Segmentation and its importance in the success of the data-driven
business . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 K-Means Clustering and its applications . . . . . . . . . . . . . . . . . . . 12
2.5 One Hot-encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Importance of Random Forest in feature selection and its hyper-parameter
optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Table of contents vi
2.7 Data Visualization and its necessity in reveling the hidden information in the
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Methodology 22
3.1 Business Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Implementation of the proposed models 31
5 Results Interpretation 34
5.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Cluster results Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 Conclusion 44
6.1 Future Work: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References 46
List of figures
2.1 Elbow plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 CRISP-Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Elbowplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Country . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 CountryRisk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 ClusterAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.6 Cluster2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.7 Cluster2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.8 Cluster3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.9 Cluster3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
List of tables
4.1 Hyper Parameters Values for the Random Forest model . . . . . . . . . . . 32

Chapter 1
Introduction
Money laundering and terrorist financing are imposing a major threat to many financial
institutions across the world. This awareness created when the 9/11 incident happened at
World trade centre in 2001. After a detailed investigation Koh (2006) on this incident, it is
observed that terrorist financing using laundered money supplied enough funding for the 9/11
attack. This terror attack opened the banking policymaker’s eyes to enforce strict measures
while onboarding customers to the banking system. From that moment, Know Your Customer
policy came into action. If any financial institutions fail to do so, they will be penalized by
the financial institution’s regulators, and reputation will be ruined. Money launders have
successfully tackled these barriers and caused reputation hit to the many banking institutions
worldwide. Hence banking institutions implies strict KYC procedures by enhancing the due
diligence of individual customers. Due to improvements in technologies and upgrading to
the easy banking system, it brought more customers into the banking system in the past
few years. Unfortunately, this impacts the banking sector to verify each customer and their
transactions once they are onboarded. It was unlikely to handle all the risk measures manually,
and the banking system needed advanced technologies to handle these risks. Some of the
advanced technologies on which the banking system highly relies are machine learning and
data visualization.
1.1 What is Know Your Customer in the banking system and why it is important? 2
Every customer comes to the bank with hopes of a satisfied and healthy relationship with
the bank. On the other hand, banks also want more customers and to achieve this bank need
to work on customer satisfaction. Hence they should have an easy process system to access
their services like opening a bank account within a minimal period of time, easy deposition
and withdrawal of their money, etc. It Van Quyet et al. (2015) shows us to maintain the
existence in the respective business line, customer satisfaction is important in the competitive
world. Hence in our research, we are analyzing data of the KYC documentation process
of the N26 bank. This financial institution facing a problem with a KYC passing rate of
their new applicants who wish to open their account with N26 bank. Due to high standards
of regulations imposed by the financial regulators, it is impossible to compromise these
standards while onboarding new customers. At the same time, it is important to have more
customers and provide satisfactory services while onboarding. Hence, it is required to do the
customer segmentation to identify any similar behaviour in the applicant’s characteristics,
which may cause the failure of passing KYC. The machine learning methodologies used in
this research help to find a solution for these problems and visualise the analysis of results to
find any potential threat from the applicant’s demographics.
1.1 What is Know Your Customer in the banking system
and why it is important?
Know Your Customer is a basic and significant process of the customer identification process.
This process came into enforcement to fight against money laundering and terrorist financing.
This laid the first step in the onboarding of new customer into the banking system. This
process involves the collection of relevant documents like ID proof and personal verification
document. Upon completion of the identification process, further due diligence will be
conducted by the banking system. rubin2018finra identified that great economically strong
1.2 Why is it important to analyze the KYC passing rate? 3
countries in Asia, Europe, and the USA from the past few years paid fine of about USD26
Billion dollars due to not adhering to the KYC policies and procedure. Nowadays, KYC is
transformed into digital KYC where the customer who wants to open a bank account can
directly apply through their smart devices by submitting their identification documents by
taking pictures using camera devices and uploading them to the respective banking website.
It is a convenient step for the customers to initiate a relationship with the bank from the
customers perspective. However, the same advanced technologies misused by the money
launders by uploading fake identification documents. If the bank fails to identify this kind of
fraudulent activities may lead to the failure in the AML-KYC compliance and results in the
huge penalties imposed by the regulators. Hence, it must have an enhanced documentation
process system that identifies fake documents and avoids potential threats in the initial
phase. It is also important to maintain customer satisfaction without creating a hustle to set a
relationship with the customers by rejecting their KYC process.
1.2 Why is it important to analyze the KYC passing rate?
Nowadays, every business institution is entering into a digital platform to provide ease of
services to their customers. Unfortunately, it has the same amount of challenges to provide
at-most quality service. it may be referred to as privacy issues, data security issues, storage
of digital information, etc. according to Reis et al. (2018) research. In our research, N26
bank allowed its customers to upload their identification document through a digital platform.
Later, the bank has an agent to process these documents to know the authenticity of the
document submitted. An interesting fact about the customers who wish to open an account
with N26 bank belongs to various parts of the world. They have recently noticed the KYC
passing rate is decreased, and the management authority needs to know the reason behind
the decrease in the KYC passing rate. Since customers from various parts of the world,
especially from high-risk-rated countries are applying to open the account, it is necessary
1.3 Problem Statement 4
to assess the possible risk of money laundering. If the KYC process fails for the customers
who have a genuine background, it will impact the bank’s reputation, and the bank loses
its valuable customers. Hence, the N26 bank needs to provide satisfied KYC service and
identify the customer’s pattern to assess the future potential of money laundering risks.
1.3 Problem Statement
The dataset used for the analysis is the KYC passing rate of N26 bank customers. This
dataset obtained from the Kaggle dataset repository. After considering the risk of AML-KYC
passing rate, this research is going to answer the following research question.
1.3.1 Research Question:
How supervised and unsupervised machine learning algorithms with the help tableau
help identify the factors affecting KYC Passing Rate? Sub Question: How to cluster
analysis helps in identifying similar behaviour pattern in the identification document
which causes potential risk in the KYC process?
1.3.2 Research aims and objectives
Primary objectives of the research:
1. To identify Potential features influencing KYC passing rate.
2. To identify a similar pattern in the dataset, which can be utilized for risk assessment
and customer satisfaction.
3. To categorize the customers based on country risk rating to apply a further due diligence
level.
1.4 Road map for the Thesis 5
1.4 Road map for the Thesis
To accomplish the goal, this dissertation is divided into several chapters. They are listed
below as,
1. Chapter 1: This chapter gives the introduction to the business domain and the chal-
lenges facing. Also discussed author motivation behind and research question that
needs to be addressed in this dissertation report.
2. Chapter 2: This chapter deals with the literature review of the previous work done
related to our framed research question by the other researchers. We have discussed
AML-KYC challenges, the importance of customer segmentation, how to deal with the
K-Means clustering algorithm, the importance of random forest algorithm in knowing
the feature importance, and how data visualization techniques help solve machine
learning and business domain problems.
3. Chapter 3: This chapter elaborates the methodology, technologies and tools used to
address the research question. It consists of programming languages used to create
interactive visualizations, data preparation steps, and data modelling steps.
4. Chapter 4: This chapter explains the implementation of the proposed models. Also,
includes a detailed discussion of the results obtained from the implementation part and
discussed findings while exploring the data.
5. Chapter 5:
6. Chapter 6: This chapter concludes the dissertation report. This chapter gives an idea
that further research can be conducted by the other researchers and discuss the results
obtained by the research and recommendations to the N26 banks by recommending
improvements in their process.
Chapter 2
Literature Review
2.1 Introduction
KYC - Know Your Customer is in the process of collecting necessary documents and verify
the authenticity of the documents to open an account in any regulated financial institution.
KYC process helps in identifying and fight against money laundering and terrorist financing.
The FATAF(Financial Action Task Force) which is an international financial regulatory body
made KYC of utmost importance and the first step in establishing a healthy relationship
with the customers to identify and verification of basic profile details. As the number of
customers coming to the financial institutions was tremendously increased these days and
at the same time, there is an increase in fraudulent activities. Hence, the banking systems
with the help of technologies are available in the market to identify such fraudulent activities.
One such instance is document verification through electronic identity verification. Financial
institutions thriving to improve customer satisfaction by introduced online submission of
their necessary documents to complete the KYC procedures. Unfortunately, fraudulent
activities are occurred by using loopholes in technology to do terrorist financing or money
laundering purposes. If financial institutions fail to identify such activities, they will be fined
heavily by the regulators, and their reputations will be ruined in the financial sector. Hence
2.2 Anti Money Laundering - Know Your Customer 7
financial institutions investing more money to identify such threatening activities by using
advanced technologies available in the world. Machine learning algorithms, data visualization
techniques, and data analysis techniques do a great job to maintain the reputation of the
financial institutions and increase the customer satisfaction of their customers.
2.2 Anti Money Laundering - Know Your Customer
KYC is a due diligence process to identify the customers or clients to have a financial
institution relationship. Rajput (2013) in his research told us the KYC process comes
into action in 2001 after the terrorist attack on the World trade centre in the United States
of America, well known as the 9/11 incident under the US patriot act. KYC procedures
and policies made risk management and assessment easy and safeguard from the threat of
heavy penalties from the regulators by effective risk management and maintain the financial
institutions’ reputation. KYC includes document collection and due diligence of each
document to verify the authenticity of those collected documents, and it varies from place to
place and customers to customers.
Christie et al. (2018) studies on the AML-KYC procedures and he highlighted financial
institutions are responsible for identifying any fraudulent activities occurring from their
account holders and should submit all these details to their respective regulators on time. The
first step in the KYC process is to identify the customer identification, and it is defined as
the customer identification process (CIP). In this step financial institutions collect necessary
identity documents like Passport, Tax document, National identity card, driving licensee,
work permit, and Voter ID and verify the originality of the documents provided by validating
the name, address, date of birth, issued date, or expiry date, etc. These details ensure that
customers are identified and store this information in their database for future reference if any
concerns are raised against a particular account, and financial institutions will know which
account holder is accountable.
2.2 Anti Money Laundering - Know Your Customer 8
Over a period of time, financial services evolved drastically by the influence of sophis-
ticated technologies available in the market. Similarly, the identification of a customer’s
identity is also done through various digital devices and digital technologies. There was a
time when there is only one type of collecting documents by submitting their physical copies
to the bank branches, and it used to take more time to validate the identification. Banks
are thriving to improve customer satisfaction so they can retain their existing customers
and gain more customers to improve their business. Hence they started taking the help of
digital technology and asked customers to submit their documents through the internet by
scanning the documents and upload them to their respective portal. At the same time, this
causes huge risk and responsibility on the financial institutions to verify digital documents
using technologies against the technologies used to tamper the original documents or fake
documents created to open the account for fraudulent activities. If financial organizations
failed to rectify these fake accounts or fraudulent accounts, they are going to be penalized
heavily and going to ruin their reputation in the financial sector. Since all the identity docu-
ments are not the same and all of the customers wanted open, the account will not have the
same version of the identity documents that leads to the failure of passing the identification
process. At present, all over the world 1.5 billion, people do not have legal documents or
outdated documents as per World Bank data(2017). The concern of customer satisfaction and
highly sophisticated technology to identify the legitimacy of the digitally submitted data can
provide wider opportunities to gain more customers in the financial sector. Latest advanced
technologies like machine learning algorithms, automation tools, and business intelligence
tools made most of the process easier in financial organizations and made their regulations
process smoother than before by bringing more efficiency and clarity in complying with
the compliance set up by the regulatory bodies. Unfortunately, this kind of innovation is
not yet captured by the KYC process, which is a crucial step in the regulatory policies and
procedures as per Agathokleous (2009) studies. Once AML-KYC embraces technology, it is
2.3 Customer Segmentation and its importance in the success of the data-driven business 9
not likely to be an easy task, on the other hand. Once all the documents collected and verified,
it will generate more data. Financial organisations should foresee the possible threats like
data storage and maintenance and cyber-attacks, which try to steal the customer data. CJason
(2020) conducted a study on Hong Kong financial institutions emphasizes the importance of
the technology in on-boarding new customers which include KYC as a basic and important
step. The study also points out current technologies employed in the KYC process are not
sufficient and its failure to clear the quo as soon as possible, and it still involves more manual
work which can be handled by the advanced technologies available in the market such as
Regtech. Aziz and Dowling (2019) study emphasises financial institution should work on the
data governance, which ensures the quality of the data collected during the KYC process.
This would help perform the documents check and ensure the quality of the check.
2.3 Customer Segmentation and its importance in the suc-
cess of the data-driven business
If we consider any business, there will be a customer to buy the product from that business.
Customer behavior influences the ups and downs of any business. Each customer will have
different requirements and the grouping of such customers with have similar requirements can
be referred to as customer segmentation. It is easy for the organizations to know the customer
behavior and they can offer products as per their requirements and also they can develop new
products to serve the needs of their customers. Parsell et al. (2014) The research revealed
that customer segmentation also helps organizations to run smoothly by allocating enough
market resources for a particular segment of customers. The important task of Customer
segmentation is to collect important featured data is required to obtain effective segmentation.
The next step of customer segmentation is integrating collected data from various databases
and applying a suitable and effective data analysis method. The method of collecting data
2.3 Customer Segmentation and its importance in the success of the data-driven business10
and method of data analysis varies from industry to industry and their customer demography,
sentiment, and lifestyle characteristics like income, expenditure, and living standards.
Customer segmentation plays a vital role in financial institutions. One such instance is a
credit card customer’s data.Wu and Lin (2005) conducted research on credit card data of the
year 2003 to identify the similarity of the credit cardholders’ patterns of expenditure. This
research is conducted to identify the opportunities and to improve their market strategies to
improve customer satisfaction. In this customer segmentation research, they have used two
methods to classify the customers into a similar group. The first method is the RFM method
which takes intervals of customer’s expenditure, frequency of expenditure, and the amount
of expenditure into consideration. By considering the above three factors customers can be
classified into several groups and also banks can find out their top 20% valued customers.
Another method used in this research was the customer value matrix method. In this method,
they consider the average time of purchasing an average amount of spending. By these
factors, they classified customers into four groups. The first group consists of valuable
customers for the organization, the second group includes those who tend to spend more in
the future, the third group consists of customers who show uncertainty in using the credit card
and the fourth group often includes usage customers. This type of customer segmentation
helps financial institutions to take an effective allocation of their resource to maximize their
profitability and their customer satisfaction.
Another research Doganis (2005) shows customer segmentation is required to compete
with competitors in their respective business lines. We all know that airlines are already
classified their customers based on their class of travel. But this is not sufficient to run the
business. If we look back and see how many transitions occurred in the airline industry,
results show us that it changed drastically. It means to say that the airline industry has a very
high competitive environment to serve their needs. A study was undertaken to improve the
efficiency of the airline business by classifying the customers according to their needs. In
2.3 Customer Segmentation and its importance in the success of the data-driven business11
this study, they have considered customer demographics, pricing structure, flight schedule,
punctuality of flights, and flexibility to increase the efficiency of their operations. This
study reveals that there are three categories of customers who want to use airlines based
on their offerings. One category of customers gives preference to punctuality so that their
traveling time can be minimized by no delaying the flights. The second group of passengers
needed flights on a specific schedule, and the third group needed flexibility in the booking
of their tickets. It is also noticed that a major group of customers urged for the high degree
of punctuality. the Teichert et al. (2008) study helps the airline industry to lunch different
products for their customers and also they will know to which audience they have to make an
effective advertisement.
A study is conducted on the Indian banking sector, regarding how historical data can be
used to prevent various threats to the banking system by researchers Srivastava and Gopalkr-
ishnan (2015). They have successfully identified the importance of customer segmenting will
help in avoiding the threats to the banking system and effective risk management. They have
utilized customer data to do sentimental analysis to understand the customer behavior and
their patterns of usage of the bank products and segmented customers into different groups so
they can sell their banking products to their customers effectively. They have also segmented
suspicious accounts to regular monitoring of these segmented customers to effective risk
management or to avoid potential threats in the future.
Nevertheless, the customer and document segmentation process is also applied in the
AML-KYC process in order to increase the efficiency of the process and to avoid the money
laundering risk and effective risk management. Research has been conducted itede2012digital
in the USA regarding document image segmentation. This research topic arises due to the
transformation of the manual KYC process to a digital process. When a customer uploads
their identity card to the system, it is required to segment each document to a particular
category. There is no single standard device to upload the documents so that customers will
2.4 K-Means Clustering and its applications 12
upload their documents through their cell phones with no standard procedure. Furthermore,
a variety of identity cards like passports, driving licenses, IT card, and many more. These
things lead to difficulties in identifying the identification of the customers. Hence customer
segmenting using image segmentation is a crucial step for the sub-segmentation like document
classification, text classification, and signature verification. In this research, conventional
neural networks are used for image segmentation. the Batista das Neves Junior et al. (2020)
study concludes that there is a need for high technology to process the images since these
documents have diverse in quality and document types.
The overall study of customer segmentation in different business domains shows the
importance of the classification of customers based on their behavior and their requirements.
The above study highlights the importance of customer segmentation to avoid unseen risk
and effective risk management in each business sector. Also, it helps in effective resource
management to allocate resources to the respective segment of the customers.
2.4 K-Means Clustering and its applications
K-Means algorithm is one of the famous unsupervised machine learning techniques for data
analysis. This algorithm helps in grouping the data values which exhibit similarity under
given circumstances. It is known as a famous clustering algorithm that can be used in various
industries to solve the problems like image segmentation, information retrieval process as per
Bellot and El-Bèze (1999) research. Na et al. (2010) mentioned in his study that MacQueen
introduces the K-Means clustering method in 1967, which is a simple unsupervised machine
learning algorithm used to solve cluster problems. K-Means algorithm includes two phases
to cluster the given data points. The first phase is to calculate how many K centers exist
randomly, and the second phase will calculate the distance of each data points to the nearest
K center as per Bradley and Fayyad (1998) research.
Elbow Method: Kodinariya and Makwana (2013) studies explain Elbow method is a
visual method of finding a number of clusters in a given data set. The logic behind this
method is that it initializes K=2 in the beginning and increases one in each step by calculating
the cluster and cost during training. The cost value decreases drastically at some value of K.
When we plot this cost in a graph and when it increases the number of K the curve started to
become flat at some value of K. This point represents the required number of K.
Fig. 2.1 Example of Identification of Number of Clusters

Kodinariya and Makwana (2013)
Once the number of clusters identified, the distance to each cluster will be calculated in
the first iteration and it will be associated with the nearest cluster point. Likewise, all the
data points in the data points are assigned to each cluster. In the next iteration, the cluster
point position will be changed, and recalculate the distance from each cluster point. This
iteration continues until the cluster center can not change the position.
The objective K-Means Clustering function:
m K
J = ∑ ∑ wik ∥xi − µk ∥2 (2.1)
i=1k=1
Whereas,
xî is a data point
µk is centroid of xi
K is the number of Clusters
Applications of K-Means Clustering:
Since there is an increase in the data and necessary for forecasting in the banking sector
demands highly sophisticated strategies to be competent in the market. Nowadays, financial
service organizations highly relied on data mining techniques to overcome financial crime
risk and to run smoother the business process. To support the above statement, research
has been conducted by Çaliş et al. (2015) the bank to classify their credit card customers
to assess the risk of repayment. It was found that there are three clusters existed. The first
cluster indicating the group of risk-based customers who were unable to repay the credit
card bills, the second cluster indicating the loyal customers who pay bill uncertainly and
they do not have own house and the third cluster denoting the group of credit cardholders
who are loyal customers. By doing this analysis banks can classify their customers based on
the repayment risk factor and allocate enough resources to the potential customers to sell
their new products and to maintain an effective customer relationship with them to have a
long-term business.
The main advantage of the K-Means clustering is the number of clusters in the given
data set before applying to the data set. It K number of clusters is obtained through the
Elbow plot method and supplied as a parameter to the K-Means algorithm. One of the
frequent applications of this clustering technique is image segmentation. One such study
is conducted Ray and Turi (1999) on synthetic images by incorporating validity measures
in which intra-cluster, inter-cluster and the number of clusters in the data set automatically.
2.5 One Hot-encoding 15
The validity measure works well with synthetic image data. They have did some minor
modifications in the algorithm by considering the median cluster center representation instead
of the mean cluster center representation.
Over a period of years competition in every business, the sector has been increased. The
unprocessed large amount of historical data prompted the usage of a wide variety of data
mining technologies to retrieve meaningful information to run the organizations on a success
track. One such application of the data mining technique is clustering using the K-Means
algorithm. One such attempt was made by
2.5 One Hot-encoding
In machine learning models, categorical features are considered as discrete variables and
these kinds of features are common in machine learning problems. These features will result
in high cardinality. All machine learning models deal with numbers and not the letters of
the world. Hence it is required to convert those words into numbers. One such method is
the one-hot encoding method. Feature engineering is required to convert these categorical
variables into a suitable feature vector. To do this one popular method that exists in the Data
Science world is a one-hot encoded method as per Bojanowski et al. (2017) research. One
hot encoding helps in helps in various supervised and unsupervised machine learning tasks
by giving better meaning for the categorical features. It is considered as one of the important
data cleaning steps and which helps and improves the prediction in the supervised machine
learning task Cerda et al. (2018).
16
2.6 Importance of Random Forest in feature selection and its hyper-parameter optimization
2.6 Importance of Random Forest in feature selection and
its hyper-parameter optimization
The main advantage of the Random forest algorithm is that it can be employed in both
classification and regression analysis. It is also used to know the contribution of each variable
to the labeled variable. Biau and Scornet (2016) mentioned in his study that Random Forest
is a popular algorithm which is introduced by L. Breiman in 2001. This algorithm is obtained
by the combination of many randomized decision trees which considers the average of the
predictions then provides the final result. In this research, we are employing a Random forest
to know the importance of the features in the given data set. But to know the importance of all
the features in the given data set it is required to know the optimal values for the parameters
of the random forest. Random forest got several parameters which should set by manual
to obtain the best result. For instance, Probst et al. (2019) research shows Random Forest
parameters like the minimum number of trees, number of nodes per each tree, sampling ratio,
number of observations used to draw the trees, and number of a variable required to draw the
individual tree. These all parameters can be obtained default by the machine but it will not
give optimal results. Hence it is important to optimize the parameters of the Random forest
to know the feature importance.
There are two important features of the random forest algorithm. One is widely used
to build classification or regression models to predict the labeled column with the highest
accuracy. Another is to assess the contribution of each variable in a given data set to
predict the labeled variable. In other words, Random forest helps in explaining the different
variables in the given data set and their importance in predicting the label column. The
hyperparameters in the random forest are also called tuning parameters. These parameters
vary from dataset to dataset. Since each dataset is varying in purpose, size, a number of
variables, and types of variables. Importantly, feature selection avoids over-fitting problems
17
in various high-dimensional classification problems like text mining as per Feng et al. (2017)
research. Hence it is vital to choose only a specific variable that contributes more to the
predicting variable. Furthermore, it also improves the processing speed of the machine to
provide optimal results.
In our study, we are going to consider three important parameters to obtain the feature
importance. One is the size of the nodes, depth of and the number of decision trees to be
constructed. Here the size of the node indicates the minimum number of observations that
should be involved in each node. Lower the size of the node will increase the depth of
the split, which means it is required to perform more nodes until it reaches the terminal
node. The second important feature is to restrict the depth of the decision trees and the third
significant hyperparameter is to decide how many trees need to be grown to build an optimal
and efficient random forest to predict the accurate result.
The research conducted Lin and Jeon (2006) regarding the performance of random forest
by tuning the hyperparameter node size yields significant results compared to the default size
node. They have noticed that tuning of the size of the nodes in a decision tree of random
forest improves the accuracy when the dimensionality of the data is less and the considered
sample size is large. Another interesting preliminary experiment conducted Probst et al.
(2019) shows us that computational time decreases exponentially when they have increased
the size of nodes in the random forest algorithm. Furthermore, they have noticed there is no
loss in the accuracy but it affected the run time increased substantially.
The maximum depth of the tree also significantly influences the accuracy of the results.
To evident this statement, research has been conducted techaroen2018sugarcane to observe
how the hyper-parameter affects the results of the final accuracy of the model by predicting
the sugarcane yield grade. The accuracy of the model with hyperparameter tuning shows a
maximum result of 71.88% where the accuracy of non-turning parameters shows a very less
18
accuracy result of just 51.2%. In this study, they have tuned the ’max_dep’ hyper-parameter
along with the other parameters to obtain the above result.
A random forest algorithm is nothing but an aggregation of certain decision trees. Be-
forehand it is unknown to decide the number of trees to be built to obtain the optimum
results. This parameter is various from data to data. Hence it is highly required to optimize
this parameter to obtain the highest accuracy of the predictive model. Many experiments
conducted to know the optimal value for the number of trees to be grown but the results of
those studies conclude that the optimal number of trees is varies based on the properties of
the dataset. A study has been conducted (Oshiro et al., 2012) regarding how to decide the
number of trees required to obtain the optimal results. The obtained results of their studies
show that after a certain number of trees there is no significant increase in the performance of
the random forest model. Moreover, it increasing the computational cost. Also if less number
of trees than the optimal value also yields less performance. In their experiment, they have
analyzed the model with the number of trees started from 64 to 4096. The high-performing
model was obtained when they used the 128 number of trees as a hyperparameter. They also
used 29 datasets with different properties yield different optimal value. When it comes to the
number of features, datasets with less number of features utilized all the features to build
the model using few numbers of trees to build the necessary model, and in the case of the
dataset with more number of features random forest model ignored some of the variables
to provide the optimal value and required more number of trees to learn the data to provide
better performance of the model.
2.7 Data Visualization and its necessity in reveling the hidden information in the data 19
2.7 Data Visualization and its necessity in reveling the hid-
den information in the data
In a fast-growing world, data is more valuable than ever before. Over the past two decades,
the storage of information in warehouses like databases gains more popularity. But how
to retrieve valuable information from the data is a difficult task in recent days. Most of all
businesses rely on data-driven solutions to make their business successful and free from risk.
One such technology available to solve these data-related problems is data visualization.
Andrews et al. (2011) in his study highlighted that a huge amount of data is generating
from different sources and in different formats and stored in different storage systems. If
we consider a simple example about admissions of a student in a university generates lots
of information about each student like name of the student, gender, age, mobile number,
email ID, and many more. Data visualization is the art of representing data to consumers
who do not know the data. There are many difficulties exists in representing data because
information stored in the form of data has more versatility and the same method cannot be
applied for every data sets as per Ali et al. (2016) research.
Effective and sophisticated visualization methods are very important in visualizing and
extracting information from the complex financial data set. They have played a key role
while taking crucial business decisions from the leaders of the organization as per research
conducted by Ramsay and Wampler (2015). These visualizations in many cases improves
the sales, demand forecasting, effective risk management, helps in resource allocations and
helps in knowing where exactly business is going. These all information can be drawn in
the form of various graphs like bar chart, line graph, histogram, tree maps and so on. These
visualization graphs helps end users like decision makers to understand a better about the
business without sneak peek into the data. As we all know the threatening to the financial
institutions such as money laundering, terrorist funding, fraudulent activities increasing
2.7 Data Visualization and its necessity in reveling the hidden information in the data 20
yearly. To fight against these threatening activities financial institutions required advanced
technologies and tools. As regulations on the financial institutions increasing to avoid threats
which demands financial institutions to have a advanced analytical tools and techniques.
One such threat is suspicious transactions from the account holders and if the financial
institution fails to identify such transactions they are imposed huge amount of fines or even
more strict measures from the regulators. Singh and Best (2019) research objective is to
using data visualization techniques to identify such suspicious activities. The analyst who
working on the transaction monitoring to detect he suspicious activity will have tons of
transactions in a day and its impossible to detect the suspicious activity in a earliest stage.
This experiment helps analysts by sub-setting the potential suspicious activity so Analyst
have greater possibility to analyze such transactions.
Templ et al. (2012) mentioned in his study that exploratory data analysis is an initial
and vital step in building any machine learning model. In a real data world, there will be
no data set that will be directly ready to build any machine learning model. It is required to
clean the data before building any machine learning models. There are so many methods
and techniques to discover the data and its structure before exploring further. One such
important and efficient technique is data visualization. Data visualization helps in identifying
incomplete information of the data and structure of distribution of whole data, missing values
in each column and rows, and so on. The usage of data visualization technique never stopped
till data processing and data cleaning. Its applicability extends to evaluate the machine
learning models. Once the machine learning model is built there should be a way to evaluate
the model using some visualization. In this, step visualization helps in understanding the
operations of the model, diagnosis of them, model and suggests any refinement can apply to
the applied machine learning model. The analysts working on the machine learning models
are mediating their results to the end-users using data visualization techniques as per Vellido
(2019) research.
2.8 Inference 21
Interactive Visualization in classifier models plays a very important role in evaluating

each model. Developers in the machine learning area have developed various visualization
techniques to evaluate machine learning models. Such visualization can be used both in
supervised and unsupervised machine learning. To name some of the visualization techniques
are ROC curve, cost curve, elbow plot method in unsupervised machine learning, and so on.
The research was conducted Talbot et al. (2009) about how interactive data visualization
techniques help machine learning with multiple classifiers. They have used the Ensemble
matrix to visualize the classification models. The confusion matrix helps in their research to
develop multi classifier model by evaluating the performance of each model and modifications
required in each model to improve the efficiency and helps in comparing those models to
find out the best performing model.
2.8 Inference
After a detailed study of AML-KYC business process and its necessity of analytics techniques
to the analysis of their customer before and after on-boarding to the banking world. As the
given data set holds the information of document processing data values and demographics
of the customers are hidden in one column. With respect to the threats to the financial
institutions, it is necessary to do the initial screening analysis to find any group of customers
that have similar behavior that can be harmful to the financial institutions. At the same,
it is important to analyze the factors which influence the KYC documentation process by
identifying which stage of the document process causing rejection of applications using the
random forest method.
Chapter 3
Methodology
Data mining is a highly valued, efficient task in the data world. It requires the knowledge
of many skills and tools to accomplish the given task. Since Any data mining projects
involve many people and require one efficient framework to have a standard approach to
convert any business problem into a data task, convert that data into structured data to build
necessary models and evaluate the model results. One such methodology that gives the
standard approach to deal with data mining projects is the CRISP-DM method as per Wirth
and Hipp (2000)research. This method is used irrespective of industry and the technology
that we are going to use.
3.1 Business Understanding
The N26 bank is a financial institution regulated by the Financial Conduct Authority (FCA),
which provides various banking services. To have an account in an N26 bank, the person who
wishes to open an account in N26 bank has undergone the KYC documentation process by
submitting a verified identification document to the verification partner Veritas. The customer
will pass the KYC process only if the submitted document valid as valid and authentic. If the
customer fails to pass this KYC process, they need to submit their identification document
3.1 Business Understanding 23
Fig. 3.1 CRISP-Methodology model

Wirth and Hipp (2000)
3.2 Data Understanding 24
again and wait to open the account. The business team identified that there is something
faulty in their verification system. This is not a good sign for the bank to Strong its customer
base. Hence business needs to know where precisely the identity document process was
going wrong to achieve more customer satisfaction. The applicants to N26 bank are local
customers, but people from other countries also wanted to open the account. Hence, there
would not be a single document type, and each document type quality varies from country to
country. So business has an interest in knowing is any document type or nationality of the
customers influencing the results of the KYC passing rate.
3.2 Data Understanding
The data which is used for this research is obtained from the open-source Kaggle dataset
website. This dataset does not have any personal information that belongs to the individuals.
Hence, no restrictions or permits are required to use this dataset to analyze the dissertation
process’s data. The dataset is about to report the documents to check. This dataset includes
18 variables and 176404 observations. The The "attempt_id" variable is the unique column
in the whole dataset. By basic overview of the dataset shows us it is not readily available
to build a model because most of the values include null values, and some of the columns
are not required. The label column in the dataset is "result" which has two categorical
values. The exciting part of this data set is, it contains only discrete values. There is no
continuous variable exists in this dataset. All document verification results column has
only two categorical values. There is one column named "properties" which includes the
data that exists in the identification document. That column includes the information of the
applicant’s gender, nationality, document type, at which year it will expire, and issued country
information. This information needs to extract and make a separate individual column to
explore the hidden information to explore the KYC passing rate. Furthermore, the issued
country contains many country names, and their respective risk category varies from Low to
3.3 Data Preparation 25
high risk. This information is not provided in the dataset which needs to be imported by the
KYC country risk rating website and needs to create another column called "risk_category".
3.3 Data Preparation
The data preparation is a crucial phase of the CRISP-DM methodology. This phase includes
all the necessary steps to clean the data which is used to model the data. To prepare the
data to feed the model, sometimes we need to clean the data multiple times based on the
level of analysis and dataset properties. To perform the cleaning task of the dataset, there
is no prescribed order. This step may include multiple tools, programming languages to
accomplish the required task. Preparation of data may include creating the table, new
columns, removing null values and outliers, replacing null values with appropriate values,
scaling, normalization, etc. These cleaning steps vary from the objectives of data modelling
as per Wirth and Hipp (2000) research. To clean the data and do exploratory data analysis,
we use pySpark version 3.0.1, Python 3 programming, and visualize various distribution
using the Tableau visualization tool.
The dataset in question has no continuous variables. It includes date data type and string
data type variables. One column named "properties" holds multiple information in the form
of a Dictionary. The "properties" column has information of the applicant’s gender, what type
of document is submitted for verification, the expiry date of that document, the applicant’s
nationality, and how the country does this document issue. For the applicants from the
countries like United States of America, Australia, Russia, and so on additional information
like the State’s name is given. We are not considering that part of the data. We consider only
information about countries only because KYC country risk rating applies to the country
level. The sample of the "properties" variable is as shown below.
"{’document_type’: ’driving_licence’, ’issuing_state’: ’FL’, ’date_of_expiry’: ’2022-11-
06’, ’issuing_country’: ’RUS’}".
3.3 Data Preparation 26
This information is collected during document verification. Each of the above information
needs to be extracted into individual variables to analyze the impact on the KYC passing rate.
This task is accomplished by using a regular expression function. When this operation is
done, it must verify all the necessary data extracted and stored in a newly formed variable.
Hence we need to check if there any null values for any new variables. Since applicants
from different parts of the world and their document-provided data values will not be in the
same format. Hence, some values are stored in the wrong format or show through null values.
Hence it is required to extract to CSV file by writing the data frame to CSV format. Once
we get the CSV file, we can apply a filter and check any discrepancies in the newly formed
features. As we expected, some invalid values are generated in the gender variable, and these
values are translated to "Not Provided". Similarly, all the null values are replaced with the
"Not Provided" value. The documentation verification process results column contains some
null values. It should be filled with the "clear" value because it will not change the ultimate
label column "result" value. The next step is to remove unwanted columns from the data
frame to do further data cleaning process. It is also essential to check whether is there any
column that contains null values. Because the null value reduces the quality of the model.
Research conducted Saar-Tsechansky and Provost (2007) shows us the effect of null values
while building the classification model.
Once data is cleaned from all the null values, it is vital to convert it into numerical values
so that model can understand. Since our dataset has categorical features, it is a must to
convert them into numbers. It will be accomplished by either string indexing or the one-hot
encoding method. We use vector assemble, string indexing, and one-hot encoding function
to convert them to numerical features to accomplish this step. One-hot encoding represents
categorical values into binary vectors and represents a variable more expressive. To convert
categorical values into a binary vector, it is required to convert them into integer values.
Hence we are using the string indexing method. In Pyspark it is required to transform all
3.4 Data Modeling 27
the necessary features into a single feature vector. We are supposed to apply string indexing
and one hot encoding method on the feature vector. Here we are using the pipeline method
to smoother our transforming task. The pipeline method helps gather string indexing and
one hot encoding task as a single entity and perform a respective task at one click. This will
avoid time in organizing and finding errors in the code.
While converting the "properties" variable into separate columns, we extracted "date_of_expiry"
ranging from 0-9999. Hence it is required to scale these features. Since scaling helps give
equal weightage to all the features when the range of each numerical feature varies.
Balancing the label column: Lin et al. (2017) emphasize, data imbalance is one major
issue in the data modelling problems. One class of the target variable has more values,
and one class has fewer values causes issues in providing more significant results. Hence,
many data scientists use random sampling of oversampling or under-sampling to balance
the label column class. It is one of the critical data processing steps when we are solving
classification problems. If we do not consider the imbalance problem, the algorithms we
will consider will learn more about the majority class. To simplify more, if the data set
contains 99% of one class value and 1% of another class value and if we built the model
without considering this imbalance issue model, it would predict all the values concerning
the majority class mentioned in Liu et al. (2008) the research. The experiment conducted by
the above researchers shows that the imbalance label column tends to outperform the mode.
Hence, our dataset label column "result" has imbalanced, and is necessary to balance the
label column. Hence we choose to over-sample the labels.
3.4 Data Modeling
In our research we are trying to analyze the data by knowing the importance of the features
which influencing the label column "result" and trying to analyze is there any cluster formation
occurs with the document verification results and the newly constructed attributes "gender",
"document_type", "date_of_expiry", "issued_country" and "nationality". Hence we are using

both supervised and unsupervised machine learning algorithms. We chose Random forest
in supervised machine learning to know the feature importance and K-Means clustering to
establish any cluster in the dataset.
3.4.1 Random Forest
Machine learning models are built on knowing the patterns in data, each feature’s behaviour
in a dataset, and prediction by considering related features. To build a sophisticated model
it necessary to follow the practice of data analysis, selecting essential features to build the
model, building the model using the feature selection method, and evaluate the build model.
This procedure makes us trust the model as per Tropsha (2010) research. The variable
importance in a given dataset is gained by measuring the loss of accuracy of the model
predictions when the variable is included or removed from the modelling. A random forest
is a collection of the number of decision trees. Each tree is going to be grown as per the
following steps.
1. Bootstrap Phase: In this phase, select a random subset from the training dataset to
build the decision trees, and the rest of the sample data used to measure the grown
trees’ accuracy.
2. Building Phase: In this phase, trees are growing by splitting the dataset by selecting
variable at each node using the classification method
3. These trees are allowed to grow to a possible extent, and there will be no pruning of
the trees.
To build the useful Random forest model to know the feature importance of the variables
it is required to know the number of trees to be grown, the number of nodes in each tree, and
the maximum depth we can grow the trees. This will be decided by doing hyperparameter
tuning. In our research, we are splitting the data into 80:20 ratio. By doing all these steps, we
are trying to get the list of feature importance that affects the label column "result", which
denotes the document verification process’s passing rate.
3.4.2 K-Means Clustering
As we mentioned in the problem statement, we are trying to analyze the data to know whether
there is any similar pattern in the given and extracted data that influences the documentation
verification in turn KYC passing rate. We are using the K-Means clustering model to identify
such similar patterns, which is part of unsupervised machine learning. Since we have no label
column in the given task and figuring out to find the cluster number by selecting the features
from the identity documentation verification results and the newly formed attributes from
the "properties" variable, in this task we were trying to analyze is there any similar pattern
in the given data by "document_type" or "gender" or "issued_country" or results obtained
in the document verification phase? To decide the number of clusters, we plot the elbow
plot, which indicates the number of clusters fed to the K-Means algorithm as a parameter.
Elbow plot plotted by cost function versus the number of K’s. As K increases in the graph,
the average distortion will be started decreasing. At a certain number of clusters, the average
distortion drops suddenly and forms the elbow shape at that point we consider the ’K’ value.
Once the model is obtained, we will fit this model to the cleaned data set, and we will get the
predictions for each row and each row assigned to the respective cluster. This is not sufficient
to analyze the clusters, and it is required to visualize these clusters to get any similar patterns
in the given feature set. This will be done through data visualization methods.
3.4.3 Data Visualization
So far, we have discussed the data processing and data modelling for the given dataset. If we
want to communicate any of these results, it is essential to use a data visualization tool. Data
Visualization is a crucial component of the data analysis. In this study, we are using Tableau
as a Data visualization tool because it has lots of software packages that would help build
effective visualization and dashboards, which helps analyze the cluster by visualizing the
cluster in our studies. We used the bar plot for the top five features obtained in the Random
forest feature importance list to plot the cluster and its distribution. Later we created a
dashboard to communicate the cluster analysis. To know the passing rate and its relationship
with the essential features obtained in the Random forest model.
Chapter 4
Implementation of the proposed models
In this section we are walking through the implementation of Random Forest and K-Means
Cluster models:
Random Forest: In our research we are using the Random Forest model to get the
list of feature importance. In this case, we are using "result" as a label column and other
variables are feed as features vector to the model. We splitting the data into 80:20 split. Since
the label column has imbalanced it is required to balance the label column to obtain the
accurate model. Hence we are doing upsampling and balanced the label column. Before
balancing the data categorical value label 1 has 119972 values and categorical value label
0 has 43117. Once we did the up-sampling label values got balanced. The random forest
model has default parameters and which not going to give optimal values. Hence we are
feeding with the hyperparameter values. Here we are finding parameters by tuning the model.
To accomplish this task we are using the parameter grid search method to obtain the optimal
hyperparameters. The hyperparameters we are using in this model are Maximum depth,
Minimum instance per node, and the number of trees grown. To evaluate these values we are
using an estimator as "BinaryClassificationEvaluator". The values of the hyperparameters
obtained by the grid search are as follows.
32
Hyper Parameter Values

No. of Trees 50
Max Depth 10
Instance per node 10
Table 4.1 Hyper Parameters Values for the Random Forest model
While building the model we feed these values into the model as a parameter which gives
the list of feature importance. When we graphically visualize the feature importance list with
the amount of influence on the label column will be visualized as below.
Fig. 4.1 List of Feature Importance
From Fig.4.1 it can be shown that "image_integrity_result" contributing more to the

prediction of the label column. A feature importance list is used for further analysis of
the data. In the next step using important features obtained from the list used to build the
K-Means Clustering to examine any similar pattern exists within these important features.
K-Means Clustering: As part of our research, we need to do cluster analysis for the
important features obtained in the random forest. All the data cleaning and data preparation
steps remain the same for the K-Means model, except we are considering features that are
obtained in the feature importance list. To build a K-Means clustering model, we need to
pass the parameter value K. Since it is an unsupervised machine beforehand we do not know
33
the number of clusters in the selected features. To know this parameter we used the elbow
plot to get the number of clusters. The elbow plot is as shown below.
Fig. 4.2 Elbow Plot
From Fig. 4.2 we can observe that the curve started sudden distortion and forms the
elbow shape at the number of cluster 6. This clearly indicates the number of clusters present
in the selected features dataset. Once we know the number of clusters i.e. value K in the
K-Means clustering algorithm we can now build the clustering model. Once we build the
model we applied to the selected features dataset and obtain the predictions for each row
in the dataset. This operation allows each row in the dataset allocates to one of the clusters
formed. To analyze further we converted this resultant dataset to the pandas to write it into
the CSV file named "Clusters_Prediction.csv". This new dataset is used to create a dashboard
in Tableau for the further analysis of the clusters formed.
Chapter 5
Results Interpretation
Interpretation of the results is a crucial part of the research. The aim of the research is to
analyze the data to extract the useful information hidden in the dataset. Initially, we used
doc_reports.csv" to clean the raw data. To do a detailed exploration of the data and o do the
cluster analysis we are using the Tableau visualization tool. As we know in the previous
discussion "image_integrity_result" plays a crucial part in deciding the KYC passing rate of
the document verification phase, this is clearly visualized in Fig. 4.1. and also the number of
clusters in the selected important features to do cluster analysis in Fig. 4.2.
5.1 Exploratory Data Analysis
After cleaning the raw dataset we exported it to the CSV file and uploaded it to Tableau to get
the useful inference of the hidden data. From the basic analysis of the data, it is got to know
that 25% of the data are being rejected. It is shown in Fig. 5.1. It is not a good development
for the N6 bank since it loosing 25% of the customers and the bank needs to work on it. By
Fig,5.2. it is observed that eight variety of identification documents submitted. Among which
Driving license, National identity card, and passport have a major share of approximately
5.1 Exploratory Data Analysis 35
78% and 15% of the customers not provided the documents and all those applications got
rejected.
Fig. 5.1 Proportion of Document Verification Passing Rate
Fig. 5.2 Passing rate results classified based on the Document type
If we observed the customers who have applied for the N26 bank account, most of the
applicants belong to the Europe region. GBR (Great Britain Republic) residents applied
more for the account in N26 Bank of 23% as shown in Fig.5.3. Interestingly, 18% of the
applicants have not provided their country information and they have got rejected by the
5.2 Cluster results Analysis 36
bank. To analyze further the country risk of the applicants, we have assigned each country
risk rating and created an additional column "Risk Rating" and classified them into Low,
Med-Low, Medium, Med-High, and High-risk countries in the other column named as "risk
Category". If we refer to Fig.5.4, most of the applicants are from the Med-Low risk country,
and together with the Low-Risk applicants, it accounts for 85%. Around 15% of people are
coming from the Medium to High-Risk country and with respect to these applicants in point
of view risk, it is required o the Enhanced due diligence before providing an account in the
bank.
Fig. 5.3 Top 10 Issuing country passing rate
5.2 Cluster results Analysis
Cluster analysis of the KYC passing rate is a critical part of this research. This is a huge
dataset and human beings can not identify any similar patterns available in the passing rate
or what factors affecting the passing rate. We employed the K-Means clustering algorithm
to tackle this problem. As we discussed earlier, we have got a number of clusters using the
Fig. 5.4 Top 10 Issuing country passing rate
elbow plot method as shown in Fig.4.2. and also we have assigned each row in the dataset
to their respective cluster groups. But this is not taking us to the analysis part. Still, we
were not able to interpret the cluster by seeing the "prediction" column. Hence we relied on
the Tableau visualization tool to recognize the hidden similar behavior in the clusters. To
reveal hidden information we plot bar plot by using the "Prediction" variable and each of
the top five important features affecting the passing rate which is obtained from the random
forest model feature importance list as shown in Fig.4.1. We have created a dashboard by
considering cluster groups and the top five important features as shown in Fig.5.5.
If we observe "Group1" all the applicants who have not provided the document their
passing results are shown as "consider" which means they have not cleared the KYC docu-
mentation step. The applicants who belong to "Group1" have not provided any information
because they have not provided any document. Hence the K-Means algorithm done a very
good job in classifying applicants who have not provided any documents.
Fig. 5.5 Cluster Analysis With Top Five Important Features

When we have filtered the "Group2" clusters as shown in Fig 5.6 it is clearly observed
that 27% of applicants who have provided national identity cards have passed the KYC and
32% of national identity card documents submitted by France. These people have cleared
the image integrity tests and document the conclusive tests with 100%. From this, we can
conclude that the national identity card document quality is good. On the other side, those
who have not passed the KYC by providing national identity document their image quality
test showing 100% passing but failed in Image integrity test and conclusive document quality
test as shown in Fig 5.7. This shows that the applicants might have submitted fake national
identity documents. From this analysis, the bank needs to further due diligence for these
customers if they come back for the second time to open the account.
Fig. 5.6 Group 2 cluster Analysis


"Group3" cluster segmented by the customers who have submitted driving license. If
we filtered who have passed by submitting a driving license we have got interesting facts
as shown in Fig 5.8. 34% of applicants passed the KYC by providing a driving license and
the majority of people who submitted a driving license are by great Britain accounts for
59%. These many applicants passed all the crucial steps in the documentation process. If
w consider those who have not passed the test from Great Britain there is a problem with
the image integrity and conclusive document quality even though the image quality is good.
This indicated applicants might have to submit the fake or tampered document for the KYC
process as shown in Fig. 5.9.


If we do the further cluster analysis in the remaining clusters, the outcome shows, the
passport and resident permit card, those who have cleared the KYC process they have
submitted a valid document which passed all the test, and the applicants who are not able
to pass the test they have passed the image quality test but failed in the image integrity test
which is the important test which affecting the final result.
Chapter 6
Conclusion
The aim of this research is to find the important factors affecting the KYC passing rate and to
find any similar behavior in the customer applicants who have applied for the account in the
N26 bank. With the help of supervised, unsupervised machine learning and Data visualization
we have done the data analysis thoroughly and able to answer the research question effectively.
By applying the random forest method we identified "image_integrity_result" plays a vital
role in deciding whether the document is valid or not. Along with this feature, we are
also able to identify other features like the type of the document, the quality of the image,
and from which country the document is issued. We have addressed the second part of
the research question by applying famous unsupervised machine learning we have able to
identify six clusters. Cluster analysis with the help of Tableau dashboards we observed that
document which has failed to pass the KYC process shows us there is a lack of integrity
in the submitted document. Further data analysis shows us that 15% of the applicants are
from medium risk, med-high risk, and high-risk countries. Even though they have cleared
the KYC due to their nationality, as per AML-KYC regulations it is required to do enhance
due diligence (EDD) for effective risk management. Furthermore, from cluster analysis,
it is observed that people from low-risk countries have submitted good quality images but
failed to pass the image integrity test. This shows applicants might using fake or tampered
6.1 Future Work: 45
documents to do some illegal operations like money laundering and terrorist financing and to
maintain an effective compliance and risk assessment N26 bank should further improve their
screening process. From the potential customer point of view N26 bank is required to have
sophisticated technology and tools to verify the digital document so the genuine applicants
can pass their KYC test in the first attempt.
6.1 Future Work:
By considering the results obtained in this research future work can be done by doing the
downsampling of the label column and can see whether the K-Means algorithm will provide
the same results or not. In this research, we did not include the country risk category in both
Random Forest and K-Means clustering methods. Future work can be laid by considering
this additional variable that we have created using the nationality information to assess the
risk of money laundering. In this research, we have used Tableau for the cluster analysis
to visualize the cluster graphically. Future work can be carried out by analyzing the cluster
using advance cluster visualization technique t-SNE to visualize the more effective cluster
groups.
References
Agathokleous, A. (2009) From fintech to regtech: How have european countries

Ali, S.M., Gupta, N., Nayak, G.K. and Lenka, R.K. (2016) Big data visualization: Tools
and challenges in: 2016 2nd International Conference on Contemporary Computing and
Informatics (IC3I) pp. 656–660 IEEE
Andrews, C., Endert, A., Yost, B. and North, C. (2011) Information visualization on large,
high-resolution displays: Issues, challenges, and opportunities Information Visualization
10(4), pp. 341–355
Aziz, S. and Dowling, M. (2019) Machine learning and ai for risk management in: Disrupting
Finance pp. 33–50 Palgrave Pivot, Cham
Bellot, P. and El-Bèze, M. (1999) A clustering method for information retrieval Technical
ReportIR-0199, Laboratoire d’Informatique d’Avignon, France
Biau, G. and Scornet, E. (2016) A random forest guided tour Test 25(2), pp. 197–227
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017) Enriching word vectors with
subword information Transactions of the Association for Computational Linguistics 5, pp.
135–146
Bradley, P.S. and Fayyad, U.M. (1998) Refining initial points for k-means clustering. in:
ICML vol. 98 pp. 91–99 Citeseer
Çaliş, A., Boyaci, A. and Baynal, K. (2015) Data mining application in banking sector with
clustering and classification methods in: 2015 International Conference on Industrial
Engineering and Operations Management (IEOM) pp. 1–8 IEEE
Cerda, P., Varoquaux, G. and Kégl, B. (2018) Similarity encoding for learning with dirty
categorical variables Machine Learning 107(8-10), pp. 1477–1494
Christie, R. et al. (2018) Setting a standard path forward for kyc Journal of Financial
Transformation 47, pp. 155–164
Doganis, R. (2005) Airline business in the 21st century Routledge
Feng, X., Liang, Y., Shi, X., Xu, D., Wang, X. and Guan, R. (2017) Overfitting reduction of
text classification based on adabelm Entropy 19(7), p. 330
Jason, C.D. (2020) New kyc/aml tools: Staying savvy and embracing technology
References 47
Kodinariya, T.M. and Makwana, P.R. (2013) Review on determining number of cluster in
k-means clustering International Journal 1(6), pp. 90–95
Koh, J.m. (2006) Suppressing terrorist financing and money laundering Springer Science &
Business Media
Lin, W.C., Tsai, C.F., Hu, Y.H. and Jhang, J.S. (2017) Clustering-based undersampling in
class-imbalanced data Information Sciences 409, pp. 17–26
Lin, Y. and Jeon, Y. (2006) Random forests and adaptive nearest neighbors Journal of the
American Statistical Association 101(474), pp. 578–590
Liu, X.Y., Wu, J. and Zhou, Z.H. (2008) Exploratory undersampling for class-imbalance
learning IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39(2),
pp. 539–550
Na, S., Xumin, L. and Yong, G. (2010) Research on k-means clustering algorithm: An
improved k-means clustering algorithm in: 2010 Third International Symposium on
intelligent information technology and security informatics pp. 63–67 IEEE
Batista das Neves Junior, R., Verçosa, L.F., Macêdo, D., Leite Dantas Bezerra, B. and
Zanchettin, C. (2020) A fast fully octave convolutional neural network for document
image segmentation arXiv pp. arXiv–2004
Oshiro, T.M., Perez, P.S. and Baranauskas, J.A. (2012) How many trees in a random forest?
in: International workshop on machine learning and data mining in pattern recognition
pp. 154–168 Springer
Parsell, R.D., Wang, J. and Kapoor, C. (2014) Customer segmentation uS Patent App.
13/716,234
Probst, P., Wright, M.N. and Boulesteix, A.L. (2019) Hyperparameters and tuning strategies
for random forest Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
9(3), p. e1301
Rajput, V.U. (2013) Research on know your customer (kyc) International Journal of Scientific
and Research Publications 3(7), pp. 541–546
Ramsay, N. and Wampler, T. (2015) Visualization and interaction with financial data using
sunburst visualization uS Patent 9,021,397
Ray, S. and Turi, R.H. (1999) Determination of number of clusters in k-means clustering
and application in colour image segmentation in: Proceedings of the 4th international
conference on advances in pattern recognition and digital techniques pp. 137–143 Calcutta,
India
Reis, J., Amorim, M., Melão, N. and Matos, P. (2018) Digital transformation: a literature
review and guidelines for future research in: World conference on information systems
and technologies pp. 411–421 Springer
Saar-Tsechansky, M. and Provost, F. (2007) Handling missing values when applying classifi-
cation models Journal of machine learning research 8(Jul), pp. 1623–1657
References 48
Singh, K. and Best, P. (2019) Anti-money laundering: Using data visualization to identify
suspicious activity International Journal of Accounting Information Systems 34, p. 100418
Srivastava, U. and Gopalkrishnan, S. (2015) Impact of big data analytics on banking sector:
Learning for indian banks Procedia Computer Science 50, pp. 643–652
Talbot, J., Lee, B., Kapoor, A. and Tan, D.S. (2009) Ensemblematrix: interactive visualization
to support machine learning with multiple classifiers in: Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems pp. 1283–1292
Teichert, T., Shehu, E. and von Wartburg, I. (2008) Customer segmentation revisited: The
case of the airline industry Transportation Research Part A: Policy and Practice 42(1), pp.
227–242
Templ, M., Alfons, A. and Filzmoser, P. (2012) Exploring incomplete data using visualization
techniques Advances in Data Analysis and Classification 6(1), pp. 29–47
Tropsha, A. (2010) Best practices for qsar model development, validation, and exploitation
Molecular informatics 29(6-7), pp. 476–488
Van Quyet, T., Vinh, N.Q. and Chang, T. (2015) Service quality effects on customer satisfac-
tion in banking industry International Journal of u-and e-Service, Science and Technology
8(8), pp. 199–206
Vellido, A. (2019) The importance of interpretability and visualization in machine learning
for applications in medicine and health care Neural Computing and Applications pp. 1–15
Wirth, R. and Hipp, J. (2000) Crisp-dm: Towards a standard process model for data mining in:
Proceedings of the 4th international conference on the practical applications of knowledge
discovery and data mining pp. 29–39 Springer-Verlag London, UK
Wu, J. and Lin, Z. (2005) Research on customer segmentation model by clustering in:
Proceedings of the 7th international conference on Electronic commerce pp. 316–318

KYC Passing Rate Data Analysis Using Feature Engineering Clustering and Data Visualization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KYC Passing Rate Data Analysis Using Feature Engineering Clustering and Data Visualization

Uploaded by

Copyright:

Available Formats

KYC Success Rate Analysis Using

Supervised and Unsupervised Machine

Siddesh Sanna Gowdru

Dublin Business School

This dissertation is submitted for the degree of

Siddesh Sanna Gowdru

In a fast-growing world, all financial institutions struggling to provide satisfactory services

List of figures vii

List of tables viii

4 Implementation of the proposed models 31

2.1 Elbow plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 Hyper Parameters Values for the Random Forest model . . . . . . . . . . . 32

1.1 What is Know Your Customer in the banking system

and why it is important?

1.2 Why is it important to analyze the KYC passing rate?

1.3 Problem Statement

1.3.1 Research Question:

1.3.2 Research aims and objectives

Primary objectives of the research:

1. To identify Potential features influencing KYC passing rate.

1.4 Road map for the Thesis

2.2 Anti Money Laundering - Know Your Customer

2.3 Customer Segmentation and its importance in the suc-

cess of the data-driven business

2.4 K-Means Clustering and its applications

Fig. 2.1 Example of Identification of Number of Clusters

2.5 One Hot-encoding

2.6 Importance of Random Forest in feature selection and

its hyper-parameter optimization

2.7 Data Visualization and its necessity in reveling the hid-

den information in the data

Interactive Visualization in classifier models plays a very important role in evaluating

3.1 Business Understanding

Fig. 3.1 CRISP-Methodology model

3.2 Data Understanding

3.3 Data Preparation

3.4 Data Modeling

"document_type", "date_of_expiry", "issued_country" and "nationality". Hence we are using

3.4.1 Random Forest

3.4.2 K-Means Clustering

3.4.3 Data Visualization

Implementation of the proposed models

Hyper Parameter Values

Fig. 4.1 List of Feature Importance

From Fig.4.1 it can be shown that "image_integrity_result" contributing more to the

Fig. 4.2 Elbow Plot

5.1 Exploratory Data Analysis

Fig. 5.1 Proportion of Document Verification Passing Rate

Fig. 5.3 Top 10 Issuing country passing rate

5.2 Cluster results Analysis

Fig. 5.4 Top 10 Issuing country passing rate

Fig. 5.5 Cluster Analysis With Top Five Important Features

Fig. 5.6 Group 2 cluster Analysis

Fig. 5.7 Group 2 cluster Analysis

Fig. 5.8 Group 3 cluster Analysis

Fig. 5.9 Group 3 cluster Analysis

6.1 Future Work:

Agathokleous, A. (2009) From fintech to regtech: How have european countries

You might also like