10 - Sabau

110 Informatica Economică vol. 16, no.
1/2012
Survey of Clustering based Financial Fraud Detection Research
Andrei Sorin SABAU

Faculty of Mathematics and Computer Science
University of Pitesti, Pitesti, Romania
andrei@asabau.com
Given the current global economic context, increasing efforts are being made to both prevent
and detect fraud. This is a natural response to the ascendant trend in fraud activities
recorded in the last couple of years, with a 13% increase only in 2011. Due to ever increasing
volumes of data needed to be analyzed, data mining methods and techniques are being used
more and more often. One domain data mining can excel at, suspicious transaction
monitoring, has emerged for the first time as the most effective fraud detection method in
2011. Out of the available data mining techniques, clustering has proven itself a constant
applied solution for detecting fraud. This paper surveys clustering techniques used in fraud
detection over the last ten years, shortly reviewing each one.
Keywords: Fraud Detection, Data Mining, Clustering
1 Introduction
Given the current global economic
context, increasing efforts are being made to
Dictionary [3] defines fraud as "a knowing
misrepresentation of the truth or concealment
of a material fact to induce another to act to
both prevent and detect fraud. This is a his or her detriment." A definition for
natural response to the ascendant trend in financial fraud can be deducted from
fraud activities recorded in the last couple of financial fraud main categories: bank fraud,
years, with a 13% increase only in 2011 [1]. insurance fraud, securities and commodities
Due to ever increasing volumes of data fraud.
needed to be analyzed, data mining methods At a more in-depth level, we can detail credit
and techniques are being used more and more card and money laundering fraud as bank
often. One domain data mining can excel at, fraud while healthcare, automobile and crop
suspicious transaction monitoring, has related frauds are the most common
emerged for the first time as the most insurance frauds. Detection of all the above
effective fraud detection method in 2011. Out financial fraud types can be conducted with
of the available data mining techniques, all data mining techniques with the current
clustering has proven itself a constant applied study focusing on clustering, either
solution for detecting fraud. standalone clustering techniques or hybrid
The current study contains an introduction to ones combined with classification
financial fraud, reviews clustering techniques, mostly neural networks and
mathematical foundations and most decision trees. Standalone clustering
commonly used clustering techniques, techniques can be viewed as unsupervised
surveys research done in this area for the last data mining while hybrid ones can be viewed
ten years, concludes with some insights on as semi-supervised data mining.
clustering usage in fraud detection activities.
Although a universally understood term, 2 Financial fraud taxonomies
fraud can have multiple meanings and be The scientific literature presents several
interpreted in different ways depending on its definitions and taxonomies for the “fraud”
definition source. Fraud, in general, is concept. An understanding of these
defined in Oxford English Dictionary [2] as definitions and classification models is
“wrongful or criminal deception intended to fundamental to prevent and detect fraud.
result in ﬁnancial or personal gain.” Definition of fraud is important to be known
Following the same line, Black's Law for both accounting and auditing profession,
Informatica Economică vol. 16, no. 1/2012 111
and for economic entities, in order to develop of the financial condition of an enterprise
an anti-fraud program. Detailed knowledge accomplished through the intentional
and awareness of fraud can prevent or even misstatement or omission of amounts or
reverse the syndrome '' it-can't-happen-here'' disclosures in the financial statements in
[4]. order to deceive financial statement users.''
According to U.S. Association of Certified The scientific literature provides various
Fraud Examiners (ACFE), fraud is classified clustering and classification systems for
as fraud and abuse in the workplace, and categorizing fraud. Some are similar, while
financial statement fraud. Occupational fraud others are redundant and ask questions of
is defined as: ''The use of one’s occupation interpretation. Common factors found in the
for personal enrichment through the research field, determining fraud
deliberate misuse or misapplication of the classifications, are: type of responsibility to
employing organization’s resources or the organization's position, motivational
assets". ACFE defines fraud financial relationships to the organization, the criminal
statements as: ''deliberate misrepresentation group.
Table 1. Fraud taxonomies [4]

Bologna and Albrechet and Singleton and KPMG
Lindquist Albrecht Singleton
• Internal Fraud • Employee • Tort or criminal • Employee Fraud

against Misappropriation liability Fraud • Suppliers Fraud
organization • Management • Fraud for or • Clients Fraud
• External Fraud Fraud against the • Informatics Fraud
against • Investment Fraud organization • Misadministration
organization • Suppliers Fraud • Internal or • Medical and
• Fraud for • Clients Fraud external fraud insurance Fraud
organization • Other Fraud Types • Management or • Financial
non-management Statement Fraud
Fraud
All the above classifications present cross • Corruption.

cutting issues overlapping each other. [5] Fraudulent statements schemes are made
overviews how these different classifications usually by people in senior management and
interact with each other, mainly within are producing the biggest losses for the
internal and external fraud projections. affected organization.
ACFE has developed a fraud classification Assets misappropriation schemes are usually
model, known as the "fraud tree", which lists made by employees and can be also
approximately 49 different individual fraud classified into subcategories. They have the
schemes, grouped into categories and highest frequency of occurrence and are
subcategories. those that produce the lowest losses. The
The three main categories in which fraud is fraud tends to be insignificant at an
classified, are: individual level and it is very difficult to be
• Fraudulent Statements; recognized by both internal and external
• Assets Misappropriation; auditors during audits.
112 Informatica Economică vol. 16, no. 1/2012
Fig. 1. Relationship between taxonomies [5]
Fraud taxonomy conducted by ACFE and the are, the overwhelming majority of clustering
tool called "fraud tree" is now regarded as methods use various types of dissimilarities,
the most complete blueprint for fraud either distance and/or density based. They
schemes. satisfy the dkl >= 0, dkk = 0, dkl = dlk
properties but are not required to satisfy the
3 Clustering techniques triangle inequality, be actual distances.
Clustering, as unsupervised data mining c) Constraints. Select a clustering type
technique, deals with the problem of dividing (partitional / hierarchical / hybrid) and
a given set of entities into meaningful specify additional required initialization
subsets. Clusters resulted from this data parameters: k total number of clusters,
segmentation are required to be to be density threshold, graph connectivity
homogeneous and/or well separated, entities threshold, etc..
within the same group being similar while d) Validity index. Select one or more validity
entities within different groups being indices to express homogeneity and/or
dissimilar. Based on general steps found in a separation of the clusters in the clustering to
typical cluster analysis study [6], a more be found.
condensed clustering scheme contains the (e) Algorithm. Select an already existing
following elements: algorithm or design a new one for the
a) Dataset. Given N entities, measure same p problem deﬁned in (c), (d). Obtain or write
properties for each entity. This results in an N the corresponding software.
x p data matrix X. (f) Computation. Apply the selected
b) Dissimilarity measure. Compute from the algorithm to matrix D = (dkl) in order to
matrix X, a N x N matrix D = (dkl) of partition the initial N entities into meaningful
dissimilarities between entities. In order to clusters.
assess how closely related two given objects
(g) Interpretation. Apply formal tests based target dataset. Based on the target data
on validity indices selected in (d) on all data domain, choosing a pattern proximity
segmentations obtained in (f). Based on measure leads to the target dissimilarity
overall data understanding of the initial N measure. The above constraints, validity
entities, apply informal tests as well. indices, algorithm definition and computation
Describe clusters by their lists of entities and clustering elements can be viewed as the
descriptive statistics. Proceed to a substantive clustering grouping activity. Fig. 2 illustrates
interpretation of the results. these main clustering activities [7], including
The above elements can be projected in main a feedback loop incrementally improving
clustering activities. Pattern representation, clustering results.
including feature extraction, leads to the
Fig. 2. Main clustering activities [7]
Regardless of the clustering technique used polythetic – sequential or simultaneous use

and its position in the overall taxonomy – of data features, crisp or fuzzy – whether or
Fig. 3, cross cutting issues always appear and not a data point belongs to one or multiple
must be taken into consideration in order to clusters, deterministic or stochastic –
fully describe a given clustering algorithm clustering optimization achieved via either
[7]. Following this path, algorithms can be deterministic objective function or random
agglomerative or divisive – in the beginning search technique, incremental or non-
each point represents one cluster or all points incremental – whether or not the original
represent one cluster, monothetic or target dataset can be increased.
Fig. 3. Clustering taxonomies [7]
With many different, overlapping taxonomies clusters are formed dividing clustering
of clustering algorithms, the most common techniques in hierarchical and partitional
generic criteria is represented by the way clustering. Hierarchical clustering groups
entities with a sequence of partitions, either 4 Methodological research framework

starting with singleton clusters – In terms of research definition, this paper's
agglomerative hierarchical clustering, or research area is represented by academic
starting from a single cluster containing all research on financial fraud detection making
entities – divisive hierarchical clustering. use of clustering data mining techniques –
Partitional clustering methods can be divided Fig. 4. The research scope covers papers
in prototype based methods, density based published in the last twelve years, between
methods (grid based, graph based), mixture- 2000 and 2011. Considering the momentum
resolving methods, metaheuristic based. data mining techniques are building as tools
Prototype based methods have a prototype in fraud detection and prevention, this time
representing each cluster, either dynamically span contains the most relevant research to
generated as an average function of all date.
entities within the given cluster or As part of research methodology, multiple
represented by one representative entity criteria for searching and selecting articles
within the given cluster. Prototype based have been defined together with an article
methods objective is to minimize a cost classification framework. In an initial stage,
function defined by distances between all Thomson Reuters Web of Science, IEEE
entities within a given cluster and cluster Transactions, ScienceDirect Freedom
prototype. One of the most used cost Collection and Springer-Link Contemporary
functions is the squared error function have been searched against "cluster* fraud*"
present in k-means, k-medoid, k-modes regular expression contained in the articles'
algorithms and their variances. Density based topic field. Relevant articles had their
methods start from the assumption that the bibliography considered for inclusion as
entire dataset is partitioned in tightly well, up to two articles deep. In a second
grouped/ high density clusters separated by stage, clustering connected to each major
low density regions. A popular algorithm of form of financial fraud has been searched
this type is dbscan. Grid based algorithms against Google Scholar with the first 100
and graph based ones are also included in the entries being considered for inclusion,
density based category. By relying on the together with relevant bibliography entries.
assumption the entire dataset is drawn from a The search expressions contained the
given set of distributions (Gaussian is usually keywords "clustering" and "fraud" combined
used), mixture-resolving methods attempt to with one of the following "credit card",
resolve the given distributions parameters in "money laundering", "insurance",
order to clearly define the clusters. For "corporate". No direct searches against "bank
metaheuristic based methods, combinatorial fraud" were conducted as this fraud area was
search for optimizing a given clustering well covered by "credit card" and "money
solution is being conducted via tabu search, laundering" keywords. Using the generic
scatter search, simulated annealing, genetic "insurance" keyword meant no further search
and nature inspired algorithms. With relative queries were required for "healthcare
low impact changes to the above clustering insurance", "automobile insurance", "crop
algorithms, all clustering methods can insurance", etc. all being subcategories of
produce hard or fuzzy clusters. Hard generic insurance fraud.
clustering assigns one entity to only one Besides being relevant to the defined
cluster where soft clustering deals with research area, each article had to meet a
probabilities of one entity belonging to each series of additional criteria. The article's full
cluster. In this sense, hard clustering can be text had to be available, it had to contain a
viewed as a special case of fuzzy clustering. case study and that case study needed to be
performed against a real dataset. Exceptions
were made on using synthetic datasets
instead of real datasets on very small number
of articles the author fell they should be in this way, at least to a certain degree,
included nonetheless for their scientific results are quantifiable.
value. Real datasets were preferred because
Fig. 4. Financial Fraud Detection Review Framework [8]
5 Clustering based FFD survey domain, clustering technique and case study
As a result of the research methodology, 27 dataset. Papers are ordered on publishing
articles were selected for inclusion. They year and clustering technique in Table 2.
have been grouped based on application
Table 2. Surveyed articles

Author Year Application Domain Clustering Dataset
Technique
H. Issa et al. [9] 2011 Refund fraud/ financial K-means Refund
fraud transaction data
S. Thiprungsri et al. 2011 Healthcare Insurance fraud K-means Life claims
[10] payment data
R. Liu et al. [11] 2011 Money laundering fraud Birch, k-means Sintetic data
F. H. Glancy et al. 2011 Financial reporting fraud Hierarchical Annual financial
[12] clustering reports data

L. Torgo et al. [13] 2011 Transaction fraud Hierarchical Foreign trade
agglomerative transactions
clustering dataset
N. D. Jyotindra et 2011 Credit card fraud Density based Credit card
al. [14] clustering transaction data
R. Ghani et al. [15] 2011 Healthcare Insurance fraud Repeated Health claims
bisection payments data
clustering
J. Wu et al. [16] 2010 Credit card fraud K-means Credit card data
N.L. Khac et al. 2010 Money laundering fraud K-means Transaction data
[17]
W.H. Chang et al. 2010 Online action fraud X-means Online transaction
[18] Auction data
L. Torgo et al. [19] 2010 Transaction fraud Hierarchical Foreign trade
clustering transactions data
W Xiaoyun et al. 2010 Healthcare Insurance fraud Resolution based Policy holder
[20] clustering attributes data
M. Jans et al. [21] 2010 Procurement Latent class Procurement
process fraud clustering dataset
M. C. Hao et al. 2010 Credit card fraud Binned scatter Credit card data
[22] plot visualization
Q. Deng et al. [23] 2009 Financial statement fraud Hybrid k-means Financial staments
data
C. Holton [24] 2009 Occupational fraud Hierarchical, Discussion groups
k-means document data
S. Panigrahi et al. 2009 Credit card fraud Density based Credit card
[25] clustering (sintetic) data
A. Jurek et al. [26] 2008 Insurance fraud K-means Sintetic data set
M. Franke et al. 2008 Stock market trading fraud Spectral Political stock
[27] clustering market data
G. K. Palshikar et 2008 Stock market trading fraud Graph clustering Transaction data
al. [28]
B. Little et al. [29] 2008 Healthcare fraud Clustering(not Healthcare
mentioned) payments data
D. Tasoulis et al. 2008 Credit card fraud Stream Credit card data
[30] clustering
M. Jans et al. [31] 2007 Purchasing fraud K-means Purchasing data
S. Virdhagriswaran 2006 Accounting fraud K-means Quarterly and
et all [32] annual financial
reports data
S. Zhang et al. [33] 2006 Insurance fraud Hierarchical Policy holder
clustering attributes data
S. Donoho [34] 2004 Inside trading fraud K-means US stock and

option data, news
data
Z. M. Zhang et al. 2003 Money laundering Histogram Official
[35] segmentation documents data
based clustering
The above papers make use of clustering text documents to mine for transaction data
techniques ranging across a relative large transformed in monetary vectors. Computed
spectrum. On one end of the spectrum we monetary vectors are either clustered via k-
encounter single, standalone clustering means or projected to a histogram.
technique being used as the sole data mining Another case group standing out consists of
method [9], [10], [31], [34]. On the other end clustering techniques used for training
we encounter hybrid data mining techniques classifiers. Due to proliferation of enterprise
where clustering is just one tool, being used resource planning systems and an ever
in one or more stages, within complex data growing amount of available data to be
mining implementations [14], [17], [25]. analyzed, manually labeling training data for
Also present are clustering visualization various classifiers has become unfeasible in
techniques targeting financial fraud detection many cases. In these situations a clustering
[22]. technique is first used on the uncategorized
Most cases where single, standalone data in order to automatically split it into
clustering techniques are being used make meaningful categories. Each cluster/category
use of k-means and its variations for outlier is labeled (usually manually) and then
detection. In most cases, Euclidian distance classifiers are being trained on each
is being used as the dissimilarity metric. [9] cluster/category. The majority of papers
implements k-means with the intent of found in this study are mostly using
identifying fraudulent refunds within a classifiers based on decision trees, neural
telecommunication company with fraudulent networks and support vector machines. [16]
transactions being regarded as outliers. [10] is conducting rare class analysis on datasets
uses k-means to automate fraud filtering with imbalanced class distribution by
during an audit. Claims with similar manually splitting data into several large
characteristics are grouped together and classes and performing k-means with
small-population clusters are flagged for Euclidian distance as dissimilarity metric on
further investigation. Dominant each class. This local clustering process
characteristics of the flagged clusters include generates sub-classes with relatively
large beneficiary payment, large interest balanced sizes within each main class, sub-
payment amounts and long lag between classes used subsequently for training a
submission and payment. [29] splits support vector machine classifier.
healthcare data according to Benford's large Experimental results on various real-world
numbers law and analyses non-compliant data sets show this method producing
classes via k-means clustering in order to significantly higher prediction accuracies on
detect outliers. [31] identifies three rare classes than state-of-the-art methods.
purchasing related fraud schemes, double [17] generates new composite attributes from
payment of invoices, changing purchasing transaction data and uses them in k-means
order after release, deviations of purchasing clustering to divide transactions into
order and implements k-means on newly suspicious and unsuspicious, most being
added attributes based on ANOVA analysis. unsuspicious. The full set of attributes is then
[34] employs k-means for finding early being used to train two different sets of
symptoms of insider trading in option classifiers (neural networks and decision
markets before any news release. [35] uses trees) on the two identified clusters. [18]
distinguishes types of behavior changes from Hierarchical agglomerative clustering is

different fraudsters with the help of x-means being used with stable end points for all
clustering technique. Afterwards, C4.5 clustering trials resulting in two stable
decision trees are employed for inducing the clusters. Under the same main author, both
rules of the labeled clusters. [26] performs k- [13] and [19] make use of hierarchical
means on insurance data and trains a naïve agglomerative clustering as outlier ranking,
bayes classifier on each found cluster. [32] part of a larger data mining solution. The
attempts to detect frauds camouflaged to look main idea of the method is that outliers
like normal activities in domains with high should offer more resistance to being merged
number of known relationships like with large groups of "normal'' cases,
accounting fraud detection for rating and information taken into account within the
investment, insider attacks on corporate hierarchical agglomerative clustering
networks, health care insurance fraud. It uses merging process. In this way, ranking of
k-means for training various classifiers. fraud probability for a set of unlabeled
There are cases where clustering techniques observations are being generated. The end
are being used to group already flagged, result outlier ranking is able to handle
possible fraudulent entries by classifiers. The applications with both global and local
clustering goal in this situation is to define a outlier types. [24] mines text based official
taxonomy of the already identified fraud documents and applies clustering in two
entries in order to implement counter stages. Initially hierarchical agglomerative
measures for each found fraud category. In clustering is being performed to a certain
certain situations, some categories may be level. In order to speed up the cluster
even found to contain legitimate data, convergence process, hierarchical clustering
wrongly labeled by the classifier due to is being interrupted and k-means is being
insufficient training to such cases. [15] applied with initial cluster centers being the
detects payment errors in insurance claims by hierarchical centers. In this way, all the
applying hierarchical divisive clustering on remaining entries not covered by hierarchical
entities flagged as fraudulent via a support clustering are grouped to hierarchical cluster
vector machine classifier. [23] computes centers via k-means. Cosine similarity
financial ratios from companies' financial function was found to be the most successful.
statements. A self-organizing map neural [33] uses a variant of classical hierarchical
network is being used with financial ratios as clustering chameleon algorithm in order to
its input vector. Subsequently k-means is define outliers as bridging rules between
being performed on the self-organizing map different conceptual clusters. A bridging rule
node vector. can be viewed as the antecedent and action
Hierarchical clustering techniques form belonging to different conceptual clusters
another case group. [11] uses a combination leading to new insights on how entries are
of the classical agglomerative hierarchical connected, related.
clustering Birch algorithm with k-means. In Even though k-means and hierarchical
this way low points from both methods are clustering are the most popular techniques in
being minimized, Birch not handling this survey, other clustering techniques are
financial data very well by not being present as well. [14] uses density based
sensitive to noise, k-means being too dbscan algorithm to form clusters of
expensive to run on large databases. [12] transaction amounts spend by the customer.
uses official financial data to separate Whenever a new credit card transaction is
between fraudulent and non-fraudulent performed by the customer, the algorithm
companies. Documents are being processed finds the cluster coverage of this particular
via text mining and the corresponding term- amount. Clustering is just one part of the
document matrix has its density increased via overall proposed transaction risk generation
a singular value decomposition vector. model consisting of five major components:
dbscan algorithm, linear equation, rules, data timestamp. Visualization [22] is also present
warehouse and Bayes theorem. [25] also uses as a form of identifying credit card fraud.
dbscan within a fraud detection system Variable binned scatter plots allow the
consisting of 4 components: rule-based filter, visualization of large amounts of data
Dempster–Shafer adder, transaction history without overlapping. The basic idea is to use
database, Bayesian learner.. Within the rule- a non-uniform (variable) binning of the x and
based filter, outlier detection is conducted via y dimensions and plots all the data points that
dbscan. [20] detects fraudulent insurance fall within each bin into corresponding
clients' applications via resolution based squares.
clustering. The algorithm, which combines
the advantages of resolution based and 6 Conclusions
density based algorithms, can detect and rank Compared to other domains where clustering
top-n outliers from any kind of datasets is being applied to identify outliers, intrusion
without the need for input parameters taking detection, etc.., clustering based fraud
the size and density of clusters into detection techniques tend to use established
consideration. Resolution can be explained as clustering techniques. Relative novel
follows. Just like viewing a density plot with clustering techniques like clustering
a microscope or telescope at a certain ensembles, large scale clustering, multi-way
magnification, one can identify different clustering have very little presence in the
groups in the night sky as the magnification surveyed papers.
is adjusted. When the resolution changes on a Based on the surveyed papers, almost three
dataset, the clusters in the dataset quarters of the encountered clustering
redistribute. All the objects are in the same techniques are partitional – Figure 5. Some
cluster when the resolution is very low, and papers, [11], [24] have been counted as using
every object is a single cluster when the both partitional and hierarchical clustering
resolution is very high. [21] uses techniques as they combine both clustering
multivariate latent class fuzzy clustering in types. Among partitional clustering
order to detect internal fraud on procurement techniques, k-means clustering and its
data. [27] analyses irregular stock market variants with Euclidian distance as
behavior via traders trading behavior using dissimilarity metric are the most common
spectral clustering. [28] combines three used ones. Hierarchical clustering techniques
graph clustering algorithms making use of a come in second place being used in one
Dempster–Shafer adder for detecting quarter of the surveyed papers. Interactive,
circular trading and price manipulation. [30] visualization clustering techniques are also
uses stream clustering based on classical used but only in very small number of cases.
Dbscan algorithm and Wstream. Using Regarding the way clustering techniques are
containers (windows) in the form of hyper- combined or used in conjunction with other
rectangles that are adjusted though time to data mining techniques, the surveyed papers
discover and track the evolution of the have been classified as containing standalone
underlying clusters. Wstream achieves this clustering techniques with only one
using two procedures, “movement” and clustering algorithm being used, combined
“enlargement-contraction”. The “movement” clustering techniques with two or more
of windows incrementally recenters windows clustering algorithms being used, hybrid
every time a new streaming data point clustering techniques combining both
arrives. Windows are recentered to the mean clustering algorithms and other data mining
of the points they include at each time point algorithms, mostly classifiers based on
in a manner that also depends on each point’s decision trees, neural networks and support
timestamp. A fading function that decreases vector machines – Figure 6.
with time, associates a weight with each
4%
24%
Partitional clustering
72% Hierarchical clustering
Visualization techniques
Fig. 5. Clustering techniques based on algorithm type
41% 41%
Standalone
Combined algorithm
18% Hybrid algorithm
Fig. 6. Clustering techniques combinations
Both standalone and hybrid clustering [3] B. A. Garner, Black's Law Dictionary 9th
techniques are heavily used with ed.. New York: West Group Publishing
approximately 40% usage each. Increasing House, 2009.
clustering accuracy by combining multiple [4] Singleton T. W., Fraud Auditing and
clustering algorithms and staying within a Forensic Accounting 4th edition, Ed.
single data mining domain is not perceived as John Wiley and Sons, 2010.
having significant benefits as only 18% cases [5] M. Jans, N. Lybaert, and K. Vanhoof, “A
apply it. framework for Internal Fraud Risk
Reduction at IT Integrating Business
References Processes,” International Journal, 2009.
[1] PriceWaterhouseCoopers UK, Global [6] P. Hansen and B. Jaumard, “Cluster
Economic Crime Survey, analysis and mathematical
PriceWaterhouseCoopers, programming,” Mathematical
Nov. 2011. [Online]. Available: Programming, vol. 79, pp. 191-215, Oct.
http://www.pwc.com/gx/en/economic- 1997.
crime-survey/download-economic- [7] A. K. Jain, M. N. Murty, and P. J. Flynn,
crime-people-culture-controls.jhtml “Data clustering: a review,” ACM
[Accessed: 01 Jan. 2012]. computing surveys (CSUR), vol. 31, no.
[2] Oxford University Press, Concise Oxford 3, pp. 264–323, 1999.
English Dictionary, Oxford University [8] E. Ngai, Y. Hu, Y. Wong, Y. Chen, and
Press, Dec. 2009. [Online]. Available: X. Sun, “The application of data mining
http://oxforddictionaries.com/ techniques in financial fraud detection:
[Accessed: 10 Nov. 2011]. A classification framework and an
academic review of literature,” Decision Study,” in 2010 IEEE International

Support Systems, vol. 50, no. 3, pp. 559– Conference on Data Mining Workshops
569, 2011. (ICDMW), 2010, pp. 577-584.
[9] H. Issa and M. Vasarhelyi, “Application [18] Wen-Hsi Chang and Jau-Shien Chang,
of Anomaly Detection Techniques to “Using clustering techniques to analyze
Identify Fraudulent Refunds,” 2011. fraudulent behavior changes in online
[10] S. Thiprungsri and M. Vasarhelyi, auctions,” in 2010 International
“Cluster Analysis for Anomaly Conference on Networking and
Detection in Accounting Data: An Audit Information Technology (ICNIT), 2010,
Approach,” The International Journal of pp. 34-38.
Digital Accounting Research, vol. 11, [19] L. Torgo and C. Soares, “Resource-
2011. bounded Outlier Detection using
[11] Rui Liu, Xiao-long Qian, Shu Mao, and Clustering Methods,” in Proceedings of
Shuai-zheng Zhu, “Research on anti- the 2010 conference on Data Mining for
money laundering based on core Business Applications, Amsterdam, The
decision tree algorithm,”, Control and Netherlands, The Netherlands, 2010, pp.
Decision Conference (CCDC), 2011 84–98.
Chinese, 2011, pp. 4322-4325. [20] Wang Xiaoyun and Liu Danyue,
[12] F. H. Glancy and S. B. Yadav, “A “Hybrid outlier mining algorithm based
computational model for financial evaluation of client moral risk in
reporting fraud detection,” Decision insurance company,” in 2010 The 2nd
Support Systems, vol. 50, no. 3, pp. 595- IEEE International Conference on
601, Feb. 2011. Information Management and
[13] L. Torgo and E. Lopes, “Utility-Based Engineering (ICIME), 2010, pp. 585-
Fraud Detection,” in Twenty-Second 589.
International Joint Conference on [21] M. Jans, N. Lybaert, and K. Vanhoof,
Artificial Intelligence, 2011. “Internal fraud risk reduction: Results of
[14] N. D. Jyotindra and R. P. Ashok, “A a data mining case study,” International
Data Mining with Hybrid Approach Journal of Accounting Information
Based Transaction Risk Score Systems, vol. 11, no. 1, pp. 17–41, 2010.
Generation Model (TRSGM) for Fraud [22] M. C. Hao, U. Dayal, R. K. Sharma, D.
Detection of Online Financial A. Keim, and H. Janetzko, Visual
Transaction,” International Journal of Analytics of Large Multi-Dimensional
Computer Applications, vol. 16, no. 1, Data Using Variable Binned Scatter
pp. 18–25, 2011. Plots. Bibliothek der Universität
[15] R. Ghani and M. Kumar, “Interactive Konstanz, 2010.
learning for efficiently detecting errors [23] Q. Deng and G. Mei, “Combining self-
in insurance claims,” in Proceedings of organizing map and K-means clustering
the 17th ACM SIGKDD international for detecting fraudulent financial
conference on Knowledge discovery and statements,” in IEEE International
data mining, New York, NY, USA, Conference on Granular Computing,
2011, pp. 325–333. 2009, GRC ’09, 2009, pp. 126-131.
[16] J. Wu, H. Xiong, and J. Chen, “COG: [24] C. Holton, “Identifying disgruntled
local decomposition for rare class employee systems fraud risk through text
analysis,” Data Mining and Knowledge mining: a simple solution for a multi-
Discovery, vol. 20, no. 2, pp. 191-220, billion dollar problem,” Decision
Jan. 2010. Support Systems, vol. 46, no. 4, pp. 853–
[17] Nhien An Le Khac and M.-T. Kechadi, 864, 2009.
“Application of Data Mining for Anti- [25] S. Panigrahi, A. Kundu, S. Sural, and A.
money Laundering Detection: A Case Majumdar, “Credit card fraud detection:
A fusion approach using Dempster- 18th Symposium (COMPSTAT 2008,

Shafer theory and Bayesian learning,” 2008, vol. 2, pp. 315–322.
Information Fusion, vol. 10, no. 4, pp. [31] M. Jans, N. Lybaert, and K. Vanhoof,
354–363, 2009. “Data Mining for Fraud Detection:
[26] A. Jurek and D. Zakrzewska, Toward an Improvement on Internal
“Improving Naïve Bayes models of Control Systems,” 2007.
insurance risk by unsupervised [32] S. Virdhagriswaran and G. Dakin,
classification,” in Computer Science and “Camouflaged fraud detection in
Information Technology, 2008. IMCSIT domains with complex relationships,” in
2008. International Multiconference on, Proceedings of the 12th ACM SIGKDD
2008, pp. 137–144. international conference on Knowledge
[27] M. Franke, B. Hoser, and J. Schröder, discovery and data mining, 2006, pp.
“On the Analysis of Irregular Stock 941–947.
Market Trading Behavior,” in Data [33] S. Zhang, F. Chen, X. Wu, and C.
Analysis, Machine Learning and Zhang, “Identifying bridging rules
Applications, C. Preisach, H. Burkhardt, between conceptual clusters,” in
L. Schmidt-Thieme, and R. Decker, Eds. Proceedings of the 12th ACM SIGKDD
Berlin, Heidelberg: Springer Berlin international conference on Knowledge
Heidelberg, 2008, pp. 355-362. discovery and data mining, 2006, pp.
[28] G. K. Palshikar and M. M. Apte, 815–820.
“Collusion set detection using graph [34] S. Donoho, “Early detection of insider
clustering,” Data Mining and Knowledge trading in option markets,” in
Discovery, vol. 16, no. 2, pp. 135–164, Proceedings of the tenth ACM SIGKDD
2008. international conference on Knowledge
[29] B. Little, R. Rejesus, M. Schucking, and discovery and data mining, 2004, pp.
R. Harris, “Benford’s Law, data mining, 420–429.
and financial fraud: a case study in New [35] Z. M. Zhang, J. J. Salerno, and P. S. Yu,
York State Medicaid data,” 2008, vol. “Applying data mining in investigating
IX, pp. 195-204. money laundering crimes,” in
[30] D. Tasoulis, N. Adams, D. Weston, and Proceedings of the ninth ACM SIGKDD
D. Hand, “Mining Information from international conference on Knowledge
Plastic Card Transaction Streams,” in discovery and data mining, 2003, pp.
Proceedings in Computational Statistics: 747–752.
Andrei Sorin SABAU has graduated the Faculty of Commerce – Marketing

Research and Forecast in 2002 at the Bucharest Academy of Economic
Studies. Currently a PhD Student at the Pitesti University, Faculty of
Mathematics and Informatics, his primary research area is unsupervised data
mining techniques used in fraud detection and prevention. With multiple
certifications in SAP, JAVA and ORACLE SQL, he is also attending
SIMPRE professional master program, ASE Bucharest, Faculty of
Cybernetics, Statistics and Informatics. Working in the research-development department in a
multinational software development company, he is successfully embedding research
knowledge with practice.

10 - Sabau

Uploaded by

Copyright:

Available Formats

You might also like

10 - Sabau

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 - Sabau

Uploaded by

Copyright:

Available Formats

110 Informatica Economică vol. 16, no.

Survey of Clustering based Financial Fraud Detection Research

Andrei Sorin SABAU

Table 1. Fraud taxonomies [4]

• Internal Fraud • Employee • Tort or criminal • Employee Fraud

All the above classifications present cross • Corruption.

Fig. 1. Relationship between taxonomies [5]

Fig. 2. Main clustering activities [7]

Regardless of the clustering technique used polythetic – sequential or simultaneous use

Fig. 3. Clustering taxonomies [7]

entities with a sequence of partitions, either 4 Methodological research framework

Fig. 4. Financial Fraud Detection Review Framework [8]

Table 2. Surveyed articles

[12] clustering reports data

S. Donoho [34] 2004 Inside trading fraud K-means US stock and

distinguishes types of behavior changes from Hierarchical agglomerative clustering is

72% Hierarchical clustering

Fig. 5. Clustering techniques based on algorithm type

18% Hybrid algorithm

Fig. 6. Clustering techniques combinations

academic review of literature,” Decision Study,” in 2010 IEEE International

A fusion approach using Dempster- 18th Symposium (COMPSTAT 2008,

Andrei Sorin SABAU has graduated the Faculty of Commerce – Marketing

You might also like