Professional Documents
Culture Documents
10 - Sabau
10 - Sabau
10 - Sabau
1/2012
Given the current global economic context, increasing efforts are being made to both prevent
and detect fraud. This is a natural response to the ascendant trend in fraud activities
recorded in the last couple of years, with a 13% increase only in 2011. Due to ever increasing
volumes of data needed to be analyzed, data mining methods and techniques are being used
more and more often. One domain data mining can excel at, suspicious transaction
monitoring, has emerged for the first time as the most effective fraud detection method in
2011. Out of the available data mining techniques, clustering has proven itself a constant
applied solution for detecting fraud. This paper surveys clustering techniques used in fraud
detection over the last ten years, shortly reviewing each one.
Keywords: Fraud Detection, Data Mining, Clustering
1 Introduction
Given the current global economic
context, increasing efforts are being made to
Dictionary [3] defines fraud as "a knowing
misrepresentation of the truth or concealment
of a material fact to induce another to act to
both prevent and detect fraud. This is a his or her detriment." A definition for
natural response to the ascendant trend in financial fraud can be deducted from
fraud activities recorded in the last couple of financial fraud main categories: bank fraud,
years, with a 13% increase only in 2011 [1]. insurance fraud, securities and commodities
Due to ever increasing volumes of data fraud.
needed to be analyzed, data mining methods At a more in-depth level, we can detail credit
and techniques are being used more and more card and money laundering fraud as bank
often. One domain data mining can excel at, fraud while healthcare, automobile and crop
suspicious transaction monitoring, has related frauds are the most common
emerged for the first time as the most insurance frauds. Detection of all the above
effective fraud detection method in 2011. Out financial fraud types can be conducted with
of the available data mining techniques, all data mining techniques with the current
clustering has proven itself a constant applied study focusing on clustering, either
solution for detecting fraud. standalone clustering techniques or hybrid
The current study contains an introduction to ones combined with classification
financial fraud, reviews clustering techniques, mostly neural networks and
mathematical foundations and most decision trees. Standalone clustering
commonly used clustering techniques, techniques can be viewed as unsupervised
surveys research done in this area for the last data mining while hybrid ones can be viewed
ten years, concludes with some insights on as semi-supervised data mining.
clustering usage in fraud detection activities.
Although a universally understood term, 2 Financial fraud taxonomies
fraud can have multiple meanings and be The scientific literature presents several
interpreted in different ways depending on its definitions and taxonomies for the “fraud”
definition source. Fraud, in general, is concept. An understanding of these
defined in Oxford English Dictionary [2] as definitions and classification models is
“wrongful or criminal deception intended to fundamental to prevent and detect fraud.
result in financial or personal gain.” Definition of fraud is important to be known
Following the same line, Black's Law for both accounting and auditing profession,
Informatica Economică vol. 16, no. 1/2012 111
and for economic entities, in order to develop of the financial condition of an enterprise
an anti-fraud program. Detailed knowledge accomplished through the intentional
and awareness of fraud can prevent or even misstatement or omission of amounts or
reverse the syndrome '' it-can't-happen-here'' disclosures in the financial statements in
[4]. order to deceive financial statement users.''
According to U.S. Association of Certified The scientific literature provides various
Fraud Examiners (ACFE), fraud is classified clustering and classification systems for
as fraud and abuse in the workplace, and categorizing fraud. Some are similar, while
financial statement fraud. Occupational fraud others are redundant and ask questions of
is defined as: ''The use of one’s occupation interpretation. Common factors found in the
for personal enrichment through the research field, determining fraud
deliberate misuse or misapplication of the classifications, are: type of responsibility to
employing organization’s resources or the organization's position, motivational
assets". ACFE defines fraud financial relationships to the organization, the criminal
statements as: ''deliberate misrepresentation group.
Fraud taxonomy conducted by ACFE and the are, the overwhelming majority of clustering
tool called "fraud tree" is now regarded as methods use various types of dissimilarities,
the most complete blueprint for fraud either distance and/or density based. They
schemes. satisfy the dkl >= 0, dkk = 0, dkl = dlk
properties but are not required to satisfy the
3 Clustering techniques triangle inequality, be actual distances.
Clustering, as unsupervised data mining c) Constraints. Select a clustering type
technique, deals with the problem of dividing (partitional / hierarchical / hybrid) and
a given set of entities into meaningful specify additional required initialization
subsets. Clusters resulted from this data parameters: k total number of clusters,
segmentation are required to be to be density threshold, graph connectivity
homogeneous and/or well separated, entities threshold, etc..
within the same group being similar while d) Validity index. Select one or more validity
entities within different groups being indices to express homogeneity and/or
dissimilar. Based on general steps found in a separation of the clusters in the clustering to
typical cluster analysis study [6], a more be found.
condensed clustering scheme contains the (e) Algorithm. Select an already existing
following elements: algorithm or design a new one for the
a) Dataset. Given N entities, measure same p problem defined in (c), (d). Obtain or write
properties for each entity. This results in an N the corresponding software.
x p data matrix X. (f) Computation. Apply the selected
b) Dissimilarity measure. Compute from the algorithm to matrix D = (dkl) in order to
matrix X, a N x N matrix D = (dkl) of partition the initial N entities into meaningful
dissimilarities between entities. In order to clusters.
assess how closely related two given objects
Informatica Economică vol. 16, no. 1/2012 113
(g) Interpretation. Apply formal tests based target dataset. Based on the target data
on validity indices selected in (d) on all data domain, choosing a pattern proximity
segmentations obtained in (f). Based on measure leads to the target dissimilarity
overall data understanding of the initial N measure. The above constraints, validity
entities, apply informal tests as well. indices, algorithm definition and computation
Describe clusters by their lists of entities and clustering elements can be viewed as the
descriptive statistics. Proceed to a substantive clustering grouping activity. Fig. 2 illustrates
interpretation of the results. these main clustering activities [7], including
The above elements can be projected in main a feedback loop incrementally improving
clustering activities. Pattern representation, clustering results.
including feature extraction, leads to the
With many different, overlapping taxonomies clusters are formed dividing clustering
of clustering algorithms, the most common techniques in hierarchical and partitional
generic criteria is represented by the way clustering. Hierarchical clustering groups
114 Informatica Economică vol. 16, no. 1/2012
of articles the author fell they should be in this way, at least to a certain degree,
included nonetheless for their scientific results are quantifiable.
value. Real datasets were preferred because
5 Clustering based FFD survey domain, clustering technique and case study
As a result of the research methodology, 27 dataset. Papers are ordered on publishing
articles were selected for inclusion. They year and clustering technique in Table 2.
have been grouped based on application
The above papers make use of clustering text documents to mine for transaction data
techniques ranging across a relative large transformed in monetary vectors. Computed
spectrum. On one end of the spectrum we monetary vectors are either clustered via k-
encounter single, standalone clustering means or projected to a histogram.
technique being used as the sole data mining Another case group standing out consists of
method [9], [10], [31], [34]. On the other end clustering techniques used for training
we encounter hybrid data mining techniques classifiers. Due to proliferation of enterprise
where clustering is just one tool, being used resource planning systems and an ever
in one or more stages, within complex data growing amount of available data to be
mining implementations [14], [17], [25]. analyzed, manually labeling training data for
Also present are clustering visualization various classifiers has become unfeasible in
techniques targeting financial fraud detection many cases. In these situations a clustering
[22]. technique is first used on the uncategorized
Most cases where single, standalone data in order to automatically split it into
clustering techniques are being used make meaningful categories. Each cluster/category
use of k-means and its variations for outlier is labeled (usually manually) and then
detection. In most cases, Euclidian distance classifiers are being trained on each
is being used as the dissimilarity metric. [9] cluster/category. The majority of papers
implements k-means with the intent of found in this study are mostly using
identifying fraudulent refunds within a classifiers based on decision trees, neural
telecommunication company with fraudulent networks and support vector machines. [16]
transactions being regarded as outliers. [10] is conducting rare class analysis on datasets
uses k-means to automate fraud filtering with imbalanced class distribution by
during an audit. Claims with similar manually splitting data into several large
characteristics are grouped together and classes and performing k-means with
small-population clusters are flagged for Euclidian distance as dissimilarity metric on
further investigation. Dominant each class. This local clustering process
characteristics of the flagged clusters include generates sub-classes with relatively
large beneficiary payment, large interest balanced sizes within each main class, sub-
payment amounts and long lag between classes used subsequently for training a
submission and payment. [29] splits support vector machine classifier.
healthcare data according to Benford's large Experimental results on various real-world
numbers law and analyses non-compliant data sets show this method producing
classes via k-means clustering in order to significantly higher prediction accuracies on
detect outliers. [31] identifies three rare classes than state-of-the-art methods.
purchasing related fraud schemes, double [17] generates new composite attributes from
payment of invoices, changing purchasing transaction data and uses them in k-means
order after release, deviations of purchasing clustering to divide transactions into
order and implements k-means on newly suspicious and unsuspicious, most being
added attributes based on ANOVA analysis. unsuspicious. The full set of attributes is then
[34] employs k-means for finding early being used to train two different sets of
symptoms of insider trading in option classifiers (neural networks and decision
markets before any news release. [35] uses trees) on the two identified clusters. [18]
118 Informatica Economică vol. 16, no. 1/2012
dbscan algorithm, linear equation, rules, data timestamp. Visualization [22] is also present
warehouse and Bayes theorem. [25] also uses as a form of identifying credit card fraud.
dbscan within a fraud detection system Variable binned scatter plots allow the
consisting of 4 components: rule-based filter, visualization of large amounts of data
Dempster–Shafer adder, transaction history without overlapping. The basic idea is to use
database, Bayesian learner.. Within the rule- a non-uniform (variable) binning of the x and
based filter, outlier detection is conducted via y dimensions and plots all the data points that
dbscan. [20] detects fraudulent insurance fall within each bin into corresponding
clients' applications via resolution based squares.
clustering. The algorithm, which combines
the advantages of resolution based and 6 Conclusions
density based algorithms, can detect and rank Compared to other domains where clustering
top-n outliers from any kind of datasets is being applied to identify outliers, intrusion
without the need for input parameters taking detection, etc.., clustering based fraud
the size and density of clusters into detection techniques tend to use established
consideration. Resolution can be explained as clustering techniques. Relative novel
follows. Just like viewing a density plot with clustering techniques like clustering
a microscope or telescope at a certain ensembles, large scale clustering, multi-way
magnification, one can identify different clustering have very little presence in the
groups in the night sky as the magnification surveyed papers.
is adjusted. When the resolution changes on a Based on the surveyed papers, almost three
dataset, the clusters in the dataset quarters of the encountered clustering
redistribute. All the objects are in the same techniques are partitional – Figure 5. Some
cluster when the resolution is very low, and papers, [11], [24] have been counted as using
every object is a single cluster when the both partitional and hierarchical clustering
resolution is very high. [21] uses techniques as they combine both clustering
multivariate latent class fuzzy clustering in types. Among partitional clustering
order to detect internal fraud on procurement techniques, k-means clustering and its
data. [27] analyses irregular stock market variants with Euclidian distance as
behavior via traders trading behavior using dissimilarity metric are the most common
spectral clustering. [28] combines three used ones. Hierarchical clustering techniques
graph clustering algorithms making use of a come in second place being used in one
Dempster–Shafer adder for detecting quarter of the surveyed papers. Interactive,
circular trading and price manipulation. [30] visualization clustering techniques are also
uses stream clustering based on classical used but only in very small number of cases.
Dbscan algorithm and Wstream. Using Regarding the way clustering techniques are
containers (windows) in the form of hyper- combined or used in conjunction with other
rectangles that are adjusted though time to data mining techniques, the surveyed papers
discover and track the evolution of the have been classified as containing standalone
underlying clusters. Wstream achieves this clustering techniques with only one
using two procedures, “movement” and clustering algorithm being used, combined
“enlargement-contraction”. The “movement” clustering techniques with two or more
of windows incrementally recenters windows clustering algorithms being used, hybrid
every time a new streaming data point clustering techniques combining both
arrives. Windows are recentered to the mean clustering algorithms and other data mining
of the points they include at each time point algorithms, mostly classifiers based on
in a manner that also depends on each point’s decision trees, neural networks and support
timestamp. A fading function that decreases vector machines – Figure 6.
with time, associates a weight with each
120 Informatica Economică vol. 16, no. 1/2012
4%
24%
Partitional clustering
Visualization techniques
41% 41%
Standalone
Combined algorithm
Both standalone and hybrid clustering [3] B. A. Garner, Black's Law Dictionary 9th
techniques are heavily used with ed.. New York: West Group Publishing
approximately 40% usage each. Increasing House, 2009.
clustering accuracy by combining multiple [4] Singleton T. W., Fraud Auditing and
clustering algorithms and staying within a Forensic Accounting 4th edition, Ed.
single data mining domain is not perceived as John Wiley and Sons, 2010.
having significant benefits as only 18% cases [5] M. Jans, N. Lybaert, and K. Vanhoof, “A
apply it. framework for Internal Fraud Risk
Reduction at IT Integrating Business
References Processes,” International Journal, 2009.
[1] PriceWaterhouseCoopers UK, Global [6] P. Hansen and B. Jaumard, “Cluster
Economic Crime Survey, analysis and mathematical
PriceWaterhouseCoopers, programming,” Mathematical
Nov. 2011. [Online]. Available: Programming, vol. 79, pp. 191-215, Oct.
http://www.pwc.com/gx/en/economic- 1997.
crime-survey/download-economic- [7] A. K. Jain, M. N. Murty, and P. J. Flynn,
crime-people-culture-controls.jhtml “Data clustering: a review,” ACM
[Accessed: 01 Jan. 2012]. computing surveys (CSUR), vol. 31, no.
[2] Oxford University Press, Concise Oxford 3, pp. 264–323, 1999.
English Dictionary, Oxford University [8] E. Ngai, Y. Hu, Y. Wong, Y. Chen, and
Press, Dec. 2009. [Online]. Available: X. Sun, “The application of data mining
http://oxforddictionaries.com/ techniques in financial fraud detection:
[Accessed: 10 Nov. 2011]. A classification framework and an
Informatica Economică vol. 16, no. 1/2012 121