Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

ADVANCES in DATA NETWORKS, COMMUNICATIONS, COMPUTERS

Six Sigma Methodology with Fraud Detection


ANDREJ TRNKA
Department of Applied Informatics
University of SS. Cyril and Methodius in Trnava
Nam. J. Herdu 2, 917 01 Trnava
SLOVAKIA
andrej.trnka@ucm.sk http://fpv.ucm.sk

Abstract: - This paper describes one of the selected Data Mining task. It deals with implementation of Data Mining
algorithms to fraud detection. The first part briefly introduces the applications of Data Mining and then is given to the
fraud detection. Case study of fraud detection is focused to the field of application of financial grants for farms. It uses
fictitious data.

Key-Words: - Six Sigma, Data Mining, Fraud Detection

1 Applications of Data Mining 1.2 Fraud Detection


The aim of data mining is to make sense of large Fraud Detection is concerned with the detection of fraud
amounts of mostly unsupervised data, in some domain. cases from logged data of system and user behaviour.
Systematic exploration through classical statistical Fraud occurs in the following areas:
methods is still the basis of data mining. Some of the • credit card fraud,
tools developed by the field of statistical analysis are • internet transaction fraud / E-Cash fraud,
harnessed through automatic control (with some key • insurance fraud and health care fraud,
human guidance) in dealing with data. [5], [6] • money laundering,
Fraud detection [1] is relevant field of application of data • intrusion into computers or computer networks,
mining. Fraud may affect different industries such as • telecommunications fraud,
telephony, insurance (false claims) and banking (illegal • voice Over IP (VOIP) fraud,
use of credit cards and bank checks; illegal monetary • subscription fraud / Identity theft.
transactions). In many situations it is the detection of The task is similar in all these areas. The fraud cases
outliers in the data that is most interesting. For example have to be detected on the basis of the available data that
[2], the detection of fraudulent insurance claim is produced by the behaviour of the customers and
applications can be based on the analysis of unusual fraudsters. [7]
activity. Hawkins [4] defines an outlier as an observation
which deviates so much from other observations so as to
arouse suspicions that it was generated by a different
mechanism.
2 Fraud Case Study
There is a rapidly growing body of successful This demonstration example [8] shows the use of Data
applications [3] in a wide range of areas as diverse as: Mining algorithms and methods in detecting behavior
that might indicate fraud. The domain concerns
• analysis of organic compounds,
applications for agricultural development grants. Two
• automatic abstracting,
grant types are considered:
• credit card fraud detection,
• arable development,
• electric load prediction,
• decommissioning of land.
• financial forecasting, The example uses fictitious data to demonstrate how
• medical diagnosis, analytical methods can be used to discover deviations
• predicting share of television audiences, from the norm, highlighting records that are abnormal
• product design, and worthy of further investigation. The real data can be
• real estate valuation, storage in data warehouses. [9]
• targeted marketing, The applicant is particularly interested in grant
• thermal power plant optimization, applications that appear to claim too much (or too little)
• toxic hazard analysis, money for the type and size of farm.
• weather forecasting… The data set contains nine fields:
• id. A unique identification number.

ISSN: 1792-6157 162 ISBN: 978-960-474-245-5


ADVANCES in DATA NETWORKS, COMMUNICATIONS, COMPUTERS

• name. Name of the claimant.


• region. Geographic location
(midlands/north/southwest/southeast).
• landquality. Integer scale – farmer’s declaration
of land quality.
• rainfall. Integer scale – annual rainfall over
farm. Fig. 2 Multiple grant applications
• farmincome. Real range – declared annual
income of farm. Based on this, we discarded records for those farms that
• maincrop. Primary crop made multiple records. Next step was to focus on the
(maize/wheat/potatoes/rapeseed). characteristics of a single farm applying for aid. We built
• claimtype. Type of grant applied for a model for estimating what we would expected a farm’s
(decommission_land/arable_dev). income, based on its size, main crop type, soil type and
• claimvalue. Real range – the value of the grant so on. To prepare for modeling, we needed to derive new
applied for. fields (f.e.: farmsize * rainfall * landquality). To
For building the model we used IBM SPSS Modeler. To investigate those farmers who deviated from the
do a first screening for unusual records, we used the estimate, we derived another field that compares the two
anomaly detection. After identifying the input variables values and returns a percentage difference:
and executing, the Anomaly detection model was ((abs(farmincome - estincome) / farmincome) * 100) (1)
generated. Figure 1 shows the results with potential Figure 3 shows the histogram of this new field.
anomalies. The overall anomaly index value is also listed
for each record, along with the peer group and the three
fields most responsible for causing that record to be
anomalous.

Fig. 1 potential anomalies

This helps to form hypotheses that can be useful in


modeling. We used charts to gain a better picture of
which records are being flagged. However, to understand
relationships, it is worth taking a closer look at the data.
Data investigation helps to form hypotheses that can be
useful in modeling. Initially, we consider the possible
types of fraud in the data. One such possibility was
multiple grant aid applications from a single farm.
Figure 2 shows multiple claims.

Fig. 3 Histogram of percentage difference

ISSN: 1792-6157 163 ISBN: 978-960-474-245-5


ADVANCES in DATA NETWORKS, COMMUNICATIONS, COMPUTERS

Since all of the large deviations seem to occur for


arable_dev grants, it made sense to select only
arable_dev grant applications.
From the initial data exploration, it was useful to
compare the actual value of claims with the value one
might expect, given a variety of factors. Using the
variables in data set, the neural net or other methods we
made the predictions based on the target, or dependent
variable. Using these predictions, we explored records or
groups of records that deviate. Figure 4 shows
comparison predicted (by neural net) and actual claim
values. It appears to be good for the majority of cases.

Fig. 5 Histogram with selected subset

3 Results
The result is the table with large value of new field.
Figure 6 shows the potential fraudsters.

Fig. 4 Comparing predicted and actual claim values

By deriving ((abs(claimvalue - '$N-claimvalue') /


'claimvalue') * 100) (2), the new field was created. It is
similar to the field derived earlier. In order to interpret
the difference between actual and estimated claim
values, we used a histogram of this new field. We were
primarily interested in those who appear to be claiming
more than we would expect. By adding a band to the
histogram, we selected records with a relatively large
value, such as greater than 50%. These claims warrant
further investigation. Fig. 6 Records with unusually large values

ISSN: 1792-6157 164 ISBN: 978-960-474-245-5


ADVANCES in DATA NETWORKS, COMMUNICATIONS, COMPUTERS

This paper demonstrated two approaches for fraud References:


detection – anomaly detection and a modeling approach [1] Carlo Vercellis. Business Intelligence: Data Mining
based on a neural net. In our real research we try to and Optimization for Decision Making. John Willey
implement fraud detection to Six Sigma methodology. & Sons, 2009, 417 p., ISBN 978-0-470-51138-1
Figure 7 shows location of fraud detection in Control [2] Glenn J. Myat. Making Sense of Data: A Practical
phase of Six Sigma methodology. Guide to Exploratory Data Analysis and Data
Mining. John Willey & Sons, 2007, 280 p., ISBN
978-0-470-07471-8
[3] Max Bramer. Principles of Data Mining. Springer,
2007, 343 p., ISBN 978-1-84628-765-7
[4] Douglas M. Hawkins. Identification of Outliers.
Springer, 1980, 188 p., ISBN 978-0412219009
[5] Krzysztof J. Cios, et al. Data Mining A Knowledge
Discovery Approach. Springer, 2007, 606 p., ISBN
978-0-387-33333-5
[6] David L. Olson, Dursun Delen. Advanced Data
Mining Techniques. Springer, 2008, 180 p., ISBN
978-3-540-76916-3
[7] Artificial Intelligence and Fraud Detection / Fraud
Management. [on-line], available at
<http://www.dinkla.net/fraud>, cited [28.4.2010]
[8] PASW Modeler 13 Applications Guide. Integral
Solutions Limited, 2009, 525 p.
[9] P. Tanuska, O. Vlkovic, A. Vorstermans, W.
Verschelde. The proposal of ontology as a part of
University data warehouse. Education Technology
and Computer (ICETC), 2010 2nd International
Conference on , vol.3, no., pp.V3-21-V3-24, 22-24
June 2010. ISBN 978-1-4244-6367-1

Fig. 7 Fraud detection and proposal of Control phase

4 Acknowledgment
Grateful acknowledgment for translating the English
edition goes to Juraj Mistina.
This paper was supported by Institutional grant of
University of SS. Cyril and Methodius in Trnava.

ISSN: 1792-6157 165 ISBN: 978-960-474-245-5

You might also like