An Empirical Method For Discovering Tax Fraudsters: A Real Case Study of Brazilian Fiscal Evasion

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

An Empirical Method for Discovering Tax Fraudsters: A

Real Case Study of Brazilian Fiscal Evasion


Tales Matos José Antonio F. de Macedo José Maria Monteiro
Federal University of Ceará Federal University of Ceará Federal University of Ceará
Campus do Pici Campus do Pici Campus do Pici
Fortaleza/Brazil Fortaleza/Brazil Fortaleza/Brazil
tales.matos@lia.ufc.br jose.macedo@dc.ufc.br monteiro@dc.ufc.br

ABSTRACT lost due to recurrent frauds. However, taxpayer information brings


This work encompasses the development of a new method for new opportunities for automatically detecting fraudulent activities
classifying tax fraudsters based on fraud indicators. This work in order to mitigate tax frauds. In this sense, we are interesting in
was developed in conjunction with a Brazilian fiscal agency aim devising a method that allows forecasting potential fraudsters
at avoiding fiscal evasion. The main contribution of this paper is a from taxpayers’ data by analyzing their fraud indicator traces.
method that allows classifying and ranking taxpayers analyzing This method will serve as a key tool for auditing companies,
fraud indicators obtained from several fiscal applications. reducing the time and financial cost of such activities and
Particularly, we developed a method for identifying frequent fraud increasing their accuracy.
patterns using association rules and then we apply two dimension In this direction, we resort to a real case scenario of the Treasury
reduction methods (i.e. PCA and SVD) in order to create a fraud State of Ceará (SEFAZ), which is responsible for inspecting over
scale, which allows ranking taxpayers according to their potential 142,000 active contributors. Although SEFAZ has a large dataset
to commit a fraud. Experiments were conducted using real about taxpayer frauds, its enforcement agent team struggles in
taxpayer data. Tax auditors, specialized in fraud detection, performing a complete inspection on taxpayers accountings
validated our results. Preliminary results show that our method because each inspection process need to evaluate countless fraud
may indicate fraudsters with 80% of accuracy, which is definitely indicators, which is very time consuming and error prone.
an excellent result. Motivated by this problem, we collected 4 analytical questions,
which we intend to answer along this work: (1) What is the
Categories and Subject Descriptors behavioral pattern of fraud indicators? (2) Is there a correlation
H.2.8 [Database Applications]: Data mining. among fraud indicators? (3) Which are the most relevant fraud
indicators? and 4) How can we measure the risk of a taxpayer
General Terms committing a fraud?
Algorithms, Measurement, Experimentation, Verification.
The main contribution of this paper is a method for classifying
Keywords taxpayers using fraud indicators. This method is composed by
Keywords are your own designated keywords. four steps, which are implemented by using data mining,
statistical analysis and dimensionality reduction techniques. The
steps are: (1) Analyzing Fraud Frequencies, (2) Discovering The
1. INTRODUCTION Most Relevant Fraud Indicators, (3) Correlating Fraud Indicators
and (4) Classifying Taxpayers from Relevant Fraud Indicators.
Indeed this is an empirical method that is oriented towards
Brazil is currently the seventh largest economy in the world answering the 4 questions raised in the previous paragraph.
conform to the national wealth ranking [15]. Due to the size o
Brazilian economy, fiscal evasion has become a key problem for Experimental results were conducted and showed that only a
states and municipalities. In order to cope with this problem, small subset of fraud indicators are representative and should be
Brazilian government has implementing two systems: the used during the analytical process. In addition, we succeeded to
Electronic Invoice and the Digital Tax Bookkeeping. These create a scale to identify the propensity for fraud by taxpayers,
systems allow tracking financial information crossing between which may help tax agents to orient their analysis on potential
contributors, states and municipalities. Although those systems fraudsters. Preliminary results show that our method allows
fastened the process of gathering taxpayer information, the level indicating fraudsters with 80% of accuracy, which demonstrates
of frauds are still high and 25 per cent of potential income tax is the good accuracy of our method. Experimental evaluation was
conducted by tax auditors with expertise in fraud detection.
This paper is structured as follows. Section 2 presents a
"Permission to make digital or hard copies of all or part of this work for description of our real case study. In Section 3, we present the
personal or classroom use is granted without fee provided that copies are proposed method describing each step of the analytical process.
not made or distributed for profit or commercial advantage and that
Next, in Section 4, we describe the experimental setting and
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists, discuss the results. Section5 presents the related work. Finally we
requires prior specific permission and/or a fee. conclude this work on Section 6.
IDEAS’15, July 13 - 15 2015, Yokohama, Japan
Copyright ©2015 ACM 978-1-4503-3414-3/15/-7 $15.00"
http://dx.doi.org/10.1145/2790755.2790759"
2. CASE STUDY
One of the objectives of this study is to analyze taxpayer fraud
indicators in order to identify what are the key indicators that may
characterize a potential tax fraud. Hereinafter we call such
indicators as fraud indicators. Indeed, fraud indicators could
guide tax auditors in the process of identifying irregular behavior
of taxpayers. As we mentioned before tax auditors should do a
thoroughly analysis on taxpayer data in order to discover possible
frauds that taxpayer has committed. However this process is
complex, time consuming and error prone due to the excessive
volume of data that encompasses different kind of information,
such as: accounting, goods stock, sales, legal data, etc.

2.1 Tax Fraud Indicator Datasets


In our case study, we resort to a historical audit data provided by Figure 1 (b). Tax fraud indicators' frequencies 2010
the Treasury State of Ceará (SEFAZ-CE - Brazil). This data were We have also analyzed the correlation among fraud indicators in
extracted from 8 applications, summing up 72 million of records. order to understand which fraud indicators occur together and
We selected taxpayers’ data of 2009 and 2010, which correspond what are their frequencies. Table 1 presents the result of this
to the current audit period. We used fourteen fraud indicators, analysis showing in each line a set of fraud indicators that occur
identified by tax auditors. From a financial point of view, these together with corresponding percentage for 2009 and 2010. From
fourteen indicators are of key importance since they correspond to these observations we identify 14 fraud indicator sets with high
the largest amount of money that can be recovered from frequency in 2009. We have also observed that fraud indicator
fraudulent transactions. Due to confidentiality reasons, we sets presented in 2009 repeat in 2010, but with a high frequency
anonymized these fraud indicators. Each fraud indicator is (colored in red). In 2010 we have noted that 26 fraud indicator
determined by a tax auditor after analyzing information issued by sets are frequent. Thus, we could perceive that frauds are
taxpayers, such as: tax documents, record the movement of goods increasing in number but also new combinations of frauds are
at the border of the State, taxpayer sales data through credit and appearing.
debit cards.
Table 1. Frequency of fraud indicator sets
Figures 1(a) and 1(b) present two histograms detailing fraud
indicator’s frequencies of 2009 and 2010 datasets, respectively. In
2009, 10,789 (95%) of 11,386 taxpayers have at least one type of
evidence of fraud and 597 (5%) of taxpayers do not present any
evidence of fraud. In 2010, 11,989 (96%) of 12,424 taxpayers
have at least one type of fraud evidence and 435 (4%) did not
present any evidence. We have also verified that the fraud
indicators: C, E, F, G and H appeared with higher frequencies, in
both datasets. While H, N and K fraud indicators had a significant
increase in the frequency in 2010, the J fraud indicator greatly
reduced in 2010. The majority of fraud indicators frequencies had
increased from 2009 to 2010, which help us to conclude that we
need some method for mitigate fraud evolution and thus tax
evasion.

2.2 Tax Fraud Matrix


Two matrices were constructed for the years 2009 and 2010,
respectively. Each matrix relates the taxpayers (in rows) and their
corresponding fraud indicators (in columns). The matrix for the
Figure 1 (a). Tax fraud indicators' frequencies 2009
data from 2009 has 11,386 rows and 14 columns. As for the 2010
data, the matrix has 12,424 rows and 14 columns. Taxpayers who
did not present any evidence of fraud were excluded because they
were considered outliers. This representation is necessary to work
with the algorithms used in sections 3.1, 3.2, and 3.3. !
Section!3.1:!Discovering! Section!3.2:! Section!3.3:!Classifying!
Section!2.1:!Tax!Fraud! Taxpayers!from!
Correlating!Fraud!
After data selection, the total of each indicator for each taxpayer is Indicator!Datasets!
The!Most!Relevant!
Fraud!Indicators!
Relevant!Fraud!
Indicators! Indicators!
summed up. Thus, the data were grouped into matrix taxpayers
versus fraud indicators. For some techniques, the matrix is

FImeasure!80%!
Fraud!Indicators! Fraud!Indicators! Relevant!Fraud!
Process!Input! Fraud!Indicators!
adapted to display the existence / absence of evidence of fraud for ! ! Indicator!

each taxpayer, where 1 (one) indicates the presence and 0 (zero) is Together! Reduced!set!of!unique! Similar!clusters!in! Scale!fraud!risk!
the absence of evidence of fraud. Aiming to anonymize taxpayers Process!Output! occurrences! determinants! both!analyzed!years!
!
was assigned sequential ID for each. Each of fraud indicators - ! ! !
Fourteen - was renamed A, B, C, D, E, F, G, H, I, J, K, L, M and
N. Figure 2 - Proposed Method Overview

3. BEHAVIOURAL ANALYSIS OF
TAXPAYERS
In this section, we present a detailed description of the proposed
method. 2 presents an overview of this method depicting each step
with their corresponding input and output. The first step of the
method is to compute the frequency of tax indicators (this was
explained in the previous section 2.1). The second step aims to
find the most relevant fraud indicators and their correlation. In the
third step, we correlate fraud indicators and compare their
evolution in both years. Finally, in the fourth step we classify
taxpayers using a fraud scale, which allows measuring the
tendency of a taxpayer to commit a fraud. Each step is explained
in detail within the following sections.
(a) (b)

3.1 Discovering The Most Relevant Fraud Figure 3. Confidence, support and lift: (a) 2009 dataset and (b)
2010 dataset
Indicators
The second step of the method is to understand the frequency of Regarding this plot it is worth noting that all of the rules with high
fraud indicators. For this task we use association rule technique, lift seem to have support below 80%, in both years 2009 and
regarding each fraud indicator as an item set. So, the mining 2010. On the other hand, there are rules with high lift and high
algorithm used in this study is the Apriori algorithm [2], confidence, which sounds quite positive.
considered to be the most widely used in the literature for this Based on this evidence, we focus on a smaller set of rules, here
purpose. called “good rules” that only have highest lift. 2009 data was
We executed Apriori algorithm using support = 60% and used lift > 1.05 for 2010 data was used lift > 1.10. An example of
confidence = 60%. The values of support and confidence were these good rules can be seen in Figure 4.
defined as the minimum acceptable for the tax auditors. For this
implementation, there is a new parameter called "lift". Lift tries to
come up with ways of measuring how “interesting” a rule is. A
more interesting rule may be a more useful rule because it is more
novel or unexpected. Without getting into math, lift takes into
account the support for a rule, but also favors situations where left
hand side and right hand side variables are not abundant but
where the relatively few occurrences always happen together. The
larger the value of lift, the more “interesting” the rule may be.
Figure 3 shows a plot that describes the correlation between
support and confidence. Even though we see a two dimensional
plot, we actually have three variables represented here. Support is
on the X-axis and confidence is on the Y-axis. Lift serves as a
measure of interestingness, and we are also interested in the rules
with the highest lift. On this plot, the lift is shown by the color of
a dot. The darker the dot, the closer the lift of that rule is to 1.1,
which appears to be the highest lift value among these rules.

(a) (b)

Figure 4 (a). A sample set of good rules: 2009 dataset and (b)
2010 dataset
Analysis of the rules listed as "good rules", it was identified that
72% of existing rules 2009 are repeated in 2010. This analysis
shows that the pattern of behavior of taxpayers is being repeated
in subsequent years, which requires urgent intervention to prevent
tax evasion.
The main result achieved in this step is the reduced set of unique
determinants on left hand side, both in 2009 and in 2010. They
are: C, E, F, H and L. After executing APRIORI algorithm, we
achieved 2,563 rules in 2009 and 1,527 rules in 2010. From those
rules, we selected 176 good rules (lift > 1.05) in 2009 and 150
good rules (lift > 1.10) in 2010.

3.2 Correlating Fraud Indicators Figure 5 (b). Manhattan dendrograms 2010


In addition to the association rules and also to answer the question
(2) Is there similarity between the fraud indicators used in the
research? An important tool used in this analysis of the behavior
of taxpayers is obtaining the similarity between the indicators of
fraud. For this analysis, we used the Manhattan similarity 3.3 Classifying Taxpayers from Relevant
function.
Fraud Indicators
The Manhattan similarity [3] function was chosen for this study So far, the analyses were used to understand the behavior of
because it is the technique most suitable for the type of variables taxpayers in the past. However, we aim to rate the taxpayer on a
used: binary variables. The similarity was calculated for indicators risk scale. The idea is that this scale indicates the risk of the
of frauds in 2009 and 2010. In order to better visualize the taxpayer to commit a fraud. For this, we resort to dimensionality
correlations between their indicators we constructed dendrograms reduction techniques to reduce the 14 indicators of fraud to a
shown in Figure 5: (a) and (b). Indicators that are closer to zero, single dimension. The following are the results of dimensionality
are more similar to the indicators. reduction techniques.
These dendrograms are useful for comparing the similarity of In order to answer the question (3) Can we reduce the number of
2009 and 2010 fraud indicators. We can see the formation of existing indicators, without losing the quality of analysis in order
different groups to observe the 2009 and 2010 datasets. For to optimize the inspection process? In this sense, we use the
example, the indicators of 2009: (C, E, F, L) form a group. In this Principal Component Analysis (PCA) [4] [6] and Singular Value
case the closer to zero, the more similar. In 2010, this group Decomposition (SVD) [5] [6].
change: (H, C, E, F, L).
3.3.1 Classifying Taxpayers from Relevant Fraud
Indicators
PCA is a mathematical procedure, which uses orthogonal
transformation to convert a set of variable observations, possibly
correlated, to a set of values linearly uncorrelated variables called
principal components. PCA is commonly used as a tool for
exploratory data analysis and for making predictive models. PCA
can be made by eigenvalue decomposition of a covariance matrix
(or correlation) or by singular value decomposition of a data
matrix, usually after centering (and normalized) data matrix for
each attribute [1].
PCA is the simplest of the true eigenvector by multivariate
analysis. It is defined mathematically [4] as an orthogonal linear
transformation that transforms the data to a new coordinate
Figure 5 (a). Manhattan dendrograms 2009 system such that the greatest variance by any projection of the
data lies along the first coordinate (called the first component), the
second greatest variance is along the second coordinate, and so
on.
The idea is to treat the set of tuples as a matrix M and find the
eigenvectors for MMT or MTM. The matrix of these eigenvectors
can be thought of as a rigid rotation in a space of high dimension
[4] [6].
At first, this study used PCA to try to reduce the universe of 14
existing indicators. However, this strategy was not very effective
for this data set as can be observed in the sample correlation
matrix in Table 2.
Table 2. sample correlation matrix for all indicators
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14
Standard Deviation 1.717 1.401 1.100 1.017 0.994 0.976 0.944 0.929 0.926 0.877 0.800 0.718 0.596 0.000
Proportion of
variance 0.210 0.140 0.086 0.074 0.070 0.068 0.063 0.061 0.061 0.055 0.045 0.036 0.025 0.000
Cumulative
Proportion 0.210 0.351 0.437 0.511 0.582 0.650 0.713 0.775 0.836 0.891 0.937 0.974 1.000 1.000

Each principal component (PC1, PC2, PC3, ...) accounts for the whereas if we take the first two components reach about 62% of
total variance of the standardized data. The first principal the total variance. This strategy has better result to reduce the
component accounts for about 21% of the total variance of the matrix.
standardized data, whereas if we take the first two components
reached about 35% of the total variance. To achieve at least 50% To define which components to use, should calculate the standard
of the variance would need the first four components (a reduction deviation squared and choose values greater than 1 (corresponding
to the components), as shown in Table 4.
of 14 for four indicators). The first principal component variance
is 2.95 (standard deviation squared), much higher than the average Table 4. Square of the standard deviation of each principal
of the variances (equal to 1). In addition, Figure 6 does not show a component
relevant difference between the main components. In this
2010
example, it is possible to retain the first two or three components
indicating a reduction to two or three dimensions. PC1 PC2 PC3 PC4 PC5
Standard 2.063 1.038 0.962 0.788 0.147
Deviation ^ 2

From the values obtained, the components to be used are PC1 and
PC2. Figure 7 reinforces the choice to show these two values as
the highest.

Figure 6. Variance of the principal components


Thus, it would be possible to have much quality in the prediction
by reducing to a single dimension or a single indicator.
In order to achieve the reduction to a single dimension, we
envisage an alternative to the use of determinants found by
APRIORI algorithm (Section 3.1) C, E, F, H and L. That is, the
Apriori algorithm found subset indicators as key determinants and Figure 7. Calculating the standard deviation squared of each
this subset is interpreted here as a dimensionality reduction - principal component
instead of 14, we now have five dimensions. Table 3 shows the
correlation matrix for the five sample indicators (C, E, F, H and The goal is to reduce the matrix to a single dimension. In this
L) as found by the Apriori algorithm. case, each variable should relate to the main component chosen:
PC1 or PC2. This relationship is made using the eigenvectors of
Table 3. Sample correlation matrix for indicators C, E, F, H each component (Figure 8).
and L
2010
PC1 PC2 PC3 PC4 PC5
Standard
deviation 1.436 1.019 0.981 0.888 0.383
Proportion of
variance 0.412 0.207 0.192 0.157 0.029
Cumulative
Proportion 0.412 0.620 0.812 0.970 1.000

In this case, the first principal component accounts for


approximately 41% of the total variance standardized data,
Figure 9. SVD reduction applied 2010 dataset: (a) fifty
taxpayers with lowest values of fraud indicators; and (b) on
thousand taxpayers with lowest values of fraud indicators

4. Results Evaluation
Figure 8. Relationship of each variable with PC1 and PC2
In order to evaluate our classification method, we resort to
By analyzing the Figure 8, we can observe that the plot has the experienced tax auditors that manually verified the results of our
characteristic of an S-curve, not a straight line. However, the method. We have used F-measure formula to analyze the
Singular-Value Decomposition (SVD) technique yielded better precision and recall of our approach. In fact, F-Measure enforces
results than PCA for the data analyzed in our study. a better balance between performance on the minority and
majority class, and is more suitable in the case of imbalanced
3.3.2 Singular-Value Decomposition (SVD) data, which arises quite frequently in real-world applications.
Following the same principle of PCA, SVD [5][6] technique was
also used on the determinants found by APRIORI algorithm: C, E, In this evaluation we applied the following methodology. We
F, H and L. Figure 9 is a demonstration of how the matrix of this selected 120 contributors with the highest values in the ranking
study was reduced to a single dimension. This figure shows the list computed by the SVD method, which represent a group of
SVD data reduction for 2010, as follows: (a) fifty taxpayers with potential fraudsters. Then we selected 50 contributors with the
lowest values of fraud indicators; and (b) on thousand taxpayers lowest values of SVD, which in turn correspond to potential non-
with lowest values of fraud indicators. fraudsters. In sequel we submit these two groups (i.e. 120
fraudsters and 50 non-fraudsters) to be analyzed by 2 experienced
This technique proved to be more feasible to reduce tax auditors. After rough analysis, the auditors concluded that
dimensionality of the data used in this study, because it is simpler from the group of 120 classified as fraudsters, 85 were correct,
and get a line exactly as expected. However, it must be applied to and from the group of 50 selected as non-fraudsters, 23 actually
all data again whenever a new line is inserted to the array. The were not fraudsters.
SVD calculation is altered in accordance with the values used in
the sample. We organized the auditors’ evaluation into a confusion matrix
(showed in Table 5), which shows the positive and negative
The results point out to the possibility of creating a scale fraud classification of our results. From this table, we obtain the values
risk for taxpayers, with the use of SVD. Thus, we answer the final of Accuracy (A), Precision (P) and Recall (R). True Positive (TP)
question 4) Can we set a scale to indicate the risk of a taxpayer occurs when fraudsters contributors were selected correctly by our
committing fraud? method (this is considered the correct decision). False Positive
(FP) occurs when fraudsters contributors were wrongly selected
by our method (this is considered an incorrect decision). False
Negative (FN) occurs when non-fraudsters contributors were
wrongly selected (this is considered an incorrect decision). True
Negative (TN) occurs when non-fraudsters contributors were
selected correctly (this is considered the correct decision).
Table 5. Confusion matrix with positive and negative
classification
IS A REAL FRAUDSTER?
CORRECT NOT CORRECT
85 35
SELECTED
True Positive (TP) False Positive (FP)

NOT 7 23
SELECTED False Negative (FN) True Negative (TN)
Below we present the computed values of Accuracy (A), Precision Kirkos et. al. [11] explored the effectiveness of Data Mining
(P) and Recall (R). These values demonstrate the effectiveness of (DM) classification techniques in detecting firms that issue
our classification method. Since we achieved 80% in F-measure fraudulent financial statements (FFS) and deals with the
we may conclude that this method captured correctly potential identification of factors associated to FFS. This study investigated
fraudsters. the usefulness of Decision Trees, Neural Networks and Bayesian
Belief Networks in the identification of fraudulent financial
statements. These three models were compared in terms of their
Accuracy = TP + TN = 85 + 23 = 72,00% performances.
TP+FP+FN+TN 85+35+7+23 Sánches et. al. [12] proposed the use of association rules in order
to extract knowledge so that normal behavior patterns may be
obtained in unlawful transactions from transactional credit card
Precision (P) = TP = 85 = 70,83% databases in order to detect and prevent fraud. The proposed
methodology has been applied on data about credit card fraud in
TP + FP 85 + 35 some of the most important retail companies in Chile.
Li et. al. [13] applied Bayesian Classification and Association
Recall (R) = TP = 85 = 92,39% Rule to identify the signs of fraudulent accounts and the patterns
of fraudulent transactions. Detection rules were developed based
TP + FN 85 + 7 on the identified signs and applied to the design of a fraudulent
account detection system. Empirical verification supported that
this fraudulent account detection system can successfully identify
F-measure = 2PR = 80,19%
fraudulent accounts in early stages and is able to provide reference
(P + R) for financial institutions.
Phua et. al. [14] presented a survey that categorizes, compares,
and summarizes from almost all published technical and review
5. Related Work articles in automated fraud detection within publishes between
Glancy and Yadav [7] proposed a quantitative model for detecting 2000 and 2010. This research discussed the main methods and
fraudulent financial reporting. The model detects the attempt to techniques used in order to detect frauds in a automatic way
conceal information and/or present incorrect information in together with their problems.
annual filings with the US Securities and Exchange Commission
(SEC). The model uses essentially all of the information contained 6. CONCLUSION
in a text document for fraud detection. This paper proposes a method for classifying taxpayers in order to
Ngai et. al. [8] presented a review of — and classification scheme help detecting potential fraudsters. In our experiments, we
for — the literature on the application of data mining techniques discovered key patterns. Through statistical techniques we
for the detection of financial fraud. The findings of this review observed that: (1) taxpayers analyzed show a high frequency
clearly show that data mining techniques have been applied most indicators of tax evasion, (2) the indicators studied showed an
extensively to the detection of insurance fraud, although corporate increase in frequency from 2009 to 2010, (3) indicators C, E, F, G
fraud and credit card fraud have also attracted a great deal of and H have the highest frequency in both periods, and (4) there
attention in recent years. The main data mining techniques used are many fraud indicator sets that occur with great frequency and
for FFD are logistic models, neural networks, the Bayesian belief in both periods. With our proposed method we allow reducing the
network, and decision trees, all of which provide primary number of indicators that a tax auditor should evaluate, which will
solutions to the problems inherent in the detection and save time and increase the accuracy during auditing process..
classification of fraudulent data.
We use association rule method in order to verify the existence of
Bhattacharyya et. al. [9] evaluated two advanced data mining some indicators that determine others. This perspective reveals
approaches, support vector machines and random forests, together that taxpayers tend to commit different types of frauds together.
with the well-known logistic regression, as part of an attempt to Besides, analysis of the rules listed as "good rules", revealed that
better detect (and thus control and prosecute) credit card fraud. 72% of existing rules 2009 are repeated in 2010, so current
The study was based on real-life data of transactions from an auditing methods are not efficient to discourage fraudsters in
international credit card operation. continuing committing irregularities.
Ravisankar et. al. [10] used data mining techniques such as Another technique used in this study was the classification of the
Multilayer Feed Forward Neural Network (MLFF), Support fraud indicators by the similarity between them. With this
Vector Machines (SVM), Genetic Programming (GP), Group technique, it was possible to identify groups of fraud according to
Method of Data Handling (GMDH), Logistic Regression (LR), the similarity between their indicators (Figure 5). This technique
and Probabilistic Neural Network (PNN) to identify companies could be very useful to the Finance Department for redefine tax
that resort to financial statement fraud. Each of these techniques declaration procedures in order to check irregularities before their
was tested on a dataset involving 202 Chinese companies and happen.
compared with and without feature selection. PNN outperformed
all the techniques without feature selection, and GP and PNN We have also applied dimensionality reduction techniques as a
outperformed others with feature selection and with marginally way to create a scale of risk for frauds. Thus, we investigated two
equal accuracies. dimensionality reduction techniques to reduce the fourteen
indicators of fraud to a single dimension. For this purpose, the
technique-Singular Value Decomposition (SVD) was more [7] F. H. Glancy and S. B. Yadav, “A computational model for
feasible than Principal Component Analysis (PCA) indicating that financial reporting fraud detection,” Decision Support
it is possible to create a scale to identify the propensity for fraud Systems, vol. 50, no. 3, pp. 595-601, Feb. 2011
by taxpayers. [8] E. Ngai, Y. Hu, Y. Wong, Y. Chen, and X. Sun, “The
Last but not least, our method proved to be very accurate in application of data mining techniques in financial fraud
detecting fraudsters, achieving 80% of accuracy. Indeed, this detection: A classification framework and an academic
method has two important advantages. It is an automatic method review of literature,” Decision Support Systems, vol. 50, no.
and it needs to evaluate few fraud indicators to infer potential 3, pp. 559–569, 2011.
fraudsters. Clearly, this method is of great importance to Brazilian [9] Bhattacharyya, S.; Jha, S.; Tharakunnel, K.; Westland, J. C.
fiscal agencies since it can allow immediate actions, which may 2011 “Data mining for credit card fraud: A comparative
mitigate avoid frauds. study”, Decision Support Systems, vol. 50, Issue 3, pp. 602--
As an opportunity for future work, is the need to investigate other 613
dimensionality reduction techniques, as well as the study of [10] Ravisankar, P.; Ravi, V.; Raghava Rao, G.; Bose, I.;
Outlier Detection techniques to find new evidence of fraud that “Detection of financial statement fraud and feature selection
are not perceived so trivial and so improve fraud detection. using data mining techniques” Decision Support Systems,
vol. 50, Issue 2, January, 2011 Pages 491-500.
7. REFERENCES [11] Kirkos, E., Spathis, C., & Manolopoulos, Y. (2007). "Data
[1] Abdi, H., Williams, L.J., 2010. Principal component analysis.
mining techniques for the detection of fraudulent financial
Wiley Interdisciplinary Reviews: Computational Statistics, 2:
statements", Expert Systems with Applications 32 (4) (2007)
433-459.
995–1003.
[2] Agrawal, R., Srikant, R., 1994. Fast algorithms for mining
[12] D. Sánchez, M.A. Vila, L. Cerda, J.M. Serrano, "Association
association rules. Intl. Conf. on Very Large Databases, pp.
rules applied to credit card fraud detection", Expert Systems
487–499.
with Applications 36 (2) (2009) 3630–3640.
[3] Paul E. Black, "Manhattan distance", in Dictionary of
[13] Shing-Han Li, David C. Yen b,1, Wen-Hui Lu, Chiang
Algorithms and Data Structures [online], Vreda Pieterse and
Wang. Identifying the signs of fraudulent accounts using data
Paul E. Black, eds. 31 May 2006. (accessed TODAY)
mining techniques.
Available from:
http://www.nist.gov/dads/HTML/manhattanDistance.html [14] Clifton Phua, Vincent Lee, Kate Smith, Ross Gaylera;
Comprehensive Survey of Data Mining-based Fraud
[4] Jolliffe, I.T., 2002. Principal Component Analysis, Series:
Detection Research
Springer Series in Statistics, 2nd ed., Springer, NY, XXIX,
487 p. 28 illus. ISBN 978-0-387-95442-4 [15] http://en.wikipedia.org/wiki/List_of_countries_by_GDP_(no
minal)
[5] Kalman, D. 1996. A singularly valuable decomposition: the
SVD of a matrix. College Math. J. 27:2-23..
[6] Ullman, J. D., 2010. Mining of Massive Datasets.

You might also like