Professional Documents
Culture Documents
An Empirical Method For Discovering Tax Fraudsters: A Real Case Study of Brazilian Fiscal Evasion
An Empirical Method For Discovering Tax Fraudsters: A Real Case Study of Brazilian Fiscal Evasion
An Empirical Method For Discovering Tax Fraudsters: A Real Case Study of Brazilian Fiscal Evasion
FImeasure!80%!
Fraud!Indicators! Fraud!Indicators! Relevant!Fraud!
Process!Input! Fraud!Indicators!
adapted to display the existence / absence of evidence of fraud for ! ! Indicator!
each taxpayer, where 1 (one) indicates the presence and 0 (zero) is Together! Reduced!set!of!unique! Similar!clusters!in! Scale!fraud!risk!
the absence of evidence of fraud. Aiming to anonymize taxpayers Process!Output! occurrences! determinants! both!analyzed!years!
!
was assigned sequential ID for each. Each of fraud indicators - ! ! !
Fourteen - was renamed A, B, C, D, E, F, G, H, I, J, K, L, M and
N. Figure 2 - Proposed Method Overview
3. BEHAVIOURAL ANALYSIS OF
TAXPAYERS
In this section, we present a detailed description of the proposed
method. 2 presents an overview of this method depicting each step
with their corresponding input and output. The first step of the
method is to compute the frequency of tax indicators (this was
explained in the previous section 2.1). The second step aims to
find the most relevant fraud indicators and their correlation. In the
third step, we correlate fraud indicators and compare their
evolution in both years. Finally, in the fourth step we classify
taxpayers using a fraud scale, which allows measuring the
tendency of a taxpayer to commit a fraud. Each step is explained
in detail within the following sections.
(a) (b)
3.1 Discovering The Most Relevant Fraud Figure 3. Confidence, support and lift: (a) 2009 dataset and (b)
2010 dataset
Indicators
The second step of the method is to understand the frequency of Regarding this plot it is worth noting that all of the rules with high
fraud indicators. For this task we use association rule technique, lift seem to have support below 80%, in both years 2009 and
regarding each fraud indicator as an item set. So, the mining 2010. On the other hand, there are rules with high lift and high
algorithm used in this study is the Apriori algorithm [2], confidence, which sounds quite positive.
considered to be the most widely used in the literature for this Based on this evidence, we focus on a smaller set of rules, here
purpose. called “good rules” that only have highest lift. 2009 data was
We executed Apriori algorithm using support = 60% and used lift > 1.05 for 2010 data was used lift > 1.10. An example of
confidence = 60%. The values of support and confidence were these good rules can be seen in Figure 4.
defined as the minimum acceptable for the tax auditors. For this
implementation, there is a new parameter called "lift". Lift tries to
come up with ways of measuring how “interesting” a rule is. A
more interesting rule may be a more useful rule because it is more
novel or unexpected. Without getting into math, lift takes into
account the support for a rule, but also favors situations where left
hand side and right hand side variables are not abundant but
where the relatively few occurrences always happen together. The
larger the value of lift, the more “interesting” the rule may be.
Figure 3 shows a plot that describes the correlation between
support and confidence. Even though we see a two dimensional
plot, we actually have three variables represented here. Support is
on the X-axis and confidence is on the Y-axis. Lift serves as a
measure of interestingness, and we are also interested in the rules
with the highest lift. On this plot, the lift is shown by the color of
a dot. The darker the dot, the closer the lift of that rule is to 1.1,
which appears to be the highest lift value among these rules.
(a) (b)
Figure 4 (a). A sample set of good rules: 2009 dataset and (b)
2010 dataset
Analysis of the rules listed as "good rules", it was identified that
72% of existing rules 2009 are repeated in 2010. This analysis
shows that the pattern of behavior of taxpayers is being repeated
in subsequent years, which requires urgent intervention to prevent
tax evasion.
The main result achieved in this step is the reduced set of unique
determinants on left hand side, both in 2009 and in 2010. They
are: C, E, F, H and L. After executing APRIORI algorithm, we
achieved 2,563 rules in 2009 and 1,527 rules in 2010. From those
rules, we selected 176 good rules (lift > 1.05) in 2009 and 150
good rules (lift > 1.10) in 2010.
Each principal component (PC1, PC2, PC3, ...) accounts for the whereas if we take the first two components reach about 62% of
total variance of the standardized data. The first principal the total variance. This strategy has better result to reduce the
component accounts for about 21% of the total variance of the matrix.
standardized data, whereas if we take the first two components
reached about 35% of the total variance. To achieve at least 50% To define which components to use, should calculate the standard
of the variance would need the first four components (a reduction deviation squared and choose values greater than 1 (corresponding
to the components), as shown in Table 4.
of 14 for four indicators). The first principal component variance
is 2.95 (standard deviation squared), much higher than the average Table 4. Square of the standard deviation of each principal
of the variances (equal to 1). In addition, Figure 6 does not show a component
relevant difference between the main components. In this
2010
example, it is possible to retain the first two or three components
indicating a reduction to two or three dimensions. PC1 PC2 PC3 PC4 PC5
Standard 2.063 1.038 0.962 0.788 0.147
Deviation ^ 2
From the values obtained, the components to be used are PC1 and
PC2. Figure 7 reinforces the choice to show these two values as
the highest.
4. Results Evaluation
Figure 8. Relationship of each variable with PC1 and PC2
In order to evaluate our classification method, we resort to
By analyzing the Figure 8, we can observe that the plot has the experienced tax auditors that manually verified the results of our
characteristic of an S-curve, not a straight line. However, the method. We have used F-measure formula to analyze the
Singular-Value Decomposition (SVD) technique yielded better precision and recall of our approach. In fact, F-Measure enforces
results than PCA for the data analyzed in our study. a better balance between performance on the minority and
majority class, and is more suitable in the case of imbalanced
3.3.2 Singular-Value Decomposition (SVD) data, which arises quite frequently in real-world applications.
Following the same principle of PCA, SVD [5][6] technique was
also used on the determinants found by APRIORI algorithm: C, E, In this evaluation we applied the following methodology. We
F, H and L. Figure 9 is a demonstration of how the matrix of this selected 120 contributors with the highest values in the ranking
study was reduced to a single dimension. This figure shows the list computed by the SVD method, which represent a group of
SVD data reduction for 2010, as follows: (a) fifty taxpayers with potential fraudsters. Then we selected 50 contributors with the
lowest values of fraud indicators; and (b) on thousand taxpayers lowest values of SVD, which in turn correspond to potential non-
with lowest values of fraud indicators. fraudsters. In sequel we submit these two groups (i.e. 120
fraudsters and 50 non-fraudsters) to be analyzed by 2 experienced
This technique proved to be more feasible to reduce tax auditors. After rough analysis, the auditors concluded that
dimensionality of the data used in this study, because it is simpler from the group of 120 classified as fraudsters, 85 were correct,
and get a line exactly as expected. However, it must be applied to and from the group of 50 selected as non-fraudsters, 23 actually
all data again whenever a new line is inserted to the array. The were not fraudsters.
SVD calculation is altered in accordance with the values used in
the sample. We organized the auditors’ evaluation into a confusion matrix
(showed in Table 5), which shows the positive and negative
The results point out to the possibility of creating a scale fraud classification of our results. From this table, we obtain the values
risk for taxpayers, with the use of SVD. Thus, we answer the final of Accuracy (A), Precision (P) and Recall (R). True Positive (TP)
question 4) Can we set a scale to indicate the risk of a taxpayer occurs when fraudsters contributors were selected correctly by our
committing fraud? method (this is considered the correct decision). False Positive
(FP) occurs when fraudsters contributors were wrongly selected
by our method (this is considered an incorrect decision). False
Negative (FN) occurs when non-fraudsters contributors were
wrongly selected (this is considered an incorrect decision). True
Negative (TN) occurs when non-fraudsters contributors were
selected correctly (this is considered the correct decision).
Table 5. Confusion matrix with positive and negative
classification
IS A REAL FRAUDSTER?
CORRECT NOT CORRECT
85 35
SELECTED
True Positive (TP) False Positive (FP)
NOT 7 23
SELECTED False Negative (FN) True Negative (TN)
Below we present the computed values of Accuracy (A), Precision Kirkos et. al. [11] explored the effectiveness of Data Mining
(P) and Recall (R). These values demonstrate the effectiveness of (DM) classification techniques in detecting firms that issue
our classification method. Since we achieved 80% in F-measure fraudulent financial statements (FFS) and deals with the
we may conclude that this method captured correctly potential identification of factors associated to FFS. This study investigated
fraudsters. the usefulness of Decision Trees, Neural Networks and Bayesian
Belief Networks in the identification of fraudulent financial
statements. These three models were compared in terms of their
Accuracy = TP + TN = 85 + 23 = 72,00% performances.
TP+FP+FN+TN 85+35+7+23 Sánches et. al. [12] proposed the use of association rules in order
to extract knowledge so that normal behavior patterns may be
obtained in unlawful transactions from transactional credit card
Precision (P) = TP = 85 = 70,83% databases in order to detect and prevent fraud. The proposed
methodology has been applied on data about credit card fraud in
TP + FP 85 + 35 some of the most important retail companies in Chile.
Li et. al. [13] applied Bayesian Classification and Association
Recall (R) = TP = 85 = 92,39% Rule to identify the signs of fraudulent accounts and the patterns
of fraudulent transactions. Detection rules were developed based
TP + FN 85 + 7 on the identified signs and applied to the design of a fraudulent
account detection system. Empirical verification supported that
this fraudulent account detection system can successfully identify
F-measure = 2PR = 80,19%
fraudulent accounts in early stages and is able to provide reference
(P + R) for financial institutions.
Phua et. al. [14] presented a survey that categorizes, compares,
and summarizes from almost all published technical and review
5. Related Work articles in automated fraud detection within publishes between
Glancy and Yadav [7] proposed a quantitative model for detecting 2000 and 2010. This research discussed the main methods and
fraudulent financial reporting. The model detects the attempt to techniques used in order to detect frauds in a automatic way
conceal information and/or present incorrect information in together with their problems.
annual filings with the US Securities and Exchange Commission
(SEC). The model uses essentially all of the information contained 6. CONCLUSION
in a text document for fraud detection. This paper proposes a method for classifying taxpayers in order to
Ngai et. al. [8] presented a review of — and classification scheme help detecting potential fraudsters. In our experiments, we
for — the literature on the application of data mining techniques discovered key patterns. Through statistical techniques we
for the detection of financial fraud. The findings of this review observed that: (1) taxpayers analyzed show a high frequency
clearly show that data mining techniques have been applied most indicators of tax evasion, (2) the indicators studied showed an
extensively to the detection of insurance fraud, although corporate increase in frequency from 2009 to 2010, (3) indicators C, E, F, G
fraud and credit card fraud have also attracted a great deal of and H have the highest frequency in both periods, and (4) there
attention in recent years. The main data mining techniques used are many fraud indicator sets that occur with great frequency and
for FFD are logistic models, neural networks, the Bayesian belief in both periods. With our proposed method we allow reducing the
network, and decision trees, all of which provide primary number of indicators that a tax auditor should evaluate, which will
solutions to the problems inherent in the detection and save time and increase the accuracy during auditing process..
classification of fraudulent data.
We use association rule method in order to verify the existence of
Bhattacharyya et. al. [9] evaluated two advanced data mining some indicators that determine others. This perspective reveals
approaches, support vector machines and random forests, together that taxpayers tend to commit different types of frauds together.
with the well-known logistic regression, as part of an attempt to Besides, analysis of the rules listed as "good rules", revealed that
better detect (and thus control and prosecute) credit card fraud. 72% of existing rules 2009 are repeated in 2010, so current
The study was based on real-life data of transactions from an auditing methods are not efficient to discourage fraudsters in
international credit card operation. continuing committing irregularities.
Ravisankar et. al. [10] used data mining techniques such as Another technique used in this study was the classification of the
Multilayer Feed Forward Neural Network (MLFF), Support fraud indicators by the similarity between them. With this
Vector Machines (SVM), Genetic Programming (GP), Group technique, it was possible to identify groups of fraud according to
Method of Data Handling (GMDH), Logistic Regression (LR), the similarity between their indicators (Figure 5). This technique
and Probabilistic Neural Network (PNN) to identify companies could be very useful to the Finance Department for redefine tax
that resort to financial statement fraud. Each of these techniques declaration procedures in order to check irregularities before their
was tested on a dataset involving 202 Chinese companies and happen.
compared with and without feature selection. PNN outperformed
all the techniques without feature selection, and GP and PNN We have also applied dimensionality reduction techniques as a
outperformed others with feature selection and with marginally way to create a scale of risk for frauds. Thus, we investigated two
equal accuracies. dimensionality reduction techniques to reduce the fourteen
indicators of fraud to a single dimension. For this purpose, the
technique-Singular Value Decomposition (SVD) was more [7] F. H. Glancy and S. B. Yadav, “A computational model for
feasible than Principal Component Analysis (PCA) indicating that financial reporting fraud detection,” Decision Support
it is possible to create a scale to identify the propensity for fraud Systems, vol. 50, no. 3, pp. 595-601, Feb. 2011
by taxpayers. [8] E. Ngai, Y. Hu, Y. Wong, Y. Chen, and X. Sun, “The
Last but not least, our method proved to be very accurate in application of data mining techniques in financial fraud
detecting fraudsters, achieving 80% of accuracy. Indeed, this detection: A classification framework and an academic
method has two important advantages. It is an automatic method review of literature,” Decision Support Systems, vol. 50, no.
and it needs to evaluate few fraud indicators to infer potential 3, pp. 559–569, 2011.
fraudsters. Clearly, this method is of great importance to Brazilian [9] Bhattacharyya, S.; Jha, S.; Tharakunnel, K.; Westland, J. C.
fiscal agencies since it can allow immediate actions, which may 2011 “Data mining for credit card fraud: A comparative
mitigate avoid frauds. study”, Decision Support Systems, vol. 50, Issue 3, pp. 602--
As an opportunity for future work, is the need to investigate other 613
dimensionality reduction techniques, as well as the study of [10] Ravisankar, P.; Ravi, V.; Raghava Rao, G.; Bose, I.;
Outlier Detection techniques to find new evidence of fraud that “Detection of financial statement fraud and feature selection
are not perceived so trivial and so improve fraud detection. using data mining techniques” Decision Support Systems,
vol. 50, Issue 2, January, 2011 Pages 491-500.
7. REFERENCES [11] Kirkos, E., Spathis, C., & Manolopoulos, Y. (2007). "Data
[1] Abdi, H., Williams, L.J., 2010. Principal component analysis.
mining techniques for the detection of fraudulent financial
Wiley Interdisciplinary Reviews: Computational Statistics, 2:
statements", Expert Systems with Applications 32 (4) (2007)
433-459.
995–1003.
[2] Agrawal, R., Srikant, R., 1994. Fast algorithms for mining
[12] D. Sánchez, M.A. Vila, L. Cerda, J.M. Serrano, "Association
association rules. Intl. Conf. on Very Large Databases, pp.
rules applied to credit card fraud detection", Expert Systems
487–499.
with Applications 36 (2) (2009) 3630–3640.
[3] Paul E. Black, "Manhattan distance", in Dictionary of
[13] Shing-Han Li, David C. Yen b,1, Wen-Hui Lu, Chiang
Algorithms and Data Structures [online], Vreda Pieterse and
Wang. Identifying the signs of fraudulent accounts using data
Paul E. Black, eds. 31 May 2006. (accessed TODAY)
mining techniques.
Available from:
http://www.nist.gov/dads/HTML/manhattanDistance.html [14] Clifton Phua, Vincent Lee, Kate Smith, Ross Gaylera;
Comprehensive Survey of Data Mining-based Fraud
[4] Jolliffe, I.T., 2002. Principal Component Analysis, Series:
Detection Research
Springer Series in Statistics, 2nd ed., Springer, NY, XXIX,
487 p. 28 illus. ISBN 978-0-387-95442-4 [15] http://en.wikipedia.org/wiki/List_of_countries_by_GDP_(no
minal)
[5] Kalman, D. 1996. A singularly valuable decomposition: the
SVD of a matrix. College Math. J. 27:2-23..
[6] Ullman, J. D., 2010. Mining of Massive Datasets.