Professional Documents
Culture Documents
Fraud Detection Based-On Data Mining On Indonesian E-Procurement System (SPSE)
Fraud Detection Based-On Data Mining On Indonesian E-Procurement System (SPSE)
Fraud Detection Based-On Data Mining On Indonesian E-Procurement System (SPSE)
Abstract - This paper focuses on detection of potential fraud phase, this unit creates procurement package, announces and
that occurs in the procurement process via the Indonesian E opens the procurement process, processes the selections
Procurement System (SPSE). Potential frauds in procurement procedure and announces the winner. After the announcement,
take very diverse forms such as corruption, collusion and tender it deals with the complaints and then finally closes the
fixation and more importantly, they are found in various stages procurement process with contract signing [2].
ranging from the budgeting to the utilization stages.
According to Journal Senarai LKPP, the potential for fraud
By analyzing the data, we show that there are several techniques in the procurement process may occur during the planning
that may work effectively to serve the goal of detecting these stages (such as budget markup and packaging work) to the
frauds. Furthermore, we also obtain that SPSE data contain a utilization stage (quantity and quality of procured goods or
huge number of data-related issues such as inconsistencies in services that do not correspond with the requirements or not
hierarchical structure among official agencies and company used at all) [3]. Potential frauds in the procurement system take
names among institutions as well as missing data points. In a very diverse forms starting from collusion, corruption and
addition to that, the size of data to deal with is gigantic (about tender fixation [4]. Various studies have been conducted to
515,069 projects), which renders the fraud detection mechanism detect potential frauds in the procurement system, such as:
has been a non-trivial problem.
fraud detection techniques based on statistical analysis using
bidding price [5] [6] and reserve price ratio [4] [7], fraud
In this paper, we implement a fraud detection mechanism using
detection techniques based on proximity and similarity using
data mining techniques based on supervised learning. The use of
spatial data and document of the payment transactions [8] [5]
supervised learning depends on the availability of data labeling
and fraud detection technique using data mining approach [9]
(fraud and non-fraud) which are extracted using string matching
and manual extraction procedure from several data sources
[J 0] [11]. Based on our research, there are two fraud detection
including court rulings, Komisi Pemberantasan Korupsi (KPK)
techniques that have been used to detect fraudulent in SPSE,
publication and public comment. Test results show that Naive i.e: fraud detection techniques using the history of user activity
Bayes algorithm with 14 attributes obtained from dimension [J 2] and potential risk analysis method from Opentender.net
reduction produces the best performance with promising result [13].
compared to the other fraud detection techniques available to
One of the issues in conducting fraud detection in SPSE is
date. Besides, the sensitivity analysis deployed in the dimension
the limited data types provided by the SPSE itself This has
reduction process has not only significantly reduced the
dimension of the problem but also has improved the performance
substantially makes many existing fraud detection techniques
of the fraud detection technique.
cannot be deployed to its full potential. The other problems that
have been encountered from SPSE data are such as: (1)
Keywords - fraud detection, e-procurement, data labeling, existing missing value in SPSE data, (2) inconsistent
data mining, dimensionality reduction subdivision names among government institutions, (3)
inconsistent names among the same companies across SPSE
I.INTRODUCTION provider, (4) unavailability of labeled data and (5) occurrence
of skewed distribution of labeled data. Therefore, to detect
Electronic Procurement System (SPSE) is a system that
potential fraud on SPSE based on data mining, an extensive
manages almost all procurement projects undertaken by the
number of data processes preparation have to be performed.
Government of the Republic of Indonesia. Until the beginning
of 2016, the total budget that has been processed using SPSE We summarize our implementation for fraud detection
reached IDR 870 trillion from 731 agencies and 34 process in six steps as follows: (1) selecting of the appropriate
provinces [1]. The procurement processes that occur in the supervised algorithm based on data mining, (2) determining a
SPSE are divided into two major selection phases i.e. suitable performance measurement techniques, (3) defining and
preparation stage and implementation stage. At the stage of the searching the baseline of fraud detection performance, (4)
selection preparations, committing officer called PPK submits a enriching attributes and building the model, (5) selecting the
procurement plan to Pokja ULP - a unit assigned to handle most influential attribute (using dimensional reduction
technical procurement processes. Pokja ULP then creates and techniques) and (6) analyzing the results.
determines procurement document. During the implementation
1) Original SPSE Data Extraction • Bidding schedules and stages may differ for each
This step is carried out by downloading all data and project.
attributes that are freely accessible through SPSE official
website. From this step, we have obtained five types of data
B. Data Cleansing TABLE IL DATA ANALYSIS RESULTS FOR COMPATIBILITY BETWEEN
THE EXISTING OF FRAUD DETECTION TECHNIQUES AND THE AVAILlBILlTY OF
Data cleansing is conducted to complete the missing values SPSE DATA
in our data.
Fraud Detection Data Data
1) Completing Missing Value Techni9ne Re9nirement Availabili�
To deal with the missing values, we synchronize the data Bid price analysis • The bid price of all • The bid price of
participants of the all participants
with other equivalent attributes. For example for missing [5) [6} auction of the auction
values in budgeting year, we use the year when the project
started while for missing values in institutions and funding Reserve price ratio • Win bid price • Win bid price
2 • Reserve price • Reserve price
source, we use the name of institutions carrying out the analysis [4] [7]
projects. With this process, we are able to complete the missing
• Company's address • Company's
values, which in tum enable us to implement fraud detection address
• Project location
techniques. • The entire value • Project location
Project Cost companies in the • The entire value
2) Hierarchy Extraction 3 procurement system companies in the
Hierarchy extraction is conducted by matching the agency Analysis [l4]
• Company workers procurement
and units names with all ministries and government agencies data system
based on lexical similarity. Assuming that all corresponding • The workload of the
agencies and units are operating under one or more appropriate project company
government units, all values sharing lexical similarity are then • All bidders data with • All bidders data
Pattern Rotation
4 consistent naming but inconsistent
mapped under the same institution. As a result, all agencies and
Analysis [18] naming
units found in the data are placed in a government-based
hierarchy which suits our technique.
• The address of • The address of
C. Data Labeling company participants company
and
This step is implemented to label the e-Procurement data as auctions organizers
fraud or non-fraud using information extraction from court Fraud Detection by • Home address of
5 participants and
decision data, [(PK publication data and public comment data. Similarity [8]
auctions organizers
We scan all those data sources to find any information whether
• payments Document
a particular project contains fraud or not. In order to make sure for participants
the fraud projects are correctly labeled, the labeling process • Document validation
required the following data: project name, agencies to which for auctions organizers
the project is offered, companies winning the project and the • The value of the • The value of the
fiscal year of the project or otherwise will not be processed. contract contract
Information extraction for the data sources above is performed • The number of bidders • The number of
Potential Risk • The reserve price ratio bidders
using string matching based on several criteria and manual
6 Analysis/rom • Scheduling of project • The reserve price
extraction technique.
• The number of ratio
Opentender.net [13] Scheduling of
The results of this information extraction process repeated winners •
project
discovered 202 fraud projects and 19,653 non-fraud projects.
• The number of
Non-fraud labeling is performed by eliminating all projects repeated winners
having any indication as fraudulent. This guarantees that all Data mining techniques
successfully labeled data not mislabeled and clear from any
information noise. • All bidders data with • All bidders data
Association Rule
consistent naming but inconsistent
[9] naming
IV.DATA ANALYSIS
Data analysis is implemented to understand data • Data labeling • Win price and
requirement for the existing fraud detection techniques and its • Win price and reserve reserve price
compatibility with our available data. Table II consists the price ratio ratio
Cluster Analysis"
summary of data compatibility from each existing fraud 2 • Skewness distribution • Skewness
detection technique and this has enabled us to conclude that the [4] [9] from bid prices distribution from
• Median value from bid bid prices
existing techniques suit the characteristic of the SPSE data are
prices • Median value
bid price analysis [5J [6J, reserve price ratio analysis [4] [7] from bid Erices
and potential risk analysis from Opentender.net [13]. Supervised • Data labeling
Furthermore, the data mining techniques that could potentially
3 Learning [/0] [/6]
be used is supervised learning [IO] [I6] [17].
[/7]
3. In this paper cluster analysis was not implemented because the attributes that must be analyzed
are very diverse, hence it needs equalization for each attribute before the process can be carried out In
addition to this diversity, there are a large enough number of attributes that complicate the
implementation of this particular analysis. Please note that cluster analysis is only deployed on similar
attributes. Moreover, the enormous efforts needed to define cluster results as the occurrence of
fraudulent activities makes cluster analysis not practical to implement
V.FRAUD DETECTION ON SPSE modeling. All attributes are then used to build the model which
then followed by performance testing. The result of
We then implement fraud detection techniques using data
performance testing is exhibited in Table IV. Table IV shows
mining supervised learning. This discussion is arranged into
that the average performance of all these attributes is 82.88%
several steps.
with saving value of 75.73% or IDR 263 billion. The
Step 1: The selection of supervised algorithms that suit the performance increases significantly compared to the
procurement data is analyzed in this section. From the results performance of the test results using a combination of groups
of analysis, we notice that several effective algorithms to detect attributes in Step 3.
fraud using supervised learning [10] can be deployed to
analyze SPSE data, i.e. Naive Bayes, Bayesian networks, TABLE IY. TESTS RESULTS USING ALL AV ALAIBLE ATTRlBUTES
decision tree and neural network.
Attributes Algorithm SAR(%) COST (%) Avg. (%)
Step 2: In this step we analyze the proper performance all" Naive Bayes 90.0 75.73 82.88
measurement techniques for supervised learning in fraud
2 all" Bayes Network 90.3 74.68 82.53
detection. Our research has concluded that there are two proper
parameters to determine the performance of fraud 3 all" Decision Tree 70.1 -23.66 23.25
detection techniques: SAR (Squared Error, Accuracy ROC nTests are performed using all available attributes
Curve) and misclassification cost called Cost Model. SAR is
suitable for justifying the overall condition of supplied data
mining model [19] while cost model is suitable because it is Step 5: Dimensional reduction is implemented to select and
capable of measuring the saving value and misclassification reduce the attributes that can provide the best performance.
cost from implementation of fraud detection technique on From the available literature, we note that there are several
supervised model [17]. analytical techniques for dimensional reduction that can be
carried out such as Correlation Analysis (CA), Principal
TABLE IlL COST MATRIX FOR FRAUD DETECTION IN E-PROCUREMENT Component Analysis (PCA) and Weighted Principal
SYSTEM Component Analysis (WPCA) [20]. We combine these three
NonFrand (Trne) Frand (Trne) techniques with sensitivity analysis (SA) to analyze how much
an attribute changing affects the misclassification cost. The
Prediction(NonFrand) 0 - (Saving Values)
baseline used for SA is the best performance that is generated
Saving Values - from the previous step, which is 82.88%. The SA steps are
Prediction(Fraud) - (Investigation Cost)
( Investigation Cost )
implemented by comparing the performance of the model
generated with and without each attribute to the baseline value.
Cost models are formed from cost matrix by measuring If the result is smaller than one percent from the baseline, then
misclassification of a detection process [17]. Cost matrix used the attribute is removed because it is considered insignificant.
in this paper is shown in Table III. Misclassification costs are These steps are repeated for all 52 attributes and yields to 14
calculated based on total investigation cost and the salvage significant attributes for our fraud detection model. The test
value of the bidding price that occurs in a project. Cost matrix result of dimensional reduction is depicted in Table V.
in Table III consists of four conditions established based on the
predicted results and the saving from each quadrant. Cost TABLE Y. TEST RESULTS FROM DIMENSIONAL REDUCTION PROCESS
matrix is formulated under the assumption that the procurement No RD" l;h Algorithm SAR(%) COST(%) Avg
is only investigated if there are fraud indications from the
WPCA 11 Naive Bayes 85.99 67.67 76.83
procurement data, not to all projects. This assumption
eliminates the cost of unnecessary investigations otherwise the 2 WPCA 11 Bayes Network 83.77 61.19 72A8
investigation costs would be enormous. WPCA 11 Decision Tree 54.63 -114A7 -29.92
3
Step 3: In this step we group attributes based on the fraud 4 PCA 11 Naive Bayes 55.20 -46.29 4A5
detection techniques. From these data it is known that there are
5 PCA 11 Bayes Network 53.17 -54.97 -0.90
four groups of attributes that can be used. All of these groups
are then merged to build the data mining model using 6 PCA 11 Decision Tree 30AO -117.64 -43.62
supervised algorithms from step 1. The goal is to find a group 7 CA 29 Naive Bayes 65.88 -109.30 -21.71
of attributes or combination of groups that have the best
performance in fraud detection using supervised algorithms. 8 CA 29 Bayes Network 58.18 -26.85 15.67
After building and validating the model based on all these 9 CA 29 Decision Tree 27.80 -96.74 -34A7
groups of attributes, as well as their combination, we found that
10 SA 14 Naive Bayes 89.79 78.68 84.24
group (a4) from Potential Risk Analysis from Opentender. net
[13] gives the best performance of 80.19% with saving 11 SA 14 Bayes Network 88.30 68.98 78.64
percentage of 78.70%. 12 SA 14 Decision Tree 82.94 70.07 76.51
7 nominal_nilai_proyek Contains ranking / grouping of the range • The analysis of bidders and lists of terms to bid may
of winning price also be used to analyze fraudulent in SPSE.
14 total�erubahanjadwal Contains the total change of each of the Last but not least, we realize that key attributes such as
stages that exist in an auction process project values, winning price statistics, bid price and reserve
price are still playing significant role on detecting fraudulent
data, especially in SPSE. It is also interesting to note discover
that procurements funded through national budget have bigger
Step 6: In this step, we analyze the probability distribution
potential to be more fraudulent than the ones using state
of the model producing the best performance. The purpose of
budget.
this analysis is to find out how the value of each attributes
affects the process of fraud detection. Furthermore, we are VII.FuTURE WORKS
interested to discover new knowledge on how to find
fraudulent data in SPSE. The process of data labeling in SPSE which had been
conducted in this study is still dominated by manual labeling
The model to analyze is the one with established 16 process. This makes our limited resources are not sufficient to
attributes resulted from dimensional reduction. This model is carry out the manual labeling process for all data. We are aware
built using naive Bayes algorithm. Naive Bayes algorithm that named entity recognition techniques can be developed to
works with the assumption of conditional independence perform automate this process and encourage the
implementation of this automatic data labeling. We believe this [18] Kevin Lang and Robert W. Rosenthal, "The Contractors' Game," The
will significantly improve the data mining model and produce a RAND Journal of Economics, pp. 329-338,1991.
better performance. [19] Rich Caruana and Alexandru Niculescu-Mizil, "Data Mining in Metric
Space: An Empirical Analysis of Supervised Learning Performance
Criteria," ,New York,2004.
REFERENCES
[20] Amruta D. Pawar, Prakash N. Kalavadekar, and Swapnali N. Tambe, "A
[I] LKPP. (2016, May) Smart Report LPSE V.2. [Online]. http/lreport
Survey on Outlier Detection Techniques for Credit Card Fraud
lpse.lkpp. go.id/v2/beranda
Detection," IOSR Journal of Computer Engineering, vol. 16, no. 2,
[2] LKPP, "Peraturan Kepala Lembaga Kebijakan Pengadaan Barang/Jasa pp. 44-48,2014.
Pemerintah," Lembaga Kebijakan Pengadaan Barang/Jasa
Pemerintah,Perka No. 1 Tahun 2015,2015.
[6] Jeannette Brosig and J. Philipp Reib, "Entry decisions and bidding
behavior in sequential first-price procurement auctions: An
experimental study," Games and Economic Behavior, pp. 50-74,
2007.
[8] Stefan Ruping, Natalja Punko, Bjorn Gunter, and Henrik Grosskreutz,
"Procurement Fraud Discovery using Similarity Measure
Learning," TRANSACTIONS ON CASE-BASED REASONING,
2008.
[9] Ghedini Celia Ralha and Vinicius Sannento Carlos Silva, "A multi-agent
data mining system for cartel detection in Brazilian," Expert
Systems with Applications, pp. 11642-11656,2012.
[11] Philip K. Chan, Wei Fan, Andreas Prodromidis, and Salvatore J Stolfo,
"Distributed Data Mining in Credit Card Fraud Detection," IEEE
Intelligent Systems' Special Issue on Data Mining,pp. 67-74,1999.
[12] Arumsari, Totok P., lswahyudi, Mucharor , and Akib P.. (2014) AUDIT
ATAS PELAKSANAAN LELANG SECARA ELEKTRONIK
DALAM PENGADAAN BARANG DAN JASA PEMERINTAH.
[Online].
http/lwww.bpkp.go.idlinvestigasi/berita/read/13521101AUD1T
ATAS-PELAKSANAAN-LELANG-SECARA-ELEKTRONIK
DALAM-PENGADAAN-BARANG-DAN-JASA-
PEMERINTAH.bpkp
[14] Patrick Bajari and Lixin Ye, "Deciding Between Competition and
Collusion," The review of economics and statistics, vol. 85, pp.
971-989,2003.
[16] Sam Maes, Karl Tuyls, Bram Vanshoenwinkel, and Bernard Manderick,
"Credit Card Fraud Dete tion Using Bayesian and Neural
Networks," , 1993.
[17] Clifton Phua, Damminda Alahakoon, and Vincent Lee, "Minority Report
in Fraud Detection: Classification of Skewed Data," ACM
SlGKDD Explorations Newsletter,vol. 6,no. 1,pp. 50-59 ,2004.