IJAL020405SHARMA

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/310423882
Importance of Big Data in ﬁnancial fraud detection
Article in International Journal of Automation and Logistics · August 2016

DOI: 10.1504/IJAL.2016.080339
CITATIONS READS
2 6,541
3 authors:
Vikash Sharma Bhavna Pandey

University of Hull Symbiosis International University
2 PUBLICATIONS 2 CITATIONS 5 PUBLICATIONS 18 CITATIONS
SEE PROFILE SEE PROFILE
Vipin Kumar
1 PUBLICATION 2 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Abnormal Event Detection in Moving Image View project
Invoice Data Extraction using Machine Learning Technique View project
All content following this page was uploaded by Vikash Sharma on 16 November 2016.
The user has requested enhancement of the downloaded file.

332 Int. J. Automation and Logistics, Vol. 2, No. 4, 2016
Importance of Big Data in financial fraud detection
Vikash Sharma*
Faculty of Computer Studies,
Symbiosis International University,
Pune, 412115 Maharashtra, India
Email: vikash0810@gmail.com
*Corresponding author
Bhavna Pandey
Symbiosis School of Banking & Finance,
Symbiosis International University,
Pune, 412115 Maharashtra, India
Email: bhavna.pandey@ssbf.edu.in
Vipin Kumar
Neurapses Technologies Pvt Ltd.,
Bangalore, 560100 Karnataka, India
Email: vipinkumar.work@gmail.com
Abstract: Big Data is playing a very significant role to take any industry
forward. In the context of the financial sector and fraud detection, automated
fraud detection tries to collect useful information to reduce financial frauds by
doing analysis and data mining of Big Data, especially structured data. Even a
significant attention of Big Data usage has shifted towards supply chain
management (SCM). Although, many dimensions of Big Data have been
studied and researched in SCM, there lies a missing gap in the understanding of
unstructured data for financial fraud detection. With the help of this paper, we
would like to propose a theoretical framework to study this dimension and
analyse how individual enablers of unstructured data impacts SCM. As such,
the paper intends to evaluate how much useful this unstructured data can be in
reducing financial frauds. Finally, we outline the limitations and challenges of
our study and further research directions.
Keywords: automated fraud detection; Big Data; financial fraud detection;

unstructured data; supply chain management; SCM.
Reference to this paper should be made as follows: Sharma, V., Pandey, B. and
Kumar, V. (2016) ‘Importance of Big Data in financial fraud detection’,
Int. J. Automation and Logistics, Vol. 2, No. 4, pp.332–348.
Biographical notes: Vikash Sharma received his MSc in Computer

Application (2007) from Symbiosis International University, Pune, and another
MSc degree in Advanced Computational Method (2010) from the University of
Leicester, UK. He has more than eight years of professional experience in
information technology. Currently, he is a Research Scholar at Symbiosis
International University, Pune and also a Director of Neurapses Technologies,
Copyright © 2016 Inderscience Enterprises Ltd.

Importance of Big Data in financial fraud detection 333
Bangalore, India. His current research interests include image processing, soft
computing techniques, Big Data, machine learning, operation research and
supply chain management.
Bhavna Pandey is currently working as Junior Research Fellow at Symbiosis

International University Pune and also a Visiting Faculty at Symbiosis Law
School Pune. She graduated in the field of management (first class with
distinction) from Birla Institute of Technology, Mesra Ranchi. She received her
Post Graduate (first class) degree from National Institute of Technology,
Durgapur. She is actively working in the area of banking and finance.
Vipin Kumar received his BTech in Electronics and Communications

Engineering (2011) from Uttar Pradesh Technical University, UP. With nearly
five years of industry experience as a programmer and research engineer, he is
currently working in a financial fraud detection area. His interest lies in the
field of deep learning, image processing and pattern recognition.
1 Introduction
Trillions of data are generated online every day on Twitter, Facebook, and other social
media websites. Slowly, we are relying heavily on online social aspects of
communication. Thus, most of the data in today’s world are getting presented in an
unstructured form which comes in large volume, velocity, varieties, and variability
format. Irrespective of significant values hidden in data, it is difficult to extract relevant
information from it. During a recent years’ analyst have discovered valuable data hidden
in unstructured form and the need to understand this for the future benefit of the business.
The velocity at which data is getting generated is difficult to be analysed by the
traditional analysis tool. Size is the first, and at times, the only dimension that leaps out at
the mention of Big Data (Gandomi and Haider, 2015). With the vast amount of data now
available, companies in almost every industry are focused on exploiting data for
competitive advantage. The volume and variety of data have far outstripped the capacity
of manual analysis, and in some cases have exceeded the capacity of conventional
databases (Provost and Fawcett, 2013). Therefore, the need of the hour is to analyse this
huge amount of data and solve many business problems. As the data is producing at a
faster rate, it is getting difficult for a data analyst to understand new problems and
provide real-time solutions. Although analyst has been working on data for a long time,
this vastness of information in such short time is the reason, the term Big Data is defined.
(Wang et al., 2016) also stated that “Big Data can provide unique insights into, inter alia,
market trends, customer buying patterns, and maintenance cycles, as well as into ways of
lowering costs and enabling more targeted business decisions”.
With this advent, Big Data seems to be a ‘blessing in disguise’. Although it is hard to
manage such massive data, it is also opening doors for analytics to extract hidden
information from these data. The introduction of electronic communication technologies
and platforms such as Skype and WhatsApp (commonly referred to as ‘over the top’
services) that deliver audio, video and other media over the internet protocol have created
extremely competitive markets that have forced many companies across most industries
to explore and implement innovative and smarter products and services to the customer
(LaValle et al., 2011). With this, there is a rise in unstructured form of data; available
334 V. Sharma et al.
online on social media, news, and product reviews website. In 2013, Insurance et al.
(2013) stated that around 80–85% of data are in unstructured form. Later, in 2015,
Gandomi and Haider (2015) reported it to be 95%. Thus, there is a huge need to
understand these unstructured data and solve many business problems as possible.
One of the business related problems is the detection of ‘financial frauds’, which
alone brings billions of losses to many companies. Many efforts have been made to study
and automate fraud detection to reduce huge financial loss. Still, many challenges and
limitations lie ahead (Ngai et al., 2011). The limitations of analysing frauds are majorly
in data that are in unstructured form i.e. data modelling is not defined correctly. The
purpose of this paper is to make industry and researchers aware of new dimensions for
data extraction i.e. unstructured data, especially in the context of financial fraud detection
and impact on supply chain management (SCM). And later, propose a framework that
will help to reduce this gap.
Further, social networking is also contributing to bringing significant behavioural
changes on one’s decision-making capability. Although data analysis study has been
carried out to a good extent, the impacts of Big Data analysis in logistics need to be
identified more to compete with other companies. A long-term customer may not take
much time to change product preference due to behavioural, peer influence, homophily,
pricing, cultural and political effect. This can quickly impact logistics, warehouse and
return on investment in business. If one can correctly predict best ways to retain
customers, understand future needs and the possibility of frauds then it will surely help in
arranging competitive pricing, increasing customer satisfaction, chunking fraudulent
customers based on fraud analysis and reducing unnecessary stock maintenance. As such,
there is a real need to study financial fraud detection and its impact from SCM
perspective to overcome the future business problem and help the business in taking
necessary action beforehand. The objective of the research is to develop a theoretical
framework to detect financial fraud using unstructured data which constitutes our 95%
(Gandomi and Haider, 2015) of data. With the help of this paper, we will try to study and
reduce the gap between unstructured data and detection of financial frauds.
The current study on the ‘importance of Big Data in financial fraud detection’
emphasises the need to give importance to unstructured data for detecting financial
frauds. The main contributions of this paper are as follows. In Section 2, we will review
the literature and describe Big Data, unstructured data, financial fraud detection and how
unstructured data can be helpful to reduce frauds related to financial transactions. In
Section 3, we will describe research methodology adopted for this paper. In Section 4, we
will discuss the theoretical research framework designed for this study. In Section 5, we
will share some limitations and research direction one can take. Finally, in Section 6, we
will draw our conclusion of this paper.
2 Literature review
In this section, we first review the literature related to Big Data, unstructured data and
financial frauds. Moreover, then examine the need and relevance of unstructured data for
fraud detection.
2.1 Big Data

According to Zhong et al. (2015) and Jacobs (2009), Big Data refers to a dataset that
collects broad and complex data, which is hard to process using traditional applications.
As such, the daily life is facing Big Data due to increasing usage of electronic devices.
However, Big Data is less about data that is big, than it is about a capacity to search,
aggregate, and cross-reference large datasets (Boyd and Crawford, 2012). The amount of
data that is getting generated is immense. For instance, a single jet engine generates
10 TB data every 30 min; while smart meters and heavy industrial equipment like oil
refineries and drilling rigs generate similar volumes of data (Dijcks, 2012; Opresnik and
Taisch, 2015). Twitter creates more than 12 TB data daily, and Facebook generates over
25 TB log data every day. As per the report, the per-capita capacity to store such data has
approximately doubled every 40 months since 1980 (Zhong et al., 2015; Manyika et al.,
2011). According to Opresnik and Taisch (2015) and Davenport et al. (2012), Google
alone processes about 24 petabytes (or 24,000 terabytes) of data every day.
If we understand more, Big Data was always there. So, the question is what new is it,
which makes it ‘big’? Business executives sometimes ask us, “Isn’t ‘Big Data’ just
another way of saying ‘analytics’?” It is true that they are related: the Big Data
movement, like analytics before it, seeks to glean intelligence from data and translate that
into a business advantage (McAfee and Brynjolfsson, 2012). Although, some experts
have divided the characteristics of data into five V’s (Beulke, 2011); we will only discuss
Doug’s (Laney, 2001) three main V’s, which defines Big Data i.e. volume, velocity, and
variety. The three V’s have emerged as a common framework to describe Big Data (Chen
et al., 2012; Kwon et al., 2014; Gandomi and Haider, 2015).
a Volume: is about the amount of data one needs to process, to find meaningful
information. As of 2012, about 2.5 exabytes of data were created each day, and
that number is doubling every 40 months or so (McAfee and Brynjolfsson, 2012).
Approx. 40 zettabytes of data are estimated to be created by 2020 i.e. an increase in
300 times from 2005 (IBM, 2013).
b Velocity: with the advent of social media like Facebook, YouTube, Twitter, and the
speed at which each user is adding new information, data creation speed is huge.
According to Kern (2012), Facebook is collecting data approximately 500 terabytes
a day. Thus, we can see that currently, the speed of data creation is almost
unimaginable: Every minute we upload 100 hours of video on YouTube. Also, every
minute over 200 million e-mails are sent, around 20 million photos are viewed and
30,000 uploaded on Flickr, almost 300,000 tweets are sent, and almost 2,5 million
queries on Google are performed (Rijmenam, 2015).
c Variety: data is generating in various formats i.e. variety in data. Earlier, all data that
were created was structured, but now, it comes in structured, semi-structured,
unstructured and even complex structured format too (Rijmenam, 2015). Due to
different types of data like pdf, word document, excel/spreadsheet, audio, video,
service and application logs; the format of data and the way it should be stored, also
differs. Understanding these many different types of data and coming up with the
model that can help to store in readily available format is a challenge (Rijmenam,
2015).
Further, there are three essential attributes of Big Data as defined below and also
explained with the help of Figure 1. Let us first understand how data is presented to the
data analyst:
a Structured data: these data comes in proper format and database schema. Proper
modelling is already defined to understand this data. Example, database and
spreadsheet. Querying the data and then extracting relevant information is quite easy.
This is how most of the organisations keep their data with the well-defined database
schema.
b Semi-structured data: these types of data also exist in a structured format, but data is
not maintained in a database, rather flat files. These data are mostly a dynamic
combination of data and metadata before presented to the user. Examples are xml
data, json file, and source code of a website in html format.
c Unstructured data: these consist of the most amounts of data being generated in
today’s world. Every company follows their standards to represent these data based
on user’s requirement and necessity. Either the source is internal in the form of
memos or meeting notes or external as reports and journals. Most of these data are
descriptive in nature, and a proper format is not defined. It is also tough to define
such kind of data in some common format. Other examples are also defined in
Figure 1.
Figure 1 Classification of Big Data
Source: Figure from Insurance et al. (2013)

According to Figure 1, out of all above three attributes, unstructured data comes in most
varied forms. Holzinger et al. (2013) and Insurance et al. (2013) also stated about
Wikipedia describing it as “information that either does not have a pre-defined data
model and/or does not fit well into relational tables”. Another problem lies in the way
these data are generated. i.e. static and real time. If we take more time to analysis a data,
by that time, these data might have taken some other form or may not be relevant. Certain
companies are explicitly focusing on analysing this real-time data by streaming directly
into the analytics system (Sensmeier, 2013).
Irrespective of it being in unstructured form, we must accept the fact that there is a
substantial amount of information stored which might look invaluable at the first point,
but contains much valuable information. Moreover, the job of an analyst is to offer a
meaning to given data. We will study further, how much importance this data hold and
how it can be helpful to save billions of dollars in the financial sector.
2.2 Financial frauds

Fraud is the use of false representations to gain an unfair advantage and criminal
deception (Chintalapati and Jyotsna, 2013). There are at least as many types of fraud as
there are types of people who commit it. However, in each instance, fraud involves
deception. Even Gupta and Gupta (2015) stated about Internal Resources Service,
Department of the USA of the Treasury, defining a corporate fraud as a violation of the
internal revenue code and related statutes committed by large, publicly traded
corporations and/or by their senior executives. Chintalapati and Jyotsna (2013) describe it
as a deliberate misrepresentation or concealment of information to deceive or mislead.
Figure 2 Hierarchy chart of white-collar crime perpetrators from both firm-level and
community-level perspectives
Source: Clifton et al. (2010)

Clifton et al. (2010) stated, as per Figure 2, that fraud can range from minor employee
theft and unproductive behaviour to misappropriation of assets and fraudulent financial
reporting. As such, frauds are classified into above main categories. Traditionally, each
business is always susceptible to internal fraud or corruption from its management
(high-level) and non-management employees (low-level). Whereas at the same time, the
fraudster can be an external party or parties. Clifton et al. (2010) also state that the
average offenders display random and/or occasional dishonest behaviour when there is an
opportunity, sudden temptation, or when suffering from financial hardship.
Opportunistic fraud is mainly carried out by individuals and is characterised by
relatively low monetary value but high volume, while professional fraud is perpetrated by
organised groups acting as a network, may be using multiple identities, targeting specific
sectors of the industry considered to be vulnerable (Insurance et al., 2013). The Insurance
Fraud Bureau estimates that the total cost of undetected insurance fraud in the UK is
£2.1 bn, which translates into an average £50 additional premium for every policyholder.
The ABI states that every week 2,670 fraudulent insurance claims are exposed in the UK,
worth £19 m. Fraud is a big ticket item both for the perpetrators and for those dedicated
to exposing it and bringing fraudsters to justice (Insurance et al., 2013).
Financial statement fraud is a serious threat to market participant’s confidence in
published audited financial statements. Capital market participants expect vigilant and
active corporate governance to ensure the integrity, transparency and quality of financial
information (Rezaee 2005; Gupta and Gupta, 2015). Li (2010) and Gupta and Gupta
(2015) also studied corporate financial fraud and detection using an empirical framework
that models the strategies interdependence between fraud and detection and accounts for
the possibility that some fraud remains undetected.
In this, automated fraud detection tries to detect fraud based on certain anomalies.
Fraud detection systems have been working mostly on structured data that are internal to
the company. Even data that they get from external sources are in structured form. Our
objective is to understand financial related frauds and importance of unstructured data in
it. For this, we also need to give importance to real-time unstructured data that exist on
the web. We will first study the need and relevance of unstructured data in fraud
detection and later, discuss the theoretical framework developed to overcome fraud
related issue.
2.3 Need and relevance of unstructured data for fraud detection

According to Zhu and Madnick (2009) and Opresnik and Taisch (2015), there are two
ways an enterprise can increase the value of its data: first, sell the ‘private’ data (currently
not publicly accessible) or, secondly, to become a data re-user. These strategies are very
well known in the software industry. One such case is the online auction site eBay, which
uses data in at least two manners; first as data reuse, with the data about the behaviour of
millions of its customers, it drives analytics at every level of the organisation (Boucher
Ferguson, 2013; Opresnik and Taisch, 2015). Second it has already begun selling blinded
transaction data to interested third parties (Ferguson et al., 2005; Opresnik and Taisch,
2015), thus exploiting them, not only for internal use but also as a new ‘product’
generating an additional revenue stream, which is data sell. This case indicates not only
the strategy of data reuse but also of ‘data re-purposing’, which can be part of both
previously depicted strategies and simply indicates the possibility to reuse data
differently. We need a similar approach in the case of unstructured and check how a
company can share internal data for the benefit of fraud detection.
Web intelligence, web analytics, and the user-generated content collected through
Web 2.0-based social and crowd-sourcing systems (Doan et al., 2011; O’Reilly, 2005),
have ushered in a new and exciting era of BI&A 2.0 research in the 2000s, centred on
text and web analytics for unstructured web contents (Chen et al., 2012). As such, an
immense amount of company, industry, product and customer information can be
gathered from the web; organised and visualised by various text and web mining
techniques. By analysing customer clickstream data logs, web analytics tools such as
Google Analytics can provide a trail of the user’s online activities and reveal the user’s
browsing and purchasing patterns. Website design, product placement optimisation,
customer transaction analysis, market structure analysis, and product recommendations
can be accomplished through web analytics. The various Web 2.0 applications developed
after 2004 have also created an abundance of user-generated content from various online
social media such as forums, online groups, web blogs, social networking sites, social
multimedia sites (for photos and videos), and even virtual worlds and social games
(O’Reilly, 2005; Chen et al., 2012).
For social media analytics of customer opinions, text analysis, and sentiment analysis
techniques are frequently adopted (Pang and Lee 2008). Various analytical techniques
have also been developed for product recommender systems, such as association rule
mining, database segmentation and clustering, anomaly detection, and graph mining
(Adomavicius and Tuzhilin, 2005; Chen et al., 2012).
In Arias et al. (2014), authors have tested the impact of Twitter on two different
datasets, the stock market and movie box office revenue to make time-series predictions.
They have proved that adding this dataset does improve the overall prediction. The test
was conducted on three years of twitter data between 2011 to 2014. Another paper (Gray
and Debreceny, 2014) did work related to financial fraud detection using data mining
techniques. The authors have conducted a literature review of publications between 1997
and 2008 to classify the articles into four categories of financial fraud (bank fraud,
insurance fraud, securities and commodities fraud, and other related financial fraud) and
six classes of data mining techniques (classification, regression, clustering, prediction,
outlier detection, and visualisation). The findings revealed that data mining techniques
have mostly been applied to the detection of insurance fraud, in comparison to corporate
fraud and credit card fraud which gave gained some attention in recent years. However,
there seems to be a lack of research on mortgage fraud, money laundering, securities and
commodities fraud.
Aral et al. (2009) developed a matched sample estimation framework to distinguish
influence and homophily effects in dynamic networks, and applied that framework to a
unique dataset documenting product adoption in a vast network. It was found that
previous methods significantly overestimate peer influence in this network, mistakenly
identifying homophilous diffusion as influence driven contagion. With all its potential in
both the academic and commercial world, the effect of Big Data on the behavioural
sciences is already apparent in the ubiquity of online surveys and psychology
experiments that outsource projects to a distributed network of people (Bentley et al.,
2014; Rand, 2012; Sela and Berger, 2012; Twenge et al., 2012). As such there is a need
to understand social behavioural nature from unstructured data for better future
forecasting and detection of financial frauds.
Since last decade, high-profile financial frauds committed by large companies in both
developed and developing countries were discovered and reported, such as Enron,
Lucent, WorldCom, Parmalat, YGX, SK Global, Satyam, Harris Scarfe and HIH
(Wadhwa and Pal, 2012). With a massive increase in accounting fraud witnessed, there is
a need for efficient financial accounting fraud detection from the investors, academic
researchers, media, the financial community and regulators (Wong and Venkatraman,
2015). However, the existing fraud detection model’s accuracies are not generally high
(Li, 2015). That brings us to a conclusion that there is a need to study Big Data further or
rather more specifically unstructured data to detect fraudulent activities.
3 Research methods
This research adopts a conceptual framework-based approach. The prime aim was to
investigate about Big Data, financial frauds, and how unstructured data can help to
improve financial fraud detection. To determine the current state and future directions for
Big Data and detecting financial frauds, we conducted an extensive literature review.
To achieve our objective, we developed and proposed a framework based on
unstructured data analysis using online data such as Twitter and Facebook and use these
data to illustrate the practical implementation of the framework and analyse the results
obtained. The first phase of the review was to determine the scope and relevant source
material. Since Big Data and detection of financial frauds is an interdisciplinary topic,
related articles were published in a wide variety of journals. Furthermore, unstructured
data and financial fraud detection are still an emerging research area, so not much
substantial literature review was found which talks about both in detail except for some
conference proceedings. Therefore, we included references from both these topics,
conference proceeding and journals from various disciplines. It includes Science Direct,
IEEE, Google Scholar, ACM, Scopus, and Springer. From the papers identified, we did
backtracking on relevant cited articles of other authors. The search was based on ‘Big
Data’, ‘data analytics’, ‘financial fraud detection’ and ‘SCM’. We mostly considered on
those papers were there was some nearby related link between searched keywords.
In the second stage, we classified the papers into the different categories. The
classification proceeded as follows. Two researchers independently collected the articles.
One focused on Big Data related while another researcher focused on financial frauds
related articles. Later, all the papers were classified into corresponding factors within the
framework. The two classifications were subsequently compared and, in the case of
differing results, a third researcher repeated and evaluated the classification.
Further, two other researchers focussed on the framework design based on existing
literature found related to data mining and their personal industrial research experience.
Moreover, with the aid of another researcher the framework was finalised. Altogether, we
reviewed the title, abstract, discussion and conclusions sections of the paper and found
out its main topic, for example, the importance of unstructured information. Some
publications focused on various factors, but not anyone in detail or a compounding of
both the topics. Those publications were then classified as ‘multiple categories’ papers.
Several other papers presented market overviews summarising the state of Big Data,
its challenges or potentials – such papers were classified as ‘overviews’. Side by side, we
analysed methodologies used. The classification was first drawn into ‘conceptual’ and
‘empirical’ and then split up the ‘empirical’ further to ‘qualitative’, ‘quantitative’ and
‘design research’. Most of the technical papers proposed conceptual constructs, but some
mainly described technologies. Therefore, we divided conceptual studies into proposed
constructions and descriptions. In the end, we produced a new theoretical model.
4 Research framework
In this section, we have defined a new theoretical model to detect future frauds. With the
help of this framework, we can focus on data modelling, extraction and mapping of
unstructured data to structured data and then find essential enablers to identify frauds. In
Figure 3, one can see that the overall unstructured data processing framework consists of
a collection of modules or in other words sub-process.
Figure 3 Proposed fraud analysis framework
Each of the modules is explained as follows:

a Unstructured data: in this section, data is acquired from various online sources and
registered into our system. We will save the downloaded data and use it for further
processing.
b Base system registration: we need to attach the structured data of customers, which
hold user information that needs to be connected to unstructured data. Based on
requirement, these processes can be connected to any industry. Example: banking,
finance, insurance or e-commerce industry. Experts can either decide to save those
data or link it for real-time mapping. Although there are some limitations in saving
data, the decision can be taken based on other factors like cost, availability,
accessibility, and privacy.
c Data management: in this section, the registered data are acquired, recorded, filtered,
cleaned and aggregated based on annotations and keywords customisation. In some
dataset, we get aggregated metrics too.
d Analytics: this section defines the deep learning models as per industry requirement.
Deep learning is a concept in which we divide the data into training and testing and
let the system learn at multiple levels to come up with better mathematical formula.
Later this formula will help to define the better relationship between different
features of the data.
e Text analytics: using natural language processing (NLP) libraries on data we can get
how the user behaves in its surrounding features like user’s taste, liking and
disliking. What does he buy? Is he too shopaholic? Textual meaning extraction is
done with the help of sentiment analysis.
f Feature extraction and segmentation: the extracted feature can then be segmented
among different categories based on functional knowledge for the ease of usage. We
have to check the different combination of features on models.
g Extraction linking and mapping with base system: the information we can get can be
represented in two types:
• It can simply be added as new feature to the base model (e.g. location,
organisation info, online activity percentage, married/single, no of friends, etc.).
• The information can be added as link format also where we can find user having
multiple locations, date of births, e-mail ids or knows another user from our user
set. Graph databases can be used for such linking.
h Pattern detection and behavioural analysis: firstly, the new features get added to
base features and then by applying models we have to check if we can find patterns
revealing towards fraud users. Later, through the links that we have discovered, we
can check for unknown fraud cycles.
i Fraud analysis: based on the machine-generated output we can build cases to check
for the authenticity of fraud system. The information can then be sent back to the
framework model for more learning enhancement.
j Machine learning and knowledge base: this overall framework requires machine
learning at various stages. A sound knowledge base of the domain is required. Some
of this knowledge base is industry and need specific. Some are data specific, and
some are standard across the industry and data source. As such, this process shall be
used across all other sub-processes.
The immediate challenge at the start of data collection is to chunk the source data to
something manageable and meaningful. Considering the principle of grounded theory,
data analysis will begin immediately because future data collection and sampling depends
on the information which emerges from the first set of data. Also, delaying this part will
somehow create an unmanageable workload in the research. Consequently, grounded
theory is known as ‘the constant comparative method of analysis’ (Glaser and Strauss,
1967). Further, the study will be carried out by mathematical foundation, graphs, and
charts considering both qualitative and quantitative approach. Factor analysis will be
done in two ways. One by using ISM/TISM model to develop the behavioural model for
the enablers of an SCM industry. The structural model developed using this methodology
will help us to understand the interaction between the various elements of the system.
After the model is developed, it is further subjected to assessment by a different group of
domain experts so as to enhance its validity (Jayalakshmi and Pramod, 2015). Another
way to do factor analysis is by doing text analytics and creating term frequency–inverse
document frequency. The expert can validate the result based on experience. This can
further help to train the model.
Finally, as discussed above that same system can be directly applied to various
industries, we only have to replace the base system. In context to SCM and logistics,
extra information needs to be incorporated in the base system apart from customer's
information. This will help to identify customer behaviour. As anomaly detection models
are almost same everywhere, the framework is providing extra features to improve the
pattern signal. Only functional and domain knowledge used will be different.
5 Research limitations and further research direction
Data mining is a threat to privacy and data security because when data can be viewed
from many different angles at different abstraction levels, it threatens the goal of keeping
data secured and guarding against the intrusion on privacy. For example, it is relatively
easy to compose a profile of an individual (e.g. personality, interests, spending habits)
with data from various sources (Jun Lee and Siau, 2014). Due to this, it is challenging to
get correct results and identify frauds.
Also, biggest limitation in the area of unstructured data analysis is a collection of real
data because of restricted access to many data points related to the search criteria. Online
web sources may not allow accessing these data on the vast amount. Example, for a bank,
it is hard to filter users based on targeted customers. A particular bank is only interested
in knowing how its customers are behaving online, in order the identify possible chances
of future frauds. Due to this, instead of enhancing the state-of-the-art existing automated
fraud detection system, most companies still rely on existing human intervention to a
great extent. Given the sensitive nature of fraud situations and because of the rather low
frequency of detected fraud cases, most empirical studies in this area use experimental
settings or surveys focusing on auditor’s perceptions, rather than actual fraud cases as a
source of data (Hassink et al., 2010). This model comparison with manual detection is
incorrect.
There is also geographical limitations and challenges for certain data. A user might be
located in one country but doing fraudulent business from another country. To talk about
kinds of fraudsters, there are possibilities that an average fraudster might be caught, but
professional fraudster tends to stay away from using online social media or using
smartphones to keep themselves aloof.
The next research direction will be to collect sample data and filter only those related
to fraud detection. This will help us to develop a final model that will combine
multi-dimensional unstructured data from various online sources. Example, Twitter, and
Facebook. After collection, prune only financial fraud-detection related words and
sentences using text analytics and other machine learning models.
We will also maintain other information like user’s name, age, and location. This will
help to identify possible users who might be suspicious of fraud, and make a connection
between base user details to multiple online identities. Surely, the culprit may not be
identified all the time correctly, but common behavioural pattern and cluster analysis will
help to determine possible sets of users. Out of other ways for classification and
detection, the model will also use graphs to highlight possible correlations between
various anomalies, as stated by Akoglu and Faloutsos (2013) that graph representation of
datasets inherently imposes long-range correlations among the data objects.
Like any other system, the proposed theoretical framework also has some limitations.
When registering the base system for a specific domain or industry, the proposed system
will require a deep knowledge of the domain. Human experts may not be available or
ready to share the knowledge. Integration might need a lot of human resources, thus will
be a costly affair. Further, direct system access and linkage need to be done to get data
out of it. If data is copied to a new system for analysis, the possibility of data consistency
and redundancy is an issue. If data keeps updating on a frequent basis, this may not match
with the data that has already been copied for analysis. Later, combining all sources of
data is a challenge too. This is due to complexity in matching different features from
various sources. Also, privacy factor will not allow access to sensitive data that is a must
for profile matching. Thus, data may not be available for collaboration.
To add further, even if there is a downside to big-data research, without clear
objectives and a unifying framework, behavioural scientists may ask whether it is useful,
for example, to infer from millions of Facebook pages or Twitter feeds that “men are
more influential than women … [and] that influential person with influential friends help
spread” information (Aral and Walker, 2012) or that “people awaken later on weekends”
(Golder and Macy, 2011). Bentley et al. (2014) stated that in the best case, big-data
studies will not compete with more traditional behavioural science but instead will allow
us to see better how known behavioural patterns apply in novel contexts. As long as
people trust their individual experiences, even in observing the behaviour of others, a
collective wisdom is possible. The previous studies have already laid the foundation for
further research design. Further evaluation will help us to review the system better. To
add more, many existing fraud detection systems operate by supervised approaches on
labelled data, hybrid approaches on labelled data, semi-supervised approaches with legal
(non-fraud) data, unsupervised approaches with unlabelled data (Chintalapati and
Jyotsna, 2013). Our intention would be to find the implications of unstructured data in
fraud detection.
Therefore, what we believe is that if all these factors are considered, a slow and
steady steps are taken to finalise the system and then data is poured into this new
proposed system then surely identification and detection of fraud are feasible to a very
good extent. Still, there are some limitations and challenges. This extent can only be
evaluated by doing a further implementation of this theoretical framework. Our goal is to
start at some point and include as many possibilities as possible. This theoretical study
will surely help us and someone working in this area to get close to common goal.
6 Concluding remarks
In this paper, we have done a literature review to understand Big Data, financial frauds
and the importance of unstructured data. We have also touched upon the benefit in
context to SCM. The literature mostly talks about the potential identification of
automated fraud detection from unstructured data. Also, most of the study needs to
understand this unstructured format and focus more to get relevant data out of it. If
necessary, check if meaningful data can be extracted without affecting privacy. We have
also formalised a theoretical research framework to carry out our further study. This
framework was proposed to focus on unstructured data for extraction of relevant
information from it.
In future, we have planned to implement and execute a better decision-making model
to test this theory by using newer data analytics and machine learning tools on the
realistic system. We are currently in a process to formalise close to a generic model that
can work for variable type data received from different sources. This framework will help
the future researcher to solve other similar business related problems. It will surely
contribute to creating a model that can be applied to multiple industries with minimum
modifications.
Acknowledgements
The authors are grateful for the valuable comments of the referees on an earlier version of
this paper.
References
Adomavicius, G. and Tuzhilin, A. (2005) ‘Toward the next generation of recommender systems: a
survey of the state-of-the-art and possible extensions’, IEEE Transactions on Knowledge and
Data Engineering, Vol. 17, No. 6, pp.734–749 [online] http://ieeexplore.ieee.org/lpdocs/
epic03/wrapper.htm?arnumber=1423975 (accessed 14 December 2015).
Akoglu, L. and Faloutsos, C. (2013) ‘Anomaly, event, and fraud detection in large network
datasets’, Proceedings of the Sixth ACM International Conference on Web Search and Data
Mining – WSDM ‘13, p.773 [online] http://dl.acm.org/citation.cfm?id=2433396.2433496
(accessed 11 December 2015).
Aral, S. and Walker, D. (2012) ‘Identifying influential and susceptible members of social
networks’, Science, Vol. 337, No. 6092, pp.337–341, New York (NY) [online]
http://www.scopus.com/inward/record.url?eid=2-s2.0-84863993298&partnerID=tZOtx3y1
(accessed 25 January 2016).
Aral, S., Muchnik, L. and Sundararajan, A. (2009) ‘Distinguishing influence-based contagion from
homophily-driven diffusion in dynamic networks’, Proceedings of the National Academy of
Sciences of the United States of America, Vol. 106, No. 51, pp.21544–21549 [online]
Arias, M., Arratia, A. and Xuriguera, R. (2014) ‘Forecasting with Twitter data’, ACM
Trans. Intell. Syst. Technol., Vol. 5, No. 1, pp.8:1–8:24 [online] http://doi.acm.org/10.1145/
2542182.2542190 (accessed 16 April 2016).
Bentley, R.A., O’Brien, M.J. and Brock, W.A. (2014) ‘Mapping collective behavior in the big-data
era’, Behavioral and Brain Sciences, Vol. 37, No. 1, pp.63–76 [online] http://www.journals.
cambridge.org/abstract_S0140525X13001659 (accessed 25 January 2016).
Beulke, D. (2011) Big Data Impacts Data Management: The 5 Vs of Big Data [Blog post], Dave
Beulke Blog [online] http://davebeulke.com/big-data-impacts-data-management-the-five-vs-
of-big-data/ (accessed 4 January 2016).
Boucher Ferguson, R. (2013) ‘How eBay uses data and analytics to get closer to its (massive)
customer base’, MIT Sloan Management Review [online] http://sloanreview.mit.edu/article/
how-ebay-uses-data-and-analytics-to-get-closer-to-its-massive-customer-base/ (accessed 17
December 2015).
Boyd, D. and Crawford, K. (2012) ‘Critical questions for Big Data’, Information, Communication
& Society, Vol. 15, No. 5, pp.662–679 [online] http://www.tandfonline.com/doi/abs/10.
1080/1369118X.2012.678878 (accessed 10 December 2015).
Chen, H., Chiang, R.H.L. and Storey, V.C. (2012) ‘Business intelligence and analytics: from Big
Data to big impact’, MIS Quarterly, Vol. 36, No. 4, pp.1165–1188 [online] http://dl.acm.org/
citation.cfm?id=2481683&CFID=614785593&CFTOKEN=52416716 (accessed 17 May
2016).
Chintalapati, S.S. and Jyotsna, G. (2013) ‘Application of data mining techniques for financial
accounting fraud detection scheme’, International Journal of Advanced Research in Computer
Science and Software Engineering, Vol. 3, No. 11, pp.717–724.
Clifton, P. et al. (2010) ‘A comprehensive survey of data mining-based fraud detection research’,
International Conference on Intelligent Computation Technology and Automation, Vol. 1,
pp.50–53.
Davenport, T.H., Barth, P. and Bean, R. (2012) ‘How ‘Big Data’ is different’, MIT Sloan
Management Review, Vol. 54, No. 1, pp.22–24.
Dijcks, J. (2012) Oracle: Big Data for the Enterprise, Oracle White Paper, June, p.16 [online]
http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Oracle+:+Big+Data+for
+the+Enterprise#0 (accessed 12 December 2015).
Doan, A., Ramakrishnan, R. and Halevy, A.Y. (2011) ‘Crowdsourcing systems on the World-Wide
Web’, Communications of the ACM, Vol. 54, No. 4, p.86.
Ferguson, G., Mathur, S. and Shah, B. (2005) ‘Evolving from information to insight information to
insight’, MIT Sloan Management Review, Vol. 46, No. 2, pp.51–57.
Gandomi, A. and Haider, M. (2015) ‘Beyond the hype: Big Data concepts, methods, and analytics’,
International Journal of Information Management, Vol. 35, No. 2, pp.137–144 [online]
http://dx.doi.org/10.1016/j.ijinfomgt.2014.10.007 (accessed 17 May 2016).
Glaser, B.G. and Strauss, A.L. (1967) The Discovery of Grounded Theory: Strategies for
Qualitative Research [online] http://www.amazon.com/dp/0202302601 (accessed 13 February
2016).
Golder, S.A. and Macy, M.W. (2011) ‘Diurnal and seasonal mood vary with work, sleep, and day
length across diverse cultures’, Science, Vol. 333, No. 6051, pp.1878–1881, New York (NY)
[online] http://www.scopus.com/inward/record.url?eid=2-s2.0-80053345545&partnerID=
tZOtx3y1 (accessed 17 January 2016).
Gray, G.L. and Debreceny, R.S. (2014) ‘A taxonomy to guide research on the application of data
mining to fraud detection in financial statement audits’, International Journal of Accounting
Information Systems, Vol. 15, No. 4, pp.357–380 [online] http://dx.doi.org/10.1016/
j.accinf.2014.05.006 (accessed 12 May 2016).
Gupta, P.K. and Gupta, S. (2015) ‘Corporate frauds in India – perceptions and emerging issues’,
Journal of Financial Crime, Vol. 22, No. 1, pp.79–103 [online] http://www.emeraldinsight
.com/doi/abs/10.1108/JFC-07-2013-0045 (accessed 11 December 2015).
Hassink, H., Meuwissen, R. and Bollen, L. (2010) ‘Fraud detection, redress and reporting by
auditors’, Managerial Auditing Journal, Vol. 25, No. 9, pp.861–881.
Holzinger, A. et al. (2013) ‘Human-computer interaction and knowledge discovery in complex,
unstructured’, in Holzinger, A. and Pasi, G. (Eds.): Big Data, Springer, Berlin Heidelberg
[online] http://link.springer.com/10.1007/978-3-642-39146-0 (accessed 10 December 2015].
IBM (2013) Infographic: The Four V’s of Big Data | The Big Data Hub [online] http://www.
ibmbigdatahub.com/infographic/four-vs-big-data (accessed 10 December 2015).
Insurance, T. et al. (2013) Fraud Detection – The Unstructured Data Goldmine, Blakehead Limited
[online] http://www.blakehead.co.uk/user_uploads/fraud detection – the unstructured data
goldmine.pdf (accessed 9 December 2015).
Jacobs, A. (2009) ‘The pathologies of Big Data’, Communications of the ACM, Vol. 52, No. 8, p.36
[online] http://portal.acm.org/citation.cfm?doid=1536616.1536632 (accessed 15 December
2015).
Jayalakshmi, B. and Pramod, V.R. (2015) ‘Total interpretive structural modeling (TISM) of the
enablers of a flexible control system for industry’, Global Journal of Flexible Systems
Management, Vol. 16, No. 1, pp.63–85 [online] http://dx.doi.org/10.1007/s40171-014-0080-y
(accessed 4 April 2016).
Jun Lee, S. and Siau, K. (2014) ‘A review of data mining techniques’, Industrial Management &
Data Systems, Vol. 101, No. 1, pp.41–46.
Kern, E. (2012) Facebook is Collecting your Data – 500 Terabytes a Day, Gagomo Research,
Gigaom [online] https://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-
terabytes-a-day/ (accessed 14 December 2015).
Kwon, O., Lee, N. and Shin, B. (2014) ‘Data quality management, data usage experience and
acquisition intention of Big Data analytics’, International Journal of Information
Management, Vol. 34, No. 3, pp.387–394 [online] http://dx.doi.org/10.1016/j.ijinfomgt.
2014.02.002 (accessed 17 May 2016).
Laney, D. (2001) ‘3D Data management: controlling data volume, velocity, and variety’,
Application Delivery Strategies, Vol. 949, p.4 [online] https://blogs.gartner.com/doug-
laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-
Variety.pdf (accessed 2 February 2016).
LaValle, S. et al. (2011) ‘Big Data, analytics and the path from insights to value’, MIT Slogan
Management Review [online] http://sloanreview.mit.edu/article/big-data-analytics-and-the-
path-from-insights-to-value/ (accessed 9 April 2016).
Li, R. (2015) ‘Detection of financial reporting fraud based on clustering algorithm of automatic
gained parameter K value’, International Journal of Database Theory and Application, Vol. 8,
No. 1, pp.157–168 [online] http://www.sersc.org/journals/IJDTA/vol8_no1/17.pdf (accessed
14 May 2016).
Li, S. (2010) ‘Corporate financial fraud: an application of detection controlled estimation’, SSRN
Electronic Journal [online] http://papers.ssrn.com/abstract=1698038 (accessed 12 December
2015).
Manyika, J. et al. (2011) Big Data: The Next Frontier for Innovation, Competition, and
Productivity, June, p.156, McKinsey Global Institute [online] http://www.mckinsey.com/~/
media/McKinsey/Business Functions/Business Technology/Our Insights/Big data The next
frontier for innovation/MGI_big_data_full_report.ashx (accessed 10 December 2015).
McAfee, A. and Brynjolfsson, E. (2012) ‘Big Data. The management revolution’, Harvard
Business Review, Vol. 90, No. 10, pp.61–68 [online] http://www.buyukverienstitusu.com/
s/1870/i/Big_Data_2.pdf (accessed 10 December 2015).
Ngai, E.W.T. et al. (2011) ‘The application of data mining techniques in financial fraud detection: a
classification framework and an academic review of literature’, Decision Support Systems,
Vol. 50, No. 3, pp.559–569 [online] http://dx.doi.org/10.1016/j.dss.2010.08.006 (accessed 12
May 2016).
O’Reilly, T. (2005) ‘What is web 2.0: design patterns and business models for the next generation
of software’, Social Science Research Network Working Paper Series, Vol. 2007, No. 65,
pp.17–37 [online] http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-
20.html (accessed 14 December 2015).
Opresnik, D. and Taisch, M. (2015) ‘The value of Big Data in servitization’, International Journal
of Production Economics, Vol. 165, pp.174–184 [online] http://linkinghub.elsevier.com/
retrieve/pii/S0925527314004307 (accessed 10 December 2015).
Pang, B. and Lee, L. (2008) ‘Opinion mining and sentiment analysis’, Foundations and Trends in
Information Retrieval, Vol. 2, Nos. 1–2, pp.1–135 [online] http://www.nowpublishers.com/
article/Details/INR-011 (accessed 14 May 2016).
Provost, F. and Fawcett, T. (2013) ‘Data science and its relationship to Big Data and data-driven
decision making’, Big Data, Vol. 1, No. 1, pp.51–59 [online] http://online.liebertpub.com/
doi/abs/10.1089/big.2013.1508 (accessed 10 December 2015).
Rand, D.G. (2012) The promise of mechanical turk: how online labor markets can help theorists
run behavioral experiments’, Journal of Theoretical Biology, Vol. 299, pp.172–179 [online]
Rezaee, Z. (2005) ‘Causes, consequences, and deterence of financial statement fraud’, Critical
Perspectives on Accounting, Vol. 16, No. 3, pp.277–298 [online] http://www.sciencedirect
.com/science/article/pii/S1045235403000728 (accessed 18 November 2015).
Rijmenam, M. van (2015) Datafloq – Why the 3V’s are Not Sufficient to Describe Big Data
[online] https://datafloq.com/read/3vs-sufficient-describe-big-data/166 (accessed 16 December
2015).
Sela, A. and Berger, J. (2012) ‘Decision quicksand: how trivial choices suck us in’, Journal of
Consumer Research, Vol. 39, No. 2, pp.360–370 [online] http://www.scopus.com/inward/
record.url?eid=2-s2.0-84864011467&partnerID=tZOtx3y1 (accessed 25 January 2016).
Sensmeier, L. (2013) How Big Data is Revolutionizing Fraud Detection in Financial Services –
Hortonworks [online] http://hortonworks.com/blog/how-big-data-is-revolutionizing-fraud-
detection-in-financial-services/ (accessed 10 December 2015).
Twenge, J.M., Campbell, W.K. and Gentile, B. (2012) ‘Increases in individualistic words and
phrases in American books, 1960–2008’, PloS One, Vol. 7, No. 7, p.e40181 [online]
Wadhwa, L. and Pal, V. (2012) ‘Forensic accounting and fraud examination in India’, International
Journal of Applied Engineering Research, Vol. 7, No. 11 Suppl., pp.2006–2009.
Wang, G. et al. (2016) ‘Big Data analytics in logistics and supply chain management: certain
investigations for research and applications’, International Journal of Production Economics,
Vol. 176, pp.98–110 [online] http://dx.doi.org/10.1016/j.ijpe.2016.03.014 (accessed 12 May
2016).
Wong, S. and Venkatraman, S. (2015) ‘Financial accounting fraud detection using business
intelligence’, Asian Economic and Financial Review, Vol. 5, No. 11, pp.1187–1207 [online]
http://www.pakinsight.com/archive/3/11-2015/11 (accessed 14 May 2016).
Zhong, R.Y. et al. (2015) ‘A Big Data approach for logistics trajectory discovery from
RFID-enabled production data’, International Journal of Production Economics,
Vol. 165, pp.260–272 [online] http://linkinghub.elsevier.com/retrieve/pii/S0925527315000481
(accessed 10 December 2015).
Zhu, H. and Madnick, S.E. (2009) ‘Finding new uses for information’, MIT Sloan Management
Review, Vol. 50, No. 4, pp.17–21 [online] http://sloanreview.mit.edu/article/finding-new-uses-
for-information/ (accessed 12 December 2015).
View publication stats

IJAL020405SHARMA

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IJAL020405SHARMA

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Importance of Big Data in ﬁnancial fraud detection

Article in International Journal of Automation and Logistics · August 2016

Vikash Sharma Bhavna Pandey

SEE PROFILE SEE PROFILE

Abnormal Event Detection in Moving Image View project

Invoice Data Extraction using Machine Learning Technique View project

The user has requested enhancement of the downloaded file.

Importance of Big Data in financial fraud detection

Keywords: automated fraud detection; Big Data; financial fraud detection;

Biographical notes: Vikash Sharma received his MSc in Computer

Copyright © 2016 Inderscience Enterprises Ltd.

Bhavna Pandey is currently working as Junior Research Fellow at Symbiosis

Vipin Kumar received his BTech in Electronics and Communications

2.1 Big Data

Figure 1 Classification of Big Data

Source: Figure from Insurance et al. (2013)

2.2 Financial frauds

Source: Clifton et al. (2010)

2.3 Need and relevance of unstructured data for fraud detection

Figure 3 Proposed fraud analysis framework

Each of the modules is explained as follows:

5 Research limitations and further research direction

View publication stats

You might also like