Professional Documents
Culture Documents
Data Mining For Fraud Detection
Data Mining For Fraud Detection
Purposes
Means
Table 2. Seed or Complex Variable: Kurtosis and Skewness
Kurtosis = 3
Skewness = 0
Seed or Complex
4
Table 3 represents a comparison of the kurtosis and The weighted outlier variable maximizes the differences
skewness of the weighted outlier, and the seed or in variables between providers providing similar services
complex variable. As the chart shows the weighted to beneficiaries and minimizes the similarities in the
outlier formula has the effect of increasing the kurtosis potential fraud schemes.
while simultaneously increasing the skewness, and in the
process of squeezing and pulling out the entity with a
high potential for fraud is identified.
Services not rendered (SNR) to beneficiaries;
An example of how the weighted outlier formula works in
detecting fraud can be illustrated using healthcare data
from Medicare. Medicare fraud comes is variable in
shapes and form. Fraud schemes are difficult to detect
given the number of variables involved: beneficiaries,
providers, medical procedures, amounts paid to
providers, days of services, number of services, and
diagnoses are just some of the basic variables with
millions of data points. The issue is how to create
variables that assist in data mining the vast databases of
Medicare claims information by separating the outliers
in the data to detect potential fraud among providers.
The potential provider-based Medicare fraud schemes
are, but are not limited to:
1.
2. Unnecessary services (UNS) to beneficiaries;
3. Impossible days (IMD), or providers billing for more
hours in a day than is probable;
4. Illegal Self-referrals by a provider for unnecessary
services to beneficiaries; or
5. An illegal financial relationship (IFA) between the
referring provider and the rendering provider; or
6. Sharing of beneficiaries between the referring and
rendering providers.
The aforementioned fraud schemes are difficult to
The weighted outlier formula is expressed as:
detect in a vast database, and they are even more
difficult if a provider uses a combination of one or more
fraud schemes to perpetuate and conceal Medicare
fraud (see table 4). Some of these schemes overlap with
Z= The Z score of an independent variable which
one another and the challenge is to create variables that
denotes the purposes of a fraud
maximizes the differences and minimizes the
similarities in the potential fraud schemes.
A= Independent (seed) variable which denotes the
means to perpetuate a fraud
B= Independent (seed) variable which denotes the
means to perpetuate a fraud
Potential Fraud Schemes
Medicare Fraud
Table 3. Comparison of Weighted Outlier, and Seed or
Complex Variable: Kurtosis and Skewness.
Kurtosis = 12
Skewness = 0
Seed or Complex
Kurtosis = 3
Skewness = 197
Fraud
Weighted Outlier
2
B
A
Z
X=
Table 4 Interaction of Potential Fraud Schemes
UNS
Services
SNR Rendered
IMD
IFA
Relationships
Bene
Sharing
Beneficiaries
Self -
Referrals
5
Healthcare Variables
income than providers practicing in rural areas. Hence,
the skewness and kurtosis of seed variable distributions
are not necessarily the accurate indicators of potential A combination of seed variables, basic ratios, and
fraud. weighted outlier variables are essential to detect
potential healthcare fraud. The seed variables are
extracted directly from the data. The seed variables are
Another method of identifying potential fraud schemes
the distinct count of the following: PIN; first day of includes analyzing the relationship, or ratio, between
different variables. These variables are called complex service (FDOS); Beneficiaries; diagnosis (Dx); medical
variables. The complex variables used to detect fraud are
procedures (CPT); place of service (POS); claims (ICN);
ICN per beneficiary; ICN per FDOS; beneficiaries per
Referring UPIN; as well as the provider paid amount and
FDOS; Beneficiaries to Referring UPIN; Medical
the total number of units. The distribution of seed
Procedures to Referring UPIN; and Diagnoses to
variables, although useful in detecting healthcare fraud,
Referring UPIN. The variables listed are derived from
independently they are not necessarily indicative of
the seed variables: (1) are correlated; (2) sometimes are
fraud. One of the main reasons seed variables alone fail
not normally distributed; and (based on experience)
to predict fraudulent activity is that they can be affected
assist in detecting potential fraud. The mean, standard
by the size and diversity of a provider practice. For
deviation, and Z scores are also examples of complex
example, a provider with a large practice in an urban area
variables.
might have a greater patient volume, thus greater
Variable ICN/Bene ICN/FDOS Bene/FDOS Bene/RefUPIN CPT/RefUPIN Dx/RefUPIN
ICN/Bene 1 0.212130786 -0.15694424 -0.02267558 0.319601875 0.049493142
ICN/FDOS 0.212130786 1 0.822651598 0.158268848 -0.2099538 -0.23760167
Bene/FDOS -0.15694424 0.822651598 1 0.230730507 -0.32004858 -0.24604582
Bene/RefUPIN -0.02267558 0.158268848 0.230730507 1 0.093916984 0.116711104
CPT/RefUPIN 0.319601875 -0.2099538 -0.32004858 0.093916984 1 0.791509512
Dx/RefUPIN 0.049493142 -0.23760167 -0.24604582 0.116711104 0.791509512 1
Table 5 Complex Variables Correlation
These complex variables are considered indications of
fraud because they may signify the practice of one or
more of the following potential fraud schemes :
Self-referrals by a provider for unnecessary services
to beneficiaries; or
b. An illegal financial relationship between the
referring provider and the rendering provider; or
c. Sharing of beneficiaries between the referring and
rendering providers.
d. Provider activity may qualify as an impossible days
scenario; or
e.. Provider may be billing for unnecessary services; or
f. Provider may be billing for services not rendered.
Table 6 shows the skewness and kurtosis of all the
variables in our data set. It clearly shows, as expected (in
order to detect potential fraud), an increase in the
skewness and kurtosis between the seed and complex
variables. For example, the squeezing and pulling out
effect occurs when we measure claims per day (complex
variable) vis--vis the number of claims (seed variable) or
the number of days of service (seed variable). An
important observation is that the increase in skewness
and kurtosis is even more significant between the
complex variables and the weighted outlier variables.
a.
Table 6 Skewness and Kurtosis of Seed, Complex and
Weighted Outlier Variables
Skewness
Kurtosis
Seed
Units
2.174538
7.3879808
Benes
1.9793898
5.1381806
Diagnosis
0.8959197
-0.213163
FDOS
0.3028637
-1.52939
CPT
0.7194892
-0.791148
ICN
1.8485883
4.2930173
POS
0.3078353
-0.648475
RefUPIN
2.1780747
5.987456
Ratios
ICN/Bene
3.7234978
18.172842
ICN/FDOS
2.3585795
10.22756
Bene/FDOS
2.5732777
10.931942
Bene/RefUPIN
19.558957
396.93533
CPT/RefUPIN
2.5717062
8.506685
Dx/RefUPIN
2.7660013
10.059753
Dependent
Variables
ProvPaid
5.4782115
56.995661
Z Score
1
5.4735533
56.867998
Wt Outlier
Z/Bene
13.520514
195.3972
Z/Dx
12.929682
178.89963
Z/CPT
12.746727
188.7118
Z/ICN
7.5788937
90.62649
Z/BeneF
14.378765
251.57888
Z/ICNB
15.519089
281.08893
1
This is the z score of the provider paid amount.
6
Table 7 shows a comparison of the mean of the skewness
in the different categories of variables. This comparison
shows that there was a skewness increase of over 292%
between the seed variables and the complex variables
(ratios); and an increase of over 285% in the skewness
between the seed ratios and the Z score of the provider
paid amount (complex variable). On the other hand,
there was an increase in the skewness of over 667%
between the seed variables and the weighted outlier
variables; an increase of over 228% between the
complex variables (ratios) and the weighted outlier
variables; and an increase of over 233% between the Z
score of the provider paid amount and the weighted
outlier variables.
Table 8 shows a comparison of the mean in the kurtosis
by the different categories of variables. This comparison
shows that there was clear and substantial increase
between the kurtosis and skewness in the different
categories of variables.an increase in kurtosis of over
890% between the seed variables & the complex
variables (ratios); and an increase of over 691% in the
kurtosis between the seed ratios and the Z score of the
provider paid amount (complex variable). On the other
hand, there was an increase in the kurtosis of over
2,322% between the seed variables and the weighted
outlier variables; an increase of over 260% between the
complex variables (ratios) & the weighted outlier
variables; and an increase of over 335% between the Z
score of the provider paid amount and the weighted
outlier variables.
Table 7. Skewness Comparison: Variables by Category
Talbe 8. Kurtosis Comparison by Mean
7
The increase in the skewness and kurtosis between the Table 9 represents a comparison of the beneficiaries per
seed and complex variables may represent an day, the Z score of the provider paid amount, and the
explanation as to why complex variables incrementally weighted outlier of the beneficiaries per day. It
augment the seed variables, in the analysis for the compares the mean of the three variables in the
detection of healthcare fraud. Hence, the weighted population, with providers that score high or low in the
outlier variables also incrementally augment the different categories of variables.
forecasting and detection of fraud of the seed and
complex variables.
Table 9. Variables Comparison Specialty X
A provider that has a low number of beneficiaries (or Z score of the provider paid amount. In this scenario the
patients) and a low number in the Z score, also scores low suspected potential fraud comes to the top since, as
in the weighted outlier variable. It is to be expected that compared to the mean of the population, it is not
a low number of beneficiaries correlates to a low Z score expected that a provider that has a low number of
of the provider paid amount (i.e., lower number of beneficiaries to have a higher provider paid amount (low
beneficiaries = lesser amount paid to provider) as it
number of beneficiaries higher amount of provider
compares to the mean of all providers. Therefore, a
paid amount). The weighted outlier correctly maximize
relevant weighted outlier should minimize this similarity
this difference.
in the data.
A provider that has a high number of beneficiaries and a
high Z score of the provider paid amount, scores
somewhat higher than the mean of all providers. This is
to be expected since a provider that has a higher number
of beneficiaries than the mean of the population should
be expected to be paid more than average in the
population (higher number of beneficiaries = higher
amount paid to provider). Again, the weighted outlier
rightly minimize these similarities.
The usefulness of the weighted outlier is seen when a
provider has a lower number of beneficiaries and a high
Data Modeling
The purpose of data modeling in fraud detection is to
develop an accurate model, or graphical representation,
which have the potential to predict the potential for
fraud among the entities within a population. Different
techniques are used to model data, which include, but
are not limited to: (1) classification and regression
analysis are used in the task of predicting a response
variable; (2) clustering (grouping the rows by
similarities); and (3) association (showing that the
variables are related). The weighted outlier increases the
data models predictive function for fraud detection.
8
The rank, RSquare, and adjusted RSquare are examples score of provider paid amount, and weighted outlier.
of how weighted outlier variables make for a better The table illustrates the ability of the weighted outlier
fitting model. Table 10 shows a comparison of the rank to increase the rank of a provider who has the
of the provider by number of beneficiaries (bene), potential for fraud vis--vis seed and complex
distinct number of first day of service (FDOS), variables.
beneficiaries per first day of service (bene/FDOS), Z
Table 10. Rank Comparison
Provider
Benes
Rank
FDOS
Rank
Benes/FDOS
Rank
Zscore
ProvPd
Rank
Wt
Outlier
Rank
Alpha1 215 94 385 3 1
A regression analysis was performed of the seed model when the weighted outliers are added to the
variables; seed and complex variables; and seed, model. The increase of over ten percent is significant to
complex and weighted outlier variables respectively. the predictive value of the model. Hence, it could be
Table 11 shows a comparative analysis of the three inferred that the addition of the weighted outlier
different regression models by RSquare and RSquare variables (to the seed and complex variables) enhances
Adjusted. These results indicate that an increase occurs the model, and makes it more efficient and accurate in
in the RSquare and Adjusted RSquare to the regression the detection of potential fraud.
Table 11. Comparative Analysis of the RSquare and RSquare Adjusted, by model
9
Table 12. Regression Analysis of Seed Variables
Response N Sum of Provider Paid Amt Whole Model
Actual by Predicted Plot
10
Table 13. Regression Analysis Seed and Complex Variables
Response N Sum of Provider Paid Amt Whole Model
Actual by Predicted Plot
Table 14. Regression Analysis Seed, Complex, and Weighted Outlier Variables
Response N Sum of Provider Paid Amt Whole Model
Actual by Predicted Plot
11
Conclusion
The weighted outlier variable formula achieves the squeezing and pulling out effect by minimizing the similarities
and maximizing the differences in the data by simultaneously increasing the kurtosis and skewness vis--vis seed and
complex variables. The weighted outlier is an independent variable which also has the potential for improving data
modeling. The detection of fraud in private industry, as well as in government can be improved through the utilization
of weighted outlier variables.
12
World Headquarters
500 Frank W. Burr Boulevard,
Teaneck, NJ 07666 USA
Phone: +1 201 801 0233
Fax: +1 201 801 0243
Toll Free: +1 888 937 3277
Email: inquiry@cognizant.com
European Headquarters
Haymarket House
28-29 Haymarket
London SW1Y 4SP UK
Phone: +44 (0) 20 7321 4888
Fax: +44 (0) 20 7321 4890
Email: infouk@cognizant.com
India Operations Headquarters
#5/535, Old Mahabalipuram Road
Okkiyam Pettai, Thoraipakkam
Chennai, 600 096 India
Phone: +91 (0) 44 4209 6000
Fax: +91 (0) 44 4209 6060
Email: inquiryindia@cognizant.com
Copyright 2009, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or
otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners.
DWBI&PM Practice at Cognizant
The Data Warehousing, Business Intelligence & Performance Management Practice is a single-point Center of
Excellence within Cognizant for designing and deploying full-fledged DWBI&PM solutions. With more than 5,610*
consultants across the globe, Cognizant's award-winning DWBI&PM practice is at the forefront of partnering leading
companies around the world in architecting pragmatic, business-focused, enterprise-wide BI solutions. The practice
has been recognized for its role in enabling BI excellence through prestigious industry awards, including three
Computerworld BI Perspectives Best Practices Awards, the DM Review Innovative Solution Award, the TDWI Award,
the Cognos Performance Leadership Award, the Cognos Excellence Award, and the Informatica Innovation Award.
Note:
For more information on Cognizant's DWBI&PM solutions, contact us at or visit our website at inquiry@cognizant.com http://www.cognizant.com
th
* As of 30 Apr '09
About the Author
Alberto Roldan is a thought leader in Enterprise Analytics within DWBI&PM Practice. He has over 20 years experience
designing analytics solutions for organizations with large and complex technology landscape. He specializes in adapting
proven analytics techniques and methods to real world data intensive problems in neuroscience, medicine, physics and
chemistry. He has degrees from the University of Michigan and University of Puerto Rico.