Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Weighted Outlier Variable

Data Mining for Fraud Detection


COGNIZANT ENTERPRISE ANALYTICS WHITEPAPER
2
Introduction
The data analysis of outliers is frequently used in the In connection with cases involving securities,
detection of potential fraud in both public and private commodities, investment and advanced fee fraud
industry. Private and public institutions attempt to schemes, conduct central to many corporate fraud
detect fraudulent situations, while maintaining large investigations, federal prosecutors were awarded over
data sets. Specifically, the analysis of outliers is used in $2.5 billion in fines, forfeitures and restitution from July
the detection of healthcare insurance fraud. An outlier is 1, 2002 through March 31, 2003.
an observation that lies outside the overall pattern of a
distribution in the data. Usually, the presence of an
outlier indicates some sort of problem. Nevertheless, an
outlier, in and of itself, is not an indication of potential
Detecting fraud through data mining is the proverbial
fraud. For the outlier to be considered representative of
problem of finding the needle in a haystack. How can
fraudulent activity, the variable must take into
consideration the purpose and means used by those relevant aberrant behavior be detected in a never-
committing the fraud. ending sea of data? The answer lies in the creation of a
formula that combines the purpose and means of fraud,
The purpose of fraud is self-serving enrichment through
while sifting through the similarities and the differences
the means of unlawful, deceptive, or illegal practices.
in the data. If financial gain is the purpose of fraud, the
The means are overutilization of goods and services, or
effects of the means to perpetuate a fraud such financial
deceptive business practices. For the purpose of the
gain should give us variables that would detect fraud.
identification of fraud and abuse, the creation of
variables is necessitated. To abstract valid outliers of
An analogy that can be used to contemplate the
fraud, a variable formula needs to be constructed
detection of potential fraud is sand filled balloon. The
combining the factors of finance and the means of fraud
balloon represents the population under study. The
perpetuation.
sand represents the entities to be studied. The weighted
outlier variable operates on the concept that if the
A significant outlier variable analyzes the relationship
balloon is squeezed by the correct variable, the entities
between the purpose and means used in fraud to create
coming out of the end of the balloon are going to
meaningful variables. The supposition of this formula is
represent those with a higher potential of fraud.
to create variables that maximize the differences in
independent variables, while simultaneously minimizing
In statistics, the population data set can be analyzed to
the similarities in the data to detect outliers which are
characterize the location and variability of a particular
indicators of potential fraud and abuse. This effect could
variable. The skewness is a measure of asymmetry in the
be described as data mining for the purpose of
data. Kurtosis is a measure of whether the distribution
squeezing and pulling out the potential fraud from the
curve of the data is peaked or flat. For the purposes of
data set.
creating an outlier variable, which is a superior detector
of potential fraud, the kurtosis signifies the degree of
similarities of the variables in the data, and the skewness
represents the differences.
According to the National Insurance Crime Bureau
(NICB), a nonprofit organization supported by 1,000
property and casualty carriers, insurance fraud is
perennially the second-most common white collar crime
behind tax evasion, and costs the U.S. public roughly $30
billion in property and casualty claims alone.
Healthcare fraud is estimated at over $100 billion
annually. Securities fraud is estimated at over $15 billion
per year. The Association of Certified Fraud Examiners
put the annual cost of occupational fraud and abuse in
the United States at $600 billion in 2002, up from $400
billion in 1996.
Fraud Variables
Fraud and Abuse
Table 1. Kurtosis and Skewness
Kurtosis = 3
Skewness = 0
3
For example, if we analyze the number of distinct
number of patients within a group of medical providers
and we want to know how similar some of the providers
are to other providers in the number of patients we
would examine the kurtosis. Hence, the kurtosis would
measure the extent of similarities in the data. On the
other hand, the skewness would represent how far away
are some providers from the similarities in the data (i.e.,
distinct number of patients). Chart 1 is a representation
of a Kurtosis and Skewness in a normal distribution.
There are different types of variables that can be used to
conduct potential fraud analysis on a population. The
variables can be classified as seed, complex, and
weighted outlier variables. The seed variables are those
that come directly from the data. For example, in the
detection of healthcare fraud, the distinct number of
claims (ICN) or the distinct number of first date of
service (FDOS) can be considered seed variables. Hence,
these variables can be expressed as follows:
Complex variables refer to ratios or statistical formulas
derived from the seed variables. For example, the
number of distinct claims per the distinct number of first
date of services (i.e., claims per day) is a complex
variable, and can be expressed as:
However, consistent with widespread inconsistent and
ambiguous terminology, the square root of the bias-
corrected variance is sometimes also known as the
standard deviation.
The standard deviation arises naturally in mathematical
statistics through its definition in terms of the second
central moment. However, a more natural but much less
frequently encountered measure of average deviation
from the mean that is used in descriptive statistics is the
so-called mean deviation.
Another complex variable, can be the Z score of a
variable in a population. The Z score is an observation
that measures the number of standard deviations away
from the mean. The z-score associated with the ith
observation of a random variable x is given by
I= ICN or distinct number of claims
F= FDOS or distinct number of first date of service
where x is the mean and the standard deviation of all
observations x , .... x .
1 n
The weighted outlier variable takes into consideration
the relationship between the purpose and means of
potential fraud to create an independent variable. In its
most simple terms, it can be expressed as how the means
A complex variable can be the mean of a population.
(M) affects the purposes (P) of fraud. It can be expressed
The quantity commonly referred to as "the" mean of a
as:
set of values is the arithmetic mean
It combines seed and complex variables to create a new
independent variable that maximizes the differences
also called the average.
and minimizes the similarities in potential fraud
schemes. This is done by simultaneously increasing the
Another complex variable can be the standard deviation
Kurtosis and Skewness in a distribution. Table 2 is a
in a data set of a population. The standard deviation
graphical representation of the kurtosis and skewness of
of a probability distribution is defined as the square root
2
a typical seed or complex variable.
of the variance
The square root of the sample variance of a set of N
values is the sample standard deviation
The sample standard deviation distribution is slightly
complicated, although it is a well-studied and well-
understood, function.
The Weighted Outlier
ICN
FDOS
x =
1
n
n
i=1
x
n
s =
N
1
N
N
i=1
(x
2
-x)
i
s =
N-1
1
N-1
N
i=1
(x
2
-x)
i
z =
i
x - x
i

Purposes
Means
Table 2. Seed or Complex Variable: Kurtosis and Skewness
Kurtosis = 3
Skewness = 0
Seed or Complex
4
Table 3 represents a comparison of the kurtosis and The weighted outlier variable maximizes the differences
skewness of the weighted outlier, and the seed or in variables between providers providing similar services
complex variable. As the chart shows the weighted to beneficiaries and minimizes the similarities in the
outlier formula has the effect of increasing the kurtosis potential fraud schemes.
while simultaneously increasing the skewness, and in the
process of squeezing and pulling out the entity with a
high potential for fraud is identified.
Services not rendered (SNR) to beneficiaries;
An example of how the weighted outlier formula works in
detecting fraud can be illustrated using healthcare data
from Medicare. Medicare fraud comes is variable in
shapes and form. Fraud schemes are difficult to detect
given the number of variables involved: beneficiaries,
providers, medical procedures, amounts paid to
providers, days of services, number of services, and
diagnoses are just some of the basic variables with
millions of data points. The issue is how to create
variables that assist in data mining the vast databases of
Medicare claims information by separating the outliers
in the data to detect potential fraud among providers.
The potential provider-based Medicare fraud schemes
are, but are not limited to:
1.
2. Unnecessary services (UNS) to beneficiaries;
3. Impossible days (IMD), or providers billing for more
hours in a day than is probable;
4. Illegal Self-referrals by a provider for unnecessary
services to beneficiaries; or
5. An illegal financial relationship (IFA) between the
referring provider and the rendering provider; or
6. Sharing of beneficiaries between the referring and
rendering providers.
The aforementioned fraud schemes are difficult to
The weighted outlier formula is expressed as:
detect in a vast database, and they are even more
difficult if a provider uses a combination of one or more
fraud schemes to perpetuate and conceal Medicare
fraud (see table 4). Some of these schemes overlap with
Z= The Z score of an independent variable which
one another and the challenge is to create variables that
denotes the purposes of a fraud
maximizes the differences and minimizes the
similarities in the potential fraud schemes.
A= Independent (seed) variable which denotes the
means to perpetuate a fraud
B= Independent (seed) variable which denotes the
means to perpetuate a fraud
Potential Fraud Schemes
Medicare Fraud
Table 3. Comparison of Weighted Outlier, and Seed or
Complex Variable: Kurtosis and Skewness.
Kurtosis = 12
Skewness = 0
Seed or Complex
Kurtosis = 3
Skewness = 197
Fraud
Weighted Outlier
2
B
A
Z
X=
Table 4 Interaction of Potential Fraud Schemes
UNS
Services
SNR Rendered
IMD
IFA
Relationships
Bene
Sharing
Beneficiaries
Self -
Referrals
5
Healthcare Variables
income than providers practicing in rural areas. Hence,
the skewness and kurtosis of seed variable distributions
are not necessarily the accurate indicators of potential A combination of seed variables, basic ratios, and
fraud. weighted outlier variables are essential to detect
potential healthcare fraud. The seed variables are
extracted directly from the data. The seed variables are
Another method of identifying potential fraud schemes
the distinct count of the following: PIN; first day of includes analyzing the relationship, or ratio, between
different variables. These variables are called complex service (FDOS); Beneficiaries; diagnosis (Dx); medical
variables. The complex variables used to detect fraud are
procedures (CPT); place of service (POS); claims (ICN);
ICN per beneficiary; ICN per FDOS; beneficiaries per
Referring UPIN; as well as the provider paid amount and
FDOS; Beneficiaries to Referring UPIN; Medical
the total number of units. The distribution of seed
Procedures to Referring UPIN; and Diagnoses to
variables, although useful in detecting healthcare fraud,
Referring UPIN. The variables listed are derived from
independently they are not necessarily indicative of
the seed variables: (1) are correlated; (2) sometimes are
fraud. One of the main reasons seed variables alone fail
not normally distributed; and (based on experience)
to predict fraudulent activity is that they can be affected
assist in detecting potential fraud. The mean, standard
by the size and diversity of a provider practice. For
deviation, and Z scores are also examples of complex
example, a provider with a large practice in an urban area
variables.
might have a greater patient volume, thus greater

Variable ICN/Bene ICN/FDOS Bene/FDOS Bene/RefUPIN CPT/RefUPIN Dx/RefUPIN
ICN/Bene 1 0.212130786 -0.15694424 -0.02267558 0.319601875 0.049493142
ICN/FDOS 0.212130786 1 0.822651598 0.158268848 -0.2099538 -0.23760167
Bene/FDOS -0.15694424 0.822651598 1 0.230730507 -0.32004858 -0.24604582
Bene/RefUPIN -0.02267558 0.158268848 0.230730507 1 0.093916984 0.116711104
CPT/RefUPIN 0.319601875 -0.2099538 -0.32004858 0.093916984 1 0.791509512
Dx/RefUPIN 0.049493142 -0.23760167 -0.24604582 0.116711104 0.791509512 1
Table 5 Complex Variables Correlation
These complex variables are considered indications of
fraud because they may signify the practice of one or
more of the following potential fraud schemes :
Self-referrals by a provider for unnecessary services
to beneficiaries; or
b. An illegal financial relationship between the
referring provider and the rendering provider; or
c. Sharing of beneficiaries between the referring and
rendering providers.
d. Provider activity may qualify as an impossible days
scenario; or
e.. Provider may be billing for unnecessary services; or
f. Provider may be billing for services not rendered.
Table 6 shows the skewness and kurtosis of all the
variables in our data set. It clearly shows, as expected (in
order to detect potential fraud), an increase in the
skewness and kurtosis between the seed and complex
variables. For example, the squeezing and pulling out
effect occurs when we measure claims per day (complex
variable) vis--vis the number of claims (seed variable) or
the number of days of service (seed variable). An
important observation is that the increase in skewness
and kurtosis is even more significant between the
complex variables and the weighted outlier variables.
a.
Table 6 Skewness and Kurtosis of Seed, Complex and
Weighted Outlier Variables

Skewness

Kurtosis

Seed

Units

2.174538

7.3879808

Benes

1.9793898

5.1381806

Diagnosis

0.8959197

-0.213163

FDOS

0.3028637

-1.52939

CPT

0.7194892

-0.791148

ICN

1.8485883

4.2930173

POS

0.3078353

-0.648475

RefUPIN

2.1780747

5.987456

Ratios

ICN/Bene

3.7234978

18.172842

ICN/FDOS

2.3585795

10.22756

Bene/FDOS

2.5732777

10.931942

Bene/RefUPIN

19.558957

396.93533

CPT/RefUPIN

2.5717062

8.506685

Dx/RefUPIN

2.7660013

10.059753

Dependent
Variables

ProvPaid

5.4782115

56.995661

Z Score
1

5.4735533

56.867998

Wt Outlier

Z/Bene

13.520514

195.3972

Z/Dx

12.929682

178.89963

Z/CPT

12.746727

188.7118

Z/ICN

7.5788937

90.62649

Z/BeneF

14.378765

251.57888

Z/ICNB

15.519089

281.08893



1

This is the z score of the provider paid amount.

6
Table 7 shows a comparison of the mean of the skewness
in the different categories of variables. This comparison
shows that there was a skewness increase of over 292%
between the seed variables and the complex variables
(ratios); and an increase of over 285% in the skewness
between the seed ratios and the Z score of the provider
paid amount (complex variable). On the other hand,
there was an increase in the skewness of over 667%
between the seed variables and the weighted outlier
variables; an increase of over 228% between the
complex variables (ratios) and the weighted outlier
variables; and an increase of over 233% between the Z
score of the provider paid amount and the weighted
outlier variables.
Table 8 shows a comparison of the mean in the kurtosis
by the different categories of variables. This comparison
shows that there was clear and substantial increase
between the kurtosis and skewness in the different
categories of variables.an increase in kurtosis of over
890% between the seed variables & the complex
variables (ratios); and an increase of over 691% in the
kurtosis between the seed ratios and the Z score of the
provider paid amount (complex variable). On the other
hand, there was an increase in the kurtosis of over
2,322% between the seed variables and the weighted
outlier variables; an increase of over 260% between the
complex variables (ratios) & the weighted outlier
variables; and an increase of over 335% between the Z
score of the provider paid amount and the weighted
outlier variables.
Table 7. Skewness Comparison: Variables by Category
Talbe 8. Kurtosis Comparison by Mean
7
The increase in the skewness and kurtosis between the Table 9 represents a comparison of the beneficiaries per
seed and complex variables may represent an day, the Z score of the provider paid amount, and the
explanation as to why complex variables incrementally weighted outlier of the beneficiaries per day. It
augment the seed variables, in the analysis for the compares the mean of the three variables in the
detection of healthcare fraud. Hence, the weighted population, with providers that score high or low in the
outlier variables also incrementally augment the different categories of variables.
forecasting and detection of fraud of the seed and
complex variables.
Table 9. Variables Comparison Specialty X
A provider that has a low number of beneficiaries (or Z score of the provider paid amount. In this scenario the
patients) and a low number in the Z score, also scores low suspected potential fraud comes to the top since, as
in the weighted outlier variable. It is to be expected that compared to the mean of the population, it is not
a low number of beneficiaries correlates to a low Z score expected that a provider that has a low number of
of the provider paid amount (i.e., lower number of beneficiaries to have a higher provider paid amount (low
beneficiaries = lesser amount paid to provider) as it
number of beneficiaries higher amount of provider
compares to the mean of all providers. Therefore, a
paid amount). The weighted outlier correctly maximize
relevant weighted outlier should minimize this similarity
this difference.
in the data.
A provider that has a high number of beneficiaries and a
high Z score of the provider paid amount, scores
somewhat higher than the mean of all providers. This is
to be expected since a provider that has a higher number
of beneficiaries than the mean of the population should
be expected to be paid more than average in the
population (higher number of beneficiaries = higher
amount paid to provider). Again, the weighted outlier
rightly minimize these similarities.
The usefulness of the weighted outlier is seen when a
provider has a lower number of beneficiaries and a high
Data Modeling
The purpose of data modeling in fraud detection is to
develop an accurate model, or graphical representation,
which have the potential to predict the potential for
fraud among the entities within a population. Different
techniques are used to model data, which include, but
are not limited to: (1) classification and regression
analysis are used in the task of predicting a response
variable; (2) clustering (grouping the rows by
similarities); and (3) association (showing that the
variables are related). The weighted outlier increases the
data models predictive function for fraud detection.
8
The rank, RSquare, and adjusted RSquare are examples score of provider paid amount, and weighted outlier.
of how weighted outlier variables make for a better The table illustrates the ability of the weighted outlier
fitting model. Table 10 shows a comparison of the rank to increase the rank of a provider who has the
of the provider by number of beneficiaries (bene), potential for fraud vis--vis seed and complex
distinct number of first day of service (FDOS), variables.
beneficiaries per first day of service (bene/FDOS), Z

Table 10. Rank Comparison
Provider
Benes
Rank
FDOS
Rank
Benes/FDOS
Rank
Zscore
ProvPd
Rank
Wt
Outlier
Rank
Alpha1 215 94 385 3 1

A regression analysis was performed of the seed model when the weighted outliers are added to the
variables; seed and complex variables; and seed, model. The increase of over ten percent is significant to
complex and weighted outlier variables respectively. the predictive value of the model. Hence, it could be
Table 11 shows a comparative analysis of the three inferred that the addition of the weighted outlier
different regression models by RSquare and RSquare variables (to the seed and complex variables) enhances
Adjusted. These results indicate that an increase occurs the model, and makes it more efficient and accurate in
in the RSquare and Adjusted RSquare to the regression the detection of potential fraud.
Table 11. Comparative Analysis of the RSquare and RSquare Adjusted, by model
9
Table 12. Regression Analysis of Seed Variables
Response N Sum of Provider Paid Amt Whole Model
Actual by Predicted Plot
10
Table 13. Regression Analysis Seed and Complex Variables
Response N Sum of Provider Paid Amt Whole Model
Actual by Predicted Plot
Table 14. Regression Analysis Seed, Complex, and Weighted Outlier Variables
Response N Sum of Provider Paid Amt Whole Model
Actual by Predicted Plot
11
Conclusion
The weighted outlier variable formula achieves the squeezing and pulling out effect by minimizing the similarities
and maximizing the differences in the data by simultaneously increasing the kurtosis and skewness vis--vis seed and
complex variables. The weighted outlier is an independent variable which also has the potential for improving data
modeling. The detection of fraud in private industry, as well as in government can be improved through the utilization
of weighted outlier variables.
12
World Headquarters
500 Frank W. Burr Boulevard,
Teaneck, NJ 07666 USA
Phone: +1 201 801 0233
Fax: +1 201 801 0243
Toll Free: +1 888 937 3277
Email: inquiry@cognizant.com
European Headquarters
Haymarket House
28-29 Haymarket
London SW1Y 4SP UK
Phone: +44 (0) 20 7321 4888
Fax: +44 (0) 20 7321 4890
Email: infouk@cognizant.com
India Operations Headquarters
#5/535, Old Mahabalipuram Road
Okkiyam Pettai, Thoraipakkam
Chennai, 600 096 India
Phone: +91 (0) 44 4209 6000
Fax: +91 (0) 44 4209 6060
Email: inquiryindia@cognizant.com
Copyright 2009, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or
otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners.
DWBI&PM Practice at Cognizant
The Data Warehousing, Business Intelligence & Performance Management Practice is a single-point Center of
Excellence within Cognizant for designing and deploying full-fledged DWBI&PM solutions. With more than 5,610*
consultants across the globe, Cognizant's award-winning DWBI&PM practice is at the forefront of partnering leading
companies around the world in architecting pragmatic, business-focused, enterprise-wide BI solutions. The practice
has been recognized for its role in enabling BI excellence through prestigious industry awards, including three
Computerworld BI Perspectives Best Practices Awards, the DM Review Innovative Solution Award, the TDWI Award,
the Cognos Performance Leadership Award, the Cognos Excellence Award, and the Informatica Innovation Award.
Note:
For more information on Cognizant's DWBI&PM solutions, contact us at or visit our website at inquiry@cognizant.com http://www.cognizant.com
th
* As of 30 Apr '09
About the Author
Alberto Roldan is a thought leader in Enterprise Analytics within DWBI&PM Practice. He has over 20 years experience
designing analytics solutions for organizations with large and complex technology landscape. He specializes in adapting
proven analytics techniques and methods to real world data intensive problems in neuroscience, medicine, physics and
chemistry. He has degrees from the University of Michigan and University of Puerto Rico.

You might also like