Report On Present Investigation: 3.1 Introduction To Data Leakage

Report on Present Investigation
Data leakage happens every day when confidential business information such as customer
or patient data, source code or design specifications, price lists, intellectual property and trade
secrets, forecasts and budgets in spreadsheets are leaked out. In this report a problem is
considered in which a data distributor has given sensitive data to a set of supposedly trusted
agents and some of the data are leaked and found in an unauthorized place by any means. The
problem with the data leakage is that once this data is no longer within the domain of
distributor, then the company is at serious risk. The distributor must assess the likelihood that
the leaked data came from one or more agents, as opposed to having been independently
gathered by other means.
We propose data allocation strategies (across the agents) that improve the probability of
identifying leakages. These methods do not rely on alterations of the released data. In some
cases, we can also inject “realistic but fake” data records to further improve our chances of
detecting leakage and identifying the guilty party.
Further modification is applied in order to overcome the problems of current algorithm by
intelligently distributing data object among various agents in such a way that identification of
guilty agent becomes simple.
3.1 Introduction to Data Leakage
Data Leakage, put simply, is the unauthorized transmission of data (or information)
from within an organization to an external destination or recipient. Leakage is possible either
intentionally or unintentionally by internal or external user, internals are authorized user of
the system who can access the data through valid access control policy, however external
intruder access data through some attack on target machine which is either active or passive.
This may be electronic, or may be via a physical method. Data Leakage is
synonymous with the term Information Leakage and harms the image of organization and
create threat in the mind for continuing the relation with the distributor, as it is not able to
protect the sensitive information.
The reader is encouraged to be mindful that unauthorized does not automatically
mean intentional or malicious. Unintentional or inadvertent data leakage is also
unauthorized.
Data Leakage
According to data compiled from EPIC.org and PerkinsCoie.com which conducted

survey on data leakage by considering various different organizations and concluded that,
52% of Data Security breaches are from internal sources compared to the remaining 48% by
external hackers hence the protection needed from internal users also.
Type of data leaked Percentage
Confidential information 15
Intellectual property 4
Customer data 73
Health record 8
Table.3.2 Types of data leaked [11]

The noteworthy aspect of these figures is that, when the internal breaches are
examined, the percentage due to malicious intent is remarkably low.
Less than 1%, the level of inadvertent data breach is significant (96%). This is further
deconstructed to 46% being due to employee oversight, and 50% due to poor business
process.
The Leaking Faucet
Data protection programs at most organizations are concerned with protecting
sensitive data from external malicious attacks, relying on technical controls that include
perimeter security, network/wireless surveillance and monitoring, application and point
security management, and user awareness and education. But what about inadvertent data
leaks that isn’t so sensational.
For example unencrypted information on a lost or stolen laptop/ USB or other device?
Like the steady drip from a leaking faucet, everyday data leaks are making headlines more
often than the nefarious attack scenarios around which organizations plan most, if not all, of
their data leakage prevention methods. However, to truly protect their critical data,
organizations also need to plan a more data-centric approach to their security programs to
protect against leaks that occur for sensitive data.
Organizations are concerned with protecting sensitive data from external malicious
attacks, relying on technical controls that include perimeter security and internal staff and
from those who can access those data from any method, network/wireless surveillance and
monitoring, application and point security management, and user awareness and education
and DLP solutions. But what about data leaks by trusted third party called as agents who are
not present inside the network and their activity is not easily traceable for this situation some
care must be taken so that data is not misused by them.
Various sensitive information such as Financial Data, Private Data, Credit Card
Information, Health Record Information, Confidential Information, Personal Information etc
are part of various different organizations which can be prevented by various different ways
as shown in figure 3.3. The leaking faucets include Education, Prevention and Detection to
control the leakage of sensitive information. Education remains the important factor among
all the protection measure, which includes training and awareness program to handle the
sensitive information and their importance for the organization.
The figure 3.3 represents faucet for data leakage in the center different kinds of
sensitive information is placed which is surrounded by protecting mechanism which will
prevent leakage of valuable information from the organization which leads to major problem.
s
R
h
H
m
Ifo
d
C
rv
tP
lD
eca
in
F Figure 3.3: The Leaking Faucet
Prevention mechanism deals with the DLP mechanism which is suits of technology
which prevent leakage of data by classifying sensitive information and monitoring them and
through various accesses control policy the access is prevented from users. Education
prevents the leakage as most of the time leakage occurs unintentionally by the internal users.
Detection process detects the leakage of information distributed to trustworthy third party
called as agent and to calculate their involvement in the process of leakage.
Data Leakage Detection
Organizations thought of data/information security only in terms of protecting their
network from intruders (e.g. hackers). But with growing amount of data, rapid growth in the
sizes of organizations (e.g. due to globalization), rise in number of data points (machines and
servers) and easier modes of communication, accidental or even deliberate leakage of data
from within the organization has become a painful reality. This has lead to growing
awareness about information security in general and about outbound content management in
particular.
Data Leakage, put simply, is the unauthorized transmission of data (or information)
from within an organization to an external destination or recipient. This may be electronic, or
may be via a physical method. Data Leakage is synonymous with the term Information
Leakage. The reader is encouraged to be mindful that unauthorized does not automatically
mean intentional or malicious. Unintentional or inadvertent data leakage is also unauthorized.
In the course of doing business, sometimes sensitive data must be handed over to supposedly
trusted third parties.
For example, a hospital may give patient records to researchers who will devise new
treatments. Similarly, a company may have partnerships with other companies that require
sharing customer data. Another enterprise may outsource its data processing, so data must be
given to various other companies. We call the owner of the data the distributor and the
supposedly trusted third parties the agents. Our goal is to detect the guilty agent among all
trustworthy agents when the distributor’s sensitive data have been leaked by any one agent,
and if possible to identify the agent that leaked the data.
We consider applications where the original sensitive data cannot be perturbed.
Perturbation is a very useful technique where the data are modified and made “less sensitive”
before being handed to agents, in such case sensitive data is sent as it is example is contact no
of employee it cannot be altered and sent to third party for recruitment as it is sensitive or
similarly its bank account number perturbation of those information to makes no effect after
transmission to the receiver as this data is not useful for any process.
To overcome such a situation an effective method is required through which the
distribution of data is possible to modify the valuable information.
In such case one can add random noise to certain attributes, or one can replace exact
values by ranges [4]. However, in some cases, it is important not to alter the original
distributor’s data. For example, if an outsourcer is doing our payroll, he must have the exact
salary and customer bank account numbers
Fig 3.4: Data Leakage Detection Process
If medical researchers treating the patients (as opposed to simply computing

statistics), they may need accurate data for the patients. Traditionally, leakage detection is
handled by watermarking, e.g., a unique code is embedded in each distributed copy.
If that copy is later discovered in the hands of an unauthorized party, the leaker can be
identified. Watermarks can be very useful in some cases, but again, involve some
modification of the original data by adding redundancy. Furthermore, watermarks can
sometimes be destroyed if the data recipient is malicious as it is aware of various techniques
to temper the watermark.
Data Leakage Detection system mainly divided into two modules
Data allocation strategy

This module helps in intelligent distribution of data set so that if that data is leaked guilty
agent is identified.
Fig 3.5: Data Distribution Scenario
3.3.2
3.3.3
3.3.4
3.3.5
3.3.6
3.3.7
3.3.8 Guilt detection model
This model helps to determine the agent is responsible for leakage of data or data set
obtained by target is by some other means.
It requires complete domain knowledge to calculate the probability p to evaluate
guilty agent. From the domain knowledge and proper analysis and experiment probability
facture is calculated which act as threshold for evidence to prove guilty of agent. It means
when number of leaked record are more than the probability specified then agent is guilty and
if number of leaked records is less then is said to be not guilty because in such situation it is
possible that leaked object is obtained by target by some other means.
Fig 3.6: Guilt Detection Model

An unobtrusive technique for detecting leakage of a set of objects or records is
proposed in this report. After giving a set of objects to agents, the distributor discovers some
of those same objects in an unauthorized place. (For example, the data may be found on a
website, or may be obtained through a legal discovery process.) At this point, the distributor
can assess the likelihood that the leaked data came from one or more agents, as opposed to
having been independently gathered by other means. Using an analogy with cookies stolen
from a cookie jar, if we catch Freddie with a single cookie, he can argue that a friend gave
him the cookie. But if we catch Freddie with five cookies, it will be much harder for him to
argue that his hands were not in the cookie jar. If the distributor sees “enough evidence” that
an agent leaked data, he may stop doing business with him, or may initiate legal proceedings.
In this paper, we develop a model for assessing the “guilt” of agents. We also present
algorithms for distributing objects to agents, in a way that improves our chances of
identifying a leaker. Finally, we also consider the option of adding “fake” objects to the
distributed set. Such objects do not correspond to real entities but appear realistic to the
agents. In a sense, the fake objects act as a type of watermark for the entire set, without
modifying any individual members. If it turns out that an agent was given one or more fake
objects that were leaked, then the distributor can be more confident that agent was guilty.
3.3.3 Symbols and Terminology
A distributor owns a set T = {t1, t2, t3, ……. } of valuable and sensitive data objects. The
distributor wants to share some of the objects with a set of agents U1, U2… Un, but does not
wish the objects be leaked to other third parties. The objects in T could be of any type and
size, e.g., they could be tuples in a relation, or relations in a database. An agent Ui receives a
subset of objects Ri subset of T, determined either by a sample request or an explicit request:
• Distributor: A distributor owns a set T of valuable and sensitive data objects.

Owner of data set T = {t1, t2, ……, tn}
• Agent (U): The distributor shares some of the objects with a set of agents U1, U2… Un,
but does not wish the objects be leaked to other third parties.
Receives set R ⊆ T from the distributor.
• Target: Unauthorized third party caught with leaked data set S ⊆ T.

Example: Say T contains customer records for a given company A. Company “C” hires a
marketing agency U1 to do an on-line survey of customers. Since any customers will do for
the survey, U1 requests a sample of 1000 customer records. At the same time, company “C”
subcontracts with agent U2 to handle billing for all California customers. Thus, U2 receives all
T records that satisfy the condition “state is California Suppose that after giving objects to
agents, the distributor discovers that a set S subset of T has leaked. This means that some
third party called the target has been caught in possession of S. For example, this target may
be displaying S on its web site, or perhaps as part of a legal discovery process, the target
turned over S to the distributor.
Agents U1, U2… Un have some of the data, it is reasonable to suspect them leaking the
data. However, the agents can argue that they are innocent, and that the S data was obtained
by the target through other means. For example, say one of the objects in S represents a
customer X. Perhaps X is also a customer of some other company, and that company
provided the data to the target. Or perhaps X can be reconstructed from various publicly
available sources on the web.
Our goal is to estimate the likelihood that the leaked data came from the agents as
opposed to other sources. Intuitively, the more data in S, the harder it is for the agents to
argue they did not leak anything. Similarly, the “rarer” the objects, the harder it is to argue
that the target obtained them through other means.
Not only do we want to estimate the likelihood the agents leaked data, but we would also
like to find out if one of them in particular was more likely to be the leaker. For instance, if
one of the S objects was only given to agent U 1, while the other objects were given to all
agents, we may suspect U1 more. The model we present next captures this intuition. We say
an agent Ui is guilty if it contributes one or more objects to the target. While performing
implementation and research work for the calculation of guilty of agent in order to reduce the
complexity and computation we are following various assumption
Chapter 4
Agent Guilt Model
This model helps to determine the agent is responsible for leakage of data or data set
obtained by target is by some other means. The distributor can assess the likelihood that the
leaked data came from one or more agents, as opposed to having been independently
gathered by other means. Using an analogy with cookies stolen from a cookie jar, if we catch
Freddie with a single cookie, he can argue that a friend gave him the cookie. But if we catch
Freddie with five cookies, it will be much harder for him to argue that his hands were not in
the cookie jar.
If the distributor sees “enough evidence” that an agent leaked data, he may stop doing
business with him, or may initiate legal proceedings.
4.1 Guilty Agent
To compute the probability of guilty agent, we need an estimate for the probability that
values in S can be “guessed” by the target. For instance, say some of the objects in T are
emails of individuals. We can conduct an experiment and ask a person with approximately
the expertise and resources of the target to find the email of say 100 individuals. If this person
can find say 90 emails, then we can reasonably guess that the probability of finding one email
is 0.9. On the other hand, if the objects in question are bank account numbers, the person may
only discover say 20, leading to an estimate of 0.2. We call this estimate pt , the probability
that object t can be guessed by the target.
To simplify the formulas, we assume that all T objects have the same probability, which
we call p. Next, we make two assumptions regarding the relationship among the various
leakage events. The first assumption simply states that an agent’s decision to leak an object is
not related to other objects.
Suppose that after giving objects to agents, the distributor discovers that a set S ⊆ T has
leaked. This means that some third party called the target has been caught in possession of S.
For example, this target may be displaying S on its web site, or perhaps as part of a legal
discovery process, the target turned over S to the distributor. Since the agents U 1, ……,Un
have some of the data, it is reasonable to suspect them leaking the data. However, the agents
can argue that they are innocent, and that the S data was obtained by the target through other
means.
For example, say one of the objects in S represents a customer X. Perhaps X is also a
customer of some other company, and that company provided the data to Equations the
target. Or perhaps X can be reconstructed from various publicly available sources on the web.
Our goal is to estimate the likelihood that the leaked data came from the agents as opposed to
other sources. Intuitively, the more data in S, the harder it is for the agents to argue they did
not leak anything. Similarly, the “rarer” the objects, the harder it is to argue that the target
obtained them through other means. Not only do we want to estimate the likelihood the
agents leaked data, but we would also like to find out if one of them in particular was more
likely to be the leaker.
For instance, if one of the S objects was only given to agent U1, while the other
objects were given to all agents, we may suspect U 1 more. The model we present next
captures this intuition. We say an agent U i is guilty if it contributes one or more objects to the
target. We denote the event that agent Ui is guilty for a given leaked set S by {Gi | S}. Our
next step is to estimate Pr {Gi | S}, i.e., the probability that agent Ui is guilty given evidence
S.
4.2 Guilt Agent Detection
We can conduct an experiment and ask a person with approximately the expertise and
resources of the target to find the email of say 100 individuals.
If this person can find say 90 emails, then we can reasonably guess that the
probability of finding one email is 0.9. On the other hand, if the objects in question are bank
account numbers, the person may only discover say 20, leading to an estimate of 0.2. We call
this estimate pt, the probability that object t can be guessed by the target [2]. For simplicity
we assume that all T objects have the same pt, which we call p.
Next, make two assumptions regarding the relationship among the various leakage
events. The first assumption simply states that an agent’s decision to leak an object is not
related to other objects
Assumption1. For all t, t’ ∈

S such that t ¿ t’ the provenance of t is independent of the
provenance of t’ [1].
The term “provenance” in this assumption statement refers to the source of a value t
that appears in the leaked set. The source can be any of the agents who have t in their sets or
the target itself (guessing). The following assumption states that joint events have a negligible
probability
Assumption2. An object t ∈ S can only be obtained by the target in one of the two ways as
follows.
A single agent Ui leaked t from its own Ri set. The target “t” guessed or obtained
through other means without the help of any of the n agents. In other words, for all t ∈
S, the
event that the target guesses t and the events that agent Ui (i = 1, . . . , n) leaks object t are
disjoint Assume that the distributor set T, the agent sets Rs, and the target set S are:
T = { t1, t2, t3 }, R1 = { t1, t2 }, R2 = { t1, t3 }, S = { t1, t2, t3 }.
In this case, all three of the distributor’s objects have been leaked and appear in S. Let us
first consider how the target may have obtained object t1, which was given to both agents.
From Assumption 2, the target either guessed t1 or one of U1 or U2 leaked it.
We know that the probability of the former event is p, so assuming that probability that
each of the two agents leaked t1 is the same, we have the following cases:
 The target guessed t1 with probability p,
 Agent U1 leaked t1 to S with probability (1-p)/2,
 Agent U2 leaked t1 to S with probability (1-p)/2.

Similarly, we find that agent U1 leaked t2 to S with probability (1-p) since he is the only
agent that has t2.
Given these values, the probability that agent U 1 is not guilty, namely that U 1 did not leak
either object, is
Pr {G1|S }=( 1−(1− p)/2 )×( 1−(1− p) )

And the probability that U1 is guilty is:
Pr( G1|S )=1−Pr{ G1 }

If Assumption 2 did not hold, our analysis would be more complex because we would need to
consider joint events, e.g., the target guesses t1, and at the same time, one or two agents leak
the value. In our simplified analysis, we say that an agent is not guilty when the object can be
guessed, regardless of whether the agent leaked the value.
Since we are “not counting” instances when an agent leaks information, the simplified
analysis yields conservative values (Smaller Probability) analysis.
Allocation Strategy
Allocation strategies that are applicable to problem instances data requests are
discussed. We deal with problems with explicit data requests, and problems with sample data
requests.
5.1 Explicit Data Requests
In problems the distributor is not allowed to add fake objects to the distributed data.
So, the data allocation is fully defined by the agents’ data requests. In EF problems, objective
values are initialized by agents’ data requests. Say, for example, that T= {t 1, t2} and there are
two agents with explicit data requests such that R1 = {t1, t2} and R2 = {t1}. The value of the
sum objective is in this case is 1.5. The distributor cannot remove or alter the R 1 or R2 data to
decrease the overlap R1 ¿ R2. If the distributor is able to create more fake objects, he could
further improve the objective.
The distributor cannot remove or alter the R1 or R2 data to decrease the overlap R1 ¿
R2. However, say that the distributor can create one fake object (B = 1) and both agents can
receive one fake object (b1 = b2 = 1). In this case, the distributor can add one fake object to
either R1 or R2 to increase the corresponding denominator of the summation term. Assume
that the distributor creates a fake object f and he gives it to agent R 1. Agent U1 has now R1 = {
t1, t2, f } and F1 = {f} and the value of the sum-objective decreases to 1:33 < 1:5.
If the distributor is able to create more fake objects, he could further improve the
objective. Algorithm 1 is a general “driver” that will be used for the allocation in case of
explicit request with fake record. In the algorithm first the random agent is selected from the
list and then its request is analyzed after the computation Fake records are created by function
CREATEFAKEOBJECT() fake records are added in the data set and given back to the agent
requested that data set. Fake records help in the process of identifying agent from the leaked
data set.
Algorithm 1:
Allocation for Explcit Data Requests (EF)
Input: R ,R1, R2,…,R, cond1 , … condi , B1,…Bn,B
Output : R1,…,Rn,F1,…,Fn
R  Null
For i=1, …, n do
If bi > 0 then
R R ∪ {i}
Fi  Null
While B>0 do
I SELECT AGENT (R, R1, R2, … …Rn)
FCREATEFAKE OBJECT (Ri, Fi, condi)
Ri  Ri ∪ {F}
Fi  Fi ∪ {F}
bi bi -1
if bi = 0 then
R=R/R{i}
BB-1
Algorithm 2:
Agent Selection for e – random function SELECT AGENT (R,.R1, …,Rn)
I select random agent from R
return i
Flow Chart for implementation of the following Algorithms:
(a)Allocation for Explicit Data Request(EF) with fake objects
(b) Agent Selection for e-random and e-optimal
“IN Both the above case Start

CREATEFAKCEOBJECT()
METHOD GENERATES A
FAKEOBJECT.”
User Request
R Explicit
Check the Condition Select

the agent and add the fake. Else
Exit
IF B> 0
Evaluate The Loop.
Create Fake Object is Invoked
User Receives the Output.

Loop Iterates for n
number of requests
Stop
Sample Data Requests
With sample data requests, each agent Ui may receive any T subset out of entire
distribution set which are different ones. Hence, there are different object allocations. In
every allocation, the distributor can permute T objects and keep the same chances of guilty
agent detection. The reason is that the guilt probability depends only on which agents have
received the leaked objects and not on the identity of the leaked objects. The distributor’s
problem is to pick one out so that he optimizes his objective. The distributor can increase the
number of possible allocations by adding fake objects.
Algorithm 3:
Allocation for Sample Data Requests (S F )

Input : m1,… … , mn |T|
Output: R1 ,…, Rn
A O|T|
R1  null, …,Rn  null
i
Rem  ∑ ❑mi
i=0
While rem > 0 do

For i=1, …, n:Ri <mi do k SELECT OBJECT(i, Ri) Ri  Ri ∪ {tk}
A[k]  a[k] + 1
Rem  rem - 1
Algorithm 4:
Object Selection function SELECTOBJE CT(i, Ri)
K  select at random an element from set (K’ | tk’ ∉ Ri)
Return k
Flow Chart for implementation of the following Algorithms:
(a) Allocation for Sample Data Request(EF) without any fake objects:
(b) Agent Selection for e-random and e-optimal
“In Both the following cases Select Method() returns the value of ∑ Ri n Rj “
Start
User Request
R Explicit
Check the Condition Select

the agent and add the fake. Else
Exit
IF B> 0
Evaluate The Loop.
SelectObject() Method is Invoked
User Receives the Output.

Loop Iterates for n
number of requests
Stop
Data Allocation Problem

The main focus of our work is the data allocation problem: how can the distributor
“intelligently” give data to agents to improve the chances of detecting a guilty agent.
As illustrated in Figure, there are four instances of this problem we address,
depending on the type of data requests made by agents (E for Explicit and S for Sample
requests) and whether “fake objects” are allowed (F for the use of fake objects, and F for
the case where fake objects are not allowed). Fake objects are objects generated by the
distributor that are not in set T.
The objects are designed to look like real objects, and are distributed to agents
together with the T objects, in order to increase the chances of detecting agents that leak data.
Fig6.1: Leakage problem instances.
Assume that we have two agents with requests R 1 = EXPLICIT( T, cond1 ) and R2 =
SAMPLE(T’, 1), where T’ = EXPLICIT( T, cond2 ).
Further, say that cond1 is “state = CA” (objects have a state field). If agent U 2 has the
same condition cond2 = cond1, we can create an equivalent problem with sample data requests
on set T’. That is, our problem will be how to distribute the CA objects to two agents, with
R1 = SAMPLE( T’, |T’| ) and R2 = SAMPLE(T’, 1). If instead U2 uses condition “state =
NY,” we can solve two different problems for sets T’ and T – T’. In each problem, it will
have only one agent. Finally, if the conditions partially overlap, R1 ¿ T’ ¿ NULL, but R1 ¿
T’, we can solve three different problems for sets R1 – T’, R1 ¿ T’, and T’ – R1.
Fake Object
The distributor may be able to add fake objects to the distributed data in order to
improve his effectiveness in detecting guilty agents. However, fake objects may impact the
correctness of what agents do, so they may not always be allowable. The idea of perturbing
data to detect leakage is not new. However, in most cases, individual objects are perturbed,
e.g., by adding random noise to sensitive salaries, or adding a watermark to an image. In this
case, perturbing the set of distributor objects by adding fake elements is done.
For example, say the distributed data objects are medical records and the agents are
hospitals. In this case, even small modifications to the records of actual patients may be
undesirable. However, the addition of some fake medical records may be acceptable, since no
patient matches these records, and hence no one will ever be treated based on fake records. A
trace file is maintained to identify the guilty agent. Trace file are a type of fake objects that
help to identify improper use of data. The creation of fake but real-looking objects is a
nontrivial problem whose thorough investigation is beyond the scope of this paper. Here, we
model the creation of a fake object for agent Ui as a black box function
CREATEFAKEOBJECT ( Ri, Fi, condi ) that takes as input the set of all objects R i, the
subset of fake objects Fi that Ui has received so far, and condi, and returns a new fake object.
This function needs condi to produce a valid object that satisfies U i’s condition. Set Ri is
needed as input so that the created fake object is not only valid but also indistinguishable
from other real objects.
Optimization Problem
The distributor’s data allocation to agents has one constraint and one objective. The
distributor’s constraint is to satisfy agents’ requests, by providing them with the number of
objects they request or with all available objects that satisfy their conditions. His objective is
to be able to detect an agent who leaks any of his data objects.
We consider the constraint as strict. The distributor may not deny serving an agent
request and may not provide agents with different perturbed versions of the same objects. We
consider fake object allocation as the only possible constraint relaxation. Our detection
objective is ideal and intractable. Detection would be assured only if the distributor gave no
data object to any agent. We use instead the following objective: maximize the chances of
detecting a guilty agent that leaks all his objects.
We now introduce some notation to state formally the distributor’s objective. Recall
that
Pr {Gj | S = Ri} or simply Pr {Gj | S = Ri}, is the probability that agent Uj is guilty if the
distributor discovers a leaked table S that contains all Ri objects. We define the difference
functions  (i, j) as:
 (i, j) = Pr {Gj | S = Ri}- Pr {Gi | S = Ri} i, j = 1,……, n(6)
Note that differences  have non-negative values: given that set Ri contains all the leaked
objects, agent Ui is at least as likely to be guilty as any other agent. Difference  (i, j) is
positive for any agent Uj, whose set Ri does not contain all data of S.
It is zero, if Ri ⊆ Rj. In this case the distributor will consider both agents Ui and Uj equally
guilty since they have both received all the leaked objects. The larger a  (i, j) value is, the
easier it is to identify Ui as the leaking agent. Thus, we want to distribute data so that
values are large.
Intelligent Data Distribution

Hashed Distribution Algorithm
Input: Agent ID (UID), Number of data item requested in the dataset (N), Fake records (F)
Output: Distribution set (Dataset + Fake record)
1. Start
2. Accept the data request from the agent and analyze
a. Type of request { Sample, Exclusive }
b. Probability of getting records from other means other than the distributor
Pr {guessing}
c. No of records in the Dataset (N) to calculate number of fake record added in
order to determine guilty agent.
d. Agent ID requesting data (UID)
3. Generate the list of data to be send to the agent (dataset), assign each record with
unique distribution ID.
4. For I =1 to F : > For each fake record
Mapping_Function (UID, FID)
{
Hash (UID)
DID → FID
Store → DistributionDetails { FID,DID,UID }.
}
5. For I=1 to F
AddFakeRecord (DistributionDetails)
Output: Distribution Set
6. Stop.
Detection Process
Detection process starts when the set of distributed sensitive record found on some
unauthorized places. Detection process completes in two phases: In phase one agent is
identified by the presence of fake record in the obtained set, if no matching fake record is
identified phase two begins which searches for missing record in the set on which fake record
is substituted. The advantage of second phase is that it works in a situation in which agent
identify and delete the fake record before leaking to the target.
Inverse mapping function (Leaked Data Set)
1. Attach DID to every record
2. Sort records in order of DID
3. Search and map fake record
4. For every Record
If fake record = Yes
MapAgent (FID)
Else If fake records = No
Map (UID) which gives hash location
Identify the absence of substituted record
MapAgent (DID)
Else
Objects are obtained by some other means.
6.3 Benefits of Hashed distribution:
Once the data is distributed fake records are used to identify the guilty agent here
instead of record we are using location to determine the guilty agent so even when the
presence of fake record is identified by the agent it will delete record but the location is
anyhow determine by the distributor so event absence of fake record will reveal the Identity
of agent by tracking absence of original record.
This solves data distribution and optimization problem to some extent by its
distribution technique.

Report On Present Investigation: 3.1 Introduction To Data Leakage

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report On Present Investigation: 3.1 Introduction To Data Leakage

Uploaded by

Copyright:

Available Formats

Report on Present Investigation

According to data compiled from EPIC.org and PerkinsCoie.com which conducted

Type of data leaked Percentage

Table.3.2 Types of data leaked [11]

If medical researchers treating the patients (as opposed to simply computing

Data allocation strategy

Fig 3.6: Guilt Detection Model

• Distributor: A distributor owns a set T of valuable and sensitive data objects.

Receives set R ⊆ T from the distributor.

• Target: Unauthorized third party caught with leaked data set S ⊆ T.

Assumption1. For all t, t’ ∈

 The target guessed t1 with probability p,

 Agent U1 leaked t1 to S with probability (1-p)/2,

 Agent U2 leaked t1 to S with probability (1-p)/2.

Pr {G1|S }=( 1−(1− p)/2 )×( 1−(1− p) )

Pr( G1|S )=1−Pr{ G1 }

(b) Agent Selection for e-random and e-optimal

“IN Both the above case Start

Check the Condition Select

Evaluate The Loop.

Create Fake Object is Invoked

User Receives the Output.

Allocation for Sample Data Requests (S F )

While rem > 0 do

(b) Agent Selection for e-random and e-optimal

Check the Condition Select

Evaluate The Loop.

SelectObject() Method is Invoked

User Receives the Output.

Data Allocation Problem

Fig6.1: Leakage problem instances.

Intelligent Data Distribution

You might also like