Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 35

Data Leakage Detection

Under the guidance of Mr. P. Vijay Varma Assistant professor Computer Science department

By K V Sai Phani CSE 4th year 160709733055

ABSTRACT
1. A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). 2. Some of the data are leaked and found in an unauthorized place (e.g., on the web or somebodys laptop). 3. We propose data allocation strategies (across the agents) that improve the probability of identifying leakages. 4. These methods do not rely on alterations of the released data (e.g., watermarks). 5. In some cases, we can also inject realistic but fake data records to further improve our chances of detecting leakage and identifying the guilty party.

AGENDA

1.Introduction 2.Proposed System 3.Pertubation 4.Existing System 5.Allocation Strategies 6.Problem Setup and Notation 7.Data Allocation Problem 8.Fake Objects

Contd..

7.Agent Guilt Model 8.Guilt Model Analysis 9.Optimization 10.Conclusion 11.References

INTRODUCTION:
1. In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted third parties. For example, a hospital may give patient records to researchers who will devise new treatments. We call the owner of the data the distributor and the supposedly trusted third parties the agents. 2. Our goal is to detect when the distributors sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.

PROPOSED SYSTEM:

1.Using the technique of Perturbation data is made less sensitive for the agents to handle 2.In this section we develop a model for assessing the guilt of agents. 3.We also present algorithms for distributing objects to agents, in a way that improves our chances of identifying a leaker. 4.Our goal is to detect when the distributors sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data

PERTURBATION:

1.Perturbation is a very useful technique where the data is modified and made less Sensitive before handing over to the supposedly trusted third party. 2. one can add Random noises to a set of attributes or exchange the exact values with the ranges. 3.In some cases its not always important to alter the distributors data.

PERTURBATION GRAPH:

EXISTING SYSTEM:

1.Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. 2.If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. 3.Watermarks can be very useful in some cases, but again, involve some modification of the original data

Related Work-Creating a Watermark

RELATED WORK-VERIFYING A WATERMARK

RELATED WORK
The main idea is to generate a watermark W(x; y) using a secret key chosen by the sender such that W(x; y) is indistinguishable from random noise for any entity that does not know the key (i.e., the recipients). The sender adds the watermark W(x; y) to the information object (image) I(x; y) before sharing it with the recipient(s). It is then hard for any recipient to guess the watermark W(x; y) (and subtract it from the transformed image I0(x; y)); the sender on the other hand can easily extract and verify a watermark (because it knows the key).

ALLOCATION STRATEGIES

1.Explicit Data Requests


In the first place, the goal of these experiments was to see whether fake objects in the distributed data sets yield significant improvement in our chances of detecting a guilty agent. In the second place, we wanted to evaluate our e-optimal algorithm relative to a random allocation.
1: R Agents that can receive fake objects 2: for i = 1, . . . , n do

3: if bi > 0 then
4: R R {i} 5: Fi 6: while B > 0 do 7: i SELECTAGENT(R,R1, . . . , Rn)

8: f CREATEFAKEOBJECT(Ri, Fi, condi)


9: Ri Ri {f} 10: Fi Fi {f} 11: bi bi 1 12: if bi = 0 then

13: R R\{Ri}
14: B B 1

CONTD.....
2.Sample Data Requests
With sample data requests agents are not interested in particular objects.

1: a 0|T| a[k]:number of agents who have received object tk 2: R1 , . . . , Rn 3: remaining 4: while remaining > 0 do

Hence, object sharing is not explicitly


defined by their requests. The distributor is forced to allocate certain objects to multiple agents only if the number of requested objects

5: for all i = 1, . . . , n : |Ri| < mi do


6: k SELECTOBJECT(i,Ri) May also use additional Parameters 7: Ri Ri {tk} 8: a[k] a[k] + 1 9: remaining remaining 1

exceeds the number of objects in set


T. The more data objects the agents request in total, the more recipients on average an object has; and the more objects are shared among different agents, the more difficult it is to detect a guilty agent.

PROBLEM SETUP AND NOTATION:


Entities and Agents: 1. A distributor owns a set T = {t1, . . . , tm} of valuable data objects. The distributor wants to share some of the objects with a set of agents U1, U2, ...,Un, but does not wish the objects be leaked to other third parties. 2. An agent Ui receives a subset of objects Ri T, determined either by a sample request or an explicit request.

PROBLEM SETUP AND NOTATION:


Entities and Agents: 3. Where Sample request Ri=SAMPLE(T,m):Any Subset of mi record from T can be given to Ui. 4. Explicit request Ri=Explicit(T,condi):Agent Ui receives all T objects that satisfy condi.

PROBLEM SETUP AND NOTATION:


Guilty Agents: Suppose that after giving objects to agents, the distributor discovers that a set S T has leaked. This means that some third party called the target, has been caught in possession of S. For example, this target may be displaying S on its web site, or perhaps as part of a legal discovery process, the target turned over S to the distributor.

RELATED WORK
1. The guilt detection approach we present is related to the data provenance problem. 2. As far as the data allocation strategies are concerned ,our work is mostly relevant to watermarking that issued as a means of establishing original ownership of distributed objects.

DATA ALLOCATION PROBLEM

1. The main focus of this paper is the data allocation problem : how can the distributor intelligently give data to agents in order to improve the chances of detecting a guilty agent? 2. The two types of requests we handle were defined are sample and explicit. Fake objects are objects generated by the distributor that are not in set T. The objects are designed to look like real objects, and are distributed to agents together with T objects, in order to increase the chances of detecting agents that leak data.

DATA ALLOCATION PROBLEM


3. As in fig 2 we represent our four problem instances with the names EF, EF, SF, and SF, where stands for explicit requests, S for sample requests, F for the use of fake objects, and F for the case where fake objects are not allowed. Note that, for simplicity, we are assuming that in the E problem instances, all agents make explicit requests, while in the S instances, all agents make sample requests. Our results can be extended to handle mixed cases, with some explicit and some sample requests. We provide here a small example to illustrate how mixed requests can be handled, but then do not elaborate further.

FAKE OBJECTS
1. The distributor may be able to add fake objects to the distributed data in order to improve his effectiveness in detecting guilty agents. However, fake objects may impact the correctness of what agents do, so they may not always be allowable. The idea of perturbing data to detect leakage is not new, e.g.,

Contd....
2. However, in most cases, individual objects are perturbed, e.g., by adding random noise to sensitive salaries, or adding a watermark to an image. In our case, we are perturbing the set of distributor objects by adding fake elements. In some applications, fake objects may cause fewer problems that perturbing real objects. 3.The distributor creates and adds fake objects to the data that he distributes to agents. We let Fi _ Ri be the subset of fake objects that agent Ui receives.

Fig2:

Contd..
The distributor can also use function CREATEFAKEOBJECT() when it wants to send the same fake object to a set of agents. In this case, the function arguments are the union of the Ri and Fi tables, respectively, and the intersection of the conditions condis. Although we do not deal with the implementation of CREATEFAKEOBJECT().

LIMITATIONS
Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks can be very useful in some cases, but again, involve some modification of the original data. Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious.

AGENT GUILT MODEL


1. To compute this Pr{Gi|S}, we need an estimate for the probability that values in S can be guessed by the target. 2. Assumption 1. For all t, t 1 S such that t = t1 provenance of t is independent of the provenance of t1. 3. Assumption 2. An object t S can only be obtained by the target in one of two ways: A single agent Ui leaked t from its own Ri set; or The target guessed (or obtained through other means) without the help of any of the n agents.

GUILT MODEL ANALYSIS


In order to see how our model parameters interact and to check if the interactions match our intuition, in this section, we study two simple scenarios. In each scenario, we have a target that has obtained all the distributors objects, i.e., T = S.

IMPACT OF PROBABILITY P

In our first scenario, T contains 16 objects: all of them are given to agent U1 and only eight are given to a second agent U2. We calculate the probabilities Pr{G1|S} and Pr{G2|S} for p in the range [0, 1] and we present the results in Fig. 1a. The dashed line shows Pr{G1|S} and the solid line shows Pr{G2|S}.

IMPACT OF OVERLAP BETWEEN RI AND S

In this section, we again study two agents, one receiving all the T = S data and the second one receiving a varying fraction of the data. Fig. 1b shows the probability of guilt for both agents, as a function of the fraction of the objects owned by U2, i.e., as a function . In this case, p has a low value of 0.2, and U1 continues to have all 16S objects. Note that in our previous scenario, U2 has 50 percent of the S objects.

OPTIMIZATION :

The Optimization Module is the distributors data allocation to agents has one constraint and one objective. The distributors constraint is to satisfy agents requests, by providing them with the number of objects they request or with all available objects that satisfy their conditions. His objective is to be able to detect an agent who leaks any portion of his data.

CONCLUSION:

1. In a perfect world, there would be no need to hand over sensitive data to agents that may unknowingly or maliciously leak it. And even if we had to hand over sensitive data, in a perfect world, we could watermark each object so that we could trace its origins with absolute certainty. 2. However, in many cases, we must indeed work with agents that may not be 100 percent trusted, and we may not be certain if a leaked object came from an agent or from some other source, since certain data cannot admit watermarks. In spite of these difficulties, we have shown that it is possible to assess the likelihood that an agent is responsible for a leak, based on the overlap of his data with the leaked data and the data of other agents, and based on the probability that objects can be guessed by other means. Our model is relatively simple, but we believe that it captures the essential trade-offs.

REFERENCES
[1] R. Agrawal and J. Kiernan, Watermarking Relational Databases, Proc. 28th Intl Conf. Very Large Data Bases (VLDB 02), VLDB

Endowment, pp. 155-166, 2002.


[2] P. Bonatti, S.D.C. di Vimercati, and P. Samarati, An Algebra for Composing Access Control Policies, ACM Trans. Information and System Security, vol. 5, no. 1, pp. 1-35, 2002.

[3] P. Buneman, S. Khanna, and W.C. Tan, Why and Where: A


Characterization of Data Provenance, Proc. Eighth Intl Conf. Database Theory (ICDT 01), J.V. den Bussche and V. Vianu, eds., pp. 316-330, Jan. 2001.

[4] P. Buneman and W.-C. Tan, Provenance in Databases, Proc.


ACM SIGMOD, pp. 1171-1173, 2007. [5] Y. Cui and J. Widom, Lineage Tracing for General Data Warehouse Transformations, The VLDB J., vol. 12, pp. 41-58,

2003.

CONTD...
[6] S. Czerwinski, R. Fromm, and T. Hodes, Digital Music Distributionand Audio Watermarking, http://www.scientificcommons.org/43025658, 2007.

[7] F. Guo, J. Wang, Z. Zhang, X. Ye, and D. Li, An Improved Algorithm to Watermark Numeric Relational Data, InformationSecurity Applications, pp. 138-149, Springer, 2006.
[8] F. Hartung and B. Girod, Watermarking of Uncompressed andCompressed Video, Signal Processing, vol. 66, no. 3, pp. 283-301,1998.

THANK YOU

You might also like