Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 64

Optimizing Information Leakage in Multicloud Services

ABSTRACT

Many schemes have been recently advanced for storing data on multiple clouds.
Distributing data over different cloud storage providers (CSPs) automatically provides users
with a certain degree of information leakage control, for no single point of attack can leak all
the information. However, unplanned distribution of data chunks can lead to high information
disclosure even while using multiple clouds. In this paper, we study an important information
leakage problem caused by unplanned data distribution in multicloud storage services. Then,
we present StoreSim, an information leakage aware storage system in multicloud. StoreSim
aims to store syntactically similar data on the same cloud, thus minimizing the user’s
information leakage across multiple clouds. We design an approximate algorithm to
efficiently generate similarity-preserving signatures for data chunks based on MinHash and
Bloom filter, and also design a function to compute the information leakage based on these
signatures. Next, we present an effective storage plan generation algorithm based on
clustering for distributing data chunks with minimal information leakage across multiple
clouds.
1. INTRODUCTION

1.1 INTRODUCTION:

With the increasingly rapid uptake of devices such as laptops, cellphones and tablets,
users require ubiquitous and massive network storage to handle their ever-growing digital
lives. To meet these demands, many cloud-based storage and file sharing services such as
Dropbox, Google Drive and Amazon S3, have gained popularity due to the easy-to-use
interface and low storage cost. However, these centralized cloud storage services are
criticized for grabbing the control of users’ data, which allows storage providers to run
analytics for marketing and advertising . Also, the information in users’ data can be leaked
e.g., by means of malicious insiders, backdoors, bribe and coercion. One possible solution to
reduce the risk of information leakage is to employ multicloud storage systems in which no
single point of attack can leak all the information. A malicious entity, such as the one
revealed in recent attacks on privacy, would be required to coerce all the different CSPs on
which a user might place her data, in order to get a complete picture of her data. Put simply,
as the saying goes, do not put all the eggs in one basket.

Yet, the situation is not so simple. CSPs such as Dropbox, among many others,
employ rsync-like protocols to synchronize the local file to remote file in their centralized
clouds . Every local file is partitioned into small chunks and these chunks are hashed with
fingerprinting algorithms such as SHA-1, MD5. Thus, a file’s contents can be uniquely
identified by this list of hashes. For each update of local file, only chunks with changed
hashes will be uploaded to the cloud. This synchronization based on hashes is different from
diff -like protocols that are based on comparing two versions of the same file line by line and
can detect the exact updates and only upload these updates in a patch style.

Instead, the hash-based synchronization model needs to upload the whole chunks with
changed hashes to the cloud. Thus, in the multicloud environment, two chunks differing only
very slightly can be distributed to two different clouds. The following motivating example
will show that if chunks of a user’s data are assigned to different CSPs in an unplanned
manner, the information leaked to each CSP can be higher than expected. Suppose that we
have a storage service with three CSPs S1; S2; S3 and a user’s dataset D. All the user’s data
will be firstly chunked and then uploaded to different clouds. The dataset D is represented as
a set of hashes generated by each data chunk. This scenario is shown in. In addition, we
consider that the data chunks are distributed to different clouds in a round robin (RR) way.

Apparently, RR is good for balancing the storage load and each cloud thus obtains the
same amount of data. However, the same amount of data does not necessarily mean the same
amount of information. For example, if we find that the set of chunks fC3;C6;C9g are almost
same, it means S3 actually obtains the information equivalent to that in only one chunk. If all
other chunks are different, S1 and S2 obtain three times as much information of data. The
problem does not exist in a single storage cloud such as Dropbox since users have no other
choice but to give all their information to only one cloud.

When the storage is in the multicloud, we have the opportunity to minimize the total
information that is leaked to each CSP. The optimal case is that each CSP obtains the same
amount of information. In our example, data distribution based on RR can achieve the
optimal result only if all the chunks are different. However this is not the case in cloud
storage service due to two reasons: 1) Frequent modifications of files by users result in large
amount of similar chunks1; and 2) Similar chunks across files, due to which existing CSPs
use the data deduplication technique.

Centralized cloud storage services are criticized for grabbing the control of users’
data, which allows storage providers to run analytics for marketing and advertising . Also, the
information in users’ data can be leaked e.g., by means of malicious insiders, backdoors,
bribe and coercion. One possible solution to reduce the risk of information leakage is to
employ multicloud storage systems in which no single point of attack can leak all the
information.

Instead, the hash-based synchronization model needs to upload the whole chunks with
changed hashes to the cloud. Thus, in the multicloud environment, two chunks differing only
very slightly can be distributed to two different clouds. The following motivating example
will show that if chunks of a user’s data are assigned to different CSPs in an unplanned
manner, the information leaked to each CSP can be higher than expected.

Suppose that we have a storage service with three CSPs S1; S2; S3 and a user’s
dataset D. All the user’s data will be firstly chunked and then uploaded to different clouds.
The dataset D is represented as a set of hashes generated by each data chunk.
1.2 FEATURES OF CLOUD COMPUTING

Data outsourcing to cloud storage servers is raising trend among many firms and users
owing to its economic advantages. This essentially means that the owner (client) of the data
moves its data to a third party cloud storage server which is supposed to - presumably for a
fee - faithfully store the data with it and provide it back to the owner whenever required.

As data generation is far outpacing data storage it proves costly for small firms to
frequently update their hardware whenever additional data is created. Also maintaining the
storages can be a difficult task. Storage outsourcing of data to cloud storage helps such firms
by reducing the costs of storage, maintenance and personnel. It can also assure a reliable
storage of important data by keeping multiple copies of the data thereby reducing the chance
of losing data by hardware failures.

Storing of user data in the cloud despite its advantages has many interesting security
concerns which need to be extensively investigated for making it a reliable solution to the
problem of avoiding local storage of data. In this paper we deal with the problem of
implementing a protocol for obtaining a proof of data possession in the cloud sometimes
referred to as Proof of retrievability (POR).This problem tries to obtain and verify a proof
that the data that is stored by a user at a remote data storage in the cloud (called cloud storage
archives or simply archives) is
Not modified by the archive and thereby the integrity of the data is assured.

Such verification systems prevent the cloud storage archives from misrepresenting or
modifying the data stored at it without the consent of the data owner by using frequent checks
on the storage archives. Such checks must allow the data owner to efficiently, frequently,
quickly and securely verify that the cloud archive is not cheating the owner. Cheating, in this
context, means that the storage archive might delete some of the data or may modify some of
the data.
Report on Present Investigation
Data leakage happens every day when confidential business information such as customer
or patient data, source code or design specifications, price lists, intellectual property and trade
secrets, forecasts and budgets in spreadsheets are leaked out. In this report a problem is
considered in which a data distributor has given sensitive data to a set of supposedly trusted
agents and some of the data are leaked and found in an unauthorized place by any means. The
problem with the data leakage is that once this data is no longer within the domain of
distributor, then the company is at serious risk. The distributor must assess the likelihood that
the leaked data came from one or more agents, as opposed to having been independently
gathered by other means.
We propose data allocation strategies (across the agents) that improve the probability of
identifying leakages. These methods do not rely on alterations of the released data. In some
cases, we can also inject “realistic but fake” data records to further improve our chances of
detecting leakage and identifying the guilty party.
Further modification is applied in order to overcome the problems of current algorithm by
intelligently distributing data object among various agents in such a way that identification of
guilty agent becomes simple.
Introduction to Data Leakage
Data Leakage, put simply, is the unauthorized transmission of data (or information)
from within an organization to an external destination or recipient. Leakage is possible either
intentionally or unintentionally by internal or external user, internals are authorized user of
the system who can access the data through valid access control policy, however external
intruder access data through some attack on target machine which is either active or passive.
This may be electronic, or may be via a physical method. Data Leakage is
synonymous with the term Information Leakage and harms the image of organization and
create threat in the mind for continuing the relation with the distributor, as it is not able to
protect the sensitive information.
The reader is encouraged to be mindful that unauthorized does not automatically
mean intentional or malicious. Unintentional or inadvertent data leakage is also
unauthorized.
Data Leakage

According to data compiled from EPIC.org and PerkinsCoie.com which conducted


survey on data leakage by considering various different organizations and concluded that,
52% of Data Security breaches are from internal sources compared to the remaining 48% by
external hackers hence the protection needed from internal users also.

Type of data leaked Percentage

Confidential information 15

Intellectual property 4

Customer data 73

Health record 8

Types of data leaked


The noteworthy aspect of these figures is that, when the internal breaches are
examined, the percentage due to malicious intent is remarkably low.
Less than 1%, the level of inadvertent data breach is significant (96%). This is further
deconstructed to 46% being due to employee oversight, and 50% due to poor business
process.

The Leaking Faucet


Data protection programs at most organizations are concerned with protecting
sensitive data from external malicious attacks, relying on technical controls that include
perimeter security, network/wireless surveillance and monitoring, application and point
security management, and user awareness and education. But what about inadvertent data
leaks that isn’t so sensational.
For example unencrypted information on a lost or stolen laptop/ USB or other device?
Like the steady drip from a leaking faucet, everyday data leaks are making headlines more
often than the nefarious attack scenarios around which organizations plan most, if not all, of
their data leakage prevention methods. However, to truly protect their critical data,
organizations also need to plan a more data-centric approach to their security programs to
protect against leaks that occur for sensitive data.
Organizations are concerned with protecting sensitive data from external malicious
attacks, relying on technical controls that include perimeter security and internal staff and
from those who can access those data from any method, network/wireless surveillance and
monitoring, application and point security management, and user awareness and education
and DLP solutions. But what about data leaks by trusted third party called as agents who are
not present inside the network and their activity is not easily traceable for this situation some
care must be taken so that data is not misused by them.
Various sensitive information such as Financial Data, Private Data, Credit Card
Information, Health Record Information, Confidential Information, Personal Information etc
are part of various different organizations which can be prevented by various different ways
as shown in figure The leaking faucets include Education, Prevention and Detection to
control the leakage of sensitive information. Education remains the important factor among
all the protection measure, which includes training and awareness program to handle the
sensitive information and their importance for the organization.
The figure represents faucet for data leakage in the center different kinds of sensitive
information is placed which is surrounded by protecting mechanism which will prevent
leakage of valuable information from the organization which leads to major problem.
s
R
h
H
m
Ifo
d
C
rv
tP
lD
eca
in
F
The Leaking Faucet

Prevention mechanism deals with the DLP mechanism which is suits of technology
which prevent leakage of data by classifying sensitive information and monitoring them and
through various accesses control policy the access is prevented from users. Education
prevents the leakage as most of the time leakage occurs unintentionally by the internal users.
Detection process detects the leakage of information distributed to trustworthy third party
called as agent and to calculate their involvement in the process of leakage.
CHAPTER -2

2.LITERATURE REVIEW

Depsky: dependable and secure storage in a cloud-of-clouds

The increasing popularity of cloud storage services has lead companies that handle
critical data to think about using these services for their storage needs. Medical record
databases, large biomedical datasets, historical information about power systems and
financial data are some examples of critical data that could be moved to the cloud. However,
the reliability and security of data stored in the cloud still remain major concerns. In this
work we present DepSky, a system that improves the availability, integrity, and
confidentiality of information stored in the cloud through the encryption, encoding, and
replication of the data on diverse clouds that form a cloud-of-clouds. We deployed our
system using four commercial clouds and used PlanetLab to run clients accessing the service
from different countries. We observed that our protocols improved the perceived availability,
and in most cases, the access latency, when compared with cloud providers individually.
Moreover, the monetary costs of using DepSky in this scenario is at most twice the cost of
using a single cloud, which is optimal and seems to be a reasonable cost, given the benefits.

Nccloud: A network-coding-based storage system in a cloud-of-clouds

To provide fault tolerance for cloud storage, recent studies propose to stripe data
across multiple cloud vendors. However, if a cloud suffers from a permanent failure and loses
all its data, we need to repair the lost data with the help of the other surviving clouds to
preserve data redundancy. We present a proxy-based storage system for fault-tolerant
multiple-cloud storage called NCCloud, which achieves cost-effective repair for a permanent
single-cloud failure. NCCloud is built on top of a network-coding-based storage scheme
called the functional minimum-storage regenerating (FMSR) codes, which maintain the same
fault tolerance and data redundancy as in traditional erasure codes (e.g., RAID-6), but use
less repair traffic and, hence, incur less monetary cost due to data transfer. One key design
feature of our FMSR codes is that we relax the encoding requirement of storage nodes during
repair, while preserving the benefits of network coding in repair. We implement a proof-of-
concept prototype of NCCloud and deploy it atop both local and commercial clouds. We
validate that FMSR codes provide significant monetary cost savings in repair over RAID-6
codes, while having comparable response time performance in normal cloud storage
operations such as upload/download.

Scalia: an adaptive scheme for efficient multi-cloud storage

A growing amount of data is produced daily resulting in a growing demand for


storage solutions. While cloud storage providers offer a virtually infinite storage capacity,
data owners seek geographical and provider diversity in data placement, in order to avoid
vendor lock-in and to increase availability and durability. Moreover, depending on the
customer data access pattern, a certain cloud provider may be cheaper than another. In this
paper, we introduce Scalia, a cloud storage brokerage solution that continuously adapts the
placement of data based on its access pattern and subject to optimization objectives, such as
storage costs. Scalia efficiently considers repositioning of only selected objects that may
significantly lower the storage cost. By extensive simulation experiments, we prove the cost-
effectiveness of Scalia against static placements and its proximity to the ideal data placement
in various scenarios of data access patterns, of available cloud storage solutions and of
failures.

Algorithms for delta compression and remote file synchronization

Delta compression and remote file synchronization techniques are concerned with
efficient file transfer over a slow communication link in the case where the receiving party
already has a similar file (or files). This problem arises naturally, e.g., when distributing
updated versions of software over a network or synchronizing personal files between
different accounts and devices. More generally, the problem is becoming increasingly
common in many networkbased applications where files and content are widely replicated,
frequently modified, and cut and reassembled in different contexts and packagings.
CHAPTER -3

3. METHODOLOGY

In our daily tasks, we are storing our critical data over different cloud storage
providers such as Google Drive, Dropbox, and iCloud, but unplanned distribution of data
chunks over multiple clouds storage providers will produce one point failure in cloud data.

3.1 SYSTEM ANALYSIS

The Systems Development Life Cycle (SDLC), or Software Development Life Cycle
in systems engineering, information systems and software engineering, is the process of
creating or altering systems, and the models and methodologies that people use to develop
these systems. In software engineering the SDLC concept underpins many kinds of software
development methodologies.

3.2 EXISTING SYSTEM:

In fact, the data deduplication technique, which is widely adopted by current cloud
storage services in existing clouds, is one example of exploiting the similarities among
different data chunks to save disk space and avoid data retransmission . It identifies the
same data chunks by their fingerprints which are generated by fingerprinting algorithms
such as SHA-1, MD5. Any change to the data will produce a very different fingerprint
with high probability . However, these fingerprints can only detect whether or not the data
nodes are duplicate, which is only good for exact equality testing. Determining identical
chunks is relatively straightforward but efficiently determining similarity between chunks
is an intricate task due to the lack of similarity preserving fingerprints (or signatures).
DISADVANTAGES OF EXISTING SYSTEM:

 Unplanned distribution of data chunks can lead to high information disclosure even
while using multiple clouds.
 Frequent modifications of files by users result in large amount of similar chunks1;
 Similar chunks across files, due to which existing CSPs use the data de duplication
technique.
3.3 PROPOSED SYSTEM:

 We present StoreSim, an information leakage aware multicloud storage system which


incorporates three important distributed entities and we also formulate information
leakage optimization problem in multicloud.
 We propose an approximate algorithm, BFSMinHash, based on Minhash to generate
similarity-preserving signatures for data chunks.
 Based on the information match measured by BFSMinHash, we develop an efficient
storage plan generation algorithm, Clustering, for distributing users data to different
clouds.
ADVANTAGES OF PROPOSED SYSTEM:
 However, previous works employed only a single cloud which has both compute and
storage capacity. Our work is different since we consider a mutli cloud in which each
storage cloud is only served as storage without the ability to compute.
 Our work is not alone in storing data with the adoption of multiple CSPs these work
focused on different issues such as cost optimization , data consistency and
availability

3.4. SYSTEM STUDY

FEASIBILITY STUDY

The feasibility of the project is analyzed in this phase and business proposal is put
forth with a very general plan for the project and some cost estimates. During system analysis
the feasibility study of the proposed system is to be carried out. This is to ensure that the
proposed system is not a burden to the company. For feasibility analysis, some
understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are

 ECONOMICAL FEASIBILITY
 TECHNICAL FEASIBILITY
 SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on
the organization. The amount of fund that the company can pour into the research and
development of the system is limited. The expenditures must be justified. Thus the developed
system as well within the budget and this was achieved because most of the technologies
used are freely available. Only the customized products had to be purchased.

TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on the
available technical resources. This will lead to high demands on the available technical
resources. This will lead to high demands being placed on the client. The developed system
must have a modest requirement, as only minimal or null changes are required for
implementing this system.

SOCIAL FEASIBILITY

The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity. The level of acceptance by the
users solely depends on the methods that are employed to educate the user about the system
and to make him familiar with it. His level of confidence must be raised so that he is also able
to make some constructive criticism, which is welcomed, as he is the final user of the system.
3.5 SYSTEM SPECIFICATION
System Requirements:

Hardware Requirements:

• System : Intel Core 2 Duo.


• Hard Disk : 1 TB.
• Monitor : 15 VGA Colour.
• Mouse : Optical.
• Ram : 2GB.

Software Requirements:

• Operating system : Windows 7.


• Coding Language : ASP.Net with C#
• Data Base : SQL Server 2005.
3.6 SYSTEM ARCHITECTURE

The purpose of the design phase is to arrange an answer of the matter such as by the
necessity document. This part is that the opening moves in moving the matter domain to the
answer domain. The design phase satisfies the requirements of the system. The design of a
system is probably the foremost crucial issue warm heartedness the standard of the software
package. It’s a serious impact on the later part, notably testing and maintenance.

The output of this part is that the style of the document. This document is analogous
to a blueprint of answer and is employed later throughout implementation, testing and
maintenance. The design activity is commonly divided into 2 separate phases System Design
and Detailed Design.

System Design conjointly referred to as top-ranking style aims to spot the modules
that ought to be within the system, the specifications of those modules, and the way them
move with one another to supply the specified results.

At the top of the system style all the main knowledge structures, file formats, output
formats, and also the major modules within the system and their specifications square
measure set. System design is that the method or art of process the design, components,
modules, interfaces, and knowledge for a system to satisfy such as needs. Users will read it
because the application of systems theory to development.

Detailed Design, the inner logic of every of the modules laid out in system design is
determined. Throughout this part, the small print of the info of a module square measure
sometimes laid out in a high-level style description language that is freelance of the target
language within which the software package can eventually be enforced.

In system design the main target is on distinguishing the modules, whereas throughout
careful style the main target is on planning the logic for every of the modules.

In this section, we firstly describe the architecture of StoreSim. Then we introduce


StoreSim in terms of metadata and CSP models. Finally, we formulate the information
leakage optimization problem in the multicloud.
ARCHITECTURE

The architecture of StoreSim is shown in Figure . It can be observed that there is a


trust boundary between the metadata and storage servers. We assume that clients and
metadata servers, which are situated inside the trust boundary, are trustable by users while
remote servers outside the boundary are untrustworthy. For example, the metadata can be
stored in private database servers while storage servers can be located in public CSPs such as
Amazon S3, Dropbox and Google Drive. Storage servers can be accessed through standard
APIs (Application Programming Interfaces). As is shown in Figure , all control flows are
inside the trust boundary while data flows can cross the trust boundary. In order to optimize
the information leakage, we design two components in StoreSim. The first component is the
Leakage Measure layer (LMLayer) that is used to evaluate the information leakage and
further to generate storage plan which maps data chunks to different clouds. The other
component is Cloud Manager layer (CMLayer) that provides cloud interoperability in a
syntactic way

Figure 3.1: Architecture diagram

Models

MetaData model. The data model we discuss in this section is for the metadata that
represents the file system of StoreSim. We model users’ data as a labeled graph G =< V, E,
Ω, π > where V is a set of vertices, E is a set of edges, Ω is a set of labels, and π : V ∪ E → Ω
is a function that assigns labels to vertices and edges. Within the data graph, the vertices V
represent different objects in a file system such as users, folders, files and data chunks. The
edges E indicate a variety of relationships among different objects which can be distinguished
by a set of labels Ω. The labels also facilitate the process of path-oriented search, e.g., to find
all data chunks of one file, or to find all the files of one user. Furthermore, we define N ⊆ V
as the set of data nodes which store the raw data in G. We aim to distribute data nodes N to
different CSPs in terms of the storage protocol defined in Section II-C. CSP model. A cloud
storage provider (CSP) s ∈ S is parameterized by two factors < u, v > where u is a storage
load factor while v indicates the prior knowledge of a CSP. The storage load, i.e., the ratio of
the total size of data stored on a cloud to the size of entire data of the user, can be assigned
either by StoreSim (the default) or by users in terms of their preferences. The prior
knowledge of a CSP is modeled as the set of data nodes which have been stored on it. Thus,
the amount of prior knowledge of a CSP increases with the number of data nodes stored on it.
We assume that the knowledge is unforgettable, i.e., the knowledge of a data node will not be
removed even when the data node is removed from the cloud2 .

Storage Protocol

In essence, the storage protocol is a set of constraints or cost functions to reduce the
information leakage on data distribution across multiple clouds. The protocol in StoreSim is
to store similar chunks on the same cloud, thereby reducing information leakage to each
individual CSP. In the following, we firstly define information leakage for a pair of data
nodes

Inspired by clustering problems, we propose a storage plan generation algorithm,


SPClustering, to group similar data nodes. We define a data node as the centroid when no
existing data node has low pairwise information leakage with it. In practice, we define a
leakage threshold, according to which a data node becomes a centroid if all its pairwise
information leakage with other nodes are greater than this threshold. In other words, a
centroid represents all data nodes which are similar to it. Given any new data node, we only
compute its pairwise similarities with a set of centroids, which largely reduces the number of
pairs. Moreover, we build the ClusterIndex among the centroids to further prune the search
space. A single index entry in ClusterIndex points to a set of similar centroids, which is
similar to the Bitmap index in traditional databases . Specifically, suppose the size of
signature generated by BFSMinHash algorithm is s bits, we divide the signature into b
segments with the length of each segment as s/b. We will use each segment as the key in hash
function and therefore, all the signatures with the same key will be hashed together. For
example, as is shown in Figure 3, when the key is the value of first segment, c2 and c4 are
hashed to the same index entry for they share the same value of first segment. Those
signatures are more likely to be similar to each other since they already share one same
segment. Recall from Section III-B, the number of elements sampled by BFSMinHash is k,
which means its signature based on Bloom filter is at most with k bits set to one. If we cannot
search any similar node from the ClusterIndex with b segments for a given node, that means
there are at least b bits different from the given node with all the centroids. Based on
Equation , it implies that there is no centroid that has Jaccard similarity with the given node
larger than (k − b)/(k + b). For example, if k is 64 and we divide the signature into 8
segments, the ClusterIndex can efficiently search all the similar centroids with similarity
higher than 77.8%. Thus, in order to find centroids with less or more similarity, we need to
respectively increase and decrease the value of b (the number of segments). To further
generate a storage plan, we firstly builds the ClusterIndex for a set of centroids on the fly. We
do not persist the ClusterIndex to reduce the storage overhead. The cost of building
ClusterIndex is acceptable, which takes about 400 milliseconds for 100 thousand centroids.
Then, for each new data node, we will find the cloud with the minimal information leakage
based on the candidate set which is queried based on ClusterIndex. Finally, if the minimal
information leakage of new node is still larger than the leakage threshold, we will assign this
node only based on the storage loads of CSPs. Meanwhile, this node will be labeled as the
centroid and be indexed on the fly.

3.7 UML DIAGRAMS

The Unified Modeling Language allows the software engineer to express an analysis
model using the modeling notation that is governed by a set of syntactic semantic and
pragmatic rules.

A UML system is represented using five different views that describe the system from
distinctly different perspective. Each view is defined by a set of diagram, which is as follows.

User Model View


This view represents the system from the users perspective. The analysis
representation describes a usage scenario from the end-users perspective.

Structural Model view

In this model the data and functionality are arrived from inside the system. This
model view models the static structures.

Behavioral Model View

It represents the dynamic of behavioral as parts of the system, depicting the interactions of
collection between various structural elements described in the user model and structural
model view.

Implementation Model View

In this the structural and behavioral as parts of the system are represented as they are to be
built.

We have implemented the StoreSim prototype using Java, and it includes both basic
components (such as chunking, data deduplication, bundling and encryption/decryption), and
featured components including LMLayer and CMLayer. In the LMLayer, we implement the
algorithms described in the previous sections, while the CMLayer enables StoreSim to
communicate with multiple CSPs. StoreSim employs the common fixed-size chunking with a
maximum chunk size of 512 KB. The chunk is identified by SHA-1 signature, which is also
used for data deduplication. The small chunks can be bundled as a ZIP file to minimize the
network transmission overhead. Succinctly, before the chunk is synchronized, it can be
measured for leakage optimization, encrypted, and bundled for better network transmissions.
The synchronization of StoreSim is based on the delta encoding [8], which only synchronizes
changed chunks (identified by SHA-1 signatures) between two copies. All the metadata,
which is organized as data graph, are stored in a MySQL database. We have implemented for
three public storage clouds: Dropbox, Google Drive, and Amazon S3. All the
communications between StoreSim and public CSPs occur using APIs supplied by those
CSPs. We also support the synchronization of files to the local FTP servers. The metadata
server is deployed on our local server machine and the evaluation is conducted on a personal
client machine
Dataset For the evaluation, we aim to find such data which has undergone several
modifications, and thus results in many similar chunks. This can serve as a model for the
modifications that users make in the cloud storage services. Wikipedia and Github are two
such data sources that contain web pages and files which are reviewed and modified multiple
times. Thus, we crawled two datasets from Wikipedia and Github, respectively. The
Wikipedia dataset contains a total of 2197 web pages and each web page has a maximum 49
revisions. For each web page, the crawler only stores the text that is extracted from HTML
files. The total size of the dataset is 1.2 GB. The size of each webpage is relatively small,
which ranges from 29 Bytes to 118 KB with an average size of 11KB. The Github dataset
contains the United States code3 spanning 56 files. The files in this dataset are much larger
than those in the Wikipedia dataset, in the range of 47.7KB to 50MB with an average size of
5.3 MB. The files in this dataset have a maximum of 8 modifications and the total dataset size
is 2.1 GB. Thus, we observe that the data chunks generated by Wikipedia dataset are small in
size with maximum chunk size of 118 KB, but great in number (91,929) while those
generated by Github dataset are bigger in size with maximum size of 512KB but are less in
number (4,274).

Data Leakage Detection


Organizations thought of data/information security only in terms of protecting their
network from intruders (e.g. hackers). But with growing amount of data, rapid growth in the
sizes of organizations (e.g. due to globalization), rise in number of data points (machines and
servers) and easier modes of communication, accidental or even deliberate leakage of data
from within the organization has become a painful reality. This has lead to growing
awareness about information security in general and about outbound content management in
particular.
Data Leakage, put simply, is the unauthorized transmission of data (or information)
from within an organization to an external destination or recipient. This may be electronic, or
may be via a physical method. Data Leakage is synonymous with the term Information
Leakage. The reader is encouraged to be mindful that unauthorized does not automatically
mean intentional or malicious. Unintentional or inadvertent data leakage is also unauthorized.
In the course of doing business, sometimes sensitive data must be handed over to supposedly
trusted third parties.
For example, a hospital may give patient records to researchers who will devise new
treatments. Similarly, a company may have partnerships with other companies that require
sharing customer data. Another enterprise may outsource its data processing, so data must be
given to various other companies. We call the owner of the data the distributor and the
supposedly trusted third parties the agents. Our goal is to detect the guilty agent among all
trustworthy agents when the distributor’s sensitive data have been leaked by any one agent,
and if possible to identify the agent that leaked the data.
We consider applications where the original sensitive data cannot be perturbed.
Perturbation is a very useful technique where the data are modified and made “less sensitive”
before being handed to agents, in such case sensitive data is sent as it is example is contact no
of employee it cannot be altered and sent to third party for recruitment as it is sensitive or
similarly its bank account number perturbation of those information to makes no effect after
transmission to the receiver as this data is not useful for any process.
To overcome such a situation an effective method is required through which the
distribution of data is possible to modify the valuable information.
In such case one can add random noise to certain attributes, or one can replace exact
values by ranges [4]. However, in some cases, it is important not to alter the original
distributor’s data. For example, if an outsourcer is doing our payroll, he must have the exact
salary and customer bank account numbers

Fig 3.4: Data Leakage Detection Process


If medical researchers treating the patients (as opposed to simply computing
statistics), they may need accurate data for the patients. Traditionally, leakage detection is
handled by watermarking, e.g., a unique code is embedded in each distributed copy.
If that copy is later discovered in the hands of an unauthorized party, the leaker can be
identified. Watermarks can be very useful in some cases, but again, involve some
modification of the original data by adding redundancy. Furthermore, watermarks can
sometimes be destroyed if the data recipient is malicious as it is aware of various techniques
to temper the watermark.
Data Leakage Detection system mainly divided into two modules

Data allocation strategy


This module helps in intelligent distribution of data set so that if that data is leaked guilty
agent is identified.

Data Distribution Scenario


Guilt detection model
This model helps to determine the agent is responsible for leakage of data or data set
obtained by target is by some other means.
It requires complete domain knowledge to calculate the probability p to evaluate
guilty agent. From the domain knowledge and proper analysis and experiment probability
facture is calculated which act as threshold for evidence to prove guilty of agent. It means
when number of leaked record are more than the probability specified then agent is guilty and
if number of leaked records is less then is said to be not guilty because in such situation it is
possible that leaked object is obtained by target by some other means.
Guilt Detection Model
An unobtrusive technique for detecting leakage of a set of objects or records is
proposed in this report. After giving a set of objects to agents, the distributor discovers some
of those same objects in an unauthorized place. (For example, the data may be found on a
website, or may be obtained through a legal discovery process.) At this point, the distributor
can assess the likelihood that the leaked data came from one or more agents, as opposed to
having been independently gathered by other means. Using an analogy with cookies stolen
from a cookie jar, if we catch Freddie with a single cookie, he can argue that a friend gave
him the cookie. But if we catch Freddie with five cookies, it will be much harder for him to
argue that his hands were not in the cookie jar. If the distributor sees “enough evidence” that
an agent leaked data, he may stop doing business with him, or may initiate legal proceedings.
In this paper, we develop a model for assessing the “guilt” of agents. We also present
algorithms for distributing objects to agents, in a way that improves our chances of
identifying a leaker. Finally, we also consider the option of adding “fake” objects to the
distributed set. Such objects do not correspond to real entities but appear realistic to the
agents. In a sense, the fake objects act as a type of watermark for the entire set, without
modifying any individual members. If it turns out that an agent was given one or more fake
objects that were leaked, then the distributor can be more confident that agent was guilty.
Symbols and Terminology
A distributor owns a set T = {t 1, t2, t3, ……. } of valuable and sensitive data objects. The
distributor wants to share some of the objects with a set of agents U 1, U2… Un, but does not
wish the objects be leaked to other third parties. The objects in T could be of any type and
size, e.g., they could be tuples in a relation, or relations in a database. An agent Ui receives a
subset of objects Ri subset of T, determined either by a sample request or an explicit request:

• Distributor: A distributor owns a set T of valuable and sensitive data objects.


Owner of data set T = {t1, t2, ……, tn}
• Agent (U): The distributor shares some of the objects with a set of agents U1, U2… Un,
but does not wish the objects be leaked to other third parties.

Receives set R ⊆ T from the distributor.


• Target: Unauthorized third party caught with leaked data set S ⊆ T.
Example: Say T contains customer records for a given company A. Company “C” hires a
marketing agency U1 to do an on-line survey of customers. Since any customers will do for
the survey, U1 requests a sample of 1000 customer records. At the same time, company “C”
subcontracts with agent U2 to handle billing for all California customers. Thus, U2 receives all
T records that satisfy the condition “state is California Suppose that after giving objects to
agents, the distributor discovers that a set S subset of T has leaked. This means that some
third party called the target has been caught in possession of S. For example, this target may
be displaying S on its web site, or perhaps as part of a legal discovery process, the target
turned over S to the distributor.
Agents U1, U2… Un have some of the data, it is reasonable to suspect them leaking the
data. However, the agents can argue that they are innocent, and that the S data was obtained
by the target through other means. For example, say one of the objects in S represents a
customer X. Perhaps X is also a customer of some other company, and that company
provided the data to the target. Or perhaps X can be reconstructed from various publicly
available sources on the web.
Our goal is to estimate the likelihood that the leaked data came from the agents as
opposed to other sources. Intuitively, the more data in S, the harder it is for the agents to
argue they did not leak anything. Similarly, the “rarer” the objects, the harder it is to argue
that the target obtained them through other means.
Not only do we want to estimate the likelihood the agents leaked data, but we would also
like to find out if one of them in particular was more likely to be the leaker. For instance, if
one of the S objects was only given to agent U 1, while the other objects were given to all
agents, we may suspect U1 more. The model we present next captures this intuition. We say
an agent Ui is guilty if it contributes one or more objects to the target. While performing
implementation and research work for the calculation of guilty of agent in order to reduce the
complexity and computation we are following various assumption
Agent Guilt Model
This model helps to determine the agent is responsible for leakage of data or data set
obtained by target is by some other means. The distributor can assess the likelihood that the
leaked data came from one or more agents, as opposed to having been independently
gathered by other means. Using an analogy with cookies stolen from a cookie jar, if we catch
Freddie with a single cookie, he can argue that a friend gave him the cookie. But if we catch
Freddie with five cookies, it will be much harder for him to argue that his hands were not in
the cookie jar.
If the distributor sees “enough evidence” that an agent leaked data, he may stop doing
business with him, or may initiate legal proceedings.
Guilty Agent
To compute the probability of guilty agent, we need an estimate for the probability that
values in S can be “guessed” by the target. For instance, say some of the objects in T are
emails of individuals. We can conduct an experiment and ask a person with approximately
the expertise and resources of the target to find the email of say 100 individuals. If this person
can find say 90 emails, then we can reasonably guess that the probability of finding one email
is 0.9. On the other hand, if the objects in question are bank account numbers, the person may
only discover say 20, leading to an estimate of 0.2. We call this estimate pt , the probability
that object t can be guessed by the target.
To simplify the formulas, we assume that all T objects have the same probability, which
we call p. Next, we make two assumptions regarding the relationship among the various
leakage events. The first assumption simply states that an agent’s decision to leak an object is
not related to other objects.

Suppose that after giving objects to agents, the distributor discovers that a set S ⊆ T has
leaked. This means that some third party called the target has been caught in possession of S.
For example, this target may be displaying S on its web site, or perhaps as part of a legal
discovery process, the target turned over S to the distributor. Since the agents U 1, ……,Un
have some of the data, it is reasonable to suspect them leaking the data. However, the agents
can argue that they are innocent, and that the S data was obtained by the target through other
means.
For example, say one of the objects in S represents a customer X. Perhaps X is also a
customer of some other company, and that company provided the data to Equations the
target. Or perhaps X can be reconstructed from various publicly available sources on the web.
Our goal is to estimate the likelihood that the leaked data came from the agents as opposed to
other sources. Intuitively, the more data in S, the harder it is for the agents to argue they did
not leak anything. Similarly, the “rarer” the objects, the harder it is to argue that the target
obtained them through other means. Not only do we want to estimate the likelihood the
agents leaked data, but we would also like to find out if one of them in particular was more
likely to be the leaker.
For instance, if one of the S objects was only given to agent U1, while the other
objects were given to all agents, we may suspect U 1 more. The model we present next
captures this intuition. We say an agent U i is guilty if it contributes one or more objects to the
target. We denote the event that agent Ui is guilty for a given leaked set S by {Gi | S}. Our
next step is to estimate Pr {Gi | S}, i.e., the probability that agent Ui is guilty given evidence
S.
Guilt Agent Detection
We can conduct an experiment and ask a person with approximately the expertise and
resources of the target to find the email of say 100 individuals.
If this person can find say 90 emails, then we can reasonably guess that the
probability of finding one email is 0.9. On the other hand, if the objects in question are bank
account numbers, the person may only discover say 20, leading to an estimate of 0.2. We call
this estimate pt, the probability that object t can be guessed by the target [2]. For simplicity
we assume that all T objects have the same pt, which we call p.
Next, make two assumptions regarding the relationship among the various leakage
events. The first assumption simply states that an agent’s decision to leak an object is not
related to other objects

Assumption1. For all t, t’ ∈


S such that t ¿ t’ the provenance of t is independent of the
provenance of t’ [1].
The term “provenance” in this assumption statement refers to the source of a value t
that appears in the leaked set. The source can be any of the agents who have t in their sets or
the target itself (guessing). The following assumption states that joint events have a negligible
probability
Assumption2. An object t ∈ S can only be obtained by the target in one of the two ways as
follows.
A single agent Ui leaked t from its own Ri set. The target “t” guessed or obtained
through other means without the help of any of the n agents. In other words, for all t ∈
S, the
event that the target guesses t and the events that agent Ui (i = 1, . . . , n) leaks object t are
disjoint Assume that the distributor set T, the agent sets Rs, and the target set S are:
T = { t1, t2, t3 }, R1 = { t1, t2 }, R2 = { t1, t3 }, S = { t1, t2, t3 }.
In this case, all three of the distributor’s objects have been leaked and appear in S. Let us
first consider how the target may have obtained object t1, which was given to both agents.
From Assumption 2, the target either guessed t1 or one of U1 or U2 leaked it.
We know that the probability of the former event is p, so assuming that probability that
each of the two agents leaked t1 is the same, we have the following cases:

 The target guessed t1 with probability p,

 Agent U1 leaked t1 to S with probability (1-p)/2,

 Agent U2 leaked t1 to S with probability (1-p)/2.


Similarly, we find that agent U1 leaked t2 to S with probability (1-p) since he is the only
agent that has t2.
Given these values, the probability that agent U 1 is not guilty, namely that U 1 did not leak
either object, is

Pr {G1|S }=( 1−(1− p)/2 )×( 1−(1− p) )


And the probability that U1 is guilty is:

Pr( G1|S )=1−Pr{ G1 }


If Assumption 2 did not hold, our analysis would be more complex because we would need to
consider joint events, e.g., the target guesses t1, and at the same time, one or two agents leak
the value. In our simplified analysis, we say that an agent is not guilty when the object can be
guessed, regardless of whether the agent leaked the value.
Since we are “not counting” instances when an agent leaks information, the simplified
analysis yields conservative values (Smaller Probability) analysis.

Allocation Strategy
Allocation strategies that are applicable to problem instances data requests are
discussed. We deal with problems with explicit data requests, and problems with sample data
requests.
Explicit Data Requests
In problems the distributor is not allowed to add fake objects to the distributed data.
So, the data allocation is fully defined by the agents’ data requests. In EF problems, objective
values are initialized by agents’ data requests. Say, for example, that T= {t 1, t2} and there are
two agents with explicit data requests such that R1 = {t1, t2} and R2 = {t1}. The value of the
sum objective is in this case is 1.5. The distributor cannot remove or alter the R 1 or R2 data to
decrease the overlap R1 ¿
R2. If the distributor is able to create more fake objects, he could
further improve the objective.
The distributor cannot remove or alter the R1 or R2 data to decrease the overlap R1 ¿

R2. However, say that the distributor can create one fake object (B = 1) and both agents can
receive one fake object (b1 = b2 = 1). In this case, the distributor can add one fake object to
either R1 or R2 to increase the corresponding denominator of the summation term. Assume
that the distributor creates a fake object f and he gives it to agent R 1. Agent U1 has now R1 = {
t1, t2, f } and F1 = {f} and the value of the sum-objective decreases to 1:33 < 1:5.
If the distributor is able to create more fake objects, he could further improve the
objective. Algorithm 1 is a general “driver” that will be used for the allocation in case of
explicit request with fake record. In the algorithm first the random agent is selected from the
list and then its request is analyzed after the computation Fake records are created by function
CREATEFAKEOBJECT() fake records are added in the data set and given back to the agent
requested that data set. Fake records help in the process of identifying agent from the leaked
data set.
Algorithm 1:
Allocation for Explcit Data Requests (EF)
Input: R ,R1, R2,…,R, cond1 , … condi , B1,…Bn,B
Output : R1,…,Rn,F1,…,Fn
R  Null
For i=1, …, n do
If bi > 0 then
R R ∪ {i}
Fi  Null
While B>0 do
I SELECT AGENT (R, R1, R2, … …Rn)
FCREATEFAKE OBJECT (Ri, Fi, condi)
Ri  Ri ∪ {F}
Fi  Fi ∪ {F}
bi bi -1
if bi = 0 then
R=R/R{i}
BB-1
Algorithm 2:
Agent Selection for e – random function SELECT AGENT (R,.R1, …,Rn)
I select random agent from R
return i
Flow Chart for implementation of the following Algorithms:
(a)Allocation for Explicit Data Request(EF) with fake objects

(b) Agent Selection for e-random and e-optimal

“IN Both the above case Start


CREATEFAKCEOBJECT()
METHOD GENERATES A
FAKEOBJECT.”
User Request

R Explicit

Check the Condition


Else
Select the agent and Exit
add the fake.

IF B> 0

Evaluate The Loop.

Create Fake Object is Invoked

User Receives the Output.

Stop
Sample Data Requests
With sample data requests, each agent Ui may receive any T subset out of entire
distribution set which are different ones. Hence, there are different object allocations. In
every allocation, the distributor can permute T objects and keep the same chances of guilty
agent detection. The reason is that the guilt probability depends only on which agents have
received the leaked objects and not on the identity of the leaked objects. The distributor’s
problem is to pick one out so that he optimizes his objective. The distributor can increase the
number of possible allocations by adding fake objects.
Algorithm 3:

Allocation for Sample Data Requests (S F )


Input : m1,… … , mn |T|
Output: R1 ,…, Rn
A O|T|
R1  null, …,Rn  null
i
Rem  ∑ ❑mi
i=0

While rem > 0 do


For i=1, …, n:Ri <mi do k SELECT OBJECT(i, Ri) Ri  Ri ∪ {tk}
A[k]  a[k] + 1
Rem  rem - 1

Algorithm 4:
Object Selection function SELECTOBJE CT(i, Ri)
K  select at random an element from set (K’ | tk’ ∉ Ri)
Return k
Flow Chart for implementation of the following Algorithms:

(a) Allocation for Sample Data Request(EF) without any fake objects:

(b) Agent Selection for e-random and e-optimal

“In Both the following cases Select Method() returns the value of ∑ Ri n Rj “

Start

User Request

R Explicit

Check the Condition


Else
Select the agent and Exit
add the fake.

IF B> 0

Evaluate The Loop.

SelectObject() Method is Invoked

User Receives the Output.


Loop Iterates for n
number of requests

Stop

Data Allocation Problem


The main focus of our work is the data allocation problem: how can the distributor
“intelligently” give data to agents to improve the chances of detecting a guilty agent.
As illustrated in Figure, there are four instances of this problem we address,
depending on the type of data requests made by agents (E for Explicit and S for Sample
requests) and whether “fake objects” are allowed (F for the use of fake objects, and F for
the case where fake objects are not allowed). Fake objects are objects generated by the
distributor that are not in set T.
The objects are designed to look like real objects, and are distributed to agents
together with the T objects, in order to increase the chances of detecting agents that leak data.

Leakage problem instances.

Assume that we have two agents with requests R 1 = EXPLICIT( T, cond1 ) and R2 =
SAMPLE(T’, 1), where T’ = EXPLICIT( T, cond2 ).
Further, say that cond1 is “state = CA” (objects have a state field). If agent U 2 has the
same condition cond2 = cond1, we can create an equivalent problem with sample data requests
on set T’. That is, our problem will be how to distribute the CA objects to two agents, with
R1 = SAMPLE( T’, |T’| ) and R2 = SAMPLE(T’, 1). If instead U2 uses condition “state =
NY,” we can solve two different problems for sets T’ and T – T’. In each problem, it will

have only one agent. Finally, if the conditions partially overlap, R1 ¿ T’ ¿ NULL, but R1
¿ T’, we can solve three different problems for sets R – T’, R ¿ T’, and T’ – R .
1 1 1

Fake Object
The distributor may be able to add fake objects to the distributed data in order to
improve his effectiveness in detecting guilty agents. However, fake objects may impact the
correctness of what agents do, so they may not always be allowable. The idea of perturbing
data to detect leakage is not new. However, in most cases, individual objects are perturbed,
e.g., by adding random noise to sensitive salaries, or adding a watermark to an image. In this
case, perturbing the set of distributor objects by adding fake elements is done.
For example, say the distributed data objects are medical records and the agents are
hospitals. In this case, even small modifications to the records of actual patients may be
undesirable. However, the addition of some fake medical records may be acceptable, since no
patient matches these records, and hence no one will ever be treated based on fake records. A
trace file is maintained to identify the guilty agent. Trace file are a type of fake objects that
help to identify improper use of data. The creation of fake but real-looking objects is a
nontrivial problem whose thorough investigation is beyond the scope of this paper. Here, we
model the creation of a fake object for agent Ui as a black box function
CREATEFAKEOBJECT ( Ri, Fi, condi ) that takes as input the set of all objects R i, the
subset of fake objects Fi that Ui has received so far, and condi, and returns a new fake object.
This function needs condi to produce a valid object that satisfies U i’s condition. Set Ri is
needed as input so that the created fake object is not only valid but also indistinguishable
from other real objects.
Optimization Problem
The distributor’s data allocation to agents has one constraint and one objective. The
distributor’s constraint is to satisfy agents’ requests, by providing them with the number of
objects they request or with all available objects that satisfy their conditions. His objective is
to be able to detect an agent who leaks any of his data objects.
We consider the constraint as strict. The distributor may not deny serving an agent
request and may not provide agents with different perturbed versions of the same objects. We
consider fake object allocation as the only possible constraint relaxation. Our detection
objective is ideal and intractable. Detection would be assured only if the distributor gave no
data object to any agent. We use instead the following objective: maximize the chances of
detecting a guilty agent that leaks all his objects.
We now introduce some notation to state formally the distributor’s objective. Recall
that
Pr {Gj | S = Ri} or simply Pr {Gj | S = Ri}, is the probability that agent Uj is guilty if the
distributor discovers a leaked table S that contains all Ri objects. We define the difference
functions  (i, j) as:
 (i, j) = Pr {Gj | S = Ri}- Pr {Gi | S = Ri} i, j = 1,……, n(6)
Note that differences  have non-negative values: given that set Ri contains all the leaked
objects, agent Ui is at least as likely to be guilty as any other agent. Difference  (i, j) is
positive for any agent Uj, whose set Ri does not contain all data of S.

It is zero, if Ri ⊆ Rj. In this case the distributor will consider both agents Ui and Uj equally
guilty since they have both received all the leaked objects. The larger a  (i, j) value is, the
easier it is to identify Ui as the leaking agent. Thus, we want to distribute data so that
values are large.

Intelligent Data Distribution


Hashed Distribution Algorithm
Input: Agent ID (UID), Number of data item requested in the dataset (N), Fake records (F)
Output: Distribution set (Dataset + Fake record)
1. Start
2. Accept the data request from the agent and analyze
a. Type of request { Sample, Exclusive }
b. Probability of getting records from other means other than the distributor
Pr {guessing}
c. No of records in the Dataset (N) to calculate number of fake record added in
order to determine guilty agent.
d. Agent ID requesting data (UID)
3. Generate the list of data to be send to the agent (dataset), assign each record with
unique distribution ID.
4. For I =1 to F : > For each fake record
Mapping_Function (UID, FID)
{
Hash (UID)
DID → FID
Store → DistributionDetails { FID,DID,UID }.
}
5. For I=1 to F
AddFakeRecord (DistributionDetails)
Output: Distribution Set
6. Stop.
Detection Process
Detection process starts when the set of distributed sensitive record found on some
unauthorized places. Detection process completes in two phases: In phase one agent is
identified by the presence of fake record in the obtained set, if no matching fake record is
identified phase two begins which searches for missing record in the set on which fake record
is substituted. The advantage of second phase is that it works in a situation in which agent
identify and delete the fake record before leaking to the target.
Inverse mapping function (Leaked Data Set)
1. Attach DID to every record
2. Sort records in order of DID
3. Search and map fake record
4. For every Record
If fake record = Yes
MapAgent (FID)
Else If fake records = No
Map (UID) which gives hash location
Identify the absence of substituted record
MapAgent (DID)
Else
Objects are obtained by some other means.
6.3 Benefits of Hashed distribution:
Once the data is distributed fake records are used to identify the guilty agent here
instead of record we are using location to determine the guilty agent so even when the
presence of fake record is identified by the agent it will delete record but the location is
anyhow determine by the distributor so event absence of fake record will reveal the Identity
of agent by tracking absence of original record.
This solves data distribution and optimization problem to some extent by its
distribution technique.
Data Flow Diagram:
Use Case Diagram:
Class Diagram:
Sequence Diagram:
Activity Diagram:

SYSTEM IMPLEMENTATION
Implementation is the stage of the project when the theoretical design is turned out
into a working system. Thus it can be considered to be the most critical stage in achieving a
successful new system and in giving the user, confidence that the new system will work and
be effective.

The implementation stage involves careful planning, investigation of the existing


system and it’s constraints on implementation, designing of methods to achieve changeover
and evaluation of changeover methods.

Meta-Data Generation:
Let the verifier V wishes to the store the file F with the archive. Let this file F consist of n file
blocks. We initially preprocess the file and create metadata to be appended to the file. Let
each of the n data blocks have m bits in them. A typical data file F which the client wishes to
store in the cloud.

Each of the Meta data from the data blocks mi is encrypted by using a suitable algorithm to
give a new modified Meta data Mi. Without loss of generality we show this process by using
a simple XOR operation. The encryption method can be improvised to provide still stronger
protection for verifier’s data. All the Meta data bit blocks that are generated using the above
procedure are to be concatenated together. This concatenated Meta data should be appended
to the file F before storing it at the cloud server. The file F along with the appended Meta data
e F is archived with the cloud.

Modules:
Cloud Storage:

Data outsourcing to cloud storage servers is raising trend among many firms and users owing
to its economic advantages. This essentially means that the owner (client) of the data moves
its data to a third party cloud storage server which is supposed to - presumably for a fee -
faithfully store the data with it and provide it back to the owner whenever required.

Simply Archives:
This problem tries to obtain and verify a proof that the data that is stored by a user at remote
data storage in the cloud (called cloud storage archives or simply archives) is not modified by
the archive and thereby the integrity of the data is assured. Cloud archive is not cheating the
owner, if cheating, in this context, means that the storage archive might delete some of the
data or may modify some of the data. While developing proofs for data possession at
untrusted cloud storage servers we are often limited by the resources at the cloud server as
well as at the client.
Sentinels:
In this scheme, unlike in the key-hash approach scheme, only a single key can be used
irrespective of the size of the file or the number of files whose retrievability it wants to verify.
Also the archive needs to access only a small portion of the file F unlike in the key-has
scheme which required the archive to process the entire file F for each protocol verification.
If the prover has modified or deleted a substantial portion of F, then with high probability it
will also have suppressed a number of sentinels.

Verification Phase:

The verifier before storing the file at the archive, preprocesses the file and appends
some Meta data to the file and stores at the archive. At the time of verification the verifier
uses this Meta data to verify the integrity of the data. It is important to note that our proof of
data integrity protocol just checks the integrity of data i.e. if the data has been illegally
modified or deleted. It does not prevent the archive from modifying the data.

SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal program
logic is functioning properly, and that program inputs produce valid outputs. All decision
branches and internal code flow should be validated. It is the testing of individual software
units of the application .it is done after the completion of an individual unit before
integration. This is a structural testing, that relies on knowledge of its construction and is
invasive. Unit tests perform basic tests at component level and test a specific business
process, application, and/or system configuration. Unit tests ensure that each unique path of a
business process performs accurately to the documented specifications and contains clearly
defined inputs and expected results.

Integration testing
Integration tests are designed to test integrated software components to determine if
they actually run as one program. Testing is event driven and is more concerned with the
basic outcome of screens or fields. Integration tests demonstrate that although the
components were individually satisfaction, as shown by successfully unit testing, the
combination of components is correct and consistent. Integration testing is specifically aimed
at exposing the problems that arise from the combination of components.

Functional test
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user
manuals.

Functional testing is centered on the following items:

Valid Input : identified classes of valid input must be accepted.

Invalid Input : identified classes of invalid input must be rejected.

Functions : identified functions must be exercised.

Output : identified classes of application outputs must be exercised.

Systems/Procedures : interfacing systems or procedures must be invoked.


Organization and preparation of functional tests is focused on requirements, key functions,
or special test cases. In addition, systematic coverage pertaining to identify Business process
flows; data fields, predefined processes, and successive processes must be considered for
testing. Before functional testing is complete, additional tests are identified and the effective
value of current tests is determined.

System Test
System testing ensures that the entire integrated software system meets requirements. It
tests a configuration to ensure known and predictable results. An example of system testing is
the configuration oriented system integration test. System testing is based on process
descriptions and flows, emphasizing pre-driven process links and integration points.

White Box Testing


White Box Testing is a testing in which in which the software tester has knowledge of
the inner workings, structure and language of the software, or at least its purpose. It is
purpose. It is used to test areas that cannot be reached from a black box level.

Black Box Testing


Black Box Testing is testing the software without any knowledge of the inner workings,
structure or language of the module being tested. Black box tests, as most other kinds of tests,
must be written from a definitive source document, such as specification or requirements
document, such as specification or requirements document. It is a testing in which the
software under test is treated, as a black box .you cannot “see” into it. The test provides
inputs and responds to outputs without considering how the software works.

Unit Testing:

Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as
two distinct phases.

Test strategy and approach


Field testing will be performed manually and functional tests will be written in detail.

Test objectives

 All field entries must work properly.


 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.
Features to be tested

 Verify that the entries are of the correct format


 No duplicate entries should be allowed
 All links should take the user to the correct page
Integration Testing
Software integration testing is the incremental integration testing of two or more
integrated software components on a single platform to produce failures caused by interface
defects.

The task of the integration test is to check that components or software applications,
e.g. components in a software system or – one step up – software applications at the company
level – interact without error.

Test Results: All the test cases mentioned above passed successfully. No defects
encountered.

Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional
requirements.

Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
4. RESULTS AND DISCUSSION

4.1 SCREEN SHOTS

Fig: Home Page

Fig: Client Registration


Fig: Client Login

Fig: upload Files


Fig: View Files and split

Fig: Encrypt Data


Fig: File Upload to Cloud

Fig: Modify Files


Fig: Modify data

Fig: Calculate Jaccard Similarity


Fig: Modify Cloud2 Files

Fig: Modify data


Fig: Calculate Jaccard Similarity

Fig: Request file


Fig: download Files

Fig: Verify Keys


Fig: download File

Fig: Login Meta Data server


Fig: Meta Data server Home

Fig: View Cloud1 Files


Fig: View Cloud2 Files

Fig: Login storage Server


Fig: storage Server1 Home

Fig: View Cloud1 Files


Fig: View Client requests and Sent Cloud1 Key

Fig: storage Server2 Home


Fig: View Cloud2 Files

Fig: View Client requests and Sent Cloud2 Key


5. CONCLUSION AND FUTURE ENHANCEMENTS

5.1 CONCLUSION

Distributing data on multiple clouds provides users with a certain degree of


information leakage control in that no single cloud provider is privy to the entire user’s data.
However, unplanned distribution of data chunks can lead to avoidable information leakage. We
show that distributing data chunks in a round robin way can leak user’s data as high as 80% of
the total information with the increase in the number of data synchronization. To optimize the
information leakage, we presented the StoreSim, an information leakage aware storage system
in the multicloud. Store Sim achieves this goal by using novel algorithms, BFSMinHash and
SPClustering, which place the data with minimal information leakage (based on similarity) on
the same cloud. Through an extensive evaluation based on two real datasets, we demonstrate
that StoreSim is both effective and efficient (in terms of time and storage space) in minimizing
information leakage during the process of synchronization in multicloud. We show that our
StoreSim can achieve near-optimal performance and reduce information leakage up to 60%
compared to unplanned placement. Finally, through our attackability analysis, we further
demonstrate that StoreSim not only reduces the risk of wholesale information leakage but also
makes attacks on retail information much more complex.

5.2 FUTURE ENHANCEMENTS:

It is not possible to develop a system that makes all the requirements of the user. User
requirements keep changing as the system is being used. Some of the future
enhancements that can be done to this system are:

 As the technology emerges, it is possible to upgrade the system and can be adaptable to
desired environment.

 Based on the future security issues, security can be improved using emerging
technologies like single sign-on.
REFERENCES

Good Teachers are worth more than thousand books, we have them in Our Department

1. J. Crowcroft, “On the duality of resilience and privacy,” in Proceedings of the Royal
Society of London A: Mathematical, Physical and Engineering Sciences, vol. 471, no.
2175. The Royal Society, 2015, p. 20140862.

2. Bessani, M. Correia, B. Quaresma, F. Andr´e, and P. Sousa, “Depsky: dependable and


secure storage in a cloud-of-clouds,” ACM Transactions on Storage (TOS), vol. 9, no.
4, p. 12, 2013.

3. H. Chen, Y. Hu, P. Lee, and Y. Tang, “Nccloud: A network-coding-based storage


system in a cloud-of-clouds,” 2013.

4. T. G. Papaioannou, N. Bonvin, and K. Aberer, “Scalia: an adaptive scheme for


efficient multi-cloud storage,” in Proceedings of the International Conference on High
Performance Computing, Networking, Storage and Analysis. IEEE Computer Society
Press, 2012, p. 20.

5. Z. Wu, M. Butkiewicz, D. Perkins, E. Katz-Bassett, and H. V. Madhyastha,


“Spanstore: Cost-effective geo-replicated storage spanning multiple cloud services,”
in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems
Principles. ACM, 2013, pp. 292–308.

6. G. Greenwald and E. MacAskill, “Nsa prism program taps in to user data of apple,
google and others,” The Guardian, vol. 7, no. 6, pp. 1–43, 2013.

7. T. Suel and N. Memon, “Algorithms for delta compression and remote file
synchronization,” 2002.

8. Panagiotis Papadimitriou, Student Member, IEEE, and Hector Garcia-Molina,


Member, IEEE “Data Leakage Detection“ IEEE Transactions on knowledge and data
engineering, Vol. 23, NO. 1, January 2011

9. S.Umamaheswari, H.Arthi Geetha “Detection of Guilty Agents” Coimbatore Institute


of Engineering and Technology.
10. J. Clerk Ma P. Papadimitriou and H. Garcia-Molina, “Data leakage detection,”
Stanford University.
11. L.Sweeney, “Achieving K-Anonymity Privacy Protection Using Generalization and
Suppression,” http://en.scientificcommons. org/43196131, 2002.
12. Peter Gordon “Data Leakage – Threats and Mitigation” SANS Institute Reading
Room October 15, 2007
13. S. Jajodia, P. Samarati, M.L. Sapino, and V.S. Subrahmanian, “Flexible Support for
Multiple Access Control Policies,” ACM Trans. Database Systems, vol. 26, no. 2, pp.
214-260, 2001.
14. F. Sebe, J. Domingo-Ferrer, A. Martinez-Balleste, Y. Deswarte, and J.-J. Quisquater,
“Efficient remote data possession checking in critical information infrastructures,”
IEEE Trans. on Knowledge and Data Engineering, vol. 20, pp. 1034 –1038, aug.
2008.
15. R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud computing
and emerging IT platforms: Vision, hype, and reality for delivering computing as the
5th utility,” Future Generation Computer Systems, vol. 25, no. 6, pp. 599 – 616, 2009.
16. G. Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kissner, Z. Peterson, and D. Song,
“Provable data possession at untrusted stores,” in CCS’07, (New York, NY, USA),
pp. 598–609, ACM, 2007.
17. R. Curtmola, O. Khan, R. Burns, and G. Ateniese, “MR-PDP: Multiple-Replica
Provable Data Possession,” in ICDCS’08, IEEE.
18. G. Ateniese, R. Di Pietro, L. V. Mancini, and G. Tsudik, “Scalable and efficient
provable data possession,” in SecureComm’08, ACM.
19. C. Erway, A. K 쮞 pc? 쮞 , C. Papamanthou, and R. Tamassia, “Dynamic provable
data possession,” in CCS’09, pp. 213–222, ACM, 2009.
20. C. Wang, Q. Wang, K. Ren, and W. Lou, “Ensuring data storage security in cloud
computing,” in IWQoS’09, pp. 1 –9, july 2009.
21. Q. Wang, C. Wang, J. Li, K. Ren, and W. Lou, “Enabling public verifiability
and data dynamics for storage security in cloud computing,” in 14th ESORICS,
Springer, September 2009.
22. C. Wang, Q. Wang, K. Ren, and W. Lou, “Privacy-preserving public auditing for
data storage security in cloud computing,” in InfoCom2010, IEEE, March 2010.
23. Y. Deswarte and J.-J. Quisquater, “Remote Integrity Checking,” in IICIS’04, pp. 1–
11, Kluwer Academic Publishers, 1 2004.
24. D. L. G. Filho and P. S. L. M. Barreto, “Demonstrating data possession and
uncheatable data transfer.” Cryptology ePrint Archive, Report 2006/150, 2006.
http://eprint.iacr.org/.
25. M. A. Shah, M. Baker, J. C. Mogul, and R. Swaminathan, “Auditing to keep online
storage services honest,” in HotOS XI., Usenix, 2007.
26. C. Wang, S. S.-M. Chow, Q. Wang, K. Ren, and W. Lou, “Privacypreserving
public auditing for secure cloud storage.” Cryptology ePrint Archive, Report
2009/579, 2009. http://eprint.iacr.org/.

27. Y. Zhu, H. Wang, Z. Hu, G.-J. Ahn, H. Hu, and S. S. Yau, “Cooperative provable
data possession.” Cryptology ePrint Archive, Report 2010/234, 2010.
http://eprint.iacr.org/.
28. Z. Hao and N. Yu, “A multiple-replica remote data possession checking protocol
with public verifiability,” in ISDPE2010, IEEE.
29. O. Goldreich, Foundations of Cryptography. Cambridge University Press, 2004.
30. I. Damg˚ard, “Towards practical public key systems secure against chosen ciphertext
attacks,” in CRYPTO’91, Springer-Verlag, 1992.
31. M. Bellare and A. Palacio, “The knowledge-of-exponent assumptions and 3-round
zero-knowledge protocols,” in CRYPTO’04, pp. 273– 289, Springer, 2004.
32. G. L. Miller, “Riemann’s hypothesis and tests for primality,” in STOC’75, (New
York, NY, USA), pp. 234–239, ACM, 1975.
33. Z. Hao, S. Zhong, and N. Yu, “A privacy-preserving remote data integrity checking
protocol with data dynamics and public verifiability,” SUNY Buffalo CSE department
technical report 2010-11, 2010.http://www.cse.buffalo.edu/tech-reports/2010-11.pdf.
34. Multiprecision Integer and Rational Arithmetic C/C++ Library.
http://www.shamus.ie/.

You might also like