Professional Documents
Culture Documents
Optimizing Information Leakage in Multicloud Services
Optimizing Information Leakage in Multicloud Services
ABSTRACT
Many schemes have been recently advanced for storing data on multiple clouds.
Distributing data over different cloud storage providers (CSPs) automatically provides users
with a certain degree of information leakage control, for no single point of attack can leak all
the information. However, unplanned distribution of data chunks can lead to high information
disclosure even while using multiple clouds. In this paper, we study an important information
leakage problem caused by unplanned data distribution in multicloud storage services. Then,
we present StoreSim, an information leakage aware storage system in multicloud. StoreSim
aims to store syntactically similar data on the same cloud, thus minimizing the user’s
information leakage across multiple clouds. We design an approximate algorithm to
efficiently generate similarity-preserving signatures for data chunks based on MinHash and
Bloom filter, and also design a function to compute the information leakage based on these
signatures. Next, we present an effective storage plan generation algorithm based on
clustering for distributing data chunks with minimal information leakage across multiple
clouds.
1. INTRODUCTION
1.1 INTRODUCTION:
With the increasingly rapid uptake of devices such as laptops, cellphones and tablets,
users require ubiquitous and massive network storage to handle their ever-growing digital
lives. To meet these demands, many cloud-based storage and file sharing services such as
Dropbox, Google Drive and Amazon S3, have gained popularity due to the easy-to-use
interface and low storage cost. However, these centralized cloud storage services are
criticized for grabbing the control of users’ data, which allows storage providers to run
analytics for marketing and advertising . Also, the information in users’ data can be leaked
e.g., by means of malicious insiders, backdoors, bribe and coercion. One possible solution to
reduce the risk of information leakage is to employ multicloud storage systems in which no
single point of attack can leak all the information. A malicious entity, such as the one
revealed in recent attacks on privacy, would be required to coerce all the different CSPs on
which a user might place her data, in order to get a complete picture of her data. Put simply,
as the saying goes, do not put all the eggs in one basket.
Yet, the situation is not so simple. CSPs such as Dropbox, among many others,
employ rsync-like protocols to synchronize the local file to remote file in their centralized
clouds . Every local file is partitioned into small chunks and these chunks are hashed with
fingerprinting algorithms such as SHA-1, MD5. Thus, a file’s contents can be uniquely
identified by this list of hashes. For each update of local file, only chunks with changed
hashes will be uploaded to the cloud. This synchronization based on hashes is different from
diff -like protocols that are based on comparing two versions of the same file line by line and
can detect the exact updates and only upload these updates in a patch style.
Instead, the hash-based synchronization model needs to upload the whole chunks with
changed hashes to the cloud. Thus, in the multicloud environment, two chunks differing only
very slightly can be distributed to two different clouds. The following motivating example
will show that if chunks of a user’s data are assigned to different CSPs in an unplanned
manner, the information leaked to each CSP can be higher than expected. Suppose that we
have a storage service with three CSPs S1; S2; S3 and a user’s dataset D. All the user’s data
will be firstly chunked and then uploaded to different clouds. The dataset D is represented as
a set of hashes generated by each data chunk. This scenario is shown in. In addition, we
consider that the data chunks are distributed to different clouds in a round robin (RR) way.
Apparently, RR is good for balancing the storage load and each cloud thus obtains the
same amount of data. However, the same amount of data does not necessarily mean the same
amount of information. For example, if we find that the set of chunks fC3;C6;C9g are almost
same, it means S3 actually obtains the information equivalent to that in only one chunk. If all
other chunks are different, S1 and S2 obtain three times as much information of data. The
problem does not exist in a single storage cloud such as Dropbox since users have no other
choice but to give all their information to only one cloud.
When the storage is in the multicloud, we have the opportunity to minimize the total
information that is leaked to each CSP. The optimal case is that each CSP obtains the same
amount of information. In our example, data distribution based on RR can achieve the
optimal result only if all the chunks are different. However this is not the case in cloud
storage service due to two reasons: 1) Frequent modifications of files by users result in large
amount of similar chunks1; and 2) Similar chunks across files, due to which existing CSPs
use the data deduplication technique.
Centralized cloud storage services are criticized for grabbing the control of users’
data, which allows storage providers to run analytics for marketing and advertising . Also, the
information in users’ data can be leaked e.g., by means of malicious insiders, backdoors,
bribe and coercion. One possible solution to reduce the risk of information leakage is to
employ multicloud storage systems in which no single point of attack can leak all the
information.
Instead, the hash-based synchronization model needs to upload the whole chunks with
changed hashes to the cloud. Thus, in the multicloud environment, two chunks differing only
very slightly can be distributed to two different clouds. The following motivating example
will show that if chunks of a user’s data are assigned to different CSPs in an unplanned
manner, the information leaked to each CSP can be higher than expected.
Suppose that we have a storage service with three CSPs S1; S2; S3 and a user’s
dataset D. All the user’s data will be firstly chunked and then uploaded to different clouds.
The dataset D is represented as a set of hashes generated by each data chunk.
1.2 FEATURES OF CLOUD COMPUTING
Data outsourcing to cloud storage servers is raising trend among many firms and users
owing to its economic advantages. This essentially means that the owner (client) of the data
moves its data to a third party cloud storage server which is supposed to - presumably for a
fee - faithfully store the data with it and provide it back to the owner whenever required.
As data generation is far outpacing data storage it proves costly for small firms to
frequently update their hardware whenever additional data is created. Also maintaining the
storages can be a difficult task. Storage outsourcing of data to cloud storage helps such firms
by reducing the costs of storage, maintenance and personnel. It can also assure a reliable
storage of important data by keeping multiple copies of the data thereby reducing the chance
of losing data by hardware failures.
Storing of user data in the cloud despite its advantages has many interesting security
concerns which need to be extensively investigated for making it a reliable solution to the
problem of avoiding local storage of data. In this paper we deal with the problem of
implementing a protocol for obtaining a proof of data possession in the cloud sometimes
referred to as Proof of retrievability (POR).This problem tries to obtain and verify a proof
that the data that is stored by a user at a remote data storage in the cloud (called cloud storage
archives or simply archives) is
Not modified by the archive and thereby the integrity of the data is assured.
Such verification systems prevent the cloud storage archives from misrepresenting or
modifying the data stored at it without the consent of the data owner by using frequent checks
on the storage archives. Such checks must allow the data owner to efficiently, frequently,
quickly and securely verify that the cloud archive is not cheating the owner. Cheating, in this
context, means that the storage archive might delete some of the data or may modify some of
the data.
Report on Present Investigation
Data leakage happens every day when confidential business information such as customer
or patient data, source code or design specifications, price lists, intellectual property and trade
secrets, forecasts and budgets in spreadsheets are leaked out. In this report a problem is
considered in which a data distributor has given sensitive data to a set of supposedly trusted
agents and some of the data are leaked and found in an unauthorized place by any means. The
problem with the data leakage is that once this data is no longer within the domain of
distributor, then the company is at serious risk. The distributor must assess the likelihood that
the leaked data came from one or more agents, as opposed to having been independently
gathered by other means.
We propose data allocation strategies (across the agents) that improve the probability of
identifying leakages. These methods do not rely on alterations of the released data. In some
cases, we can also inject “realistic but fake” data records to further improve our chances of
detecting leakage and identifying the guilty party.
Further modification is applied in order to overcome the problems of current algorithm by
intelligently distributing data object among various agents in such a way that identification of
guilty agent becomes simple.
Introduction to Data Leakage
Data Leakage, put simply, is the unauthorized transmission of data (or information)
from within an organization to an external destination or recipient. Leakage is possible either
intentionally or unintentionally by internal or external user, internals are authorized user of
the system who can access the data through valid access control policy, however external
intruder access data through some attack on target machine which is either active or passive.
This may be electronic, or may be via a physical method. Data Leakage is
synonymous with the term Information Leakage and harms the image of organization and
create threat in the mind for continuing the relation with the distributor, as it is not able to
protect the sensitive information.
The reader is encouraged to be mindful that unauthorized does not automatically
mean intentional or malicious. Unintentional or inadvertent data leakage is also
unauthorized.
Data Leakage
Confidential information 15
Intellectual property 4
Customer data 73
Health record 8
Prevention mechanism deals with the DLP mechanism which is suits of technology
which prevent leakage of data by classifying sensitive information and monitoring them and
through various accesses control policy the access is prevented from users. Education
prevents the leakage as most of the time leakage occurs unintentionally by the internal users.
Detection process detects the leakage of information distributed to trustworthy third party
called as agent and to calculate their involvement in the process of leakage.
CHAPTER -2
2.LITERATURE REVIEW
The increasing popularity of cloud storage services has lead companies that handle
critical data to think about using these services for their storage needs. Medical record
databases, large biomedical datasets, historical information about power systems and
financial data are some examples of critical data that could be moved to the cloud. However,
the reliability and security of data stored in the cloud still remain major concerns. In this
work we present DepSky, a system that improves the availability, integrity, and
confidentiality of information stored in the cloud through the encryption, encoding, and
replication of the data on diverse clouds that form a cloud-of-clouds. We deployed our
system using four commercial clouds and used PlanetLab to run clients accessing the service
from different countries. We observed that our protocols improved the perceived availability,
and in most cases, the access latency, when compared with cloud providers individually.
Moreover, the monetary costs of using DepSky in this scenario is at most twice the cost of
using a single cloud, which is optimal and seems to be a reasonable cost, given the benefits.
To provide fault tolerance for cloud storage, recent studies propose to stripe data
across multiple cloud vendors. However, if a cloud suffers from a permanent failure and loses
all its data, we need to repair the lost data with the help of the other surviving clouds to
preserve data redundancy. We present a proxy-based storage system for fault-tolerant
multiple-cloud storage called NCCloud, which achieves cost-effective repair for a permanent
single-cloud failure. NCCloud is built on top of a network-coding-based storage scheme
called the functional minimum-storage regenerating (FMSR) codes, which maintain the same
fault tolerance and data redundancy as in traditional erasure codes (e.g., RAID-6), but use
less repair traffic and, hence, incur less monetary cost due to data transfer. One key design
feature of our FMSR codes is that we relax the encoding requirement of storage nodes during
repair, while preserving the benefits of network coding in repair. We implement a proof-of-
concept prototype of NCCloud and deploy it atop both local and commercial clouds. We
validate that FMSR codes provide significant monetary cost savings in repair over RAID-6
codes, while having comparable response time performance in normal cloud storage
operations such as upload/download.
Delta compression and remote file synchronization techniques are concerned with
efficient file transfer over a slow communication link in the case where the receiving party
already has a similar file (or files). This problem arises naturally, e.g., when distributing
updated versions of software over a network or synchronizing personal files between
different accounts and devices. More generally, the problem is becoming increasingly
common in many networkbased applications where files and content are widely replicated,
frequently modified, and cut and reassembled in different contexts and packagings.
CHAPTER -3
3. METHODOLOGY
In our daily tasks, we are storing our critical data over different cloud storage
providers such as Google Drive, Dropbox, and iCloud, but unplanned distribution of data
chunks over multiple clouds storage providers will produce one point failure in cloud data.
The Systems Development Life Cycle (SDLC), or Software Development Life Cycle
in systems engineering, information systems and software engineering, is the process of
creating or altering systems, and the models and methodologies that people use to develop
these systems. In software engineering the SDLC concept underpins many kinds of software
development methodologies.
In fact, the data deduplication technique, which is widely adopted by current cloud
storage services in existing clouds, is one example of exploiting the similarities among
different data chunks to save disk space and avoid data retransmission . It identifies the
same data chunks by their fingerprints which are generated by fingerprinting algorithms
such as SHA-1, MD5. Any change to the data will produce a very different fingerprint
with high probability . However, these fingerprints can only detect whether or not the data
nodes are duplicate, which is only good for exact equality testing. Determining identical
chunks is relatively straightforward but efficiently determining similarity between chunks
is an intricate task due to the lack of similarity preserving fingerprints (or signatures).
DISADVANTAGES OF EXISTING SYSTEM:
Unplanned distribution of data chunks can lead to high information disclosure even
while using multiple clouds.
Frequent modifications of files by users result in large amount of similar chunks1;
Similar chunks across files, due to which existing CSPs use the data de duplication
technique.
3.3 PROPOSED SYSTEM:
FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is put
forth with a very general plan for the project and some cost estimates. During system analysis
the feasibility study of the proposed system is to be carried out. This is to ensure that the
proposed system is not a burden to the company. For feasibility analysis, some
understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on
the organization. The amount of fund that the company can pour into the research and
development of the system is limited. The expenditures must be justified. Thus the developed
system as well within the budget and this was achieved because most of the technologies
used are freely available. Only the customized products had to be purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on the
available technical resources. This will lead to high demands on the available technical
resources. This will lead to high demands being placed on the client. The developed system
must have a modest requirement, as only minimal or null changes are required for
implementing this system.
SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity. The level of acceptance by the
users solely depends on the methods that are employed to educate the user about the system
and to make him familiar with it. His level of confidence must be raised so that he is also able
to make some constructive criticism, which is welcomed, as he is the final user of the system.
3.5 SYSTEM SPECIFICATION
System Requirements:
Hardware Requirements:
Software Requirements:
The purpose of the design phase is to arrange an answer of the matter such as by the
necessity document. This part is that the opening moves in moving the matter domain to the
answer domain. The design phase satisfies the requirements of the system. The design of a
system is probably the foremost crucial issue warm heartedness the standard of the software
package. It’s a serious impact on the later part, notably testing and maintenance.
The output of this part is that the style of the document. This document is analogous
to a blueprint of answer and is employed later throughout implementation, testing and
maintenance. The design activity is commonly divided into 2 separate phases System Design
and Detailed Design.
System Design conjointly referred to as top-ranking style aims to spot the modules
that ought to be within the system, the specifications of those modules, and the way them
move with one another to supply the specified results.
At the top of the system style all the main knowledge structures, file formats, output
formats, and also the major modules within the system and their specifications square
measure set. System design is that the method or art of process the design, components,
modules, interfaces, and knowledge for a system to satisfy such as needs. Users will read it
because the application of systems theory to development.
Detailed Design, the inner logic of every of the modules laid out in system design is
determined. Throughout this part, the small print of the info of a module square measure
sometimes laid out in a high-level style description language that is freelance of the target
language within which the software package can eventually be enforced.
In system design the main target is on distinguishing the modules, whereas throughout
careful style the main target is on planning the logic for every of the modules.
Models
MetaData model. The data model we discuss in this section is for the metadata that
represents the file system of StoreSim. We model users’ data as a labeled graph G =< V, E,
Ω, π > where V is a set of vertices, E is a set of edges, Ω is a set of labels, and π : V ∪ E → Ω
is a function that assigns labels to vertices and edges. Within the data graph, the vertices V
represent different objects in a file system such as users, folders, files and data chunks. The
edges E indicate a variety of relationships among different objects which can be distinguished
by a set of labels Ω. The labels also facilitate the process of path-oriented search, e.g., to find
all data chunks of one file, or to find all the files of one user. Furthermore, we define N ⊆ V
as the set of data nodes which store the raw data in G. We aim to distribute data nodes N to
different CSPs in terms of the storage protocol defined in Section II-C. CSP model. A cloud
storage provider (CSP) s ∈ S is parameterized by two factors < u, v > where u is a storage
load factor while v indicates the prior knowledge of a CSP. The storage load, i.e., the ratio of
the total size of data stored on a cloud to the size of entire data of the user, can be assigned
either by StoreSim (the default) or by users in terms of their preferences. The prior
knowledge of a CSP is modeled as the set of data nodes which have been stored on it. Thus,
the amount of prior knowledge of a CSP increases with the number of data nodes stored on it.
We assume that the knowledge is unforgettable, i.e., the knowledge of a data node will not be
removed even when the data node is removed from the cloud2 .
Storage Protocol
In essence, the storage protocol is a set of constraints or cost functions to reduce the
information leakage on data distribution across multiple clouds. The protocol in StoreSim is
to store similar chunks on the same cloud, thereby reducing information leakage to each
individual CSP. In the following, we firstly define information leakage for a pair of data
nodes
The Unified Modeling Language allows the software engineer to express an analysis
model using the modeling notation that is governed by a set of syntactic semantic and
pragmatic rules.
A UML system is represented using five different views that describe the system from
distinctly different perspective. Each view is defined by a set of diagram, which is as follows.
In this model the data and functionality are arrived from inside the system. This
model view models the static structures.
It represents the dynamic of behavioral as parts of the system, depicting the interactions of
collection between various structural elements described in the user model and structural
model view.
In this the structural and behavioral as parts of the system are represented as they are to be
built.
We have implemented the StoreSim prototype using Java, and it includes both basic
components (such as chunking, data deduplication, bundling and encryption/decryption), and
featured components including LMLayer and CMLayer. In the LMLayer, we implement the
algorithms described in the previous sections, while the CMLayer enables StoreSim to
communicate with multiple CSPs. StoreSim employs the common fixed-size chunking with a
maximum chunk size of 512 KB. The chunk is identified by SHA-1 signature, which is also
used for data deduplication. The small chunks can be bundled as a ZIP file to minimize the
network transmission overhead. Succinctly, before the chunk is synchronized, it can be
measured for leakage optimization, encrypted, and bundled for better network transmissions.
The synchronization of StoreSim is based on the delta encoding [8], which only synchronizes
changed chunks (identified by SHA-1 signatures) between two copies. All the metadata,
which is organized as data graph, are stored in a MySQL database. We have implemented for
three public storage clouds: Dropbox, Google Drive, and Amazon S3. All the
communications between StoreSim and public CSPs occur using APIs supplied by those
CSPs. We also support the synchronization of files to the local FTP servers. The metadata
server is deployed on our local server machine and the evaluation is conducted on a personal
client machine
Dataset For the evaluation, we aim to find such data which has undergone several
modifications, and thus results in many similar chunks. This can serve as a model for the
modifications that users make in the cloud storage services. Wikipedia and Github are two
such data sources that contain web pages and files which are reviewed and modified multiple
times. Thus, we crawled two datasets from Wikipedia and Github, respectively. The
Wikipedia dataset contains a total of 2197 web pages and each web page has a maximum 49
revisions. For each web page, the crawler only stores the text that is extracted from HTML
files. The total size of the dataset is 1.2 GB. The size of each webpage is relatively small,
which ranges from 29 Bytes to 118 KB with an average size of 11KB. The Github dataset
contains the United States code3 spanning 56 files. The files in this dataset are much larger
than those in the Wikipedia dataset, in the range of 47.7KB to 50MB with an average size of
5.3 MB. The files in this dataset have a maximum of 8 modifications and the total dataset size
is 2.1 GB. Thus, we observe that the data chunks generated by Wikipedia dataset are small in
size with maximum chunk size of 118 KB, but great in number (91,929) while those
generated by Github dataset are bigger in size with maximum size of 512KB but are less in
number (4,274).
Suppose that after giving objects to agents, the distributor discovers that a set S ⊆ T has
leaked. This means that some third party called the target has been caught in possession of S.
For example, this target may be displaying S on its web site, or perhaps as part of a legal
discovery process, the target turned over S to the distributor. Since the agents U 1, ……,Un
have some of the data, it is reasonable to suspect them leaking the data. However, the agents
can argue that they are innocent, and that the S data was obtained by the target through other
means.
For example, say one of the objects in S represents a customer X. Perhaps X is also a
customer of some other company, and that company provided the data to Equations the
target. Or perhaps X can be reconstructed from various publicly available sources on the web.
Our goal is to estimate the likelihood that the leaked data came from the agents as opposed to
other sources. Intuitively, the more data in S, the harder it is for the agents to argue they did
not leak anything. Similarly, the “rarer” the objects, the harder it is to argue that the target
obtained them through other means. Not only do we want to estimate the likelihood the
agents leaked data, but we would also like to find out if one of them in particular was more
likely to be the leaker.
For instance, if one of the S objects was only given to agent U1, while the other
objects were given to all agents, we may suspect U 1 more. The model we present next
captures this intuition. We say an agent U i is guilty if it contributes one or more objects to the
target. We denote the event that agent Ui is guilty for a given leaked set S by {Gi | S}. Our
next step is to estimate Pr {Gi | S}, i.e., the probability that agent Ui is guilty given evidence
S.
Guilt Agent Detection
We can conduct an experiment and ask a person with approximately the expertise and
resources of the target to find the email of say 100 individuals.
If this person can find say 90 emails, then we can reasonably guess that the
probability of finding one email is 0.9. On the other hand, if the objects in question are bank
account numbers, the person may only discover say 20, leading to an estimate of 0.2. We call
this estimate pt, the probability that object t can be guessed by the target [2]. For simplicity
we assume that all T objects have the same pt, which we call p.
Next, make two assumptions regarding the relationship among the various leakage
events. The first assumption simply states that an agent’s decision to leak an object is not
related to other objects
Allocation Strategy
Allocation strategies that are applicable to problem instances data requests are
discussed. We deal with problems with explicit data requests, and problems with sample data
requests.
Explicit Data Requests
In problems the distributor is not allowed to add fake objects to the distributed data.
So, the data allocation is fully defined by the agents’ data requests. In EF problems, objective
values are initialized by agents’ data requests. Say, for example, that T= {t 1, t2} and there are
two agents with explicit data requests such that R1 = {t1, t2} and R2 = {t1}. The value of the
sum objective is in this case is 1.5. The distributor cannot remove or alter the R 1 or R2 data to
decrease the overlap R1 ¿
R2. If the distributor is able to create more fake objects, he could
further improve the objective.
The distributor cannot remove or alter the R1 or R2 data to decrease the overlap R1 ¿
R2. However, say that the distributor can create one fake object (B = 1) and both agents can
receive one fake object (b1 = b2 = 1). In this case, the distributor can add one fake object to
either R1 or R2 to increase the corresponding denominator of the summation term. Assume
that the distributor creates a fake object f and he gives it to agent R 1. Agent U1 has now R1 = {
t1, t2, f } and F1 = {f} and the value of the sum-objective decreases to 1:33 < 1:5.
If the distributor is able to create more fake objects, he could further improve the
objective. Algorithm 1 is a general “driver” that will be used for the allocation in case of
explicit request with fake record. In the algorithm first the random agent is selected from the
list and then its request is analyzed after the computation Fake records are created by function
CREATEFAKEOBJECT() fake records are added in the data set and given back to the agent
requested that data set. Fake records help in the process of identifying agent from the leaked
data set.
Algorithm 1:
Allocation for Explcit Data Requests (EF)
Input: R ,R1, R2,…,R, cond1 , … condi , B1,…Bn,B
Output : R1,…,Rn,F1,…,Fn
R Null
For i=1, …, n do
If bi > 0 then
R R ∪ {i}
Fi Null
While B>0 do
I SELECT AGENT (R, R1, R2, … …Rn)
FCREATEFAKE OBJECT (Ri, Fi, condi)
Ri Ri ∪ {F}
Fi Fi ∪ {F}
bi bi -1
if bi = 0 then
R=R/R{i}
BB-1
Algorithm 2:
Agent Selection for e – random function SELECT AGENT (R,.R1, …,Rn)
I select random agent from R
return i
Flow Chart for implementation of the following Algorithms:
(a)Allocation for Explicit Data Request(EF) with fake objects
R Explicit
IF B> 0
Stop
Sample Data Requests
With sample data requests, each agent Ui may receive any T subset out of entire
distribution set which are different ones. Hence, there are different object allocations. In
every allocation, the distributor can permute T objects and keep the same chances of guilty
agent detection. The reason is that the guilt probability depends only on which agents have
received the leaked objects and not on the identity of the leaked objects. The distributor’s
problem is to pick one out so that he optimizes his objective. The distributor can increase the
number of possible allocations by adding fake objects.
Algorithm 3:
Algorithm 4:
Object Selection function SELECTOBJE CT(i, Ri)
K select at random an element from set (K’ | tk’ ∉ Ri)
Return k
Flow Chart for implementation of the following Algorithms:
(a) Allocation for Sample Data Request(EF) without any fake objects:
“In Both the following cases Select Method() returns the value of ∑ Ri n Rj “
Start
User Request
R Explicit
IF B> 0
Stop
Assume that we have two agents with requests R 1 = EXPLICIT( T, cond1 ) and R2 =
SAMPLE(T’, 1), where T’ = EXPLICIT( T, cond2 ).
Further, say that cond1 is “state = CA” (objects have a state field). If agent U 2 has the
same condition cond2 = cond1, we can create an equivalent problem with sample data requests
on set T’. That is, our problem will be how to distribute the CA objects to two agents, with
R1 = SAMPLE( T’, |T’| ) and R2 = SAMPLE(T’, 1). If instead U2 uses condition “state =
NY,” we can solve two different problems for sets T’ and T – T’. In each problem, it will
have only one agent. Finally, if the conditions partially overlap, R1 ¿ T’ ¿ NULL, but R1
¿ T’, we can solve three different problems for sets R – T’, R ¿ T’, and T’ – R .
1 1 1
Fake Object
The distributor may be able to add fake objects to the distributed data in order to
improve his effectiveness in detecting guilty agents. However, fake objects may impact the
correctness of what agents do, so they may not always be allowable. The idea of perturbing
data to detect leakage is not new. However, in most cases, individual objects are perturbed,
e.g., by adding random noise to sensitive salaries, or adding a watermark to an image. In this
case, perturbing the set of distributor objects by adding fake elements is done.
For example, say the distributed data objects are medical records and the agents are
hospitals. In this case, even small modifications to the records of actual patients may be
undesirable. However, the addition of some fake medical records may be acceptable, since no
patient matches these records, and hence no one will ever be treated based on fake records. A
trace file is maintained to identify the guilty agent. Trace file are a type of fake objects that
help to identify improper use of data. The creation of fake but real-looking objects is a
nontrivial problem whose thorough investigation is beyond the scope of this paper. Here, we
model the creation of a fake object for agent Ui as a black box function
CREATEFAKEOBJECT ( Ri, Fi, condi ) that takes as input the set of all objects R i, the
subset of fake objects Fi that Ui has received so far, and condi, and returns a new fake object.
This function needs condi to produce a valid object that satisfies U i’s condition. Set Ri is
needed as input so that the created fake object is not only valid but also indistinguishable
from other real objects.
Optimization Problem
The distributor’s data allocation to agents has one constraint and one objective. The
distributor’s constraint is to satisfy agents’ requests, by providing them with the number of
objects they request or with all available objects that satisfy their conditions. His objective is
to be able to detect an agent who leaks any of his data objects.
We consider the constraint as strict. The distributor may not deny serving an agent
request and may not provide agents with different perturbed versions of the same objects. We
consider fake object allocation as the only possible constraint relaxation. Our detection
objective is ideal and intractable. Detection would be assured only if the distributor gave no
data object to any agent. We use instead the following objective: maximize the chances of
detecting a guilty agent that leaks all his objects.
We now introduce some notation to state formally the distributor’s objective. Recall
that
Pr {Gj | S = Ri} or simply Pr {Gj | S = Ri}, is the probability that agent Uj is guilty if the
distributor discovers a leaked table S that contains all Ri objects. We define the difference
functions (i, j) as:
(i, j) = Pr {Gj | S = Ri}- Pr {Gi | S = Ri} i, j = 1,……, n(6)
Note that differences have non-negative values: given that set Ri contains all the leaked
objects, agent Ui is at least as likely to be guilty as any other agent. Difference (i, j) is
positive for any agent Uj, whose set Ri does not contain all data of S.
It is zero, if Ri ⊆ Rj. In this case the distributor will consider both agents Ui and Uj equally
guilty since they have both received all the leaked objects. The larger a (i, j) value is, the
easier it is to identify Ui as the leaking agent. Thus, we want to distribute data so that
values are large.
SYSTEM IMPLEMENTATION
Implementation is the stage of the project when the theoretical design is turned out
into a working system. Thus it can be considered to be the most critical stage in achieving a
successful new system and in giving the user, confidence that the new system will work and
be effective.
Meta-Data Generation:
Let the verifier V wishes to the store the file F with the archive. Let this file F consist of n file
blocks. We initially preprocess the file and create metadata to be appended to the file. Let
each of the n data blocks have m bits in them. A typical data file F which the client wishes to
store in the cloud.
Each of the Meta data from the data blocks mi is encrypted by using a suitable algorithm to
give a new modified Meta data Mi. Without loss of generality we show this process by using
a simple XOR operation. The encryption method can be improvised to provide still stronger
protection for verifier’s data. All the Meta data bit blocks that are generated using the above
procedure are to be concatenated together. This concatenated Meta data should be appended
to the file F before storing it at the cloud server. The file F along with the appended Meta data
e F is archived with the cloud.
Modules:
Cloud Storage:
Data outsourcing to cloud storage servers is raising trend among many firms and users owing
to its economic advantages. This essentially means that the owner (client) of the data moves
its data to a third party cloud storage server which is supposed to - presumably for a fee -
faithfully store the data with it and provide it back to the owner whenever required.
Simply Archives:
This problem tries to obtain and verify a proof that the data that is stored by a user at remote
data storage in the cloud (called cloud storage archives or simply archives) is not modified by
the archive and thereby the integrity of the data is assured. Cloud archive is not cheating the
owner, if cheating, in this context, means that the storage archive might delete some of the
data or may modify some of the data. While developing proofs for data possession at
untrusted cloud storage servers we are often limited by the resources at the cloud server as
well as at the client.
Sentinels:
In this scheme, unlike in the key-hash approach scheme, only a single key can be used
irrespective of the size of the file or the number of files whose retrievability it wants to verify.
Also the archive needs to access only a small portion of the file F unlike in the key-has
scheme which required the archive to process the entire file F for each protocol verification.
If the prover has modified or deleted a substantial portion of F, then with high probability it
will also have suppressed a number of sentinels.
Verification Phase:
The verifier before storing the file at the archive, preprocesses the file and appends
some Meta data to the file and stores at the archive. At the time of verification the verifier
uses this Meta data to verify the integrity of the data. It is important to note that our proof of
data integrity protocol just checks the integrity of data i.e. if the data has been illegally
modified or deleted. It does not prevent the archive from modifying the data.
SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal program
logic is functioning properly, and that program inputs produce valid outputs. All decision
branches and internal code flow should be validated. It is the testing of individual software
units of the application .it is done after the completion of an individual unit before
integration. This is a structural testing, that relies on knowledge of its construction and is
invasive. Unit tests perform basic tests at component level and test a specific business
process, application, and/or system configuration. Unit tests ensure that each unique path of a
business process performs accurately to the documented specifications and contains clearly
defined inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to determine if
they actually run as one program. Testing is event driven and is more concerned with the
basic outcome of screens or fields. Integration tests demonstrate that although the
components were individually satisfaction, as shown by successfully unit testing, the
combination of components is correct and consistent. Integration testing is specifically aimed
at exposing the problems that arise from the combination of components.
Functional test
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user
manuals.
System Test
System testing ensures that the entire integrated software system meets requirements. It
tests a configuration to ensure known and predictable results. An example of system testing is
the configuration oriented system integration test. System testing is based on process
descriptions and flows, emphasizing pre-driven process links and integration points.
Unit Testing:
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as
two distinct phases.
Test objectives
The task of the integration test is to check that components or software applications,
e.g. components in a software system or – one step up – software applications at the company
level – interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional
requirements.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
4. RESULTS AND DISCUSSION
5.1 CONCLUSION
It is not possible to develop a system that makes all the requirements of the user. User
requirements keep changing as the system is being used. Some of the future
enhancements that can be done to this system are:
As the technology emerges, it is possible to upgrade the system and can be adaptable to
desired environment.
Based on the future security issues, security can be improved using emerging
technologies like single sign-on.
REFERENCES
Good Teachers are worth more than thousand books, we have them in Our Department
1. J. Crowcroft, “On the duality of resilience and privacy,” in Proceedings of the Royal
Society of London A: Mathematical, Physical and Engineering Sciences, vol. 471, no.
2175. The Royal Society, 2015, p. 20140862.
6. G. Greenwald and E. MacAskill, “Nsa prism program taps in to user data of apple,
google and others,” The Guardian, vol. 7, no. 6, pp. 1–43, 2013.
7. T. Suel and N. Memon, “Algorithms for delta compression and remote file
synchronization,” 2002.
27. Y. Zhu, H. Wang, Z. Hu, G.-J. Ahn, H. Hu, and S. S. Yau, “Cooperative provable
data possession.” Cryptology ePrint Archive, Report 2010/234, 2010.
http://eprint.iacr.org/.
28. Z. Hao and N. Yu, “A multiple-replica remote data possession checking protocol
with public verifiability,” in ISDPE2010, IEEE.
29. O. Goldreich, Foundations of Cryptography. Cambridge University Press, 2004.
30. I. Damg˚ard, “Towards practical public key systems secure against chosen ciphertext
attacks,” in CRYPTO’91, Springer-Verlag, 1992.
31. M. Bellare and A. Palacio, “The knowledge-of-exponent assumptions and 3-round
zero-knowledge protocols,” in CRYPTO’04, pp. 273– 289, Springer, 2004.
32. G. L. Miller, “Riemann’s hypothesis and tests for primality,” in STOC’75, (New
York, NY, USA), pp. 234–239, ACM, 1975.
33. Z. Hao, S. Zhong, and N. Yu, “A privacy-preserving remote data integrity checking
protocol with data dynamics and public verifiability,” SUNY Buffalo CSE department
technical report 2010-11, 2010.http://www.cse.buffalo.edu/tech-reports/2010-11.pdf.
34. Multiprecision Integer and Rational Arithmetic C/C++ Library.
http://www.shamus.ie/.