Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

IEEE TRANSACTIONS ON BIG DATA, VOL. 5, NO.

3, JULY-SEPTEMBER 2019 393

Heterogeneous Data Storage Management


with Deduplication in Cloud Computing
Zheng Yan , Senior Member, IEEE, Lifang Zhang, Wenxiu Ding , and Qinghua Zheng, Member, IEEE

Abstract—Cloud storage as one of the most important services of cloud computing helps cloud users break the bottleneck of restricted
resources and expand their storage without upgrading their devices. In order to guarantee the security and privacy of cloud users, data
are always outsourced in an encrypted form. However, encrypted data could incur much waste of cloud storage and complicate data
sharing among authorized users. We are still facing challenges on encrypted data storage and management with deduplication.
Traditional deduplication schemes always focus on specific application scenarios, in which the deduplication is completely controlled by
either data owners or cloud servers. They cannot flexibly satisfy various demands of data owners according to the level of data
sensitivity. In this paper, we propose a heterogeneous data storage management scheme, which flexibly offers both deduplication
management and access control at the same time across multiple Cloud Service Providers (CSPs). We evaluate its performance with
security analysis, comparison and implementation. The results show its security, effectiveness and efficiency towards potential
practical usage.

Index Terms—Data deduplication, cloud computing, access control, storage management

1 INTRODUCTION

C LOUD computing allows centralized data storage and


online access to computer services or resources. It offers
a new way of Information Technology (IT) services by re-
data stored at the cloud include sensitive personal informa-
tion, publicly shared data, data shared within a group, and so
on. Obviously, crucial data should be protected at the cloud to
arranging various resources and providing them to users prevent from any access of unauthorized parties. Some unim-
based on their demands. Cloud computing has greatly portant data, however, have no such a requirement. As
enriched pervasive services and become a promising service outsourced data could disclose personal or even sensitive
platform due to a number of desirable properties [40], [41], information, data owners sometimes would like to control
such as scalability, elasticity, fault-tolerance, and pay-per-use. their data by themselves, while on some occasion, they prefer
Data storage service is one of the most widely consumed to delegate their control to a third party since they cannot be
cloud services. Cloud users have greatly benefited from cloud always online or have no idea how to perform such a control.
storage since they can store huge volume of data without How to make cloud data access control adapt to various sce-
upgrading their devices and access them at any time and in narios and satisfy different user demands becomes a practi-
any place. However, cloud data storage offered by Cloud cally important issue. Access control on encrypted data has
Service Providers (CSPs) still incurs some problems. been widely studied in the literature [10], [11], [12], [13], [14],
First of all, various data stored at the cloud may request dif- [15], [16], [17], [33]. However, few of them can flexibly support
ferent ways of protection due to different data sensitivity. The various requirements on cloud data protection in a uniform
way, especially with economic deduplication management.
Second, flexible cloud data deduplication with data access
 Z. Yan is with the State Key Laboratory of Integrated Services Networks, control is still an open issue. Duplicated data could be stored
School of Cyber Engineering, Xidian University, POX 91, No. 2 South
Taibai Road, Xi’an 710071, China, and also with the Department of Com-
at the cloud [39] in an encrypted form by the same or differ-
munications and Networking, Aalto University, Otakaari 5, Espoo 02150, ent users, in the same or different CSPs. From the standpoint
Finland. E-mail: zyan@xidian.edu.cn. of compatibility, it is highly expected that data deduplication
 L. Zhang is with the Department of Communications and Networking, can cooperate well with data access control. That is the same
Aalto University, Otakaari 5, Espoo 02150, Finland.
E-mail: lifang.zhang@aalto.fi. data (either encrypted or not) are only stored once at the
 W. Ding is with the School of Cyber Engineering, Xidian University, cloud, but can be accessed by different users based on the
No. 2 South Taibai Road, Xi’an 710071, China. policies of data owners or data holders (i.e., the eligible data
E-mail: wenxiuding_1989@126.com. users who hold original data). Although cloud storage space
 Q. Zheng is with the Xi’an Jiaotong University, Xi’an 710071, China.
E-mail: qhzheng@xjtu.edu.cn. is huge, duplicated data storage could greatly waste net-
Manuscript received 15 Oct. 2016; revised 28 Mar. 2017; accepted 2 May
working resources, consume plenty of power energy,
2017. Date of publication 3 May 2017; date of current version 9 Sept. 2019. increase operation costs, and make data management com-
(Corresponding author: Zheng Yan) plicated. Economic storage will greatly benefit CSPs by
Recommended for acceptance by K.-K. R. Choo, M. Conti, and A. Dehghantanha. decreasing their operation costs and reversely benefit cloud
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below. users with reduced service fees. Obviously, cloud data dedu-
Digital Object Identifier no. 10.1109/TBDATA.2017.2701352 plication is particularly significant for big data storage and
2332-7790 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
394 IEEE TRANSACTIONS ON BIG DATA, VOL. 5, NO. 3, JULY-SEPTEMBER 2019

management. However, the literature still lacks studies on Before uploading data to CSP, the data owner first classifies
flexible cloud data deduplication across multiple CSPs. the data into different groups, and then encrypts each group
Existing work cannot offer a generic solution to support both with a symmetric key, which is only distributed to the users
deduplication and access control in a flexible and uniform in the ACL of the group. In this way, this group of data is
way over the cloud [18], [22], [23], [24], [29], [30], [31], [32], only accessible by the users in the ACL [4]. The shortcoming
[33], [34], [35], [36], [37], [38]. of this scheme mainly comes from the fact that the number
In this paper, we propose a holistic and heterogeneous of symmetric keys increases linearly with the number of
data storage management scheme in order to solve the above groups. Moreover, the trust relationship change between
problems. The proposed scheme is compatible with the one individual user and the data owner could cause essen-
access control scheme proposed in [33]. It further realizes tial update of relevant symmetric keys, which impacts other
flexible cloud storage management with both data dedupli- users in the same ACL. Thereby, this approach is impracti-
cation and access control that can be operated by either the cal to be applied in many real applications where the trust
data owner or a trusted third party or both or none of them. relationship between different users changes frequently.
Moreover, the proposed scheme can satisfy miscellaneous Combining a traditional symmetric cryptosystem and an
data security demands and at the same time save storage asymmetric cryptographic system was proposed for cloud
spaces with deduplication across multiple CSPs. Thus it can data access control [5]. However, the computation cost of
fit into various data storage scenarios. Our scheme is original key encryption increases linearly with the number of users
and different from the existing work. It is a generic scheme to in the ACL.
realize encrypted cloud data deduplication with access con- Attribute-Based Encryption (ABE) [6], [7], [8], [9] was
trol, which supports the cooperation between multiple CSPs. proposed to achieve access control on encrypted cloud data.
Specifically, the contributions of this paper are: It specifies a set of attributes to identify users and encrypts
data based on an access structure specified by attributes.
 We motivate to save cloud storage across multiple Thus, encrypted data can only be decrypted by the users
CSPs and preserve data security and privacy by that hold such attributes that can satisfy the access struc-
managing encrypted data storage with deduplica- ture. ABE is classified into two divisions: key-policy ABE
tion in various situations. (KP-ABE) [7] and ciphertext-policy ABE (CP-ABE) [6], [8]
 We propose a heterogeneous data management according to how the attributes link to ciphertexts and
scheme to support both deduplication and access decryption keys. ABE has such advantages as scalability
control according to the demands of data owners, and high flexibility in terms of attributes based access poli-
which can adapt to different application scenarios. cies and fine-grained access control. It has been widely
Our scheme can support data sharing among eligible applied to secure cloud data storage in recent years [10],
users in a flexible way, which can be controlled by [11], [12], [13], [14], [15], [16], [17]. However, all above exist-
either the data owners or other trusted parties or ing solutions about access control on encrypted data did not
both of them. consider how to solve the issue of duplicated data storage
 We justify the performance of the proposed scheme in cloud computing in a holistic and comprehensive man-
through security analysis, comparison with existing ner, especially for encrypted data in various data storage
work and implementation based performance evalu- scenarios. This issue is practically significant for big data
ation. The results show its security, advantages, effi- secure storage over the cloud.
ciency and potential applicability.
The rest of the paper is organized as below. We give a 2.2 Encrypted Data Deduplication
brief review on related work in Section 2. In Section 3, we It is a hot research topic to reconcile deduplication and
present a system and security model, and introduce nota- client-side encryption [18]. The need for data deduplication
tions and preliminaries that are used in our scheme. We can also help efforts in recovering evidence from cloud serv-
present the detailed design of the proposed scheme in ices for forensic analysis, as noted by forensic researchers
Section 4, followed by security analysis, comparison with [42], [43], [44], [45]. Existing industrial solutions fail to per-
existing work and performance evaluation in Section 5. form deduplication on encrypted data, e.g., Dropbox [19],
Finally, the last section concludes the paper. Google Drive [20], and Mozy [21]. Message-Locked Encryp-
tion (MLE) was proposed to resolve this tension [22]. Con-
2 RELATED WORK vergent Encryption (CE), the most prominent manifestation
2.1 Access Control on Encrypted Data of MLE, was introduced [23], [24]. In CE, a user computes
Existing researches [1], [2], [3] proposed to encrypt data the key of data M based on its hash code K HðMÞ and
before outsourcing it to the cloud in order to prevent data encrypts M with K. Another user holding the same data
privacy from being invaded at CSP. Access control on can produce the same encrypted data, thus realizing dedu-
encrypted data requests that only authorized entities can plication. The CE suffers from offline brute-force dictionary
decrypt the encrypted data. An ideal approach is to encrypt attacks. As a result, CE can ensure high security only when
each data once and issue relevant keys to authorized entities the underlying data is drawn from a large space that is too
only once. However, due to the changeability of trust rela- big to exhaust. In addition, CE cannot support data access
tionships, key management becomes complicated due to controlled by data owners, as well as other authorized par-
frequent key update. ties. It is hard to support data revocation because generating
Access Control Lists (ACLs) were applied to ensure data a same new encryption key is hard to achieve for both the
security in a distrusted or semi-trusted party (e.g., CSP). data owners and the data holders to re-encrypt the data.
YAN ET AL.: HETEROGENEOUS DATA STORAGE MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING 395

A number of schemes were proposed to overcome the


weakness of CE. Bellare et al. proposed DupLESS to resist
the above-mentioned brute-force attacks [18]. In DupLESS,
users encrypt their data using the keys obtained from a Key
Server (KS). They are generated based on the data with an
oblivious Pseudo Random Function (PRF) protocol. The KS
is separated from a Storage Service (SS). Users authenticate
themselves to the KS without leaking any information about
their data. Thus, high security can be assured if the KS is
not accessible to attackers. Even though both KS and SS are
compromised, DupLESS can still preserve the security of
stored data based on the guarantee of MLE. But some data
owners do not like to authorize a third party like KS to con-
trol their data, since in some specific situations they prefer Fig. 1. System model.
to manage the storage and access of their data by them-
selves and keep track of data storage and usage status. [32], we applied ABE to realize deduplicated data access con-
However, DupLESS cannot support this desirable feature. trol managed by data owners. Similarly, this scheme cannot
Li et al. presented an efficient and reliable convergent key solve the issue about flexible data access control with dedu-
management scheme by splitting a convergent key and dis- plication across multiple CSPs, which can be managed by
tributing its shares among multiple servers [35]. However, any trusted parties based on real application demands.
it still cannot avoid the innate drawbacks of CE. Wen et al.
constructed a session-key-based convergent key manage- 2.3 Other Related Work
ment scheme and a convergent key sharing scheme to solve Yang et al. proposed a scheme called Provable Ownership
the issue that encrypted data blocks and data ownership are of the File (POF) [25], which allows a user to prove to a
frequently changed [36]. But this work requests all data server that it really possesses a file without the need to
owners communicate with each other to manage their ses- upload the entire file. Data ownership proof is an essential
sion key. The problem of CE still exists. Liu et al. proposed process of data deduplication, especially for encrypted
a secure cross-user deduplication scheme that supports cli- data. But this scheme does not consider flexible deduplica-
ent-side encryption without requiring any additional inde- tion control across multiple CSPs. Yuan and Yu proposed a
pendent servers by applying a password authenticated key scheme to achieve data deduplication and secure data integ-
exchange protocol [38]. But this scheme requests that the rity auditing at the same time [28]. It supports both public
data owner is always online for data ownership check and and batch auditing. This work applied different technolo-
deduplication. Thus this approach cannot handle the situa- gies (i.e., polynomial-based authentication tags and homo-
tion that the data owner is not available, which is very com- morphic linear authenticators) from ours and focused on
mon in practice. Cross-CSP was not discussed in this work. solving a different research issue. Wu et al. developed Index
The above schemes cannot flexibly manage data deduplica- Name Servers (INS) to reduce the workload caused by
tion in various situations and across multiple CSPs. They duplicated data. But this work cannot support the dedupli-
cannot solve the issues as described in the introduction. cation on encrypted data. A hybrid data deduplication
Neither can they support the management of digital rights. mechanism was proposed by Fan et al. [27]. It can dedupli-
Existing schemes realized deduplication in either server- cate both plaintext and ciphertext. However, this mecha-
side or owner-side. Seldom, a hybrid solution was proposed nism has such a drawback that CSP knows the key that is
to gain advantages of both approaches. In [29], the authors used for data encryption. Therefore, it cannot be applied
proposed a method to solve deduplication controlled by into such a situation that the CSP cannot be fully trusted by
data owner only. The access control of other data holders is data owners. Li et al. formally addressed the problem of
based on predefined metadata that describes eligible users authorized data deduplication [37]. Different from tradi-
and is shared with CSP. Applying public key encryption in tional deduplication systems, the differential privileges of
this method results in high computation complexity, which users are further considered in duplicate check besides the
is linearly increased with the number of users and lacks data itself in a hybrid cloud architecture. All above work
flexibility to support various data storage scenarios. Hur focused on solving different research issues from ours.
et al. proposed a novel server-side deduplication scheme
for encrypted data [34]. It allows the cloud server to control 3 PROBLEM STATEMENT
access to outsourced data even when the ownership 3.1 System and Security Model
changes dynamically by exploiting randomized convergent Fig. 1 shows the system that the proposed scheme can be
encryption and secure ownership group key distribution. applied into. It contains four types of entities:
This scheme prevents data leakage not only to revoked
users but also to an honest-but-curious cloud storage server. 1) Key Generation Center (KGC) that is fully trusted
Yan et al. [30], [31] proposed a deduplication scheme and responsible for system parameter generation
based on PRE, but it completely relied on an authorized and certification issuance.
party to control data deduplication. It cannot flexibly adapt 2) The Cloud Service Provider (CSP) that offers a data
to different scenarios, especially the data access controlled storage service. Multiple CSPs could exist in the sys-
by the data holders. In another line of our previous work tem. Thus, a cloud user can choose one of them to
396 IEEE TRANSACTIONS ON BIG DATA, VOL. 5, NO. 3, JULY-SEPTEMBER 2019

TABLE 1 underlying scheme. For simplification, we assume one


Notations AP in the system for easy presentation. CSP, AP and data
owners/holders use secure channels to communicate with
Key Description Usage
each other. Backup of stored data is generally performed by
PKu The public key of u The unique ID of user u CSP and this kind of data duplication for erasing storage
about ABE; and the key for user
risk is out of the discussion of this paper. Delegation agree-
attribute verification; it
is used to generate ment could be negotiated and signed among data holders
personalized secret during ownership check for data access management. If a
attribute key for u. data holder does not want any delegation, the procedure of
SKu The secret key of u For decryption in ABE. our scheme will go to this data holder about data access,
about ABE; which means the access will be jointly controlled online by
PK 0u The public key of u For PKC encryption this data holder and the data owner. For simplifying system
about Public-Key Cryp- and signature verifica-
process, we assume delegation can be agreed among all
tosystem (PKC); tion.
SK 0u The secret key of u For PKC decryption data holders.
about PKC; and signature
generation. 3.2 Notations and Preliminaries
DEKu The symmetric key of u; User data encryption. 3.2.1 Notations
DEK1;u The partial key 1 of
DEKu ;
Table 1 summarizes the notations used in this paper.
DEK2;u The partial key 2 of
DEKu ; 3.2.2 Proxy Re-Encryption (PRE)
pkID;u The public key of u For encrypting DEK2;u . PRE transforms the ciphertext of m encrypted with the pub-
regarding attribute ID;
skID;u;u0 The secret key of u0 For decryption to get lic key of entity A into one that can be decrypted with the
regarding attribute ID DEK2;u . private key of entity B at a proxy.
issued by u; E ðppk A ; mÞ outputs ciphertext Ciphe rA ¼ EðpkA ; mÞ by
DEK 0u The renewed symmet- taking input pkA and data m.
ric key of u; RGðssk A ; pk B Þ, the re-encryption key generation algo-
RG
HðÞ The hash function; rithm outputs re-encryption key rkA!B for a proxy (e.g.,
CTu The ciphertext of u;
CSP) by taking input ðskA ; pkB Þ.
CKu The cipherkey of u;
pku The public key of u Generation of CipherrA Þ, the re-encryption algorithm outputs
R ðrrk A!BB ; Ciphe
about PRE; re-encryption key for u: R ðrkA!B ; CipherA Þ ¼ E ðpkB ; mÞ ¼ CipherB by taking
sku The secret key of u Decryption in PRE. input rkA!B and CipherA . CipherB can be decrypted with
about PRE; skB .
M The duplicated data; CipherrB Þ outputs plain data m by taking input
D ðssk B ; Ciphe
HCðMÞ The hash code set of skB and CipherB .
data M.
Each user has a key pair for PRE, which is applied when
AP involves in taking charge of data deduplication and
manage its uploaded data and seek advanced usage access control. PRE allows AP to grant data access right to
experiences. In addition, CSPs can cooperate with an eligible user through re-encryption at CSP, while the
each other under a business agreement to save stor- plain data cannot be gained by the CSP.
age spaces through deduplication;
3) The Data Owner or the Data Holder that uploads 3.2.3 Attribute-Based Encryption
and stores data at CSPs. Different CSPs may serve We also resort to control data access during deduplication
the data holders. Multiple eligible data holders or a based on user identity by applying ABE [6], [7], [8], [9]. The
single cloud user could store the same encrypted or advance of adopting ABE is a data owner only encrypts a
plain data at one CSP or across CSPs; data encryption key once when it issues access rights to a
4) The Authorized Party (AP) that is responsible for number of eligible data holders. Because the issued decryp-
controlling data access as a delegate of data owners tion keys are personalized for the data holders, they cannot
as they expect to support deduplication. collude with each other. We can also easily realize fine-
In this system, AP is trusted by all entities. All CSPs can- grained access control with ABE, which further enhances
not be fully trusted. That is, they are curious about the raw the flexibility of our scheme. We can use either CP-ABE to
data of cloud users but follow system design and protocols simplify key management or KP-ABE to gain the efficiency
strictly. We hold such an assumption that the AP would of data encryption. In our implementation as described in
never collude with the CSPs due to different business incen- Section 5.3, we applied CP-ABE to demonstrate the scheme.
tive and interests. Any collusion would worsen the reputa- In the proposed scheme, each user maintains a secret key
tion of the CSPs, which lead to final loss of their business. SKu about ABE. SKu and the identities of other users are
We additionally hold following assumptions. The data used to generate the ABE decryption key of the users based
holder provides the correct hash code set of its data for data on attribute ID, named as secret attribute key. ID denotes
ownership verification. The first eligible data holder that the identity attribute, which can be an anonymous identifier
uploads the data is regarded as the data owner. Multiple of the user. pkID;u is the public key used to encrypt a partial
APs could exist in the system and can be supported by the key of DEKu . Data owner u issues a personalized secret
YAN ET AL.: HETEROGENEOUS DATA STORAGE MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING 397

attribute key skID;u;u0 to eligible data holder u0 through a decrypting CK2 if eligibility verification is positive (i.e., the
secure channel for decrypting the part of cipher-key data holder is allowed by the data owner to store data M at
encrypted with pkID; u . CSP). Meanwhile, AP issues CSP a re-encryption key that is
Our scheme is heterogonous and flexible. In some scenar- used to re-encrypt CT1 to make it decryptable by the dupli-
ios, data owners would like to directly control data dedupli- cated data holder in order to get DEK1 . By getting both
cation, e.g., in the case that they know the data holders, DEK1 and DEK2 , the duplicated data holder can gain
which is also the scenario that was referred by the work in DEK and access CT at CSP. Data duplication check and
[36]. The scheme proposed in our paper is more advanced data deduplication can be performed among CSPs based on
than existing work because it can adapt to various applica- their agreement. One CSP can store data for other CSPs.
tion scenarios. For example, the data owner can manage Duplicated data access from the eligible users of other CSPs
deduplication directly or it does not know how to manage it can be supported among the CSPs.
thus delegates this task to a third party, or it would like to Depending on the data management policy set by the data
perform dual control or no control. All above scenarios can owner, DEK can be randomly divided into multiple parts,
be supported by our scheme. which are taken care by different authorized parties (e.g.,
Notably, our scheme is a framework that can adapt to multiple APs). For simplifying presentation, we illustrate
various user policies on data deduplication. The data owner our scheme by dividing DEK into two parts: DEK1 and
adopts the ABE algorithm to directly manage its data dedu- DEK2 . The following use cases can be flexibly supported: 1)
plication and sharing. In different scenarios, the access pol- when DEK1 is null and DEK2 ¼ DEK, the data owner
icy would differ from each other, which is based on the data solely controls data deduplication; 2) when DEK1 ¼ DEK
sensitivity and the willingness of the data owner. For sim- and DEK2 ¼ null, data deduplication is only controlled by
plicity, we directly regard user identity as a basic attribute AP; 3) when DEK1 6¼ null, DEK2 6¼ null and DEK1 k
in ABE and data owners are responsible for the ABE setup DEK2 ¼ DEK, data deduplication is controlled by both AP
and key management. If higher security is required and and the data owner; 4) when DEK1 ¼ DEK2 ¼ DEK, data
fine-grained access control is expected, more complicated deduplication is managed by either AP or the data owner; 5)
access policy can be designed and this can be realized based when DEK1 ¼ DEK2 ¼ DEK ¼ null, plaintext is stored
on the properties of ABE. at CSP that handles deduplication without any specific con-
trol indicated by the data owner.
4 SYSTEM DESIGN
4.1 Overview 4.2 Fundamental Algorithms
We propose a scheme for heterogeneous data storage man- In this section, we introduce a number of fundamental
agement with deduplication. It can be flexibly applied into algorithms of the proposed scheme.
such scenarios that cloud data deduplication is handled 1)
only by the data owner; 2) by any trusted third party; 3) by 4.2.1 System Setup
both the data owner and the trusted third party; 4) by
InitiateSystem. This algorithm is conducted at the KGC. It
InitiateSystem
nobody (i.e., plain data is stored at the cloud); 5) by either
generates basic system parameters related to ABE and PRE,
the data owner or the trusted third party.
Concretely, we use the hash code of data M to check data such as generators and universal attributes, etc.
duplication during data storage at the cloud. The data holder InitiateNodeðuÞ. Based on the system parameters, cloud
InitiateNodeðuÞ
signs the hash code of the data for passing the originality ver- user u generates its own key pairs including ABE master
ification of CSP. Meanwhile, a number of hash codes of ran- key pair PKu and SKu used for ABE encryption and user
domly selected specific parts of the data are calculated with decryption key issuance, PKC key pair PK 0u and SK 0u for
their indexes (e.g., the hash code of the first 15.1 percent of signing, as well as pku and sku regarding PRE.
SetupNodeðuÞ: With node identity u and public keys as
SetupNodeðuÞ
M, the hash code of 21-25 percent of M). We call these hash
input, this algorithm conducted at KGC outputs a number
codes as the hash code set ðHCðMÞÞ of data M.
of user credentials, CertðPKu Þ, CertðPK 0u Þ and Certðpku Þ,
When the data owner/holder stores M at CSP, it sends
which can be verified by CSPs and their users.
the signed hash code of M to CSP for duplication check. If
InitiateAP . AP initiates itself by generating pkAP and
there is no duplicated data stored at CSP, the data owner
encrypts M with a randomly generated symmetric key skAP . pkAP is broadcast to the users of CSPs.
DEK to get encrypted data CT . It separates DEK into two
parts DEK1 and DEK2 . It encrypts DEK1 with pkAP by 4.2.2 ABE Key Generation
applying PRE to get CK1 and encrypts DEK2 with ABE by CreateIDPK
CreateIDPKðID ID; S K u Þ: This algorithm checks the policies
using pkID to get CK2 . The encrypted two parts of DEK are about ID and outputs pkID;u for user u to allow u to control
passed to CSP together with CT . its data deduplication and access.
If the above duplication check is positive, CSP further IssueIDSK
IssueIDSKðID ID; S K u ; P K u0 Þ. This algorithm is run by u
verifies the ownership of the data holder by challenging the to issue skID;u;u0 to u0 if the eligibility check of u0 is positive.
hash code set of M, concretely some specific hash codes. If Otherwise, it outputs NULL. Specifically, user u checks the
the ownership verification is positive, CSP contacts the data attributes of u0 . If they satisfy with the policy defined by u, u
owner and/or AP for deduplication. issues a secret key to u0 for sharing the duplicated data storage
During deduplication, the data owner issues a personal- and allow its future access. Otherwise, it rejects the request.
ized secret key through a secure communication channel For simplifying our presentation, we set user identity as an
(e.g., public key cryptosystem) to a data holder for example attribute rather than complex attributes herein. The
398 IEEE TRANSACTIONS ON BIG DATA, VOL. 5, NO. 3, JULY-SEPTEMBER 2019

about DEK1 . The algorithms related to PRE are represented


as below:
E ðppk AP ; DE
DEK K 1;uu Þ outputs CK1 ¼ EðpkAP ; DEK1;u Þ by
taking pkAP and DEK1;u , as input.
RGðppk AP ; skAP ; pk u 0 Þ outputs re-encryption key
RG
rkAP !u0 for the proxy CSP by taking pkAP , skAP , and pku 0 as
input.
R ðrrk AP !uu0 ; C K 1 Þ takes input rkAP !u0 and CK1 , and out-
puts R ðrkAP !u0 ; CK1 Þ ¼ E ðpku0 ; DEK1;u Þ ¼ CK 01 , which
can be decrypted with sku0 .
D ðssk u ; CK 01 Þ outputs DEK1;u by taking sku and CK 01 as
input.

4.3 Flexible Deduplication Scheme


4.3.1 Data Deduplication
Fig. 2 shows the procedure of data deduplication with het-
erogeneous control handled by both the data owner and
AP. User u1 is the data owner that stores data M at CSP by
Fig. 2. Data deduplication with heterogeneous control. encrypting it with DEKu1 , while user u2 tries to store the
same data at CSP. We assume that both the data owner and
access control based on user identity also consists with prac- AP are indicated for deduplication control based on the
tice since most of data access over the cloud is based on user encryption behavior of u1 . Both u1 and u2 are the users of
identity. Data owner u allows other data holders with the same CSP.
ID ¼ PKu0 (j ¼ 1; 2; 3) to share its data storage. It Step 1 - System Setup: After system parameter generation,
j
encrypts DEK2 with policy : ID ¼ PKu0 1 _ PKu0 2 _ PKu0 3 . each node ui calls InitiateNode to generate three key pairs
The encryption key algorithm EncryptKey as described below PKui and SKui ; PK 0ui and SK 0ui ; pkui and skui ði ¼ 1;
iterates over all j ¼ 1; 2; 3, generates a random value for 2; . . .Þ. Meanwhile, ui gets the certificates of public keys
each conjunction and constructs CK2j . The cipher-key CK2 is CertðPKui Þ, CertðPK 0ui Þ and Certðpkui Þ from KGC. AP calls
obtained as tuple CK2 ¼ hCK21 ; CK22 ; CK23 i. InitiateAP to generate its key pair pkAP and skAP .
Step 2 - Duplication Check: User u1 stores data M at CSP.
4.2.3 Data Encryption and Decryption It calculates HðMÞ, signs HðMÞ with SK 0u1 and sends pack-
age P1 ¼ fHðMÞ; SignðHðMÞ; SK 0u1 Þ, CertðPKu1 Þ, CertðPK 0u1 Þ,
Encrypt
EncryptðDE
DEKK u ; M Þ encrypts M with DEKu and outputs
Certðpku1 Þ} to CSP. CSP checks if the same data has been
ciphertext CTu to protect M stored at CSP.
stored already by verifying the signature and checking if
Decrypt
DecryptðDE
DEK K u ; C T u Þ decrypts CTu with DEKu and
HðMÞ has existed. The duplication check across multiple
outputs M. It is executed at the data holders to obtain the
CSPs can be supported, refer to next sub-section for details.
plain content of CTu stored at CSP.
If the check is positive, go to Step 5. Otherwise, go to Step 3
to request data package.
4.2.4 Symmetric Key Management Step 3 - Data Storage: When CSP requests the data pack-
SeparateKey
SeparateKeyðDEDEK K u Þ. On input DEKu , this algorithm out- age, user u1 encrypts M with a random symmetric key
puts a number of partial keys, e.g., DEK1;u and DEK2;u DEKu1 to get C Tu1 ¼ EncryptðDEKu1 ; MÞ. If DEKu1 ¼
based on random separation. Separating DEKu into multi- null, CTu1 ¼ Encrypt ðnull; MÞ ¼ M. It then calls
ple parts can also be performed if needed. SeparateKeyðDEKu1 Þ to get two random parts of DEKu1 :
CombineKey
CombineKeyðDE DEKK 1;uu ; DE
DEKK 2;uu Þ: On input partial keys DEK1; u1 and DEK2;u1 . User u1 encrypts DEK1;u1 with pkAP
of DEKu , e.g., DEK1;u and DEK2;u , this algorithm outputs to get CK1;u1 by calling EðpkAP ; DEK1;u1 Þ and encrypts
the full key DEKu through combination. DEK2;u1 with pkID; u1 by calling EncryptKeyðDEK2;u1 ; ;
pkID;u1 Þ to get CK2;u1 , where pkID; u1 is generated according
4.2.5 Partial Key Control based on ABE Operated by to data policy  of u1 . In addition, it randomly selects a
Data Owner number of indexes: IN ¼ fIn1 ; In2 ; . . . ; Ink g that indicate
the special parts of M (e.g., In1 indicates first 1 percent of
EncryptKey
EncryptKeyðDE DEK K 2;uu ; ; p kID u Þ encrypts DEK2;u with pol-
ID;u
icy  and outputs cipher-key CK2;u by taking DEK2;u ,  and data; In2 indicates first 3 percent of data), where k is the
pkID;u as input. This algorithm is conducted at u. total number of indexes. Furthermore, u1 calculates the
hash codes of partial M based on the indexes as
DecryptKey
DecryptKeyðC C K 2;uu ; ; S K u0 ; skID
ID;u u0 Þ decrypts cipher-
u;u
key CK2;u and outputs DEK2;u if the policy  under which HC ðMÞ ¼ fHðM1 Þ; HðM2 Þ; . . . ; HðMk Þg : Then u1 sends
DEK2;u was encrypted can be satisfied; otherwise it outputs the data package to CSP for storage:
NULL. This algorithm is conducted at u0 .
DP1
n  o
4.2.6 Partial Key Control based on PRE Operated by AP ¼ CTu1 ; CK1;u1 ; CK2;u1 ; IN; HC ðM Þ; Sign HC ðM Þ; SK 0u1 :
We employ PRE to enable AP to perform the re-encryption
of CK1 . During ciphertext re-encryption, CSP learns nothing
YAN ET AL.: HETEROGENEOUS DATA STORAGE MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING 399

Fig. 3. Data deduplication at CSP controlled by AP.


Fig. 4. Data deduplication at CSP controlled by the data owner.
Step 4 - Duplicated Data Upload: Later on, user u2 wants to
store the same data M at CSP by sending CSP the data pack- owner. The main differences of these two procedures from
age P2 ¼ fHðMÞ; SignðHðMÞ; SK 0u2 Þ; CertðPKu2 Þ; the one described in Fig. 2 are:
    1) The separation of DEK is different: in Fig. 2,
Cert PK 0u2 ; Cert pku2 g:
DEK1 jj DEK2 ¼ DEK, where DEK1 and DEK2
Step 5 - Deduplication: CSP performs duplication check as are not null. In Fig. 3, DEK1 ¼ DEK and DEK2 is
in step 2. It further checks the correctness of HCðMÞ by ran- null. In Fig. 4, DEK2 ¼ DEK and DEK1 is null.
domly selecting an index x in IN, and challenging u2 . The 2) CSP requests both the data owner and AP for dedu-
purpose of performing this additional check is to ensure the plication in Fig. 2, while CSP only requests AP for
data ownership in case that HðMÞ is eavesdropped or deduplication in Fig. 3 and CSP only requests the
gained by some malicious party. A number of indexes in IN data owner for deduplication in Fig. 4.
can be selected with regard to the hash code set challenge in Fig. 5 shows the procedure of data deduplication at CSP
order to enhance the security of the ownership check. If the without any control provided by AP and the data owner. In
verification of challenge response is positive, CSP performs this case DEK is null. Plaintext is stored at CSP. If some
data storage with deduplication. data holder would like to store encrypted data at CSP later
If AP is involved into the control of deduplication, CSP on, system process is similar to DEK update, which will be
contacts AP with Certðpku2 Þ (that contains pku2 ). If the verifi- described below. Note that re-encryption key rkAP !u2 and
cation on the data storage policy regarding u2 is positive, AP skID;u1 ;u2 should be issued if they are not available by CSP
generates rkAP !u2 if not performed before by calling and eligible user u2 in the case of DEK update.
RGðpkAP ; skAP ; pku2 Þ and issues it to CSP to allow it to re- Though deduplication can help save storage cost, the
encrypt CK1;u1 by calling RðrkAP !u2 ; CK1;u1 Þ. CSP sends data owners or holders may prefer to store replicated data
Eðpku2 ; DEK1;u1 Þ to u2 for decryption with sku2 to get in the CSP. A flag is signed by the user to indicate its prefer-
DEK1;u1 . ence with regard to deduplication, which is checked by CSP
If the control of data owner is applied to the stored data, before performing any duplication check. On the other
CSP contacts u1 by sending HðMÞ and CertðPKu2 Þ (that con-
tains PKu2 ) for deduplication. If the verification on u2 ’s eli-
gibility for data storage at CSP is positive, u1 generates
SKID; u1 ;u2 by calling IssueIDSKðID; SKu1 ; PKu2 Þ, and
issues it and CK2;u1 to u2 . Then u1 informs the success of
data deduplication to CSP. After getting this notification,
CSP updates corresponding deduplication records.
Through data deduplication, both u1 and u2 can access
the same data M that is stored only once at CSP. User u1
uses DEKu1 directly. While u2 gets DEK2;u1 and DEK1;u1
by calling DecryptKeyðCK2;u1 ; ; SKu0 ; skID;u1 ;u2 Þ and
Dðsku2 ; CK1;u1 Þ, respectively. It then combines DEK1;u1
and DEK2;u1 to get DEKu1 by calling CombineKey
ðDEK1;u1 ; DEK2;u1 Þ.
Fig. 3 describes the procedure of data deduplication at
CSP with the control of AP. Fig. 4 shows the procedure of Fig. 5. Data deduplication at CSP without any control of AP or the data
data deduplication at CSP with the control of the data owner.
400 IEEE TRANSACTIONS ON BIG DATA, VOL. 5, NO. 3, JULY-SEPTEMBER 2019

Fig. 7. A procedure of data deletion.


Fig. 6. Data deduplication across multiple CSPs. Step 3. The CSP further checks if the data is locally
stored. If not, go to Step 4. If yes, it will delete the data in
hand, CSP can also issue different privileges to its users. case that the data deduplication record is empty (i.e., no
Some users could hold a specific privilege to store replicated user stores such data in CSP any more). It could contact the
data in the CSP, but they could be charged more than nor- data owner about DEK update in case that the deduplica-
mal users by the CSP. For this type of users, the data dupli- tion record is not empty. Further deduplication control is
cation check should be waived. also required if the underlying user u is the data owner
when the deduplication record is not empty.
4.3.2 Deduplication Across CSPs Step 4. The local CSP contacts remote CSP0 that really
stores the data. The CSP0 deletes the storage record of u
Fig. 6 shows the process of deduplication across multiple
and blocks its future data access. It also checks the data
CSPs.
deduplication record. If it is empty (i.e., no user stores
Step 1. The user requests its local CSP for data storage.
Step 2. The local CSP checks data duplication. If yes, the such data in CSP0 any more), CSP0 deletes the data. Oth-
local CSP performs deduplication by contacting the data erwise, it contacts the data owner about DEK update
owner and/or AP based on the way of data encryption for (refer to Continuous Deduplication Control as described
deduplication. Corresponding keys are generated by the below).
data owner and/or AP and issued to the user if it is an eligi-
ble data holder. 4.3.4 Continuous Deduplication Control
Step 3. If the local duplication check is negative, CSP will Fig. 8 illustrates the procedure when CSP inquires a data
check with other CSPs if the same data is stored by broad- owner for continuous deduplication control if the data
casting the data storage request of the user. If there is no owner deletes its data at CSP, but still there are other eligi-
any positive reply from other CSPs, the local CSP performs ble data users storing the same data at CSP.
data storage by requesting data package from the user. Step 1. CSP inquires a data owner about continuous
Step 4. If there is a remote CSP0 replying that the same data deduplication control.
has been stored therein, the local CSP forwards the data stor- Step 2. If the data owner’s decision is positive, the data
age request to CSP0 and records user data deduplication infor- owner continues deduplication control by issuing access
mation locally. The remote CSP0 performs deduplication by keys to eligible users. Else, go to Step 3.
contacting the data owner and/or AP. Corresponding keys
are generated by the data owner and/or AP and issued to the
user through the cooperation of CSP and CSP0 . Meanwhile,
CSP0 records the deduplication information of the user.

4.3.3 Data Deletion


Fig. 7 shows the procedure of data deletion by a data holder
in the context of data deduplication.
Step 1. User u sends a request of data deletion to its local
CSP by providing HðMÞ; SignðHðMÞ; SK 0u Þ:
Step 2. The CSP verifies the ownership of u by randomly
selecting an index x in IN and challenging expected hash
code set. It deletes the storage record of u and blocks its
future access to data M if the verification is positive. Fig. 8. A procedure of continuous deduplication control.
YAN ET AL.: HETEROGENEOUS DATA STORAGE MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING 401

Proposition 1. To pass data ownership verification, a cloud user


must really hold data M.
Proof. A cloud user can generate correct HðMÞ with real
data M, thus it can pass duplication check. For ownership
challenge, stolen HðMÞ is useless since an ineligible data
holder is hard to provide correct HðMx Þ since x is ran-
domly selected and HðÞ is non-invertible. Eavesdropping
previous transmitted HðMx0 Þ is useless to pass the current
challenge. The user holding the real M can get Mx and
generate correct HðMx Þ, thus pass the challenge. The
security level of data ownership verification links to the
Fig. 9. A procedure of DEK and CT update. maximum number of k in IN ¼ fIn1 ; In2 ; . . . ; Ink g and
the number of indexes used to challenge the ownership.
Step 3. The data owner generates a new key In practice, these parameters can be set according to the
DEK 0 ¼ DEK 01 , encrypts it with pkAP , and sends sensitivity of the stored data and the security requirement
DP 0 ¼ fCT 0 ; CK 01 g to CSP. CSP performs re-encryption on of the data owner. u
t
CK 01 using the re-encryption keys of all eligible users and Proposition 2. Data M can be deduplicated in a secure way and
updates the deduplication record of the underlying data. only eligible users can access it if data owner u, CSP and AP
When any eligible data user accesses the data, CSP provides cooperate without collusion.
CT 0 and the re-encrypted CK 01 .
Herein, we only illustrate one solution of continuous Proof. During data deduplication, data confidentiality is
deduplication control. Other data holders can also take over ensured by ABE, PRE and symmetric key encryption
the control. In this case, CSP will request a new delegate (e.g., AES). Data M could be disclosed in two ways:
from existing data holders and select one of them (e.g., obtaining it from HðMÞ and breaking CT ¼ Encrypt
based on the duration of data storage and user willingness). ðDEK; MÞ. First, the hash function is assumed hard to
The new delegate will perform storage update by applying suffer from collision attacks. Therefore, it is impossible to
a newly generated key DEK 0 . The process is similar to obtain M through its hash code. Second, if we select a
DEK update as described below. long enough key size for the symmetric key encryption,
breaking CT is hard. Thus, DEK becomes the attack
point of data security. In our scheme, DEK is divided
4.3.5 DEK and CT Update
into two parts: DEK1 and DEK2 , which are encrypted
Fig. 9 illustrates the procedure of DEK and CT update, with pkAP through PRE and pkID;u through ABE, respec-
which is essential for enhancing system security. The data tively. Because AP does not collude with CSP, CSP cannot
owner (or an eligible data holder) u1 generates a new key gain DEK1 since it knows nothing about skAP although it
DEK 0u1 and encrypts data M with DEK 0u1 . It separates stores CK1 . Through the re-encryption with pkAP !u0 , CK1
DEK 0u1 into DEK1;u 0
1
and DEK 02;u1 , and encrypts DEK 01;u1 under pkAP is transformed into the cipherkey under pku0 ,
with pkAP , and DEK 02;u1 with pkID;u1 . Then data package during which CSP cannot get to know DEK1 since CSP
DP 01 ¼ fCT 0u1 ; CK 01;u1 ; CK 02;u1 ; HðMÞ; SignðHðMÞ; SK 0u1 Þg is knows nothing about sku0 and DEK1 is always in
sent to CSP. The CSP validates the eligibility of u1 and stores encrypted forms. AP has no way to access M because
DP 01 . CSP requests AP to get the re-encryption keys of cur- CSP blocks its access. Even though AP obtains DEK1 by
rent eligible data holders (e.g., u2 ) if the re-encryption keys colluding with CSP, it is still impossible for AP to get M
are not available. CSP performs re-encryption on CK 01;u1 , because another part of DEK (i.e., DEK2 ) is controlled by
with e.g., rkAP !u2 to get Eðpku2 ; DEK 01u1 Þ. the data owner. The data owner and holders have no
Meanwhile, u1 also needs to issue skID; u1 ;u2 through a incentive to collude with CSP considering their personal
secure channel if it is not ever sent to eligible users. Any eli- data profits, thus they would not disclose skID;u;u0 and
gible user, e.g., u2 , can get DEK 01;u1 with sku2 and gain DEK to CSP. CSP has no way to gain DEK2 and DEK.
DEK 02;u1 with skID; u1 ;u2 in order to generate DEK 01; u1 for Based on the above analysis, the CSP that stores CT can-
accessing newly encrypted data CT 0u1 . not obtain M through DEK. The scheme can guarantee
that M is securely stored at CSP during deduplication,
5 PERFORMANCE EVALUATION which can be only accessed by eligible data holders. u
t
5.1 Security Analysis
The security of our scheme relies on ABE theory, PRE theory, 5.2 Comparison with Existing Work
symmetric key encryption and PKC. The security of PRE and We compare our scheme with the previous work [31], [32].
ABE was proved in our previous work [33]. Symmetric key One of them [31] realizes deduplication managed by AP
encryption and PKC theory play as a security foundation in and the other [32] manages deduplication by the online
many security schemes. We assume that the applied key data owner. As shown in Table 2 and further tested in
sizes of these two cryptosystems are long enough to satisfy Section 5.4, the proposed scheme can flexibly support vari-
the security requirements of our system. In what follows, we ous scenarios with similar computation complexity to exist-
analyze the security of our scheme regarding data ownership ing work [31], [32]. Thereby, we compare their main
verification and data deduplication. properties in Table 3. We can see that the proposed scheme
402 IEEE TRANSACTIONS ON BIG DATA, VOL. 5, NO. 3, JULY-SEPTEMBER 2019

TABLE 2 TABLE 3
Comparison of Computation Comparison of Features with [31], [32]
Complexity with [31], [32]
Properties [this paper] [31] [32]
Party [this paper] [31] [32] Basic Algorithm Applied PRE,pABE PRE, ECC ABE
p
Data Owner OðnÞ Oð1Þ Oðn
nÞ Fine-grained Access Control p ‘
p
CSP OðnÞ Oðn
nÞ Oðn
nÞ Possession Proof p p ‘
Data Holder Oð1Þ Oð1Þ Oð1Þ Offline Access Control p ‘
AP OðnÞ Oðn
nÞ - Deduplication Across CSPs ‘ ‘

n: the number of data holders.


sets. Our implementation was in Cþþ and adopted MySQL
is a heterogeneous solution. It can realize both fine-grained 5.5.46 to build a database. The experiments were conducted
and offline access control, thus it has better flexibility than in a virtual machine running a 64-bits Ubuntu operating sys-
previous work. In addition, the random hash code challenge tem on Amazon EC2 cloud service with Intel Xeon CPU E5-
is applied to verify data ownership, which can guarantee 2670, 2.50 GHz processor and 1-GB RAM. We tested the cor-
that the data holders really have the original data rather rectness of our implementation in terms of each procedure
than its hash code. Though possession proof has been described in Section 4.3. Herein, we only take the case of data
achieved in [31] by applying Elliptic Curve Cryptography deduplication with heterogeneous control as an example to
(ECC) (with ownership verification time about 1.2 millisec- illustrate our implementation due to paper size limitation.
ond), hash code set employed in this paper is also very effi- Use Case: Data deduplication with heterogeneous con-
cient if we make challenged part of data is small. If the trol (as illustrated in Fig. 2).
challenged part of data is very small, e.g., within 1 kilobyte, Step 1: U1 wants to upload file TestHetro to CSP. CSP
we can achieve much better performance than [31] consider- checks that there is no duplicated file stored, and thus
ing the fast operation time of the hash function. Moreover, requests DP1 from U1. U1 generates DEK randomly,
our scheme can cope with the situations of deduplication encrypts file TestHetro using AES with DEK, divides DEK
across multiple CSPs, which was not considered at all in into two parts, denoted as DEK1 and DEK2. U1 prepares
previous work. In general, our scheme has distinct advan- DP1 and uploads it to CSP. Upon receiving DP1 , CSP stores
tages compared with existing work in terms of high flexibil- DP1 in its database and U1 also stores the uploaded file
ity and advanced properties. information in its own database. Such uploading process is
shown in Fig. 10a and the records in CSP database and U1
5.3 Scheme Implementation database are shown in Figs. 11 and 12a.
We implemented the scheme based on MIRACL Crypto Step 2: U2 wants to upload file U2File whose content is
Library (http://info.certivox.com/docs/miracl), Pairing exactly the same as that of file TestHetro. CSP identifies
Based Cryptography (PBC) Library [27], OpenSSL Cryptog- duplication and challenges the ownership of U2. After pass-
raphy, JHU-MIT Proxy Re-cryptography Library (https:// ing ownership challenge, U2 gets CK2, the re-encrypted
isi.jhu.edu/mgreen/prl/index.html), and SSL/TLS Tool- CK1, and related attribute secrete key of U2. U2 then
kit (https://www.openssl.org/). In our implementation, we decrypts CK1 and CK2 to get DEK1 and DEK2 , as shown
applied AES for symmetric encryption, RSA for PKC and in Fig. 10b. In addition, we can observe that U2 can get the
SHA-1 as a hash function to generate hash codes and hash correct DEK generated by U1. Then, U2 stores the received

Fig. 10. File uploading process with heterogeneous control.

Fig. 11. Record of CSP database.


YAN ET AL.: HETEROGENEOUS DATA STORAGE MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING 403

Fig. 12. Records of U1 and U2 databases.

Fig. 13. The detailed data content in CSP.

Fig. 14. Duplicated file download with heterogeneous control.

Fig. 15. The content of file U2File, CT and the decryption of CT.

DEK1 and DEK2 for file U2File, as shown in Fig. 12b. Mean- Test 2: Efficiency of calculating hash code set of a file
while, CSP updates the record of file TestHetro to mark U2 as
a user that holds file TestHetro, as shown in Fig. 11b. Fig. 16b shows the time needed to calculate HðMÞ
Fig. 13 shows the detailed data content in CSP. We can (k ¼ 1) and HCðMÞ (k > 1) of files of different sizes using
see that the file is secure from CSP since only the ciphertexts SHA-1. We can see from Fig. 16b that the time increases as
of DEK1, DEK2 and file content are stored in CSP. the file size increases and that the bigger k is, the more time
Step 3: U2 wants to download file U2File. After checking it takes to calculate HCðMÞ. Calculating HðMÞ is very effi-
the eligibility of U2, CSP sends CT of file TestHetro to U2. cient, which takes less than 10 seconds to calculate HðMÞ of
Upon receiving CT , U2 decrypts it with DEK that is com- a file as big as 500 MB. When k is small (e.g., k¼ 50), calcu-
bined from DEK1 and DEK2, as shown in Fig. 14. Fig. 15 lating HCðMÞ with data size 500 kilobytes (KB) is also very
shows the content of the file U2File before it is uploaded, its efficient, within 50 milliseconds.
CT and decryption of CT . We can see from Fig. 15 that U2 Test 3: Efficiency of RSA sign and verification
can decrypt the file correctly.
In our proposed scheme, RSA signature is used during
5.4 Efficiency Evaluation duplication check and performed on the hash code or the
Based on the implementation, we performed a number of hash code set of plaintext data. Signature verification is
tests to evaluate the efficiency of our proposed scheme. used at CSP to ensure data ownership during duplication
check. We tested the execution time needed to sign a given
Test 1: Efficiency of file encryption and decryption SHA-1 hash code and verify a given signature using RSA
We tested the time spent to encrypt and decrypt a file cryptosystem. We observed from Fig. 16c that both RSA
with different sizes by applying AES with 3 different key sign and RSA verification are very efficient. Signing with
sizes, namely 128 bits, 196 bits and 256 bits. We observe 4096-bit RSA takes only about 10 milliseconds.
from Fig. 16a that encrypting or decrypting a file of 500
Test 4: Efficiency of PRE operations
megabytes (MB) with 256-bit AES takes about 100 seconds.
It is a reasonable and practical choice to apply symmetric We tested the operation time of different PRE operations.
encryption for data protection. PRE schemes require that all users in a PRE deployment
404 IEEE TRANSACTIONS ON BIG DATA, VOL. 5, NO. 3, JULY-SEPTEMBER 2019

Fig. 16. Efficiency evaluation on basic algorithms.

share a common set of public parameters. These parameters Following tests were carried out by applying 128-bit AES
should be fixed, then they need to be generated only once and 2048-bit RSA. The HCðMÞ was calculated with k ¼ 10
during system setup. We tested that generating these and the number of IDs in CP-ABE encryption policy is 5.
parameters takes about 34.79 milliseconds. Each user in a
Test 6: Efficiency of file uploading
PRE deployment needs to generate a public/secrete key
pair. As shown in Fig. 16d, generating a PRE key pair takes We tested the efficiency of file uploading process under
only 6.5 milliseconds. We can observe that PRE operations different control policies. The process includes encrypting
(including re-encryption key generation, encryption, re- data file with AES, calculating HðMÞ and HCðMÞ, signing
encryption and decryption) are quite efficient. Thus, apply- and verifying signature. The process may include encrypting
ing PRE to protect data encryption keys is reasonable and DEK1;u with PRE and/or encrypting DEK2;u with ABE
practical, especially when it is handled at a server with suf- according to the access control policy. As shown in Fig. 17b,
ficient resources and processing capability. there is no much difference between uploading a file under
three control policies, namely, data owner and AP control,
Test 5: Efficiency of CP-ABE operations
data owner control, and AP control, especially for big files.
Fig. 16e shows the execution times of all CP-ABE Since for big files, the time is dominated mainly by AES
operations (UKGen: User key pair generation; IDPKGen: encryption that increases with file sizes. However, CP-ABE
ID public key generation (ID numbers ¼ 10); IDSKGen: and PRE are quite efficient (less than 1 second) and stays con-
ID secret key generation; Enc: ABE encryption (ID num- stant for files with different sizes, since the size of DEK stays
bers in encryption policy ¼ 5); Dec: ABE decryption (ID constant for different files. We can also see from Fig. 17b that
numbers in encryption policy ¼ 5)). The setup process encrypting a file does not introduce too much computation
that is needed only once generates CP-ABE global public overhead. The result shown in this figure also indicates that
key and secret master key, which takes about 12 millisec- the proposed scheme has similar performance to the existing
onds. User key pair generation takes about 14 millisec- work [31], [32] with regard to file uploading.
onds and is needed when a new user is registered into Fig. 17c shows the duplicated file uploading time under
the system. The ID public key generation process varies different control policies. In this process, CSP will request re-
with different number of IDs. Fig. 16f shows the ID pub- encryption key from AP and use it to re-encrypt CK1 if
lic key (pkID; u ) generation time with different number of needed. CSP also contacts the data owner about issuing the
IDs, namely eligible users. user attribute secret key if the data owner controls data
Fig. 17a shows the CP-ABE encryption and decryption access. We observe that such a process is very efficient, taking
time. The encryption time increases with the number of IDs less than 0.3 seconds if the data is less than 100 MB. The oper-
in encryption policy, since the encryption algorithm iterates ation time varies slightly with file sizes, which results from
over all IDs and constructs ciphertext for each ID. The HCðMÞ calculation and challenge. By comparing Figs. 17b
decryption time is consistent around 7.8 milliseconds. with 17c, we can see that the proposed deduplication scheme
YAN ET AL.: HETEROGENEOUS DATA STORAGE MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING 405

Fig. 17. Efficiency evaluation on main operations.

can greatly save data uploading time for duplicated data stor- access control policies. For a file without any access control,
age at the cloud. the deletion just needs to update related CSP file records
and thus very efficient.
Test 7: Efficiency of file downloading
As can be seen from Fig. 17, the proposed scheme
We also tested the efficiency of file downloading process achieves similar performance to the existing work [31], [32].
that combines DEK1 and DEK2 and decrypts downloaded Considering its advanced properties as shown in Table 3
CT with AES. Because the decryption of CK1 and CK2 is and high flexibility, we conclude that our scheme outper-
very fast (only several milliseconds), there is no much dif- forms the existing work.
ference between the file downloading time under different
control policies, as shown in Fig. 17d. But if no party con-
6 CONCLUSION
trols the data access, the downloading process is much
faster than that under data owner and/or AP control. Since Data deduplication is important and significant in the prac-
in this case, AES decryption is not needed. This result indi- tice of cloud data storage, especially for big data storage
cates that the proposed scheme has similar performance to management. In this paper, we proposed a heterogeneous
the existing work [31], [32] with regard to file downloading. data storage management scheme, which offers flexible
cloud data deduplication and access control. Our scheme
Test 8: Efficiency of file deletion
can adapt to various application scenarios and demands
Fig. 17e shows the data holder’s file deletion time under and offer economic big data storage management across
different control policies. The deletion process involves multiple CSPs. It can achieve data deduplication and access
DEK and CT update if the deleted file is controlled by the control with different security requirements. Security analy-
data owner and/or AP. The DEK and CT update process is sis, comparison with existing work and implementation
similar to the above mentioned new file uploading process based performance evaluation showed that our scheme is
except that it does not need to calculate HCðMÞ. Thus, delet- secure, advanced and efficient.
ing a file under data owner and/or AP control varies Our scheme supports data privacy of cloud users since
slightly, especially for big files, as shown in Fig. 17e. How- the data stored at the cloud is in an encrypted form. One
ever, deleting a file without any control by the data owner way to support identity privacy is to apply pseudonyms in
or AP only needs to update related file records in CSP, thus Key Generation Center (KGC), where a real identity is
it is very efficient and takes less than 0.1 seconds for a file linked to a number of pseudonyms, which is verified and
with 100 MB. certified by the KGC. In our future work, we will further
Fig. 17f shows the data owner’s file deletion time under enhance user privacy and improve the performance of our
different control policies. The data owner deletion process scheme towards practical deployment. In addition, we will
involves generating a new DEK and encrypting it with conduct game theoretical analysis to further prove the ratio-
pkAP . Therefore, there is no much difference under different nality and security of the proposed scheme.
406 IEEE TRANSACTIONS ON BIG DATA, VOL. 5, NO. 3, JULY-SEPTEMBER 2019

ACKNOWLEDGMENTS [23] G. Wallace, et al., “Characteristics of backup workloads in produc-


tion systems,” in Proc. USENIX Conf. File Storage Technol., 2012,
This work is sponsored by the National Key Research and Art. no. 500.
[24] Z. O. Wilcox, “Convergent encryption reconsidered,” 2011. [Online].
Development Program of China (grant 2016YFB0800704), the Available: http://www.mail-archive.com/cryptography@metzdowd.
NSFC (grants 61672410 and U1536202), the Project Supported com/msg08949.html
by Natural Science Basic Research Plan in Shaanxi Province of [25] C. Yang, J. Ren, and J. F. Ma, “Provable ownership of file in de-
China (Program No. 2016ZDJC-06), the 111 project (grants duplication cloud storage,” in Proc. IEEE Global Commun. Conf.,
2013, pp. 695–700.
B16037 and B08038), the PhD grant of the Ministry of Educa- [26] T.-Y. Wu, J.-S. Pan, and C.-F. Lin, “Improving accessing efficiency
tion, China (grant JY0300130104), and Academy of Finland of cloud storage using de-duplication and feedback schemes,”
Research Fellow Grant (No. 308087). IEEE Syst. J., vol. 8, no. 1, pp. 208–218, Mar. 2014.
[27] C.-I. Fan, S.-Y. Huang, and W.-C. Hsu, “Hybrid data deduplica-
tion in cloud environment,” in Proc. Int. Conf. Inf. Secur. Intell. Con-
REFERENCES trol, 2012, pp. 174–177.
[1] R. Chow, et al., “Controlling data in the cloud: outsourcing com- [28] J. W. Yuan, and S. C. Yu, “Secure and constant cost public cloud
putation without outsourcing control,” in Proc. ACM Workshop storage auditing with deduplication,” in Proc. IEEE Conf. Commun.
Cloud Comput. Secur., 2009, pp. 85–90. Netw. Secur., 2013, pp. 145–153.
[2] S. Kamara, and K. Lauter, “Cryptographic cloud storage,” Financ. [29] N. Kaaniche, and M. Laurent, “A secure client side deduplication
Crypto. Data Secur., 2010, pp. 136–149. scheme in cloud storage environments,” in Proc. 6th Int. Conf. New
[3] Q. Liu, C. C. Tan, J. Wu, and G. Wang, “Efficient information Technol., Mobility Secur., 2014, pp. 1–7.
retrieval for ranked queries in cost-effective cloud environments,” [30] Z. Yan, W. X. Ding, and H. Q. Zhu, “A scheme to manage
in Proc. IEEE INFOCOM, 2012, pp. 2581–2585. encrypted data storage with deduplication in cloud,” in Proc. Int.
[4] M. Kallahalla, E. Riedel, R. Swaminathan, Q. Wang, and K. Fu, Conf. Agents, 2015, pp. 547–561.
“Plutus: scalable secure file sharing on untrusted storage,” in [31] Z. Yan, W. X. Ding, X. X. Yu, H. Q. Zhu, and R. H. Deng,
Proc. USENIX Conf. File Storage Technol., 2003, pp. 29–42. “Deduplication on encrypted big data in cloud,” IEEE Trans. Big
[5] E.-J. Goh, H. Shacham, N. Modadugu, and D. Boneh, “SiRiUS: Data, vol. 2, no. 2, pp. 138–150, Apr.-Jun. 2016.
securing remote untrusted storage,” in Proc. Netw. Distrib. Syst. [32] Z. Yan, M. J. Wang, Y. X. Li, and A. V. Vasilakos, “Encrypted data
Secur. Symp., 2003, pp. 131–145. management with deduplication in cloud computing,” IEEE Cloud
[6] J. Bethencourt, A. Sahai, and B. Waters, “Ciphertext-policy attri- Comput. Mag., vol. 3, no. 2, pp. 28–35, Mar.-Apr. 2016.
bute-based encryption,” in Proc. IEEE Symp. Secur. Privacy, 2007, [33] Z. Yan, X. Y. Li, M. J. Wang, and A.V. Vasilakos, “Flexible data
pp. 321–334. access control based on trust and reputation in cloud
[7] V. Goyal, O. Pandey, A. Sahai, and B. Waters, “Attribute-based computing,” IEEE Trans. Cloud Comput., 2015. doi: 10.1109/
encryption for fine-grained access control of encrypted data,” in TCC.2015.2469662.
Proc. 13th ACM Comput. Commun. Secur., 2006, pp. 89–98. [34] J. Hur, D. Koo, Y. Shin, and K. Kang, “Secure data deduplication
[8] S. Muller, S. Katzenbeisser, and C. Eckert, “Distributed attribute- with dynamic ownership management in cloud storage,” IEEE
based encryption,” in Proc. 11th Annu. Int. Conf. Inf. Secur. Crypto., Trans. Knowl. Data Eng., vol. 28, no. 11, pp. 3113–3125, Nov. 2016.
2008, pp. 20–36. [35] J. Li, X. F. Chen, M. Q. Li, J. W. Li, P. P. C. Lee, and W. J. Lou,
[9] A. Sahai, and B. Waters, “Fuzzy identity-based encryption,” in Proc. “Secure deduplication with efficient and reliable convergent key
24th Int. Conf. Theory App. Cryptographic Tech., 2005, pp. 457–473. management,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 6,
[10] S. C. Yu, C. Wang, K. Ren, and W. J. Lou, “Achieving secure, scal- pp. 1615–1625, Jun. 2014.
able, and fine-grained data access control in cloud computing,” in [36] M. Wen, K. Ota, H. Li, J. S. Lei, C. H. Gu, and Z. Su, “Secure data
Proc. IEEE INFOCOM, 2010, pp. 534–542. deduplication with reliable key management for dynamic updates
[11] G. J. Wang, Q. Liu, J. Wu, and M. Y. Guo, “Hierarchical attribute- in CPSS,” IEEE Trans. Comput. Social Syst., vol. 2, no. 4, pp. 137–
based encryption and scalable user revocation for sharing data in 147, Dec. 2015.
cloud servers,” Comput. Secur., vol. 30, no. 5, pp. 320–331, 2011. [37] J. Li, Y. K. Li, X. F. Chen, P. P. C. Lee, and W. J. Lou. “A hybrid
[12] S. C. Yu, C. Wang, K. Ren, and W. J. Lou, “Attribute based data cloud approach for secure authorized deduplication,” IEEE Trans.
sharing with attribute revocation,” in Proc. ACM Asia Conf. Com- Parallel Distrib. Syst., vol. 26, no. 5, pp. 1206–1216, May 2015.
put. Commun. Secur., 2010, pp. 261–270. [38] J. Liu, N. Asokan, and B. Pinkas. “Secure deduplication of
[13] G. J. Wang, Q. Liu, and J. Wu, “Hierarchical attribute-based encrypted data without additional independent servers,” in Proc.
encryption for fine-grained access control in cloud storage serv- 22nd ACM SIGSAC Conf. Comput. Commun. Secur., 2015, pp. 874–
ices,” in Proc. 17th ACM Comput. Commun. Secur., 2010, pp. 735–737. 885.
[14] M. Zhou, Y. Mu, W. Susilo, M. H. Au, and J. Yan, “Privacy- [39] Q. Liu, G. J. Wang, and J. Wu, “Consistency as a service: Auditing
preserved access control for cloud computing,” in Proc. IEEE 10th cloud consistency,” IEEE Trans. Netw. Serv. Manage., vol. 11, no. 1,
Int. Conf. Trust, Secur. Privacy Comput. Commun., 2011, pp. 83–90. pp. 25–35, Mar. 2014.
[15] Z. G. Wan, J. E. Liu, and R. H. Deng, “HASBE: A hierarchical attri- [40] Q. Duan, “Cloud service performance evaluation: Status, chal-
bute-based solution for flexible and scalable access control in lenges, and opportunities – A survey from the system modeling
cloud computing,” IEEE Trans. Inf. Forensics Secur., vol. 7, no. 2, perspective,” Dig. Commun. Netw., Available online 23 December
pp. 743–754, Art. 2012. 2016, ISSN 2352-8648. [Online]. Available: http://dx.doi.org/
[16] Z. Yan, Trust Management in Mobile Environments – Usable and 10.1016/j.dcan.2016.12.002
Autonomic Models. Hershey, PA, USA: IGI Global, 2013. [41] Q. Liu, C. C. Tan, J. Wu, and G. J. Wang, “Towards differential
[17] Y. Tang, P. P. Lee, J. C. Lui, and R. Perlman, “Secure overlay cloud query services in cost-efficient clouds,” IEEE Trans. Parallel Distrib.
storage with access control and assured deletion,” IEEE Trans. Syst., vol. 25, no. 6, pp. 1648–1658, Jun. 2014.
Dependable Secure Comput., vol. 9, no. 6, pp. 903–916, Nov.-Dec. 2012. [42] B. Martini and K.-K. R. Choo, “Distributed filesystem forensics:
[18] M. Bellare, S. Keelveedhi, and T. Ristenpart, “DupLESS: server XtreemFS as a case study,” Dig. Investigation, vol. 11, no. 4,
aided encryption for deduplicated storage,” in Proc. 22nd USENIX pp. 295–313, 2014.
Conf. Secur., 2013, pp. 179–194. [43] B. Martini and K-K R. Choo, “Cloud forensic technical challenges
[19] Dropbox, “A file-storage and sharing service.” [Online]. Avail- and solutions: A snapshot,” IEEE Cloud Comput., vol. 1, no. 4,
able: http://www.dropbox.com/, retrieved March 2017. pp. 20–25, Nov. 2014.
[20] Google Drive. [Online]. Available: http://drive.google.com, [44] D. Quick, and K-K R. Choo, “Digital forensic intelligence: Data sub-
retrieved May 2017. sets and open source intelligence (DFINTþOSINT): A timely and
[21] Mozy, “Mozy: A file-storage and sharing Service.” [Online]. cohesive mix,” Future Generation Comput. Syst., 2017 [In press].
Available: http://mozy.com/, retrieved May 2017. [Online]. Available: http://dx.doi.org/10.1016/j.future.2016.12.032
[22] J. R. Douceur, A. Adya, W. J. Bolosky, P. Simon, and M. Theimer, [45] D. Quick, and K.-K. R. Choo, “Big forensic data management in
“Reclaiming space from duplicate files in a serverless distributed heterogeneous distributed systems: quick analysis of multimedia
file system,” in Proc. 22nd Int. Conf. Distributed Comput. Syst., 2002, forensic data,” Softw.: Practice Experience, 2017 [In press]. [Online].
pp. 617–624. Available: http://dx.doi.org/10.1002/spe.2429
YAN ET AL.: HETEROGENEOUS DATA STORAGE MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING 407

Zheng Yan (M’06, SM’14) received the BEng Wenxiu Ding received the BEng degree in infor-
degree in electrical engineering and the MEng mation security from the Xidian University, Xi’an,
degree in computer science and engineering China, in 2012. She is currently working toward
from the Xi’an Jiaotong University, Xi’an, China, the PhD degree in information security, the
in 1994 and 1997, respectively, the second MEng School of Cyber Engineering in the Xidian Univer-
degree in information security from the National sity, and also a visiting research student at the
University of Singapore, Singapore, in 2000, and School of Information Systems, Singapore Man-
the licentiate of science and the doctor of science agement University. Her research interests are in
in technology in electrical engineering from Hel- RFID authentication, privacy preservation, cloud
sinki University of Technology, Helsinki, Finland, security, data mining and trust management.
in 2005 and 2007. She is currently a professor at
the Xidian University, Xi’an, China and a visiting professor with the Aalto
University, Espoo, Finland. She authored more than 150 peer-reviewed Qinghua Zheng received the BSc degree in
publications and solely authored two books. She is the inventor and co- computer software, in 1990, the MSc degree in
inventor of about 60 patents and PCT patent applications. Her research computer organization and architecture, in 1993,
interests are in trust, security and privacy, social networking, cloud com- and the PhD degree in system engineering, in
puting, networking systems, and data mining. She serves as an associ- 1997 from Xian Jiaotong University, China. He
ate editor of the Information Sciences, Information Fusion, the IEEE did postdoctoral research at Harvard University
Internet of Things Journal, the IEEE Access Journal, the Journal of Net- from February 2002 to October 2002 and visiting
work and Computer Applications, the Security and Communication Net- professor research with Hong Kong University
works, etc. She is a leading guest editor of many reputable journals from November 2004 to January 2005. Since
including ACM TOMM, FGCS, IEEE Systems Journal, MONET, etc. 1995, he has been with the Department of Com-
She served as a steering, organization and program committee member puter Science and Technology at Xi’an Jiaotong
for more than 70 international conferences. She is a senior member of University. He is currently a professor and serves as the vice president
the IEEE. of Xi’an Jiaotong University. His research interests include intelligent
e-learning, network security, and trusted software. He is a member of
the IEEE.
Lifang Zhang received the BSc degree in electri-
cal engineering from Beijing Forestry University,
Beijing, China. She received the MSc degree " For more information on this or any other computing topic,
from the Department of Communications and please visit our Digital Library at www.computer.org/publications/dlib.
Networking, Aalto University, Espoo, Finland.
Her research interests are in network security
and data privacy.

You might also like