Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 1

Catch You if You Misbehave: Ranked Keyword


Search Results Verification in Cloud Computing
Wei Zhang, Student Member, IEEE, and Yaping Lin, Member, IEEE

Abstract—With the advent of cloud computing, more and more people tend to outsource their data to the cloud. As a fundamental data
utilization, secure keyword search over encrypted cloud data has attracted the interest of many researchers recently. However, most of
existing researches are based on an ideal assumption that the cloud server is “curious but honest”, where the search results are not
verified. In this paper, we consider a more challenging model, where the cloud server would probably behave dishonestly. Based on
this model, we explore the problem of result verification for the secure ranked keyword search. Different from previous data verification
schemes, we propose a novel deterrent-based scheme. With our carefully devised verification data, the cloud server cannot know
which data owners, or how many data owners exchange anchor data which will be used for verifying the cloud server’s misbehavior.
With our systematically designed verification construction, the cloud server cannot know which data owners’ data are embedded in the
verification data buffer, or how many data owners’ verification data are actually used for verification. All the cloud server knows is that,
once he behaves dishonestly, he would be discovered with a high probability, and punished seriously once discovered. Furthermore,
we propose to optimize the value of parameters used in the construction of the secret verification data buffer. Finally, with thorough
analysis and extensive experiments, we confirm the efficacy and efficiency of our proposed schemes.

Index Terms—Cloud computing, dishonest cloud server, data verification, deterrent

1 I NTRODUCTION However, all these schemes are based on the ideal as-
sumption that the cloud server is “curious but honest”.
With the advent of cloud computing, more and more
Unfortunately, in practical applications, the cloud server
people tend to outsource their data to the cloud. Cloud
may be compromised and behave dishonestly [19], [20].
computing provides tremendous benefits including easy
A compromised cloud server would return false search
access, decreased costs, quick deployment, and flexible
results to data users for various reasons:
resource management [1], [2]. Enterprises of all sizes can
1) The cloud server may return forged search results.
leverage the cloud to increase innovation and collabora-
For example, the cloud may rank an advertisement
tion.
higher than others, since the cloud can profit from it, or
Although cloud computing brings a lot of benefits, the cloud would return random large files to earn money,
for privacy concerns, individuals and enterprise users since the cloud adopts the ‘pay as you consume’ model.
are reluctant to outsource their sensitive data, including 2) The cloud server may return incomplete search re-
private photos, personal health records, and commer- sults in peak hours to avoid suffering from performance
cial confidential documents, to the cloud. Because once bottlenecks.
sensitive data are outsourced to a remote cloud, the There are some researches that focus on search results
corresponding data owner directly loses control of these verification [21], [22], [23], [24], [25], [26], [27], [28], [29].
data. The Apple’s iCloud leakage of celebrity photo in However, these methods cannot be applied to verify the
2014 [3] has furthered our concern regarding the cloud’s top-k ranked search results in the cloud computing en-
data security. Encryption on sensitive data before out- vironment, where numerous data owners are involved.
sourcing is an alternative way to preserve data privacy We illustrate the reason from two aspects.
against adversaries. However, data encryption becomes 1) Existing schemes share a common assumption, i.e.,
an obstacle to the utilization of traditional applications, data owners foresee the order of search results. How-
e.g., plaintext based keyword search. ever, in practical applications, numerous data owners
To achieve efficient data retrieval from encrypted data, are involved; each data owner only knows its own
many researchers have recently put efforts on secure partial order. Without knowing the total order, these data
keyword search over encrypted data [4], [5], [6], [7], owners cannot use the conventional schemes to verify
[8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]. the search results.
2) For a top-k ranked keyword search (e.g., k = 10),
• W. Zhang and Y. p. Lin are with College of Computer Science and only a few data owners will have satisfied files in the
Electronic Engineering, Hunan University, Changsha 410082, China,
and with Hunan Provincial Key Laboratory of Dependable Systems and
search results. Traditional methods need to return a lot
Networks, Changsha, 410082, China. Yaping Lin is the Corresponding of data to verify whether the huge amount of absen-
Author. E-mail: {zhangweidoc, yplin}@hnu.edu.cn. t data owners have satisfied search results. However,
the top-k ranked keyword search is, to some extent,

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 2

proposed to save communication cost; returning too 2 R ELATED W ORK


much verification data would make the top-k ranked 2.1 Secure Keyword Search in Cloud Computing
search meaningless. Additionally, in the ‘pay as you
Recently, there have been a lot of research works con-
consume’ cloud computing environment, returning too
cerned with secure keyword search in cloud computing.
much data would cause considerable expenses for data
The first securely ranked keyword search over encrypted
users, which would make the cloud computing lose its
data was proposed by Wang et al. [4]. Cao et al. [5]
attractiveness.
and Wen et al. [6], further strengthening the ranked
In this paper, we consider a more challenging model, keyword search and constructing schemes for privacy-
where multiple data owners are involved, and the cloud preserving multi-keyword ranked search. In [7], Xu et
server would probably behave dishonestly. Based on this al. proposed a multi-keyword ranked query scheme
model, we explore the problem of result verification on encrypted data, which enables a dynamic keyword
for the secure ranked keyword search. Different from dictionary and avoids the problem in which the rank
previous data verification schemes, we propose a nov- order is perturbed by several high frequency keywords.
el deterrent-based scheme. With our carefully devised Based on information retrieval systems and cryptogra-
verification data, the cloud server cannot know which phy approaches, Ibrahim et al. [8] proposed a ranked
data owners, or how many data owners exchange anchor searchable encryption scheme of multi-keyword search
data which will be used for verifying the cloud server’s over a cloud server. Hore et al. [9] further proposed
misbehavior. With our systematically designed verifica- using a set of colors to encode the presence of the
tion construction, the cloud server cannot know which keywords and creating an index to accelerate the search
data owners’ data are embedded in the verification data process. To enrich search functionality, Li et al. [10],
buffer, or how many data owners’ verification data are Chuah et al. [11], Xu et al. [12] and Wang et al. [13]
actually used for verification. All the cloud server knows proposed fuzzy keyword search over encrypted cloud
is that, once he behaves dishonestly, he would be dis- data, respectively. Meanwhile, Wang et al. [14] proposed
covered with a high probability, and punished seriously a privacy preserving similarity search mechanism over
once discovered. Additionally, when any suspicious ac- cloud data. To support secure searches in the system
tion is detected, data owners can dynamically update the where multiple data owners are involved, Sun et al. [15]
verification data stored on the cloud server. Furthermore, and Zheng et al. [16] proposed secure attribute-based
we propose to optimize the value of parameters used in keyword search schemes. In our previous work [17],
the construction of the secret verification data buffer. Fi- [30], we proposed a secure ranked multi-keyword search
nally, with thorough analysis and extensive experiments, scheme to support multiple data owners. In [18], we
we confirm the efficacy and efficiency of our proposed proposed a secure and efficient keyword search protocol
schemes. in the geo-distributed cloud environment.
The main contributions of this paper are: All these schemes assume that the cloud server is
“curious but honest”. However, in practical application-
• We formalize the ranked keyword search result s, the cloud server may be compromised and behave
verification problem where multiple data owners dishonestly. Different from these works, we consider
are involved and the cloud server would probably that the cloud server would probably be compromised
behave dishonestly. and behave dishonestly. Based on this consideration,
• We propose a novel secure and efficient deterrent- we propose a deterrent based scheme which will make
based verification scheme for secure ranked key- the cloud server dare not to behave dishonestly. Addi-
word search. tionally, once the cloud server behaves dishonestly, our
• We propose to optimize the value of parameters proposed scheme will detect it with a high probability.
used in the construction of verification data buffer.
• We give a thorough analysis and conduct extensive 2.2 Verification for Search Results
performance experiments to show the efficacy and There are a lot of researches concerned on the search
efficiency of our proposed scheme. results verification. Methods used in these works can
be classified into two categories, i.e., linked signature
The rest of this paper is organized as follows. Section chaining and Merkle hash tree.
2 reviews the related works. Section 3 formulates the The linked signature chaining schemes [21], [22], as-
problem and introduces notations used in later discus- sume all original data are ordered, then the data owner
sions. Section 4 describes the preliminary techniques that signs for consecutive data items. These signatures will
will be used in this paper. In Section 5, we illustrate finally form a linked chaining, and any data tampering
how to efficiently and securely verify the ranked search or data deletion will be easily detected since it will lead
results. In Section 6, we show how to optimize the value to an incomplete of the signature chaining. However,
of parameters. In Sections 7 and 8, we present analysis the linked signature chaining will lead to very high
and performance evaluation for our proposed schemes, computational costs, storage overhead, and user-side
respectively. Finally, we conclude the paper in Section 9. verification time [23].

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 3

and relevance scores as anchor data with other θ −1 data


owners uniformly at random. After getting the sampled
Trapdoors
Encrypted Files data and anchor data, each data owner concatenates
& indexes
&Verification Data
Ranked Results &
Verification Data
these data into a string. The encryption of the string
Buffer is used as each owner’s verification data. At last, each
Dishonest
Cloud Servers
data owner outsources his encrypted files, indexes, and
Data
Owners
Data verification data to the cloud server. Once an authorized
File decryption keys Users
data user wants to perform a ranked (top-k) secure key-
Fig. 1: Architecture of verifying secure ranked keyword word search over these encrypted files, he first generates
search result in cloud computing his trapdoor (encrypted keyword) and submits it with
variable k to the cloud server. Upon receiving the search
request, the cloud server searches locally and returns
The Merkle hash tree [24], [25], [26], [27], [28], [29] is the top-k relevant data files. The authorized data user
proposed to verify the integrity of a very large data set. decrypts his search results. If the data user finds any
Data owners first rank all data items, then place the or- suspicious data, he will construct and submit a secret
dered data items in the leaf node; further, they construct verification request. The cloud server further returns a
a Merkle hash tree from the leaf node recursively until verification data buffer without knowing which data
they get a root node. Finally, data owners sign on the owners’ verification data are returned. Finally, the data
root node. For each query, the server has to return all user decrypts and recovers the verification data, and
necessary items to reconstruct the root node. For data verifies the search results. If the search results do not
users, they first reconstruct the root node, then they pass the verification, the search results are considered
decrypt the signed root. Finally, they compare whether as contaminated and abandoned. Note that, we do not
two root nodes match. Any data tampering or deletion consider how to securely obtain the ranked search results
will lead to the inconsistency of the comparison. here; interested readers can refer to [7], [14], [17]. Instead,
However, these methods cannot be applied to verify we concern on the secure verification for the secure
the top-k ranked search results in cloud computing ranked keyword search results, which are also very
where multiple data owners are involved. First, each da- important. Tab. 1 lists the notations used in this paper.
ta owner only knows its own partial order; they cannot
foresee the total order. Second, these methods need to TABLE 1: Notations used in the paper
return a lot of data to verify whether the absent data
owners have satisfied search results. However, returning Notations Descriptions
O Data owners
too much verification data would make the top-k ranked F The plaintext file.
search meaningless. Additionally, in the ‘pay as you w Keyword.
V Verification data.
consume’ cloud computing environment, returning too m Number of data owners.
much data would cause considerable expenses for data d Number of files corresponding to each keyword.
ψ The number of sampled data.
users, which would cause the cloud computing to lose its θ Each data owner exchanges anchor data with θ − 1
attractiveness. Different from these methods, we propose data owners.
a novel deterrent based verification scheme. Instead of κ The number of hash functions used for mapping
λ The number of entries in the verification data buffer
returning each data owner’s data for verification, we α The actual number of verification data that is requested
only need to return a verification data buffer, where β The enlarged size of ID set submitted by the data user
some other data owners’ anchor values are secretly
embedded, which will force the cloud server to dare not
behave dishonestly. On the other hand, once the cloud
behaves dishonestly, our scheme can detect it with a high 3.2 Threat Model
probability. In our threat model, both data owners and authorized
data users are trusted. However, different from previous
3 P ROBLEM F ORMULATION works [4], [5], [7], [14], the cloud server is not trusted and
would probably behave dishonestly, i.e., the cloud server
3.1 System Model not only aims at revealing the contents of encrypted
There are three entities involved in our system model, as files, keywords and verification data, but also tends to
illustrated in Fig. 1, they are data owners, cloud server, return false search results and contaminate the secret
and data users. First of all, each data owner extracts verification data. We assume that the data owners and
keywords from his file collection and constructs indexes authorized data users share a secret hash function, e.g.,
(i.e., computing relevance scores between keywords and the keyed-Hash Message Authentication Code (HMAC)
files, ranking files based on relevance scores, and getting [31]. This assumption is applicable, e.g., for a large scale
the partial order). For each keyword, he samples ψ files company, the network administrator would distribute
from the corresponding file set, and obtains the file IDs this secret hash function when a new employer is en-
and relevance scores. Then he exchanges these file IDs rolled.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 4

3. Mapping verification
3.3 Design Goals data to a data buffer
1. Preparing 2. Constructing
To enable a secure and efficient verification for the verification data verification request
ranked keyword search, our system design should si- 4. Recovering
multaneously satisfy the following goals. and verifying

• Efficiency: The proposed scheme should allow data Fig. 2: The process of verification
owners to construct the verification data efficiently.
The cloud server should also return the verification
data without introducing heavy costs. Additionally, 5 V ERIFYING R ANKED TOP -k S EARCH R E -
data users can verify the search result efficiently. SULTS
• Security: The proposed scheme should prevent the
cloud server from knowing the actual value of the The basic idea of our deterrent based verification scheme
secret verification data, and which data owners’ data is elaborated as follows: We can consider the dishonest
are returned as verification data. cloud server as a suspect, the data user as a police chief,
• Detectability: The proposed scheme should deter
and each verification data as a policeman, who masters
the cloud server from behaving dishonestly. Once part of the suspect’s actions. Intuitively, the police chief
the cloud server behaves dishonestly, the scheme can gather all the policemen to verify whether the sus-
should detect it with a high probability. pect commits a crime. However, this will cause a lot of
manpower, financial and time waste. To overcome this
problem, each time the suspect takes an action, the police
4 P RELIMINARIES chief only inquires a few policemen to verify whether the
Before we introduce our detailed construction, we first suspect commits a crime. During the process, the police
briefly introduce some techniques that will be used in chief ensures that the suspect does not know which
this paper. policemen know his action, and which policemen are
inquired by the police chief. What the suspect knows is
that, once he behaves dishonestly, he will be discovered
4.1 Paillier Cryptosystem
with high probability, and punished seriously once dis-
Paillier cryptosystem [32] is a public key cryptosystem covered. By doing this, we can deter the suspect not to
with additive homomorphic properties. Let E(a) denote behave dishonestly.
the paillier encryption on a, and D(E(a)) denotes paillier Different from discovering misbehaviors of the cloud
decryption on E(a); we have the following properties: after the misbehaviors occur, we propose a comple-
∀a, b ∈ Zn , mentary and preventive scheme to deter the cloud not
to behave dishonestly. The deterrent in our scheme is
D(E(a) · E(b) mod n2 ) = a + b mod n
derived from a series of constructions, which include
D(E(a)b mod n2 ) = a · b mod n embedding secret sampling data and anchor data in the
verification data buffer, forcing the cloud conduct blind
computations on ciphertext, updating the verification
4.2 Privacy Preserving Ranked Keyword Search A-
data dynamically, and so on. The final goal of our
mong Multiple Data Owners
deterrent based scheme is to deter the cloud not to
In our previous work [17], we introduce how to achieve behave dishonestly, and once it misbehaves, it would be
ranked and privacy-preserving keyword search among detected with high probability.
multiple data owners. First of all, we systematically In what follows, we first give an overview of the
construct protocols on how to encrypt keywords for data verification construction. Then we introduce the detailed
owners, how to generate trapdoors for data users, and construction step by step. Note that, we first introduce
how to perform blind searching for the cloud server. how to achieve the single dimensional verification, while
As a result, different data owners use their own secret we leave the multi-dimensional verification in the exten-
keys to encrypt their files and keywords. Authorized sion subsection.
data users can issue queries without knowing secret
keys of these data owners. Then an Additive Order
Preserving Function family is proposed, which enables 5.1 Verification Construction Overview
different data owners to encode their relevance scores The verification construction is composed of four steps,
with different secret keys, and helps cloud server return which is illustrated in Fig. 2.
the top-k relevant search results to data users without First, each data owner prepares the verification data.
revealing any sensitive information. Specifically, each data owner samples ψ files from the
In this paper, we adopt this ranked and privacy p- corresponding file set, and obtains the file ID and rele-
reserving keyword search scheme to return the top-k vance score. With the effective data sampling, a data user
search results. Our goal is to systematically construct can verify the correctness of search results belonging to
schemes that can verify whether the returned top-k a specific data owner with a high probability. Then each
search results are correct. data owner exchanges these file IDs and relevance scores

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 5

as anchor data with other θ − 1 data owners uniformly Now we give an example to illustrate the feasibility of
at random, which will be used to verify the correctness sampling a subset of data as verification data. Assume
of search results among data owners. After getting the professor A has 100 students who have different number
sampled data and anchor data, each data owner con- of publication papers. Professor A is very sure that his
catenates these data into a string. The encryption of the student SB has the most papers, and SC has the third
string is used as each owner’s verification data. most papers, but he is not clear of the publications
Second, data users construct a secret verification re- of other students. When the professor asks the corre-
quest, and indicate the size of verification data buffer. sponding students who have the top-5 most papers,
Third, cloud server operates on the encrypted data and professor will detect a false answer with a probability
returns the verification data buffer. of more than 0.999. This probability is computed as
Forth, data users decrypt the returned search results follows, to compute the probability that the professor
and verify whether misbehavior occurs. A can detect a misbehavior, we first compute the prob-
ability that A cannot detect the misbehavior. When the
search results guarantee that SB ranks the first and SC
5.2 Preparing the verification data
ranks the third, then A cannot detect the misbehavior,
In this subsection, we introduce how to prepare the there are P (98, 3) such conditions, where P (n, k) de-
verification data step by step. notes the number of permutation. Additionally, there
are P (100, 5) conditions of the probable search results.
5.2.1 Sampling from original data Therefore, the probability that the A cannot detect the
Our sampling method is conducted in three steps. First, misbehavior is P (98, 3)/P (100, 5). Correspondingly, pro-
the data owner samples files from its original data set. fessor A can detect the misbehavior with the probability
Second, he extracts the corresponding file IDs, relevance 1 − P (98, 3)/P (100, 5) > 0.9998.
scores. Third, he attaches the file ID and relevance score
to the owner’s ID. Assume data owner Oi has d files 5.2.2 Exchanging data among data owners
belonging to keyword wt , he samples ψ files’ data from In our system, multiple data owners are involved. For
these d files for keyword wt , the corresponding process a given keyword, each data owner only knows its own
is shown in Algorithm 1. First of all, Oi initializes the partial order, i.e., each data owner cannot obtain a total
head of sampled data string as wt ||i, where wt denotes order for the keyword. This brings a great challenge
the keyword and i denotes Oi ’s ID. Second, Oi ranks for data users to verify whether the returned results
all the files corresponding to wt in descending order of are top-k relevant to the search request. A trivial way
relevance scores. Third, Oi concatenates F ID[0]||RS0,t to is to ask the cloud to return all encoded relevance
wt ||i, where F ID[0] denotes the file ID of F0 , and RS0,t scores belonging to different data owners to the data
denotes the relevance score between F0 and wt . Fourth, user, and the data user recomputes the top-k results
Oi samples ψ − 1 data items from the remaining d − 1 for verification, which requires gigantic computation and
files. The sampled files are composed of two kinds, i.e., communication costs from the data user.
the first file in wt ’s sorted file list, and the other ψ − 1 In our scheme, we propose to let data owners ex-
uniformly and randomly sampled files. Then, all these change a very small amount of data. Specifically, given
ψ sampled data and the head data are concatenated. Oi ’s keyword wt , each data owner uses the ψ aforemen-
Finally, the algorithm outputs the sampled data set SDi . tioned sampled files to generate interactive data. Since
these operations are conducted among data owners, the
Algorithm 1 Constructing Sampled data cloud server would not know whether a data owner
has the interactive data of another data owner. For easy
Input:
description, we assume each data owner exchanges the ψ
Oi ’s ID: i, number of sampled data: ψ, and wt ’s file
data items with θ − 1 data owners uniformly at random,
list: F ID[d]
i.e., after exchanging, each data owner will have θ − 1
Output:
data owners’ data, which will be used as secret data to
Sampled data: SDi
detect false search results even if a data owner’s data is
1: Initialize sampled data SDi to wt ||i
not involved in the search results.
2: Rank wt ’s file list F ID[d] in descending order of
relevance scores 5.2.3 Assembling the verification data
3: Concatenate F ID[0]||RS0,t to SDi
Assume data owner Oi receives other θ − 1 data owners’
4: Uniformly and randomly generate ψ − 1 number set
interactive data, Oi will further assemble his verification
R where R[i] ∈ [1, d]
data as follows: first, Oi extracts all the file IDs and rele-
5: Rank R incrementally
vance scores from the interactive data of other θ − 1 data
6: for ind =1 to ψ − 1 do
owners. Second, Oi ranks on his ψ sampled data and
7: concatenate F ID[R[ind]]||RSR[ind],t to SDi
(θ−1)·ψ received interactive data. Third, Oi concatenates
8: end for
all the θ · ψ data entries in descending order, where each
9: return SDi
entry is composed of a file ID and its corresponding

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 6

order. Finally, Oi uses the symmetric encryption (e.g., Algorithm 2 Securely returning verification data
AES [33]) to encrypt the concatenation result with his Input:
secret key Hs (i), where Hs () is the aforementioned se- Verification request set [< j, E(P K, rj ) >], j ∈ [1, β],
cretly shared hash function, i is the Oi ’s ID. We denote the size of verification data buffer λ
the result of the encryption as Vi , therefore, Vi is used Output:
as Oi ’s verification data, which will be outsourced to the Verification data buffer V B
cloud server along with Oi ’s encrypted indexes and files. 1: The cloud initializes V B with λ entries, each entry
with initial value 1
5.3 Submitting verification request 2: for j ∈ [1, β] do
When an authorized data user wants to verify the search 3: Locates Oj ’s verification data Vj
V
results, he specifies a set of data owners whose veri- 4: Compute vd = E(P K, rj ) j
fication data need to be returned to help verification. 5: for i in range (0,κ) do
The data user can achieve this goal by simply setting 6: V B[hi (j)] = V B[hi (j)] · vd
an ID set of his desired data owners. However, the 7: end for
ID set should not be exposed to the cloud server. The 8: end for
fundamental reason is illustrated as follows: if the cloud 9: return V B
server knows which data owners’ data are frequently
verified, he can deduce that these data owners’ data are
very useful or sensitive, therefore, these data owners’ verification data. Specifically, the cloud server first ini-
data would easily become attackers’ targets. On the other tializes a verification data buffer with λ entries, where
hand, if the cloud server knows which data owners’ data λ is specified by the data user. Then, the cloud server
are rarely verified, the cloud server will maliciously filter finds the data owner’s verification data indicated by
out or delete these data owners’ data as search results. the requested ID set (line 3), conducts calculations on
To prevent the cloud server from knowing which the encrypted data (line 4), and maps the results to the
data owners’ data are actually returned, we propose to verification data buffer with κ hash functions (line 5-7),
construct a secret verification request which is illustrated where the output of each hash function belongs to [0, λ].
as follows: First, the data user enlarges the ID set of Note that, since the size of the verification buffer, i.e., λ,
verification by inserting random IDs. Assume a data user is specified by the data user, different users will submit
wants to get Oi ’s verification data, he can add other n−1 different λ to the cloud. To ensure that the output of each
data owners’ ID in the set (we can adopt encryption or function belongs to [0, λ], instead of changing the κ hash
obfuscation to hide the true ID, for easy description, we functions all the time, we only need to specify the cloud
simply demonstrate with ID hereafter). Second, the data to do a module-λ operation for the output of each hash
user attaches a data 0 or 1 to each ID. Here, if the data function. When the cloud server finishes proceeding all
user wants to return a data owner’s verification data, the IDs in the requested ID set, the result verification
then he attaches 1 to the corresponding ID, otherwise, data buffer is returned to the data user. Note that, during
0 is attached. Third, the data user encrypts the attached the whole process, the cloud server only sees the large ID
0 or 1 with the Paillier encryption. Here, we assume all sets, and conducts computation on the encrypted data.
the data owners and authorized data users share a key Therefore, the cloud server know nothing about which
pair, i.e., the public key P K and the private key SK, data owner’s verification data are actually returned and
for the Paillier encryption. Therefore, 0 is encrypted to used for verification.
E(P K, 0) and 1 is encrypted to E(P K, 1). With the well
In the above description, we propose to use multiple
designed Paillier encryption, the cipher-text of the same
hash functions, here we elaborate on the reason of in-
data would be different each time. Finally, the data user
troducing these hash functions. To obfuscate the cloud
submits the ID set and the attached encrypted data set
to know which verification data are actually returned.
to the cloud server.
There are three alternative ways to prevent the cloud
Now we give an example here. Assume a data us-
from knowing which verification data are actually re-
er Uj needs to download O1 ’s and O2 ’s verification
trieved by the data user. First, we can request the cloud
data, first of all, he formulates a large ID set, say,
to return all the verification data each time, so that the
{O1 , O2 , O3 , · · · , On }, then he attaches E(P K, 1) to O1
cloud knows nothing about which specific verification
and O2 , and E(P K, 0) to other IDs. Finally, the data
data are actually used. However, this method will lead to
user submits {< O1 , E(P K, 1) >, < O2 , E(P K, 1) >, <
a heavy communication cost between the cloud and the
O3 , E(P K, 0) >, · · · , < On , E(P K, 0) >} as the verifica-
data user. Second, the data user can prepare the enlarged
tion request to the cloud server.
data set where some fake IDs are also involved, and
specify the cloud to put the verification data in specific
5.4 Returning verification data positions in the verification data buffer. However, the
Upon receiving data user’s verification request, the cloud data user has to control the cloud to return data correctly,
server follows Algorithm 2 to prepare and return the which is not user friendly. In addition, during the process

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 7

Verification E(PK,
V1 V2 V3 V4 Verification
data E(PK,V1) E(PK,V2) E(PK,0) E(PK,0) E(PK,0)
V1+V2) data buffer
E(PK,1) E(PK,1) E(PK,0) E(PK,0) Request
data
Decrypted
V1 V2 V1+V2 0 0 0
data
V1 V2 V3 V4
E(PK,1) E(PK,1) E(PK,0) E(PK,0)
Fig. 4: Example of decrypting the verification data buffer
Verification
data buffer

E(PK,1)V1 E(PK,0)V3
E(PK,1)V1 E(PK,1)V2
· E(PK,1)V2
E(PK,0)V3
· E(PK,0)V4
E(PK,0)V4 entries of V B, respectively. Note that, since the data
E(PK,
users can pre-compute the entries where no collision
E(PK,V1) E(PK,V2)
V1+V2)
E(PK,0) E(PK,0) E(PK,0) occurs, instead of decrypting the whole verification data
buffer, the authorized data user only needs to decrypt
Fig. 3: Example of verification data buffer construction the entries where no collision occurs, which helps im-
prove the decryption efficiency.
Note that, since the cloud server knows that, if data
of specification, the data user would reveal his sensitive collision happens in an entry, the data in that entry
data. Therefore, we propose the third way, i.e., hash cannot be recovered. To prevent the data user from
verification data into verification data buffer directly, recovering the verification data and detecting a misbe-
to let the cloud return the verification data without havior, the cloud server would contaminate the entries
knowing which and how many verification data are in the verification buffer set, and deceive that collision
actually returned. happens in these entries. However, this attack cannot
Now we give an example (shown in Fig. 3) to illus- be achieved. The fundamental reason is that, the data
trate how to map the verification data into the veri- user specifies the IDs of data owners whose verification
fication data buffer, based on the homomorphic prop- data will be returned, and knows the κ hash functions,
erty of Paillier encryption. First of all, the cloud serv- therefore, the data user can foresee whether collision
er finds the encrypted verification data {V1 , V2 , V3 , V4 }. happens in an entry in the verification data buffer. When
Then he conducts calculation on the cipher-text, i.e., the cloud server contaminates the data in these entries,
V V V V
{E(P K, 1) 1 , E(P K, 1) 2 , E(P K, 0) 3 , E(P K, 0) 4 }. Due the misbehavior can be easily detected.
to the homomorphism of Paillier encryption, we have:
D(E(P K, 1)V1 ) = D(E(P K, 1 · V1 )) = D(E(P K, V1 )) 5.5.2 Verifying the ranked search results
D(E(P K, 1)V2 ) = D(E(P K, 1 · V2 )) = D(E(P K, V2 )) When the data user gets some data owners’ verification
D(E(P K, 0)V3 ) = D(E(P K, 1 · V3 )) = D(E(P K, V3 )) data, he can further recover all the sampled data and an-
D(E(P K, 0)V4 ) = D(E(P K, 1 · V4 )) = D(E(P K, V4 )) chor data. The data user will use them to verify whether
the returned results are correct. The verification is done
Further, the cloud uses two hash function h1 (i) and in two steps: first, the data user verifies whether the data
h2 (i) to map the encrypted data to the verification data from a specific data owner is correct. If the search results
buffer. Again, with the homomorphic property, we have pass the first verification, the verification process turns
interesting outcomes, e.g., D(E(P K, 0) · E(P K, V1 )) = to the second step, i.e., with the help of anchor data,
D(E(P K, 0+V1 )) = D(E(P K, V1 )). In this way, the veri- the data user verifies whether the search results from
fication data V1 is returned without known by the cloud. different data owners are correct. After verification, the
Finally, the result buffer is returned to the data user. As data user can detect the cloud server’s misbehavior with
we can see, from the viewpoint of the cloud server, the a high probability. In Section 7, we will give an analysis
verification data of four data owners (i.e., O1 , O2 , O3 , of the detection probability.
and O4 ) are processed. As a matter of fact, only O1 ’s and
O2 ’s verification data are returned, which is not known
by the cloud. 5.6 Extension: Multi-dimensional Verification
As we know, a file is often accompanied with several
5.5 Verifying Search Results keywords, i.e., the index of the files are often multi-
dimensional. If we return verification data for each
5.5.1 Recovering verification data dimension (keyword), the communication cost will in-
Upon receiving the verification data buffer V B, the data crease linearly with the amount of dimension increasing.
user decrypts it with the corresponding private key SK. As a matter of fact, there will be some relationship
After decryption, a data user can recover verification among different dimensions. For example, assume the
data from each entry where no collision happens (only height of a man is 2.1 meters, then the weight of this
one owner’s data is mapped in the entry, and no other man would very likely be more than 60 kilograms.
data are mapped). We use the pearson correlation coefficient to e-
Fig. 4 shows the decryption result of V B in Fig. 3, the valuate the correlation of the order list between t-
data user can recover V1 , V2 from the first and second wo keywords (dimensions). Given w1 ’s order list

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 8

4
discuss how to combine different dimensions based on
3.5 the characteristics of the data set, and to what extent can
3 we combine data, so that there would be an excellent
2.5 tradeoff.
2
r2

1.5 5.7 Updating verification data stored on the cloud


1
server
0.5
Since the data users in the system would also involve
data owners, if they find that some dishonest behavior
0
0 1 2 3 4 5 6 7 8 9 10 is not detected by the existed verification data, they will
r1
update the verification data stored on the cloud server.
Fig. 5: Example of binding dimensions The update can be finished in three steps: first, they
download the verification data from the cloud server.
Second, they decrypt the verification data, update it with
r1 =< r11 , r12 , · · · , r1n > and w2 ’s order list r2 =< new sampled data and anchor data, and encrypt the new
r21 , r22 , · · · , r2n >, we first compute the covariance verification data. Third, they outsource the ciphertext of
cov(r1 , r2 ) of the two lists, where cov(r1 , r2 ) = E((r1i − the updated verification data to the cloud.
E(r1 ))(r2i − E(r2 ))). Then we compute the pearson cor-
relation coefficient r = cov(r1 , r2 )/(σ(r1 ) · σ(r2 )). Finally, 5.8 Auxiliary Deterrent Method
we use the relationship rule of thumb [34] √ to evaluate To strengthen the deterrent on the cloud server, we
the correlation of r1 and r2 . If |r| ≥ 2/ n, then there
illustrate some auxiliary deterrent methods here. If the
exists a strong correlation between r1 and r2 . Otherwise,
data user detects any problems during the process of
the correlation is ignored.
verification, he will announce the errors to the public.
Fig. 5 shows an example of binding two dimensions
Otherwise, the data user publishes all the returned data
together. Dimension r1 has value <0.5, 1, 1.5, 2, 2.5, 3, 3.5,
from the cloud, his trapdoor, and secret verification
4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9> and r2 has value <0.5,
request to the public for supervision. We assume the
1, 1.2, 1.3, 1.5, 1.8, 1.9, 2, 2.1, 2.15, 2.4, 2.45, 2.5, 2.7, 2.75,
data owners will periodically check this data published
2.78, 2.8, 2.88>. We first compute the standard deviation
by the data users. Data owners who have more relevant
of r1 and r2 , and we get σ(r1 ) = 2.67 and σ(r2 ) = 0.7
files, but are not returned as results, will soon detect the
respectively. Then we compute the covariance of r1 and
dishonest behavior of the cloud. This scheme will suffer
r2 and get cov(r1 , r2 ) = 1.82. Now we can get the cor-
some delay, but it will strengthen the deterrent on the
relation coefficient √ 1 ) · σ(r2 )) = 0.97.
√ r = cov(r1 , r2 )/(σ(r cloud server. On the other hand, once the data users
Since 0.97 > 2/ 18, i.e., r > 2/ n, according to the
discover the dishonest behavior, the cloud server should
relationship rule of thumb, we are sure that there exists
be seriously penalized. This will force the cloud server
a strong correlation between r1 and r2 . Therefore, we
to not dare behave dishonestly.
further deduce the relationship between r1 and r2 , that

is r2 = r1 , and the sliding interval is ±0.2.
By finding the relationships among dimensions, we 6 S ETTING THE OPTIMAL PARAMETERS
can bind these dimensions together, and only return very Since the data users are often resource-limited, to control
few dimensions as verification data. As a result, even the communication cost, it would be significant to enable
if some dimensions are not returned, its value can also data users to specify the length of verification data
be estimated and verified. In our scheme, data owners buffer. To increase the detection probability of potentially
first dig into the relationship among different dimen- dishonest behaviors of the cloud server, it would be
sions, then they model these relationships with some crucial to recover as many verification data from the
functions, and further set some bound value for these data buffer as possible. An intuitive way of improving
relationships. Finally, these functions and bound values the detection probability is to let the cloud server map
are used to bind the correlated dimensions. Therefore, as many data items in the verification data buffer as
once the value of one dimension is returned, the value possible. However, the more data items are mapped,
of its correlated dimensions can also be estimated. the higher probability that a collision will occur. As a
result, when too many data items are mapped into the
5.6.1 Discussion verification data buffer, the amount of data that can be
Binding data among dimensions would increase the recovered from the data buffer would be very small.
detection efficiency. However, when we bind too many Assume the data user specifies the length of the ver-
dimensions, the cloud would be much easier to cheat. ification data buffer as λ, the number of hash function
Therefore, the problem becomes how to improve the de- used for mapping as κ, and the size of enlarged ID set
tection efficiency as much as possible without sacrificing (i.e., the number of verification data that we map into
too much security goals. In the future work, we plan to the data buffer) as β. The coming question is that, given

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 9

λ and κ, to maximize the number of verification data are already mapped, as E(x), therefore,
that can be recovered from the data buffer, how to set
the optimal β. ∑
β
E(x) = x · px
Next, we introduce how to obtain the optimal β step x=1
by step. First of all, we compute the probability of re- ∑β
covering x data items from the data buffer where β data x β−x
= x · C(β, x) · (ps ) · (1 − ps ) (4)
items are already mapped. Obviously, the probability is x=1
the same with recovering x colors from the following ∑
β−1
x β−x−1
color survival game [35] [36]. = β · ps · C(β − 1, x) · (ps ) · (1 − ps )
Color Survival Game: Assume there are β colors, x=0

each color has the same κ balls, we throw these balls = β · ps


into λ buckets uniformly at random. Once only one ball
Obviously, for any κ ≥ 4, the expected number that
falls into a bucket, we say the ball survives. Otherwise,
we can recover from the data buffer is E(x) = ⌊λ/κ⌋,
we say the ball fails. Once any one of κ balls of a
i.e., ⌊β · ps ⌋ = ⌊λ/κ⌋. With Eq. 1, Eq. 2,⌊ and Eq. 4, we can

color survives, we say the color survives. Obviously, ln(κ)
the probability of successfully recovering x data from compute β, which is very close to 2 − ln(λ−κ)−ln(λ) .
the data buffer is equal to the probability of x colors Therefore, to recover the max number of data items from
surviving. the data buffer, the optimal number of data items that
we map into the data buffer is:
Since the probability that color i survives depends on
how many buckets are covered by other β − 1 colors, ⌊ ⌋
ln(κ)
we first compute the number of buckets covered by the β = 2− (5)
ln(λ − κ) − ln(λ)
other β − 1 colors. Obviously, each color covers κ/λ
percent of the buckets, and the coverage of these colors is Now, we can conclude that, suppose the data user
relatively independent; therefore, β − 1 colors will cover specify the length of the verification data buffer as λ,
1 − (1 − κ/λ)β−1 percent of the total buckets. We denote and the number of hash function used for mapping as
the number of buckets covered by the other β − 1 colors κ, to maximize the number of verification data that will
as T , therefore, be recovered from the data buffer, we⌊ need to specify ⌋
ln(κ)
the size of the enlarged ID set to be 2 − ln(λ−κ)−ln(λ) .
⌊ ( )⌋
β−1
T = λ · 1 − (1 − κ/λ) (1)
7 A NALYSIS
The probability of color i covered by other colors is In this Section, we will give a thorough analysis of the
C(T, κ)/C(λ,(κ),) where C(T, κ) denotes the combinato- security and performance of our proposed schemes. First
rial number Tκ . We denote the survival probability of of all, we will analyze the security. Then we illustrate the
color i as ps , therefore deterrent proposed by our scheme. Further, we describe
the detailed computation and communication cost for
data owners, cloud server and data users. Finally, we an-
ps = 1 − C(T, κ)/C(λ, κ) (2) alyze the detection probability of our proposed schemes
with the verification data.

Obviously, each color shares the same survival prob-


7.1 Security Analysis
ability. Assume the probability that the exact x colors
survive from the buckets is denoted as px , then px = Recall that, for ranked and privacy preserving keyword
x β−x
C(β, x) · (ps ) · (1 − ps ) . As we can see, during this search, data owners encrypt the keywords, relevance
computation, these colors are not strictly independent, scores, and files before they outsource these data items
but for the values that make sense, they are essentially to the cloud. Data users also generate secure trapdoors
independent. Therefore, the probability that exactly x before submitting them to the cloud. The security is
data items can be recovered from the verification data proven in [17]. For search result verification, each data
buffer is: owner constructs a secret verification data which is
encrypted with the AES encryption [33]. Therefore, the
verification data is secure as long as the AES encryption
x β−x
px = C(β, x) · (ps ) · (1 − ps ) (3) is secure. The verification request is encrypted with
Paillier encryption [32], the cloud server only conducts
computation on the cipher-texts. Therefore, the verifica-
We denote the expected number of data items that can tion process is secure as long as the Paillier encryption
be recovered from the data buffer, where β data items is secure.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 10

0.9995 1 0.9998
γ=4 m=10
m=10
γ=6 m=20
0.9994 0.9998 m=20
0.9996

Detection probability

Detection probability
Detection probability

0.9993 0.9996
0.9994
0.9992 0.9994

0.9992
0.9991 0.9992

0.999 0.999 0.999


10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
k ki γ

(a) m = 10, d = 100, ki = 1. (b) d = 100, γ = 10, k = 50. (c) d = 100, ki = 1, k = 50.
Fig. 6: Misbehavior Detection Probability

7.2 Deterrent Analysis The communication cost mainly comes from transmit-
In this paper, we propose the deterrent-based verification ting the data buffer and receiving verification data. The
scheme. During the whole process of verification, the cloud server needs to return data buffer with λ entries,
cloud server only conducts computation on cipher-texts. where the communication cost of each entry is O(ψ · θ).
Therefore, he does not know how many data owners’ For verification data receiving, assume there are m data
verification data are actually used for verification, or owners in the system, the cost is O(m · θ · ψ). Therefore,
which data owners’ data are embedded in the verifi- the communication cost for the cloud server is O(λ·ψ ·θ).
cation data buffer. Furthermore, the cloud server is not O(max {λ · ψ · θ, m · θ · ψ}).
clear of which data owners, or how many data owners
exchange anchor data. We keep all these information 7.3.3 Costs for Data User
secret to the cloud server. All the cloud server knows is The computational cost for data users spent on verifica-
that, once he behaves dishonestly, he will be discovered tion mainly comes from three aspects: first, constructing
with a high probability, and punished seriously once the verification request. That’s O(β). Second, decrypting
discovered. the entries where the α requested ID corresponds and no
collision happen, that’s O(κ·α). Third, detecting whether
7.3 Performance Analysis misbehavior happens, the cost is O(α · θ · ψ). So the
computational cost is O(max {β, κ · α, α · θ · ψ}).
7.3.1 Costs for Data Owners
The communication cost for the data users spent on
The computational cost for the data owners spent on verification comes from receiving the verification data
verification mainly comes from constructing the verifi- buffers, so the communication cost is O(λ · ψ · θ).
cation data. For data sampling, the running time mainly
comes from ordering files for each keyword; therefore
the computational complexity is O(d · log(d)). For data 7.4 Misbehavior Detection probability
assembling, data owners need to rank the ψ ·θ data items Our proposed scheme should not only ensure a strong
and then encrypt the assembled data, which is O(ψ · θ). deterrent for potential attacks, but also achieve high
Therefore, the computational complexity for each data detection probability once the compromised cloud server
owner is O(max {d · log(d), ψ · θ}). misbehaves. Now we analyze the detection probability.
The communication cost mainly comes from two as- For the k returned search results, suppose ki out of
pects, i.e., anchor data exchanging, and the verification the k results are also contained in the verification data.
data buffer transmission. For anchor data exchanging, Assume there are m data owners in our system, and
each data owner needs to transmit ψ anchor data to θ −1 the data user recovers γ distinct verification data from
data owners, the cost is O(θ · ψ). For verification data the data buffer. The data user can detect an error with
buffer transmission, the communication cost is O(θ · ψ). probability Pe :
So the total communication cost is O(θ · ψ).
P (m · d − γ, k − ki )
Pe = 1 − (6)
7.3.2 Costs for Cloud Server P (m · d, k)
The computational cost for the cloud server spent on The figure shown in Fig. 6 describes the relationship
verification mainly comes from mapping the verification between the detection probability and the corresponding
data into the data buffer. Since the data user provides an parameters. From Fig. 6(a), we observe that, when we
enlarged size of ID set, i.e., β, the cloud server needs to set m = 10 (number of data owners involved in the
map the corresponding β data owners’ verification data system), d = 100 (average number of files corresponding
to the data buffer, where each data item is mapped κ to a keyword), even if ki = 1, i.e., there are only one
times. Therefore, the computational complexity for the out of k search results has its corresponding verification
cloud server is O(β · κ). data, the detection probability is still more than 0.999.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 11

We can also see that, with the number of returned results α symmetric encryption operations is relatively low.
(k) increases, the detection probability increases linearly. Fig. 9(b) demonstrates the time cost of decrypting and
Additionally, the larger γ (number of distinct returned recovering the verification data. Since the time is mainly
verification data) is, the higher detection probability is spent on decrypting the verification data buffer, with the
achieved. The fundamental reason can be easily inferred size of the data buffer increasing, the corresponding time
from the Equation 6. In Fig. 6(b), we set d = 100, γ = 10, cost of data users increases linearly.
k = 50, with ki increases, the detection probability also Fig. 10 shows the ratio of data recovered from the
increases. When ki is larger than 2, the detection proba- verification data buffer with different parameters. Fig.
bility is very close to 1. From Fig. 6 (c), we see that, when 10(a) shows that, when the size of the verification data
we set k = 50, ki = 1, with the value of γ grows up, the buffer is set to be 500, i.e., λ = 500, the ratio of data
detection probability also grows linearly. Additionally, recovered from the verification data buffer will decrease
the larger m is, the higher probability would be achieved. with the number of mapped data increasing. Meantime,
the more hash functions we use, the lower ratio that
8 P ERFORMANCE E VALUATION data can be recovered. The fundamental reason is that,
the more data we map into the data buffer, the higher
In this Section, we demonstrate a thorough evaluation probability that data collision will occur, which renders
on our proposed scheme. First of all, we evaluate the some data cannot be recovered. Therefore, the ratio of
computational cost of the data owners, the cloud server, recovered data will decrease accordingly.
and the data users. Then we evaluate the functionality Fig. 10(b) demonstrates that, when we set the number
of the verification data buffer. of hash functions to be 30, i.e., κ=30, with the size of the
data buffer increasing, the ratio of recovered data will
8.1 Experiment Settings increase. Meanwhile, the less data we map, the faster
The experiment programs are coded using the Python the data recovery ratio will increase. The reason is that
programming language on a Laptop with 2.2GHZ Intel larger data buffer and less data mapped into the buffer
Core CPU and 2GB memory. We use the Pailier Encryp- will reduce the probability of data collision; therefore,
tion [32] for data encryption; the secret key is set to be the ratio of recovered data will increase.
512 bits. Fig. 10(c) demonstrates that, when the number of
entries is set to 500, the ratio of data recovery will
decrease with the number of hash functions increasing;
8.2 Experiment Results the more data items we map into the verification data
Recall that the time cost for each data owner mainly buffer, the faster the data recover ratio will decrease.
comes from ranking, string concatenating, and sym- Fig. 11 shows the number of data recovered from the
metric encryption. Fig. 7(a) and Fig. 7(b) show that, verification data buffer with different parameters. Fig.
with the number of sampled data (ψ) and exchanged 11(a) shows that, when λ = 500, with the amount of
data (θ) increasing, the time cost of the data owners mapped data increasing, the number of data recovered
increases linearly. The fundamental reason is that, with ψ from the verification data buffer will first increase and
and θ increasing, more data items are concatenated and then fall down. We illustrate this phenomenon as fol-
encrypted, which results in more time consumption. lows: when we map a few data items into the verification
The time cost of the cloud server spent on verification data buffer, few data collisions will occur, and almost
mainly comes from two aspects, i.e, conducting compu- all the data can be recovered from the verification data
tation on the user submitted cipher-text and mapping buffer. Therefore, the amount of recovered data will in-
the computation result to the verification data buffer. crease. However, when the amount of data we map into
Fig. 8(a) shows that, with the size of enlarged ID set the verification data buffer increases to a threshold, data
(β) increasing, the corresponding time cost will increase collision will also increase with the increasing number of
linearly. The reason is that more IDs will lead to more distributed data. Obviously, data collision will cause data
computation on the cipher-text, and more time will be to be unrecoverable. Therefore, the amount of recovered
needed. Fig. 8(b) demonstrates that, the time cost of the data will decrease when we map too many data items
cloud server has little connection with the size of the into the verification data buffer.
verification data buffer. Fig. 11(b) demonstrates that, when we set κ = 30, with
Fig. 9 shows the time cost of data users. Fig. 9(a) illus- the size of the verification data buffer increasing, the
trates the time cost of generating verification request. As number of recovered data will increase. An interesting
we can see, when the size of the enlarged ID set increases feature shown in Fig. 11(b) is that, the fewer items we
from 10 to 100, the corresponding time cost increases map into the verification data buffer, the prior data
from 0.024s to 0.24s. The reason is that, the larger the can be recovered from the verification data buffer. The
β is, the more Paillier encryption will be conducted. We fundamental reason is that, with fewer data items, we
also observe that the time cost of the data user has little need fewer entries to accommodate data collision.
connection with α. The reason is that, compared with Fig. 11(c) demonstrates that, when λ = 500, the
the time cost spent on Paillier encryption, conducting amount of recovered data will decrease with the number

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 12

8 8
θ=10 ψ=10
7 7

Time Cost of Operation (ms)

Time Cost of Operation (ms)


θ=20 ψ=20
6 6

5 5

4 4

3 3

2 2

1 1

0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of Sampled Data Number of Exchanged data

(a) (b)
Fig. 7: Time cost of the data owner

10 5
9 λ=100 β=10
λ=200 β=30

Time Cost of Operation (s)


4
Time Cost of Operation (s)

8
7
6 3
5
4 2
3
2 1
1
0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Size of Enlarged ID Data Size of Verification Data Buffer

(a) (b)
Fig. 8: Time cost of the cloud server

25 5
Time Cost of Operation (× 10−2s)

α=4
Time Cost of Operation (× 10−2s)

α=4
α=10 4 α=10
20

15 3

10 2

5 1

0 0
10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10
Size of Enlarged Data Set Size of Verification Data Buffer ( × 102)

(a) (b)
Fig. 9: Time cost of the data user

of hash functions increases; additionally, the more data exchange anchor data used for verification, he also does
items we map into the verification data buffer, the faster not know which data owners’ data are embedded in
the number of recovered data will decrease. The reason the verification data buffer or how many data owners’
is that, the more mapping operations and data items verification data are actually used for verification. All
we map into the verification data buffer, the higher the cloud server knows is that, once he behaves dishon-
probability that a data collision will occur, which leads estly, he would be discovered with a high probability,
to the amount of recovered data decreasing. and punished seriously once discovered. Additionally,
when any suspicious action is detected, data owners
can dynamically update the verification data stored on
9 C ONCLUSION the cloud server. Furthermore, our proposed scheme
In this paper, we explore the problem of verification allows the data users to control the communication
for the secure ranked keyword search, under the model cost for the verification according to their preferences,
where cloud servers would probably behave dishonestly. which is especially important for the resource limited
Different from previous data verification schemes, we data users. Finally, with thorough analysis and extensive
propose a novel deterrent-based scheme. During the experiments, we confirm the efficacy and efficiency of
whole process of verification, the cloud server is not our proposed schemes.
clear of which data owners, or how many data owners

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 13

1 1 1
0.9 0.9 θ=50 0.9
θ=70
Ratio of Recovered Data 0.8 0.8 0.8

Ratio of Recovered Data

Ratio of Recovered Data


θ=90
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
κ=20 θ=30
0.2 κ=30 0.2 0.2
θ=40
0.1 κ=40 0.1 0.1
θ=50
0 0 0
10 20 30 40 50 60 70 80 90 100 100 200 300 400 500 600 700 800 900 1000 10 20 30 40 50 60 70 80 90 100
Number of Mapped Data Number of Entries Number of hash function

(a) λ = 500. (b) κ=30. (c) λ =500.


Fig. 10: Ratio of data recovered from the verification data buffer

50 80 50
κ=20 θ=50
45 70 45
κ=30 θ=70
Number of Recovered Data

Number of Recovered Data

Number of Recovered Data


40 40
κ=40 60 θ=90
35 35
30 50
30
25 40 25
20 30 20
15 15
20 κ=30
10 10
κ=40
5 10 5 κ=50
0 0 0
10 20 30 40 50 60 70 80 90 100 100 200 300 400 500 600 700 800 900 1000 10 20 30 40 50 60 70 80 90 100
Number of Mapped Data Number of Entries Number of hash function

(a) λ = 500. (b) κ=30. (c) λ =500.


Fig. 11: Amount of data recovered from the verification data buffer

ACKNOWLEDGMENTS [9] B. Hore, E. C. Chang, M. H. Diallo, and S. Mehrotra, “Indexing


encrypted documents for supporting efficient keyword search,”
This work is supported in part by the National Natu- in Proc. Secure Data Management (SDM’12), Istanbul, Turkey, Aug.
ral Science Foundation of China (Project No. 61173038, 2012, pp. 93–110.
61472125). Thanks to the China Scholarship Council for [10] J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, and W. Lou, “Fuzzy
keyword search over encrypted data in cloud computing,” in Proc.
providing the foundation. IEEE INFOCOM’10, San Diego, CA, Mar. 2010, pp. 1–5.
[11] M. Chuah and W. Hu, “Privacy-aware bedtree based solution
R EFERENCES for fuzzy multi-keyword search over encrypted data,” in Proc.
IEEE 31th International Conference on Distributed Computing Systems
[1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Kon- (ICDCS’11), Minneapolis, MN, Jun. 2011, pp. 383–392.
winski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, [12] P. Xu, H. Jin, Q. Wu, and W. Wang, “Public-key encryption with
“A view of cloud computing,” Communication of the ACM, vol. 53, fuzzy keyword search: A provably secure scheme under keyword
no. 4, pp. 50–58, 2010. guessing attack,” IEEE Transactions on Computers, vol. 62, no. 11,
[2] C. Zhu, V. Leung, X. Hu, L. Shu, and L. T. Yang, “A review of key pp. 2266–2277, 2013.
issues that concern the feasibility of mobile cloud computing,” in [13] B. Wang, S. Yu, W. Lou, and Y. T. Hou, “Privacy-preserving multi-
Green Computing and Communications (GreenCom), 2013 IEEE and keyword fuzzy search over encrypted data in the cloud,” in IEEE
Internet of Things (iThings/CPSCom), IEEE International Conference INFOCOM, Toronto, Canada, May 2014, pp. 2112–2120.
on and IEEE Cyber, Physical and Social Computing. IEEE, 2013, pp. [14] C. Wang, K. Ren, S. Yu, and K. M. R. Urs, “Achieving usable and
769–776. privacy-assured similarity search over outsourced cloud data,” in
[3] Ritz, “Vulnerable icloud may be the reason to celebrity photo Proc. IEEE INFOCOM’12, Orlando, FL, Mar. 2012, pp. 451–459.
leak.” [Online]. Available: http://marcritz.com/icloud-flaw-leak/
[15] W. Sun, S. Yu, W. Lou, Y. T. Hou, and H. Li, “Protecting your
[4] C. Wang, N. Cao, J. Li, K. Ren, and W. Lou, “Secure ranked key-
right: Attribute-based keyword search with fine-grained owner-
word search over encrypted cloud data,” in Proc. IEEE Distributed
enforced search authorization in the cloud,” in Proc. IEEE INFO-
Computing Systems (ICDCS’10), Genoa, Italy, Jun. 2010, pp. 253–
COM’14, Toronto, Canada, May 2014, pp. 226–234.
262.
[5] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-preserving [16] Q. Zheng, S. Xu, and G. Ateniese, “Vabks: Verifiable attribute-
multi-keyword ranked search over encrypted cloud data,” in Proc. based keyword search over outsourced encrypted data,” in Proc.
IEEE INFOCOM’11, Shanghai, China, Apr. 2011, pp. 829–837. IEEE INFOCOM’14, Toronto, Canada, May 2014, pp. 522–530.
[6] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, and [17] W. Zhang, S. Xiao, Y. Lin, T. Zhou, and S. Zhou, “Secure ranked
H. Li, “Privacy-preserving multi-keyword text search in the cloud multi-keyword search for multiple data owners in cloud com-
supporting similarity-based ranking,” in Proc. IEEE ASIACCS’13, puting,” in Proc. 44th Annual IEEE/IFIP International Conference on
Hangzhou, China, May 2013, pp. 71–81. Dependable Systems and Networks (DSN2014). Atlanta, USA: IEEE,
[7] Z. Xu, W. Kang, R. Li, K. Yow, and C. Xu, “Efficient multi-keyword jun 2014, pp. 276–286.
ranked query on encrypted data in the cloud,” in Proc. IEEE [18] W. Zhang, Y. Lin, S. Xiao, Q. Liu, and T. Zhou, “Secure dis-
Parallel and Distributed Systems (ICPADS’12), Singapore, Dec. 2012, tributed keyword search in multiple clouds,” in Proc. IEEE/ACM
pp. 244–251. IWQOS’14, Hongkong, May 2014, pp. 370–379.
[8] A. Ibrahim, H. Jin, A. A. Yassin, and D. Zou, “Secure rank- [19] B. Wang, B. Li, and H. Li, “Oruta: Privacy-preserving public
ordered search of multi-keyword trapdoor over encrypted cloud auditing for shared data in the cloud,” IEEE Transactions on Cloud
data,” in Proc. IEEE Asia-Pacific Conference on Services Computing Computing, vol. 2, no. 1, pp. 43–56, 2014.
(APSCC’12), Guilin, China, Dec. 2012, pp. 263–270. [20] J. Li, X. Tan, X. Chen, D. Wong, and F. Xhafa, “Opor: En-

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 14

abling proof of retrievability in cloud computing with resource- residuosity classes,” in Advances in cryptologyłEUROCRYPT99.
constrained devices,” IEEE Transactions on Cloud Computing, 2014. Springer, 1999, pp. 223–238.
[21] H. Pang, A. Jain, K. Ramamritham, and K.-L. Tan, “Verifying [33] J. Daemen and V. Rijmen, The design of Rijndael: AES-the advanced
completeness of relational query results in data publishing,” in encryption standard. Springer, 2002.
Proceedings of the 2005 ACM SIGMOD international conference on [34] S. B. Green, “How many subjects does it take to do a regression
Management of data. ACM, 2005, pp. 407–418. analysis,” Multivariate behavioral research, vol. 26, no. 3, pp. 499–
[22] M. Narasimha and G. Tsudik, “Dsac: integrity for outsourced 510, 1991.
databases with signature aggregation and chaining,” in Proceed- [35] “Importance of being urnest.” [Online]. Available:
ings of the 14th ACM international conference on Information and http://www.mathpages.com/home/kmath321.htm
knowledge management. ACM, 2005, pp. 235–236. [36] R. Ostrovsky and W. E. Skeith III, “Private searching on streaming
[23] H. Pang and K. Mouratidis, “Authenticating the query results of data,” in Advances in Cryptology–CRYPTO 2005. Springer, 2005,
text search engines,” Proceedings of the VLDB Endowment, vol. 1, pp. 223–240.
no. 1, pp. 126–137, 2008.
[24] R. C. Merkle, “A certified digital signature,” in Proc. Advances in
Cryptology (CRYPTO’89), California USA, Aug. 1989, pp. 218–238.
[25] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, “Dynam-
ic authenticated index structures for outsourced databases,” in Wei Zhang was born in 1990, received his B.S.
Proceedings of the 2006 ACM SIGMOD international conference on degree in Computer Science from Hunan Uni-
Management of data. ACM, 2006, pp. 121–132. versity, China, in 2011. Since 2011, he has been
[26] Y. Yang, S. Papadopoulos, D. Papadias, and G. Kollios, “Au- a Ph.D. candidate in College of Computer Sci-
thenticated indexing for outsourced spatial databases,” The VLDB ence and Electronic Engineering, Hunan Univer-
JournalłThe International Journal on Very Large Data Bases, vol. 18, sity. Since 2014, he has been a visiting student
no. 3, pp. 631–648, 2009. in Department of Computer and Information Sci-
[27] Q. Chen, H. Hu, and J. Xu, “Authenticating top-k queries in ences, Temple University. His research interests
location-based services with confidentiality,” Proceedings of the include cloud computing, network security and
VLDB Endowment, vol. 7, no. 1, 2013. data mining.
[28] H. Hu, J. Xu, Q. Chen, and Z. Yang, “Authenticating location-
based services without compromising location privacy,” in Pro- Yaping Lin received the B.S. degree in Comput-
ceedings of the 2012 ACM SIGMOD International Conference on er Application from Hunan University, China, in
Management of Data. ACM, 2012, pp. 301–312. 1982, and the M.S. degree in Computer Applica-
[29] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. Hou, and H. Li, tion from National University of Defense Tech-
“Verifiable privacy-preserving multi-keyword text search in the nology, China in 1985. He received the Ph.D.
cloud supporting similarity-based ranking,” TPDS, 2013. degree in Control Theory and Application from
[30] W. Zhang, S. Xiao, Y. Lin, J. Wu, and S. Zhou, “Privacy preserving Hunan University in 2000. He has been a pro-
ranked multi-keyword search for multiple data owners in cloud fessor and Ph.D supervisor in Hunan University
computing,” IEEE Transactions on Computers, 2015. since 1996. During 2004-2005, he worked as
[31] D. Eastlake and P. Jones, “Us secure hash algorithm 1 (sha1),” a visiting researcher at the University of Texas
2001. at Arlington. His research interests include ma-
[32] P. Paillier, “Public-key cryptosystems based on composite degree chine learning, network security and wireless sensor networks.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like