Professional Documents
Culture Documents
Zhang 2015
Zhang 2015
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 1
Abstract—With the advent of cloud computing, more and more people tend to outsource their data to the cloud. As a fundamental data
utilization, secure keyword search over encrypted cloud data has attracted the interest of many researchers recently. However, most of
existing researches are based on an ideal assumption that the cloud server is “curious but honest”, where the search results are not
verified. In this paper, we consider a more challenging model, where the cloud server would probably behave dishonestly. Based on
this model, we explore the problem of result verification for the secure ranked keyword search. Different from previous data verification
schemes, we propose a novel deterrent-based scheme. With our carefully devised verification data, the cloud server cannot know
which data owners, or how many data owners exchange anchor data which will be used for verifying the cloud server’s misbehavior.
With our systematically designed verification construction, the cloud server cannot know which data owners’ data are embedded in the
verification data buffer, or how many data owners’ verification data are actually used for verification. All the cloud server knows is that,
once he behaves dishonestly, he would be discovered with a high probability, and punished seriously once discovered. Furthermore,
we propose to optimize the value of parameters used in the construction of the secret verification data buffer. Finally, with thorough
analysis and extensive experiments, we confirm the efficacy and efficiency of our proposed schemes.
1 I NTRODUCTION However, all these schemes are based on the ideal as-
sumption that the cloud server is “curious but honest”.
With the advent of cloud computing, more and more
Unfortunately, in practical applications, the cloud server
people tend to outsource their data to the cloud. Cloud
may be compromised and behave dishonestly [19], [20].
computing provides tremendous benefits including easy
A compromised cloud server would return false search
access, decreased costs, quick deployment, and flexible
results to data users for various reasons:
resource management [1], [2]. Enterprises of all sizes can
1) The cloud server may return forged search results.
leverage the cloud to increase innovation and collabora-
For example, the cloud may rank an advertisement
tion.
higher than others, since the cloud can profit from it, or
Although cloud computing brings a lot of benefits, the cloud would return random large files to earn money,
for privacy concerns, individuals and enterprise users since the cloud adopts the ‘pay as you consume’ model.
are reluctant to outsource their sensitive data, including 2) The cloud server may return incomplete search re-
private photos, personal health records, and commer- sults in peak hours to avoid suffering from performance
cial confidential documents, to the cloud. Because once bottlenecks.
sensitive data are outsourced to a remote cloud, the There are some researches that focus on search results
corresponding data owner directly loses control of these verification [21], [22], [23], [24], [25], [26], [27], [28], [29].
data. The Apple’s iCloud leakage of celebrity photo in However, these methods cannot be applied to verify the
2014 [3] has furthered our concern regarding the cloud’s top-k ranked search results in the cloud computing en-
data security. Encryption on sensitive data before out- vironment, where numerous data owners are involved.
sourcing is an alternative way to preserve data privacy We illustrate the reason from two aspects.
against adversaries. However, data encryption becomes 1) Existing schemes share a common assumption, i.e.,
an obstacle to the utilization of traditional applications, data owners foresee the order of search results. How-
e.g., plaintext based keyword search. ever, in practical applications, numerous data owners
To achieve efficient data retrieval from encrypted data, are involved; each data owner only knows its own
many researchers have recently put efforts on secure partial order. Without knowing the total order, these data
keyword search over encrypted data [4], [5], [6], [7], owners cannot use the conventional schemes to verify
[8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]. the search results.
2) For a top-k ranked keyword search (e.g., k = 10),
• W. Zhang and Y. p. Lin are with College of Computer Science and only a few data owners will have satisfied files in the
Electronic Engineering, Hunan University, Changsha 410082, China,
and with Hunan Provincial Key Laboratory of Dependable Systems and
search results. Traditional methods need to return a lot
Networks, Changsha, 410082, China. Yaping Lin is the Corresponding of data to verify whether the huge amount of absen-
Author. E-mail: {zhangweidoc, yplin}@hnu.edu.cn. t data owners have satisfied search results. However,
the top-k ranked keyword search is, to some extent,
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 2
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 3
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 4
3. Mapping verification
3.3 Design Goals data to a data buffer
1. Preparing 2. Constructing
To enable a secure and efficient verification for the verification data verification request
ranked keyword search, our system design should si- 4. Recovering
multaneously satisfy the following goals. and verifying
• Efficiency: The proposed scheme should allow data Fig. 2: The process of verification
owners to construct the verification data efficiently.
The cloud server should also return the verification
data without introducing heavy costs. Additionally, 5 V ERIFYING R ANKED TOP -k S EARCH R E -
data users can verify the search result efficiently. SULTS
• Security: The proposed scheme should prevent the
cloud server from knowing the actual value of the The basic idea of our deterrent based verification scheme
secret verification data, and which data owners’ data is elaborated as follows: We can consider the dishonest
are returned as verification data. cloud server as a suspect, the data user as a police chief,
• Detectability: The proposed scheme should deter
and each verification data as a policeman, who masters
the cloud server from behaving dishonestly. Once part of the suspect’s actions. Intuitively, the police chief
the cloud server behaves dishonestly, the scheme can gather all the policemen to verify whether the sus-
should detect it with a high probability. pect commits a crime. However, this will cause a lot of
manpower, financial and time waste. To overcome this
problem, each time the suspect takes an action, the police
4 P RELIMINARIES chief only inquires a few policemen to verify whether the
Before we introduce our detailed construction, we first suspect commits a crime. During the process, the police
briefly introduce some techniques that will be used in chief ensures that the suspect does not know which
this paper. policemen know his action, and which policemen are
inquired by the police chief. What the suspect knows is
that, once he behaves dishonestly, he will be discovered
4.1 Paillier Cryptosystem
with high probability, and punished seriously once dis-
Paillier cryptosystem [32] is a public key cryptosystem covered. By doing this, we can deter the suspect not to
with additive homomorphic properties. Let E(a) denote behave dishonestly.
the paillier encryption on a, and D(E(a)) denotes paillier Different from discovering misbehaviors of the cloud
decryption on E(a); we have the following properties: after the misbehaviors occur, we propose a comple-
∀a, b ∈ Zn , mentary and preventive scheme to deter the cloud not
to behave dishonestly. The deterrent in our scheme is
D(E(a) · E(b) mod n2 ) = a + b mod n
derived from a series of constructions, which include
D(E(a)b mod n2 ) = a · b mod n embedding secret sampling data and anchor data in the
verification data buffer, forcing the cloud conduct blind
computations on ciphertext, updating the verification
4.2 Privacy Preserving Ranked Keyword Search A-
data dynamically, and so on. The final goal of our
mong Multiple Data Owners
deterrent based scheme is to deter the cloud not to
In our previous work [17], we introduce how to achieve behave dishonestly, and once it misbehaves, it would be
ranked and privacy-preserving keyword search among detected with high probability.
multiple data owners. First of all, we systematically In what follows, we first give an overview of the
construct protocols on how to encrypt keywords for data verification construction. Then we introduce the detailed
owners, how to generate trapdoors for data users, and construction step by step. Note that, we first introduce
how to perform blind searching for the cloud server. how to achieve the single dimensional verification, while
As a result, different data owners use their own secret we leave the multi-dimensional verification in the exten-
keys to encrypt their files and keywords. Authorized sion subsection.
data users can issue queries without knowing secret
keys of these data owners. Then an Additive Order
Preserving Function family is proposed, which enables 5.1 Verification Construction Overview
different data owners to encode their relevance scores The verification construction is composed of four steps,
with different secret keys, and helps cloud server return which is illustrated in Fig. 2.
the top-k relevant search results to data users without First, each data owner prepares the verification data.
revealing any sensitive information. Specifically, each data owner samples ψ files from the
In this paper, we adopt this ranked and privacy p- corresponding file set, and obtains the file ID and rele-
reserving keyword search scheme to return the top-k vance score. With the effective data sampling, a data user
search results. Our goal is to systematically construct can verify the correctness of search results belonging to
schemes that can verify whether the returned top-k a specific data owner with a high probability. Then each
search results are correct. data owner exchanges these file IDs and relevance scores
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 5
as anchor data with other θ − 1 data owners uniformly Now we give an example to illustrate the feasibility of
at random, which will be used to verify the correctness sampling a subset of data as verification data. Assume
of search results among data owners. After getting the professor A has 100 students who have different number
sampled data and anchor data, each data owner con- of publication papers. Professor A is very sure that his
catenates these data into a string. The encryption of the student SB has the most papers, and SC has the third
string is used as each owner’s verification data. most papers, but he is not clear of the publications
Second, data users construct a secret verification re- of other students. When the professor asks the corre-
quest, and indicate the size of verification data buffer. sponding students who have the top-5 most papers,
Third, cloud server operates on the encrypted data and professor will detect a false answer with a probability
returns the verification data buffer. of more than 0.999. This probability is computed as
Forth, data users decrypt the returned search results follows, to compute the probability that the professor
and verify whether misbehavior occurs. A can detect a misbehavior, we first compute the prob-
ability that A cannot detect the misbehavior. When the
search results guarantee that SB ranks the first and SC
5.2 Preparing the verification data
ranks the third, then A cannot detect the misbehavior,
In this subsection, we introduce how to prepare the there are P (98, 3) such conditions, where P (n, k) de-
verification data step by step. notes the number of permutation. Additionally, there
are P (100, 5) conditions of the probable search results.
5.2.1 Sampling from original data Therefore, the probability that the A cannot detect the
Our sampling method is conducted in three steps. First, misbehavior is P (98, 3)/P (100, 5). Correspondingly, pro-
the data owner samples files from its original data set. fessor A can detect the misbehavior with the probability
Second, he extracts the corresponding file IDs, relevance 1 − P (98, 3)/P (100, 5) > 0.9998.
scores. Third, he attaches the file ID and relevance score
to the owner’s ID. Assume data owner Oi has d files 5.2.2 Exchanging data among data owners
belonging to keyword wt , he samples ψ files’ data from In our system, multiple data owners are involved. For
these d files for keyword wt , the corresponding process a given keyword, each data owner only knows its own
is shown in Algorithm 1. First of all, Oi initializes the partial order, i.e., each data owner cannot obtain a total
head of sampled data string as wt ||i, where wt denotes order for the keyword. This brings a great challenge
the keyword and i denotes Oi ’s ID. Second, Oi ranks for data users to verify whether the returned results
all the files corresponding to wt in descending order of are top-k relevant to the search request. A trivial way
relevance scores. Third, Oi concatenates F ID[0]||RS0,t to is to ask the cloud to return all encoded relevance
wt ||i, where F ID[0] denotes the file ID of F0 , and RS0,t scores belonging to different data owners to the data
denotes the relevance score between F0 and wt . Fourth, user, and the data user recomputes the top-k results
Oi samples ψ − 1 data items from the remaining d − 1 for verification, which requires gigantic computation and
files. The sampled files are composed of two kinds, i.e., communication costs from the data user.
the first file in wt ’s sorted file list, and the other ψ − 1 In our scheme, we propose to let data owners ex-
uniformly and randomly sampled files. Then, all these change a very small amount of data. Specifically, given
ψ sampled data and the head data are concatenated. Oi ’s keyword wt , each data owner uses the ψ aforemen-
Finally, the algorithm outputs the sampled data set SDi . tioned sampled files to generate interactive data. Since
these operations are conducted among data owners, the
Algorithm 1 Constructing Sampled data cloud server would not know whether a data owner
has the interactive data of another data owner. For easy
Input:
description, we assume each data owner exchanges the ψ
Oi ’s ID: i, number of sampled data: ψ, and wt ’s file
data items with θ − 1 data owners uniformly at random,
list: F ID[d]
i.e., after exchanging, each data owner will have θ − 1
Output:
data owners’ data, which will be used as secret data to
Sampled data: SDi
detect false search results even if a data owner’s data is
1: Initialize sampled data SDi to wt ||i
not involved in the search results.
2: Rank wt ’s file list F ID[d] in descending order of
relevance scores 5.2.3 Assembling the verification data
3: Concatenate F ID[0]||RS0,t to SDi
Assume data owner Oi receives other θ − 1 data owners’
4: Uniformly and randomly generate ψ − 1 number set
interactive data, Oi will further assemble his verification
R where R[i] ∈ [1, d]
data as follows: first, Oi extracts all the file IDs and rele-
5: Rank R incrementally
vance scores from the interactive data of other θ − 1 data
6: for ind =1 to ψ − 1 do
owners. Second, Oi ranks on his ψ sampled data and
7: concatenate F ID[R[ind]]||RSR[ind],t to SDi
(θ−1)·ψ received interactive data. Third, Oi concatenates
8: end for
all the θ · ψ data entries in descending order, where each
9: return SDi
entry is composed of a file ID and its corresponding
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 6
order. Finally, Oi uses the symmetric encryption (e.g., Algorithm 2 Securely returning verification data
AES [33]) to encrypt the concatenation result with his Input:
secret key Hs (i), where Hs () is the aforementioned se- Verification request set [< j, E(P K, rj ) >], j ∈ [1, β],
cretly shared hash function, i is the Oi ’s ID. We denote the size of verification data buffer λ
the result of the encryption as Vi , therefore, Vi is used Output:
as Oi ’s verification data, which will be outsourced to the Verification data buffer V B
cloud server along with Oi ’s encrypted indexes and files. 1: The cloud initializes V B with λ entries, each entry
with initial value 1
5.3 Submitting verification request 2: for j ∈ [1, β] do
When an authorized data user wants to verify the search 3: Locates Oj ’s verification data Vj
V
results, he specifies a set of data owners whose veri- 4: Compute vd = E(P K, rj ) j
fication data need to be returned to help verification. 5: for i in range (0,κ) do
The data user can achieve this goal by simply setting 6: V B[hi (j)] = V B[hi (j)] · vd
an ID set of his desired data owners. However, the 7: end for
ID set should not be exposed to the cloud server. The 8: end for
fundamental reason is illustrated as follows: if the cloud 9: return V B
server knows which data owners’ data are frequently
verified, he can deduce that these data owners’ data are
very useful or sensitive, therefore, these data owners’ verification data. Specifically, the cloud server first ini-
data would easily become attackers’ targets. On the other tializes a verification data buffer with λ entries, where
hand, if the cloud server knows which data owners’ data λ is specified by the data user. Then, the cloud server
are rarely verified, the cloud server will maliciously filter finds the data owner’s verification data indicated by
out or delete these data owners’ data as search results. the requested ID set (line 3), conducts calculations on
To prevent the cloud server from knowing which the encrypted data (line 4), and maps the results to the
data owners’ data are actually returned, we propose to verification data buffer with κ hash functions (line 5-7),
construct a secret verification request which is illustrated where the output of each hash function belongs to [0, λ].
as follows: First, the data user enlarges the ID set of Note that, since the size of the verification buffer, i.e., λ,
verification by inserting random IDs. Assume a data user is specified by the data user, different users will submit
wants to get Oi ’s verification data, he can add other n−1 different λ to the cloud. To ensure that the output of each
data owners’ ID in the set (we can adopt encryption or function belongs to [0, λ], instead of changing the κ hash
obfuscation to hide the true ID, for easy description, we functions all the time, we only need to specify the cloud
simply demonstrate with ID hereafter). Second, the data to do a module-λ operation for the output of each hash
user attaches a data 0 or 1 to each ID. Here, if the data function. When the cloud server finishes proceeding all
user wants to return a data owner’s verification data, the IDs in the requested ID set, the result verification
then he attaches 1 to the corresponding ID, otherwise, data buffer is returned to the data user. Note that, during
0 is attached. Third, the data user encrypts the attached the whole process, the cloud server only sees the large ID
0 or 1 with the Paillier encryption. Here, we assume all sets, and conducts computation on the encrypted data.
the data owners and authorized data users share a key Therefore, the cloud server know nothing about which
pair, i.e., the public key P K and the private key SK, data owner’s verification data are actually returned and
for the Paillier encryption. Therefore, 0 is encrypted to used for verification.
E(P K, 0) and 1 is encrypted to E(P K, 1). With the well
In the above description, we propose to use multiple
designed Paillier encryption, the cipher-text of the same
hash functions, here we elaborate on the reason of in-
data would be different each time. Finally, the data user
troducing these hash functions. To obfuscate the cloud
submits the ID set and the attached encrypted data set
to know which verification data are actually returned.
to the cloud server.
There are three alternative ways to prevent the cloud
Now we give an example here. Assume a data us-
from knowing which verification data are actually re-
er Uj needs to download O1 ’s and O2 ’s verification
trieved by the data user. First, we can request the cloud
data, first of all, he formulates a large ID set, say,
to return all the verification data each time, so that the
{O1 , O2 , O3 , · · · , On }, then he attaches E(P K, 1) to O1
cloud knows nothing about which specific verification
and O2 , and E(P K, 0) to other IDs. Finally, the data
data are actually used. However, this method will lead to
user submits {< O1 , E(P K, 1) >, < O2 , E(P K, 1) >, <
a heavy communication cost between the cloud and the
O3 , E(P K, 0) >, · · · , < On , E(P K, 0) >} as the verifica-
data user. Second, the data user can prepare the enlarged
tion request to the cloud server.
data set where some fake IDs are also involved, and
specify the cloud to put the verification data in specific
5.4 Returning verification data positions in the verification data buffer. However, the
Upon receiving data user’s verification request, the cloud data user has to control the cloud to return data correctly,
server follows Algorithm 2 to prepare and return the which is not user friendly. In addition, during the process
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 7
Verification E(PK,
V1 V2 V3 V4 Verification
data E(PK,V1) E(PK,V2) E(PK,0) E(PK,0) E(PK,0)
V1+V2) data buffer
E(PK,1) E(PK,1) E(PK,0) E(PK,0) Request
data
Decrypted
V1 V2 V1+V2 0 0 0
data
V1 V2 V3 V4
E(PK,1) E(PK,1) E(PK,0) E(PK,0)
Fig. 4: Example of decrypting the verification data buffer
Verification
data buffer
E(PK,1)V1 E(PK,0)V3
E(PK,1)V1 E(PK,1)V2
· E(PK,1)V2
E(PK,0)V3
· E(PK,0)V4
E(PK,0)V4 entries of V B, respectively. Note that, since the data
E(PK,
users can pre-compute the entries where no collision
E(PK,V1) E(PK,V2)
V1+V2)
E(PK,0) E(PK,0) E(PK,0) occurs, instead of decrypting the whole verification data
buffer, the authorized data user only needs to decrypt
Fig. 3: Example of verification data buffer construction the entries where no collision occurs, which helps im-
prove the decryption efficiency.
Note that, since the cloud server knows that, if data
of specification, the data user would reveal his sensitive collision happens in an entry, the data in that entry
data. Therefore, we propose the third way, i.e., hash cannot be recovered. To prevent the data user from
verification data into verification data buffer directly, recovering the verification data and detecting a misbe-
to let the cloud return the verification data without havior, the cloud server would contaminate the entries
knowing which and how many verification data are in the verification buffer set, and deceive that collision
actually returned. happens in these entries. However, this attack cannot
Now we give an example (shown in Fig. 3) to illus- be achieved. The fundamental reason is that, the data
trate how to map the verification data into the veri- user specifies the IDs of data owners whose verification
fication data buffer, based on the homomorphic prop- data will be returned, and knows the κ hash functions,
erty of Paillier encryption. First of all, the cloud serv- therefore, the data user can foresee whether collision
er finds the encrypted verification data {V1 , V2 , V3 , V4 }. happens in an entry in the verification data buffer. When
Then he conducts calculation on the cipher-text, i.e., the cloud server contaminates the data in these entries,
V V V V
{E(P K, 1) 1 , E(P K, 1) 2 , E(P K, 0) 3 , E(P K, 0) 4 }. Due the misbehavior can be easily detected.
to the homomorphism of Paillier encryption, we have:
D(E(P K, 1)V1 ) = D(E(P K, 1 · V1 )) = D(E(P K, V1 )) 5.5.2 Verifying the ranked search results
D(E(P K, 1)V2 ) = D(E(P K, 1 · V2 )) = D(E(P K, V2 )) When the data user gets some data owners’ verification
D(E(P K, 0)V3 ) = D(E(P K, 1 · V3 )) = D(E(P K, V3 )) data, he can further recover all the sampled data and an-
D(E(P K, 0)V4 ) = D(E(P K, 1 · V4 )) = D(E(P K, V4 )) chor data. The data user will use them to verify whether
the returned results are correct. The verification is done
Further, the cloud uses two hash function h1 (i) and in two steps: first, the data user verifies whether the data
h2 (i) to map the encrypted data to the verification data from a specific data owner is correct. If the search results
buffer. Again, with the homomorphic property, we have pass the first verification, the verification process turns
interesting outcomes, e.g., D(E(P K, 0) · E(P K, V1 )) = to the second step, i.e., with the help of anchor data,
D(E(P K, 0+V1 )) = D(E(P K, V1 )). In this way, the veri- the data user verifies whether the search results from
fication data V1 is returned without known by the cloud. different data owners are correct. After verification, the
Finally, the result buffer is returned to the data user. As data user can detect the cloud server’s misbehavior with
we can see, from the viewpoint of the cloud server, the a high probability. In Section 7, we will give an analysis
verification data of four data owners (i.e., O1 , O2 , O3 , of the detection probability.
and O4 ) are processed. As a matter of fact, only O1 ’s and
O2 ’s verification data are returned, which is not known
by the cloud. 5.6 Extension: Multi-dimensional Verification
As we know, a file is often accompanied with several
5.5 Verifying Search Results keywords, i.e., the index of the files are often multi-
dimensional. If we return verification data for each
5.5.1 Recovering verification data dimension (keyword), the communication cost will in-
Upon receiving the verification data buffer V B, the data crease linearly with the amount of dimension increasing.
user decrypts it with the corresponding private key SK. As a matter of fact, there will be some relationship
After decryption, a data user can recover verification among different dimensions. For example, assume the
data from each entry where no collision happens (only height of a man is 2.1 meters, then the weight of this
one owner’s data is mapped in the entry, and no other man would very likely be more than 60 kilograms.
data are mapped). We use the pearson correlation coefficient to e-
Fig. 4 shows the decryption result of V B in Fig. 3, the valuate the correlation of the order list between t-
data user can recover V1 , V2 from the first and second wo keywords (dimensions). Given w1 ’s order list
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 8
4
discuss how to combine different dimensions based on
3.5 the characteristics of the data set, and to what extent can
3 we combine data, so that there would be an excellent
2.5 tradeoff.
2
r2
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 9
λ and κ, to maximize the number of verification data are already mapped, as E(x), therefore,
that can be recovered from the data buffer, how to set
the optimal β. ∑
β
E(x) = x · px
Next, we introduce how to obtain the optimal β step x=1
by step. First of all, we compute the probability of re- ∑β
covering x data items from the data buffer where β data x β−x
= x · C(β, x) · (ps ) · (1 − ps ) (4)
items are already mapped. Obviously, the probability is x=1
the same with recovering x colors from the following ∑
β−1
x β−x−1
color survival game [35] [36]. = β · ps · C(β − 1, x) · (ps ) · (1 − ps )
Color Survival Game: Assume there are β colors, x=0
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 10
0.9995 1 0.9998
γ=4 m=10
m=10
γ=6 m=20
0.9994 0.9998 m=20
0.9996
Detection probability
Detection probability
Detection probability
0.9993 0.9996
0.9994
0.9992 0.9994
0.9992
0.9991 0.9992
(a) m = 10, d = 100, ki = 1. (b) d = 100, γ = 10, k = 50. (c) d = 100, ki = 1, k = 50.
Fig. 6: Misbehavior Detection Probability
7.2 Deterrent Analysis The communication cost mainly comes from transmit-
In this paper, we propose the deterrent-based verification ting the data buffer and receiving verification data. The
scheme. During the whole process of verification, the cloud server needs to return data buffer with λ entries,
cloud server only conducts computation on cipher-texts. where the communication cost of each entry is O(ψ · θ).
Therefore, he does not know how many data owners’ For verification data receiving, assume there are m data
verification data are actually used for verification, or owners in the system, the cost is O(m · θ · ψ). Therefore,
which data owners’ data are embedded in the verifi- the communication cost for the cloud server is O(λ·ψ ·θ).
cation data buffer. Furthermore, the cloud server is not O(max {λ · ψ · θ, m · θ · ψ}).
clear of which data owners, or how many data owners
exchange anchor data. We keep all these information 7.3.3 Costs for Data User
secret to the cloud server. All the cloud server knows is The computational cost for data users spent on verifica-
that, once he behaves dishonestly, he will be discovered tion mainly comes from three aspects: first, constructing
with a high probability, and punished seriously once the verification request. That’s O(β). Second, decrypting
discovered. the entries where the α requested ID corresponds and no
collision happen, that’s O(κ·α). Third, detecting whether
7.3 Performance Analysis misbehavior happens, the cost is O(α · θ · ψ). So the
computational cost is O(max {β, κ · α, α · θ · ψ}).
7.3.1 Costs for Data Owners
The communication cost for the data users spent on
The computational cost for the data owners spent on verification comes from receiving the verification data
verification mainly comes from constructing the verifi- buffers, so the communication cost is O(λ · ψ · θ).
cation data. For data sampling, the running time mainly
comes from ordering files for each keyword; therefore
the computational complexity is O(d · log(d)). For data 7.4 Misbehavior Detection probability
assembling, data owners need to rank the ψ ·θ data items Our proposed scheme should not only ensure a strong
and then encrypt the assembled data, which is O(ψ · θ). deterrent for potential attacks, but also achieve high
Therefore, the computational complexity for each data detection probability once the compromised cloud server
owner is O(max {d · log(d), ψ · θ}). misbehaves. Now we analyze the detection probability.
The communication cost mainly comes from two as- For the k returned search results, suppose ki out of
pects, i.e., anchor data exchanging, and the verification the k results are also contained in the verification data.
data buffer transmission. For anchor data exchanging, Assume there are m data owners in our system, and
each data owner needs to transmit ψ anchor data to θ −1 the data user recovers γ distinct verification data from
data owners, the cost is O(θ · ψ). For verification data the data buffer. The data user can detect an error with
buffer transmission, the communication cost is O(θ · ψ). probability Pe :
So the total communication cost is O(θ · ψ).
P (m · d − γ, k − ki )
Pe = 1 − (6)
7.3.2 Costs for Cloud Server P (m · d, k)
The computational cost for the cloud server spent on The figure shown in Fig. 6 describes the relationship
verification mainly comes from mapping the verification between the detection probability and the corresponding
data into the data buffer. Since the data user provides an parameters. From Fig. 6(a), we observe that, when we
enlarged size of ID set, i.e., β, the cloud server needs to set m = 10 (number of data owners involved in the
map the corresponding β data owners’ verification data system), d = 100 (average number of files corresponding
to the data buffer, where each data item is mapped κ to a keyword), even if ki = 1, i.e., there are only one
times. Therefore, the computational complexity for the out of k search results has its corresponding verification
cloud server is O(β · κ). data, the detection probability is still more than 0.999.
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 11
We can also see that, with the number of returned results α symmetric encryption operations is relatively low.
(k) increases, the detection probability increases linearly. Fig. 9(b) demonstrates the time cost of decrypting and
Additionally, the larger γ (number of distinct returned recovering the verification data. Since the time is mainly
verification data) is, the higher detection probability is spent on decrypting the verification data buffer, with the
achieved. The fundamental reason can be easily inferred size of the data buffer increasing, the corresponding time
from the Equation 6. In Fig. 6(b), we set d = 100, γ = 10, cost of data users increases linearly.
k = 50, with ki increases, the detection probability also Fig. 10 shows the ratio of data recovered from the
increases. When ki is larger than 2, the detection proba- verification data buffer with different parameters. Fig.
bility is very close to 1. From Fig. 6 (c), we see that, when 10(a) shows that, when the size of the verification data
we set k = 50, ki = 1, with the value of γ grows up, the buffer is set to be 500, i.e., λ = 500, the ratio of data
detection probability also grows linearly. Additionally, recovered from the verification data buffer will decrease
the larger m is, the higher probability would be achieved. with the number of mapped data increasing. Meantime,
the more hash functions we use, the lower ratio that
8 P ERFORMANCE E VALUATION data can be recovered. The fundamental reason is that,
the more data we map into the data buffer, the higher
In this Section, we demonstrate a thorough evaluation probability that data collision will occur, which renders
on our proposed scheme. First of all, we evaluate the some data cannot be recovered. Therefore, the ratio of
computational cost of the data owners, the cloud server, recovered data will decrease accordingly.
and the data users. Then we evaluate the functionality Fig. 10(b) demonstrates that, when we set the number
of the verification data buffer. of hash functions to be 30, i.e., κ=30, with the size of the
data buffer increasing, the ratio of recovered data will
8.1 Experiment Settings increase. Meanwhile, the less data we map, the faster
The experiment programs are coded using the Python the data recovery ratio will increase. The reason is that
programming language on a Laptop with 2.2GHZ Intel larger data buffer and less data mapped into the buffer
Core CPU and 2GB memory. We use the Pailier Encryp- will reduce the probability of data collision; therefore,
tion [32] for data encryption; the secret key is set to be the ratio of recovered data will increase.
512 bits. Fig. 10(c) demonstrates that, when the number of
entries is set to 500, the ratio of data recovery will
decrease with the number of hash functions increasing;
8.2 Experiment Results the more data items we map into the verification data
Recall that the time cost for each data owner mainly buffer, the faster the data recover ratio will decrease.
comes from ranking, string concatenating, and sym- Fig. 11 shows the number of data recovered from the
metric encryption. Fig. 7(a) and Fig. 7(b) show that, verification data buffer with different parameters. Fig.
with the number of sampled data (ψ) and exchanged 11(a) shows that, when λ = 500, with the amount of
data (θ) increasing, the time cost of the data owners mapped data increasing, the number of data recovered
increases linearly. The fundamental reason is that, with ψ from the verification data buffer will first increase and
and θ increasing, more data items are concatenated and then fall down. We illustrate this phenomenon as fol-
encrypted, which results in more time consumption. lows: when we map a few data items into the verification
The time cost of the cloud server spent on verification data buffer, few data collisions will occur, and almost
mainly comes from two aspects, i.e, conducting compu- all the data can be recovered from the verification data
tation on the user submitted cipher-text and mapping buffer. Therefore, the amount of recovered data will in-
the computation result to the verification data buffer. crease. However, when the amount of data we map into
Fig. 8(a) shows that, with the size of enlarged ID set the verification data buffer increases to a threshold, data
(β) increasing, the corresponding time cost will increase collision will also increase with the increasing number of
linearly. The reason is that more IDs will lead to more distributed data. Obviously, data collision will cause data
computation on the cipher-text, and more time will be to be unrecoverable. Therefore, the amount of recovered
needed. Fig. 8(b) demonstrates that, the time cost of the data will decrease when we map too many data items
cloud server has little connection with the size of the into the verification data buffer.
verification data buffer. Fig. 11(b) demonstrates that, when we set κ = 30, with
Fig. 9 shows the time cost of data users. Fig. 9(a) illus- the size of the verification data buffer increasing, the
trates the time cost of generating verification request. As number of recovered data will increase. An interesting
we can see, when the size of the enlarged ID set increases feature shown in Fig. 11(b) is that, the fewer items we
from 10 to 100, the corresponding time cost increases map into the verification data buffer, the prior data
from 0.024s to 0.24s. The reason is that, the larger the can be recovered from the verification data buffer. The
β is, the more Paillier encryption will be conducted. We fundamental reason is that, with fewer data items, we
also observe that the time cost of the data user has little need fewer entries to accommodate data collision.
connection with α. The reason is that, compared with Fig. 11(c) demonstrates that, when λ = 500, the
the time cost spent on Paillier encryption, conducting amount of recovered data will decrease with the number
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 12
8 8
θ=10 ψ=10
7 7
5 5
4 4
3 3
2 2
1 1
0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of Sampled Data Number of Exchanged data
(a) (b)
Fig. 7: Time cost of the data owner
10 5
9 λ=100 β=10
λ=200 β=30
8
7
6 3
5
4 2
3
2 1
1
0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Size of Enlarged ID Data Size of Verification Data Buffer
(a) (b)
Fig. 8: Time cost of the cloud server
25 5
Time Cost of Operation (× 10−2s)
α=4
Time Cost of Operation (× 10−2s)
α=4
α=10 4 α=10
20
15 3
10 2
5 1
0 0
10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10
Size of Enlarged Data Set Size of Verification Data Buffer ( × 102)
(a) (b)
Fig. 9: Time cost of the data user
of hash functions increases; additionally, the more data exchange anchor data used for verification, he also does
items we map into the verification data buffer, the faster not know which data owners’ data are embedded in
the number of recovered data will decrease. The reason the verification data buffer or how many data owners’
is that, the more mapping operations and data items verification data are actually used for verification. All
we map into the verification data buffer, the higher the cloud server knows is that, once he behaves dishon-
probability that a data collision will occur, which leads estly, he would be discovered with a high probability,
to the amount of recovered data decreasing. and punished seriously once discovered. Additionally,
when any suspicious action is detected, data owners
can dynamically update the verification data stored on
9 C ONCLUSION the cloud server. Furthermore, our proposed scheme
In this paper, we explore the problem of verification allows the data users to control the communication
for the secure ranked keyword search, under the model cost for the verification according to their preferences,
where cloud servers would probably behave dishonestly. which is especially important for the resource limited
Different from previous data verification schemes, we data users. Finally, with thorough analysis and extensive
propose a novel deterrent-based scheme. During the experiments, we confirm the efficacy and efficiency of
whole process of verification, the cloud server is not our proposed schemes.
clear of which data owners, or how many data owners
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 13
1 1 1
0.9 0.9 θ=50 0.9
θ=70
Ratio of Recovered Data 0.8 0.8 0.8
50 80 50
κ=20 θ=50
45 70 45
κ=30 θ=70
Number of Recovered Data
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 14
abling proof of retrievability in cloud computing with resource- residuosity classes,” in Advances in cryptologyłEUROCRYPT99.
constrained devices,” IEEE Transactions on Cloud Computing, 2014. Springer, 1999, pp. 223–238.
[21] H. Pang, A. Jain, K. Ramamritham, and K.-L. Tan, “Verifying [33] J. Daemen and V. Rijmen, The design of Rijndael: AES-the advanced
completeness of relational query results in data publishing,” in encryption standard. Springer, 2002.
Proceedings of the 2005 ACM SIGMOD international conference on [34] S. B. Green, “How many subjects does it take to do a regression
Management of data. ACM, 2005, pp. 407–418. analysis,” Multivariate behavioral research, vol. 26, no. 3, pp. 499–
[22] M. Narasimha and G. Tsudik, “Dsac: integrity for outsourced 510, 1991.
databases with signature aggregation and chaining,” in Proceed- [35] “Importance of being urnest.” [Online]. Available:
ings of the 14th ACM international conference on Information and http://www.mathpages.com/home/kmath321.htm
knowledge management. ACM, 2005, pp. 235–236. [36] R. Ostrovsky and W. E. Skeith III, “Private searching on streaming
[23] H. Pang and K. Mouratidis, “Authenticating the query results of data,” in Advances in Cryptology–CRYPTO 2005. Springer, 2005,
text search engines,” Proceedings of the VLDB Endowment, vol. 1, pp. 223–240.
no. 1, pp. 126–137, 2008.
[24] R. C. Merkle, “A certified digital signature,” in Proc. Advances in
Cryptology (CRYPTO’89), California USA, Aug. 1989, pp. 218–238.
[25] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, “Dynam-
ic authenticated index structures for outsourced databases,” in Wei Zhang was born in 1990, received his B.S.
Proceedings of the 2006 ACM SIGMOD international conference on degree in Computer Science from Hunan Uni-
Management of data. ACM, 2006, pp. 121–132. versity, China, in 2011. Since 2011, he has been
[26] Y. Yang, S. Papadopoulos, D. Papadias, and G. Kollios, “Au- a Ph.D. candidate in College of Computer Sci-
thenticated indexing for outsourced spatial databases,” The VLDB ence and Electronic Engineering, Hunan Univer-
JournalłThe International Journal on Very Large Data Bases, vol. 18, sity. Since 2014, he has been a visiting student
no. 3, pp. 631–648, 2009. in Department of Computer and Information Sci-
[27] Q. Chen, H. Hu, and J. Xu, “Authenticating top-k queries in ences, Temple University. His research interests
location-based services with confidentiality,” Proceedings of the include cloud computing, network security and
VLDB Endowment, vol. 7, no. 1, 2013. data mining.
[28] H. Hu, J. Xu, Q. Chen, and Z. Yang, “Authenticating location-
based services without compromising location privacy,” in Pro- Yaping Lin received the B.S. degree in Comput-
ceedings of the 2012 ACM SIGMOD International Conference on er Application from Hunan University, China, in
Management of Data. ACM, 2012, pp. 301–312. 1982, and the M.S. degree in Computer Applica-
[29] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. Hou, and H. Li, tion from National University of Defense Tech-
“Verifiable privacy-preserving multi-keyword text search in the nology, China in 1985. He received the Ph.D.
cloud supporting similarity-based ranking,” TPDS, 2013. degree in Control Theory and Application from
[30] W. Zhang, S. Xiao, Y. Lin, J. Wu, and S. Zhou, “Privacy preserving Hunan University in 2000. He has been a pro-
ranked multi-keyword search for multiple data owners in cloud fessor and Ph.D supervisor in Hunan University
computing,” IEEE Transactions on Computers, 2015. since 1996. During 2004-2005, he worked as
[31] D. Eastlake and P. Jones, “Us secure hash algorithm 1 (sha1),” a visiting researcher at the University of Texas
2001. at Arlington. His research interests include ma-
[32] P. Paillier, “Public-key cryptosystems based on composite degree chine learning, network security and wireless sensor networks.
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.