Professional Documents
Culture Documents
MSTDB A Hybrid Storage-Empowered Scalable Semantic Blockchain Database
MSTDB A Hybrid Storage-Empowered Scalable Semantic Blockchain Database
8, AUGUST 2023
Abstract—Blockchain has been regarded as a trusted carrier for distributed data storage. With large volumes of valuable data stored on
blockchain, data query has become a major requirement. However, the existing blockchains do not provide efficient query functionality
because of their deep-rooted chain structure. Blockchain database is a new direction that constructs index on top of blockchain to provide
rich query functionalities. The existing works are either insecure because the query process separates from the blockchain consensus, or
inscalable because all the data needs to be stored in the block. In this paper, we propose a novel semantic blockchain database called
MSTDB. We design a hybrid on/off chain blockchain storage architecture in which the majority of blockchain storage is offloaded to the
off-chain storage and a novel index structure named Merkle Semantic Trie (MST) is designed to be a secure and semantic bridge
between on- and off-chain. Based on MST, MSTDB provides a variety of semantic query functions including multi-keyword query,
range query, Top-K query, and cross-chain query. To improve the performance further, we design some index compression and query
preprocessing techniques for MSTDB. Extensive experiments demonstrate the effectiveness and efficiency of our blockchain database.
implement but is separated from the blockchain system, so The contributions of our paper are summarized below.
they cannot guarantee the security for the clients’ query. On
the other hand, some other studies [12], [13] try to build the We propose a hybrid on/off chain storage architec-
query model completely on-chain. However, they put all ture to reduce the blockchain space cost, and add
the data and index information in the block. The excessive semantic features to blockchain system.
coupling makes their systems inscalable. We create a novel index structure on blockchain
Motivated by an emerging data storage mode for which is composed of meta-data in transactions and
blockchain, i.e., a combination of on and off-chain stor- maintained by miners through the consensus mecha-
age [14], our basic idea is to build a semantic query layer nism, and based on this structure, our system can
on the top of a hybrid blockchain storage. We plan to put provide rich query functions including authenticated
most of the data off the chain, and place the semantic multi-keyword query, authenticated range query,
information and index information of data on the chain. authenticated Top-K query, and authenticated cross-
There are two challenges to support convenient query in chain query.
such a hybrid storage architecture and we summarize We make some improvements to the program to
them as follows enhance its availability. We first use a F-B point to
reduce the number of nodes in the index. Second,
Inefficient semantic query. The underlying key-value node point is an effective ways to prevent space
database of the existing blockchain systems makes reuse. Third, we add a Bloom filter to the block to
the clients can query a block or transaction based on preprocess the query.
its number or hash, but cannot query based on the We implement the prototype of MSTDB on Ether-
semantic information of the targeted data, which is eum, and we conduct extensive evaluation of it.
more often in a real-life application scenario. On the Some experiments on three datasets are done, show-
other hand, the append-only structure makes the cli- ing that our scheme has good response time, lower
ents must traverse the entire blockchain to ensure query results verification cost and superior effective-
the completeness of the query result, which brings ness and completeness.
clients a terrible query experience. The rest of this paper is organized as follows. Section 2
None semantic link. The existing blockchain systems introduces some related works. In Section 3, we define the
adopt hash values for mapping between the on-chain system and query model. Section 4 represents our architec-
and off-chain data. The hash values can only guaran- tures, including the hybrid storage and indexing mecha-
tee that the off-chain data will not be tampered with. nism, and the query processing. Some design refinements
However, they do not establish the index for the are shown in Section 5. In Section 6, we perform the experi-
semantic information of the off-chain data. Thus, the ments and verify the availability and effectiveness of our
clients need to frequently access the whole off-chain system by analyzing the experimental results. Finally, Sec-
data for each query, which is time-consuming and tion 7 gives the conclusion and our future work.
inefficient.
Therefore, in this paper, we propose a hybrid semantic
blockchain database named MSTDB. It adds the semantic
2 RELATED WORK
feature in blockchain system and provides rich and effi- 2.1 Blockchain Data Storage
cient query methods for the hybrid on-chain and off-chain Blockchain is a decentralized ledger which preserves all his-
data. We first use a hybrid on/off chain storage and torical transaction records. Underlying most blockchain sys-
employ an improved TF-IDF model to extract semantic tems is key-value databases like levelDB and couchDB,
information of the off-chain data, and upload this informa- which find the one-to-one corresponding value through a
tion to the chain by way of transactions. Next, we build an certain key. The data structure of blockchain is a chain which
on-chain index structure named Merkle semantic trie is linked from one block to another, and with the block data
(MST) in block, which contains some index components increasing, the amount of data stored by a blockchain node
including inverted index, B+ tree and Merkle Patricia Trie, has become very huge. Bitcoin blockchain size has reached
etc. And we introduce the query processing via MST, 341.89 GB in April 2021 [15], and at the same time, the size of
including multi-keyword query, range query, Top-K Ethereum also exceeded 300GB [16]. To reduce the on-chain
query, and authenticated query. These functions are moti- storage pressure, some hybrid on-chain and off-chain storage
vated by traditional databases, while we have specially schemes [14] [17] have been proposed for better scalability of
designed these technologies for the decentralized and blockchain system. Their main approach is to store most of
chain structure of the blockchain. Finally, we have opti- the data off the chain in a distributed peer-to-peer network
mized the specific design of our scheme. We compress the (e.g., Bittorrent network [18] and IPFS network [19]), with
index to reduce the storage cost and design a special index hash values of the entire data or partitioned data blocks on
traversal scheme to support Top-K query for blockchain, the blockchain. The on-chain hash value can ensure the con-
which improve the system’s performance. Besides, we sistency of off-chain data and some simple query like check-
design an approach for cross-chain query, which solves the ing the balance and querying blocks. For example, Zhang
problem of information interaction between heterogeneous et al. [20] proposed a blockchain-based data marketplace
chains. To overcome the scalability challenges of MST-B+ with quality-aware incentives, where transaction records are
tree, we propose a cryptographic vector commitment (VC)- preserved off-chain, and transaction hash values are stored
based scheme refinement. on-chain for checks. However, as we say in introduction,
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8230 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023
they lack semantic queries such as keyword query, range Zhang et al, [33] propose a hybrid-storage architecture for
query or Top-K query, which are fundamental properties for blockchain and improve the data management scalability
scenarios such as health services [21], [22]. In addition to by a novel Chameleon index-based ADS.
these problems, in practical applications, some information Comparison: First, our MSTDB system has richer query
may be distributed on different blockchains, and these block- functionality than traditional blockchain systems such as
chains are heterogeneous [23], which also leads to difficulties Bitcoin and Ethereum. Second, compared with some total
in information exchange between them. Some works focus off-chain approaches including EtherQL, BigchainDB and
on transaction [24] or data migration [25], but cannot con- blockchainDB, MSTDB has a stronger level of decentraliza-
sider the query requirement between different chains. Shard- tion, which ensures higher data security and tamper-proof.
ing and payment channel network [26], [27] are other Third, in terms of scalability, our hybrid architecture has
approaches for blockchain scalability. Sharding divides the greater scalability and less storage cost than some complete
transactions into different groups to reduce the storage cost on-chain schemes like ECBC. Finally, compared with
for blockchain nodes, but the query still needs to traverse all vChain and SEBDB whose key research is the security of
the shards, which increases the communication overhead. authenticated query, our MSTDB mainly focuses on the
structure of index and query performance.
Hybrid Storage. Our storage pattern is a hybrid on-chain All on-chain data including block headers and block bod-
and off-chain storage. Most of the data are stored in some ies are stored in the database of the blockchain system, such
off-chain spaces such as distributed databases (Hbase, redis, as LevelDB, RocksDB, etc. But what are stored in the data-
etc) and IPFS. Off-chain data could be unstructured, semi- base are not the original on-chain data, since all data need
structured, or relational. Section 4.2.1 describes some data to be serialized to be stored in the database. For example,
structures and storage medium of blockchain. Ethereum uses the Recursive Length Prefix (RLP) [35] codes
Information Extraction. To establish a semantic mapping to perform network transmission and persistent storage.
between the on-chain and off-chain data, we propose an off- After the encoding process of on-chain data, they are stored
chain information extraction algorithm. Through the algo- physically in some devices such as disk. The write efficiency
rithm, the most valuable meta-data can be extracted from of the underlying database of the blockchain is extremely
the huge off-chain data and be sent to the blockchain in the high which is mainly due to the sequential writing scheme
form of transactions. This meta-data is the basic component of LSM-Tree [36]. However, since the low transaction
of the on-chain index structure. The detailed information throughput, this advantage of fast writing is not reflected in
extraction algorithm is described in Section 4.2.2. the existing blockchain system.
On-Chain Index. An on-chain index structure (IS) is the The data owners need to upload the off-chain data like
core of MSTDB, which is established in the block body. documents to the blockchain by sending the transactions, and
Once the extracted meta-data is sent to the blockchain in the the transaction form is as < Tid; SendId; docID; metadata >
form of transaction, the full nodes jointly process them and , where Tid is the unique identity of the transaction, sendId is
update the IS together. Since the index structure is updated the address of transaction initiator, docID is the unique iden-
with the increase of transactions, it always keeps up to date tity of the document, and metadata contains the keywords list
in the latest block. Furthermore, our index structure is com- and weights of the document.
patible with various blockchain systems because it does not
rely on any specific underlying database. We will elaborate 4.2.2 Information Extraction
the specific design and the generation process of our index The goal of information extraction is to establish the seman-
structure in Section 4.2.3. tic mapping and tamper-proof ability between the on-chain
and off-chain data, which includes two steps, i.e., meta-data
extraction and hash binding. Note that the basic meta-data
4.2.1 Hybrid Storage
(e.g., data location, data size) will be directly provided by
The storage model of MSTDB is a hybrid on-chain and off- the data owner, thus we focus on the extraction of the
chain architecture. Off-chain data are managed by some pri- semantic keywords in this section. For backward compati-
vate data owner, and stored in some distributed devices bility, the meta-data is placed in the “data” field of transac-
like distributed databases or IPFS. Since the off-chain data tions which exist in most blockchain systems.
are stored in some private spaces and their copies do not In the search engine filed, TF-IDF [37] is a classic keywords
need to be synchronized across the whole network, the stor- extraction algorithm which can collect words that appear fre-
age cost of off-chain data is not very high. We focus on the quently. To apply this method to blockchain system, we pro-
storage cost of on-chain part. A blockchain is a data struc- pose an improved TF-IDF model called Decentralization
ture which is linked from one block to another. Each block Semantic Extraction (DSE). DSE improves the precision of
contains two parts: the block header and block body. In the semantic feature extraction by introducing decentralized
block header, a core data called hash point which deter- word frequency, part-of-speech and position factors to
mines the context of a block and ensures the block data can- address the lack of semantic understanding in traditional
not be tampered with. keyword extraction algorithms. Through the DSE model, we
Transaction is the basic data content in blockchain sys- build an inverted index [38] of semantic keywords.
tem, and a series of them are stored in the block body. All The main purpose of TF-IDF model is to find the words
transactions have a standard format, and every transaction which appear frequently in one document and rarely
has a transaction hash through it we could locate the trans- appears in the entire document collection. Given a word i, a
action quickly. Inside the block body, a number of transac- document collection N, and an individual off-chain data j 2
tions are arranged to form a Merkle tree. Merkle tree is a N, the TF-IDF can be calculated by
hash-based Multi-tree, which uses layers of hashing to
ensure that the order of transactions in a block is not tam- N
pered with. Bitcoin uses Merkle tree to conserve transac- TF IDFi;j ¼ fi;j log ; (1)
dfi þ 1
tions, and put the root hash of the Merkle tree into the block
header. Unlike bitcoin, Ethereum is an account-based where fi;j is the frequency of keyword i in the data j, and dfi
model, known as a transaction-driven state machine. It uses is the number of times the data containing the keyword i
Merkle Patricia Trie (MPT) [2] to store the states such as appears in N.
account state or smart contract state. Each MPT leaf node is A problem of TF-IDF model is that the IDF model only cal-
an account state, while the branch stores the compressed culates the number of documents that contain a word but
prefix of the account address. The miner updates the states does not consider the word frequency difference between
of the accounts in the MPT according to the transactions in them. In some cases, if a word has a lower frequency in a doc-
the block. Therefore, the latest states of all accounts are ument but does not appear in other documents, its impor-
stored in the latest block, with the root hash of the MPT in tance for document classification is higher than that of
the block header. another word that has a higher frequency in this document
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8233
but also appears high in other documents. For example, if on the blockchain, it guarantees that if data owner modifies
there are 10000 documents in the system, word 1 appears 10 the raw off-chain data, the mapping relationship between
times in document 1 and does not appear in other docu- on-chain and off-chain data will be broken. To expand the
ments, and word 2 appears 1000 times in all documents. The coverage of the hash function, we add the meta-data to the
TF-IDF value of words 1 and 2 in document 1 is 40 and 1000, input parameters of it. So the hashing process becomes this
but in fact word 1 has a greater discriminating ability for doc- form: hash ¼ Hðraw datajmeta-dataÞ. Besides, if some off-
ument 1. For solving this problem, we propose a calculation chain data are deposited in IPFS or other decentralized stor-
method called Document-wisee Term Frequency Factor age network, our system can be compatible with their hash
(DTFF) to decentralize the frequency of keywords by results such as Merkle DAG root.
After the information extraction process of off-chain
DTFFi;j ¼ 2Ni;j Ni ; (2) data, we create a table called inverted index to store the
index information of every keyword. Inverted index often
where Ni;j is the number of occurrences of keyword i in
consists of a series of inverted list whose format is like <
data j, and Ni is the average number of occurrences of key-
key; docId1; docId2 . . . > , where docId is the unique identi-
word i in all data. The weight of exclusive words can be
fier of a document. Different from the traditional inverted
increased through the decentralization of DTFF. Following
index, we use block number instead of docId to represent
the above example, through the DTFF model, the weight of
the appearance position of the keyword. A problem with
word 1 is increased to 40960 and the weight of highly dis-
this modification is that if two or more transactions contain-
tinctive word 1 is highlighted.
ing the word appear in the same block, we need to traverse
In addition to term frequency, the position of a word in a
the entire block to ensure the integrity of the query. The ben-
document is also very important. For example, the weight
efits of this design will be explained in Section 4.3.1. For any
of a word which is at the beginning and end of an article
inverted list, all block numbers are sorted in descending
often has greater significance than those in the middle of
order of average weight. Moreover, although the historical
the article. We arrange the positions where the keywords
record cannot be changed due to the immutability of the
first appear in the data into a sequence with the number of
blockchain, the weight of each word in MSTDB can be
words as the total length and 1 as the unit scale, and we
updated by the blockchain nodes’ consensus. The data own-
choose the middle point as the initial coordinate. In our
ers can submit a transaction with updated metadata to the
thoughts, we determine that the words in a small range on
blockchain network to ask the blockchain nodes to update
both sides of the center point (initial coordinate) the same
the weight of word.
weights. outside the range, we measure the distance from
the initial coordinate to the word and calculate a weight fac-
tor affected by distance. The philosophy we follow is that 4.2.3 MST Tree
the longer the distance, the greater the weight. The Position
There are some search methods in blockchain system like
Factor is defined as
( searching block by hash, searching transaction by hash,
1; d < d searching block by block number, etc. However, there are no
PFi;j ¼ ; (3) semantic meanings included in the input parameters of these
" dd ; d d
methods. It is inefficient to process a query request if a user
where " is an increased weight multiple, d is the distance of wants to query some information by keywords, because the
the word i from the initial coordinate, d 2 ½0; L=2Þ and L is node must scan all blocks one by one. In our scheme, an IS is
the length of the sequence. established in the block body. This IS is jointly maintained by
Furthermore, we draw lessons from the scheme of speech the miners and is always up to date in the latest block. The
classification priority based on statistical methods [39], which index structure is not a special component in the blockchain
calculates the weight of part of speech by using a fuzzy hid- system, it is kept in the database as a structure and loaded
den Markov model, using the method, we define the weights into the memory when it is used. Since it does not depend on
for keywords of different parts of speech as follows any specific underlying database, our index structure is com-
8 patible with various blockchain systems.
>
> 0:9 n: Inverted index allows fast real-time query on a single
>
>
>
> 0:8 v: keyword, but it does not provide efficient multiple key-
>
>
>
> words query. To overcome this challenge, we construct a
< 0:3 pron:
>
novel index structure called merkle semantic trie (MST) that
Posi ¼ 0:5 adj: (4) integrates the inverted index, Merkle Tree, prefix compres-
>
>
>
> 0:5 adv: sion and hash point techniques.
>
>
>
>
>
> 0:4 num: MST is a tree-shaped data structure, and there are three
>
: types of nodes in the MST: branch nodes, extension nodes,
1 idiom:
and leaf nodes. Different nodes have different features in
Then, the weight of the feature word is MST structure, and a node can only be of one type. Branch
nodes of MST has analogous duty as in tire, they only store
Wi;j ¼ TF IDFi;j DTFFi;j PFi;j Posi : (5) index-related information and provide pointers to the next-
level nodes. Leaf nodes always represent the end of a search
Hash binding is the process of storing the hash value cal- path and store the query results. The result is not the spe-
culated by the one-way hash function of the off-chain data cific information but a hash pointer which directs to the
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8234 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023
block where the corresponding transaction is located. Exten- has been sorted according to the weight in the inverted index.
sion node is a multi-functional node of MST. It contains not MST has been created when the genesis block was built, and it
only the index-related information, but also the query moves to another state after all of the transactions in a block
results and pointers. This design means that when you have been handled. The original MST had only a root node,
search for that node, it has a definite search result in this and every index pair in the transaction need to be inserted
node. But if you want to continue searching for more from the root node.
detailed and accurate results, you can enter the next level After transactions are inserted into MST, blockchain
by the pointer and explore deeply. The root node is the nodes need to update its keywords set. In an MST, each leaf
entrance to MST, and if there is only one node in MST, the node or extension node can be expressed as a boolean
root node serves as a leaf node, otherwise it is a branch expression. For example, in Fig. 2, the boolean expression of
node which has child nodes. leaf node (a) is fkey1 ^ ðkey3 ^ key2Þg. Moreover, boolean
expressions corresponding to multiple leaf nodes or exten-
sion nodes can be combined. For example, in Fig. 2, the com-
Algorithm 1. MST Node Insert
bination of boolean expressions of leaf node (a) and (b) is
Input: info as indexInfo(blockNumber and a list of key- fkey1 ^ ððkey3 ^ key2Þ _ key5Þg.
word as < keym > , node as an MST node Besides, the keyword list of queries can also be represented
1: ln lenðnode:keyÞ
in Boolean form like q = fkey1 ^ key3 ^ key2g, There may be
2: li lenðinfoÞ
some ambiguity in the order of keywords. We assume that a
3: while 9i : i lnandi li do
combination of leaf nodes or extension nodes denoted by
4: foreach keyi 2 info:key, nodekeyi 2 node:key do
5: if keyi 6¼ nodekeyi then
f‘‘blockchain00 ^ ð‘‘bitcoin00 _ ‘‘ethereum00 Þg is stored in the
6: go to split MST. Then, if a query is f‘‘ethereum00 ^ ‘‘blockchain00 g, the
7: if li < ln then MST cannot return any result. However, our DSE model could
8: node extendnode eliminate the ambiguity since we can sort the input keyword
9: go to split list by their weights with those of larger weights ranked ahead,
10: if li = ln then preventing recursive insertion of the keyword combinations.
11: Switch: node type The main consideration of the MST path borrows the
12: Case: leaf k extend common prefix concept of trie. Upon an insertion to the key-
13: node.value.append(blockNumber) word list, the data with the same prefix keyword combina-
14: break tion is stored to the same path. To further drop index size,
15: Case: branch we comprss paths with only one child node, which is like
16: node.value.append(blockNumber) the Patricia trie. Fig. 2 shows the specific structure of MST.
17: node extendNode Since the existence of merkle tree characteristics, each node
18: break of MST needs to be calculated a hash value. And different
19: if li > ln then type of nodes have different input of the hash function: for
20: if node.type = leaf then leaf node, the input parameters of hash function are key-
21: node extendnode words and block pointers which direct to the block and are
22: node:child:appendðinfo½i : liÞ stored in the leaf node; the hash value of branch node
23: split: node.child.append(info[0:i]) and node.child.append(info[i:li]) includes the hash result of keyword and its child nodes; as
for extension nodes, its hash value is the hash result of the
To construct the global on-chain index structure, miners keywords, block pointers, and child nodes. The role of node
extract the keyword information from transactions. The for- hash is not only to prevent the content of the node from
mat of the information we need in the transaction is being tampered with, but also to establish a logical connec-
< blockNumber; key1; key2 . . . > pairs and the keyword list tion between the parent node and the child node. The
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8235
parent node and the child node are not physically connected
in the database, but the child node stores the hash value as Fig. 4. A Structure of Multiple keywords Aho-Corasick automaton.
the key in the Key-value database, and the parent node
finds the child node through the hash value of the child
4.3.1 Authenticated Multi-Keyword Query
node. The root hash value saved by the root node is stored
in the MSTrootHash Field of the block header. According to the definition given in Section 3.2, mutil-key-
In Algorithm 2, we perform an insert operation of a node word query needs to match the keyword combination
in MST, including the node type conversion operation and entered by the user with the index path in MST, and return
split operation. There are three situations to consider here all results that meet the conditions. We refer to many string-
when a new keyword list appears. We describe three differ- matching algorithms such as KMP (Knuth-Morris-Pratt),
ent operations separately and how the node split. and propose a multi-keyword matching structure named
Multiple keywords Aho Corasick automaton (MKACA). In
our design, MKACA is another form of MST and can always
4.2.4 MST-B+ Tree be kept in the memory of the blockchain node. It can sup-
Besides the keyword query, we also focus on range query port incremental updates and it will be updated with MST
on blockchain. Merkle-B+ tree is a common ADS that is usu- update in each consensus round of blockchain.
ally used to support authenticated range queries in out- As shown in Fig. 4, the initial state of MKACA corre-
sourced databases. MST-B+ tree is a combination of MST sponds to the root nodes of MST, and the receiving states
keyword trie and Merkle-B+ tree, which can support both are represented as double circles, which correspond to the
the authenticated keyword query and the authenticated extension nodes or leaf nodes. All the state transition func-
range query. The combing process consists of two steps. tions in MKACA correspond to index keywords on different
The first step is the sorting process. As shown in Fig. 3, the branch paths of MST. For example, q0 in Fig. 4 represents
size information of the off-chain documents is added in the the root node in Fig. 2, q1 represents the extension node
leaf nodes and extension nodes in MST, but they are unor- with key1, and two states q2 and q3 together represent the
dered. To put them in order, the blockchain nodes should leaf node with key2 and key3 because < key1; key2; key3 >
sort them by their size values and add small-to-large hash and < key1; key3; key2 > are two different matching path.
pointers between them. Through the ordered hash pointers, MKACA can provide the authenticated multi-keyword
all leaf nodes and extension nodes form an ordered linked query for MSTDB. The inputs of authenticated multi-key-
list. In the second step, after the ordered linked list is con- word queries include a keyword list and the root node of the
structed, the blockchain nodes add the index information MST. First, the MST is converted to a corresponding
into the branch nodes and leaf nodes to let them point to the MKACA, and the keyword list is transformed to a CNF and
documents whose sizes are in an ordered range. starts its matching processing in the initial state of MKACA.
The MST part and the B+ tree part on the MST-B+ tree Then, the blockchain nodes match the keyword of the CNF
are maintained and updated independently. In an MST, in order with the different paths of MKACA. If the operator
each extension node or leaf node has two pointer areas is conjunctive, the match processing will search the MKACA
(blue area in Fig. 3) that store pointers to the document cor- in depth, while if the operator is disjunctive, match process-
responding to the keyword query and the document corre- ing will search the MKACA in breadth. If the CNF has been
sponding to the range query, respectively. For example, in traversed, and matching processing has reached the receiv-
Fig. 3, the two pointers (red dotted lines) of leaf nodes (a) ing states, blockchain nodes return the value of the corre-
point to two different documents with keyword set sponding node of this state. Otherwise, the blockchain nodes
fkey1; key2; key5g and document size 3 KB respectively. return an empty result. For example, in Fig. if a query input
is key1 ^ ðkey3 _ key2Þ, the state will go from q0 to q1, q1 to
4.3 Query Processing q2 and q3, but neither q2 nor q3 are the receiving states, the
After adding the index structure on the blockchain, our match processing is failed. However, if the input is key1 ^
MSTDB system can provide multiple authenticated queries. key3 ^ key2, the state goes from q0 to q1, q1 to q3, and finally
In this section, we will introduce the query processes of to q9. Since q9 is a receiving state of MKACA, the match
four methods. processing is succeeded.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8236 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023
The outputs of it are the multi-keyword query result R request, and calculate the product of the weight of
and its verification object (VO). The correctness and com- all documents (or transactions) in the list of the
pleteness of the results can be guaranteed by the Merkle word in the inverted index and the weight of the
proof in VO. Clients can verify the correctness of the results word in the query request. Then for each other key-
by recomputing the hash value of Merkle root and compar- word in query request, using the same similarity
ing it with the record in their local block header. The com- calculation method and add the product to a sum.
pleteness can be guaranteed because one query only returns Miners use a temporary memory to keep track of
one related node in MST, and the result cannot be mali- currently active Top-K candidates.
ciously omitted to pass the validation. Document-At-A-Time (DAAT) [41] : Each document
which is related to the query input can be calculated
independently, miner use simhash algorithm or
4.3.2 Authenticated Range Query
cosine distance to calculate the similarity between
By combining B+ tree and MST, our system can provide two documents. DAAT store current Top-K result by
verifiable range query in the hybrid storage architecture. a simple heap structure.
Range query is a common query operation in database Compared to TAAT, DAAT is a more space-saving solu-
field, which finds some information whose value is tion and faster due to multi-thread parallel operations.
between an upper and lower boundary. Using the B+ tree However, DAAT rarely terminates early, because no one
feature, it can find the value in the input range very sta- can make a preliminary overall judgment on the next docu-
bly. For verifiable range query, the full node searches the ment. Many optimized approaches for termination early
MST-B+ tree to find the first record in the range. Then it use TAAT, but in blockchain system, the cost of temporary
scans the leaf nodes from the first node until right bound- storage space cannot be ignored, so we improve DAAT ’s
ary of the query range. Merkle proof is packed into VO early termination conditions.
and will be returned to the light node together with the
query result.
Algorithm 2. Top-K Query Baesd on BAAT
For authenticated range queries, the query results are a
range of MST-B+ tree nodes, thus the blockchain nodes Input: Q as a query request; B as the blockchain
need to return the query results and the Merkle proof of Output: R as Top-K query result
two points at the front and back ends of the range. The cor- 1: Create a min-k heap h½k and h½0 is the minimal value
2: foreach block bi 2 B do
rectness and completeness of the results can be guaranteed
3: if bi:avgWeight bi:transNumber h½0:value
by the Merkle proof. Clients can verify the correctness of
then
the results by recomputing the hash value of Merkle root
4: continue
and comparing it with the record in their local block header.
5: else
The completeness can be guaranteed because the Merkle 6: foreach doc 2 bi do
proof of B+ tree can prove that all in-scope results are 7: if cal½doc; Q > h½0:weight then
returned. 8: h½0 = doc
9: h:rebalanceðÞ
4.3.3 Authenticated Top-K Query 10: return h;
Users constantly want to seek for documents related to its
input. To meet the search needs of users, the basic solution In our scheme, we propose a more efficient index traversal
is returning all query results through the index, and let the pattern called Block-At-A-Time (BAAT) developed from
searcher find the best document by themselves. A fact is DAAT. Same as DAAT, BAAT traverses the blockchain by
that most of the query results are not highly matched to calculating weights for all documents one by one and uses a
user needs, thus users also take time to filter the results, min-k heap to store the temporary top k results, while BAAT
which is inefficient and has become a bottleneck of query adopts a block-based early termination scheme to skip all
process. Top-K query, which is a classic optimization tech- documents in the block without calculating the weights of
nique to save the query time cost. The reason why Top-K the documents, which allows it to get query results faster
query can improve query efficiency is that it has a key step than DAAT. In BAAT, each block has pre-calculated an aver-
called early termination. Early termination first lets the age score table containing the average weight of all keywords
inverted list be arranged such that the most promising in the block. For each keyword ki in block bi , the average
documents appear early, and stop the query process as soon weight of ki is the average DSE weight of ki . When a query
as the top K results appear. Besides the early termination, request is received, the blockchain node who receives the
there are many effective ways such as skip within lists, omit request traverses the chain from the latest block to the gene-
lists or score only partially. Top-K query calculates the rele- sis block and keeps a min-k heap to store the terminated
vance of user input to candidate documents, and returns results. All documents in a block are compared with the
the top K results according to the ranking from high to low query to calculate a similarity between them, and the min-k
score. For score calculation, the index should be traversed, heap always keeps the top k relevant documents. The early
and there are two traditional index traversal techniques as termination scheme of BAAT is that if it is assumed that
follows there is a document in the block that has the highest rele-
vance to all the keywords in the query and the relevance of
Term-At-A-Time (TAAT) [40] : Query service pro- the document is still less than the document at the head of
viders first access one term (or keyword) of the query the heap, the whole block can be skipped.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8237
MST like the MST-B+ tree, and divide the ordered list into
Fig. 5. Structure of VC-based MST. multiple groups of the same size to reduce the generation
time for the security parameters, each of which is an
ordered vector. Second, the blockchain nodes generate all
The detailed algorithm of BAAT is shown in Algorithm commitments of the vectors, and store these commitments
1. In Algorithm 1, the inputs contain a query request Q in the root node’s child nodes. For example, in Fig. 5, c1 to
which includes a keyword list < key1 ; key2 ; keyt > and a c3 are three commitments of three vectors V1 to V3, and are
blockchain B. First, the blockchain node creates a min-k stored in the three child nodes of the root node. Third,
heap h[k] whose head is h[0] (line 1). Then, for each block bi when some items of MST need to be inserted into MST or
in B, the blockchain node calculates the average weights of updated, the blockchain nodes create new vectors or update
all keywords in the keyword list and the possible maximum the existing vectors and their commitments.
value of the block. If the maximum value is lower than the The scheme of VC can support the same range query
value in h[0], the block is skipped (lines 2-4). If not, the function of Merkle B+ tree. In the authenticated range
blockchain node will calculate the similarity of all docu- query processing, blockchain nodes search in the root
ments in the block with the query and update the heap if node’s child list to find the group with the target range and
there are some more relevant documents. Finally, the block- find the result in the corresponding vector first, then they
chain node returns the heap. return the result with a VO to the light node. The block-
For authenticated Top-K queries, since the blockchain chain nodes will generate the vector commitment proofs of
nodes needs to traverse the entire chain to iterate over all the query results, and add the proofs, the relevant commit-
documents, it is too expensive to send the data of the entire ments of these proofs, and the Merkle proofs of these com-
chain to the client for verification. Therefore, we adopt a mitments to the VO. After receiving the result and VO, the
consensus-based verification method. The client needs to client verifies the result with the proof and the commitment
send a query transaction to get the top k results, and the locally. Through adding the VC scheme, MSTDB only
query results are returned with a transaction receipt. Since needs to store the additional commitments of the grouped
the query transaction are executed by all blockchain nodes, linked list and each update of the digest value will only
the correctness and completeness can be guaranteed by the lead to one time update of one commitment. Comparing
consensus mechanism. with the Merkle B+ tree, the time complexity of VC updat-
ing is reduced to O(1).
Fig. 7. Blockchain with added node pointers. The dotted lines represent
the parent node pointers to the unchanged subtrees after the MST state
changes.
2, 5, 6, and (3,4) are connected by the red arrows, and they verifyðkeyi Þ ¼ ðhashi ðkeyi Þ \ 1Þ: (6)
i
correspond to the leftmost subtree in the left figure. More-
over, the path from any node to the root node is connected
by F-B pointers of the same color between them. For exam- If the result is 1, the keyword is likely to be an existing
ple, the three red arrows in the right figure with numbers 1, keyword. And if the result is 0, it can be determined that
2, 3 are three F-B pointers including < root; 1 > , < 1; 2 > this keyword has never been stored in the system, so there
, and < 2; 3 > , each of them contains two nodes before is no need to access the index. The structure of the bloom fil-
and after the pointer. Through the three F-B pointers, the ter is shown in Fig. 8. Bloom filter is not always completely
node with keyword 6 can establish a link with the root node accurate. It possibly shows false positives because two or
of the keyword path < 1; 2; 6 > . more keywords may be mapped to the same bit of the
Node Pointer. Another effective way to reduce storage bloom filter. Despite the presence of false positive, the effi-
space of index structure is node pointer. Since MST contains ciency of our system has still been greatly improved after
all the index information on the blockchain and always adding the bloom filter. The structure of the bloom filter is
keep the newest state in the last block, there are many differ- shown in Fig. 8.
ent versions of MST as the block increases. Streamlining the
index structure that is not in the latest block is an obvious 5.3 Cross-Chain Query
and effective means to save storage space. But to ensure the MST can effectively meet the needs of single-chain query,
returnability and traceability of blockchain during its fork, but as we discussed before, sometimes the data required by
it is necessary to keep a complete MST structure evolution some users may be on different blockchains. In a cross-chain
process. The core idea of our design is incremental modifi- application, different chains must be able to connect and
cation of the index structure. Similar to Ethereum’s state communicate with each other [43], which also called inter-
tree, the newer block only stores the changed part of MST operability. To solve the cross-chain query problem, we
and uses some node pointers to direct to the subtrees that assume that there are series of different chains which all
have not changed in MST, which are located in the previous have MST in their block, and they need query some data
block. One possible problem in this design is that if a sub- which are probably stored in different chains. To establish
tree has not been updated for a long time, there may be too communication between multiple chains, we propose a
many node pointers which direct to them. Therefore, we cross-chain query committee scheme based on the Merkle
allow the next block to reuse the same node pointer of the proof verification. The nodes in the committee are drawn
previous block. A blockchain with added node pointers is from each blockchain in a certain proportion and replaced
shown in Fig. 7. after a period of time.
DEFINITION 1: Given a blockchain set B=fb1 ; b2 ; . . . ; bn g,
where bi is the number of nodes in the ith blockchain. Let
5.2 Bloom Filter C=MaxðBÞ, and ci is the number of committee nodes of the
The structure of MST results in a large change in the MST ith blockchain. The number of nodes in each blockchain sys-
when a new keyword that hasn’t appeared before is tem is allocated as
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8239
TABLE 1
Precisions, Recalls, and F-Values at Different Numbers of
Keywords
TABLE 2
Response Times for Three Query Types
TABLE 3
The Result of Index Compression
TABLE 4
Fig. 10. Cross-chain query. The Impact of Index Compression on Query Efficiency
Fig. 11. Time cost for generating blocks with and without the MST.
ACKNOWLEDGMENTS
We thank the editors and anonymous reviewers of TKDE
and the powerful computation of remote server supported
by Alibaba Cloud EFLOPS AI Platform. Our previous work
related to this paper was presented at the International
Symposium on Reliable Distributed Systems (2020) and
won the best paper award runner up in the conference. The
conference paper is available at: https://ieeexplore.ieee.
Fig. 14. The results of authenticated queries on two datasets. org/document/9252029/
[18] S. J. Arulmozhi, K. Praveenkumar, and G. Vinayagamoorthi, [44] “IET inspec,” 2003. [Online]. Available: https://www.theiet.org/
“Bitcoin in India: A deep down summary,” Adv. Innov. Res., vol. 6, publishing/inspec
no. 2, 2019, Art. no. 28. [45] S. D. Gollapalli and C. Caragea, “Extracting keyphrases from
[19] J. Benet, “IPFS-content addressed, versioned, P2P file system,” research papers using citation networks,” in Proc. 28th AAAI Conf.
2014, arXiv:1407.3561. Artif. Intell., 2014, pp. 1629–1635.
[20] C. Zhang, M. Zhao, L. Zhu, W. Zhang, T. Wu, and J. Ni, “FRUIT: A [46] O. Medelyan, I. Witten, and D. Milne, “Topic indexing with
blockchain-based efficient and privacy-preserving quality-aware wikipedia,” in Proc. AAAI Conf. Artif. Intell. Workshop, 2008.
incentive scheme,” IEEE J. Sel. Areas Commun., early access, Oct. [47] V. Anh and A. Moffat, “Improved word-aligned binary compres-
10, 2022, doi: 10.1109/JSAC.2022.3213341. sion for text indexing,” IEEE Trans. Knowl. Data Eng., vol. 18,
[21] J. Liang, Z. Qin, S. Xiao, L. Ou, and X. Lin, “Efficient and secure no. 6, pp. 857–861, 2006.
decision tree classification for cloud-assisted online diagnosis
services,” IEEE Trans. Dependable Secure Comput., vol. 18, no. 4, Enyuan Zhou (Student Member, IEEE) received
pp. 1632–1644, Jul./Aug. 2019. the BE degree in information security from North-
[22] J. Liang, Z. Qin, L. Xue, X. Lin, and X. Shen, “Verifiable and secure eastern University, and the MSc degree in cyber-
SVM classification for cloud-based health monitoring services,” space security from Xidian University, supervised
IEEE Internet Things J., vol. 8, no. 23, pp. 17029–17042, Dec. 2021. by prof. Qingqi Pei in ISN. He is currently working
[23] Z. Wu, Y. Xiao, E. Zhou, Q. Pei, and Q. Wang, “A solution to data toward the PhD degree with the Department of
accessibility across heterogeneous blockchains,” in Proc. IEEE 26th Computing, Hong Kong Polytechnic University.
Int. Conf. Parallel Distrib. Syst., 2020, pp. 414–421. His current research interests include Blockchain,
[24] M. Herlihy, “Atomic cross-chain swaps,” in Proc. ACM Symp. database, and IR.
Princ. Distrib. Comput., 2018, pp. 245–254.
[25] Z. Gao, H. Li, K. Xiao, and Q. Wang, “Cross-chain oracle based data
migration mechanism in heterogeneous blockchains,” in Proc. IEEE
40th Int. Conf. Distrib. Comput. Syst., 2020, pp. 1263–1268.
[26] Z. Hong, S. Guo, P. Li, and W. Chen, “Pyramid: A layered shard- Zicong Hong (Graduate Student Member, IEEE)
ing blockchain system,” in Proc. IEEE Conf. Comput. Commun., received the BEng degree in software engineer-
2021, pp. 1–10. ing from the School of Data and Computer
[27] Z. Hong, S. Guo, R. Zhang, P. Li, Y. Zhan, and W. Chen, “Cycle: Science, Sun Yat-sen University. He is currently
Sustainable off-chain payment channel network with asynchro- working toward the PhD degree with the Depart-
nous rebalancing,” in Proc. IEEE/IFIP 52nd Annu. Int. Conf. Depend- ment of Computing, Hong Kong Polytechnic
able Syst. Netw., 2022, pp. 41–53. University. His current research interests include
[28] Y. Peng, M. Du, F. Li, R. Cheng, and D. Song, “FalconDB: Block- blockchain, game theory, Internet of Things, and
chain-based collaborative database,” in Proc. ACM SIGMOD Int. edge/cloud computing.
Conf. Manage. Data, 2020, pp. 637–652.
[29] M. Zhang, Z. Xie, C. Yue, and Z. Zhong, “Spitz: A verifiable database
system,” Proc. VLDB Endowment, vol. 13, no. 12, pp. 3449–3460,
Aug. 2020. Yang Xiao (Member, IEEE) received the BS and
[30] C. Zhang, C. Xu, J. Xu, Y. Tang, and B. Choi, “GEM2-tree: A gas- PhD degrees from Xidian University, in 2013,
efficient structure for authenticated range queries in blockchain,” 2020, respectively. His research interests focus
in Proc. IEEE 35th Int. Conf. Data Eng., 2019, pp. 842–853. on adversarial attacks and defence on graph
[31] Y. Zhu, Z. Zhang, C. Jin, A. Zhou, and Y. Yan, “SEBDB: Semantics structured data, distributed trust management
empowered blockchain database,” in Proc. IEEE 35th Int. Conf. and data privacy in AI.
Data Eng., 2019, pp. 1820–1831.
[32] C. Xu, C. Zhang, and J. Xu, “vChain: Enabling verifiable boolean
range queries over blockchain databases,” in Proc. Int. Conf. Man-
age. Data, 2019, pp. 141–158.
[33] C. Zhang, C. Xu, H. Wang, J. Xu, and B. Choi, “Authenticated key-
word search in scalable hybrid-storage blockchains,” in Proc. IEEE
37th Int. Conf. Data Eng., 2021, pp. 996–1007.
[34] Q. Pei, E. Zhou, Y. Xiao, D. Zhang, and D. Zhao, “An efficient Dongxiao Zhao He received the bachelor’s and
query scheme for hybrid storage blockchains based on merkle master’s degrees in information and communi-
semantic trie,” in Proc. IEEE Int. Symp. Reliable Distrib. Syst., 2020. cation engineering from Xidian University. He is
[35] A. Coglio, “Ethereum’s recursive length prefix in ACL2,” in Proc. currently working toward the PhD degree with
16th Int. Workshop ACL2 Theorem Prover Appl., 2020, pp. 108–124. the School of Telecommunications Engineering in
[36] C. Luo and M. J. Carey, “LSM-based storage techniques: A Xidian University. He used to work in software
survey,” VLDB J., vol. 29, no. 1, pp. 393–418, 2020. development with ZTE. His current research
[37] J. H. Paik, “A novel TF-IDF weighting scheme for effective interests include Blockchain, database, distrib-
ranking,” in Proc. 36th Int. ACM SIGIR Conf. Res. Develop. Inf. uted system.
Retrieval, 2013, pp. 343–352.
[38] H. Bast and B. Buchhold, “An index for efficient semantic full-text
search,” in Proc. 22nd ACM Int. Conf. Inf. Knowl. Manage., 2013,
pp. 369–378.
[39] I. N. Yulita, H. L. The, and adiwijaya, “Fuzzy hidden markov
models for indonesian speech classification,” J. Adv. Comput. Intell. Qingqi Pei (Senior Member, IEEE) received the
Intell. Informat., vol. 16, no. 3, pp. 381–387. BS, MS, and PhD degrees in computer science
[40] C. Buckley and A. F. Lewit, “Optimization of inverted vector and cryptography from Xidian University in 1998,
searches,” Proc. 8th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. 2005, and 2008, respectively. He is currently a
Retrieval, 1985, pp. 97–110. professor and a member of the State Key Labora-
[41] H. Turtle and J. Flood, “Query evaluation: Strategies and opti- tory of Integrated Services Networks, Xidian Uni-
mizations,” Inf. Process. Manage., vol. 31, no. 6, pp. 831–850, 1995. versity. His research interests focus on cognitive
[42] D. Leung, Y. Gilad, S. Gorbunov, L. Reyzin, and N. Zeldovich, network, data security, and physical layer secu-
“Aardvark: An asynchronous authenticated dictionary with appli- rity. He is a professional member of ACM and a
cations to account-based cryptocurrencies,” in Proc. 31st USENIX senior member of the Chinese Institute of Elec-
Secur. Symp., 2022, pp. 4237–4254. tronics and China Computer Federation.
[43] B. Pillai, K. Biswas, and V. Muthukkumarasamy, “Cross-chain
interoperability among blockchain-based systems using trans-
actions,” Knowl. Eng. Rev., vol. 35, 2020, Art. no. e23.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8244 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023
Song Guo (Fellow, IEEE) received the PhD Rajendra Akerkar is professor and head of Big
degree in computer science from the University Data Research Group, Western Norway Research
of Ottawa. He is currently a full professor with the Institute (Vestlandsforsking), where his primary
Department of Computing with The Hong Kong domain of activities is Big Data and semantic tech-
Polytechnic University. Prior to joining PolyU, he nologies with aim to combine strong theoretical
was a full professor with the University of Aizu, results with high impact practical results. His
Japan. His research interests are mainly in the recent research focuses on application of Big
areas of cloud and green computing, Big Data, Data methods to real-world challenges in mobility,
wireless networks, and cyber-physical systems. transport, energy and emergency management.
He has published more than 300 conference and He is serving as an associate editor of Interna-
journal papers in these areas and received multi- tional Journal of Metadata, Semantics and Ontolo-
ple best paper awards from IEEE/ACM conferences. His research has gies (IJMSO), associate editor of IEEE Open Journal of the Computer
been sponsored by JSPS, JST, MIC, NSF, NSFC, and industrial compa- Society (OJ-CS) and Knowledge Management Track editor of Web Intelli-
nies. He has served as an editor for several journals, including IEEE gence, an international journal. He has authored 16 books, 139 research
Transactions on Parallel and Distributed Systems, IEEE Transactions papers and edited 19 volumes of international conferences and work-
on Emerging Topics in Computing, IEEE Transactions on Green Com- shops.He is also actively involved in several international ICT initiatives,
munications and Networking, IEEE Communications Magazine, and and research & innovation projects, including H2020 projects, for more
Wireless Networks. He has been actively participating in international than 20 years.
conferences as general chair and TPC chair. He is a senior member of
ACM, and an IEEE Communications Society Distinguished lecturer.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.