Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

8228 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO.

8, AUGUST 2023

MSTDB: A Hybrid Storage-Empowered


Scalable Semantic Blockchain Database
Enyuan Zhou , Student Member, IEEE, Zicong Hong , Graduate Student Member, IEEE,
Yang Xiao , Member, IEEE, Dongxiao Zhao, Qingqi Pei , Senior Member, IEEE,
Song Guo , Fellow, IEEE, and Rajendra Akerkar

Abstract—Blockchain has been regarded as a trusted carrier for distributed data storage. With large volumes of valuable data stored on
blockchain, data query has become a major requirement. However, the existing blockchains do not provide efficient query functionality
because of their deep-rooted chain structure. Blockchain database is a new direction that constructs index on top of blockchain to provide
rich query functionalities. The existing works are either insecure because the query process separates from the blockchain consensus, or
inscalable because all the data needs to be stored in the block. In this paper, we propose a novel semantic blockchain database called
MSTDB. We design a hybrid on/off chain blockchain storage architecture in which the majority of blockchain storage is offloaded to the
off-chain storage and a novel index structure named Merkle Semantic Trie (MST) is designed to be a secure and semantic bridge
between on- and off-chain. Based on MST, MSTDB provides a variety of semantic query functions including multi-keyword query,
range query, Top-K query, and cross-chain query. To improve the performance further, we design some index compression and query
preprocessing techniques for MSTDB. Extensive experiments demonstrate the effectiveness and efficiency of our blockchain database.

Index Terms—Blockchain database, data sharing, distributed query, index

1 INTRODUCTION As a distributed ledger technology, blockchain can guarantee


high security and reliability of data, and has been adopted in
recent years, blockchain systems represented by Bit-
I N
coin [1] and Ethereum [2] have attracted growing attention.
different types of applications which involve multiple
untrusted participants [3], such as supply chains [4], asset
digitization [5], decentralized identifiers [6], and commodity
traceability [7]. For example, in a supply chain scenario, mul-
 Enyuan Zhou is with the Department of Computing, Hong Kong Polytech-
nic University, Hung Hom, Hong Kong, and also with the State Key Labo- tiple companies who do not trust each other need a complete
ratory of ISN, School of Cyber Engineering, Xidian University, Xi’an, historical record of their transactions (such as making quotes
Shaanxi 710071, China. E-mail: en-yuan.zhou@connect.polyu.hk. and placing orders) for traceability. The characteristics of
 Zicong Hong and Song Guo are with the Department of Computing, Hong
Kong Polytechnic University, Hung Hom, Hong Kong. E-mail: zicong. full-ledger sharing, ethedible consensus and tamper-proof
hong@connect.polyu.hk, song.guo@polyu.edu.hk. storing for blockchain makes it suitable for use in the supply
 Yang Xiao is with the State Key Laboratory of ISN, School of Cyber Engi- chain.
neering, Xidian University, Xi’an, Shaanxi 710071, China.
E-mail: yxiao@xidian.edu.cn. However, in comparison with the traditional databases,
 Dongxiao Zhao and Qingqi Pei are with the State Key Laboratory of ISN, most of the existing blockchains do not provide rich and effi-
Shaanxi Key Laboratory of Blockchain and Secure Computing, Xidian cient query functionalities such as SQL, keyword query, etc.,
University, Xi’an, Shaanxi 710071, China. E-mail: bigctime@163.com, which significantly influences the user experience and
qqpei@mail.xidian.edu.cn.
 Rajendra Akerkar is with Big Data Research Group, Western Norway restricts the application of blockchains. For example, Bitcoin
Research Institute, 6856 Sogndal, Norway. E-mail: rak@vestforsk.no. only provides some simple query APIs based on the hash or
Manuscript received 15 June 2021; revised 28 September 2022; accepted 30 number of blocks (such as getBlockByHash, and getBlockBy-
October 2022. Date of publication 8 November 2022; date of current version Num, etc. [8]) Thus, in the above blockchain-based supply
21 June 2023. chain, users cannot query transactions with some keywords,
The work of Enyuan Zhou, Yang Xiao, Qingqi Pei, and Dongxiao Zhao was sup-
ported in part by the National Key Research and Development Program of China like “Logistics providers  [“DHL,” “UPS,” “FedEx”],” or
under Grant 2020YFB1807500, in part by the National Natural Science Founda- search for some history records in a certain period of time,
tion of China under Grants 62132013, 62102295, 62276198, 62202358 and such as “ 3.99$ < Unit price of pork < 5.99$”.
61902292, in part by the Key Research and Development Programs of Shaanxi
under Grant 2021ZDLGY06-03. the work of Enyuan Zhou, Zicong Hong, and The main reason is that most of blockchain designers
Song Guo was supported in part by fundings from the Key-Area Research and only take the tamper-proof characteristic of but not the
Development Program of Guangdong Province under Grant 2021B0101400003, semantic value of the data into account. To solve the prob-
in part by Hong Kong RGC Research Impact Fund (RIF) under Grant R5060-19,
in part by General Research Fund (GRF) under Grants 152221/19E, 152203/20E
lem, there have been some works about adding semantic
and 152244/21E, in part by the National Natural Science Foundation of China features to the blockchain and providing query functions
under Grant 61872310, and in part by Shenzhen Science and Technology Innova- for it, but they have certain limitations as follows. On the
tion Commission under Grant JCYJ20200109142008673. one hand, a few studies [9], [10], [11] are off-chain schemes.
(Corresponding authors: Yang Xiao and Song Guo.)
Recommended for acceptance by N. Zhang. They extract semantic data from the blockchain and put it
Digital Object Identifier no. 10.1109/TKDE.2022.3220522 in a local database to provide query services. It is easy to
1041-4347 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8229

implement but is separated from the blockchain system, so The contributions of our paper are summarized below.
they cannot guarantee the security for the clients’ query. On
the other hand, some other studies [12], [13] try to build the  We propose a hybrid on/off chain storage architec-
query model completely on-chain. However, they put all ture to reduce the blockchain space cost, and add
the data and index information in the block. The excessive semantic features to blockchain system.
coupling makes their systems inscalable.  We create a novel index structure on blockchain
Motivated by an emerging data storage mode for which is composed of meta-data in transactions and
blockchain, i.e., a combination of on and off-chain stor- maintained by miners through the consensus mecha-
age [14], our basic idea is to build a semantic query layer nism, and based on this structure, our system can
on the top of a hybrid blockchain storage. We plan to put provide rich query functions including authenticated
most of the data off the chain, and place the semantic multi-keyword query, authenticated range query,
information and index information of data on the chain. authenticated Top-K query, and authenticated cross-
There are two challenges to support convenient query in chain query.
such a hybrid storage architecture and we summarize  We make some improvements to the program to
them as follows enhance its availability. We first use a F-B point to
reduce the number of nodes in the index. Second,
 Inefficient semantic query. The underlying key-value node point is an effective ways to prevent space
database of the existing blockchain systems makes reuse. Third, we add a Bloom filter to the block to
the clients can query a block or transaction based on preprocess the query.
its number or hash, but cannot query based on the  We implement the prototype of MSTDB on Ether-
semantic information of the targeted data, which is eum, and we conduct extensive evaluation of it.
more often in a real-life application scenario. On the Some experiments on three datasets are done, show-
other hand, the append-only structure makes the cli- ing that our scheme has good response time, lower
ents must traverse the entire blockchain to ensure query results verification cost and superior effective-
the completeness of the query result, which brings ness and completeness.
clients a terrible query experience. The rest of this paper is organized as follows. Section 2
 None semantic link. The existing blockchain systems introduces some related works. In Section 3, we define the
adopt hash values for mapping between the on-chain system and query model. Section 4 represents our architec-
and off-chain data. The hash values can only guaran- tures, including the hybrid storage and indexing mecha-
tee that the off-chain data will not be tampered with. nism, and the query processing. Some design refinements
However, they do not establish the index for the are shown in Section 5. In Section 6, we perform the experi-
semantic information of the off-chain data. Thus, the ments and verify the availability and effectiveness of our
clients need to frequently access the whole off-chain system by analyzing the experimental results. Finally, Sec-
data for each query, which is time-consuming and tion 7 gives the conclusion and our future work.
inefficient.
Therefore, in this paper, we propose a hybrid semantic
blockchain database named MSTDB. It adds the semantic
2 RELATED WORK
feature in blockchain system and provides rich and effi- 2.1 Blockchain Data Storage
cient query methods for the hybrid on-chain and off-chain Blockchain is a decentralized ledger which preserves all his-
data. We first use a hybrid on/off chain storage and torical transaction records. Underlying most blockchain sys-
employ an improved TF-IDF model to extract semantic tems is key-value databases like levelDB and couchDB,
information of the off-chain data, and upload this informa- which find the one-to-one corresponding value through a
tion to the chain by way of transactions. Next, we build an certain key. The data structure of blockchain is a chain which
on-chain index structure named Merkle semantic trie is linked from one block to another, and with the block data
(MST) in block, which contains some index components increasing, the amount of data stored by a blockchain node
including inverted index, B+ tree and Merkle Patricia Trie, has become very huge. Bitcoin blockchain size has reached
etc. And we introduce the query processing via MST, 341.89 GB in April 2021 [15], and at the same time, the size of
including multi-keyword query, range query, Top-K Ethereum also exceeded 300GB [16]. To reduce the on-chain
query, and authenticated query. These functions are moti- storage pressure, some hybrid on-chain and off-chain storage
vated by traditional databases, while we have specially schemes [14] [17] have been proposed for better scalability of
designed these technologies for the decentralized and blockchain system. Their main approach is to store most of
chain structure of the blockchain. Finally, we have opti- the data off the chain in a distributed peer-to-peer network
mized the specific design of our scheme. We compress the (e.g., Bittorrent network [18] and IPFS network [19]), with
index to reduce the storage cost and design a special index hash values of the entire data or partitioned data blocks on
traversal scheme to support Top-K query for blockchain, the blockchain. The on-chain hash value can ensure the con-
which improve the system’s performance. Besides, we sistency of off-chain data and some simple query like check-
design an approach for cross-chain query, which solves the ing the balance and querying blocks. For example, Zhang
problem of information interaction between heterogeneous et al. [20] proposed a blockchain-based data marketplace
chains. To overcome the scalability challenges of MST-B+ with quality-aware incentives, where transaction records are
tree, we propose a cryptographic vector commitment (VC)- preserved off-chain, and transaction hash values are stored
based scheme refinement. on-chain for checks. However, as we say in introduction,
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8230 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

they lack semantic queries such as keyword query, range Zhang et al, [33] propose a hybrid-storage architecture for
query or Top-K query, which are fundamental properties for blockchain and improve the data management scalability
scenarios such as health services [21], [22]. In addition to by a novel Chameleon index-based ADS.
these problems, in practical applications, some information Comparison: First, our MSTDB system has richer query
may be distributed on different blockchains, and these block- functionality than traditional blockchain systems such as
chains are heterogeneous [23], which also leads to difficulties Bitcoin and Ethereum. Second, compared with some total
in information exchange between them. Some works focus off-chain approaches including EtherQL, BigchainDB and
on transaction [24] or data migration [25], but cannot con- blockchainDB, MSTDB has a stronger level of decentraliza-
sider the query requirement between different chains. Shard- tion, which ensures higher data security and tamper-proof.
ing and payment channel network [26], [27] are other Third, in terms of scalability, our hybrid architecture has
approaches for blockchain scalability. Sharding divides the greater scalability and less storage cost than some complete
transactions into different groups to reduce the storage cost on-chain schemes like ECBC. Finally, compared with
for blockchain nodes, but the query still needs to traverse all vChain and SEBDB whose key research is the security of
the shards, which increases the communication overhead. authenticated query, our MSTDB mainly focuses on the
structure of index and query performance.

2.2 Blockchain Query


2.3 New Achievements
Blockchain is an append-only data structure, and through the
Our previous work [34] proposes a query scheme for a
hash pointer to establish the tamper proof binding relation-
hybrid storage blockchain. Based on the previous work, this
ship between the front and back blocks. However, its
paper has designed several technologies to optimize the
append-only attribute also means that a query of all the block-
blockchain database in terms of performance and storage.
chain data requires traversing backward the entire chain to
First, the query method in the previous work traverses
ensure that the search results are complete. Since most block-
and returns the results without a relevance ranking. How-
chain systems are based on key-value databases, it can only
ever, in practice, returning all results is inefficient and costly
provide some simple operations such as key update, search,
because the user may only want the results that are most rel-
and write, etc. Taking Ethereum as an example, it supports
evant or relatively relevant. Thus, this paper proposes a
several query methods: (i) query block by block height; (ii)
Top-K query based on an efficient index traversal scheme
query block by block hash; (iii) query transaction by transac-
designed for blockchain database (see Section 4.3.3).
tion hash.
Second, in our previous work, the space cost by the index
To optimize query experience, some works focus on mod-
will be very large when the data is massive, which causes
ify blockchain structure to improve search capabilities. These
high on-chain storage thus index access becomes the perfor-
works are divided into two categories: adding outside data-
mance bottleneck of our system. To solve this problem, we
bases and build-in index. Some recent works including
proceed some design refinements for performance, includ-
EtherQL [10], BigchainDB [11] and blockchainDB [9] focus on
ing index compression and bloom filter (see Sections 5.1
building a query layer on top of blockchain to provide richer
and 5.2).
query functionality. EtherQL develops a middleware which
Third, our previous work only focuses on a single-chain
can automatically keep data in sync between blockchain and
scenario. Considering different blockchains have different
external databases, and provides some RESTful APIs for the
characteristics and applicable scenarios, the query data may
client to call blockchain handler and query interface. Big-
exist on different chains and the data users want to query
chainDB is a distributed database with blockchain character-
multiple blockchains at the same time. Therefore, consider-
istics. It uses mongoDB and a rich query language is
ing the data silos between different blockchains, we extend
designed. But bigchainDB is not essentially a traditional
our solution to a multi-chain architecture based on a Simpli-
blockchain system because it does not have the full replica-
fied Retrieval Verification approach (see Section 5.3).
tion feature. BlockchainDB is a shared database system based
Finally, in comparison to the previous work, we adopt
on blockchain, and its main applicable scenario is a multiple
three larger datasets and conduct a large-scale experiment
distributed database where the participants do not trust
with more nodes to evaluate MSTDB (see Section 6).
others and they frequently read and write data. FalconDB [28]
is a Blockchain-based Collaborative Database that enhances
the client-side verifiability and system availability. 3 PROBLEM DEFINITION
Adding an index inside the blockchain is an intrusive 3.1 System Model
design [29], and it maintains the highly decentralized nature There are two types of nodes in MSTDB as follows
of blockchain. ECBC [12] is a high performance educational
certificate blockchain with efficient query. It builds a tree  Data owner: A data owner is a user that holds the off-
index structure in block and it can provide users with a bet- chain data like text documents and can upload it to
ter query experience. GEM 2 -tree [30] designs an optimized the blockchain by sending transactions.
Merkle-B+ tree to implement gas-efficient authenticated  Full nodes: A full node is a node that fully validates
range query. SEBDB [31] adds relational data semantics into and stores transactions and blocks. It exists in the
blockchain database, and provides SQL-like languages as P2P network and exchanges information with other
its general interface. Besides, vChain [32] and SEBDB use full nodes. All full nodes participate in the consensus
authenticated data structure (ADS) to implement authenti- process. For example, in Bitcoin or Ethereum, each
cated query, which guarantees the integrity of query results. miner is a full node.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8231

 Light nodes: A light node is a client that stores the


header information of blockchain and request every-
thing else from the full nodes. They can verify the
validity of the data against the merkle tree’s roots in
the block headers. For example, in Bitcoin or Ether-
eum, a wallet application is a light node.

3.2 Query Models


Document is the basic unit of our hybrid storage system,
which is a text file composed of words and sentences. A
document has some attributes that can be queried: keyword,
size, etc. The full nodes in MSTDB are going to provide the
light nodes with the following query functions

 A multi-keyword query is in the form of a keyword


tuple < key1 ; key2 ; key3 > connected with connec- Fig. 1. System architecture.
tive AND, OR, etc. It is sometimes referred to as a
Boolean query. For example, “Fruit” ^ (“Apple” _  Index Storage: the storage overhead of the index
“Strawberry” means the user wants to obtain results structure.
that contain fruit together with both apple and
strawberry or at least one of them. 4 ARCHITECTURE
 The user may want to find some documents whose
size is in a range ½Si ; Sj . For instance, the user may 4.1 Architecture Overview
want to search for a document whose size is between Our system architecture consists of three parts: off-chain
5 and 10 KBs, and his/her input is like ½5; 10. storage, on-chain index processing and query APIs. The
 The user inputs a document and wants to find the architecture of our system is illustrated in Fig. 1.
first k documents most similar to the document. Still  Off-chain storage: This part represents the data stor-
an example of supply chain, the user probably input age and devices off the blockchain, and their storage
a fruit list containing information about his ideal medium can be a traditional database, cloud storage,
fruits, and the system returns him k search results and IPFS, etc. The most of the original data are kept
most similar to his input. by a single organization, and they can establish a
 The cross-chain retrieval needs exist when the data relationship with the data on the chain through a
being searched exists in multiple chains, and the mapping method. The details of hybrid storage will
user can choose to perform cross-chain query on be covered in Section 4.2.1.
which chains. For example, if a user wants to search  On-chain index processing: Miners in the system main-
for some fruit and vegetable providers and these two tain a global index structure by processing transac-
companies are on two different chains, the user can tions, and update it with the increase of blocks. See
call the cross-chain query API. Section 4.2.2 for details.
 Query APIs: This part is responsible for data query,
3.3 System Goals including a variety of analytical query functions. All
Due to the decentralized nature of the blockchain system, full nodes could call these functions locally, and pro-
there is not a trusted node responsible for the unified vide authentication query function for light nodes.
administration of the entire system. The completeness and In section 4.3, we introduce the specific query
correctness of the query result are the major concerns of our processes.
scheme. A malicious full node may maliciously tampers The life cycle of the off-chain data like documents main-
with the index on the chain, and the full node performing tenance has three steps: 1) Each data owner stores a number
the query may also return incomplete or misleading results of documents. The details are shown in Section 4.2.1. 2) To
to the light node. The superiority of our system lies mainly upload one of its documents to the blockchain, a data owner
in two security related indicators, i.e., uploads the metadata of the document by submitting a
transaction to the blockchain. The details are shown in Sec-
 Correctness: whether the expected query results are tion 4.2.1. 3) Data owners can update the contents of their
returned; documents, and they submit a transaction with updated
 Completeness: whether all the query results are metadata to the blockchain network.
returned.
Besides, a query system must ensure that users can quickly
4.2 Storage and Index
find search results without adding too much additional
space overhead. Therefore, the performance of our system is In this section, we first give an overview of the storage and
reflected in two following aspects, i.e., index of MSTDB. This part consists of three main compo-
nents: hybrid storage, information extraction, and on-chain
 Response time: the time consumption for various index. We give some brief introduction about these three
queries; components.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8232 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

Hybrid Storage. Our storage pattern is a hybrid on-chain All on-chain data including block headers and block bod-
and off-chain storage. Most of the data are stored in some ies are stored in the database of the blockchain system, such
off-chain spaces such as distributed databases (Hbase, redis, as LevelDB, RocksDB, etc. But what are stored in the data-
etc) and IPFS. Off-chain data could be unstructured, semi- base are not the original on-chain data, since all data need
structured, or relational. Section 4.2.1 describes some data to be serialized to be stored in the database. For example,
structures and storage medium of blockchain. Ethereum uses the Recursive Length Prefix (RLP) [35] codes
Information Extraction. To establish a semantic mapping to perform network transmission and persistent storage.
between the on-chain and off-chain data, we propose an off- After the encoding process of on-chain data, they are stored
chain information extraction algorithm. Through the algo- physically in some devices such as disk. The write efficiency
rithm, the most valuable meta-data can be extracted from of the underlying database of the blockchain is extremely
the huge off-chain data and be sent to the blockchain in the high which is mainly due to the sequential writing scheme
form of transactions. This meta-data is the basic component of LSM-Tree [36]. However, since the low transaction
of the on-chain index structure. The detailed information throughput, this advantage of fast writing is not reflected in
extraction algorithm is described in Section 4.2.2. the existing blockchain system.
On-Chain Index. An on-chain index structure (IS) is the The data owners need to upload the off-chain data like
core of MSTDB, which is established in the block body. documents to the blockchain by sending the transactions, and
Once the extracted meta-data is sent to the blockchain in the the transaction form is as < Tid; SendId; docID; metadata >
form of transaction, the full nodes jointly process them and , where Tid is the unique identity of the transaction, sendId is
update the IS together. Since the index structure is updated the address of transaction initiator, docID is the unique iden-
with the increase of transactions, it always keeps up to date tity of the document, and metadata contains the keywords list
in the latest block. Furthermore, our index structure is com- and weights of the document.
patible with various blockchain systems because it does not
rely on any specific underlying database. We will elaborate 4.2.2 Information Extraction
the specific design and the generation process of our index The goal of information extraction is to establish the seman-
structure in Section 4.2.3. tic mapping and tamper-proof ability between the on-chain
and off-chain data, which includes two steps, i.e., meta-data
extraction and hash binding. Note that the basic meta-data
4.2.1 Hybrid Storage
(e.g., data location, data size) will be directly provided by
The storage model of MSTDB is a hybrid on-chain and off- the data owner, thus we focus on the extraction of the
chain architecture. Off-chain data are managed by some pri- semantic keywords in this section. For backward compati-
vate data owner, and stored in some distributed devices bility, the meta-data is placed in the “data” field of transac-
like distributed databases or IPFS. Since the off-chain data tions which exist in most blockchain systems.
are stored in some private spaces and their copies do not In the search engine filed, TF-IDF [37] is a classic keywords
need to be synchronized across the whole network, the stor- extraction algorithm which can collect words that appear fre-
age cost of off-chain data is not very high. We focus on the quently. To apply this method to blockchain system, we pro-
storage cost of on-chain part. A blockchain is a data struc- pose an improved TF-IDF model called Decentralization
ture which is linked from one block to another. Each block Semantic Extraction (DSE). DSE improves the precision of
contains two parts: the block header and block body. In the semantic feature extraction by introducing decentralized
block header, a core data called hash point which deter- word frequency, part-of-speech and position factors to
mines the context of a block and ensures the block data can- address the lack of semantic understanding in traditional
not be tampered with. keyword extraction algorithms. Through the DSE model, we
Transaction is the basic data content in blockchain sys- build an inverted index [38] of semantic keywords.
tem, and a series of them are stored in the block body. All The main purpose of TF-IDF model is to find the words
transactions have a standard format, and every transaction which appear frequently in one document and rarely
has a transaction hash through it we could locate the trans- appears in the entire document collection. Given a word i, a
action quickly. Inside the block body, a number of transac- document collection N, and an individual off-chain data j 2
tions are arranged to form a Merkle tree. Merkle tree is a N, the TF-IDF can be calculated by
hash-based Multi-tree, which uses layers of hashing to
 
ensure that the order of transactions in a block is not tam- N
pered with. Bitcoin uses Merkle tree to conserve transac- TF  IDFi;j ¼ fi;j  log ; (1)
dfi þ 1
tions, and put the root hash of the Merkle tree into the block
header. Unlike bitcoin, Ethereum is an account-based where fi;j is the frequency of keyword i in the data j, and dfi
model, known as a transaction-driven state machine. It uses is the number of times the data containing the keyword i
Merkle Patricia Trie (MPT) [2] to store the states such as appears in N.
account state or smart contract state. Each MPT leaf node is A problem of TF-IDF model is that the IDF model only cal-
an account state, while the branch stores the compressed culates the number of documents that contain a word but
prefix of the account address. The miner updates the states does not consider the word frequency difference between
of the accounts in the MPT according to the transactions in them. In some cases, if a word has a lower frequency in a doc-
the block. Therefore, the latest states of all accounts are ument but does not appear in other documents, its impor-
stored in the latest block, with the root hash of the MPT in tance for document classification is higher than that of
the block header. another word that has a higher frequency in this document
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8233

but also appears high in other documents. For example, if on the blockchain, it guarantees that if data owner modifies
there are 10000 documents in the system, word 1 appears 10 the raw off-chain data, the mapping relationship between
times in document 1 and does not appear in other docu- on-chain and off-chain data will be broken. To expand the
ments, and word 2 appears 1000 times in all documents. The coverage of the hash function, we add the meta-data to the
TF-IDF value of words 1 and 2 in document 1 is 40 and 1000, input parameters of it. So the hashing process becomes this
but in fact word 1 has a greater discriminating ability for doc- form: hash ¼ Hðraw datajmeta-dataÞ. Besides, if some off-
ument 1. For solving this problem, we propose a calculation chain data are deposited in IPFS or other decentralized stor-
method called Document-wisee Term Frequency Factor age network, our system can be compatible with their hash
(DTFF) to decentralize the frequency of keywords by results such as Merkle DAG root.
After the information extraction process of off-chain
DTFFi;j ¼ 2Ni;j Ni ; (2) data, we create a table called inverted index to store the
index information of every keyword. Inverted index often
where Ni;j is the number of occurrences of keyword i in
consists of a series of inverted list whose format is like <
data j, and Ni is the average number of occurrences of key-
key; docId1; docId2 . . . > , where docId is the unique identi-
word i in all data. The weight of exclusive words can be
fier of a document. Different from the traditional inverted
increased through the decentralization of DTFF. Following
index, we use block number instead of docId to represent
the above example, through the DTFF model, the weight of
the appearance position of the keyword. A problem with
word 1 is increased to 40960 and the weight of highly dis-
this modification is that if two or more transactions contain-
tinctive word 1 is highlighted.
ing the word appear in the same block, we need to traverse
In addition to term frequency, the position of a word in a
the entire block to ensure the integrity of the query. The ben-
document is also very important. For example, the weight
efits of this design will be explained in Section 4.3.1. For any
of a word which is at the beginning and end of an article
inverted list, all block numbers are sorted in descending
often has greater significance than those in the middle of
order of average weight. Moreover, although the historical
the article. We arrange the positions where the keywords
record cannot be changed due to the immutability of the
first appear in the data into a sequence with the number of
blockchain, the weight of each word in MSTDB can be
words as the total length and 1 as the unit scale, and we
updated by the blockchain nodes’ consensus. The data own-
choose the middle point as the initial coordinate. In our
ers can submit a transaction with updated metadata to the
thoughts, we determine that the words in a small range on
blockchain network to ask the blockchain nodes to update
both sides of the center point (initial coordinate) the same
the weight of word.
weights. outside the range, we measure the distance from
the initial coordinate to the word and calculate a weight fac-
tor affected by distance. The philosophy we follow is that 4.2.3 MST Tree
the longer the distance, the greater the weight. The Position
There are some search methods in blockchain system like
Factor is defined as
( searching block by hash, searching transaction by hash,
1; d < d searching block by block number, etc. However, there are no
PFi;j ¼ ; (3) semantic meanings included in the input parameters of these
"  dd ; d  d
methods. It is inefficient to process a query request if a user
where " is an increased weight multiple, d is the distance of wants to query some information by keywords, because the
the word i from the initial coordinate, d 2 ½0; L=2Þ and L is node must scan all blocks one by one. In our scheme, an IS is
the length of the sequence. established in the block body. This IS is jointly maintained by
Furthermore, we draw lessons from the scheme of speech the miners and is always up to date in the latest block. The
classification priority based on statistical methods [39], which index structure is not a special component in the blockchain
calculates the weight of part of speech by using a fuzzy hid- system, it is kept in the database as a structure and loaded
den Markov model, using the method, we define the weights into the memory when it is used. Since it does not depend on
for keywords of different parts of speech as follows any specific underlying database, our index structure is com-
8 patible with various blockchain systems.
>
> 0:9 n: Inverted index allows fast real-time query on a single
>
>
>
> 0:8 v: keyword, but it does not provide efficient multiple key-
>
>
>
> words query. To overcome this challenge, we construct a
< 0:3 pron:
>
novel index structure called merkle semantic trie (MST) that
Posi ¼ 0:5 adj: (4) integrates the inverted index, Merkle Tree, prefix compres-
>
>
>
> 0:5 adv: sion and hash point techniques.
>
>
>
>
>
> 0:4 num: MST is a tree-shaped data structure, and there are three
>
: types of nodes in the MST: branch nodes, extension nodes,
1 idiom:
and leaf nodes. Different nodes have different features in
Then, the weight of the feature word is MST structure, and a node can only be of one type. Branch
nodes of MST has analogous duty as in tire, they only store
Wi;j ¼ TF  IDFi;j  DTFFi;j  PFi;j  Posi : (5) index-related information and provide pointers to the next-
level nodes. Leaf nodes always represent the end of a search
Hash binding is the process of storing the hash value cal- path and store the query results. The result is not the spe-
culated by the one-way hash function of the off-chain data cific information but a hash pointer which directs to the
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8234 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

Fig. 2. MST structure.

block where the corresponding transaction is located. Exten- has been sorted according to the weight in the inverted index.
sion node is a multi-functional node of MST. It contains not MST has been created when the genesis block was built, and it
only the index-related information, but also the query moves to another state after all of the transactions in a block
results and pointers. This design means that when you have been handled. The original MST had only a root node,
search for that node, it has a definite search result in this and every index pair in the transaction need to be inserted
node. But if you want to continue searching for more from the root node.
detailed and accurate results, you can enter the next level After transactions are inserted into MST, blockchain
by the pointer and explore deeply. The root node is the nodes need to update its keywords set. In an MST, each leaf
entrance to MST, and if there is only one node in MST, the node or extension node can be expressed as a boolean
root node serves as a leaf node, otherwise it is a branch expression. For example, in Fig. 2, the boolean expression of
node which has child nodes. leaf node (a) is fkey1 ^ ðkey3 ^ key2Þg. Moreover, boolean
expressions corresponding to multiple leaf nodes or exten-
sion nodes can be combined. For example, in Fig. 2, the com-
Algorithm 1. MST Node Insert
bination of boolean expressions of leaf node (a) and (b) is
Input: info as indexInfo(blockNumber and a list of key- fkey1 ^ ððkey3 ^ key2Þ _ key5Þg.
word as < keym > , node as an MST node Besides, the keyword list of queries can also be represented
1: ln lenðnode:keyÞ
in Boolean form like q = fkey1 ^ key3 ^ key2g, There may be
2: li lenðinfoÞ
some ambiguity in the order of keywords. We assume that a
3: while 9i : i  lnandi  li do
combination of leaf nodes or extension nodes denoted by
4: foreach keyi 2 info:key, nodekeyi 2 node:key do
5: if keyi 6¼ nodekeyi then
f‘‘blockchain00 ^ ð‘‘bitcoin00 _ ‘‘ethereum00 Þg is stored in the
6: go to split MST. Then, if a query is f‘‘ethereum00 ^ ‘‘blockchain00 g, the
7: if li < ln then MST cannot return any result. However, our DSE model could
8: node extendnode eliminate the ambiguity since we can sort the input keyword
9: go to split list by their weights with those of larger weights ranked ahead,
10: if li = ln then preventing recursive insertion of the keyword combinations.
11: Switch: node type The main consideration of the MST path borrows the
12: Case: leaf k extend common prefix concept of trie. Upon an insertion to the key-
13: node.value.append(blockNumber) word list, the data with the same prefix keyword combina-
14: break tion is stored to the same path. To further drop index size,
15: Case: branch we comprss paths with only one child node, which is like
16: node.value.append(blockNumber) the Patricia trie. Fig. 2 shows the specific structure of MST.
17: node extendNode Since the existence of merkle tree characteristics, each node
18: break of MST needs to be calculated a hash value. And different
19: if li > ln then type of nodes have different input of the hash function: for
20: if node.type = leaf then leaf node, the input parameters of hash function are key-
21: node extendnode words and block pointers which direct to the block and are
22: node:child:appendðinfo½i : liÞ stored in the leaf node; the hash value of branch node
23: split: node.child.append(info[0:i]) and node.child.append(info[i:li]) includes the hash result of keyword and its child nodes; as
for extension nodes, its hash value is the hash result of the
To construct the global on-chain index structure, miners keywords, block pointers, and child nodes. The role of node
extract the keyword information from transactions. The for- hash is not only to prevent the content of the node from
mat of the information we need in the transaction is being tampered with, but also to establish a logical connec-
< blockNumber; key1; key2 . . . > pairs and the keyword list tion between the parent node and the child node. The
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8235

Fig. 3. After adding the B+ tree feature to MST.

parent node and the child node are not physically connected
in the database, but the child node stores the hash value as Fig. 4. A Structure of Multiple keywords Aho-Corasick automaton.
the key in the Key-value database, and the parent node
finds the child node through the hash value of the child
4.3.1 Authenticated Multi-Keyword Query
node. The root hash value saved by the root node is stored
in the MSTrootHash Field of the block header. According to the definition given in Section 3.2, mutil-key-
In Algorithm 2, we perform an insert operation of a node word query needs to match the keyword combination
in MST, including the node type conversion operation and entered by the user with the index path in MST, and return
split operation. There are three situations to consider here all results that meet the conditions. We refer to many string-
when a new keyword list appears. We describe three differ- matching algorithms such as KMP (Knuth-Morris-Pratt),
ent operations separately and how the node split. and propose a multi-keyword matching structure named
Multiple keywords Aho Corasick automaton (MKACA). In
our design, MKACA is another form of MST and can always
4.2.4 MST-B+ Tree be kept in the memory of the blockchain node. It can sup-
Besides the keyword query, we also focus on range query port incremental updates and it will be updated with MST
on blockchain. Merkle-B+ tree is a common ADS that is usu- update in each consensus round of blockchain.
ally used to support authenticated range queries in out- As shown in Fig. 4, the initial state of MKACA corre-
sourced databases. MST-B+ tree is a combination of MST sponds to the root nodes of MST, and the receiving states
keyword trie and Merkle-B+ tree, which can support both are represented as double circles, which correspond to the
the authenticated keyword query and the authenticated extension nodes or leaf nodes. All the state transition func-
range query. The combing process consists of two steps. tions in MKACA correspond to index keywords on different
The first step is the sorting process. As shown in Fig. 3, the branch paths of MST. For example, q0 in Fig. 4 represents
size information of the off-chain documents is added in the the root node in Fig. 2, q1 represents the extension node
leaf nodes and extension nodes in MST, but they are unor- with key1, and two states q2 and q3 together represent the
dered. To put them in order, the blockchain nodes should leaf node with key2 and key3 because < key1; key2; key3 >
sort them by their size values and add small-to-large hash and < key1; key3; key2 > are two different matching path.
pointers between them. Through the ordered hash pointers, MKACA can provide the authenticated multi-keyword
all leaf nodes and extension nodes form an ordered linked query for MSTDB. The inputs of authenticated multi-key-
list. In the second step, after the ordered linked list is con- word queries include a keyword list and the root node of the
structed, the blockchain nodes add the index information MST. First, the MST is converted to a corresponding
into the branch nodes and leaf nodes to let them point to the MKACA, and the keyword list is transformed to a CNF and
documents whose sizes are in an ordered range. starts its matching processing in the initial state of MKACA.
The MST part and the B+ tree part on the MST-B+ tree Then, the blockchain nodes match the keyword of the CNF
are maintained and updated independently. In an MST, in order with the different paths of MKACA. If the operator
each extension node or leaf node has two pointer areas is conjunctive, the match processing will search the MKACA
(blue area in Fig. 3) that store pointers to the document cor- in depth, while if the operator is disjunctive, match process-
responding to the keyword query and the document corre- ing will search the MKACA in breadth. If the CNF has been
sponding to the range query, respectively. For example, in traversed, and matching processing has reached the receiv-
Fig. 3, the two pointers (red dotted lines) of leaf nodes (a) ing states, blockchain nodes return the value of the corre-
point to two different documents with keyword set sponding node of this state. Otherwise, the blockchain nodes
fkey1; key2; key5g and document size 3 KB respectively. return an empty result. For example, in Fig. if a query input
is key1 ^ ðkey3 _ key2Þ, the state will go from q0 to q1, q1 to
4.3 Query Processing q2 and q3, but neither q2 nor q3 are the receiving states, the
After adding the index structure on the blockchain, our match processing is failed. However, if the input is key1 ^
MSTDB system can provide multiple authenticated queries. key3 ^ key2, the state goes from q0 to q1, q1 to q3, and finally
In this section, we will introduce the query processes of to q9. Since q9 is a receiving state of MKACA, the match
four methods. processing is succeeded.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8236 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

The outputs of it are the multi-keyword query result R request, and calculate the product of the weight of
and its verification object (VO). The correctness and com- all documents (or transactions) in the list of the
pleteness of the results can be guaranteed by the Merkle word in the inverted index and the weight of the
proof in VO. Clients can verify the correctness of the results word in the query request. Then for each other key-
by recomputing the hash value of Merkle root and compar- word in query request, using the same similarity
ing it with the record in their local block header. The com- calculation method and add the product to a sum.
pleteness can be guaranteed because one query only returns Miners use a temporary memory to keep track of
one related node in MST, and the result cannot be mali- currently active Top-K candidates.
ciously omitted to pass the validation.  Document-At-A-Time (DAAT) [41] : Each document
which is related to the query input can be calculated
independently, miner use simhash algorithm or
4.3.2 Authenticated Range Query
cosine distance to calculate the similarity between
By combining B+ tree and MST, our system can provide two documents. DAAT store current Top-K result by
verifiable range query in the hybrid storage architecture. a simple heap structure.
Range query is a common query operation in database Compared to TAAT, DAAT is a more space-saving solu-
field, which finds some information whose value is tion and faster due to multi-thread parallel operations.
between an upper and lower boundary. Using the B+ tree However, DAAT rarely terminates early, because no one
feature, it can find the value in the input range very sta- can make a preliminary overall judgment on the next docu-
bly. For verifiable range query, the full node searches the ment. Many optimized approaches for termination early
MST-B+ tree to find the first record in the range. Then it use TAAT, but in blockchain system, the cost of temporary
scans the leaf nodes from the first node until right bound- storage space cannot be ignored, so we improve DAAT ’s
ary of the query range. Merkle proof is packed into VO early termination conditions.
and will be returned to the light node together with the
query result.
Algorithm 2. Top-K Query Baesd on BAAT
For authenticated range queries, the query results are a
range of MST-B+ tree nodes, thus the blockchain nodes Input: Q as a query request; B as the blockchain
need to return the query results and the Merkle proof of Output: R as Top-K query result
two points at the front and back ends of the range. The cor- 1: Create a min-k heap h½k and h½0 is the minimal value
2: foreach block bi 2 B do
rectness and completeness of the results can be guaranteed
3: if bi:avgWeight  bi:transNumber  h½0:value
by the Merkle proof. Clients can verify the correctness of
then
the results by recomputing the hash value of Merkle root
4: continue
and comparing it with the record in their local block header.
5: else
The completeness can be guaranteed because the Merkle 6: foreach doc 2 bi do
proof of B+ tree can prove that all in-scope results are 7: if cal½doc; Q > h½0:weight then
returned. 8: h½0 = doc
9: h:rebalanceðÞ
4.3.3 Authenticated Top-K Query 10: return h;
Users constantly want to seek for documents related to its
input. To meet the search needs of users, the basic solution In our scheme, we propose a more efficient index traversal
is returning all query results through the index, and let the pattern called Block-At-A-Time (BAAT) developed from
searcher find the best document by themselves. A fact is DAAT. Same as DAAT, BAAT traverses the blockchain by
that most of the query results are not highly matched to calculating weights for all documents one by one and uses a
user needs, thus users also take time to filter the results, min-k heap to store the temporary top k results, while BAAT
which is inefficient and has become a bottleneck of query adopts a block-based early termination scheme to skip all
process. Top-K query, which is a classic optimization tech- documents in the block without calculating the weights of
nique to save the query time cost. The reason why Top-K the documents, which allows it to get query results faster
query can improve query efficiency is that it has a key step than DAAT. In BAAT, each block has pre-calculated an aver-
called early termination. Early termination first lets the age score table containing the average weight of all keywords
inverted list be arranged such that the most promising in the block. For each keyword ki in block bi , the average
documents appear early, and stop the query process as soon weight of ki is the average DSE weight of ki . When a query
as the top K results appear. Besides the early termination, request is received, the blockchain node who receives the
there are many effective ways such as skip within lists, omit request traverses the chain from the latest block to the gene-
lists or score only partially. Top-K query calculates the rele- sis block and keeps a min-k heap to store the terminated
vance of user input to candidate documents, and returns results. All documents in a block are compared with the
the top K results according to the ranking from high to low query to calculate a similarity between them, and the min-k
score. For score calculation, the index should be traversed, heap always keeps the top k relevant documents. The early
and there are two traditional index traversal techniques as termination scheme of BAAT is that if it is assumed that
follows there is a document in the block that has the highest rele-
vance to all the keywords in the query and the relevance of
 Term-At-A-Time (TAAT) [40] : Query service pro- the document is still less than the document at the head of
viders first access one term (or keyword) of the query the heap, the whole block can be skipped.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8237

Fig. 6. MST compression process.

MST like the MST-B+ tree, and divide the ordered list into
Fig. 5. Structure of VC-based MST. multiple groups of the same size to reduce the generation
time for the security parameters, each of which is an
ordered vector. Second, the blockchain nodes generate all
The detailed algorithm of BAAT is shown in Algorithm commitments of the vectors, and store these commitments
1. In Algorithm 1, the inputs contain a query request Q in the root node’s child nodes. For example, in Fig. 5, c1 to
which includes a keyword list < key1 ; key2 ; keyt > and a c3 are three commitments of three vectors V1 to V3, and are
blockchain B. First, the blockchain node creates a min-k stored in the three child nodes of the root node. Third,
heap h[k] whose head is h[0] (line 1). Then, for each block bi when some items of MST need to be inserted into MST or
in B, the blockchain node calculates the average weights of updated, the blockchain nodes create new vectors or update
all keywords in the keyword list and the possible maximum the existing vectors and their commitments.
value of the block. If the maximum value is lower than the The scheme of VC can support the same range query
value in h[0], the block is skipped (lines 2-4). If not, the function of Merkle B+ tree. In the authenticated range
blockchain node will calculate the similarity of all docu- query processing, blockchain nodes search in the root
ments in the block with the query and update the heap if node’s child list to find the group with the target range and
there are some more relevant documents. Finally, the block- find the result in the corresponding vector first, then they
chain node returns the heap. return the result with a VO to the light node. The block-
For authenticated Top-K queries, since the blockchain chain nodes will generate the vector commitment proofs of
nodes needs to traverse the entire chain to iterate over all the query results, and add the proofs, the relevant commit-
documents, it is too expensive to send the data of the entire ments of these proofs, and the Merkle proofs of these com-
chain to the client for verification. Therefore, we adopt a mitments to the VO. After receiving the result and VO, the
consensus-based verification method. The client needs to client verifies the result with the proof and the commitment
send a query transaction to get the top k results, and the locally. Through adding the VC scheme, MSTDB only
query results are returned with a transaction receipt. Since needs to store the additional commitments of the grouped
the query transaction are executed by all blockchain nodes, linked list and each update of the digest value will only
the correctness and completeness can be guaranteed by the lead to one time update of one commitment. Comparing
consensus mechanism. with the Merkle B+ tree, the time complexity of VC updat-
ing is reduced to O(1).

4.4 Vector Commitment-Based MST-B+ Tree


In an update intensive case, the poor scalability of our MST-
5 DESIGN REFINEMENTS FOR PERFORMANCE
B+ tree is a problem. To overcome the scalability challenge, 5.1 Index Compression
motivated by [33] and [42], we propose a cryptographic vec- Although our hybrid storage architecture can alleviate the
tor commitment (VC)-based scheme refinement. We will storage overhead on the chain, with the data continuing to
explain the refinement and the new experiment in details as grow, the size of the index structure can still place an unac-
follows. ceptable burden on the node. To allow lower storage cost
VC is a cryptographic structure that can commit an entire and faster access to index on disk or other devices, and limit
array V of data with a short commitment c. A user who has the memory space needed, we present an index compres-
c can verify V or any item V[i] in V through a proof pro- sion approach for MST. The main problem for the expansion
vided by a prover who maintains V. To generate the com- of MST is that there are too many nodes store the same key-
mitment for a vector with size B, the prover needs to word, because there are many intersections between key-
generate the security parameters at first, and the parameters word combinations. To reduce the repetition rate of nodes,
can be used by other vectors with the same size B. However, we propose a method based-on F-B pointer to merge those
if the prover wants to generate a commitment for a vector of nodes that contain the same key.
another size, it needs to generate the new security parame- F-B Pointer. We use a two tuples structure called F-B
ters. The process of generating security parameters is time- pointer to achieve the compression, and F-B means front
consuming. and back. The contents of F-B pointer are two hash val-
The motivation of the new design is to update fewer ues which direct to the parent node and the child node
nodes when inserting new nodes in the MST-B+ tree. The in on search path. After adding the F-B pointers, the
processing of adding the VC scheme in MST contains three space complexity of MST can be reduced to O (n), and
following parts: First, as shown in Fig. 5, the blockchain this allows MST to ensure availability in scenarios with
nodes sort the nodes of leaf nodes and extension nodes in massive data.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8238 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

Fig. 7. Blockchain with added node pointers. The dotted lines represent
the parent node pointers to the unchanged subtrees after the MST state
changes.

Fig. 8. Keyword Bloom Filter.


Fig. 6 is a case of MST before and after compression. In
Fig. 6, the left part is an original MST, and 1 to 6 are different
keywords stored in different nodes. In the original MST, inserted. A bloom filter is employed to quickly determine
there ithe left part is an original MST, and 1 to 6 are different whether a keyword to be inserted is a new keyword.
keywords stored in different nodes. In the original MST, Compared with the Bloom filter in the traditional block-
there is only one path between any node and the root node. chain, the main purpose of our structure is to improve
The right figure is the compressed MST. In the right figure, the efficiency of keyword query. We use bloom filters to
all the nodes with the same keywords are merged, and the pre-match user input to minimize the use of index struc-
double arrows with different colors represent the different tures. We map the keyword list in the inverted index into
F-B pointers on different subtrees in the compressed MST. a bit vector called bitmap. For each keyword, we perform
For example, the red arrow with symbol 1 is < root; 1 > , k different hash operations, and then set 1 for the data
where the root node is its predecessor and the node with bit obtained by hash value modulus-128 in the bitmap.
keyword 1 is its successor. In the compressed MST, the When a new keyword keyi comes, we perform multiple
nodes connected by all F-B pointers of the same color can be hashing
mapped to a subtree of the root node of the original MST.
For example, in the right figure, the nodes with keywords 1, Y
k

2, 5, 6, and (3,4) are connected by the red arrows, and they verifyðkeyi Þ ¼ ðhashi ðkeyi Þ \ 1Þ: (6)
i
correspond to the leftmost subtree in the left figure. More-
over, the path from any node to the root node is connected
by F-B pointers of the same color between them. For exam- If the result is 1, the keyword is likely to be an existing
ple, the three red arrows in the right figure with numbers 1, keyword. And if the result is 0, it can be determined that
2, 3 are three F-B pointers including < root; 1 > , < 1; 2 > this keyword has never been stored in the system, so there
, and < 2; 3 > , each of them contains two nodes before is no need to access the index. The structure of the bloom fil-
and after the pointer. Through the three F-B pointers, the ter is shown in Fig. 8. Bloom filter is not always completely
node with keyword 6 can establish a link with the root node accurate. It possibly shows false positives because two or
of the keyword path < 1; 2; 6 > . more keywords may be mapped to the same bit of the
Node Pointer. Another effective way to reduce storage bloom filter. Despite the presence of false positive, the effi-
space of index structure is node pointer. Since MST contains ciency of our system has still been greatly improved after
all the index information on the blockchain and always adding the bloom filter. The structure of the bloom filter is
keep the newest state in the last block, there are many differ- shown in Fig. 8.
ent versions of MST as the block increases. Streamlining the
index structure that is not in the latest block is an obvious 5.3 Cross-Chain Query
and effective means to save storage space. But to ensure the MST can effectively meet the needs of single-chain query,
returnability and traceability of blockchain during its fork, but as we discussed before, sometimes the data required by
it is necessary to keep a complete MST structure evolution some users may be on different blockchains. In a cross-chain
process. The core idea of our design is incremental modifi- application, different chains must be able to connect and
cation of the index structure. Similar to Ethereum’s state communicate with each other [43], which also called inter-
tree, the newer block only stores the changed part of MST operability. To solve the cross-chain query problem, we
and uses some node pointers to direct to the subtrees that assume that there are series of different chains which all
have not changed in MST, which are located in the previous have MST in their block, and they need query some data
block. One possible problem in this design is that if a sub- which are probably stored in different chains. To establish
tree has not been updated for a long time, there may be too communication between multiple chains, we propose a
many node pointers which direct to them. Therefore, we cross-chain query committee scheme based on the Merkle
allow the next block to reuse the same node pointer of the proof verification. The nodes in the committee are drawn
previous block. A blockchain with added node pointers is from each blockchain in a certain proportion and replaced
shown in Fig. 7. after a period of time.
DEFINITION 1: Given a blockchain set B=fb1 ; b2 ; . . . ; bn g,
where bi is the number of nodes in the ith blockchain. Let
5.2 Bloom Filter C=MaxðBÞ, and ci is the number of committee nodes of the
The structure of MST results in a large change in the MST ith blockchain. The number of nodes in each blockchain sys-
when a new keyword that hasn’t appeared before is tem is allocated as
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8239

8   value, the harmonic average of precision and


>
< ci ¼ PnBi c recall, which is calculated by
B (7)
j¼1 j
>
:c ¼ c  P 1
m i6¼m ci F ¼ ; (8)
a P1 þ ð1  aÞ R1
Where cm is the nodes number of the blockchain who has
maximum number of nodes. where P represents the precision, R represents the
The cross-chain query process includes five steps: (i) recall, and a is the proportion of P and R. In our
Each single blockchain extracts some nodes and the nodes experiment, we set a to 0.5 because we deem the pre-
of multiple chains together form a cross-chain query com- cision and the recall equally important. The higher
mittee. (ii) A node belongs to any blockchain system send a the F-value, the stronger the overall retrieval ability
cross-chain query request to the node from cross-chain of the system. (See Section 6.2.1)
query committee of their chain. (iii) The node who requests  Query response time: Query response time is the
the query request searches locally and broadcasts the query amount of time from the request is sent to all results
to the entire cross-chain committee. (iv) After receiving the are received. Note that different types of queries and
request, nodes belonging to other chains in the cross-chain nodes have different response time calculation meth-
committee will query locally, and return the query result ods. For keyword query and range query, full nodes
and VO to the committee node of the request chain. (v) can query through the on-chain index locally, but
When the node who broadcast the query received the query the light nodes need to connect a full node and make
result, it downloads the block head information from other an authenticated query through json-RPC or other
nodes which returned it the results. Then it verifies the query interfaces. Besides, cross-chain query needs to
query results through the Merkle proof. (vi) The verified specify which chains to retrieve, and its response
query result is sent to the node that originally submitted the time is the total time for queries on all blockchains.
query request. (See Section 6.2.2)
 Indexing cost and authenticated query cost: Indexing
cost is composed of the time and space cost of con-
6 EXPERIMENT
structing the index structure. In MSTDB, the time
6.1 Experimental Setup cost is generated in the process of generating and
Dataset Description. Our experiments are done on three col- updating the index. The space cost is the extra stor-
lections: (i) Inspec: The Inspec collection contains the age space after adding the semantic information to
abstracts of 12684 scientific documents [44]. (ii) WWW: The block including the extra fields in the transaction
WWW dataset includes the abstracts and authors informa- and on-chain index structure. Besides, since light
tion of the papers collected from the World Wide Web Con- nodes can only connect to full nodes for authenti-
ference (WWW) published during the period 2004-2014, cated queries, they need to consume additional time
with 1330 documents [45]. (iii) Wiki20: Wiki20 consists of and space costs to verify the query results. (See
some English research reports covering different aspects of Section 6.2.3)
computer science [46]. Since the Wiki20 dataset involves dif- Benchmark. We adopt Ethereum1 as a benchmark in our
ferent aspects which have different data format, it is suitable experiment. Although Ethereum does not have the same
for cross-chain query evaluation. query functions (e.g., multi-keyword and range query func-
Environment. Our experiments are performed using tion) as MSTDB, we implement them in Ethereum by tra-
Ethereum’s golang client geth on four 64-bit Linux servers versing all transactions in all blocks on the blockchain.
(Ubuntu 18.04) with Intel core i9-10th CPU and 32 GB of
memory. 6.2 Experimental Results
Evaluation Metrics. For the system goals mentioned in 6.2.1 Precision and Recall
Section 3.3, we measure the following metrics
As discussed in Section 4.2.3, due to the prefix compression
 Precision and recall: To test the correctness and com- feature, the extension nodes may cause multi-keyword
pleteness of our scheme, we use precision to repre- query errors. We evaluate the precision, recall and F-value
sent correctness and use recall to represent of the multi-keyword query in MSTDB with different num-
completeness. Precision indicates the proportion of ber of keywords, and the results are given in Table 1. From
correct results in the query results, and it can be the result, we can find when there are more keywords pro-
calculated by the number of correct results divided vided to the multi-keyword query, the precision will be
by the number of all returned results. The higher higher, which denotes the correctness of the query result
the precision, the higher the proportion of correct will be improved. It is because when the user provides
results returned by the system. Recall is the frac- more keywords, the query accesses deeper layer and avoids
tion of the relevant documents that are success- querying the results from the extension nodes in MST thus
fully retrieved, and it can be calculated by the the results will be more related to the user’s input. Further-
number of correct results divided by the number more, more keywords can reduce the false positives in
of results that should be returned. The higher the Bloom filters, thus improving the precision. The recall and
recall, the higher the proportion of all correct F-value also increase when there are more keywords and
results returned by the system. Furthermore, to
trade off precision and recall, we also adopt F- 1. https://github.com/ethereum/go-ethereum
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8240 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

TABLE 1
Precisions, Recalls, and F-Values at Different Numbers of
Keywords

Keywords Precision Recall F-value


1 0.733 0.549 0.627
2 0.809 0.588 0.681
3 0.863 0.665 0.751
4 0.889 0.773 0.826
5 0.909 0.802 0.852 Fig. 9. Top-K query results with three methods.
non-inverted list 0.615 0.261 0.366

authenticated query does not introduce a significant perfor-


the reason is similar to that for the precision. And the result mance degradation.
of non-inverted list shows that the inverted list in MST is an To evaluate the performance of Top-K query in MSTDB,
important component for the multi-keyword query. we test the average response time of 8000 Top-K queries
In addition, since the range query only depends on the B with three different parameters (Top-1, Top-10, Top-100)
+ tree which only stores data on its leaf nodes and does not under three inverted index traversal methods (TAAT,
end the search at the extension nodes, its result will be defi- DAAT, BAAT), and the results can be found in Fig. 9.
nitely accurate. Thus, we only test the precision and recall From the results of the Top-K query, we can see that our
of multi-keyword query here. BAAT scheme has a 65% and 30% performance improve-
ment over the traditional TAAT and DAAT schemes.
Therefore, the early termination method does play a role
6.2.2 Query Response Time
in our Top-K query because it can skip some irrelevant
To evaluate the query response time, we test the average blocks directly.
query time and worst-case query time for multi-keyword For the cross-chain query, considering the response time
query and range query in MSTDB under two datasets. Due is related to the number of chains in the system, we divide
to the different contents in the two datasets, we choose dif- the Wiki20 collection into multiple parts and test the query
ferent values as the range. In the Inspec dataset we use the response time from 1 chain to 16 chains respectively, and
file size as the range value, and in the WWW dataset we use the results are given in Fig. 10. Because each additional
the collection year of the paper as the range value. Besides, chain needs an additional cross-chain query and verification
we test the response time of full nodes and light nodes for process, the response time increases nearly linearly accord-
these two queries separately. As discussed in Section 4.3.4, ing to the increase in the number of chains. Each additional
although a light node does not have the whole chain data, it chain increases the cross-chain query time by approxi-
can connect to any full nodes and make the authenticated mately 5 ms.
query request.
The experimental results are given in Table 2, where FN
stands for full nodes in MSTDB, LN stands for light nodes 6.2.3 Index Cost and Authenticated Query Cost
in MSTDB, and ETH stands for an Ethereum’s node. To evaluate the index cost of our MSTDB, we conduct a set
Besides, we assemble some query statements at the end of of comparative experiments to test the time and space costs
the index that will not be filtered by the Bloom filter to before and after adding the index structure.
determine the response time of our system in the worst Time Cost. Fig. 11 shows the time for generating a block
case. Note that the query in Ethereum’s node does not have with and without the MSTDB under different numbers of
the value in the worst case because each query needs to tra- blocks. The block confirmation time in MSTDB is mainly
verse the whole chain. From Table 2, we can see that the longer than that in the Ethereum system. The average con-
average response time of MSTDB is within 10 ms for two firmation time of MSTDB is about 1894 ms and approxi-
query types, which is less than 1/1000 of Ethereum. This mately 1451 ms in Ethereum. This gap means that the
huge performance improvement means that our index addition of the MST leads to a slightly increase of about 0.44
design is very effective and greatly enhances the usability of seconds in the average block generation time. The main rea-
the blockchain database. For different types of nodes, com- son is that the full nodes in MSTDB need to process the
pared to the full nodes, the light nodes need an extra aver- meta-data in the transaction and update the index structure
age query time of nor more than 1 ms, which means that the before generating blocks.

TABLE 2
Response Times for Three Query Types

Query approach Multi-keyword Range


Dataset Inspec WWW Inspec WWW
Node type FN LN ETH FN LN ETH FN LN ETH FN LN ETH
Mean (ms) 5 6 27804 5 5 23320 9 10 11790 4 4 3320
Worst case (ms) 23 25 - 20 21 - 38 40 - 17 18 -
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8241

TABLE 3
The Result of Index Compression

Dataset Original MST Compact MST Compression rate


Inspec (46.3 MB) 3558bytes 1971bytes 55.39%
WWW (11.5 MB) 2442bytes 1455bytes 59.58%

TABLE 4
Fig. 10. Cross-chain query. The Impact of Index Compression on Query Efficiency

Dataset Original MST Compact MST rate of prolongation


Inspec 5135 ms 5371 ms 4.23%
WWW 4398 ms 4447 ms 1.11%

Fig. 11. Time cost for generating blocks with and without the MST.

Fig. 13. Scalability for MST-B+ tree and MST-VC.

Furthermore, since we use an index compression method


to shrink MST, we evaluate the storage space occupied by
MST before and after compression and the result is given in
Table 3. We store the two datasets in the MSTDB in the
same environment, and test the size of the MST before and
after compression. Compression rate [47] is an important
indicator to test the compression capability of some com-
pression algorithms. It can be calculated by the ratio of the
Fig. 12. Space cost for generating blocks with and without the MST. size of the data before and after compression, and the
smaller the compression rate, the better the compression
Space Cost. In space cost, MSTDB’s hybrid storage archi- effect. Through the result, it can be seen that the compres-
tecture needs to store three parts of data: off-chain data, on- sion rate of these two datasets can reach below 60%, which
chain data, and index data. To compare the space cost of means about 40% storage overhead of the MST index can be
MSTDB and traditional blockchain system, we use MSTDB saved by our compression technology. Moreover, to test the
and Ethereum to store the same amount of raw data, and impact of index compression on query efficiency, we sepa-
test the space they occupy respectively. For Ethereum, we rately counted the time for 1000 multi-keyword queries on
store all data on the chain in the form of transactions. Since the two datasets with the two structures before and after
the transaction volume may differ from block to block, we compression, and the results are shown in Table 4. In
test the space occupied by the blocks under different trans- Table 4, we can see that the query efficiency of the original
action data sizes. The result is shown in Fig. 12. We can see MST and compressed MST is near while the compressed
that in the main part of the figure, we can see that most of MST is slightly lower. The query time of the compressed
the data stored in our MSTDB (over 98%) are off-chain, MST on the two datasets is 1% and 4% longer than that of
while Ethereum stores all data on-chain. To better illustrate the original MST, respectively. These slight efficiency drops
the MSTDB on-chain data, we provide a subfigure in are acceptable compared to the huge space cost savings of
Fig. 12. It can be calculated that the on-chain storage space over 55%
required for each transaction in MSTDB is less than 1 KB, We design a vector commitment (VC)-based scheme to
which reduces the huge storage space compared with the improve the scalability of MST-B+ tree in Section 5.4. The
totally on-chain storage scheme (each transaction is about VC scheme is implemented in Python 3.7 and based on
50 KB). Therefore, our MSTDB system reduces a large the RSA assumption. To evaluate the scalability of MST-VC,
amount of on-chain storage costs through a hybrid storage we test the on-chain storage cost and update time on MST-B
architecture without increasing the total storage data, and + tree and VC, and the results are shown in Figs. 13a and
enhances the scalability of the system. 13b. From Fig. 13b, we can see that as the number of
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8242 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

TABLE 5 chain through a hybrid storage architecture (ii) building a


The Comparison of MST-B+ Tree and MST-VC semantic link between on and off chain data (iii) designing
an on-chain index structure to support rich query functions
Schemes B+ tree VC
(iv) providing a safe authenticated query scheme for light
Query time 3744 ms 3971 ms nodes.
VO size 2.6 MB 2.0 MB In future, we will continue to improve the performance
Verification time 583 ms 997 ms
of our system, and plan to explore deeper data sharing sce-
narios in the blockchain, such as cross-chain knowledge
graph and sharing-based incentive mechanism.

ACKNOWLEDGMENTS
We thank the editors and anonymous reviewers of TKDE
and the powerful computation of remote server supported
by Alibaba Cloud EFLOPS AI Platform. Our previous work
related to this paper was presented at the International
Symposium on Reliable Distributed Systems (2020) and
won the best paper award runner up in the conference. The
conference paper is available at: https://ieeexplore.ieee.
Fig. 14. The results of authenticated queries on two datasets. org/document/9252029/

transactions increases, the VC-based scheme requires less REFERENCES


on-chain storage space than the B+ tree-based scheme. [1] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,”
Fig. 13b shows that as the number of blocks increases, the 2008. [Online]. Available: http://bitcoin.org/bitcoin.pdf
MST-B+ tree takes longer to update the ADS, while the VC [2] G. Wood et al., “Ethereum: A secure decentralised generalised
update time remains the same. Table 5 shows that for 1000 transaction ledger,” Ethereum Project Yellow Paper, vol. 151,
no. 2014, pp. 1–32, 2014.
range queries, these two schemes have the same query effi- [3] H. Huang, W. Kong, S. Zhou, Z. Zheng, and S. Guo, “A survey of
ciency, and the VC-based scheme has a smaller VO size and state-of-the-art on blockchains: Theories, modelings, and tools,”
longer verification time. Due to the full range of perfor- ACM Comput. Surveys, vol. 54, no. 2, pp. 1–42, 2021.
mance advantages, we choose the VC-based MST-B+ tree as [4] K. Francisco and D. Swanson, “The supply chain has no clothes:
Technology adoption of blockchain for supply chain trans-
the default solution in MSTDB. parency,” Logistics, vol. 2, no. 1, 2018, Art. no. 2.
Authenticated Query Cost. To evaluate the cost of authenti- [5] P. Kuhle, D. Arroyo, and E. Schuster, “Building a blockchain-
cated query, we test the VO size and verification time of based decentralized digital asset management system for com-
mercial aircraft leasing,” Comput. Ind., vol. 126, 2021, Art. no.
some authenticated queries on two datasets and the result is 103393.
given in Fig. 14. From Fig. 14a, we can see that the VO size [6] J. Yin et al., “SmartDID: A novel privacy-preserving identity
increases as the number of queries increases. On average, based on blockchain for IoT,” IEEE Internet Things J., early access,
each query will generate about 3 KB of VO data, and the VO 2022, doi: 10.1109/JIOT.2022.3145089.
[7] Q. Ding, S. Gao, J. Zhu, and C. Yuan, “Permissioned blockchain-
size generated by storing the Inspec dataset is indeed larger based double-layer framework for product traceability system,”
than thar of the WWW dataset. The result of Fig. 14b shows IEEE Access, vol. 8, pp. 6209–6225, 2020.
that the verification time of authenticated queries is propor- [8] B. developer, Blockchain RPCS, 2019. [Online]. Available: https://
developer.bitcoin.org/reference/rpc/index.html
tional to the query result size. The verification throughput [9] M. El-Hindi, C. Binnig, A. Arasu, D. Kossmann, and R. Ramamur-
can reach between 1500-1900 per second and the WWW thy, “BlockchainDB: A shared database on blockchains,” Proc.
dataset has a faster verification time than the Inspec dataset. VLDB Endowment, vol. 12, no. 11, pp. 1597–1609, 2019.
Since the Inspec dataset has the documents in different [10] Y. Li, K. Zheng, Y. Yan, Q. Liu, and X. Zhou, “EtherQL: A query
layer for blockchain system,” in Proc. Int. Conf. Database Syst. Adv.
fields while WWW only has the documents of a single con- Appl., 2017, pp. 556–567.
ference, the keyword lists extracted from different docu- [11] T. McConaghy et al., “Bigchaindb: A Scalable Blockchain Data-
ments are rarely duplicated for Inspec dataset, but are base,” white paper, BigChainDB, 2016.
similar for WWW dataset. In MST, the same keyword is on [12] Y. Xu, S. Zhao, L. Kong, Y. Zheng, S. Zhang, and Q. Li, “ECBC:
A high performance educational certificate blockchain with effi-
the same path, and different keywords will produce differ- cient query,” in Proc. Int. Colloq. Theor. Aspects Comput., 2017,
ent paths. Therefore, a dataset with more types of content pp. 288–304.
such as Inspec dataset will make a larger index structure [13] C. Riegger, T. Vinçon, and I. Petrov, “Efficient data and indexing
structure for blockchains in enterprise systems,” in Proc. 20th Int.
and result in a larger VO and longer verification time in Conf. Inf. Integration Web-based Appl. Serv., 2018, pp. 173–182.
MSTDB. [14] J. Eberhardt and J. Heiss, “Off-chaining models and approaches to
off-chain computations,” in Proc. 2nd Workshop Scalable Resilient
Infrastructures Distrib. Ledgers, 2018, pp. 7–12.
7 CONCLUSION [15] B. node, Bitcoin chain size, [Online]. Available: https://
bitcoinvisuals.com/chain-size
In this paper, we propose and implement a novel hybrid [16] T. E. B. Explorer, Ethereum chain size, [Online]. Available:
semantic blockchain system called MSTDB, to add the https://cn.etherscan.com/
[17] K. Miyachi and T. Mackey, “hOCBS: A privacy-preserving block-
semantic feature to blockchain and solve the query chal- chain framework for healthcare data leveraging an on-chain and
lenge. Compared with other works, our MSTDB system has off-chain system design,” Inf. Process. Manage., vol. 58, no. 3, 2021,
the following innovations: (i) reducing the storage cost on Art. no. 102535.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
ZHOU ET AL.: MSTDB: A HYBRID STORAGE-EMPOWERED SCALABLE SEMANTIC BLOCKCHAIN DATABASE 8243

[18] S. J. Arulmozhi, K. Praveenkumar, and G. Vinayagamoorthi, [44] “IET inspec,” 2003. [Online]. Available: https://www.theiet.org/
“Bitcoin in India: A deep down summary,” Adv. Innov. Res., vol. 6, publishing/inspec
no. 2, 2019, Art. no. 28. [45] S. D. Gollapalli and C. Caragea, “Extracting keyphrases from
[19] J. Benet, “IPFS-content addressed, versioned, P2P file system,” research papers using citation networks,” in Proc. 28th AAAI Conf.
2014, arXiv:1407.3561. Artif. Intell., 2014, pp. 1629–1635.
[20] C. Zhang, M. Zhao, L. Zhu, W. Zhang, T. Wu, and J. Ni, “FRUIT: A [46] O. Medelyan, I. Witten, and D. Milne, “Topic indexing with
blockchain-based efficient and privacy-preserving quality-aware wikipedia,” in Proc. AAAI Conf. Artif. Intell. Workshop, 2008.
incentive scheme,” IEEE J. Sel. Areas Commun., early access, Oct. [47] V. Anh and A. Moffat, “Improved word-aligned binary compres-
10, 2022, doi: 10.1109/JSAC.2022.3213341. sion for text indexing,” IEEE Trans. Knowl. Data Eng., vol. 18,
[21] J. Liang, Z. Qin, S. Xiao, L. Ou, and X. Lin, “Efficient and secure no. 6, pp. 857–861, 2006.
decision tree classification for cloud-assisted online diagnosis
services,” IEEE Trans. Dependable Secure Comput., vol. 18, no. 4, Enyuan Zhou (Student Member, IEEE) received
pp. 1632–1644, Jul./Aug. 2019. the BE degree in information security from North-
[22] J. Liang, Z. Qin, L. Xue, X. Lin, and X. Shen, “Verifiable and secure eastern University, and the MSc degree in cyber-
SVM classification for cloud-based health monitoring services,” space security from Xidian University, supervised
IEEE Internet Things J., vol. 8, no. 23, pp. 17029–17042, Dec. 2021. by prof. Qingqi Pei in ISN. He is currently working
[23] Z. Wu, Y. Xiao, E. Zhou, Q. Pei, and Q. Wang, “A solution to data toward the PhD degree with the Department of
accessibility across heterogeneous blockchains,” in Proc. IEEE 26th Computing, Hong Kong Polytechnic University.
Int. Conf. Parallel Distrib. Syst., 2020, pp. 414–421. His current research interests include Blockchain,
[24] M. Herlihy, “Atomic cross-chain swaps,” in Proc. ACM Symp. database, and IR.
Princ. Distrib. Comput., 2018, pp. 245–254.
[25] Z. Gao, H. Li, K. Xiao, and Q. Wang, “Cross-chain oracle based data
migration mechanism in heterogeneous blockchains,” in Proc. IEEE
40th Int. Conf. Distrib. Comput. Syst., 2020, pp. 1263–1268.
[26] Z. Hong, S. Guo, P. Li, and W. Chen, “Pyramid: A layered shard- Zicong Hong (Graduate Student Member, IEEE)
ing blockchain system,” in Proc. IEEE Conf. Comput. Commun., received the BEng degree in software engineer-
2021, pp. 1–10. ing from the School of Data and Computer
[27] Z. Hong, S. Guo, R. Zhang, P. Li, Y. Zhan, and W. Chen, “Cycle: Science, Sun Yat-sen University. He is currently
Sustainable off-chain payment channel network with asynchro- working toward the PhD degree with the Depart-
nous rebalancing,” in Proc. IEEE/IFIP 52nd Annu. Int. Conf. Depend- ment of Computing, Hong Kong Polytechnic
able Syst. Netw., 2022, pp. 41–53. University. His current research interests include
[28] Y. Peng, M. Du, F. Li, R. Cheng, and D. Song, “FalconDB: Block- blockchain, game theory, Internet of Things, and
chain-based collaborative database,” in Proc. ACM SIGMOD Int. edge/cloud computing.
Conf. Manage. Data, 2020, pp. 637–652.
[29] M. Zhang, Z. Xie, C. Yue, and Z. Zhong, “Spitz: A verifiable database
system,” Proc. VLDB Endowment, vol. 13, no. 12, pp. 3449–3460,
Aug. 2020. Yang Xiao (Member, IEEE) received the BS and
[30] C. Zhang, C. Xu, J. Xu, Y. Tang, and B. Choi, “GEM2-tree: A gas- PhD degrees from Xidian University, in 2013,
efficient structure for authenticated range queries in blockchain,” 2020, respectively. His research interests focus
in Proc. IEEE 35th Int. Conf. Data Eng., 2019, pp. 842–853. on adversarial attacks and defence on graph
[31] Y. Zhu, Z. Zhang, C. Jin, A. Zhou, and Y. Yan, “SEBDB: Semantics structured data, distributed trust management
empowered blockchain database,” in Proc. IEEE 35th Int. Conf. and data privacy in AI.
Data Eng., 2019, pp. 1820–1831.
[32] C. Xu, C. Zhang, and J. Xu, “vChain: Enabling verifiable boolean
range queries over blockchain databases,” in Proc. Int. Conf. Man-
age. Data, 2019, pp. 141–158.
[33] C. Zhang, C. Xu, H. Wang, J. Xu, and B. Choi, “Authenticated key-
word search in scalable hybrid-storage blockchains,” in Proc. IEEE
37th Int. Conf. Data Eng., 2021, pp. 996–1007.
[34] Q. Pei, E. Zhou, Y. Xiao, D. Zhang, and D. Zhao, “An efficient Dongxiao Zhao He received the bachelor’s and
query scheme for hybrid storage blockchains based on merkle master’s degrees in information and communi-
semantic trie,” in Proc. IEEE Int. Symp. Reliable Distrib. Syst., 2020. cation engineering from Xidian University. He is
[35] A. Coglio, “Ethereum’s recursive length prefix in ACL2,” in Proc. currently working toward the PhD degree with
16th Int. Workshop ACL2 Theorem Prover Appl., 2020, pp. 108–124. the School of Telecommunications Engineering in
[36] C. Luo and M. J. Carey, “LSM-based storage techniques: A Xidian University. He used to work in software
survey,” VLDB J., vol. 29, no. 1, pp. 393–418, 2020. development with ZTE. His current research
[37] J. H. Paik, “A novel TF-IDF weighting scheme for effective interests include Blockchain, database, distrib-
ranking,” in Proc. 36th Int. ACM SIGIR Conf. Res. Develop. Inf. uted system.
Retrieval, 2013, pp. 343–352.
[38] H. Bast and B. Buchhold, “An index for efficient semantic full-text
search,” in Proc. 22nd ACM Int. Conf. Inf. Knowl. Manage., 2013,
pp. 369–378.
[39] I. N. Yulita, H. L. The, and adiwijaya, “Fuzzy hidden markov
models for indonesian speech classification,” J. Adv. Comput. Intell. Qingqi Pei (Senior Member, IEEE) received the
Intell. Informat., vol. 16, no. 3, pp. 381–387. BS, MS, and PhD degrees in computer science
[40] C. Buckley and A. F. Lewit, “Optimization of inverted vector and cryptography from Xidian University in 1998,
searches,” Proc. 8th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. 2005, and 2008, respectively. He is currently a
Retrieval, 1985, pp. 97–110. professor and a member of the State Key Labora-
[41] H. Turtle and J. Flood, “Query evaluation: Strategies and opti- tory of Integrated Services Networks, Xidian Uni-
mizations,” Inf. Process. Manage., vol. 31, no. 6, pp. 831–850, 1995. versity. His research interests focus on cognitive
[42] D. Leung, Y. Gilad, S. Gorbunov, L. Reyzin, and N. Zeldovich, network, data security, and physical layer secu-
“Aardvark: An asynchronous authenticated dictionary with appli- rity. He is a professional member of ACM and a
cations to account-based cryptocurrencies,” in Proc. 31st USENIX senior member of the Chinese Institute of Elec-
Secur. Symp., 2022, pp. 4237–4254. tronics and China Computer Federation.
[43] B. Pillai, K. Biswas, and V. Muthukkumarasamy, “Cross-chain
interoperability among blockchain-based systems using trans-
actions,” Knowl. Eng. Rev., vol. 35, 2020, Art. no. e23.

Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.
8244 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

Song Guo (Fellow, IEEE) received the PhD Rajendra Akerkar is professor and head of Big
degree in computer science from the University Data Research Group, Western Norway Research
of Ottawa. He is currently a full professor with the Institute (Vestlandsforsking), where his primary
Department of Computing with The Hong Kong domain of activities is Big Data and semantic tech-
Polytechnic University. Prior to joining PolyU, he nologies with aim to combine strong theoretical
was a full professor with the University of Aizu, results with high impact practical results. His
Japan. His research interests are mainly in the recent research focuses on application of Big
areas of cloud and green computing, Big Data, Data methods to real-world challenges in mobility,
wireless networks, and cyber-physical systems. transport, energy and emergency management.
He has published more than 300 conference and He is serving as an associate editor of Interna-
journal papers in these areas and received multi- tional Journal of Metadata, Semantics and Ontolo-
ple best paper awards from IEEE/ACM conferences. His research has gies (IJMSO), associate editor of IEEE Open Journal of the Computer
been sponsored by JSPS, JST, MIC, NSF, NSFC, and industrial compa- Society (OJ-CS) and Knowledge Management Track editor of Web Intelli-
nies. He has served as an editor for several journals, including IEEE gence, an international journal. He has authored 16 books, 139 research
Transactions on Parallel and Distributed Systems, IEEE Transactions papers and edited 19 volumes of international conferences and work-
on Emerging Topics in Computing, IEEE Transactions on Green Com- shops.He is also actively involved in several international ICT initiatives,
munications and Networking, IEEE Communications Magazine, and and research & innovation projects, including H2020 projects, for more
Wireless Networks. He has been actively participating in international than 20 years.
conferences as general chair and TPC chair. He is a senior member of
ACM, and an IEEE Communications Society Distinguished lecturer.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.

Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 03,2023 at 06:04:56 UTC from IEEE Xplore. Restrictions apply.

You might also like