Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Publishing

Page Proof Instructions and Queries


Journal Title: JIS
Article Number: 1018401

Thank you for choosing to publish with us. This is your final opportunity to ensure your article will be accurate at publication. Please review your proof
carefully and respond to the queries using the circled tools in the image below, which are available in Adobe Reader DC* by clicking Tools from the
top menu, then clicking Comment.

Please use only the tools circled in the image, as edits via other tools/methods can be lost during file conversion. For comments, questions, or formatting
requests, please use . Please do not use comment bubbles/sticky notes .

*If you do not see these tools, please ensure you have opened this file with Adobe Reader DC, available for free at get.adobe.com/reader or by going
to Help >Check for Updates within other versions of Reader. For more detailed instructions, please see us.sagepub.com/ReaderXProofs.

Sl. No. Query

Please note that we cannot add/amend ORCID iDs for any article at the proof stage. Following
ORCID’s guidelines, the publisher can include only ORCID iDs that the authors have specifically
validated for each manuscript prior to official acceptance for publication.
Please confirm that all author information, including names, affiliations, sequence, and contact
details, is correct.
Please review the entire document for typographical errors, mathematical errors, and any other neces-
sary corrections; check headings, tables, and figures.
Please ensure that you have obtained and enclosed all necessary permissions for the reproduction of
artworks (e.g. illustrations, photographs, charts, maps, other visual material, etc.) not owned by your-
self. please refer to your publishing agreement for further information.
Please note that this proof represents your final opportunity to review your article prior to publication,
so please do send all of your changes now.
Please confirm that the funding and conflict of interest statements are accurate.
1 Please check whether the author names are correct as inserted.
2 Please check whether the hierarchy of the section head levels is correct throughout the article.
3 Please note that ‘vector space model’ are repeated in the sentence ‘In the document system, the docu-
ments and ...’. Please check.
4 Please check whether the sentence ‘Initially, the input documents are pre-processed to ...’ is correct as
set.
5 Please check whether all variables/terms/functions/Greeks are accurately and consistently used
throughout the article.
6 Please note that some of the references have been renumbered to make their citations sequential in
text. Please check.
Research Paper

Journal of Information Science


1–14
An approach for document retrieval Ó The Author(s) 2021
Article reuse guidelines:
using cluster-based inverted indexing sagepub.com/journals-permissions
DOI: 10.1177/01655515211018401
journals.sagepub.com/home/jis

Gunjan Chandwani [AQ: 1]


Manav Rachna University, India; Dr. A.P.J. Abdul Kalam Technical University (AKTU), India

Anil Ahlawat
Academics, KIET Group of Institutions, India

Gaurav Dubey
ABES Engineering College, India

Abstract
Document retrieval plays an important role in knowledge management as it facilitates us to discover the relevant information from the
existing data. This article proposes a cluster-based inverted indexing algorithm for document retrieval. First, the pre-processing is done
to remove the unnecessary and redundant words from the documents. Then, the indexing of documents is done by the cluster-based
inverted indexing algorithm, which is developed by integrating the piecewise fuzzy C-means (piFCM) clustering algorithm and inverted
indexing. After providing the index to the documents, the query matching is performed for the user queries using the Bhattacharyya
distance. Finally, the query optimisation is done by the Pearson correlation coefficient, and the relevant documents are retrieved. The
performance of the proposed algorithm is analysed by the WebKB data set and Twenty Newsgroups data set. The analysis exposes that
the proposed algorithm offers high performance with a precision of 1, recall of 0.70 and F-measure of 0.8235. The proposed document
retrieval system retrieves the most relevant documents and speeds up the storing and retrieval of information.

Keywords
Bhattacharyya distance; document retrieval; inverted indexing; piecewise fuzzy C-means; query optimisation

1. Introduction
Data are a collection of documents stored in the databases, where each document contains key-value pairs. The sub-
documents are nested inside the main documents using the key value. Document orientation is highly flexible in design-
ing the schema as they provide heterogeneous data and complex structured data that are to be stored and collected
together. The document-oriented database is considered a schemaless database, as it does not require any schema to con-
vey the data model [1,2]. Most organisations collect various data with different file format, speed and types due to tech-
nological growth. When the data are used properly, analysing the data creates massive potential and enables the system
to attain better results. In the growth of this digital world, transforming the possible information requires not only the
data analysis approach but also uses the computing environment and generation system for dealing with the enormous
volume of data in the structure of the database [3–5]. The database in the documentation system offers the wrapper using
which the user can access the information based on the given query. Based on the characteristic features of data, the
field, like image and text, requires various approaches to create the indexes, accessing the records and formulating
records. In the traditional database, the data are ordered in a specific way so that the contents are directly accessed using
the specific field. The searching queries are formulated based on the index fields, and the records are retrieved using the
index relation. Hence, the index information is directly accessed from the data [6][AQ: 2].

Corresponding author:
Gunjan Chandwani, Manav Rachna University, Faridabad 121004, Haryana, India.
Email: gunjanchandwani80@gmail.com
Chandwani et al. 2

Due to the enormous growth in digital data, understanding the query plays a key role in the documentation system to
obtain the relevant information based on the user needs. Classic information retrieval (IR) system relies on the keyword
matching scheme to perform the document indexes in the corpus. In the document system, the documents and the queries
are represented using the vector space model, vector space model and vector space model [7].[AQ: 3] Reformulating the
query is considered the most popular track system in the database, which is explored to refine the user information effec-
tively. Different techniques are introduced in the survey to reformulate the user queries automatically, where relevance
feedback is one of the important techniques. The user is requested to judge the documents based on the relevance of the
data; in the relevance feedback scheme, usually, the top most documents are retrieved based on the user’s query. The
information, like indexing the terms and the non-relevant documents, is extracted from the original documents and is
combined using the query to expand and re-weight the query automatically using the query terms. One of the essential
features of the relevance feedback method is relevance judgements [8]. The system provides the user interface, which is
employed by the user to specify the needs of the information in the form of query. Based on the query operation, the
queries are processed to remove the stop word and are converted into the representation form, which is operated by the
system, usually in the form of an index. The searching process identifies the documents using the index based on the rel-
evant query. While searching the documents, the system provides the matching score to every document [9]. In the medi-
cal field [10], clustering has been proven to be a powerful tool for discovering patterns and structure in labelled and
unlabeled data sets.
Based on the query of the user, the documents are efficiently retrieved using indexing. Indexing in the document is a
process to assign the terms to the documents for easier retrieval purposes [11]. An indexing technique was introduced
based on the document assumption, which is assigned to the index terms for document retrieval based on the queries
[12]. In the knowledge base, the concept of semantic and indexing is used to find the matches in the document. The sys-
tem is offered to infer the information by connecting the queries and the concepts [13]. Various techniques are adopted
to enhance the query evaluation performance, like combinatorial or heuristic algorithms, rapid implementation of basic
operations, semantic transformations and logic-based approach to generate the access plans and select the information
among them. The above methods are described in the query framework evaluation procedure to represent the queries
using the relational calculus [14]. Documents available under the same class use the same keywords, which is called the
surface-based match method for IR [15]. The keyword-based category association is learned from the documents using
the Linear Least Squares Fit (LLSF) method for estimating the keyword [16]. The query optimisation issues – like data-
base machine usage, distributed database optimisation and query evaluation – are addressed. Query optimisation inte-
grates a variety of techniques to solve the above-mentioned problem, which ranges from the logical transformation to
the optimisation of the storage data at the system level [14].

1.1. Motivation
Document retrieval is the problem of how to find the stored documents that contain useful information. Various methods
have been developed for retrieving the relevant documents. However, some problems remain unsolved. Some of the chal-
lenges are listed here. The pattern matching is inefficient with respect to the computational cost and is mainly intended to
retrieve the exact matches. Hence, it does not create any real-world scenario for the lexical variations and results in poor
modelling capability in case of the semi-structured data, which is a major challenge associated in the documents retrieval
[17]. The IR system often results in complete and inaccurate results due to the challenges of synonymy and polysemy.
The word and vocabulary mismatch problem arises in the conceptual document space for the query transformation [7].
The time latencies involved in the query pre- and post-processing degrades the performance of the retrieval system.
Reducing the quality of the query results and query relaxation, along with user involvement, is the major challenge of the
query refinement process [18]. Retrieving the relevant documents with the short queries is a challenging task in document
retrieval. Frequently irrelevant documents are retrieved, when the keyword of the query is short. Hence, relevant docu-
ments affect the negative term distribution [19]. Data – like video, text and images – are characterised as unstructured
data, as it contains valuable information to the business. Extracting and searching these kinds of information have a major
challenge because these kinds of information are self-describing and do not have any predefined model [20]. These chal-
lenges are considered as the motivation, and a new method for document retrieval is proposed in this work. The proposed
document retrieval system not only retrieves the most relevant documents but also speeds up the storing and retrieval of
information. Also, the proposed cluster-based indexing algorithm effectively performs the clustering and the indexing
process based on the relevant information of the documents.
The contribution of this article is the development of the cluster-based inverted indexing algorithm by integrating the
piecewise fuzzy C-means (piFCM) clustering algorithm and the inverted indexing to generate the document indexing
based on the keyword. The rest of the article is organised as follows: the literature survey of the existing techniques is

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 3

elaborated in section 2. The proposed algorithm is described in section 3, and the results along with the analysis are ela-
borated in section 4. Finally, section 5 concludes this article.

2. Literature survey
The review of the existing methods is listed in this section: Lopez-Otero et al. [17] developed a Query-by-example
Spoken Document Retrieval (QbESDR) approach to identify the documents based on the spoken query. This approach
recorded the documents in indices, which allowed performing an efficient and fast search. The searching time of the
query transcription was reduced, but it failed to use the clustering strategy for matching the document pairs. Rad et al.
[21] modelled a lexical scoring system to determine the semantic relationship between the words. It utilised the lexical
chain model to retrieve the relevant documents based on the judgement of relevance. Even though the ambiguity was
resolved, it retrieved the unrelated documents. Gupta et al. [22] developed a hash-based indexing approach for document
retrieval. This approach provided the privacy to retrieve the document based on the term features. However, the data were
not retrieved effectively in less time. Hao et al. [7] introduced a coupling relationship model to organise and rank the doc-
uments based on the learned concept. It represented the documents and queries in the concept space to retrieve the
semantic information. The threshold selection was effectively balanced, but retrieving the documents using the linguistic
model was not achieved. Biswas et al. [23] developed a linear space index model to retrieve the relevant documents based
on the query time. The monotonic function was determined to compute the relevance using the linear space. It provided
efficient document retrieval for most of the relevant documents, but it required too much of space for limited search func-
tions. Tekli et al. [18] developed a semantic-aware indexing framework by integrating the domain knowledge with the
textual information to process the semantic-aware query. It generated more semantic relevant results and handled the var-
iation while collecting the multi-attribute data. However, it was not accurate in performing the incremental result fetching
to enhance the performance. Hao et al. [19] developed an expectation–maximisation (EM) algorithm to identify the top-
rank documents. The negative and the positive feedback were integrated to enhance the retrieval performance. Even
though it was robust and provided better precision, the complexity of this algorithm was increased, as it failed to evaluate
the negative documents. Madaan et al. [24] introduced a Question Answering (QA) system to retrieve the documents.
QA provided the exact answers to the language questions and attained high accuracy level. QA effectively indexed the
semantic web and retrieved the relevant documents quickly, but it offered low-quality for reasoning questions.

3. Proposed cluster-based inverted indexing algorithm


Indexing and retrieving the documents based on the query plays a key role in document processing and retrieval.
Figure 1 shows the block diagram of the proposed cluster-based inverted indexing. The proposed cluster-based index-
ing algorithm involves four stages, such as pre-processing, document indexing, complex query matching and query
optimisation. The database contains a variety of documents, and these input documents are subjected to the pre-
processing stage, where the redundant and the unnecessary words are removed using the stop word removal and the
stemming techniques. The pre-processed documents are further passed into the document indexing stage, which uti-
lises the clustering-based indexing algorithm termed cluster-based inverted indexing through combining the inverted
indexing with the piFCM clustering algorithm for indexing the documents. The clustered documents are fed into the
complex query matching stage. The query matching is performed for the user queries, like semantic queries or multi-
gram queries, based on the Bhattacharyya distance to produce a better query matching result. The Bhattacharyya dis-
tance is used to find the relevant documents based on the minimum distance measure. Finally, the query optimisation
is performed using the Pearson correlation coefficient based on the interactive query optimisation, which determines
an effective way to retrieve the documents.

3.1. Pre-processing
The database contains several documents, and each of these input documents is pre-processed using the stop word
removal and stemming techniques for removing the redundant and unnecessary words. The stop word removal and the
stemming process are applied to reduce the seeking time of the user. Initially, the input documents are pre-processed to
eliminate the duplicate and the redundant words and then, the indexing and the clustering are applied into the pre-
processed documents to retrieve the optimal information.[AQ: 4] Every document sent to the indexing process must be
pre-processed before it is forwarded to the next stage. However, documents with redundant words are not utilised by the
query matching criteria to generate the retrieval results. Moreover, documents with unrelated information may also exist
in the database. Thus, it is required to pre-process the documents to perform effective retrieval of documents. The

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 4

Figure 1. Schematic diagram of proposed cluster-based inverted indexing for document indexing and retrieval.

retrieval mechanism without proper pre-processing affects the retrieval performance. The pre-processed documents are
used by the indexing operations, which uniquely identifies the matched documents based on the keyword of the query.
The pre-processing stage uses the stop word and the stemming techniques for effectively reducing the unwanted words
in the documents.

3.2. Document indexing for effective retrieval of documents


The resulted pre-processed documents are passed into the document indexing stage. The document indexing is performed
using the proposed cluster-based inverted indexing algorithm, which is the combination of the clustering algorithm and
the inverted indexing. The proposed cluster-based indexing algorithm effectively performs the clustering and the index-
ing process based on the relevant information of the documents. The documents are indexed for the purpose of easy
retrieval of data based on the keyword of the query. The documents are clustered using the piFCM [25] clustering algo-
rithm based on the information present in the documents. Clustering groups the similar documents so that the documents
with similar features are grouped under the same cluster, and there exists a different number of cluster group, or in other
words, one can say that each cluster group contains various documents, but the information in the documents must be
similar when it is grouped under the same cluster group.

3.2.1. piFCM clustering algorithm for indexing the documents. The documents with relevant contents are grouped into clus-
ters. The documents may exist in various cluster groups according to the content of the documents. Different cluster
group contains different documents, and it will be retrieved based on the matching documents. As the documents are
clustered in a group, retrieving the information makes it easier for the user query. Every cluster group contains any num-
ber of data objects with the relevant information. The proposed cluster-based inverted indexing uses the piFCM cluster-
ing algorithm, which is based on the fuzzy C-means (FCM) clustering approach to cluster the data objects together
effectively. piFCM is a clustering algorithm, which enhances the performance of indexing using the objective function
of the membership data.
The piFCM clustering is performed using the utility function of the fuzzy consensus clustering (FCC). The contin-
gency matrix is used in the consensus clustering to make the clustering process more effective. The consensus clustering

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 5

Algorithm 1. Pseudocode of piFCM.

piFCM

1 b 0  
2 Initialise σ ½b , and let d½b = C σ ½b ; refer equation (7)
3 repeat
4 b b+1  
5 Let σ ½b = K d½b1 ; refer equation (6)
6 Let db = Cðσ b Þ; refer equation (7)
7 Until some stop criteria is met

uses the structure of the soft cluster based on the utility function. FCM uses the iterative process of piFCM to increase
the efficiency of the consensus clustering. FCC uses the horizontal and the vertical segmentation approach to obtain the
big data with a feasible framework. In the spark platform, the FCC framework is accelerated using the parallelisation
scheme. The data objects with different membership degrees may exist in any cluster group. The difference between the
basic partition and the consensus clustering is measured using the utility function, and the clustering results are obtained
by maximising the utility value. The objective function based on the mutual information uses the k-means clustering for
identifying the optimal solution in the consensus clustering. The utility function of the k-means consensus clustering
(KCC) is applied into the objective function for transferring the consensus cluster groups into k-means clustering. The
basic partitioning information is collected and summarised in the co-association matrix, which identifies the number of
times and the instances present in the similar cluster group.
FCM uses the objective function rather than using the data centroid to identify similar objects based on the distance
measure. The concept of multi-membership data is introduced into the piFCM clustering algorithm. The membership data
set is denoted as A, which is expressed as[AQ: 5]

A = fa1 , a2 , . . . , ai g ð1Þ
where a1 supports the membership degree to all the cluster groups. The basic partition of A is defined as,
P = fσ (1) , . . . , σ (c) g, which determines the membership data set. The function ab corresponds to eb with bth data object
 
ð1Þ ðnÞ ð cÞ
ab = σ b , . . . , σ b , . . . , σ b 8b ð2Þ

In the piFCM clustering, the piecewise centroid is defined with the dimensional vector of c pieces and is represented
as

d = fd1 , . . . , dD g ð3Þ

dm = ðdmð1Þ , . . . , dmðnÞ , . . . , dmðcÞ ÞT ð4Þ


where dm denotes the dimensional vector, which is constituted with c pieces. Moreover, dm is called the piecewise cen-
troid of A, and dm(n) is called the discrete distribution function of the centroid. The pseudocode of the piFCM is provided
in Algorithm 1.
The fuzzy clustering defines the membership data with mth centroid, while the piecewise centroid is represented as,
dm 8 m, and hence, the distance from ab to dm is given by

X
c
ðnÞ
x ð ab , dm Þ = ρn gðσ b , dmðnÞ Þ 8b, m ð5Þ
n=1

where dm denotes the piecewise centroid, ab represents the membership data, ρn defines the user specified weight, g
denotes the fuzzy function, and σ b is the membership degree. The non-degeneration case is considered as, σ without the
loss of generality and is expressed as

xðab , dm Þ1=ðh1Þ
σ bm = 8b, m ð6Þ
P
D
xðab , dm Þ1=ðh1Þ
m=1

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 6

X ðσ bm Þh ðnÞ
dmn =   σ b 8m, n
σ h 1 ð7Þ
m

where σ bm and σ (n)


b denotes the membership degree, and h denotes the fuzzifier. Now, we have

σ = ½σ bm i × D = K ðd Þ ð8Þ

d = f dm gD
m = 1 = C ðσ Þ ð9Þ

where K and C are the mapping functions explained by equations (6) and (7), respectively. Based on the mapping func-
tion of K and C, the FCM algorithm is defined as piFCM, which is similar to FCM by optimising the σ and d values
iteratively using equations (6) and (7). Moreover, equal weights are set to ρn and the squared norm is denoted as, g.
Hence, the distance between ab and dm is reduced to the Euclidean distance of σ and d, respectively. When x(ab , dm ) = 0,
the value of b and m in piFCM is exactly the same as that of FCM.
Hence, the output of the cluster is represented as

Z = fz 1 , z 2 , z 3 , . . . g ð10Þ
where z1 , z2 and z3 are the cluster members.

3.2.2. Cluster group retrieval using inverted indexing. The proposed cluster-based inverted indexing algorithm uses the
inverted indexing process to perform the indexing mechanism. Here, the input query keyword is sent to the clustered
groups. Based on the keyword query, the documents are retrieved from different cluster groups. When the matched key-
word is present in more than one document in different cluster groups, then all the documents are retrieved. Accordingly,
different documents from various cluster groups are retrieved through inverted indexing. The inverted indexing is used
along with the clustering algorithm to perform the document indexing efficiently. The documents are indexed based on
the related matched keyword. The query keyword is forwarded to each cluster group to search for the matching docu-
ments, where each cluster group contains many documents. Thus, for each query keyword, the entire documents are
searched from various cluster groups, and only the matched documents are retrieved using the indexing process. Hence,
any document is indexed using the inverted indexing process, and the resulted retrieved documents are further used to
process with the query matching mechanism.

3.3. Complex query matching using the Bhattacharyya distance


The result of the inverted indexing stage is further subjected to the query matching stage. Complex query matching is
performed for the user queries, like multigram queries or semantic queries, to generate the results. Semantic queries are
contextual and associative. Semantic query focuses on retrieving the documents implicitly and explicitly depending on
the structural, semantic and syntactic information present in the information. It is designed to deliver the results through
the query matching criteria. Semantic query processes the relationship between the documents based on the semantics of
the unstructured data. The query matching process utilises the Bhattacharyya distance to produce better query matching
results. The Bhattacharyya distance measure finds similar documents based on the minimum distance measure. Based
on the Bhattacharyya coefficient, the Bhattacharyya distance measures the document differences in the retrieval process.
The Bhattacharyya distance is represented as
  2  !
1 1 λk λ2l 1 ðσ k  σ l Þ2
X ðk, lÞ = ln + 2 +2 + ð11Þ
4 4 λ2l λk 4 λ2k + λ2l

where λ2k denotes the variance of kth documents, σ k represents the mean of the kth documents, X (k, l) denotes the
Bhattacharyya distance between the two documents, and k, l denote the two different documents, respectively. The com-
plex query matching process is performed based on the Bhattacharyya distance, which uses the keyword query to search
the relevant documents in the cluster groups. Once the relevant documents are retrieved, the query matching again
searches the documents, which is retrieved by the cluster-based inverted indexing algorithm. Among all the selected
documents, the query matching retrieves the exact documents based on the preference of the user. Thus, the query
matching using the Bhattacharyya distance produces better matching results. Figure 2 shows the schematic diagram of
query matching.

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 7

s1
s5
s4
Cluster
group 1

s2
s6
s3
Keyword
s2 , s6 , y4

y2 Documents
retrieved
y1
y5
Cluster
group 2

y4 y6
y3

Figure 2. Schematic diagram of complex query matching.

In the complex query matching stage, the input keyword is sent to the cluster groups. In this context, two cluster
groups are considered that are referred to as cluster group-1 and cluster group-2. The cluster group-1 has the documents
as, s1 , s2 , s3 , s4 , s5 and s6 , and the cluster group-2 contains the documents as, y1 , y2 , y3 , y4 , y5 and y6 , respectively. The
keyword is matched with every document in both the cluster groups to retrieve the relevant documents. The content of
the documents, which is matched with the input keyword, is selected from both clusters. Moreover, if the searching key-
word is matched with the cluster groups, the documents s2 and s6 in the cluster group-1 and the document y4 in the clus-
ter group-2 are retrieved. Hence, the keyword searches the similar documents among both the cluster groups, and the
documents s2 , s6 and y4 are retrieved based on the best matching results.

3.4. Query optimisation for document retrieval


The matching result obtained from the complex query matching is further processed by the query optimisation stage.
Finally, query optimisation is carried out using interactive query optimisation for determining an efficient way to execute
a query with different possible query plans to retrieve the relevant documents. The query optimisation uses the similarity
measure based on the Pearson correlation coefficient for retrieving the documents. The Pearson correlation coefficient is
defined as the measure of the correlation between the documents and is expressed as

covðp, qÞ
νðp, qÞ = ð12Þ
σpσq

where p and q are the random variables, cov denotes the covariance measure, σ p represents the standard deviation of p,
and σ q denotes the standard deviation of q, respectively. The query optimisation uses the similarity measure to the docu-
ments, which is retrieved by the query matching process to identify the relevant documents effectively. Therefore, the
related documents are retrieved based on the similarity measure of the Pearson correlation coefficient.

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 8

4. Results and discussion


This section describes the results and discussion of the proposed cluster-based inverted indexing algorithm using the
piFCM clustering and inverted indexing.

4.1. Experimental setup


The proposed cluster-based indexing algorithm is implemented in the JAVA tool, and the experimentation is carried out
using the Twenty Newsgroups data set [26] and the Reuters data set [27]. The Twenty Newsgroup data set contains a
collection of newsgroup documents and is mainly used in text applications. The Reuter data set is a benchmark data set
used to perform the document classification. It has 3019 testing documents, 7769 training documents and 90 classes,
respectively.

4.1.1. Comparative methods used for the analysis. The performance is evaluated by comparing the proposed algorithm with
the existing methods, like the Semantic indexing query framework (SemIndex) [18] and EM algorithm [19].

4.1.2. Evaluation metrics. The performance of the proposed cluster-based indexing approach is analysed and evaluated
based on the metrics, namely, precision, recall and F-measure.

4.1.2.1. Precision.
Precision is defined as the ratio of positive rate to the total number of observations and is expressed as

P
U= ð13Þ
P+Q
where P denotes true positive rate, R represents the false negative, and Q is the false positive rate.

4.1.2.2. Recall.
Recall is defined as the ratio of positive predicted observations to the actual number of observations and is represented
as

P
V= ð14Þ
P+R

4.1.2.3. F-measure.
F-measure is the average weight of the precision and the recall value, and it takes both the positive and the negative
values. Thus, F-measure is represented using the below equation as

2 * ðV × U Þ
Z= ð15Þ
ðV + U Þ

4.2. Comparative analysis


This section describes the experimental results of the proposed cluster-based inverted indexing algorithm made using the
metrics like precision, recall and F-measure by varying the cluster size.

4.2.1. Comparative analysis using Twenty Newsgroup data sets. The analysis made using the Twenty Newsgroup data set by
varying the cluster size is elaborated in this section.

4.2.1.1. Analysis using multigram query.


The analysis made using the multigram query named ‘student faculty’ for the metrics, namely, precision, recall and
F-measure with respect to the cluster size, is explained in this section. Figure 3(a) shows the analysis based on the clus-
ter size with respect to the precision. When the cluster size is 17, the precision attained by the existing methods, like
SemIndex and EM, is 0.12 and 0.333, while the proposed cluster-based inverted indexing obtained better precision with

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 9

Figure 3. Experimental analysis using multigram query with data set-1 (a) precision, (b) recall and (c) F-measure.

the value of 1, respectively. When the cluster size is 18, the precision attained by the existing methods, namely,
SemIndex and EM, is 0.0888 and 0.36, whereas the proposed cluster-based inverted indexing obtained better precision
with the value of 1, respectively. When the cluster size is 19, the precision attained by the existing methods, namely,
SemIndex and EM, is 0.24 and 0.36, while the proposed cluster-based inverted indexing obtained better precision with
the value of 1, respectively.
Figure 3(b) shows the analysis based on the cluster size with respect to the recall. When the cluster size is 17, the
recall attained by the existing methods, like SemIndex and EM, is 0.25 and 0.5, while the proposed cluster-based inverted
indexing obtained a better recall value of 0.625, respectively. When the cluster size is 18, the recall attained by the exist-
ing methods, namely, SemIndex and EM, is 0.1667 and 0.5, whereas the proposed cluster-based inverted indexing
obtained a better recall value of 0.75, respectively. When the cluster size is 19, the recall attained by the existing meth-
ods, namely, SemIndex and EM, is 0.458 and 0.5, while the proposed cluster-based inverted indexing obtained a better
recall value of 0.75, respectively. When the cluster size is 20, the recall obtained by the existing methods, like SemIndex
and EM, is 0.5 and 0.583, while the proposed cluster-based inverted indexing obtained a better recall value of 0.7083,
respectively.
Figure 3(c) shows the analysis based on the cluster size with respect to the F-measure. When the cluster size is 17, the
F-measure attained by the existing methods, like SemIndex and EM, is 0.1621 and 0.4, while the proposed cluster-based
inverted indexing obtained a better F-measure value of 0.7692, respectively. When the cluster size is 18, the F-measure
attained by the existing methods, namely, SemIndex and EM, is 0.1159 and 0.4186, whereas the proposed cluster-based
inverted indexing obtained better F-measure with the value of 0.8571, respectively. When the cluster size is 19, the F-
measure attained by the existing methods, namely, SemIndex and EM, is 0.3188 and 0.4186, while the proposed cluster-
based inverted indexing obtained a better F-measure value of 0.8571, respectively. When the cluster size is 20, the F-
measure obtained by the existing methods, like SemIndex and EM, is 0.3835 and 0.4351, while the proposed cluster-
based inverted indexing obtained a better F-measure value of 0.829, respectively.

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 10

Figure 4. Experimental analysis using semantic query using data set-1 (a) precision, (b) recall and (c) F-measure.

4.2.1.2. Analysis using the semantic query.


The analysis made using the semantic query named ‘Electronics’ for the metrics, namely, precision, recall and F-mea-
sure with respect to the cluster size is explained in this section. Figure 4(a) shows the analysis based on the cluster size
with respect to the precision. When the cluster size is 17, the precision attained by the existing methods, like SemIndex
and EM, is 0.3111 and 0.4, while the proposed cluster-based inverted indexing obtained a better precision value of 1,
respectively. When the cluster size is 18, the precision attained by the existing methods, namely, SemIndex and EM, is
0.0888 and 0.4, whereas the proposed cluster-based inverted indexing obtained a better precision value of 1, respectively.
When the cluster size is 19, the precision attained by the existing methods, namely, SemIndex and EM, is 0.1020 and
0.333, while the proposed cluster-based inverted indexing obtained a better precision value of 1, respectively.
Figure 4(b) shows the analysis based on the cluster size with respect to the recall. When the cluster size is 17, the
recall attained by the existing methods, like SemIndex and EM, is 0.5 and 0.583, while the proposed cluster-based
inverted indexing obtained a better recall value of 0.833, respectively. When the cluster size is 18, the recall attained by
the existing methods, namely, SemIndex and EM, is 0.1667 and 0.5, whereas the proposed cluster-based inverted index-
ing obtained a better recall value of 0.8333, respectively. When the cluster size is 19, the recall attained by the existing
methods, namely, SemIndex and EM, is 0.2083 and 0.2083, while the proposed cluster-based inverted indexing obtained
a better recall value of 0.5, respectively. When the cluster size is 20, the recall obtained by the existing methods, like
SemIndex and EM, is 0.4166 and 0.5, while the proposed cluster-based inverted indexing obtained a better recall value
of 0.625, respectively.
Figure 4(c) shows the analysis based on the cluster size with respect to the F-measure. When the cluster size is 17,
the F-measure attained by the existing methods, like SemIndex and EM, is 0.3835 and 0.4745, while the proposed
cluster-based inverted indexing obtained a better F-measure value of 0.9090, respectively. When the cluster size is 18,
the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.1159 and 0.444, whereas the proposed
cluster-based inverted indexing obtained a better F-measure value of 0.9090, respectively. When the cluster size is 19,
the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.3169 and 0.144, while the proposed

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 11

Figure 5. Experimental analysis using multigram query using data set-2 (a) precision, (b) recall and (c) F-measure.

cluster-based inverted indexing obtained a better F-measure value of 0.666, respectively. When the cluster size is 20, the
F-measure obtained by the existing methods, like SemIndex and EM, is 0.270 and 0.4, while the proposed cluster-based
inverted indexing obtained a better F-measure value of 0.7693, respectively.

4.2.2. Comparative analysis using the Reuter data set. The analysis made using the Reuter data set by varying the cluster size
is elaborated in this section.

4.2.2.1. Analysis using multigram query.


The analysis made using the multigram query named ‘schedule limit’ for the metrics, namely, precision, recall and F-
measure with respect to the cluster size is explained in this section. Figure 5(a) shows the analysis based on the cluster
size with respect to the precision. When the cluster size is 17, the precision attained by the existing methods, like
SemIndex and EM, is 0.3636 and 0.4909, while the proposed cluster-based inverted indexing obtained a better precision
value of 0.7, respectively. When the cluster size is 18, the precision attained by the existing methods, namely, SemIndex
and EM, is 0.2545 and 0.26, whereas the proposed cluster-based inverted indexing obtained a better precision value of
1, respectively. When the cluster size is 19, the precision attained by the existing methods, namely, SemIndex and EM,
is 0.22 and 0.236, while the proposed cluster-based inverted indexing obtained a better precision value of 0.6181,
respectively.
Figure 5(b) shows the analysis based on the cluster size with respect to the recall. When the cluster size is 17, the
recall attained by the existing methods, like SemIndex and EM, is 0.45 and 0.666, while the proposed cluster-based
inverted indexing obtained a better recall value of 1, respectively. When the cluster size is 18, the recall attained by the
existing methods, namely, SemIndex and EM, is 0.3714 and 0.4667, whereas the proposed cluster-based inverted index-
ing obtained a better recall value of 0.9166, respectively. When the cluster size is 19, the recall attained by the existing
methods, namely, SemIndex and EM, is 0.3142 and 0.4333, while the proposed cluster-based inverted indexing obtained
a better recall value of 0.5666, respectively. When the cluster size is 20, the recall obtained by the existing methods, like

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 12

Figure 6. Experimental analysis using semantic query using data set-2 (a) precision, (b) recall and (c) F-measure.

SemIndex and EM, is 0.3428 and 0.4666, while the proposed cluster-based inverted indexing obtained a better recall
value of 0.9166, respectively.
Figure 5(c) shows the analysis based on the cluster size with respect to the F-measure. When the cluster size is 17,
the F-measure attained by the existing methods, like SemIndex and EM, is 0.40223 and 0.5654, while the proposed
cluster-based inverted indexing obtained a better F-measure value of 0.8235, respectively. When the cluster size is 18,
the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.3020 and 0.3339, whereas the proposed
cluster-based inverted indexing obtained a better F-measure value of 0.9565, respectively. When the cluster size is 19,
the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.2588 and 0.3058, while the proposed
cluster-based inverted indexing obtained a better F-measure value of 0.5913, respectively. When the cluster size is 20,
the F-measure obtained by the existing methods, like SemIndex and EM, is 0.2823 and 0.3294, while the proposed
cluster-based inverted indexing obtained a better F-measure value of 0.9565, respectively.

4.2.2.2. Analysis using semantic query.


The analysis made using the semantic query named ‘freedom’ for the metrics, namely, precision, recall and F-mea-
sure with respect to the cluster size is explained in this section. Figure 6(a) shows the analysis based on the cluster size
with respect to the precision. When the cluster size is 17, the precision attained by the existing methods, like SemIndex
and EM, is 0.2 and 0.24, while the proposed cluster-based inverted indexing obtained a better precision value of 0.545,
respectively. When the cluster size is 18, the precision attained by the existing methods, namely, SemIndex and EM, is
0.3 and 0.327, whereas the proposed cluster-based inverted indexing obtained a better precision value of 0.7454, respec-
tively. When the cluster size is 19, the precision attained by the existing methods, namely, SemIndex and EM, is 0.24
and 0.4, while the proposed cluster-based inverted indexing obtained a better precision value of 0.545, respectively.
Figure 6(b) shows the analysis based on the cluster size with respect to the recall. When the cluster size is 17, the
recall attained by the existing methods, like SemIndex and EM, is 0.3428 and 0.366, while the proposed cluster-based
inverted indexing obtained a better recall value of 0.5, respectively. When the cluster size is 18, the recall attained by

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 13

Table 1. Comparative discussion.

Metrics SemIndex EM Proposed cluster-based inverted indexing

Precision 0.0888 0.4909 1


Recall 0.4583 0.5 0.70
F-measure 0.4022 0.5654 0.8235

EM: expectation–maximisation.

the existing methods, namely, SemIndex and EM, is 0.4285 and 0.6, whereas the proposed cluster-based inverted index-
ing obtained a better recall value of 0.683, respectively. When the cluster size is 19, the recall attained by the existing
methods, namely, SemIndex and EM, is 0.342 and 0.5, while the proposed cluster-based inverted indexing obtained a
better recall value of 0.7333, respectively. When the cluster size is 20, the recall obtained by the existing methods, like
SemIndex and EM, is 0.371 and 0.4, while the proposed cluster-based inverted indexing obtained a better recall value of
0.5166, respectively.
Figure 6(c) shows the analysis based on the cluster size with respect to the F-measure. When the cluster size is 17,
the F-measure attained by the existing methods, like SemIndex and EM, is 0.2526 and 0.290, while the proposed cluster-
based inverted indexing obtained better F-measure value of 0.5217, respectively. When the cluster size is 18, the F-mea-
sure attained by the existing methods, namely, SemIndex and EM, is 0.3529 and 0.4235, whereas the proposed cluster-
based inverted indexing obtained a better F-measure value of 0.713, respectively. When the cluster size is 19, the F-mea-
sure attained by the existing methods, namely, SemIndex and EM, is 0.2823 and 0.444, while the proposed cluster-based
inverted indexing obtained a better F-measure value of 0.6255, respectively. When the cluster size is 20, the F-measure
obtained by the existing methods, like SemIndex and EM, is 0.2748 and 0.315, while the proposed cluster-based inverted
indexing obtained a better F-measure value of 0.539, respectively.

4.3. Comparative discussion


This section describes the comparative discussion of the proposed cluster-based inverted indexing by considering the
maximal values for the metrics, like precision, recall and F-measure, respectively. The existing methods, such as
SemIndex, attained the maximum values for the metrics, like precision, recall and F-measure are 0.0888, 0.4583 and
0.4022, respectively. Similarly, the existing methods, like EM, attained the maximum values for the metrics, like preci-
sion, recall and F-measure are 0.4909, 0.5 and 0.5654, whereas the maximum values attained by the proposed cluster-
based inverted indexing are 1, 0.70 and 0.9235, respectively. Table 1 shows the comparative discussion of the proposed
algorithm.

5. Conclusion
The clustering algorithm named cluster-based inverted indexing is proposed in this research to retrieve the relevant docu-
ments. The proposed cluster-based inverted indexing algorithm effectively retrieves the relevant documents by combin-
ing the inverted indexing with the piecewise fuzzy C-means clustering algorithm. The documents are pre-processed
using the stop word removal and the stemming techniques to remove the redundant and unnecessary words. The docu-
ment indexing is performed using the cluster-based inverted indexing algorithm, which utilises the pre-processed docu-
ments and generates the indexing based on the keyword of the clustered documents. The resulted documents are further
processed by the complex query matching process, where the user queries, such as semantic queries or multigram queries,
are matched using the Bhattacharyya distance. The better query matching results are acquired that are based on the mini-
mum distance measure or minimal value of Bhattacharyya distance. The query optimisation uses the Pearson correlation
coefficient based on the interactive query optimisation and retrieves the relevant documents efficiently. The proposed
cluster-based inverted indexing algorithm performs better with the metrics, namely, precision, recall and F-measure val-
ues to be 1, 0.70 and 0.8235, respectively. The results show that the proposed document retrieval system retrieves the
most relevant documents and speeds up the storing and retrieval of information. The future extension of the work is based
on any optimal clustering algorithms for document retrieval. Also, in the future, the proposed system will be extended for
image retrieval and video retrieval processes.

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401
Chandwani et al. 14

Declaration of conflicting interests


The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iD
Gunjan Chandwani https://orcid.org/0000-0001-8929-3927

References
[1] Chevalier M, El Malki M, Kopliku A et al. Implementation of multidimensional databases with document-oriented NoSQL. In:
Madria S and Hara T (eds) Big data analytics and knowledge discovery. Cham: Springer, 2016, pp. 379–390[AQ: 6].
[2] Raiber F and Kurland O. On identifying representative relevant documents. In: Proceedings of the 19th ACM international con-
ference on information and knowledge management, October 2010, pp. 99–108, https://ie.technion.ac.il/~kurland/relfb.pdf
[3] Martinho B and Santos MY. An architecture for data warehousing in big data environments. In: Tjoa A, Xu L, Raffai M et al.
(eds) Research and practical issues of enterprise information systems, vol. 268. Cham: Springer, 2016, pp. 237–250.
[4] Aye KN and Thein NL. A comparison of big data analytics approaches based on Hadoop MapReduce. In: Proceedings of the
11th international conference on computer applications, Yangon, Myanmar, 2013, https://www.academia.edu/3502325/
A_Comparison_of_Big_Data_Analytics_Approaches_Based_on_Hadoop_MapReduce
[5] Krishnan K. Data warehousing in the age of big data. 1st ed. Burlington, MA: Morgan Kaufmann Publishers, 2013.
[6] Doermann D. The indexing and retrieval of document images: a survey. Comput Vis Image Understand 1998; 70(3): 287–298.
[7] Hao S, Shi C, Niu Z et al. Concept coupling learning for improving concept lattice-based document retrieval. Eng Appl Artif
Intell 2018; 69: 65–75.
[8] Mothe J, Chrisment C, Dousset B et al. DocCube: multi-dimensional visualisation and exploration of large document sets. J Am
Soc Inform Sci Technol 2003; 54(7): 650–659.
[9] Thammasut D and Sornil O. A graph-based information retrieval system. In: Proceedings of the international symposium on
communications and information technologies, Bangkok, Thailand, 18–20 October 2006, pp. 743–748. New York: IEEE.
[10] Tozzi R, Ferrari F, Nieuwstad J et al. Tozzi classification of diaphragmatic surgery in patients with stage IIIC–IV ovarian cancer
based on surgical findings and complexity. J Gynecol Oncol 2020; 31(2): e14.
[11] Maron ME and Kuhns JL. On relevance, probabilistic indexing and information retrieval. ACM 1960; 7: 216–244.
[12] Fuhr N and Buckley C. A probabilistic learning approach for document indexing. ACM Trans Inform Syst 1991; 9(3): 223–224.
[13] Chahine CA, Chaignaud N, Kotowicz JP et al. Document indexing and retrieval using Wikipedia. In: Proceedings of the 7th
international conference on signal image technology & internet-based systems, Dijon, 28 November–1 December 2011.
[14] Jarke M and Koch J. Query optimization in database systems. ACM Comput Surv 1984; 16(2): 111–152.
[15] Yang Y and Chute CG. An example-based mapping method for text categorization and retrieval. ACM Trans Inform Syst 1994;
12(3): 252–277.
[16] Horng JT and Yeh CC. Applying genetic algorithms to query optimization in document retrieval. Inform Process Manage 2000;
36(5): 737–759.
[17] Lopez-Otero P, Parapar J and Barreiro A. Efficient query-by-example spoken document retrieval combining phone multigram
representation and dynamic time warping. Inform Process Manage 2019; 56(1): 43–60.
[18] Tekli J, Chbeir R, Traina AJM et al. SemIndex + : a semantic indexing scheme for structured, unstructured, and partly struc-
tured data. Knowl Based Syst 2019; 164: 378–403.
[19] Hao S, Shi C, Niu Z et al. Modeling positive and negative feedback for improving document retrieval. Exp Syst Appl 2019; 120:
253–261.
[20] González SM and Berbel TDR. Considering unstructured data for OLAP: a feasibility study using a systematic review. Rev Sist
Inform 2014; 14: 26–35.
[21] Rad HZ, Tiun S and Saad S. Lexical scoring system of lexical chain for quranic document retrieval. J Lang Stud 2018; 18(2):
59–79.
[22] Gupta RK, Patel D and Bramhe A. A hash-based approach for document retrieval by utilizing term features. In: Behera HS,
Nayak J, Naik B et al. (eds) Computational intelligence in data mining. Singapore: Springer, 2019, pp. 617–627.
[23] Biswas S, Ganguly A, Shah R et al. Ranked document retrieval for multiple patterns. Theor Comput Sci 2018; 746: 98–111.
[24] Madaan R, Sharma AK, Dixit A et al. Indexing of semantic web for efficient question answering system. In: Hoda M,
Chauhan N, Quadri S et al. (eds) Software engineering. Singapore: Springer, 2019, pp. 51–61.
[25] Wu J, Wu Z, Cao J et al. Fuzzy consensus clustering with applications on big data. IEEE Trans Fuzzy Syst 2017; 25(6): 1430–1445.
[26] Twenty Newsgroups Data Set, https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups (accessed May 2019).
[27] The Reuters Dataset, https://martin-thoma.com/nlp-reuters/ (accessed May 2019).

Journal of Information Science, 2021, pp. 1–14 Ó The Author(s), DOI: 10.1177/01655515211018401

You might also like