Information Retrieval On Cranfield Dataset

Information Retrieval on Cranfield Dataset
Vanya BK1
Indian Institute Of Technology, Madras

ae17b045@smail.iitm.ac.in
Abstract. The aim of this work is to quantitatively measure the in-

formation retrieval on Cranfield dataset using Vector Space Model as
the baseline and compare its results against Latent Semantic Indexing
Model, Latent Dirichlet Allocation and Word2Vec embeddings.
Keywords: Information Retrieval · Vector Space Model · Latent Se-

mantic Indexing
1 Introduction
Information retrieval is the is the science of searching for information in docu-
ments, searching for documents themselves, searching for meta-data which de-
scribe documents or searching within databases, whether relational stand-alone
databases or hyper textually-networked databases such as World Wide Web.
An Information Retrieval (IR) model selects and ranks the document that is
required by the user or the user has asked for in the form of a query. The doc-
uments and the queries are represented in a similar manner, so that document
selection and ranking can be formalized by a matching function that returns a
retrieval status value (RSV) for each document in the collection. Many of the
Information Retrieval systems represent document contents by a set of descrip-
tors, called terms, belonging to a vocabulary V. An IR model determines the
query-document matching function according to four main approaches:
– Acquisition: In this step, the selection of documents and other objects from
various web resources that consist of text-based documents takes place. The
required data is collected by web crawlers and stored in the database.
– Representation: It consists of indexing that contains free-text terms, con-
trolled vocabulary, manual automatic techniques as well. example: Ab-
stracting contains summarizing and Bibliographic description that contains
author, title, sources, data, and metadata.
– File Organization: There are two types of file organization methods. i.e.
Sequential: It contains documents by document data. Inverted: It contains
term by term, list of records under each term. Combination of both.
– Query: An IR process starts when a user enters a query into the system.
Queries are formal statements of information needs, for example, search
strings in web search engines. In information retrieval, a query does not
uniquely identify a single object in the collection. Instead, several objects
may match the query, perhaps with different degrees of relevancy.
2 Vanya BK
2 Motivation
The major limitations of the vector space model are :
– It assumes that all words are independent.

– It makes the consideration of all words impractical, since each word is a
dimension, and considering all words would imply expensive computations
in a very high-dimensional space.
The above limitations can be overcome by employing methods to extract top-

ics/features of documents and queries using the words present in them which
would help in overcoming the limitation of words considered as independent in
Vector space model. Also these models are efficient in computation and does not
involve very high dimensions
3 Information Retrieval Models
3.1 Vector Space Model
A vector space model is an algebraic model, involving two steps, in first step
the text documents is represented into vector of words and in second step it
is transformed to numerical format so that any text mining techniques such
as information retrieval, information extraction,information filtering etc. can be
applied. The text document represented as vector of words is converted to nu-
merical format using the term frequency and inverse document frequency values.
Term frequency refers to the number of times a word appears in the given docu-
ment and inverse document frequency is given by log(N/n) where N is the total
number of documents present and n is the number of documents in which the
word occurs. These two terms are multiplied to give the value of the word in
the word vector used to represent the documents. These vectors are used to find
document similarities or for retrieving the documents for a given query by rep-
resenting the query in a similar fashion(using IDF(inverse document frequency)
calculated from the documents) and finding the similarity between the docu-
ment and query vectors using similarity measures like cosine similarity, Jaccard
distance, Euclidean distance, etc.
3.2 Latent Semantic Indexing
Latent semantic indexing (LSI) is an indexing and retrieval method that uses a
mathematical technique called singular value decomposition (SVD) to identify
patterns in the relationships between the terms and concepts contained in an
unstructured collection of text. LSI is based on the principle that words that are
used in the same contexts tend to have similar meanings. A key feature of LSI
is its ability to extract the conceptual content of a body of text by establishing
associations between those terms that occur in similar contexts. LSI uncovers
the underlying latent semantic structure in the usage of words in a body of text
Information Retrieval on Cranfield Dataset 3
and how it can be used to extract the meaning of the text in response to user
queries, commonly referred to as concept searches. Queries, or concept searches,
against a set of documents that have undergone LSI will return results that are
conceptually similar in meaning to the search criteria even if the results don’t
share a specific word or words with the search criteria.[2]
3.3 Latent Dirichlet Allocation
Latent Dirichlet allocation (LDA) is a generative probabilistic model for col-

lections of discrete data such as text corpora. LDA is a three-level hierarchical
Bayesian model, in which each item of a collection is modeled as a finite mixture
over an underlying set of topics. Each topic is, in turn, modeled as an infinite
mixture over an underlying set of topic probabilities. In the context of text mod-
eling, the topic probabilities provide an explicit representation of a document.
LDA allows sets of observations to be explained by unobserved groups that ex-
plain why some parts of the data are similar. For example, if observations are
words collected into documents, it posits that each document is a mixture of a
small number of topics and that each word’s presence is attributable to one of
the document’s topics.[1]
3.4 Word2Vec
Word embedding is one of the most popular representation of document vo-

cabulary. It is capable of capturing context of a word in a document, semantic
and syntactic similarity, relation with other words, etc. They are vector rep-
resentations of a particular word. Word2Vec is a method to construct such an
embedding. It can be obtained using two methods (both involving Neural Net-
works):
– Continuous Bag Of Words : This method takes the context of each word as
the input and tries to predict the word corresponding to the context. One hot
encoding of the input word is used to measure the output error compared to
one hot encoding of the target word . In the process of predicting the target
word, the vector representation of the target word is learnt.
– Skip Gram : In this method the target word is used to predict the context
words. In other words, target word is provided as input into the network and
the model outputs C probability distributions. For each context position, C
probability distributions of V probabilities, one for each word, are obtained.
(V-vocabulary size, C-number of context words predicted)[3]
4 Methodology
4.1 Data Preprocessing
The documents are preprocessed by the following techniques :

4 Vanya BK
– Segmentation - Sentences are segmented by any of the two methods :

• Naive segmentation : The sentences are segmented naively using the end
of sentence characters like ’.’,’ ?’,’,’ !’,’”’,’;’,’:’.
• Punkt segmentation : The sentences are segmented using the PunktSen-
tenceTokenizer of the natural language toolkit. This tokenizer divides
a text into a list of sentences by using an unsupervised algorithm to
build a model for abbreviation words, collocations, and words that start
sentences.
– Tokenization
• Naive Tokenization : Sentences are naively split based on whitespace and
commas. The split token are converted to lower case uniformly across the
dataset.
• PennTreeBank Tokenization : Sentences are tokenized using the Tree-
bankWordTokenizer of the natural language toolkit. This tokenizer uses
regular expressions to tokenize text as in Penn Treebank.
– Inflection Reduction : It is used to reduce inflectional forms and sometimes
derivationally related forms of a word to a common base form
• Lemmatization : It is used to reduce inflectional forms by taking into
consideration the morphological analysis of the words. To do so, it is
necessary to have detailed dictionaries which the algorithm can look
through to link the form back to its lemma. Example : studies is con-
verted to study
• Stemming : This algorithm work by cutting off the end or the beginning
of the word, taking into account a list of common prefixes and suffixes
that can be found in an inflected word. Example : studies is converted
to study
– StopWord Removal : It is used to remove the words which does not add much
meaning to a sentence. The list of such words is obtained from collection of
stopwords present in the natural language toolkit corpus.
4.2 Models
After the data preprocessing step, the documents are represented as numerical
vectors/features of the words present in the documents using one of the 4 models
:
– Vector Space Model : The document vectors are constructed by multiplying

the term frequency and inverse document frequency(IDF) values of a term in
a given document. Similarly a query vector is constructed by multiplying the
inverse document frequency(IDF) of words obtained from the term document
matrix and the term frequency of each term in the query. In case the query
contains words which is not seen before a smoothing factor of 1 is included
in IDF formula to give the IDF value of the unseen word as log(N) (N is the
number of documents).
– Latent Semantic Indexing : The lsimodel of Gensim is used to model the doc-
ument vectors. The bag of words format of the document is passed to the
lsimodel which implements a fast truncated SVD (Singular Value Decom-
position).Then the document similarity is computed which helps build an
index for a given set of documents. Using these indices a similairity measure
between the document and the query can be computed.
– Latent Dirichlet Allocation : The ldaModel of gensim is used to obtain the
topic distribution of the documents as well as the queries. The vectors ob-
tained from the model is a percentage distribution of each topic correspond-
ing to that document/query. The number of topic is set to
– Word Embeddings : The publicly available Google word2vec embeddings is
used to provide the word embeddings for each word. The documents and
queries are then modelled using these word embeddings corresponding to
the terms present in them.
The similarity measure is then obtained using the cosine similarity between the
query and document vectors. The retrieved documents are ranked based on the
cosine similarity values of the retrieved document vector with the given query
vector.
4.3 Measurement Techniques
In order to quantitatively measure the quality of the documents retrieved for

each of the queries, the following 5 measurements are used :
– Precision is the number of correct documents retrieved divided by the total

number of documents retrieved(Here total number of documents retrieved is
11 for all queries)
– Recall is the number of correct documents retrieved divided by the total
number of correct documents
– F-Score is given as the harmonic mean of precision and recall. Usually pre-
cision is high when recall is low and vice versa so, finding just a mean of
the two values would not do any good. Whereas F-score penalises the lower
values accordingly and thus helping to build a better model
– Average Precision(mAP) is calculated by taking an average of all the preci-
sion@K values where the Kth position has a correct document.
– Normalized Discounted Cumulative Gain(nDCG)
p p
X reli X reli
DCGp = = rel1 + (1)
i=1
log2 (i + 1) i=2
log2 (i + 1)
DCGp
nDCGp = (2)
IDCGp
|RELp |
X 2reli − 1
IDCGp = (3)
i=1
log2 (i + 1)
6 Vanya BK
where,
IDCG is ideal discounted cumulative gain
RELp represents the list of relevant documents (ordered by their relevance)
in the corpus up to position p.
reli is the relavance score
All the above values are then averaged over all queries.
5 Experiments and Results
5.1 Vector Space Model
Precision, Recall and F-score @ 1.0 : 0.63556 , 0.10865 , 0.17863

MAP, nDCG @ 1.0 : 0.63556 , 0.63556
MAP, nDCG @ 2.0 : 0.68444 , 0.68391
MAP, nDCG @ 3.0 : 0.69111 , 0.69603
MAP, nDCG @ 4.0 : 0.69049 , 0.70460
MAP, nDCG @ 5.0 : 0.68120 , 0.71107
MAP, nDCG @ 6.0 : 0.67178 , 0.71476
MAP, nDCG @ 7.0 : 0.66479 , 0.71667
MAP, nDCG @ 8.0 : 0.65008 , 0.70604
MAP, nDCG @ 9.0 : 0.64613 , 0.70895
MAP, nDCG @ 10.0 : 0.63478 , 0.70521
5.2 Latent Semantic Indexing with number of topics=20

MAP, nDCG @ 1.0 : 0.40000 , 0.40000
MAP, nDCG @ 2.0 : 0.46444 , 0.47442
MAP, nDCG @ 3.0 : 0.48630 , 0.50860
MAP, nDCG @ 4.0 : 0.49037 , 0.52299
MAP, nDCG @ 5.0 : 0.48636 , 0.53013
MAP, nDCG @ 6.0 : 0.47909 , 0.53205
8 Vanya BK
MAP, nDCG @ 7.0 : 0.47535 , 0.53430

MAP, nDCG @ 8.0 : 0.47173 , 0.54072
MAP, nDCG @ 9.0 : 0.46742 , 0.54850
MAP, nDCG @ 10.0 : 0.46381 , 0.55009
5.3 Latent Semantic Indexing with number of topics=300

MAP, nDCG @ 1.0 : 0.67111 , 0.67111
MAP, nDCG @ 2.0 : 0.71556 , 0.70994
MAP, nDCG @ 3.0 : 0.72370 , 0.72341
MAP, nDCG @ 4.0 : 0.71654 , 0.72633

MAP, nDCG @ 5.0 : 0.71075 , 0.73144
MAP, nDCG @ 6.0 : 0.69756 , 0.72794
MAP, nDCG @ 7.0 : 0.69065 , 0.72720
MAP, nDCG @ 8.0 : 0.67861 , 0.72655
MAP, nDCG @ 9.0 : 0.67024 , 0.72598
MAP, nDCG @ 10.0 : 0.66476 , 0.72495
10 Vanya BK
5.4 Latent Direchlet Allocation with number of topics = 20

MAP, nDCG @ 1.0 : 0.00444 , 0.00444 Precision, Recall and F-score @ 2.0 :
0.00444 , 0.00049 , 0.00089
0.00593 , 0.00089 , 0.00155
0.00556 , 0.00114 , 0.00189
0.00622 , 0.00157 , 0.00249
0.00667 , 0.00199 , 0.00305
0.00635 , 0.00224 , 0.00329
0.00611 , 0.00248 , 0.00351
0.00593 , 0.00273 , 0.00372

0.00578 , 0.00298 , 0.00391
MAP, nDCG @ 10.0 : 0.00711 , 0.00869
5.5 Latent Direchlet Allocation with number of topics = 200
Precision, Recall and F-score @ 1 : 0.0044, 0.0004, 0.0008

MAP, nDCG @ 1 : 0.0044, 0.0044
MAP, nDCG @ 2 : 0.0044, 0.0044
MAP, nDCG @ 3 : 0.0044, 0.0044
MAP, nDCG @ 4 : 0.0044, 0.0044
MAP, nDCG @ 5 : 0.0044, 0.0044
MAP, nDCG @ 6 : 0.0044, 0.0044
MAP, nDCG @ 7 : 0.0044, 0.0044
MAP, nDCG @ 8 : 0.0044, 0.0044
MAP, nDCG @ 9 : 0.0044, 0.0044
12 Vanya BK
MAP, nDCG @ 10 : 0.0044, 0.0044
5.6 Word2Vec
MAP, nDCG @ 1 : 0.4967, 0.4985
MAP, nDCG @ 2 : 0.5142, 0.5168
MAP, nDCG @ 3 : 0.5298, 0.5378
MAP, nDCG @ 4 : 0.5361, 0.5512
MAP, nDCG @ 5 : 0.5286, 0.5698
MAP, nDCG @ 6 : 0.5134, 0.5789
MAP, nDCG @ 7 : 0.5076, 0.5890

MAP, nDCG @ 8 : 0.5032, 0.0.5945
MAP, nDCG @ 9 : 0.5010, 0.0.6120
MAP, nDCG @ 10 : 0.5003, 0.6031
6 Comparative study of all the models over the baseline

Baseline = Vector Space Model
6.1 Latent Semantic Indexing

From the results above we can see that Latent Semantic indexing with num-
ber of topics = 300 gives much better results than vector space model. This
14 Vanya BK
is because if the query and the document do not have sufficient word overlap
then the vector space model fails to retrieve any document because it does not
take into consideration synonyms whereas the Latent semantic indexing looks
at feature representation of the documents(not only in terms of the number of
words present) which helps in retrieving similar documents which have very little
word overlap with the query[6]. Because LSI does not depend on literal keyword
matching, it is especially useful when the text input is noisy, as in OCR (Optical
Character Reader), open input, or spelling errors. Thus no document would be
retrieved if the query contains any spelling error in case of vector space model
which is not the case in Latent Semantic Indexing. It can also be seen that when
the number of topics in LSI is increased from 20 to 300 topics the performance
drastically increases because the number of topics is optimal to be around 300 as
shown in [4], leading to better results. Since with the optimal number of topics
it can better model the documents and queries, it gives us better performance.
6.2 Latent Dirichlet allocation

We can observe from the results that the Latent Dirichlet allocation performs
the worst among all the models. A possible reason for this is the small dataset. It
is known that Latent Dirichlet Allocation works well only for considerably large
datasets in order to model the topics properly and generalise it over any unseen
data point as well.The number of topics in the dataset are specified beforehand
which is subjective and doesn’t always highlight the true distribution of topics.
Hence it is observed that with either 20 or 200 topics the performance of the
model still remains worse. The topics are predicted based on the multinomial
distribution and then the words are predicted based on another multinomial
distribution trained specific to that topic. If the true structure is more complex
than a multinomial distribution or if the data to train isn0 t sufficient, then it
might underfit, leading to worse performance than the vector space model. If the
LDA is fast to run, then it is very hard to get to the optimal solution indicating
it is very much sensitive to hyperparameter tuning leading to worse result.[7]
6.3 Word2Vec
Here also it can be observed that the results are comparatively worse compared
to vector space model because of the small dataset. Fine-tuning the word2vec
embeddings on small dataset can cause a lot of issues like :
– There might be no, or very few, examples where the desired-to-be-alike
words are in similar nearby-word contexts. With no examples where there
are shared nearby-words, there’s little basis for nudging the pair to the same
place – the internal Word2Vec task, of predicting nearby words, doesn’t need
them to be near each other, if they’re each predicting completely different
words.
– Words need to move a lot from their original random positions, to eventually
reach the useful ’constellation’ gotten form a successful word2vec session. If
a word only appears in your dataset 10 times, and is trained over the dataset
for 10 iterations, that word gets just 100 CBOW(Continuous Bag of Word)
nudges. If instead the word happens 1000 times, then 10 training iterations
give it 10,000 CBOW updates – a much better chance to move from an
initially-arbitrary location to a useful location.
Since only the co-occurrence is concerned, the word vector contains limited se-
mantic information of the word vector which are not very much efficient in
retrieving the documents containing synonyms. The word2vec embeddings used
are fine tuned on the dataset using the pre-trained model rather than training
from scratch because the training loss decays more quickly for the pre-trained
embedding-based models compared to the embedding-layer-based model trained
from scratch.[8]
Table 1. Comparison of results.
Models Precision Recall F-score MAP nDCG

Vector Space Model 0.27689 0.39306 0.30073 .63478 0.70521
Latent Dirichlet Allocation 0.0044 0.0044 0.0044 .0044 0.0044
Word2Vec 0.1388 0.2231 0.1231 .5003 0.6031
Latent Semantic Indexing 0.30311 0.42705 0.32832 0.66476 0.72495
References
1. Blei, David M. and Ng, Andrew Y. and Jordan, Michael I. Latent dirichlet allocation.
JJ. Mach. Learn. Res. 3 993–1022 (2003)
2. Foltz, Peter W. Latent semantic analysis for text-based research. Behavior Research
Methods, Instruments, and Computers. 28(2) 197-202 (1996)
3. Tomas Mikolov and Kai Chen and Greg Corrado and Jeffrey Dean Efficient Esti-
mation of Word Representations in Vector Space. arXiv:1301.3781 [cs.CL] (1999)
4. Grant, Scott. Cordy, James R.: Estimating the Optimal Number of La-
tent Concepts in Source Code Analysis. 10th IEEE Working Conference
on Source Code Analysis and Manipulation, Timisoara, 2010, pp. 65-74,
https://doi.org/10.1109/SCAM.2010.22.(2010)
5. LNCS Homepage, http://www.springer.com/lncs. Last accessed 4 Oct 2017
6. Rosario Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring
(2000)
7. Revert Felix. An overview of topics extraction in Python with LDA
https://doi.org/https://towardsdatascience.com/the-complete-guide-for-topics-
extraction-in-python-a6aaa6cedbbc
8. Farahmand Meghdad. Pre-trained Word Embeddings or Embedding Layer?
— A Dilemma https://doi.org/https://towardsdatascience.com/pre-trained-word-
embeddings-or-embedding-layer-a-dilemma-8406959fd76c

Information Retrieval On Cranfield Dataset

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Retrieval On Cranfield Dataset

Uploaded by

Copyright:

Available Formats

Information Retrieval on Cranfield Dataset

Indian Institute Of Technology, Madras

Abstract. The aim of this work is to quantitatively measure the in-

Keywords: Information Retrieval · Vector Space Model · Latent Se-

The major limitations of the vector space model are :

– It assumes that all words are independent.

The above limitations can be overcome by employing methods to extract top-

3 Information Retrieval Models

3.1 Vector Space Model

3.2 Latent Semantic Indexing

3.3 Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) is a generative probabilistic model for col-

Word embedding is one of the most popular representation of document vo-

4.1 Data Preprocessing

The documents are preprocessed by the following techniques :

– Segmentation - Sentences are segmented by any of the two methods :

– Vector Space Model : The document vectors are constructed by multiplying

4.3 Measurement Techniques

In order to quantitatively measure the quality of the documents retrieved for

– Precision is the number of correct documents retrieved divided by the total

5 Experiments and Results

5.1 Vector Space Model

Precision, Recall and F-score @ 1.0 : 0.63556 , 0.10865 , 0.17863

MAP, nDCG @ 10.0 : 0.63478 , 0.70521

5.2 Latent Semantic Indexing with number of topics=20

MAP, nDCG @ 7.0 : 0.47535 , 0.53430

5.3 Latent Semantic Indexing with number of topics=300

MAP, nDCG @ 4.0 : 0.71654 , 0.72633

5.4 Latent Direchlet Allocation with number of topics = 20

Precision, Recall and F-score @ 1.0 : 0.00444 , 0.00025 , 0.00047

0.00593 , 0.00273 , 0.00372

5.5 Latent Direchlet Allocation with number of topics = 200

Precision, Recall and F-score @ 1 : 0.0044, 0.0004, 0.0008

MAP, nDCG @ 10 : 0.0044, 0.0044

MAP, nDCG @ 7 : 0.5076, 0.5890

6 Comparative study of all the models over the baseline

6.1 Latent Semantic Indexing

6.2 Latent Dirichlet allocation

Table 1. Comparison of results.

Models Precision Recall F-score MAP nDCG

You might also like