Unit 1 Notes-1

1. What is Information Retrieval and where is it used?
2. What are the basic assumptions in IR, what are the evaluation metrics to evaluate the
performance of Retrieval system?
Evaluation metrics are used to assess the performance of information retrieval

systems. Some common evaluation metrics for retrieval systems include:
1. Precision: Precision measures the proportion of retrieved documents that are

relevant to the query. It is calculated as the number of relevant documents retrieved
divided by the total number of documents retrieved.
2. Recall: Recall measures the proportion of relevant documents that are retrieved by
the system. It is calculated as the number of relevant documents retrieved divided by
the total number of relevant documents in the corpus.
3. F1 Score: The F1 score is the harmonic mean of precision and recall. It provides a
single measure that balances both precision and recall. It is calculated as 2 * (precision
* recall) / (precision + recall).
4. Mean Average Precision (MAP): MAP is the average of precision values at different
recall levels. It measures the average precision of the system across multiple queries.
5. Normalized Discounted Cumulative Gain (NDCG): NDCG measures the quality of
the ranked list of retrieved documents. It takes into account both the relevance of
retrieved documents and their positions in the ranked list.
6. Mean Reciprocal Rank (MRR): MRR measures the average of the reciprocal ranks of
the first relevant document retrieved for each query. It provides a single measure of
the system's effectiveness in retrieving relevant documents.
3. Explain with an example Term document incidence matrix?
Let's create a term-document incidence matrix using a simple example corpus
consisting of three documents:
1. Document 1: "The quick brown fox jumps over the lazy dog."
2. Document 2: "The lazy dog slept in the sun."
3. Document 3: "The quick brown fox played with the lazy dog."
First, we need to create a vocabulary of unique terms present in the corpus. Then,
we'll construct the term-document incidence matrix based on the presence or absence
of each term in each document.
Step 1: Create Vocabulary: The vocabulary consists of all unique terms

present in the corpus. In this case, the vocabulary is:
Vocabulary: {The, quick, brown, fox, jumps, over, lazy, dog,

slept, in, sun, played, with}
Step 2: Construct Term-Document Incidence Matrix: Each row of the

matrix represents a document, and each column represents a term in the
vocabulary. The value of each cell indicates the presence (1) or absence (0) of
the corresponding term in the document.
In this matrix:
 Each row represents a document.

 Each column represents a term in the vocabulary.
 The value 1 indicates the presence of the term in the document, while 0
indicates absence.
This term-document incidence matrix serves as a compact representation of

the corpus, enabling various text mining and analysis tasks such as
information retrieval, text classification, and clustering.
4. Explain with an example incidence vectors?
Incidence vectors are vectors used to represent the occurrence of items (such as terms or features) in a
collection of documents or data instances. Each element of the vector corresponds to a specific item,
and its value indicates the presence or absence of that item in a particular document or data instance.
Let's illustrate incidence vectors with an example using a small corpus of documents:
Consider a corpus consisting of three documents:
We will create incidence vectors to represent the occurrence of terms in each document. First, we need
to define a vocabulary, which consists of all unique terms present in the corpus.
Vocabulary: {The, quick, brown, fox, jumps, over, lazy, dog, slept, in, sun, played, with}
Now, we create incidence vectors for each document:
1. Document 1:
 Incidence Vector: [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
 Explanation: The term "The" appears once, "quick" appears once, "brown" appears
once, "fox" appears once, "jumps" appears once, "over" appears once, "lazy" appears
once, "dog" appears once, and all other terms do not appear (0).
2. Document 2:
 Explanation: "The" appears once, "quick" does not appear (0), "brown" does not
appear (0), "fox" does not appear (0), "jumps" does not appear (0), "over" does not
appear (0), "lazy" appears once, "dog" appears once, "slept" appears once, "in"
appears once, "sun" appears once, and all other terms do not appear (0).
3. Document 3:
 Explanation: "The" appears once, "quick" appears once, "brown" appears once, "fox"
appears once, "jumps" does not appear (0), "over" does not appear (0), "lazy" appears
once, "dog" appears once, "slept" does not appear (0), "in" does not appear (0), "sun"
does not appear (0), "played" appears once, "with" appears once, and all other terms
do not appear (0).
In this example, each incidence vector represents the occurrence of terms in a document. The length of
each vector is equal to the size of the vocabulary, and each element of the vector indicates the presence
(1) or absence (0) of the corresponding term in the document. These incidence vectors are useful for
various text mining and analysis tasks, including information retrieval, text classification, and clustering.
5. Explain with an example Inverted Index and what are the steps to construct Inverted
Index?
Certainly! Let's walk through an example of constructing an inverted index for a small
corpus of documents. For simplicity, let's consider a corpus of three documents:
Steps to Construct an Inverted Index:
1. Tokenization:
 Tokenize each document into terms or tokens. This involves breaking
down the text into individual words while removing punctuation and
other non-alphanumeric characters.
 Tokenized versions of the documents might look like this:
 Document 1: ["The", "quick", "brown", "fox", "jumps", "over",
"the", "lazy", "dog"]
 Document 2: ["The", "lazy", "dog", "slept", "in", "the", "sun"]
 Document 3: ["The", "quick", "brown", "fox", "played", "with",
2. Normalization:
 Normalize the terms by converting them to lowercase. This ensures
that variations of the same word (e.g., "The" and "the") are treated as
the same term.
 After normalization, the terms might look like this:
 Document 1: ["the", "quick", "brown", "fox", "jumps", "over",
 Document 2: ["the", "lazy", "dog", "slept", "in", "the", "sun"]
 Document 3: ["the", "quick", "brown", "fox", "played", "with",
3. Constructing the Inverted Index:
 For each unique term in the vocabulary, create an entry in the inverted
index that maps the term to the documents in which it appears.
 Each entry in the inverted index typically consists of a term and a list
of document identifiers or pointers to the documents containing that
term.
 The inverted index for the corpus might look like this:
 Each row represents a unique term in the vocabulary, and the
corresponding list of documents contains the identifiers of the documents
where the term appears.
2. Sorting (Optional):
 Optionally, sort the inverted index alphabetically by terms to facilitate
efficient searching.
In summary, constructing an inverted index involves tokenization, normalization,

and creating a mapping from terms to the documents in which they appear. This
data structure enables efficient retrieval of documents containing specific terms,
making it a fundamental component of many information retrieval systems.
6. Explain Boolean queries processing with example?

Boolean queries involve retrieving documents that match a set of specified conditions using
Boolean operators such as AND, OR, and NOT. These queries are based on the Boolean model
of information retrieval, which treats documents as binary entities (either relevant or non-
relevant) based on the presence or absence of terms. Let's explain Boolean queries processing
with an example:
Consider a small document collection with three documents:
We'll use Boolean queries to retrieve documents that match specific conditions.
Example Boolean Queries:
1. AND Query:
 Retrieve documents that contain both terms.
 Example: Retrieve documents containing both "quick" and "brown."
 Result: Document 1, Document 3
 Explanation: Both "quick" and "brown" appear in Document 1 and Document
3.
2. OR Query:
 Retrieve documents that contain either of the terms.
 Example: Retrieve documents containing either "lazy" or "played."
 Result: Document 1, Document 2, Document 3
 Explanation: "Lazy" appears in Document 1, Document 2, and Document 3.
"Played" appears only in Document 3.
3. NOT Query:
 Retrieve documents that contain the first term but not the second term.
 Example: Retrieve documents containing "lazy" but not "dog."
 Explanation: "Lazy" appears in Document 1 and Document 3. "Dog" appears
in Document 1 and Document 2, but we exclude those documents.
4. Combination of Boolean Queries:
 Boolean queries can be combined using parentheses and multiple operators.
 Example: Retrieve documents containing ("quick" AND "brown") OR ("lazy" AND NOT
"dog").
 Explanation: Documents 1 and 3 contain "quick" AND "brown" or "lazy" AND
NOT "dog."
In Boolean queries processing, each term is treated as either present (1) or absent (0) in each
document. The Boolean operators allow for the combination of terms and the specification of
conditions for document retrieval. While Boolean queries provide precise control over
document retrieval, they may produce either too few or too many results, depending on the
query formulation and the nature of the document collection.
7. Write merge algorithm?

8. What are the steps to process Unstructured and semi- structured data?
Any 5 steps:
Processing unstructured and semi-structured data involves several steps to organize, clean,
analyze, and extract valuable insights from the data. Below are the general steps involved in
processing unstructured and semi-structured data:
1. Data Collection:
 Gather data from various sources such as text documents, web pages, social media,
emails, sensor data, etc.
 Data can be collected manually or automatically using web scraping, APIs, or data
feeds.
2. Data Preprocessing:
 Clean the raw data to remove noise, irrelevant information, duplicates, and
inconsistencies.
 Normalize or standardize the data format and structure for consistency.
 Perform text preprocessing tasks such as tokenization, stemming, lemmatization, and
stop word removal for textual data.
3. Data Parsing and Structuring:
 Parse semi-structured data formats such as XML, JSON, CSV, or HTML into a
structured format.
 Extract relevant information and attributes from semi-structured data fields.
4. Data Storage:
 Store the processed data in suitable storage systems based on the volume, velocity,
and variety of the data.
 Choose storage solutions such as relational databases, NoSQL databases, data lakes,
or cloud storage based on the requirements of the application.
5. Data Integration:
 Integrate data from different sources and formats into a unified dataset for analysis.
 Resolve schema mismatches, data inconsistencies, and conflicts during integration.
6. Data Analysis and Exploration:
 Perform exploratory data analysis (EDA) to understand the characteristics, patterns,
and relationships within the data.
 Use statistical analysis, visualization techniques, and machine learning algorithms to
gain insights from the data.
7. Text Mining and Natural Language Processing (NLP):
 Apply text mining and NLP techniques to extract structured information from
unstructured textual data.
 Perform tasks such as sentiment analysis, entity recognition, topic modeling, and
document classification.
8. Feature Engineering:
 Create new features or attributes from the existing data to improve the performance
of machine learning models.
 Feature engineering techniques include feature scaling, transformation, selection, and
creation of derived features.
9. Modeling and Prediction:
 Build predictive models using machine learning algorithms to make predictions or
classify data.
 Train the models on labeled data and evaluate their performance using suitable
metrics.
10. Deployment and Monitoring:
 Deploy the trained models into production environments for real-time or batch
processing.
 Monitor the performance of deployed models and update them regularly to maintain
accuracy and relevance.
11. Feedback Loop and Iteration:
 Continuously refine and improve data processing pipelines based on feedback from
users, stakeholders, and performance metrics.
 Iterate through the steps to incorporate new data, adapt to changing requirements,
and enhance the effectiveness of data processing workflows.
9. Explain the steps of relationship extraction, Regex and Pattern- matching and word
embedding for Unstructured and semi- structured data?
Certainly! Relationship extraction, regex and pattern-matching, and word embedding are
techniques used in natural language processing (NLP) to extract structured information from
unstructured or semi-structured text data. Let's discuss the steps involved in each of these
techniques:
1. Relationship Extraction:
Relationship extraction involves identifying and extracting semantic relationships between
entities mentioned in text. These relationships can be binary (e.g., "works for," "married
to") or more complex (e.g., "is a parent of," "is a member of"). Here are the steps involved
in relationship extraction:
 Entity Recognition: Identify and extract entities mentioned in the text, such as
people, organizations, locations, dates, etc.
 Dependency Parsing: Analyze the syntactic structure of the sentences to understand
the relationships between words and phrases. Dependency parsing helps identify
subject-verb-object relationships and other syntactic patterns.
 Pattern Matching: Use predefined patterns or rules to extract specific relationships
based on syntactic or semantic patterns observed in the text.
 Named Entity Recognition (NER): Identify named entities in the text and their types
(e.g., person, organization, location). Named entities often play roles in relationships
and can help identify potential relationships between entities.
2. Regex and Pattern-Matching:
Regular expressions (regex) and pattern-matching techniques are used to search for
specific patterns or sequences of characters within text data. Here are the steps involved:
 Define Patterns: Define regular expressions or patterns that describe the text
patterns you want to extract. These patterns can include literals, wildcards, character
classes, quantifiers, and anchors.
 Compile Patterns: Compile the defined regular expressions into pattern objects that
can be used for matching against text data efficiently.
 Match Patterns: Apply the compiled patterns to the text data to find all occurrences
of the specified patterns. This step involves searching the text for substrings that
match the defined patterns.
 Extract Information: Extract relevant information from the matched patterns, such as
dates, email addresses, phone numbers, URLs, etc.
3. Word Embedding:
Word embedding is a technique used to represent words or phrases as dense vectors of
real numbers in a continuous vector space. It captures the semantic relationships between
words based on their contexts in large text corpora. Here are the steps involved:
 Tokenization: Split the text into individual words or tokens.
 Word Representation: Represent each word as a one-hot vector or an index in a
vocabulary.
 Training: Train a word embedding model (e.g., Word2Vec, GloVe, FastText) on a large
corpus of text data to learn distributed representations of words based on their co-
occurrence patterns.
 Embedding Lookup: Use the trained word embedding model to convert each word
in the text into its corresponding dense vector representation.
 Semantic Similarity: Measure the similarity between word vectors using cosine
similarity or other distance metrics. Words with similar meanings will have vectors
that are closer together in the embedding space.
By following these steps, it's possible to extract structured information, search for specific
patterns, and capture semantic relationships between words in unstructured or semi-
structured text data, enabling a wide range of NLP applications such as information extraction,
text mining, sentiment analysis, and document classification.
12. What are the features of Apache Solr, what are fields in Apache Solr?
Apache Solr is a powerful open-source search platform built on Apache Lucene. It provides full-text
search, faceted search, hit highlighting, dynamic clustering, database integration, and rich document
handling capabilities. Here are some of the key features of Apache Solr:
1. Full-Text Search: Solr supports full-text search, allowing users to search for documents based on
the content of indexed fields.
2. Indexing: Solr indexes documents from various sources such as databases, XML files, JSON, and
plain text files. It supports automatic indexing and incremental updates.
3. Scalability: Solr is designed to be highly scalable, allowing users to distribute indexes across
multiple servers and handle large volumes of data.
4. Faceted Search: Solr supports faceted search, enabling users to drill down into search results
based on predefined categories or facets.
5. Geospatial Search: Solr includes support for geospatial search, allowing users to search for
documents based on geographic location.
6. Hit Highlighting: Solr provides hit highlighting functionality, which highlights search terms
within the retrieved documents to make them stand out to users.
7. Dynamic Clustering: Solr supports dynamic clustering, enabling users to group search results
based on common attributes or fields.
8. Document Ranking: Solr includes built-in ranking algorithms that determine the relevance of
search results based on factors such as term frequency, document length, and proximity of terms.
9. Rich Document Handling: Solr supports a wide range of document formats including Word,
PDF, HTML, XML, and JSON. It can extract text and metadata from these documents for
indexing and search.
10. Data Import: Solr includes features for importing data from external sources such as relational
databases, CSV files, and XML files.
Fields in Apache Solr: In Solr, a field is a component of a document that holds a specific type of
data, such as text, numbers, dates, or geographic coordinates. Fields are defined in the schema.xml
configuration file, which specifies the fields present in documents to be indexed and searched.
Here are some common types of fields in Apache Solr:
1. Text Fields: Text fields are used to store textual data such as titles, descriptions, or content. Text
fields are often tokenized during indexing to support full-text search.
2. String Fields: String fields store untokenized strings of text, such as IDs, names, or categories.
String fields are typically used for exact match or sorting purposes.
3. Numeric Fields: Numeric fields store numerical data such as integers, floats, or dates. Numeric
fields support range queries and sorting.
4. Date Fields: Date fields store date and time information. Date fields support date range queries
and date-based sorting.
5. Boolean Fields: Boolean fields store boolean values (true or false). Boolean fields are used for
binary attributes such as in stock availability or product availability.
6. Spatial Fields: Spatial fields store geographic coordinates (latitude and longitude) or other spatial
data. Spatial fields support spatial search and distance-based queries.
Fields in Solr are defined with specific attributes such as field name, field type, indexing options,
and search options. By defining fields in the schema.xml file, users can control how documents are
indexed, searched, and retrieved in Apache Solr.

Unit 1 Notes-1

Uploaded by

Copyright:

Available Formats

You might also like

Unit 1 Notes-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1 Notes-1

Uploaded by

Copyright:

Available Formats

1. What is Information Retrieval and where is it used?

Evaluation metrics are used to assess the performance of information retrieval

1. Precision: Precision measures the proportion of retrieved documents that are

Step 1: Create Vocabulary: The vocabulary consists of all unique terms

Vocabulary: {The, quick, brown, fox, jumps, over, lazy, dog,

Step 2: Construct Term-Document Incidence Matrix: Each row of the

 Each row represents a document.

This term-document incidence matrix serves as a compact representation of

Consider a corpus consisting of three documents:

Now, we create incidence vectors for each document:

Steps to Construct an Inverted Index:

In summary, constructing an inverted index involves tokenization, normalization,

6. Explain Boolean queries processing with example?

Consider a small document collection with three documents:

Example Boolean Queries:

7. Write merge algorithm?

You might also like