UNIT-1 - Google Docs

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19



‭1. Normalizing Incoming Items:‬

‭ his step is about converting various types of incoming data into‬

‭a consistent, standard format so that they can be easily‬
‭processed and searched.‬

‭●‬ ‭Language Encoding:‬‭Ensure that text from different‬

‭languages is properly encoded, typically in Unicode, which‬
‭allows consistent display and search across languages.‬
‭●‬ ‭Different File Formats:‬‭Convert files from various formats‬
‭(like text, images, videos) into a standard format. For‬
‭○‬ ‭Videos could be converted to formats like MPEG-2,‬
‭MPEG-1, AVI.‬
‭○‬ ‭Audio files to WAV, Real Audio.‬
‭○‬ ‭Images to GIF, JPEG, BMP.‬

‭2. Logical Restructuring – Zoning:‬

‭ reak down the content into meaningful sections. For example, if‬
‭you're processing an academic paper, divide it into sections like‬
‭Title, Author, Abstract, Main Text, Conclusion, References,‬
‭Keywords. This helps in more precise searching and better‬
‭display of search results.‬

‭3. Creating a Searchable Data Structure (Indexing):‬

‭This involves several steps:‬

‭1.‬‭Identification of Processing Tokens:‬

‭○‬ ‭Processing Tokens:‬‭These are the key pieces of‬
‭information used in searches, often better defined‬
‭than just words.‬
‭○‬ ‭Valid Word Symbols:‬‭Alphabetic characters and‬
‭○‬ ‭Inter-Word Symbols:‬‭Blanks, periods, semicolons‬
‭(these don't affect the search).‬
‭○‬ ‭Special Processing Symbols:‬‭Hyphens.‬
‭2.‬‭W ords are defined as continuous sequences of valid word‬
‭symbols separated by inter-word symbols.‬
‭3.‬‭Stop Algorithm:‬
‭○‬ ‭Stop Words:‬‭Remove common words (like 'the', 'and')‬
‭that appear in almost every document, or words that‬
‭appear very infrequently, to save system resources.‬
‭ ‬ ‭Stop List:‬‭A predefined list of such stop words.‬

‭4.‬‭Characterize Tokens:‬
‭○‬ ‭W ord Characteristics:‬‭Identify specific features like‬
‭proper names, acronyms, numbers, dates.‬
‭○‬ ‭Part of Speech Tagging:‬‭Determine if the word is a‬
‭noun, verb, etc.‬
‭○‬ ‭W ord Sense Disambiguation:‬‭Understand the‬
‭meaning of a word based on context.‬
‭5.‬‭Stemming Algorithm:‬
‭○‬ ‭Stemming:‬‭Reduce words to their base or root form.‬
‭For example, 'computing', 'computers', and‬
‭'computation' are all reduced to 'comput'. This reduces‬
‭the number of unique words and saves storage space,‬
‭while also improving search efficiency.‬

‭4. Creating the Searchable Data Structure:‬

‭ fter processing tokens through the stemming algorithm, they‬

‭are updated into a searchable data structure. This structure‬
‭could be a signature file, inverted list, or PAT tree, and it‬
‭represents the semantic concepts of items in the database. It‬
‭limits what a user can find as a result of the search, ensuring‬
‭efficient and accurate retrieval of information.‬


‭●‬ ‭Normalization:‬‭Convert and standardize different formats‬

‭and languages.‬
‭●‬ ‭Zoning:‬‭Break down content into logical sections.‬
‭●‬ ‭Token Identification:‬‭Identify important searchable‬‭tokens‬
‭and remove unnecessary ones.‬
‭●‬ ‭Token Characterization:‬‭Determine the specific features‬
‭and context of tokens.‬
‭●‬ ‭Stemming:‬‭Reduce words to their base form to save‬
‭space and improve search efficiency.‬
‭●‬ ‭Indexing:‬‭Create an internal structure that represents the‬
‭data and enables efficient searching.‬

‭Selective Dissemination of Information (SDI):‬

‭ DI is a system that automatically matches new information‬

‭against users' interests and delivers relevant items to them.‬

‭●‬ ‭How it works:‬

‭○‬ ‭Search Process:‬‭The system continuously searches‬
‭new items.‬
‭○‬ ‭User Profiles:‬‭Each user has a profile that describes‬
‭their interests.‬
‭○‬ ‭User Mail Files:‬‭W here the system stores items‬
‭matching user interests.‬
‭●‬ ‭User Profile:‬
‭○‬ ‭A broad search statement that describes what the‬
‭user is interested in.‬
‭○‬ ‭A list of mail files to receive documents that match the‬
‭search statement.‬
‭○‬ ‭W hen a new item matches the profile, it is sent to the‬
‭associated mail files.‬
‭●‬ ‭Difference from Ad Hoc Queries:‬
‭○‬ ‭Profiles have many search terms and cover a wide‬
‭range of interests.‬
‭○‬ ‭Ad hoc queries are short and specific.‬

‭Document Database Search:‬

‭ his allows users to search all items that have been received‬
‭and stored in the system.‬

‭●‬ ‭Components:‬
‭○‬ ‭Search Process:‬‭The mechanism that handles‬
‭○‬ ‭User Queries:‬‭Specific search statements entered by‬
‭○‬ ‭Document Database:‬‭The collection of all processed‬
‭and stored items.‬
‭●‬ ‭Characteristics of Document Database:‬
‭○‬ ‭Items usually do not change once stored.‬
‭○‬ ‭It can be partitioned by time and allow for archiving.‬
‭●‬ ‭Difference from Profiles:‬‭Queries are short and focused‬
‭on specific interests.‬

‭Index Database Search:‬

‭ sers can save and organize items for future reference through‬

‭●‬ ‭Index Process:‬

‭○‬ ‭Users can add items to an index with extra terms and‬
‭○‬ ‭The index can point to the original item or contain‬
‭detailed information about it.‬
‭●‬ ‭Components:‬
‭○‬ ‭Indexes:‬‭Like a library card catalog, they help‬
‭organize and find items.‬
‭○‬ ‭Index Database Search Process:‬‭Lets users create‬
‭and search indexes.‬
‭○‬ ‭Users can search the index and retrieve either the‬
‭index itself or the original item.‬
‭●‬ ‭Types of Index Files:‬
‭○‬ ‭Public Index Files:‬‭Managed by library staff and‬
‭include all items in the Document Database.‬
‭○‬ ‭Private Index Files:‬‭Created by individual users,‬
‭each user can have multiple private indexes.‬

‭Combined File Search:‬

‭ his process integrates searches across both the document and‬
‭index databases.‬

‭●‬ ‭Public vs. Private Index Files:‬

‭○‬ ‭Public index files cover all items and are accessible to‬
‭all users.‬
‭○‬ ‭Private index files are specific to individual users and‬
‭cover a smaller subset of items.‬
‭●‬ ‭Database Management System:‬
‭○‬ ‭Often, index files are managed using a structured‬
‭database management system (RDBMS).‬

‭Automatic File Build (Information Extraction):‬

‭This process helps create indexes automatically.‬

‭●‬ ‭How it works:‬

‭○‬ ‭Processes new documents and identifies key‬
‭information like authors, publication date, source, and‬
‭○‬ ‭Rules for which documents to process and how to‬
‭extract index terms are stored in Automatic File Build‬
‭●‬ ‭Candidate Index Records:‬
‭○‬ ‭The result of processing new documents.‬
‭○‬ ‭Reviewed and edited by users before updating the‬
‭actual index file.‬


‭●‬ ‭SDI:‬‭Automatically matches new items to user interests‬‭and‬

‭delivers relevant information.‬
‭●‬ ‭Document Database Search:‬‭Allows users to search all‬
‭stored items.‬
‭●‬ ‭Index Database Search:‬‭Enables users to save, organize,‬
‭and search items using indexes.‬
‭●‬ ‭Combined File Search:‬‭Integrates document and index‬
‭●‬ ‭Automatic File Build:‬‭Automates the creation of index‬
‭records by extracting key information from new documents‬

‭Boolean Logic:‬

‭●‬ ‭Boolean logic allows users to combine search terms using‬

‭operators like AND, OR, and NOT. For instance, "cats AND‬
‭dogs" retrieves items containing both words, "cats OR‬
‭dogs" retrieves items containing either word, and "cats‬
‭NOT dogs" retrieves items containing "cats" but excluding‬

‭●‬ ‭Proximity search looks for words that appear close to each‬
‭other within a specified distance. For example, searching‬
‭"bake NEAR/5 cake" finds instances where "bake" and‬
‭"cake" appear within five words of each other, which helps‬
‭in locating related terms in context.‬

‭Contiguous Word Phrases:‬

‭●‬ ‭This capability searches for exact phrases where words‬

‭appear together in the same order. For example, searching‬
‭for "climate change" returns results where these two words‬
‭are next to each other, ensuring the phrase's specific‬
‭context is maintained in the search results.‬

‭Fuzzy Searches:‬

‭●‬ ‭Fuzzy searches find words that are similar to the search‬
‭term, accommodating spelling variations and typos. For‬
‭example, searching for "color" might also return "colour."‬
‭This is useful when dealing with documents containing‬
‭typographical errors or different spellings of the same word.‬

‭Term Masking:‬

‭●‬ ‭Term masking uses wildcards to replace characters in a‬

‭search term. For example, "comp*" can find "computer,"‬
‭"compete," and "compile." The asterisk (*) represents any‬
‭number of characters, while a question mark (?) can‬
‭replace a single character, broadening the search scope.‬

‭Numeric & Date Ranges:‬

‭●‬ ‭This capability allows searching within specific numeric or‬

‭date ranges. For example, searching for documents from‬
‭2010 to 2020 or finding products priced between $50 and‬
‭ 100. It helps in filtering search results based on‬
‭quantitative criteria, like dates or numbers.‬

‭Concept & Thesaurus Expansions:‬

‭●‬ ‭This search capability includes related concepts or‬

‭synonyms to broaden search results. For example,‬
‭searching for "happy" might also retrieve "joyful" or‬
‭"content." Thesaurus expansions enhance search flexibility‬
‭by understanding and including variations in terminology,‬
‭ensuring comprehensive results.‬

‭Natural Language Queries:‬

‭●‬ ‭Natural language queries allow users to search using‬

‭everyday language, mimicking human conversation. For‬
‭example, instead of using keywords, a user might ask,‬
‭"What is the capital of France?" The system interprets the‬
‭question and retrieves relevant information, making‬
‭searches more intuitive.‬

‭Multimedia Queries:‬

‭●‬ ‭Multimedia queries enable searching for various types of‬

‭content such as images, videos, and audio files. For‬
‭example, finding all videos related to "wildlife." This‬
‭capability is essential for databases that include diverse‬
‭media types, allowing users to locate non-textual‬
‭information easily.‬

‭Browse Capabilities‬

‭○‬ ‭Ranking orders search results by relevance or‬
‭importance. This helps users see the most relevant‬
‭items first, based on criteria like keyword matches,‬
‭document popularity, or date of publication. For‬
‭example, a search for "renewable energy" will show‬
‭the most relevant articles at the top.‬
‭○‬ ‭Zoning divides a document into logical sections such‬
‭as title, author, abstract, and main text. This helps in‬
‭targeted searching within specific sections. For‬
‭example, a user might search only within the‬
‭"abstract" zone to find articles with relevant‬
‭○‬ ‭Highlighting visually emphasizes search terms in the‬
‭results. When users search for a keyword, this‬
‭feature highlights occurrences of that keyword in the‬
‭displayed documents. This makes it easier for users‬
‭to spot the relevant information quickly.‬

‭Miscellaneous Capabilities‬

‭1.‬‭Vocabulary Browse:‬
‭○‬ ‭Vocabulary browsing allows users to explore terms‬
‭and their relationships within a specific domain or‬
‭subject. It often includes browsing through an index or‬
‭thesaurus to find related terms and expand searches‬
‭effectively. For example, exploring synonyms and‬
‭related terms for "biodiversity."‬
‭2.‬‭Iterative Search & Search History Log:‬
‭○‬ ‭Iterative search involves refining searches based on‬
‭previous results to narrow down to the most relevant‬
‭information. The search history log keeps track of all‬
‭ earch queries, allowing users to revisit and refine‬
‭past searches for improved results.‬
‭3.‬‭Canned Query:‬
‭○‬ ‭Canned queries are pre-defined searches created for‬
‭common queries. These saved searches can be‬
‭quickly executed without having to re-enter the search‬
‭criteria. For example, a canned query for "latest‬
‭technology news" would fetch up-to-date articles on‬
‭that topic.‬
‭○‬ ‭Multimedia capabilities involve searching and‬
‭retrieving various types of content like images,‬
‭videos, and audio files. For instance, users can‬
‭search for educational videos, photographs, or music‬
‭files, enabling a richer and more diverse search‬

You might also like