Information Retrieval Tech-neo (1)

You might also like

Download as pdf
Download as pdf
You are on page 1of 158
IR (MU-T.Y. B.Sc.-Comp-SEM 6) Table of Contents ..Page py Table Of Contents ji a 23. CHAPTER 1: Introduction to Information Retrievay System 1-1 to 1-6 1.1 Definition and Goals of Information Retrieval 1.1.1. Information Retrieval Involves A Range of Tasks and Applications... 1.2 Components of an IR System 1.3 Challenges and Applications of IR... > Chapter Ends 3 CHAPTER 2 : Document Indexing, Storage, and Compression 2-1 to 2-12 2.1 __ Inverted Index... — 2.2 _ Inverted Index Construction and Compression Techniques.. Inverted Index Construction. Simple Index Construction Mergin Data Placement MapReduce, Compression Techniques .. Dictionary Compression Bit-Aligned Codes... Variable-Byte Code. 2.3 Document Representation and Term Weighting.. 2.3.1 Document Representation 23.2 Term Weighting... > Chapter Ends ——————_—_—_——_—_— 2 CHAPTER 3 : Retrieval'Models» - = __ 3-1 to. 3-13 EMo = 5 3.1 Boolean Model... 34 3.2 Boolean Operators. 33 Query Processing... 3.3.1 Document-at-a-Time Query Processing .. 3.3.2 Efficient Query Processing with Heaps. (New Syllabus w.e.f Academic Year 23-24) (BC-12) [eecn.neo Publications | alle 1p (MU-T.Y_ B.Se-Comp-SEM 6) Table of Contents ...Page No. (2) 3.3.3 Term-at-a-Time Query Processing 3.4 Vector Space Model. 3,5 Probabilistic Model... > Chapter Ends.... a ya CHAPTER 4: Spelling Correction in IR Systems 4-1 to 4-7 4.1 Spelling Correction 42 Challenges of Spelling Errors In Queries and Documents 43 Edit distance and String Similarity Measures ... 4.4 Techniques for Spelling Correction in IR Systems 4.4.1. k-gram Indexes for Spelling Correction. 4.4.2 Context Sensitive Spelling Correction. 44.3 Phonetic Correction . > Chapter Ends... Qa CHAPTER 5 : Performance 5.1 Evaluation Metrics. 5.1.1 Recall and Precision .. 5.1.2 Fmeasure 5.1.3. Average Precisiot 5.2 Test collections and Relevance Judgments ..... > Chapter Ends eS: CHAPTER 6 : Text Categorization and Filtering. =: 6-1 to 6-17 6.1 Text Classification/Categorization Algorithms 6.1.1 Naive Bayes. 6.1.2 Support Vector Machine (SVM 62 Feature Selection. 63 Dimensionality Reduction . 6.4 Applications of Text Categorization and Filtering ... > Chapter Ends... (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bh ech-veo Publications Table of Contents ...P; -comp-SEM 6) £98 No. (3) IR (MU-TY. BSc. 7: Text Clustering for Information = PTER eucn Retrieval 7-1 to 7-13 7.1. Clustering Techniques ~ 7.4.1 K-means Clustering - 7.1.2 Hierarchical Clustering 72 Evaluation of Clustering Resulls 7.3. Clustering for Query Expansion and Result Grouping.. > Chapter Ends. ya CHAPTER 8: Web Information Retrieval 8-1 to 8-14 8.1 Web Search Architecture and Challenge: g.1.1 Web Search and Search Engin 8.1.2 Web Structure 8.1.3 Challenges of Web Search 8.1.4 Web Search Architecture 8.2 Crawling and Indexing Web Pages. 8.2.1 Web Crawling... 8.2.2 _ Indexing the Web Pages or Web Indexes. 8.3 Link Analysis and PageRank Algorithm 8.3.1 Link Analysis. 8.3.2 PageRank Algorithm » Chapter Ends... ———————————— w= CHAPTER 9 : Learning:to Rank 9-1 to 9-14 SE 9.1 Leaming to Rank (LTR) : Algorithms and Techniques 9.2 Palrwise and Listwise Learning to Rank Approaches. 9.21 Pairwise Learning to Rank Approaches... 9.2.2 _Listwise Learning to Rank Approaches... Supervised Learning for Ranking: RankSVM, RankBoost. 9.3.1 RankSVM 9.3.2 RankBoost Evaluation Metrics for Learning to Rank > Chapter Ends... 9-1 9.3 9.4 (New Syil . vllabus we.f Academic Year 23-24) (8c-12) BBhecn Neo Publications k Analysis and its Role in IR f Systems 10-1 to 10-12 210-1 10-4 40.1. Web Graph Representation and Link Analysis . 40.1.1 Web Graph 10.1.2 Link Analysis 40.1.3 _ Link Analysis Algorithms 402 HITS and PageRank algorithm: 40.2.1 HITS (Hyperlink-Induced Topic Search) Algorithms . 10.2.2 PageRank Algorithms... 10.3. Applications of Link Analysis in IR Systems .. » Chapter Ends... ——_——— he CHAPTER 11: Crawling and Near-Duplicate Page t ~ Detection 11-1 to 11-11, 41.1. Web Page Crawling Techniques: Breadth-First, Depth-First... 41.1.1 Breadth-First...... 41.1.2 Depth-Firs! 11.2 Focused Crawling. 11.3 Near Duplicate Detection Algorithm 41.4 Handling Dynamic Web Content During Crawling > Chapter Ends... 12.1 Text Summarization.... 12.1.1 Extractive Approach.. 12.1.2 Abstractive Approach. 42.2 Question Answering: Approaches for Finding Precise Answers. 12.3 Recommender Systems... 12.3.1 Collaborative Filterin. 12.3.2 Content-based Filtering... > Chapter Ends. (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bhecn-neo Publications Ga CHAPTER 13 : Cross: 13.1 Retrieval Gross-Lingual Retrieval or Cross-Lingual Information Retrievay and Multilingual Retrieval or Multlingual Information Retrieva, (Cross-Lingual Search and Multilingual Search) 43.1.1 Cross-Lingual Retrieval or Cross-Lingual Information 134 Retrieval 7 49.1.2. Multilingual Retrieval or Multilingual Information M1 Retrieval... 43.2 Challenges and Techniques for Cross-Lingual Retrieval. 43.2.1 Techniques for Cross-Lingual Retrieval. 432.2 Challenges for Cross-Lingual Retrieval 13.3 Machine Translation (MT) for IR... 13.4 Multilingual Document Representations and Query Translatio 13.5 Evaluation Techniques for IR Systems... > Chapter Ends ee CHAPTER 14: User-based Evaluation 14-1 to 14-7 14.1 User-based Evaluation. 14.1.1 User Studies 14.1.2 Surveys. 14.2 Test Collections and Benchmarking 14.3 Online Evaluation Methods : A/B Testing, interleaving Experiments... 14.3.1 A/B Testing 14.3.2 Interleaving Experiments .. > Chapter Ends... Introduction to Information CHAPTER 1 Retrieval System Introduction to Information Retrieval (IR) systems : Definition and goals of information retrieval, Components of an IR system, Challenges and applications of IR. yi 1.1 DEFINITION AND GOALS ‘OF INFORMATION RETRIEVAL « Information retrieval is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections. + Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information. + Information retrieval is concerned with representing, searching, and manipulating large collections of electronic text and other human- language data. * IR systems and services are now widespread, with millions of people depending on them daily to facilitate business, education and entertainment. © Web search engines Google, Bing, and others are by far the most popular and heavily used IR services, providing access to up-to-date technical information, locating people and organizations, summarizing news and events, and simplifying comparison shopping. ructured data” refers to data which does not hav, “unsti ; The term “un: computer structure. Tt is the opp i y-for-a- antically overt, easy ture semantically the canonical example of which is a relational g e Clea, Site of latabase Ties ang Unstructitegs This is definitely true of all text data if you count the latent linguisig ured data, wax : of in sort companies usually use to maintain product invento of the s personnel records. In reality, almost no data are truly “ structure of human languages. IR is also used to facilitate “semi-structured” search ae as fin cing g document where the title contains Java and the body contains threadin, ; The field of information retrieval also covers supporting Users jg browsing or filtering document collections or further Processing a ser aa retrieved documents. 1.1.1 Information Retrieval Involves A Range of Tasks and Applications The usual search scenario involves someone typing in a query to q search engine and receiving answers in the form of a list of documents in ranked order. World Wide Web (web search) is by far the most common application involving information retrieval, search is also a crucial part of applications in corporations, government, and many other domains, Vertical search is a specialized form of web search where the domain of the search is restricted to a particular topic. Enterprise search involves finding the required information in the huge variety of computer files scattered across a corporate intranet. Web pages are certainly a part of that distributed information store, but most information will be found in sources such as email, reports, Presentations, spreadsheets, and structured data in corporate databases. Desktop search is the personal version of enterprise search, where the information sources are the files stored on an individual computer, including email messages and web pages that have recently been browsed, Peer-to-peer search involves finding information in networks of nodes Or computers without any centralized control. This type of search beg" (New Syllabus wes Academic Year 23-24) (BC-12) [Brecn-tveo pubtstion asa file shi aring tool for music but can be used in any community based on shared interests, or even sh: ared locality in the case of mobile devices. information retrieval techniques are used for advertising, for intelligence analysis, for scientific discovery, for health care, for customer support, for real estate and so on. Search and related e Search based on a user query (sometimes called ad hoc search) because the range of possible queries is huge and not prespecified is not the only text-based task that is studied in information retrieval. Other tasks include filtering, clas: ification, and question answering. Filtering or tracking involves detecting stories of interest based on a person's interests and provi mechanism. jing an alert using email or some other oOo 1 1.2 COMPONENTS OF AN IR SYSTEM = XL Fig, 1.2.1 : Components of an IR system. (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bhec-neo Publications {Intro to IRS)...Page no. (1-4) somp-SEM 6) IR (Mi Before conducting a search, a user has an information need, which We sometimes refer to this underlies and drives the search proces information need as a topic, particularly when it is presented in written form as part of a test collection for IR evaluation. As a result of her information need, the user constructs and issues a query to the IR system, Typically, this query consists of a small number of terms, with two to three terms being typical for a Web search. We use “term” instead of “word”, because a query term may in fact not be a word at all. | Depending on the information need, a query term may be a date, a number, a musical note, or a phrase. Wildcard operators and other partial-match operators may also be permitted in query terms. For might match any word starting with informs", “informal”, “informant”, example, the term “inform* that prefix (“inform”, “informative”, etc.). * The user's query is processed by a search engine, which may be running on the user's local machine, on a large cluster of machines in a remote geographic location, or anywhere in between, A major task of a | search engine is to maintain and manipulate an inverted index fora document collection. This index forms the principal data structure used by the engine for searching and relevance ranking. * To support relevance ranking algorithms, the search engine maintains collection statistics associated with the index, such as the number of documents containing each term and the length of each document. In addition, the search engine usually has access to the original content of the documents, in order to report meaningful results back to the user. ¢ — Using the inverted index, collection statistics, and other data, the search engine accepts queries from its users, processes these queries, and retums ranked lists of results. ¢ To perform relevance ranking, the search engine computes a score, sometimes called a retrieval status value (RSV), for each document. After sorting documents according to their scores, the result list may be subjected to further processing, such as the removal of duplicate or redundant results, (New Syllabus w.e. Academic Year 23-24) (BC-12) IR (MU-T.Y. B.Sc.-Comp-SEM 6) (intro to IRS). ‘age no. (1-5) For example, a Web search engine might report only one or two results from a single host or domain, eliminating the others in Favor of pages from different sources, The problem of scoring documents with respect to a user's query is one of the most fundamental in the field. b1_1.3__ CHALLENGES AND APPLICATIONS OF IR Document routing, filtering, and selective dissemination reverse the typical IR process. Whereas a typical search application evaluates incoming querics against a given document collection, a routing, filtering, or dissemination system compares newly created or discovered documents to a fixed set of queries supplied in advance by users, identifying those that match a given query closcly enough to be of possible interest to the users. A news aggregator, for example, might use a routing system to separate the day’s news into sections such as “business,” “politics,” and “lifestyle,” or to send headlines of interest to particular subscribers. Text clustering and categorization systems group documents according to shared properties. The difference between clustering and categorization stems from the information provided to the system. Categorization systems are provided with training data illustrating the various classes. Examples of “business,” “politics,” and “lifestyle” articles might be provided to a categorization system, which would then sort unlabelled articles into the same categories. A clustering system, in contrast, is not provided with training examples. Instead, it sorts documents into groups based on patterns it discovers itself. Summarization systems reduce documents to a few key paragraphs, sentences, or phrases describing their content. The snippets of text displayed with Web search results represent one example. Information extraction systems identify named entities, such as places and dates, and combine this information into structured records that describe relationships between these entities - for example, creating lists of books and their authors from Web data. Topic detection and tracking systems identify events in streams of news articles and similar information sources, tracking these events as they evolve. (New Syllabus w.e.f Academic Year 23-24) (BC-12) Paheci-veo Publications (Intro to IRS) ‘age no. (1-6) Expert search systems identify members of organizations who are experts in a specified area. ystems integrate information from multiple sources pecific questions. They often incorporate including search, summarization, and Question answering s to provide concise answers to S| and extend other IR technologie: information extraction. Multimedia information retrieval systems extend relevance ranking and other IR techniques to images, video, music, and speech. Chapter Ends... o00 Document | Indexing, Storage, CHAPTER 2. and Compression Document Indexing, Storage, and Compression : Inverted index construction and compression techniques, Document representation and term weighting, Storage and retrieval of indexed documents. yb! 2.1 INVERTED INDEX . Text search is very different from traditional computing tasks, so it calls for its own kind of data structure, the inverted index. The name “inverted index” is really an umbrella term for many different kinds of structures that share the same general philosophy The inverted index (sometimes called inverted file) is the central data structure in virtually every information retrieval system. At its simplest, an inverted index provides a mapping between terms and their locations of occurrence in a text collection C. An inverted index is organized by index term. The index is inverted because usually we think of words being a part of documents, but if we invert this idea, documents are associated with words. Index terms are often alphabetized like a traditional book index, but they need not be, since they are often found directly using a hash table. Each index term has its own inverted list that holds the relevant data for that term IR (MU-.Y. B.Sc. -Comp-SEM 6) _(Doc. Indexing Storage & Compre)...Page no. (2-2) Dicttonary Postings lists fest SST ane TEE SHOT, TAT rein + ie rasan) Tasaie, Tasa2e, _ 1271aBO thunder > [hesoa 137036... 745307, TASH, 12A7II wie > [_1598, 27555. 745407, 749429, 745951, 745.467. 1245276, witchcraft > [iva 165150, 1259006 ] witches > [ani0re. 119163, ., 745002, 762883 witching > [205197] ST To TSG SHE ee TTS csrecci> LL + GE TSS ETT TROT ETE, SS > (0506, 75500, 1271508 Fig. 2.1.1 : A schema-independent inverted index for Shakespeare’s plays. The dictionary provides a mapping from terms to their positions of occurrence. * An inverted index is organized by index term. The index is inverted because usually we think of words being a part of documents, but if we invert this idea, documents are associated with words. Index terms are often alphabetized like a traditional book index, but they need not be, since they are often found directly using a hash table. that term, below figure : Each index term has its own inverted list that holds the relevant data for The fundamental components of an inverted index are illustrated in * The dictionary lists the terms contained in the vocabulary V of the collection. Each term has associated with it a postings list of the Positions in which it appears, consistent with the positional numbering. (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bhecr-neo Publications IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Doc. Indexing,Slorage & Compre)...Page no. (2-3) ¢ The index shown above contains not document identifiers but “flat” word positions of the individual term occurrences. This type of index is called a sche independent index because it makes no assumptions about the structure (usually referred to as schema in the database community) of the underlying text. We chose the schemz independent variant for most of the examples in this chapter because it is the simplest. * We define an inverted index as an abstract data type (ADT) with four methods : ° first(t) returns the first position at which the term t occurs in the collection; © Iast(t) returns the last position at which t occurs in the collection; ° next (t, current) returns the position of t’s first occurrence after the current position; © prev(t, current) returns the position of t's last occurrence before the current position. >4. 2.2 INVERTED INDEX CONSTRUCTION AND. = COMPRESSION TECHNIQUES WS 2.2.1 Inverted Index Construction Before an index can be used for query processing, it has to be created from the text collection. Building a small index is not particularly difficult, but as input sizes grow, some index construction tricks can be useful. YS 2.2.1.1 Simple Index Construction * The process involves only a few steps. A list of documents is passed to the Build Index function, and the function parses each document into tokens, © These tokens are words, perhaps with some additional processing, such as down casing or stemming. The function removes duplicate tokens, using, for example, a hash table. © Then, for each token, the function determines whether a new inverted list needs to be created in J, and creates one if necessary. (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bhecn-neo Publications JR (MU-T.Y. B.Sc.-Comp-SEM 6) (Doc. Indexing,Storage & Compre)...Page no, (2-4) * Finally, the current document number, n, is added to the inverted list, The result is a hash table of tokens and inverted lists. ‘The inverted lists are just lists of integer document numbers and contain No special information. This is enough to do very simple kinds of retrieval this indexer can be used for many small tasks—for example, indexing less than a few thousand documents. However, it is limited in two ways. First, it requires that all of the inverted lists be stored in memory, which may not be practical for larger collections, Second, this algorithm is sequential, with no obvious way to parallelize it, The primary barrier to parallelizing this algorithm is the hash table, which is accessed constantly in the inner loop. Adding locks to the hash table would allow parallelism for parsing, but that improvement alone will not be enough to make use of more than a handful of CPU cores. Handling large collections will require less reliance on memory and improved parallelism. 2.2.1.2 Merging The classic way to solve the memory problem in the previous example is by merging. We can build the inverted list structure I until memory runs out. When that happens, we write the partial index I to disk, then start making a new one. At the end of this process, the disk is filled with many partial indexes, IJ, 12, 13, ..., In. The system then merges these files into a single result. By definition, it is not possible to hold even two of the partial index files in memory at one time, so the input files need to be carefully designed so that they can be merged in small pieces. One way to do this is to store the Partial indexes in alphabetical order. It is then possible for a merge algorithm to merge the partial indexes using very little memory. h-Neo Publications IR (MU-T.Y. B.Sc.-Comp-SEM 6) _(Doc. Indexing, Storage & Compre)...Page no. (2-5) Index [aonhark [PSA TS] ook [2Ta Index 8 Lanta To ToT scwor [15] 2 [om Index A [Ganhark cone [ala] Index 8 [Gaahark C[o] scr [1S] Ble conned intes Eat BIS] a IS [eo amr [ela pane TaT Fig. 2.2.1 : An example of index merging. The first and second indexes are merged together to produce the combined index. YW 2.2.1.3 Data Placemene © Before diving into the mechanics of distributed processing, consider the problems of handling huge amounts of data on a single computer. * Distributed processing and large-scale data processing have one major aspect in common, which is that not all of the input data is available at once. * In distributed processing, the data might be scattered among many machines, * In large-scale data processing, most of the data is on the disk. In both cases, the key to efficient data processing is placing the data correctly. WS 2.2.1.4 MapReduce © MapReduce is a distributed programming framework that focuses on data placement and distribution. As we saw in the last few examples, proper data placement can make some problems very simple to compute. * By focusing on data placement, MapReduce can unlock the parallelism in some common tasks and make it easier to process large amounts of data. * — MapReduce gets its name from the two pieces of code that a user needs to write in order to use the framework: the Mapper and the Reducer. © The MapReduce library automatically launches many Mapper and Reducer tasks on a cluster of machines. (New Syllabus w.e.f Academic Year 23-24) (BC-12) hech-neo Publications >! -T.Y. B.Se-Comp-SEM 6) _ (Doe. Indexing, Storage & Compre)..Page no, (2-6 IR (MU-TY, P ) The interesting part about MapReduce, though, is the path the data takes between the Mapper and the Reducer. Before we look at how the Mapper and Reducer work, let’s look at the foundations of the MapReduce idea. The functions map and reduce are commonly found in functional languages. In very simple terms, the map function transforms a list of items into another list of items of the same length. The reduce function transforms a list of items into a single item. The MapReduce framework isn’t quite so strict with its definitions: both Mappers and Reducers can return an arbitrary number of items, However, the general idea is the same. Map Topur Shur. Reduce SS] oupu —— +] == Fig. 2.2.2 : MapReduce ‘The MapReduce steps are summarized in Figure given above. We assume that the data comes in a set of records. The records are sent to the Mapper, which transforms these records into pairs, each with a key and a value, The next step is the shuffle, which the library performs by itself. This operation uses a hash function so that all Pairs with the same key end up next to each other and on the same machine. The final step is the reduce stage, where the records are processed again, but this time in batches, meaning all pairs with the same key are Processed at once, (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bhrech-veo Publications IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Doc. Indexing,Storage & Compre). Es 2.2.2 Compression Techniques | GQ. Discuss about various Index Compression techniques. © — Compression techniques are the most powerful tool for managing the memory hierarchy. The inverted lists for a large collection are themselves very large. . In fact, when it includes information about word position and document extents, the index can be comparable in size3 to the document collection: ¢ Compression allows the same inverted list data to be stored in less space. The obvious benefit is that this could reduce disk or memory requirements, which would save money. © More importantly, compression allows data to move up the memory hierarchy. If index data is compressed by a factor of four, we can store four times more useful data in the processor cache, and we can feed data to the processor four times faster. © Ondisk, compression also squeezes data closer together, which reduces seek times. © Unfortunately, nothing is free. The space savings of compression comes at a cost: the processor must decompress the data in order to use it. «Therefore, it isn’t enough to pick the compression technique that can store the most data in the smallest amount of space. * In order to increase overall performance, we need to choose a compression technique that reduces space and is easy to decompress. we consider only lossless compression techniques. Lossless techniques store data in less space, but without losing information. © There are also lossy data compression techniques, which are often used for video, images, and audio. These techniques achieve very high compression ratios (r in our previous discussion, but do this by throwing portant data. away the least © Inverted list pruning techniques, which we discuss lamer, could be considered a lossy compression technique, but typically when we talk about compression, we mean only lossless methods, TBhecn-neo Publications (New Syllabus v.e.f Academic Year 23-24) (BC-12) IR (MU-T, Comp-SEM 6) (Doc. Indexing,Storage & Compre)...Page no. (2-8) YX 2.2.2.1 Dictionary Compression This section presents a series of dictionary data structures that achieve increasingly higher compression ratios. The dictionary is small compared with the postings file. So why compress it if it is responsible for only a small percentage of the overall space requirements of the IR system? One of the primary factors in determining the response time of an IR system is the number of disk seeks necessary to process a query. If parts of the dictionary are on disk, then many more disk seeks are necessary in query evaluation. Thus, the main goal of compressing the dictionary is to fit it in main memory, or at least a large portion of it, to support high query through- put. Although dictionaries of very large collections fit into the memory of a standard desktop machine, this is not true of many other application scenarios. For example, an enterprise search server for a large corporation may have to index a multi tera byte collection with a comparatively large vocabulary because of the presence of documents in many different languages. We also want to be able to design search systems for limited hardware such as mobile phones and onboard computers. Other reasons for wanting to conserve memory are fast startup time and having to share resources with other applications. 2.2.2.2 Bit-Aligned Codes Code words are restricted to end on byte boundaries, In all of the techniques we'll discuss, we are looking at ways to store small numbers in inverted lists (such as word counts, word positions, and delta encoded document numbers) in as little space as possible. One of the simplest codes is the unary code. You are probably familiar with binary, which encodes numbers with two symbols, typically 0 and |. (New Syllabus w.e. Academic Year 23-24) (BC-12) Bhecr-neo Publications IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Doc. Indexing,Storage & Compre)...Page NO- (2-9) * A unary number system is a base-I encoding, which means il uses a single symbol to encode numbers. Here are some examples: Number|Code URwWN HO © In general, to encode a number k in unary, we output k 1s, followed by a0. We need the 0 at the end to make the code unambiguous. © This code is very efficient for small numbers such as O and 1, but quickly becomes very expensive. «For instance, the number 1023 can be represented in 10 binary bits, but requires 1024 bits to represent in unary code. Now we know about two kinds of numeric encodings. ¢ — Unary is convenient because it is compact for small numbers and is inherently unambiguous. © Binary is a better choice for large numbers, but it is not inherently unambiguous. * A reasonable compression scheme needs to encode frequent numbers with fewer bits than infrequent numbers, which means binary encoding is not useful on its own for compression. YW 2.2.2.3 Variable-Byte Code «The codes described in the previous sections are bit-aligned as they do not represent an integer using a multiple of a fixed number of bits, ¢.g.. a byte. © But reading a stream of bits in chunks where each chunk is a byte of memory (or a multiple of a byte, e.g., a memory word ~ 4 or 8 bytes), is simpler and faster because the data itself is written in memory in this way. (New Syllabus w.e.f Academic Year 23-24) (BC-12) Therefore, it could be preferable to use byte-aligned or word-aligned codes when decoding speed is the main concern rather than compression effectiveness. Variable byte (VB) encoding uses an integral number of bytes to encode agap. The last 7 bits of a byte are “payload” and encode part of the gap. The first bit of the byte is a continuation bit. It is set to 1 for the last byte of the encoded gap and to 0 otherwise. To decode a variable byte code, we read a sequence of bytes with continuation bit 0 terminated by a byte with continuation bit 1. ‘We then extract and concatenate the 7-bit parts. The main advantage of Variable-Byte codes is decoding speed: we just need to read one byte at a ime until we found a value smaller than 2’. Conversely, the number of bits to encode an integer cannot be less than 8, thus Variable-Byte is only suitable for large numbers and its compression ratio may not be competitive with the one of bit-aligned codes for small integers. >4 2.3 DOCUMENT REPRESENTATION AND TERM WEIGHTING 2 2.3.1 Document Representation It is concerned about how textual documents should be represented in various tasks, e.g. text processing, retrieval and knowledge discovery and mining. . Ils prevailing approach is the vector space model, i.e. a document di is represented as a vector of term weights, where is the collection of terms that occur at least once in the document collection D. (New Syllabus WeewfiAc IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Doc. Indexing, Storage & Compre)...Page no. (2-11) WW 2.3.2 Term Weighting «Term weighting is a procedure that takes place during the tex process in order to assess the valuc of cach term to the document. * Term weighting is the assignment of numerical values to terms that represent their importance in a document in order to improve retrieval effectiveness. © — Essentially it considers the relative importance of individual words in an information retrieval system, which can improve system effectiveness, since not all the terms in a given document collection are of equal importance. * Index term weights reflect the relative importance of words in documents, and are used in computing scores for ranking. © The specific form of a weight is determined by the retrieval model. The weighting component calculates weights using the document statistics and stores them in lookup tables. * Weighing the terms is the means that enables the retrieval system to determine the importance of a given term in a certain document or a query. * It is a crucial component of any information retrieval system, a component that has shown great potential for improving the retrieval effectiveness of an information retrieval system. e Each term in a document a weight for that term, that depends on the number of occurrences of the term in the document. Assign the weight to be equal to the number of occurrences of term t in document d. © This weighting scheme is referred to as Term Frequency and is denoted tf,g, with the subscripts denoting the term(t) and the document (d) in order. © Document frequency : The document frequency df,, defined to be the number of documents in the collection that contain a term t. (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bhecn-neo Publications IA (MU-T.Y. B.Sc.-Comp-SEM 6) (Doc. Indexing, Storage & Compre)...Page no. (2-12) = eS Denoting as usual the total number of documents in a collection by N, we define the inverse document frequency (idf) of a term {as follows: santer lee idf, = los Gp TF-idf weighting We now combine the definitions of term frequency and inverse document frequency, to produce a composite weight for cach term in each document. The tf-idf weighting scheme assigns to term t a weight in document d given by if-idfja = tha x idf Chapter Ends... Qa0Q Retrieval Models CHAPTER 3 Retrieval Models : Boolean model: Boolean operators, query processing, Vector space model: TF-IDF, cosine similarity, query- document matching, Probabilistic model: Bayesian retrieval, relevance feedback. >X3.1 BOOLEAN MODEL _- SSS * Apart from the implicit Boolean filters ‘applied by Web search engines, explicit support for Boolean queries is important in specific application areas such as digital libraries and the legal domain. * In contrast to ranked retrieval, Boolean retrieval returns sets of documents rather than ranked lists. Under the Boolean retrieval model, a term t is considered to specify the set of documents containing it. © The standard Boolean operators (AND, OR, and NOT) are used to construct Boolean queries, which are interpreted as operations over these sets, as follows : 1. 3.2 BOOLEAN OPERATORS A AND B_ intersection of A and B(AN B) AORB_— union of Aand B (A U B) NOT (Retrieval Modols)...Pago no. (3-2) B.Sc. IRM omplement of Aw ith respect to the document collection (A7 ) 0 } terms or other Boolean queries. xt fragment from Shakespeare's Romeo and Juliet, act I, scene 1. Ac where A and B are Table 3.2.1: [ Document ID Document Content 1 Do you quarrel, sir? 2 Quarrel sir! no, sir! 3 If you do, sir, 1 am for you: I serve as good a man as you. 4 No better. 5 Well, sir. = For example, over the collection in the above given Table, the query (equarre!” OR “sir") AND “you” specifies the set {1, 3), whereas the query (“quarrel” OR “sir") AND NOT “you” specifies the set {2,5}. © Our algorithm for solving Boolean queries is another variant of the phrase searching algorithm. * The algorithm locates candidate solutions to a Boolean query where each candidate solution represents a range of documents that together satisfy the Boolean query, such that no smaller range of documents contained within it also satisfies the query. «When the range represented by a candidate solution has a length of 1, this single document satisfies the query and should be included in the result set. * — To simplify our definition of our Boolean search algorithm, we define two functions that operate over Boolean queries, extending the nextDoc and prevDoc methods of schema-dependent inverted indices. docRight(Q, u) - end point of the first candidate solution to Q starting after document u docLef(Q, v) - start point of the last candidate solution to Q ending before document v Ne I 7 (New Syllabus w.e.{ Academic Year 23-24) (BC-12) TBrcch-rteo pusteaions y IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Retrieval Models)...Page no. (3-3) For terms we define: docRight(t. u) = nextDoc(t, u) docLefi(t, v) = prevDoe(t. v) and for the AND and OR operators we define: docRight(A AND B, vu! max(docRight(A, u), docRight(B, u)) docLeft(A AND B,v) = min(docLeti(A, v), docLef(B,v)) docRight(A OR B, u) = min(docRight(A, u), docRight(B, u)) docLef(A OR B.v) = max(docLeft(A, v), docLef(B.v)) * To determine the result for a given query, these definitions are applied recursively. For example: docRight((“quarrel” OR “sir”) AND “you”, 1) ax(docRight(“quarrel” OR “sir, 1), docRight(“you”, 1)) ax(min(docRight(“quarrel”, 1), docRight(“sir", 1)), nextDoc(“you”, 1) = max(min(nextDoc(“quarrel”, 1), nextDoc(“sir”, 1)), 3) = max(min(2, 2), 3) docLeft(("“quarrel” OR “sit") AND “you", 4) = min(docLeft(“quarrel” OR “sir”, 4), docLefi("you", 4)) = min(max(docLeft(“quarrel”, 4), docLeft("'sir". 4)), prevDoc(“you", 4)) = min(max(prevDoc(“quarrel”, 4), prevDoc(“sir”, 4)), 3) = min(max(2, 3), 3) =3 —_—_—— > 3.3. QUERY PROCESSING ee Efficient query processing is a particularly important problem in web Search, as it has reached a scale that would have been hard to imagine just 10 years ago. (New Syllabus w.e.f Academic Year 23-24) (BC-12) Bhech-neo Publications IR (MU-T.Y. B.Sc,-Comp-SEM 6) (Retrieval Models)...Page no. (3-4) —— People all over the world type in over half a billion queries every day, searching indexes containing billions of web pages. Inverted indexes are at the core of all modern web search engines. The query processing algorithm depends on the retrieval model, and dictates the contents of the index. This works in reverse, too, since we arc unlikely to choose a retrieval model that has no efficient query processing algorithm. Traditional information retrieval systems usually follow the disjunctive approach, while Web search engines often employ conjunctive query semantics. The conjunctive retrieval model leads to faster query processing than the disjunctive model, because fewer documents have to be scored and ranked. However, this performance advantage comes at the cost of a lower recall: If a relevant document contains only two of the three query terms, it will never be returned to the user. This limitation is quite obvious for the query Q shown above. Of the half-million documents in the TREC collection, 7,834 match the disjunctive interpretation of the query, whereas only a single document matches the conjunctive version. Incidentally, that document is not even relevant. 3.3.1 Document-at- Time Query Processing The most common form of query processing for ranked retrieval is called the document-at-a-time approach. In this method all matching documents are enumerated, one after the other, and a score is computed for each of them. At the end all documents are sorted according to their score, and the top k results (where k is chosen by the user or the application) are returned to the user, IR (MU-T.Y. B.Sc.-Comp-SEM 6) {Retrieval Models)...Page no. (3-5) rankBM25_DocumentAtaTime (lj, .t,). K) = m © 0// mis the total number of matching documents d & min, SiS n{nextDoe(s,, — «)} while d < © do results[m].docid — d 0 resulis[m].score — 3 log (N/ Nt) * TFpaqas (ty, d) ie] mem+l d © min, Si Sn, (nextDoc(t, d)} sort results[0..(m — 1)] in decreasing order of score =) Fig. 3.3.1 : Document-at-a-time query processing with BM2S. return results{0.. © The overall time complexity of the algorithm is O(m-n+m_- log(m)) © where nis the number of query terms and m is the number of matching documents (containing at least one query term). © The term m - n corresponds to the loop starting in line 3 of the algorithm. The term m - log(m) corresponds to the sorting of the search results in line 8. % 3.3.2 Efficient Query Processing with Heaps * We can use Reheap to overcome the limitations of the previous algorithm. In the revised version of the algorithm, we employ two heaps: one to manage the query terms and, for each term t, keep track of the next document that contains t; the other one to maintain the set of the top k search results seen so far. © The terms heap contains the set of query terms, ordered by the next document in which the respective term appears (nextDoc). It allows us to perform an efficient multiway merge operation on the n postings lists. * = The results heap contains the top k documents encountered so far, ordered by their scores. It is important to note that result’s root node (New Syllabus w.e.f Academic Year 23-24) (BC-12) Texhec-neo Publications (Retrieval Models)...Page no. (3-6) IR (MU: fn the best document scen so far, but the Kth-best docs not conta document seen sO far. © This allows us to maintain and continually update the top k search results by replacing the lowest-scoring document in the top k (and restoring the heap property) whenever we find a new document that scores better than the old one. © The worst-case time complexity of the revised version of the document. at-a-time algorithm is O(Nq - login) + Nq- log(k)) Ys, 3.5.3 Termrat-a-Time Query Processing «As an alternative to the document-at-a-time approach, some search engines process queries in a term-al-a-time fashion. ¢ Instead of merging the query terms’ postings lists by using a heap, the search engine examines, in turn, all (or some) of the postings for each query term. It maintains a set of document score accumulators. «For each posting inspected, it identifies the corresponding accumulator and updates its value according to the posting’s score contribution to the respective document. e — Whenall query terms have been processed, the accumulators contain the final scores of all matching documents, and a heap may be used to collect the top k search results. ‘© One of the motivations behind the term-at-a-time approach is that the index is stored on disk and that the query terms’ postings lists may be too large to be loaded into memory in their entirety. © In that situation a document-at-a-time implementation would need to jump back and forth between the query terms’ postings lists, reading a small number of postings into memory after each such jump, and incurring the cost of a nonsequential disk access (disk seek). «For short queries, containing two or three terms, this may not be a problem, as we can keep the number of disk seeks low by allocating an appropriately sized read-ahead buffer for each postings list. (New Syllabus we.f Academic Year 23-24) (BC-12) [ehrech-Neo Publications IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Retrieval Models)...Page no. (3-7) « However, for queries containing more than a dozen terms (c.g.. after applying pseudo-relevance feedback — see Section 8.6), disk seeks may become a problem. « A term-at-a time implementation does not exhibit any nonsequential dis! access pattern, The search engine processes cach term's postings list in a linear fashion, moving on to term {,, when it is done with term tj. eee by 3.4 VECTOR SPACE MODEL | GQ. Write a brief note on Vector Space Model * The vector space model is one of the oldest and best known of the information retrieval models. © The vector space model is intimately associated with the field as a whole and has been adapted to many IR problems beyond ranked retrieval, including document clustering and classification, in which it continues to play an important role. — Inrecent years, the vector space model has been largely overshadowed by probabilistic models, language models, and machine learning approaches. * Naturally, for a collection of even modest size, this vector space model ns. produces vectors with millions of dimens © This high-dimensionality might appear inefficient at first glance, but in many circumstances the query vector is sparse, with all but a few components being zero. © For example, the vector corresponding to the query “william”, “shakespeare”, “marriage” has only three nonzero components. © Tocompute the length of this vector, or its dot product with a document vector, we need only consider the components corresponding to these three terms. © On the other hand, a document vector typically has a nonzero component for each unique term contained in the document, which may consist of thousands of terms. However, the length of a document vector is independent of the query. (New Syllabus w.e.f Academic Year 23-24) (BC-12) Bhech.neo Publications IR (MU-T.Y. B.SC (Retrieval Models)...Page no. (3-8) precomputed and stored in a frequency or posili nal index ecific information, or it may be applied to rin advance, with the components of the lace of term frequencies in the postings It may be along with other document-sp' normalize the document vector normalized vector taking the pl lists. e cosine similarity measure has intuitive appeal appropriately represent queries and ty may be used to rank the Asa ranking method th ‘and natural simplicity. If we can documents as vectors, cosine similari documents with respect to the queries. In representing a document or query as a vector, a weight must be assigned to each term that represents the value of the corresponding component of the vector. Throughout the long history of the vector space model, many formulae for assigning these weights have been proposed and evaluated. With few exceptions, these formulae may be characterized as belonging toa general family known as TF-IDF weights. When assigning a weight in a document vector, the TF-IDF weights are computed by taking the product of a function of term frequency (fy) and a function of the inverse of document frequency (1/N,)- When assigning a weight to a query vector, the within-query term frequency (q,) may be substituted for f,,, in essence treating the query as a tiny document. It is also possible (and not at all unusual) to use different TF and IDF functions to determine weights for document vectors and query vectors. TF-IDF a O™ Function of Term Frequency Function of Inverse document Frequency We emphasize that a TF-IDF weight is a product of functions of term frequency and inverse document frequency. A common error is to use the raw f,, value for the term frequency component, which may lead to poor performance. Over the years a number of variants for both the TF and the JDF functions have been proposed and evaluated. (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bhrech-tteo Publications IR (MU-T.Y. B.Sc.-Comp-SEM 6) © The IDF functions typically relate the document frequency to the total number of documents in the collection (N). © — The basic intuition behind the IDF functions is that a term appearing in many documents should be assigned a lower weight than a term appearing in few documents. Of the two functions, IDF comes closer to having a “standard form”, IDF = log (N/N,) © The first one, ranked retrieval, allows the search engine to rank search results according to their predicted relevance to the query. The second one, lightweight structure, is a natural extension of the Boolean model to the sub-document level. «Instead of restricting the search process to entire documents, it allows the user to search for arbitrary text passages satisfying Boolean-like constraints (e.g., “show me all passages that contain ‘apothecary’ and ‘drugs’ within 10 words”). * Cosine similarity is a metric that measures the « similarity between two vectors in a multi-dimensional space, such as the vectors representing documents in the VSM. «Inthe context of VSM, it quantifies how alike two documents are based on their vector representations. © The key idea behind cosine similarity is to calculate the cosine of the angle between two vectors. * Ifthe vectors are very similar, their angle will be small, and the cosine value will be close to 1. Conversely, if the vectors are dissimilar, the angle will be large, and the cosine value will approach 0. How is Cosine Similarity Calculated ? The formula for calculating cosine similarity between two vectors A and B is as follows: A:B Cosine Similarity (A, B) = Talla (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bhecn-neo Publications (Retrieval Models). Page no. (3-10) Where: © AB represents the dot product of v e — JAll and [BUI represent the Euclide: and B, respectively rectors A and B. an norms (magnitudes) of vectors A «The cosine similarity value ranges from -1 (completely dissimilar) to 1 (completely similar). © A higher cosine similarity score indicates greater similarity between the two vectors. Cosine Similarity in a Vector Space Model In a VSM, cosine similarity is crucial for information retrieval and document ranking. Here’s how it works in practice: * Vector Representation + We represent documents and queries as Vectors using techniques like TF-IDF. Each document in the corpus and the query are converted into vectors in the same high-dimensional space. «Cosine Similarity Calculation : To determine the relevance of a document to a query, we calculate the cosine similarity between the query vector and the vectors representing each document in the corpus. © Ranking : Documents with higher cosine similarity scores to the query are considered more relevant and are ranked higher. Those with lower scores are ranked lower. Cosine similarity has several advantages when applied to text data + «Scale Invariance : Cosine similarity is scale-invariant, meaning it’s not affected by the magnitude of the vectors. This makes it suitable for documents of different lengths. * Angle Measure : It focuses on the direction of vectors rather than their absolute values, which is crucial for text similarity, where document length can vary. © Efficiency : Calculating cosine similarity is computationally efficient, making it suitable for large-scale text datasets. Query-document matching © In the vector space model, there is an implicit assumption that relevance is related to the similarity of query and document vectors. (New Syllabus w.e.f Academic Year 23-24) (BC-12) fBhrech-neo Publications - IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Rotrioval Models)...Page no. (3-11) © In other words, documents “closer” to the query are more likely to be relevant. Thi Primarily a model of topical relevance, although features related to user relevance could be incorporated into the vector representation, * Relevance feedback, a technique for query modification based on user- identified relevant documents. + This technique was first introduced using the vector space model. The well-known Rocchio algorithm was based on the concept of an optimal query, which maximizes the difference between the average vector representing the relevant documents and the average vector representing the non-relevant documents. bi 3.5 PROBABILISTIC MODEL * One of the features that a retrieval model should provide is a clear statement about the assumptions upon which it is based. The Boolean and vector space approaches make implicit assumptions about relevance and text representation that impact the design and effectiveness of ranking algorithms. © The ideal situation would be to show that, given the assumptions, a ranking algorithm based on the retrieval model will achieve better effectiveness than any other approach. * One early theoretical statement about effectiveness, known as the Probability Ranking Principle (Robertson, 1977/1997), encouraged the development of probabilistic retrieval models, which are the dominant paradigm today. * These models have achieved this status because probability theory is a strong foundation for representing and manipulating the uncertainty that is an inherent part Bayesian retrieval * In any retrieval model that assumes relevance is binary, there will be two sets of documents, the relevant documents and the non-relevant documents, for each query. (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bhect-neo Publications (Retrieval Models)..Page no. (3-12) Given a new document, the task of a search engine could be described as deciding whether the document belongs in the relevant set or the non. relevant2 set. That is, the system should classify the document as relevant or non. relevant, and retrieve it if it is relevant. Given some way of calculating the probability that the document js relevant and the probability that it is non-relevant, then it would seem reasonable to classify the document into the set that has the highest probability. In other word we would decide that a document D is relevant if P(RID) > P(NRID), where P(RID) is a conditional probability representing the probability of relevance given the representation of that document, and P(NRID) is the conditional probability of non-relevance This is known as the Bayes Decision Rule, and a system that classifies documents this way is called a Bayes classifier. P(NR|O) —rnnto) Document Non-Relevant Documents Fig. 3.5.1 : Classifying a documet as relevant or non-relevant Relevance feedback It is possible to represent the topic of a query as a language model. Instead of calling this the query language model, we use the name relevance model since it represents the topic covered by relevant documents. The query can be viewed as a very small sample of text generated from the relevance model, and relevant documents are much larger samples of text from the same model. (New Syllabus w.ef Academic Year 23-24) (BC-12) Bhecrneo Publications IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Retrieval Modols)...Pago no. (3-13) e Given some examples of relevant documents for a query, we could estimate the probabilities to predict the relevance of new documents. in the relevance model and then use this model e In fact, this is a version of the classification model where we interpret PODIR) a relevance model. the probability of generating the text in a document given a ¢ This is also called the document likelihood model. Although this model, unlike the binary independence model, directly incorporates term frequency, it turns out that P(DIR) is difficult to calculate and compare across documents. ¢ This is because documents contain a large and extremely variable number of words compared to a query. © Consider two documents Da and Db, for example, containing 5 and 500 words respectively. Because of the large difference in the number of words involved, the comparison of P(DalR) and P(DbIR) for ranking will be more difficult than comparing P(QIDa) and P(QIDb), which use the same query and smoothed representations for the documents. * In addition, we still have the problem of obtaining examples of relevant documents. Ranking based on relevance models actually requires two passes. © The first pass ranks documents using query likelihood to obtain the weights that are needed for relevance model estimation. © In the second pass, we use KL-divergence to rank documents by comparing the relevance model and the document model. © Note also that we are in effect adding words to the query by smoothing the relevance model using documents that are similar to the query. * Many words that had zero probabilities in the relevance model based on query frequency estimates will now have non-zero values. © What we are describing here is exactly the pseudo-relevance feedback process * In other words, relevance models provide a formal retrieval model for pseudo-relevance feedback and query expansion. Chapter Ends. Qo00 Spelling | Correction in IR CHAPTER 4: Systems Spelling Correction in IR Systems : Challenges of spelling errors in queries and documents, Edit distance and string similarity measures, Techniques for spelling correction in IR systems. > 4.1 SPELLING CORRECTION * We look at the problem of correcting spelling errors in queries. For instance, we may wish to retrieve documents containing the term carrot when the user types the query carot. © — Google reports that the following are all treated as misspellings of the query britney spears: britian spears, britney's spears, brandy spears and prittany spears. © We look at two steps to solving this problem: the first based on edit distance and the second based on k-gram overlap. © Before getting into the algorithmic details of these methods, we first review how search engines provide spell-correction as part of a user experience. IR (MU-T.Y. B.Sc.-Comp-SEM 6) _( (Spelling Correction in IR Systems)...Page no. (4-2) DD pi 4.2 CHALLENGES OF SPELLING ERRORS IN QUERIES AND DOCUMENTS : Spell checking is an extremely important part of query processing. e Approximately 10-15% of queries submitted to web search engines contain spelling errors, and people have come to rely on the “Did you mean: ...” feature to correct these errors. * — These errors are similar to those that may be found in a word processing document. * In addition, however, there will be many queries containing words related to websites, products, companies, and people that are unlikely to be found in any standard spelling dictionary. * Some examples from the same query log are : 1. realstateisting.bc.com akia 1080i manunal ultimatwarcade mainscourcebank yan dellottitouche © The wide variety in the type and severity of possible spelling errors in queries presents a significant challenge. * In order to discuss which spelling correction techniques are the most effective for search engine queries, we first have to review how spelling correction is done for general text. * The basic approach used in many spelling checkers is to suggest corrections for words that are not found in the spelling dictionary. © — Suggestions are found by comparing the word that was not found in the dictionary to words that are in the dictionary using a similarity measure. * A given spelling error may have many possible corrections. For example, the spelling error “lawers” has the following possible corrections (among others) at edit distance 1: lawers + lowers, lawyers, layers, lasers, lagers. © The spelling corrector has to decide whether to present all of these to the user, and in what order to present them. Dehn (New Syllabus w.e.f Academic Yeat IR (MU-T.Y. B.Sc-Comp: 6) _(Spoling Correction InIR Systoms)..Page no_ (4-3) The noisy channel model for spelling correction is a general framework that can address the issues of ranking, context, and run-on errors. The model is called a “noisy channel” because it is based on Shannon's theory of communication, The intuition is that a person chooses a word write), based on a probability distribution P(w), Ww to output (i, The person then tries to write the word w, but the noisy channel (presumably the person's brain) causes the person to write the word ¢ instend, with probability P(elw). The probabilities P(w), called the language model, capture information about the frequency of occurrence of a word in text (e.g., what is the probability of the word “lawyer” occurring in a document or query?), and contextual information such as the probability of observing a word given that another word has just been observed (e.g., what is the probability of “lawyer” following the word “trial The probabilities P(chw), called the error model, represent information about the frequency of different types of spelling errors. The probabilities for words (or strings) that are edit distance | away from the word w will be quite high, for example. Words with higher edit distances will generally have lower probabilities, although homophones will have high probabilities. Note that the error model will have probabilities for writing the correct word (P(wlw)) as well as probabilities for spelling errors. This enables the spelling corrector to suggest a correction for all words, even if the original word was correctly spelled. If the highest-probability correction is the same word, then no correction is suggested to the user. >| 4.3 EDIT DISTANCE AND STRING SIMILARITY MEASURES Given two-character strings s] and s2, the edit distance between them is the minimum number of edit operations required to transform s1 into s2. Most commonly, the edit operations allowed for this purpose are : (i) _ insert a character into a string, (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bhech.wveo Publications 8.Sc.-Comp-SEM 6) _(Spelling Correction in IR System: IR (MI (ii) delete a character from a string Gii) replace a character of a string by another character « For these operations, edit distance is sometimes known as Levenshtein distance. e For example, the edit distance between cat and dog is 3. «In fact, the notion of edit distance can be generalized to allowing different weights for different kinds of edit operations, for instance a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a (the latter being closer to s on the keyboard). Setting weights in this way depending on the likelihood of letters substituting for each other is very effective in practice However, the remainder of our treatment here will focus on the case in which all edit operations have the same weight. Example 1 © Input: strl = “cat”, str2 = “cut” Output : | Explanation : We can convert str! into str2 by replacing ‘a’ with ‘u’. Example 2 © Input: str] = “sunday”, str2 = “saturday” Output :3 Explanation : Last three and first characters are same. We basically This can be done using below three need to convert “un” to “atu operations. Replace ‘n‘ with ‘r’, insert 4, insert a. eee 1 4.4. TECHNIQUES FOR SPELLING CORRECTION IN IR SYSTEMS GQ. Discuss any two Techniques for Spelling correction in detal 4.4.1 k-gram Indexes for Spelling Correction © To further limit the set of vocabulary terms for which we compute edit distances to the query term, we now show how to invoke the k-gram [Bh ach-neo Publications (New Syllabus w.e-f Academic Year 23-24) (BC-12) tle ig spelling Cortection in IR Systems). ini se-Comp-SEM 6)_( 2 vocabulary terms with low edit distance to index to assist Wl h re the query g- «Once we retrieve S' distance from 4- iil use the k-gram index to retrieve v uch terms, we can then find the ones of Teast edit «Infact, we Wi ocabulary terms that have many k-grams 1 common with the query. for reasonable definitions of “many k-grams in al process is essentially that of a single scan ry string q. * We will argue that common,” the retriev: fy the postings for the k-grams in the que aboard about | froardroo border Cec H ee Hs a eH Hie Hes 1 least two of the three 2-grams in the query bord throug Fig. 4.4.1 : Matching ai K-Grams «© Kegrams are k-length subsequences of a string. Here, k can be 1, 2,3 and so on, For k=1, each resulting subsequence is called # “unigram”; nd for k=3, a “trigram”, These are the most widely for bigram’ used k-grams for spelling correction, but the value of k really depends ‘on the situation and context. © Asan example, consider the string “catastrophic”. In this case, Soph", ve" o _Unigrams: © Bigrams: *, “ro”, “op”, “ph”, “hi”. “ic"] © Trigrams: [“cat™, “ata”, “tas”, “ast”, “str”, “tro”, phi", hie") op”, “oph’. * The 2-gram (or bigram) index shown in above figure (a portion of) the Postings for the three bigrams in the query bord Pago no. (4-6) EM 6) (Spelling Correction in IR Systems, IR(M Suppose we wanted to retrieve vocabulary terms that contained at Icast two of these three bigrams. « A single scan of the postings would let us enumerate all such terms; in the example of Figure given above, we would enumerate aboard, boardroom and border. The steps involved for spelling correction are : «Find the k-grams of the misspelled word. * For each k-gram, linearly scan through the postings list in the k-gram index. © Find k-gram overlaps after having linearly scanned the lists (no extra time complexity because we are finding the Jaccard coefficient). «Return the terms with the maximum k-gram overlaps. Ya 4.4.2 Context Sensitive Spelling Correction © — Context sensitive spelling correction Isolated-term correction would fail to correct typographical errors such as flew form Heathrow, where all three query terms are correctly spelled. © When a phrase such as this retrieves few documents, a search engine may like to offer the corrected query flew from Heathrow. «The simplest way to do this is to enumerate corrections of each of the three query terms even though each query term is correctly spelled, then try substitutions of each correction in the phrase. © For the example flew form Heathrow, we enumerate such phrases as fled form Heathrow and flew fore Heathrow. * For each such substitute phrase, the search engine runs the query and determines the number of matching results. * This enumeration can be expensive if we find many corrections of the individual terms, since we could encounter a large number of ics are used to trim this combinations of alternatives. Several heuri space. © Inthe example above, as we expand the alternatives for flew and form, we retain only the most frequent combinations in the collection or in the query logs, which contain previous queries by users. (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Bheci-veo Publications IR (MI ‘omp-SEM 6) (Spelling Correction In IR Systems)...Page no. (4-7) YW 4.4.3 Phonetic Correction © Our final technique for tolerant retrieval has to do with phonetic correction: misspellings that arise because the user types a query that | sounds like the target term. Such algorithms are especially applicable to | searches on the names of people. © The main idea here is to generate, for each term, a “phonetic hash” so that similar-sounding terms hash to the same value. | © The idea owes its origins to work in international police departments | from the early 20th century, seeking to match names for wanted criminals despite the names being spelled differently in different | countries. * It is mainly used to correct phonetic misspellings in proper nouns, Algorithms for such phonetic hashing are commonly collectively known as soundex algorithms. | © However, there is an original soundex algorithm, with various variants, built on the following scheme: o Tum every term to be indexed into a 4-character reduced form. Build an inverted index from these reduced forms to the original | terms; call this the soundex index. Do the same with query terms. When the query calls for a soundex match, search this soundex index. o The variations in different soundex algorithms have to do with the conversion of terms to 4-character forms. A commonly used conversion results in a 4-character code, with the first character being a letter of the alphabet and the other three being digits between 0 and 9. Chapter Ends... ooo Performance Evaluation CHAPTER 5 Performance Evaluation : Evaluation metrics: precision, recall, F-measure, average precision, Test collections and relevance judgments, Experimental design and significance testing. > 5.1. EVALUATION METRICS W 5.1.1 Recall and Precision ' GQ. What are various performance evaluation mmetrices? 1 1 ' GQ. Explain Recall and Precision as evaluation metrices. ‘ * The two most common effectiveness measures, recall and precision, were introduced in the Cranfield studies to summarize and compare search results. © Intuitively, recall measures how well the search engine is doing at finding all the relevant documents for a query, and precision measures how well it is doing at rejecting non-relevant documents. © The definition of these measures assumes that, for a given query, there is aset of documents that is retrieved and a set that is not retrieved (the rest of the documents). © This obviously applies to the results of a Boolean search, but the same definition can also be used with a ranked search, | (Perlormance Evalvation)...Page no. (5-2) | IR (MU-T.Y. B.Sc.-Comp-SEM 6) | © — Recall is the proportion of relevant documents that are retrieved. OR | are retrieved Recall (R) is the fraction of relevant documents that ‘#(relevant items retrieved) = P(retrieved | relevant) Recall =" y(relevant items) Number of relevant documents, retrieved recall = “Total number of relevant documents n is the proportion of retrieved documents that are relevant. OR n (P) is the fraction of retrieved documents that are relevant _ #(relevant items: retrieved) = P(relevant | retrieved) Precision = ~~ x(retrieved items) é Number of relevant documents retrieved Precision = Number ol eleven oc en ew Total number of documents retrieved © There is an implicit assumption in using these measures that the task involves retrieving as many of the relevant documents as possible and minimizing the number of non-relevant documents retrieved. In other words, even if there are 500 relevant documents for a query, the user is interested in finding them all. In a nutshell, recall indicates the fraction of relevant documents that appears in the result set, whereas precision indicates the fraction of the result set that is relevant. Precision is known as the positive predictive value, and is often used in medical diagnostic tests where the probability that a posilive test is correct is particularly important. 5.1.2 F measure The F measure is an effectiveness measure based on recall and precision that is used for evaluating classification performance and also in some search applications. It has the advantage of summarizing effectiveness in a single number. r i et A single measure that trades off Precision versus recall is the F measure, which is the weighted harmonic mean of precision and recall. (New Syllabus w.e.f Academic Year 23-24) (8C-12) TBhech Neo Publications a i « The harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by values that are unusually large (outliers). « A search result that returned nearly the entire document collection, for example, would have a recall of 1.0 and a precision near 0. The arithmetic mean of these values is 0.5, but the harmonic mean will be close to 0. «The harmonic mean is clearly a better summary of the effectiveness of this retrieved set. yw. 5.1.3 Average Precision « Fora single information need, Average Precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved, and this value is then averaged over information needs. © Average precision has a number of advantages. It is a single number that is based on the ranking of all the relevant documents, but the value depends heavily on the highly ranked relevant documents. + Average Precision is calculated as the weighted mean of precisions at each threshold; the weight is the increase in recall from the prior threshold. © Mean Average Precision is the average of AP of each class. However, the interpretation of AP and mAP varies in different contexts. © The mAP is calculated by finding Average Precision (AP) for each class and then average over a number of classes. (New Syllabus w.e.f Academic Year 23-24) (BC ee [Bh ech-neo Publications IR (MU-T.Y. BSc-Comp-SEM 6) (Performance Evaluation)...Page no. (5-4) Mean Average Precision Formula tes the trade-off between pre es (FP) and false negatives (FN). The mAP incorporal cision and recall and nl ie mM. a considers both false positiv akes MAP a_ suitable metric for most detection * This property applications. PH 5.2 TEST COLLECTIONS "AND RELEVANCE ‘ 1 GQ. Explain the significance ' Performance evaluation. test collection. « — Acentral goal of TREC is to create test collections that may be re-used for later experiments. For instance, if a new IR technique or ranking | formula is proposed, its inventor may use an established test collection to compare it against standard methods. ¢ Reusable test collections may also be employed to tune retrieval formulae, adjusting parameters to optimize performance. Ifa test collection is to be reusable, it is traditionally assumed that the judgments should be as exhaustive as possible. Ideally, all relevant documents would be located. Thus, many evaluation experiments actively encourage manual runs (involving human intervention) in order to increase the number of known relevant documents. © Here is a list of the most standard test collections and evaluation series. We focus particularly on test collections for ad hoc information retrieval system evaluation, but also mention a couple of similar test collections for text classification. * The Cranfield collection, This was the pioneering test collection in allowing precise quantitative measures of information retrieval effectiveness, but is nowadays too small for anything but the most elementary pilot experiments. Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs, (New Syllabus w.e.f Academic Year 23-24) (8C-12) Takcc neo Publications IR (MU-T.Y. 8.Sc¢.-Comp-SEM 6) (Performance Evaluation)...Page no. (5-5) ¢ Test Retrieval Conference (TREC). The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation series since 1992. Within this framework, there have been many tracks over a range of different test collections. In total, these test collections comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450 information needs, which are called topics and specified in detailed text passages. © Individual test collections are defined over different subsets of this data. The early TRECs each consisted of 50 information needs, evaluated over different but overlapping sets of documents. * TRECs 6-8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles. This is probably the best subcollection to use in future work, because it is the largest and the topics are more consistent. Because the test document collections are so large, there are no exhaustive relevance judgments. * GOV2 more recent years, NIST has done evaluations on larger document collections, including the 25-million-page web page collection. From the beginning, the NIST test document collections were orders of magnitude larger than anything available to researchers previously and GOV2 is now the largest Web collection easily available for research purposes. + NII Test Collections for IR Systems (NTCIR). The NTCIR project has built various test collections of similar sizes to the TREC collections, focusing on East Asian language and cross-language information retrieval, where queries are made in one language over a document collection containing documents in one or more other languages. * Cross Language Evaluation Forum (CLEF). This evaluation series has concentrated on European languages and cross-language information retrieval. Relevance judgments * A set of relevance judgments, standardly a binary assessment of either relevant or nonrelevant for each query-document pair. (New Syllabus w.e.f Academic Year 23-24) (BC-12) Tahech-veo Publications IR (MU: somp-SEM 6) (Performance Evaluation) Page no. (5-6) The standard approach to information retrieval system evaluation revolves around the notion of relevant and nonrelevant documents. With a document in the test collection is nt or nonrelevant. respect to a user information need, given a binary classification as either relevai This decision is referred to as the gold standard or judgment of relevance. Evaluation of retrieval models and search engines is a very active area, with much of the current focus on using large volumes of log data from user interactions, such as clickthrough data, which records the documents that were clicked on during a search session. Clickthrough and other log data is strongly correlated with relevance so it can be used to evaluate search, but search engine companies still use relevance judgments in addition to log data to ensure the validity of their ground truth results. Chapter Ends... , Qoo0 Text Categorization CHAPTER 6: and Filtering feaaSyllabus as iltering : Text classification algorithms: Text Categorization and Fi Naive Bayes, Support Vector Machines, Feature selection and dimensionality reduction, Applications of text categorization and filtering. mo 6.1 TEXT CLASSIFICATION/CATEGORIZATION ALGORITHMS ' GQ. Explain © Text categorization also termed as text class automatically sorting a set of documents into categories (classes) from predefined set. We consider classification and categorization the same process. «Related problem : partition documents into subsets, no labels. since each subset has no label, it is not a class instead, each subset is called a cluster, the partitioning process is called clustering, we consider clustering as a simpler variant of text classification. Example © We can classify Emails into spam or non-spam, news articles into different categories like Politics, Stock Market, Sports, etc., academic Papers are often classified by technical domains and sub-domains. IR (MU-TY. B.Sc-Comp-SEM 6) _(Text Categorization and Fitering) ..Page no. (6-2) The Text Classification Problem © A classifier can be formally defined D: a collection of documents CH [eney--- scp} a set of Lclasses with their respective labels a text classifier is a binary function © F:Dx C — (0,1), which assigns to each pair (dj, ¢,), dj € D and cE C, a value of 1, if d,is a member of class Cy 0, if djis nota member of class c, «Broad definition, admits supervised and unsupervised algorithms. «For high accuracy, use supervised algorithm © multi-label : one or more labels are assigned to each document © single-label : a single class is assigned to each document Classification function F defined as binary function of document-class pair [4;, ¢)] « canbe modified to compute degree of membership of d;in c, ® — documents as candidates for membership in class c, * candidates sorted by decreasing values of F(d,, c,) Text Classification Algorithms GQ. Disc Text categorization is an effective activity that can be accomplished using a variety of classification algorithms. Text classification algorithms are categorized into two groups. © Supervised algorithms © Unsupervised algorithms The below diagram shows the text classification algorithms. (Ni 5 (New Syllabus w.e.t Academic Year 23-24) (BC-12) Bhech-neo Publications ‘ IR (MU-T.Y. B.Sc.-Comp-SEM 6) _(Text Categorization and Filtering) ...Page no. (6-3) Text Classification Algorithms ‘Unsupervised Algorithms Supervised Algorithms ( ieee Supervised Algorithms © Depend on a training set. Training set used to learn a classification function. The larger the number of training examples, the better is the fine tuning of the classifier « Overfitting : classifier becomes specific to the training examples. * Toevaluate the classifier, use a set of unseen objects commonly referred to as test set. Unsupervised Algorithms : Clustering © Input data : Set of documents to classify, not even class labels are provided © Task of the classifier : Separate documents into subsets (clusters) automatically separating procedure is called clustering. i 1 GQ. Explain Naive Bayes Algorithm. : 1 GQ. Explain Bayes Theorem. : * Naive Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems. (New Syllabus w.e.f Academic Year 23-24) (BC-12) Tehech-neo Publications IR (MI Sc.-Comp-SEM 6) _ (Text Categorization and Fillering) ..Page no. (6-4) * It is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions. © Itis mainly used in text classification that includes a high-dimensional training dataset. * — tis a probabilistic classifier, which means it predicts on the basis of the probability of an object. * Some popular examples of Naive Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles. * The Naive Bayes algorithm is comprised of two words Naive and Bayes, which can be described as: * Naive : It is called Naive because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Such | as if the fruit is identified on the bases of colour, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other. * Bayes: It is called Bayes because it depends on the principle of Bayes* Theorem Bayes' Theorem * Bayes’ theorem is also known as Bayes’ Rule or Bayes’ law, which is used to determine the probability of a hypothesis with prior knowledge. It depends on the conditional probability. © The formula for Bayes’ theorem is given as: P(BIA)P(A) P(AIB) = Sopp Where, * P(AIB) is Posterior probability : Probability of hypothesis A on the observed event B. + P(BIA) is Likelihood probability : Probability of the evidence given that the probability of a hypothesis is true, P(A) is Prior Probability : Probability of hypothesis before observing the evidence. (New Syllabus w.e.f Academic Year 23-24) ~ 1p (MU-T.Y. B.Sc.-Comp-SEM 6) _(Toxt Catogorization and Fitering) ..Page no. (6-5) Sam « P(B) is Marginal Probability : Probability of Evidence. Consider the given Dataset ,Apply Nalve Baye's Algorithm and Predict that Ifa fruit has the following properties then which type of the fruit itis Fruit = (Yellow , Sweet tong) Frequency Table: Fruit | Yellow Sweet Long Total Mango. o oC Banana 350 400 fomes 50 150 fToval_ 400 1200 Table name: Samples for classification for the Naive Baye Theorem 1.Mango: P(X | Mango) = P(Yellow | Mango) * P(Sweet | Mango) * P(Long | Mango) a)P(Yellow | Mango) = (P(Yellow | Mango) * P(Yellow) )/ P (Mango) = ((350/800) * (800/1200)) / (650/1200) PrYellow | Mango)= 0.53 +1 b)P(Sweet | Mango) = (P(Sweet | Mango) * P(Sweet) )/ P (Mango) = ((450/850) * (850/1200)) / (650/1200) P(Sweet | Mango)= 0.69 — 2 ©)P(Long | Mango) = (P(Long | Mango) * P(Long) )/ P (Mango) = ((0/650) * (400/1200)) / (800/120) P(Long | Mango)= 0 + 3 On multiplying eq 1,2,3 P(X | Mango) = 0 > P(X | Mango) = 0.53 * 0.69 * 0 2. Banana: P(X | Banana) = P(Yellow | Banana) * P(Sweet | Banana) * P(Long | Banana) a) P(Yellow | Banana) = (P( Banana | Yellow ) * P(Yellow) )/ P (Banana) = ((400/800) * (800/1200)) / (400/1200) P(Yellow | Banana) = 1+ 4 (New Syllabus w.e,f Academic Year 23-24) (BC-12) Tehect-neo Publications Comp-SEM 6) _(Text Categorization and Filering) ..Page no. (6-6) (P( Banana | Sweet) * P(Sweet) / P (Banana) 1R (MU b) P(Sweet | Banana) = = ((300/850) * (850/1200)) / (400/1200) P(Sweet | Banana) = .75— 5 ¢)P(Long | Banana) = (P( Banana | Yellow ) * P(Long) )/ P (Banana) = ((350/400) * (400/1200)) / (400/1200) P(Yellow | Banana) = 0.875 — 6 => P(X | Banana) = 1 * .75 * 0.875 On multiplying eq 4,5,6 P(X | Banana) = 0.6562 3. Others: P(X | Others) = P(Yellow | Others) * P(Sweet | Others) * P(Long | Others) a) P(Yellow | Others) = (P( Others] Yellow ) * P(Yellow) )/ P (Others) = ((50/800) * (800/1200)) / (150/1200) P(Yellow | Others) = 0.34— 7 b) P(Sweet | Others) = (P( Others| Sweet ) * P(Sweet) )/ P (Others) = ((100/850) * (850/1200)) / (150/1200) P(Sweet | Others) = 0.67 = 8 ©) P(Long | Others) = (P( Others| Long) * P(Long) )/ P (Others) = ((50/400) * (400/1200)) / (150/120) P(Long | Others) = 0.34 9 On multiplying eq 7,8,9 ==> P(x | Others) = 0.34 * 0.67" 0.34 P(X | Others) = 0.07742 So finally from P(X | Mango) == 0, P(x == 0, Banana) == 0. | Others) == 0.07742, | ) == 0.65 and P(x! We can conclude Fruit{Yellow,Sweet,Long} is Banana. (New Syllabus w.e.f Academic Year 2: 24) (BC- 2) Takcti-neo Publications aan iar aa Zi i las IR (MU-T.Y. B.Sc.-Comp-SEM 6) _(Text Categorization and Filtering) ...Page no. (6-7) Training Algorithm «Let whe the vocabulary of all words in D For each category C, € C © — Let Di be the subset of documents in category C, P(C) = IDyIDI © Let T,be the concatenation of all documents in d, Let n,be the total number of word occurrences in T, For each word Wye V * Let ny be the number of occurrences of Win T; © Let PWC) = (ny + Dn, + IVI) YW 6.1.2 Support Vector Machine (SVM) © Unlike the Naive Bayes classifier, which is bused purely on probabilistic principles, the next classifier we describe is based on geometric principles. * Support Vector Machines, often called SVMs, treat inputs such as documents as points in some geometric space. * For simplicity, we first describe how SVMs are applied 10 classification problems with binary class labels, which we will refer to as the “positive” and “negative” classes. * In this setting, the goal of SVMs is to find a hyperplane that separates the positive examples from the negative examples. hine (SVM) is a very popular model. SVM applies a s a binary classifier. * — Support Vector Ma geometric interpretation of the data. By default, it It maps the data points in space to maximize the distance between the two categories. * For SVM, data points are N-dimensional vectors, and the method looks for an N-l dimensional hyperplane to separate them. This is called a linear classifier. * Many hyperplanes could satisfy this condition. Thus, the best hyperplane is the one that gives the largest margin, or distance, between the two categories, Thus, it is called the maximum perplane: int (New Syllabus w.e.f Academic Year 23-24) (BC-12) TBhech.neo Publications IR (MU-T.Y. B.Sc-Comp-SEM 6)_(Tex! Cate: ' 1 gorization and Filtering) ..Page no. (6 “2 Fig. 6. We can see a set of points corresponding to two categories, blue and | green, The red line indicates the maximum margin hyperplane that separates both groups of points. Those points over the dashed line are called the vectors. Frequently happens that the sets are not linearly separable in the original space. Therefore, the original space is mapped into a higher-dimensional space where the separation could be obtained. SVMs can efficiently alled kernel trick. | perform a non-linear classification using the so- The kernel trick consists of using specific kernel functions, which simplify the mapping between the original space into a higher- dimensional space. Naive Bayes comes under the class of generative models for classification. It models the posterior probability from the class conditional densities. So, the output is a probability of belonging to 4 class. SVM on the other hand is based on a discriminant function given by y =w.x +b. Here the weights w and bias parameter b are estimated from the training data. It tries to find a hyperplane that maximises the margin and there is optimization function in this regard. Performance wise SVMs using the radial basis function kernel are more likely to perform better as they can handle non-linearities in the data. (New Syllabus w.e.f Academic Year 23-24) (BC-12) TBhech.nveo Publications IR (MU-TY_ 8 Sc-Comp-SEM 6) _(Text Categorization and Fitering) ...Page no. (6-9) Naive Bayes performs best when the features are independent of each other which often does not happen in real. Having said that it still performs good even when the features are not independent pi 6.2 FEATURE SELECTION Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features in text classification Feature selection serves two main purposes. First, it applying effective voca akes training and classifier more efficient by decreasing the size of the bulary. This is of particular importance for classifiers that, unlike NB, are expensive to train, Second, feature selection often increases clas ification accuracy by eliminating noise features. A noise feature is one that, when added to the document representation, increases the classification error on new data, Suppose a rare term, say arachnocentric, has no information about a class, say China, but all instances of arachnocentric happen to occur in China documents in our waining set. Then the learning method might produce a classifier that misassigns test documents containing arachnocentric to China. Such an incorrect generalization from an accidental property of the training set is called overfitting. SELECTFEATURES (ID, c, k) 2 £<-[] 3 foreachte V 4 do A (t,c) — COMPITTEPEATUREUTTLITY (ID, t, ¢) 5 APPEND (1,,(A (1, ¢), 0) 6 return PEATURESWIT'HLARGESTVALIT ES (L, k) V — EXTRACTVOCABULARY (D) (New Syllabus w.e.f Academic Year 23-24) (BC-12) Fig, 6.2.1 : Basic feature selection algorithm for selecting the k best features. Tech-Neo Publications So.-Comp-SEM 6) (Text Categorization and Filtering) ...Page no. (6-10) © We can view feature selection as a method for replacing a complex classifier (using all features) with a simpler one (using he subset of features). © The basic feature selection algorithm is shown in above figure. For a given class c, we compute a utility measure A(t, c) for cach term of the vocabulary and select the k terms that have the highest values of A(t, c), All other terms are discarded and not used in classification. We will inuoduce three different utility measures in this section: mutual information, A(t, c) = I(Ut ; Ce); the x2 test, A(L ¢) = X° (Lc); and | frequency, A(t, c) = N(L c). 1 6.3 DIMENSIONALITY REDUCTION © Dimensionality reduction refers to techniques that transform a high- dimensional dataset into a lower-dimensional representation while preserving its essential structure and characteristics. © The aim is to reduce the computational complexity, improve visualization, and eliminate redundant or noisy features. Advantages of Dimensionality Reduction Dimensionality reduction offers several advantages: 1. Improved Computational Efficiency : Reducing the number of dimensions simplifies the data representation and accelerates the training and inference process. 2. Enhanced Visualization : By reducing the dataset to two or three dimensions, we can visualize and explore the data more effectively. 3. Noise and Outlier Removal : Dimensionality reduction techniques can help filter out noisy features or outliers that may negatively impact the model's performance. Reduction. i ; 1 GQ. Differentiate between Feature Selection and Dimensionality 1 t 1 t While both feature selection and dimensionality reduction aim to reduce the number of features, they differ in their approach : (New Syllabus w.e, Academic Year 23-24) (BC-12) [hech.nieo puta eb = ee IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Text Categorization and Filtering) ..Page NO. (6-11) Qe « Feature Sclection : Selects a subset of relevant features while keeping the original feature space intact, The focus is on identifying the most informative features for modelling. « Dimensionality Reduction : Projects the data onto a lower-dimensional space by transforming the feature space. The objective is to create a compressed representation that captures the essence of the original data. —._ ym 6.4 APPLICATIONS OF TEXT CATEGORIZATION AND FILTERING GQ. Explain in brief applications of text categorization and filtering. : 1 6Q. Discuss the various applications of Text cat Text categorization «Text categorization is a machine learning technique that assigns a set of predefined categories to open-ended text. «Text classifiers can be used to organize, structure, and categorize pretty much any kind of text - from documents, medical studies and files, and all over the web. «For example, new articles can be organized by topics; support tickets can be organized by urgency; chat conversations can be organized by language; brand mentions can be organized by sentiment; and so on. © The most common applications are spam detection, sentiment classification, and online advertisement classification. 1. Spam detection © Classification techniques can be used to help detect and eliminate various types of spam. Spam is broadly defined to be any content that is generated for malevolent purposes, such as unsolicited advertisements, deceptively increasing the ranking of a web page, or spreading a virus. © — One important characteristic of spam is that it tends to have little, if any, useful content. This definition of spam is very subjective, because what may be useful to one person may not be useful to another, For this reason, it is often difficult to come up with an objective definition of spam. (New syllabus w.e.f Academic Year 23-24) (BC-12) Tahech-neo Publications IR (MU-TY. B.Sc-Comp-SEM 6) (Toxt Categorization and Fillering) ~ Page no. (6-12) ° of spam, including email spam, There are many types am. Spammers advertisement spam, blog spam, and web page spi use different techniques for different types of spam. Therefore, there is no one single spam classification technique that works for all types of spam. Instead, very specialized spam classifiers are built for the different types of spam, each taking into account domain-specific information. Much has been written about email spam, and filtering programs such as Spam Assassin are in common use. Spam Assassin computes a score for the email that is compared to a threshold (default value 5.0) to determine whether it is spam. The score is based on a combination of features, one of the most important of which is the output of a Bayes classifier. 2. Sentiment Analysis ° Perhaps the most popular example of text classification is sentiment analysis (or opinion mining): the automated process of reading a text for opinion polarity (positive, negative, neutral, and beyond). Companies use sentiment classifiers on a wide range of applications, like product analytics, brand monitoring, market research, customer support, workforce analytics, and much more. Sentiment analysis allows you to automatically analyze all forms of text for the feeling and emotion of the r. For example, if the user is interested in purchasing the product, then links to online shopping sites can be provided to help the user complete her purchase. It may also be possible that the user already owns the product and is searching for accessories or enhancements, The search engine could then derive revenue from the query by displaying advertisements for related accessories and services. 3. Classifying advertisements ° (New Syllabus we. Academic Year 23-24) (BC-12) Sponsored search and content match are two different advertising models widely used by commercial search engines, The former matches advertisements to queries, whereas the latter matches advertisements to web Pages. Both sponsored search and Content match use a pay per click pricing model, which means that [brcch-co Publications in (MU-T.Y. B.Sc.-Comp-SEM 6) (Tox! Categorization and Fitering) ..Page 70, (6-13) advertisers must pay the search engine only if a user clicks on the advertisement. A user may click on an advertisement for a number of reasons. Clearly, if the advertisement is “topically relevant,” then the user may click on it. However, this is not the only reason why a user may click. o Customers often use social media to express their opinions about and experiences of products or services. Text classification is often used to identify the tweets that brands must respond to. o Text classification is also used in language identification, like identifying the language of new tweets or posts. For example, Google Translate has an automatic language identification feature. © Authorship attribution, or identifying the unknown authors of texts from a pool of authors, is another popular use case of text cl analysis to literary studies. Text classification has also been used 10 ification, and it's used in a range of fields from forensic segregate fake news from real news. o Language detection is another great example of text classification, that is, the process of classifying incoming text according to its language. These text classifiers are often used for routing purposes (e.g., route support tickets according to their language to the appropriate team). Text Filtering © Filtering is the process of evaluating documents on an ongoing basis according to some standing information need. Generally, the outcome of filtering is to deliver the document to zero or more destinations, depending on the information need. © It removes redundant or unwanted information from an information stream using automated or computerized methods. * Afiltering system consists of several tools that help people find the most valuable information so in the limited time, you can dedicate to read/listen/view correctly directional and valuable documents. © Itreduces or eliminates the harmful information, (New Syllabus we. Academic Year 23-24) (BC-12) [Bhech.neo Publications - £16) (Tox Categorization and Filtering) .Pago no. (6-14) Types of Text Filtering There are three types of text filtering. 1. Content-Based Filtering 2. Collaborative Filtering. Hybrid Filtering > 1. Content-Based filtering Objects to be filtered: generally, texts, filter engine based on content analysis, These filtering methods are based on the description of an item and a profile of the user's preferred choices. © Ina content-based recommendation system, keywords are used to describe the items; besides, a user profile is built to state the type of item this user likes. © The algorithms try to recommend products which are similar to the ones that a user has liked in the past. The idea of content-based filtering is that if you like an item, you will also like a ‘similar’ item. > 2. Collaborative filtering Objects to be fillered: products/goods, filter engine based on usage analysis. * — This filtering method is usually based on collecting and analyzing Ss or preferences and information on user's behaviors, their acti predicting what they will like based on the similarity with other users. © Akey advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and thus it is capable of accurately recommending complex items such as movies without requiring an “understanding” of the item itself. > 3, Hybrid Filtering Combination of the two previous approaches. ¢ — Recent research shows that combining collaborative and content-based recommendation can be more effective. *¢ Hybrid approaches can be implemented by making content-based and collaborative-based predictions separately and then combining them. ¢ Further, by adding content-based capabilities to a collaborative-based approach and vice versa; or by unifying the approaches into one model. (New Syllabus we. Academic Year 23-24) (8C-12) Bhecr-neo Publications 1 (MU omp-SEM 6) (Text Categorization and Filtering) ...Page no. (6-15) SS See applications of Text filtering 1. » If the user is trying to search for a particular book, the search engine will recommend some of the similar titles from their past likes. This technology is used by some of the major companies like Netflix, Pandora’s search engines, Such type of systems is mostly used with text documents. If the person wants to watch a movie, he/she might ask other users’ opinion or friends about the particular movie. Because different peoples have different opinions. So that in this case only those people can see a movie who has similar interests. The website makes recommendations by comparing the watching and searching habits of similar users (i.e., collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering). Even job searching uses hybrid fileting system which is the combination of content-based filtering and collaborative filtering approach. The main motto is to make easy job search for users. This recommendation depends on the user's past experiences as well as it makes it easy for the users to get recommendation of various job profiles on basis of their past experiences, projects, internships, skills, etc. Searching friends online in Facebook, whom to be friend with, also part of collaborative filtering. Even song listings based on previous history or choice in Spotify is also another example of collaborative filtering. SPACE FOR NOTES (New Syllabus w.e.f Academic Year 23-24) (BC-12) Behech-veo Publications | GQ. Differentiate betwee 1 information Filtering and Information Retrieval, | and Fillering) ..Page no. (6-16) Information Filtering Information Retrieval Information Filtering is about processing a stream of information to match your static set of likes, tastes and preferences. Information retrieval is about fulfilling immediate queries from a library of information available. Example: a clipper service which reads all the news articles published today and serves you content that is relevant to you based on your likes and interests. Example: you have a deal store containing 100 deals and a query comes from a user. You show the deals that are relevant to that query. Information filtering is concerned with repeated uses of the system, by a person or persons with long- term goals or interests IR is typically concerned with single uses of the system, by a person with a one-time goal and one-time query Filtering assumes that profiles can be correct specifications of information interests Filtering is mainly concerned of texts with the distribution texts to groups or individuals. Filtering is concemed with long- term changes over a series of information- seeking episodes. IR recognizes inherent problems in the adequacy of queries as representations of information needs IR is concerned with the collection and organization of texts IR is concerned with responding to the user's interaction with texts within a single information-seeking episode Models: - Probabilistic model Models: Boolean IR model Vector space IR model Probabilistic IR model Language Model (New Syllabus w.e.f Acader Classification — Clustering Classification is a supervised Jearning approach where a specific label is provided to the machine to classify new observations. Here the machine needs proper testing and mining for the label verification, Clustering is an unsupervised learning approach where grouping is done on similarities basis. Supervised learning approach. Unsupervised learning approach. Tluses a training dataset. It does not use a training dataset. It uses algorithms to categorize the new data as per the observations of the training set. It uses statistical concepts in which the data set is divided into subsets with the same features. In classification, there are labels for training data. LE Its objective is to find which class a new object belongs to form the set of predefined classes. In clustering, there are no labels for training data. Its objective is to group a set of objects to find whether there is any relationship between them. It is more complex as compared to clustering. eo iB. It is less complex as compared to clustering. Chapter Ends. Qo0o0 Se a, 0 Text Clustering : for Information CHAPTER 7 Retrieval Text Clustering for information Retrieval : Clustenng techniques: K- means, hierarchical clustering, Evaluation of clustering results, Clustering for query expansion and result grouping. i_7.1_ CLUSTERING TECHNIQUES * Clustering is the proce: of grouping a set of documents into clusters of similar documents, Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning. | * After clustering, each cluster is assigned a number called a cluster ID. Two of the most popular clustering algorithms in detail-K Means and | Hierarchical. Types of Clustering Methods The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group) and Soft Clustering (data points can | belong to another group also). in (Mu omp-SEM 6) (Text Clustering for 1R)...Page no. (7-2) But there are also other various approaches of Clustering exist. Below are the main clustering methods, 1. Partitioning Clustering e — Itis a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-based method. The most common example of Partitioning clustering is the K-Means Clustering algorithm, 2, Density-Based Clustering * The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped distributions are formed as long as the dense region can be connected. This algorithm does it by identifying different clusters in the dataset and connects the areas of high densities into clusters. « The dense areas in data space are divided from each other by sparser areas. 3. Distribution Model-Based Clustering © In the distribution model-based clustering method, the data is divided based on the probability of how a dataset belongs to a particular distribution. * The grouping is done by assuming some distributions commonly Gaussian Distribution. 4. Connectivity-based Clustering «As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. * In the first approach, they start by classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. (New Syllabus w.e.f Academic Year 23-24) (BC-12) [Brecn-teo Publications

You might also like