Professional Documents
Culture Documents
CLIR Techniques Ver 1
CLIR Techniques Ver 1
CLIR Techniques Ver 1
Techniques to select document in one language based on information need /query Q in another language.
1.2
FROM
Oard and Diekema have identified three basic approaches D. W. Oard and A. R. Diekema. Crosslanguage information retrieval. In Annual Review of Information Science and Technology, volume 33. to CLIR: query translation, document translation, and interlingual techniques [6].
HOW,RESULT
Fast MT engine
on Research and development in information retrieval, 1998 - portal.acm.org Page 1. The Effects of Query Structure and Dictionary Setups in DictionaryBased Cross-language Information Retrieval Ari Pirkola ist.org/fileadmin/engage/public_jakarta2006/presentations/A1_02_Adriani.pdf,last visited at Dec 2007 Using language models for information retrieval Ph. D D Hiemstra - Thesis University of Twente, Enschede, 2001
McCarley, J.S.: Should We Translate the Documents or the Queries in Cross-Language In-. formation Retrieval? Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics , 1999jakarta2006.engage-
Opposite two directions of translations between document and query the authors try identical data set to investigate IR between English ad French using MT and they found that hybrids document and query based systems out perform (better than) depending on query only even human query translation.
Using language models for information retrieval Ph. D .D Hiemstra - Thesis University of Twente, Enschede, 2001
Its difficult to answer this question problems and it needs need further research Problems In query translation approach Problems Document translation approach search done in document language Search done in language of query Must pick one translation and translate Can use more than translation fro reach the document. query term
[This method obstacle for 2 decades now as Dr.Oard told in his Email ]
1.4
Free Text
1.4.1.1
Parallel Corpora
Analyze the text and automatically extract the information needed where the collection contains the original documents and their translation. But there are Limited causes of the need to get suitable collection for applying the technique where translated documents are expensive to create.
1.4.1.2
Comparable Corpora
The collection composed of documents on similar subjects. BUT written in dissimilar languages. It over comes Parallel Corpora disadvantage But similarity depends on the topics addressed in the document.
1.4.1.3
Email"
Monolingual Corpora
1.4.2.1
Dictionary based
Replace each term in the Query Q with suitable term or terms in the wanted language. Noticing that entering new terms will motivate the need to describe it. As in Controlled Vocabulary.
Disadvantages of Dictionary based approach Oard 1997) Douglas W. Oard, "Alternative Approaches for Cross-Language Text Retrieval," in Cross-Language Text and Speech Retrieval , AAAI Technical Report
Many words dont have unique translation, beside that translation may have different meaning. 1. translation may require additional words to be added because
1.
1.1.
query may be talking about a topic that's out the dictionary range 1.2. user uses slang or abbreviations that's not in the dictionary This approach will be discussed in more details later at this report
1.4.2.2
Ontology based
Ontology is a "sophisticated knowledge structure"
Need reading
1.4.2.3 Thesaurus based
Boca Raton, Florida. USA 2005. A Machine Translation Approach to Cross Language Text Retrieval , http://www.dissertation.com/book.php?book=1581122675&method=ISBN. University of Glasgow, Glasgow, United Kingdom ,Advisor(s): Mark Sanderson,Degree: Master of Science (M.Sc.),Year: 1997
Multilingual thesaurus organizes terminology from more than one language. It represents term and concept relationship in understandable way to human. it Can be used manually or automatically And Concept retrieval technique used when concept added to thesaurus.
http://www.glue.umd.edu/~oard/teaching/796/spring04/slides/10/
There are Three construction techniques from scratch thesauri building existing thesaurus Translation Merging monolingual thesauri
Type
Controlled Vocabulary
Free Text
Aspects
Corpus based
Knowledge Based
Sources
Parallel Corpora
Comparable Corpora
Monolingual Corpora
Dictionary based
Ontology based
Document aligned
Sentence aligned
Term aligned
Thesaurus based
1. 2. 3. 4.
Traditional approaches MT approaches Corpus based Methods Dictionary based These approaches will e discussed in the following sections
2.1.1 MT refers to
Translate automatically with or without hymen aid.
2.1.2.1
Carry on word-by-word over text Translate all words Morphology Knowledge used intermediately in the form of Huge bilingual dictionary Word-to-word translation information After word translation, simple reordering can be done
Morphologica l analysis
Local reordering
Morphologica l generation
From Foundations of Statistical Natural Language Processing. by Christopher D. Manning and Hinrich Schtze. MIT Press, 1999.
Source Text
Target Text
2.1.2.2
There is no association between one to one words Languages have different word order and nave direct translation usually will order words wrong. Syntactic ambiguity
From www.stanford.edu/class/linguist180/ , Introduction to Computational Linguistics.htm that uses the Jurafsky book MT approaches
2.1.3.1
Depends on applying contrastive knowledge that is languages difference knowledge, it also needs bilingual dictionary such in Direct MT. Steps as shown: Analysis: parse Source language syntactically Transfer: turn the source language into the target language using Rules Generation: Generate Target sentence using parse tree E.g. English to French rule
Such approach is hard to implement and maintain , From Foundations of Statistical Natural Language Processing. by Christopher D. Manning and Hinrich Schtze. MIT Press, 1999.
Syntactic Transfer
2.1.3.1.1 Disadvantages
Syntactic ambiguity Unsuitable semantic e.g. there is no verb adverb construction in English such in German " I like to eat German phrase" translated to English to " I eat gladly | with pleasure "
2.1.3.2
From Foundations of Statistical Natural Language Processing. by Christopher D. Manning and Hinrich Schtze. MIT Press, 1999.
Represent Source sentence meaning. By parsing then indicate the meaning finally generate target language
Semantic Transfere
2.1.3.2.1 Disadvantages
Even if literal | exact meaning is correct it could be unintelligible | making no sense e.g I need one the book talked about Spanish phrase that I cant understand NEED EXAMPLE
From www.stanford.edu/class/linguist180/ , Introduction to Computational Linguistics.htm that uses the Jurafsky book MT approaches
2.1.4 Interlingua
Depends on meaning using Instead of language - to - language rules. Create meaning representation [Knowledge representation formalism, independent from any language literal meaning translation] by translating source sentence to it. 2Use the meaning representation to generate target sentence Easier to write results this way b useful information lost
From Foundations of Statistical Natural Language Processing. by Christopher D. Manning and Hinrich Schtze. MIT Press, 1999.,chapter21
1-
2.1.4.1.1 Disadvantages
Difficult to design efficient knowledge representation ;because of huge amount of ambiguity in translating from natural language to Interlingua knowledge representation.
2.2 Modern statistical Machine Translation (SMT) approaches; focus on results 2.2.1.1 Statistical MT system
ESSLLI Summer Course on SMT (2005), day1, 2, 3, 4, 5 by Chris Callison-Burch and Philipp Koehn , http://www.statmt.org/
Get most accepted English sentence [target language] given foreign language sentence [source language] .
e = argmax P(e) P(f | e) Where e best translation P(e) fluency P(f | e) faithfulness
^
From Foundations of Statistical Natural Language Processing. by Christopher D. Manning and Hinrich Schtze. MIT Press, 1999.,chapter13
SMT compromises due to languages culture and parallel constructions, tense and metaphor are differ so SMT do what translators do compromise to be accepted in both languages. To do so we need three things Quantify fluency ; language model Quantity faithfulness; translation model Create algorithm to find best translation ;that maximize the product of P(e) P(f | e) .Decoding algorithm
From SMT Tutorial (2003) by Kevin Knight and Philipp Koehn Information Sciences Institute University of Southern California , http://www.statmt.org/
10
11
ESSLLI Summer Course on SMT (2005), day1, 2, 3, 4, 5 by Chris Callison-Burch and Philipp Koehn , http://www.statmt.org/
2.2.2.1
Probabilistic context-free grammar PCFG stochastic context-free grammar CFG is a context free grammar where each production[possible right hand side \ terminal symbols ] has a probability , and the parse / derivation is the product of the probabilities of the use production in the derivation but summation of right hand side probabilities to calculate the non terminal symbol probability is more consistent .
E.g. S -> NP VP [1.0] NP -> Det N [0.6] NP -> Det Adj N [0.2] NP -> John [0.1]
12
13
Bilingual dictionary doesnt tell what word is equal to another in different language but parallel corpus do that, but to use it problems must be solved first those problems include alignment Word alignment or Melamed Pear / EM methods.
ESSLLI Summer Course on SMT (2005), day1, 2, 3, 4, 5 by Chris Callison-Burch and Philipp Koehn , http://www.statmt.org/
Gives higher probability to sentence that have related meaning. It is estimated through bilingual corpora .training corpus used to estimate parameters. It is impossible to calculate p(f |e) for a sentence so sentences broken into smaller chunks and
P( f | e ) = sum a P(a, f| e) where a is an alignment between word and the sentence P(a, f| e) =|--| j=1 m [from j=1 to m ] t(fj | e i) t(fj | e i) = count (fj | e i) / count (e i)
14
From
http://www.stanford.edu/class/linguist180/
2) 3)
bi-1 => is the end position of the foreign generated by the I-1th source phrase ei-1
2.3
SMT evaluation
Statistical machine translation methods consider many of human formed translation samples then SMT algorithms automatically learn how to translate, SMT described by the use translation a machine learning methods. SMT accuracy system depends on Data accuracy Data quantity Data quality Data domain
From "Machine Translation in the Year 2004", (K. Knight, D. Marcu), Proc. ICASSP, 2005.
MT System Evaluation is subjective process since translation is generation process not a classification process, evaluating the system is unclear, how could distance between human translation and MT be measured E.g. from Chinese to English "from the paper" At least 12 people were killed in the battle last week At least 12 people lost their lives in the last week fight Last week fight took at least 12 lives .etc it has 11 different translations.
From "INEED THE BOOK CALLED of Jurafsky and Martin. 2007. Speech and Language Processing. CHAPTER 25 " From C:\Documents and Settings\XPPRESP3\Desktop\MT\LINGUIST 180 Introduction to Computational www.stanford.edu/class/linguist180/ .htm that uses the Jurafsky book MT approaches
Good translation depends on Faithfulness : how close the target meaning to the source translated text " does the translation cause the reader to draw the same inferences as the original would have" And Fluency: how natural is the translation taking in account the fluency in the target language.
15
Translate the query by finding association between query and document (implicit dependency and co-occurrence ) independent of language differences. We have collection C1 in language A and collection 2 C2 in language B and collection of translated documents C3 that contain translated documents from language A to B , C3 can be observed in both spaces related to C1 and C2 , If query Q in language B submitted the C3 related documents will retrieved and used to retrieve related documents from C1 that uses language A. So Query translation And Expansion done using basis of multilingual terminology (terms ?"is terms right word??" ) Derived from parallel or comparable corpora (Ari Pirkola,1998) opinion in corpus based methods where query translated and expanded depending on comparable document collection or parallel corpora ,Varity of methods used and some of them pointed that their results near to monolingual quires . But such approaches that report near to monolingual quires A training sample to be representative of the full data and sufficient parallel or comparable corpora to be available.( we dont have it in our data set )
16
compound words ,names ,spelling variants and special terms { I think this related to medical in special terms } o Where compound words are easier to decompose than phrase identification. o Proper names and spelling variants ,to translate spelling variants Varity of methods could be used such as n- gram based matching or approximate string matching or Transliteration based on phonetic similarity . BUT n-gram shows that its effective comparable with poor results of the other two methods, for that they test n gram method and query structuring both together they claim it could be used in mono or cross language retrieval. Special terms transition prob. can be solved by the use of general and domain specific dictionary . it is possible to combine limited content dictionaries to cover general language and many specific domains. Noticing that special dictionary reduce translation ambiguity problem. Pirkola 1998 used this combination to translate health documents and he summarized his experience in using sequential translation where special dictionary gets more weight is appropriate for scientific text. o Inflected -words Stemming used to deal with inflected words and the output is root or stem. Also morphological analysis used where words normalized into base form and split compound words. Stemming problem is that different headwords can be belong to one stem; and if index terms stemmed dictionary output words have to be stemmed. Morphological effectiveness is limited to morphological lexicon size its impossible to list all language words in lexicon. Noticing that when depending on stemming recall improved since more documents retrieved on the other hand precision improvement depends on the language experiments on Arabic English French show that. o Phrase Identified using collection statistics tor exognize sub phrases in phrases<!?> Part of speech tagging POS
17
Phrase translation if phrase are not defined and translated correctly then the effect may be fatal on quires. phrase translation may be done by Phrase dictionary where its impossible to list all phrases or Based on word collection statistics<???> o Translation ambiguity, CLIR dictionary based technique to handle ambiguity involves Part of speech tagging ,had different effective on different languages Corpus based disambiguation ,includes query e Pirkola) <I didnt complete the paper MISSSSSSSSSSS >
18
d_thesis
4.2
Three challenges [ Oard and Diekema] What to translate? How to obtain |get translation knowledge? How to use translation knowledge? All of them affected by translation resources
Phrase identification
19
-Syntactic Parsingconstruct syntactic structure e.g. verb phrase. Pre translation stop word removal is worthy since they incorrectly could be translated into high term weight words.
20
1.
Despite Translation Knowledge Resources Out-of-Vocabulary (OOV) terms if translated such as names and geo locations; decrease the CLIR system effectiveness. 2. Extract the knowledge from the resource.
How to do 1, 2 (acquire and extract knowledge using) following resources. Bilingual dictionary. Acquired by
These dictionaries often low coverage and dont cover most languages
Bilingual corpora , Acquired by Parallel corpora from International organization e.g. (UN) for example rules published in several languages, European union publications, the Bible. Web for both parallel and comparable corpora [a previous study shows that parallel pages can be found using web page anchor related text. Also mining the web to find parallel ages I promising technique.
21
Bilingual corpora used techniques to extract translation knowledge could be Statistical MT , GIZA++ tool kit let the user to select a model and change parameters. Cross language latent semantic Indexing CL-LSI
When more than one translation available [Polysemy = term have more than one meaning] ambiguity shall be resolved. How does ambiguity solved in Bilingual Dictionary Early First method depends on selecting one best translation, but no guarantee thats the first translation is the most appropriate and if its the right translation it may not fit in the specific sentence being translated, beside that not al dictionaries put the most equivalent translation as first one. Second approach is to depend on more than one dictionary to find the translation and it seems logically true to take the one that appears in most of them. The third and most efficient method depends on phrase translation rather than word translation it identify and translate phrases .but phrase covering in bilingual dictionaries are limited. How does ambiguity solved in Corpora -Davis and his colleagues use POS for each English query and select its equivalent part of speech tagging POS in Spanish text that matches, this done on TREC 5 collection and improve CLIR effectiveness more than 70%.but conflicting results obtained when disambiguation used independently it decrees effectiveness . -term statistics also used, simplest method is to calc unigram frequency in the target corpus and choose the highest one BUT no study differentiate between synonymous ambiguity and polysemous ambiguity that seems to have more bad influence on CLIR. - Co occurrence phrase translation to dictionary translation improves CLIR effectiveness. -other study uses word co occurrence e information to translate and resolve ambiguity but this improves CLIR effectiveness slightly and researchers point to the small training set or domain as arson for that slight improve .
22
How to deal with out of vocabulary terms (OOV) Three major techniques 1-Translation If the languages share similar alphabets translation can be A)Orthographic mapping :-use rules to transform language A spelling to language B spelling .e.g. treat English word as misspelled French word and consider best suits an translation. But if no share in alphabet between the two languages then B) considered B) Phonetic ) )mapping can where OOV words converted to phonetic version that converted to target language using phonetic rules[derived by expert or statistic approaches] finally transformed to character sequence 2- Backoff translation [it shows improvement in retrieval - CLEF collection 2000] Oard develop a Backoff translation technique that enhances CLIR retrieval effectiveness the technique the experiment as shown below Term by term translation technique. Require lexicon [dictionary], where each French F word associated with ranked list of English translations {e1, e2, e3}. A) Match input document term surface [English] form with source language surface forms In the translation dictionary. If it fails go to B). B) Match input term stem with source language surface forms In the translation dictionary. If it fails go to C). C) Match input term surface form with term stems forms of the source In the translation. If it fails go to D). D) Match input term stem with term stems forms of the source In the translation dictionary. 3-Pre translation expansion [Query expansion] Query expansion can be before OR after translation. If before pre-translation If After Post-Translation First monolingual retrieval Expand translated query. performed on comparable document By adding top terms from top collection. ranked retrieve documents using A new query created by adding translated query. terms from top ranked documents. Advantages Vary according to experiment, some The new query set translated . peppermints shows that its affect Search target collection using the positively retrieve when long queries new translated set. used and when poor translation Advantages resources used too. Useful words translated to target language. This method very effective in short queries [Document expansion] For CLIR done as follow Translate documents to query language Use translated document as query to search the comparable collection Select top ranked terms fro top ranked documents and expand the translated document (expansion). This shows light small improvement in retrieval.
23
Using the web Following method might be true for un cognate character set languages, the method depends on observing Particular technical terms that stay as it's when borrowed from another language.
24
2Weighting translation choices Techniques developed to weight choices these techniques include the following Weighted Boolean Model (translation probability knowledge are not available) Based on independent assumption User query constructed like in monolingual Boolean model where terms linkded by conjunction (or and not0 ,this model perform better than vector space model when using manual query processing. Noticing that d = document , ti(i=1,2..n) query term
P(A and B) = P(A^B)=P(A)*P(B) weighting function P(t1 And And tn|d)= P(A OR B) p(AV B) = P(A)+P(B)-P(A)*P(B) =1-[1-P(A)]*[1-P(b)] weighting function P(t1 AOROR tn|d)=
P( NOT A) =1-P(A) weighting function 1-P(ti }d) Balanced translation(Levow and Oard, 2002) (translation probability knowledge are not available) Where source term is the arithmetic mean of its translations weight. Balanced translation problem is that it gives more weight for Rare Terms This is logical if the rare term translation was correct. But kill rare terms when no information about translation probability AND other common alternative be present is generally Good. Structured Queries "Pirkola approach "(translation probability knowledge are not available) {why not applying it for Med img ret. ?} Compute source term TF, DF based on its translation TF, DF. Queries are structured via the synonym operator of the InQuery retrieval system {InQuery's synonym operator} that basically designed for monolingual retrieval by thesaurus expansion. Pseudo | artificial term are grouped with synonym operator ant TF,DF computed by TF normalized to DF are calculated as the summation of TF of each translation alternative found in the document. = DF = number of documents where at least one of its translation alternative found = union from i=1 to n {d|ti d} .Where both TD , DF pre-computed for each possible query at indexing time. -When no translation prob. is known the most effective approach is structured query SQ. Later on Kawokss simplified Pirkola method by replacing union with sum in the DF equation without harmful in retrieval effectiveness so DF becomes = sum of Df of each its translation alternative . conservative NOTICING that In SQ translation appear in many documents takes low weight; since DF will be high for that its term rate will be low. But Balanced translation is more sensitive for rare translation than SQ. also as in Chinese if word segmentation has a lot of errors that will lead to rare translation and poor results using balanced translation. Weighted IDF "(translation probability knowledge are available) Translation probability knowledge are available through Using more thatn bilingual dictionary
25
Use of bilingual corpora This method depends on use translation probability to modifying inverse document frequency IDF for many translation alternatives. Balanced translation approach considered a special case of Weighted IDF. Probabilistic Structured Query PSQ {why not applying it for Med img ret. ?}
where P(ti|c) is probability of query term c translating into document term ti . PSQ could be better than structured query if the number of alternatives increase Language Modeling approach {why not applying it for Med img ret. ?} Hiemstra divide the experiment into 3 sections Use one translation for source language. Translate query using unstructured query [bag of words] of all possible translation for source query term. Use structured query in query translation of all possible translation for source query term. Hiemstra proved that structured query consistently outperform one best translation and untructured query .but less effective than monolingual retrieval , Hiemstra experiment with on small Dutch- English collection and has some limitation s like using 24 topics that makes tests less reliable. KRAJJ developed CLIR model of probabilistic query translation, document translation and document plus query translation where both document and query translated to third language using language modeling approach , he founds that all of them performed like structured query He notice that poor performance of structured query result of Large number of translation alternatives Translation property not used Bidirectional Translation Merge document and query translation more than study showes imrovmetn in CLIR if this method used, [ I will not go through this mainly I want to depend on query ; at least now ]
26
"The first approach is often referred to as query translation if done at query time and as document translation if done at indexing time, but in practice both approaches require that document-language evidence be used to compute query-language term weights that can then be combined as if the documents had been written in the query language" THE "Authors focus on" Focus on machine readable bilingual dictionary list for its simplicity
1. Document processing
document for IR system consists of terms for example term1 term 2 and term 3 can ,so categorizing those helps is the issue when they or one of them inquiry . This process consists of.. Extraction of indexing terms Document expansion Differ from a language to the other, but extracting terms using white spaces is so simple and can be done By 1) automatically segmenting the text stream into a single chain (string) of non-overlapping words, which might then be an issue to further processing such as stemming 2) Indexing overlapping character sequences
1.1.
27
language
Method
English and Both written using space delimiter. French A) tokenization tokenization, Simple pattern based {what are they, such as ??} clitic splitting, and approaches to separate words and punctuation stemming Clitic splitting done using apostrophe. And a used technique is twofold process by separating the clitic and then expanding it. For example "you're you re then you are". B) for the result word MORPHOLOGICAL analysis or STEMMER used Morphological analysis: that maps verbs to their root. Stemming : involve removing prefixes or suffixes or both ,widely used one is Porter (1980) Stemming increase match when using normalized document and query ,since stemming collapse part of speech not preserve them as morphological German: German uses white-space delimiter for words decompounding separation. Morphological Can be done using the same approaches I English and French. {NOTE} "in hemistra thesis (phd) when he talk about toknizationhe says just leave digits and alphas and remove other symbols and replace them with space "
28
1.2.
Document expansion
This method first proposed by Singhal and Pereira (1999) in the context of spoken document retrieval. And can be applied in CLIR to find words that the author possibly have used, Authors used this method on news stories to expand each document by terms related in other documents as follow Extract document terms Weigh all terms equally such as in query Use the collection as comparable corpora "collection" that searched for terms to enrich the document Rank the collection and select top5 without the original one Select terms from the collection by ranking terms based on IDF Instance of each term added to the original document Till document length doubled Like in document it depends on extracting terms, query expansion could be used too to augment the representation.
1.1.
The same techniques as in document but the important different is that Query terms are not used for matching but it's used as translation base ,authors apply the greedy longest match technique since translating multiword expressions as a unit is well be useful for CLIR {Authors mention that paper about multi word as unit CLIR (Ballesteros & Croft, 1997) } .
1.2.
Short length of query can result ambiguity => that reduces precision OR Non-inclusion /omission of document authors used terms=>that reduces recall BUT Query Expansion generally increase effectiveness. Authors mentioned also that an English-Spanish study found that query expansion pre and post translation enhanced precision and recall. Where pre improves recall and percsion but post increase recall only. Authors perform query expansion as in the following steps o Construct initial query to INQUERY o Then Obtain expansion Using INQUERY relevance feed back from top ranked 10 documents o Concatenate expansion set with the original query and uses it as translation base.
29
2. Translation knowledge and query translation 2.1. Bilingual term lists and optimizing coverage
Easily found .BUT they constructed for various purposes and may be taken from various resources too beside that they vary also in Number of entries Source, number of multi-word entries Degree of ambiguity And mix of surface or root form entries Noticing that in their experiment construction of "an EnglishArabic term list done by sending each unique word found in a large collection of English news stories to two Internet-accessible English-to-Arabic translation services and merging the results."
2.2.
Weight mapping
previous used technique in dictionary-based query translation Includes all translations for all query words when used with "bag of terms retrieval" such as vector space model where terms treated independently / separately So unnecessary query terms that have many translations on focus. =>and this is not a desire for IR system Where terms with fewer translations usually give more better results when applied.
Balanced translation problem is that it gives more weight for Rare Terms.This is logical if the rare term translation was correct.
But kill rare terms when no information about translation probability AND other common alternative be present is generally Good.
2.3.
2002). "
30
31
MRIMLIG, Grenoble, France; 2.6.7 LIG MRIMLIG submitted 6 runs, all of them textual runs. Besides the best textual results, this was also the best overall result in 2007. {looking for the paper}
32
Julien Gobeill, Henning Mller, Patrick Ruch, Query and Document Translation by Automatic Text Categorization: A Simple Approach to Establish a Strong Textual Baseline for ImageCLEFmed 2006, working notes of the Cross Language Evalutation Forum (CLEF 2006), Alicante, Spain, 2006 - Can be applied with any controlled vocabulary {the experiment uses MeSh} - No need for training data This approach competitive with CLIR effective approaches
Each Document Categories vary 3 , 5 , 8 Annotated using All Quires Annotated by 3 MeSh categories
Automatic text categorizer that contain MeSh in English French and German
Categorizer {Can be done using byes , k-nearest boosting rule based .} this categorizer relly on poor data so they catogrize MeSh terms as if they are documents and input documents
33
Regular Expression pattern matcher Applied on Canonic MeSH augmented (increased) with MeSh thesaurus, the system moves a Window along the abstract and cuts the abstract into 5 tokens phrase, the distance calculated between these 5 tokens and Each MeSh term where insertion costs one and deletion costs 2. And ranking of terms done bas using this distance measure Vector space component Visual retrieval using GIFT
Based on genral IR engine with tf.idf "SMART representation" -and list of 544 stop words -for weighting factor cosin normalization used "for its effective results in documents with similer length" -even in controlled vocabulary the idf was able to discriminate between content and non content bearing -features such as yndrome and disease
The pattern matcher gives better results then vector space engine -combination (non linear)between the first two classifiers done as follow => returned list from extor space engine used as refrence list RL and RegExp list as boosting(enhance) list BL , Beside that long and compound terms favoured than short and single terms .
=>combined retrieval status value cRSV done as follow for each concept t in the RL list cRSVt {"equation !!!} Categorizer and indexing
To translate quries and document English French and germen of Mesh merged in the categorizer, each language stop word list merged.
Results Textual retreival results ---runs genrated automaticlly ,english retreival results was the second best froup Visual and mixed runs-all visual run ar not good as textual ones. Need further investigation.
34
MIRACLE MIRACLE submitted 36 runs in total and thus most runs of all groups.
runs were among the best, whereas visual retrieval was in the midfield. The combined runs were
worse than text alone and also only in the midfield.
35
Appendix A , Definitions
1.1. Polysemy
Polysemy = multiple meaning. A word is judged to be polysemous if it has two senses of the word whose meanings are related http://en.wikipedia.org/wiki/Polysemy 1.2. Medical Literature Analysis and Retrieval System Online (MEDLINE) " is an international literature database of life sciences and biomedical information. It covers the fields of medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care. MEDLINE covers much of the literature in biology and biochemistry, and fields with no direct medical connection, such as molecular evolution. Listing of an article in MEDLINE does not mean endorsement of that article. Compiled by the U.S. National Library of Medicine (NLM), MEDLINE is freely available on the Internet and searchable via PubMed and NLM's National Center for Biotechnology Information's Entrez system."
http://en.wikipedia.org/wiki/MEDLINE
1.1 Descriptors
Hierarchy arranged, descriptor may appear in many places at the Hierarchy , a descriptor can carry several tree numbers and each descriptor has a unique ID.
o
1.2 Description
It Enlarge the thesaurus and have links to the closest fitting descriptor.
http://en.wikipedia.org/wiki/Medical_Subject_Headings
36
37
SMART " The SMART (Saltons Magic Automatic Retriever of Text) Information Retrieval System is an information retrieval system developed at Cornell University in the 1960s. Many important concepts in information retrieval were developed as part of research on the SMART system, including the vector space model and relevance feedback "
http://en.wikipedia.org/wiki/SMART_Information_Retrieval_System
Stemming
For stemming algorithms as in http://www.dissertation.com/book.php?book=1581122675&method=ISBN see Porter (1980) and Frakes (1992) -what charctrize a word to be indexed is the fulfillment (requiermetn) of the following requirement
Pelissier and others at 86 Merging thesoursn different languages discussed by Kalachkina at 87.
"An unstressed word, typically a function word, that is incapable of standing on its own and attaches in pronunciation to a stressed word, with which it forms a single accentual unit. Examples of clitics are the pronoun 'em in I see 'em and the definite article in French l'arme, "the arm." "
38
INQUERY "
" J. P. Callan, W. B. Croft, and S. M. Harding. The INQUERY retrieval system. In Proceedings of the
Third International Conference on Database and Expert Systems Applications, pages 78--83, Valencia, Spain, 1992. Springer-Verlag. "
Well known system -retrieval system based on probabilistic retrieval model (Bayesian inference net) -support complicated indexing and multipart /complex query INQUERY Architecture Based on document retrieval inference network (type of Bayesian network) . Inference net consists of two networks first one for documents {that represent them indifrent techniques } and the second for queries {that nodes are always rue} query nodes related to document nodes through concept and content nodes and this may done within more than one concept or single node /not one to one mapping . Nodes has the value true or false and arcs have between and 1 belief value ,if document node rpresetn the query of the useitgets value of true . Mapping {parsing } done using five components 1Lexical analyzer That provides lexical tokens .to the syntactic analyzer. 2Syntactic analyzer Ensure that the document is in the expected format 3Concept identification Identify high level concept such as datea and names of persons, companies also define sentence and paragraph boundaries at the documents 4Dictionary storage First Stored in B tree then hash table used so when needed the word or sentence replaced within its entry in the dictionary . 5Transaction generation
"Language modeling is the art of determining the probability of a sequence of words." "
39
John Coleman
From ord lecture Natural Language Processing for Information Retrieval Douglas W. Oard
40