Professional Documents
Culture Documents
Databases and Computerized Information Retrieval
Databases and Computerized Information Retrieval
****
What is a database?
A database is a collection of similar data records stored in a common file (or collection of files).
****
***-
Presentation of information
***-
***-
***-
***-
Database
Search engine
Search interface
User User
**--
Retrieval
User User
**--
10
?? Question ??
The Therecords recordsinput inputin ina adatabase databasesystem systemto tobe beindexed indexed do not necessarily appear completely do not necessarily appear completely in inthe theoutput outputphase; phase; that thatis: is:they theyare arenot notshown showncompletely completely to tothe theuser userof ofthe thesystem systemin inthe theresults resultsof ofa aquery. query. Can Canyou youillustrate illustratethis? this?
****
11
***-
12
***-
13
***-
14
***-
15
***-
16
***-
17
***-
18
***-
19
**--
20
!! Question !!
The Thedatabase databasearchitecture architecturedescribed describedhere hereis issimple, simple, but which factors make retrieval but which factors make retrieval nevertheless neverthelessa acomplex complexprocedure procedure in inmany manyreal realdatabases databaseswith withthis thisarchitecture? architecture?
**--
21
**--
22
Repeated fields
****
23
***-
24
***-
25
Retrieval can be enhanced by coping with the problems caused by the use of natural language. Contributions to this enhancement of retrieval can be made by the database producer the computerized retrieval system the searcher/user
(The distinction between these is not very sharp and clear in all cases.)
**--
26
!! Task - Assignment !!
Read Readabout about Language Languageand andinformation informationretrieval retrieval by byLarge, Large,Andrew, Andrew,Tedd, Tedd,Lucy LucyA., A.,and andHartley, Hartley,R.J. R.J. Chapter Chapter4 4in: in:Information Informationseeking seekingin inthe theonline onlineage: age: principles and practice. principles and practice. London London::Bowker-Saur, Bowker-Saur,1999, 1999,308 308pp. pp.
**--
27
!! Task - Assignment !!
Read Readabout about Information .. Informationorganization organization By ByLarge, Large,Andrew, Andrew,Tedd, Tedd,Lucy LucyA., A.,and andHartley, Hartley,R.J. R.J. Chapter Chapter5 5in: in:Information Informationseeking seekingin inthe theonline onlineage: age: principles and practice. principles and practice. London .. London::Bowker-Saur, Bowker-Saur,1999, 1999,308 308pp pp
****
28
****
29
!
**** 30
****
31
!
**** 32
***-
33
adding to each database record those codes from a classification system or terms from a thesaurus system that are relevant, and providing the user with knowledge about the system used; in some cases, this process is computerized (with intellectual intervention or completely automatic)
***-
34
Moreover, in practice, most users of the resulting database do not exploit this method offered.
***-
35
offering to the user a partly computerized access to the particular subject description system used by the database producer, and then linking to the database for searching computerized, automatic, analysis of the free text search terms applied in a query by the user, for transparent mapping to the corresponding particular classification codes, categories, or thesaurus terms used by the database producer
**--
36
offering the searching user access to a (general) thesaurus system, even when the database producer has not categorised the database contents; in this way, the user can refine his/her query better, and more generally: computerized, automatic expansion of the query terms introduced by the user, based on a general thesaurus! (however, not many retrieval systems offer this feature)
**--
37
offering the possibility to the user to truncate a search term explicitly computerized, automatic, transparent truncation without explicit user action
**--
38
computerized, automatic, transparent, intelligent morphological analysis of the query terms: stemming of the free text search terms used by the user; however, this does not work perfectly and has not (yet) been implemented in most retrieval systems; for languages that have a richer morphology than English, this can offer even a larger pay-off
****
39
?? Question ??
Which Whichproblems problemsin intext textretrieval retrieval are areillustrated illustratedby bythe thefollowing followingsentences? sentences?
!
****Examples 40
****Examples
41
****Examples
42
****
43
****Example
44
****Example
45
!
**** 46
***-
47
adding to each database record codes from a classification system or terms from a thesaurus system, and providing the user with knowledge about the system used; in some cases, this process is computerized (completely automatic or with intellectual intervention);
***-
48
offering to the user a partly computerized access to the subject description system and then linking to the database for searching
***-
49
the public access module of the book catalogue of the library automation system VUBIS at VUB, Belgium, when a searching items that were assigned a particular keyword
*---
50
!! Task - Assignment !!
Search SearchClusty Clustyor orVivisimo Vivisimoor orWisenut Wisenut as an example of a system that as an example of a system thatapplies applies automatic, computerized automatic, computerized subject subjectcategorization categorization of database of databaserecords. records.
***-
51
Natural language processing of the queries: linguistic analysis to determine possible meanings of the query, which includes disambiguation of words in their context: lexical analysis = at the level of the word semantic analysis = at the level of the sentence However, most queries are short and therefore it is difficult to apply semantic analysis for disambiguation.
***-
52
Natural language processing of the documents: linguistic analysis to determine possible meanings of a sentence, which includes disambiguation of words in their context: lexical analysis = at the level of the word semantic analysis = at the level of the sentence However, most retrieval systems do not apply this complicated method.
****
53
Word2
Concept2
Word3
Concept3
The most simple relation between words and concepts is NOT valid.
****
54
Word2
Irrelevant concept 2
Word3
Irrelevant concept3
A concept cannot be covered by only 1 word or term; this may be the cause of low recall of a search. The meaning of many words is ambiguous; this may be the cause of low precision of a search.
**--
55
**--
56
**--
57
using a categorization system and also adapting this continuously to the changing reality and meanings of terms
***-
58
***-
59
the user can and should indicate explicitly that a few words should be considered together by the retrieval system as forming a phrase/term (for instance in many Internet search engines by putting the phrase in quotes like three word phrase)
***-
60
better: the retrieval system automatically recognizes a phrase/term relying on a term bank that has been created in advance; examples: the Internet search engines AltaVista and Scirus work in this way
**--
61
!
**-62
**--
63
!
**--Examples 64
**--Examples
65
!
66
**--
**--
67
**--
68
***-
69
?? Question ??
Which Whichare aresome someof ofthe theproblems problems caused by the use of language caused by the use of language in ininformation informationretrieval? retrieval?
!
**-70
**--
71
**--
72
translating the query of the user, by using a general multilingual thesaurus; however, most free text queries are quite short, which makes it difficult to use the context to limit possible ambiguity; disambiguation by user-computer interaction offered by the query interface, can increase the effectiveness here.
**--
73
!
**-74
**--
75
!
**-76
**--
77
**--
78
**--
79
accepts words / terms / phrases in the query of the user maps the words to corresponding concepts presents these concepts to the user who can then select the appropriate, relevant concept (disambiguation) searches for this concept, even in documents written in another language presents the resulting, retrieved documents in the language preferred by the user
**--
80
Natural language processing of the documents AND of the query Comparison and matching of both
***-
81
**--
82
!! Task - Assignment !!
Recommended Recommendedreading: reading: Veal, D.C. Veal, D.C. Progress Progressin indocumentation: documentation: Techniques of document Techniques of documentmanagement: management: a review of text retrieval and related a review of text retrieval and relatedtechnologies. technologies. J. J.Doc., Doc.,Vol. Vol.57, 57,No. No.2, 2,March March2001, 2001,pp. pp.192-217. 192-217.
**--
83
!! Task - Assignment !!
Recommended Recommendedreading: reading: Chowdhury, G. G., and Chowdhury, Chowdhury, G. G., and Chowdhury,Sudatta Sudatta Information Informationretrieval retrievalin indigital digitallibraries. libraries. In: In:Introduction Introductionto todigital digitallibraries. libraries. London London::Facet FacetPublishing, Publishing,2003, 2003,354 354pp. pp.
**--
84
?? Question ??
Explain Explainthe thebasic basicrelations/similarities relations/similaritiesin in speech (speech speechrecognition recognition (speechto totext) text) translation translationof ofa atext text summarizing summarizingtexts texts text textretrieval retrieval cross-language cross-languagetext textretrieval retrieval (text (textto totext) text) (text (textto tosummary) summary) (query (queryto totexts) texts) (( combination )) combination
****
85
****
86
****
87
****
88
****
89
****
90
**--
91
****
92
Boolean operators: OR = obtain records that contain one or both search terms AND = obtain records that contain both search terms NOT or ANDNOT or AND NOT = exclude records that contain a search term
****
93
AND ...
***-
94
?? Question ??
Suppose Supposethat thatyou youwant wantto tosearch searchfor fora atopic topic that has several synonyms that has several synonyms (for (forexample, example,young youngpeople, people,adolescents, adolescents,teenagers, teenagers,teens). teens). Then Thenwhich whichone oneof ofthe thefollowing followingoperators operators would wouldyou youuse usein inyour yourquery? query? ADJ AND NEAR NOT ADJ AND NEAR NOT OR OR
****
95
****
96
***-
97
?? Question ??
Why Whyis isit itimportant importantto toknow knowthe thedefault defaultBoolean Booleanoperator operator in the search system that you in the search system that youuse? use? You Youcan canalso alsoexplain explainthis thiswith withan anexample. example.
***-
98
!! Task - Assignment !!
You Youcan canread read Cohen, Laura Cohen, Laura Boolean Booleansearching searchingon onthe theInternet. Internet.[online] [online] Available from: Available from: http://library.albany.edu/internet/boolean.html http://library.albany.edu/internet/boolean.html University UniversityLibraries, Libraries,University Universityat atAlbany, Albany,USA. USA. [cited 2006] [cited 2006]
***-
99
?? Question ??
You Youwant wantto tosearch searcha adatabase databasefor fora alow-fat low-fatrecipe recipe for pasta with either shrimp or chicken. for pasta with either shrimp or chicken. Which Whichquery querydemonstrates demonstratesthe theproper properuse useof ofnesting nesting to get many search results that are very relevant? to get many search results that are very relevant? 1. 1. noodles noodlesor or(pasta (pastaand andshrimp) shrimp)or orchicken chickenand andlow-fat low-fat 2. (noodles or pasta) and (shrimp or chicken) and 2. (noodles or pasta) and (shrimp or chicken) andlow-fat low-fat 3. 3. noodles noodlesor orpasta pastaand and(shrimp (shrimpor orchicken) chicken)and andlow-fat low-fat 4. 4. (noodles (noodlesor orpasta) pasta)and andshrimp shrimpor or(chicken (chickenand andlow-fat) low-fat) 5. noodles or pasta and shrimp or chicken and low-fat 5. noodles or pasta and shrimp or chicken and low-fat
***-
100
?? Question ??
You Youneed needinformation informationon onthe thecommunication communicationstrategies strategies applied appliedby bythe thepopular popularstar starMadonna. Madonna. Which Whichquery querywill willprobably probablybe bethe themost mostefficient efficientone one in insome someparticular particulardatabase, database, (of (ofcourse coursein inthe thecase casethat thatthe thedatabase databaseunderstands understandsthe theoperators operatorsapplied) applied) 1. 1. 2. 2. 3. 3. 4. 4. 5. 5. Communication CommunicationAND ANDstrategies strategies Madonna AND communication Madonna AND communicationAND ANDstrategies strategies Madonna MadonnaOR ORcommunication communicationOR ORstrategies strategies Strategies OR communication Strategies OR communication Madonna Madonna
****
101
?? Question ??
How Howmany many(and (andwhich) which)concepts/facets concepts/facets do you see in a search do you see in a searchfor for general reviews general reviews about about monitoring seawater monitoring seawaterpollution pollution that is due to effluents in that is due to effluents inTanzania? Tanzania?
****
102
!! Task - Assignment !!
Prepare Prepareoff-line, off-line,on onpaper, paper,a asuitable suitablesearch searchquery query in ina ageneric genericformat, format,to tofind find general generalreviews reviews about about monitoring seawater pollution monitoring seawater pollutionthat thatis isdue dueto toeffluents effluents as the basis for later, concrete searches in databases. as the basis for later, concrete searches in databases. (Limit (Limityourself yourselfto to1 1of ofthe theconcepts.) concepts.)
***-Example
103
****
104
?? Question ??
What Whatdid didyou youlearn learn from the exercise from the exercise on the formulation on the formulationof ofa aquery? query?
**--
105
!! Task - Assignment !!
Prepare Prepareoff-line, off-line,on onpaper, paper,a asuitable suitablesearch searchquery query in a generic format, to find documents about in a generic format, to find documents about how howto toevaluate evaluatethe theability ability to find scientific information to find scientific information of starting university of starting universitystudents studentsup upto toprofessional professionalscientists scientists as the basis for later, concrete searches in databases. as the basis for later, concrete searches in databases. (Limit (Limityourself yourselfto to1 1of ofthe theconcepts.) concepts.)
***-
106
?? Question ??
How Howcan canwe weexploit exploitin insome somesearches searchesthe thefact fact that many bibliographic databases that many bibliographic databases (in particular (in particularthe thecommercial, commercial,expensive expensiveones) ones) offer records with a field structure? offer records with a field structure?
**--
107
!! Task - Assignment !!
Read Read
Luther, Luther,Judy, Judy,Kelly, Kelly,Maureen, Maureen,and andBeagle, Beagle,Donald Donald Visualize this Visualize this (Visualization (Visualizationsoftware softwaremay maybecome becomeaapowerful powerfulnew newway wayto tosearch search or oraafootnote footnotein intechnology technologyhistory). history). Library LibraryJournal, Journal,March March1, 1,2005, 2005,pp. pp.34-37. 34-37.
****
108
Feedback
Results
****
109
**--
110
!! Task - Assignment !!
Search Searchin inthe thefreely freelyaccessible accessibleERIC ERICdatabase database for documents on for documents on courses coursesoffered offeredthrough throughthe theweb web in the field of architecture, or history, or computer in the field of architecture, or history, or computerapplications. applications. This Thisis isnot noteasy, easy, because becausewords wordslike likeweb, web,architecture, architecture,history, history,and andcomputers, computers, can canhave haveother othermeanings meaningsthan thantitles titlesof ofcourses. courses. Therefore, Therefore,find findand anduse usethe thecontrolled controlledsubject subjectterms terms that thatare areadded addedby bythe thedatabase databaseproducer producer and andsee seethat thatthe theresults resultsare arebetter. better.
****
111
The ability to ask the right question is more than half the battle of finding the answer. Thomas J. Watson
****
112
**** You are free to copy, distribute, display this work under the following conditions: Attribution: You must mention the author. Noncommercial: You may not use this work for commercial purposes. No Derivative Works: You may not change, modify, alter, transform, or build upon this work. For any reuse or distribution, you must make clear to others the license terms of this work.
113