Ανάκτηση Πληροφορίας

Τμήμα Ηλεκτρολόγων Μηχανικών & Μηχανικών Υπολογιστών
Πανεπιστήμιο Πελοποννήσου
Πληροφοριακά Συστήματα, Εξόρυξη

Δεδομένων και Επιχειρησιακή
Ευφυΐα
Introduction to information Retrieval Systems
Βασίλειος Ταμπακάς
Καθηγητής
Διάλε Αντικείμενο Τύπος Εργασίες
ξη μαθήματος
1 Θέματα Θ+Ε εργασία 1
Δεικτοδότησης και
ανάκτησης
πληροφορίας (1)
2 Θέματα Θ+Ε
Δεικτοδότησης και
ανάκτησης
πληροφορίας (2)
3 Προχωρημένα Θ+Ε
Θέματα ΒΔ
Γράφων και
Εγγράφων
4 Κατανεμημένα Θ
Συστήματα
επεξεργασίας
Μεγάλων
Δεδομένων
5 Θέματα crawling Ε+Θ εργασία 2
και spiders
Β. Ταμπακάς 2
6 Θέματα crawling Ε+Θ
και spiders
2 Εργασίες Υποχρεωτικές  40% του τελικού
βαθμού
Τελική εξέταση θεωρίας  60% του τελικού βαθμού
Για την εργασία πρέπει να σχηματιστούν ομάδες 2 ατόμων
Back-End architecture for big data mining and analysis
Τα δεδομένα με βάση τη δομή τους
Δομημένα (structured) δεδομένα

Παρουσιάζουν συγκεκριμένη αυστηρή δομή και σχήμα.
Π.χ. Τα δεδομένα μιας σχεσιακής ΒΔ
Μη Δομημένα (unstructured) δεδομένα

Δεν παρουσιάζουν καμία δομή ή σχήμα.
Π.χ. Βίντεο, ήχος, χωροταξικά δεδομένα, δεδομένα καιρού
Ημι- Δομημένα (semi-structured) δεδομένα

Αναφέρονται και ως αυτό-περιγραφικά δεδομένα.
Παρουσιάζουν περιορισμένη δομή ή σχήμα. Περιέχουν
ετικέτες ή άλλους δείκτες για να διαχωρίσουν σημασιολογικά
στοιχεία και να επιβάλουν ιεραρχίες.
Π.χ. Email. CSV, XML and JSON documents, HTML.
Το 80% των παγκόσμιων δεδομένων είναι σε μη

δομημένη και ημι-δομημένη
5
μορφή
BackEnd Architecture
communication interfaces Data Storage
output processing engine

API
Web - Social Networks
unstructured data
input
Semi-structured data
BackEnd Architecture
/Τεχνολογίες και εφαρμογές
Data Storage
communication interfaces Analytics
Indexing
DataBases
output processing engine Querying
API
Big Data Analytics

Web - Social Networks
Crawling unstructured data

APIs Κατασκευή και
input Επεξεργασία
streaming
γράφων
Semi-structured data
Διαχείριση κειμένου
Distributed Processing engine SPARK
over a cluster of nodes Mining Big Data and Analytics
NoSQL Databases
Data Storage
Neo4J Κατασκευή και
Επεξεργασία γράφων
MongoDB Διαχείριση κειμένου
Information retrieval tools

Document Analytics Lucienne and Elastic Search
Indexing & efficient retrieval
Social Networks through APIs

Data Acquisition Social Networks through
Crawlers
Web through Crawlers
Type of Processing
Batch or Streaming
(e.g. kafka, Flume
8
Information Retrieval Tools
9
Ορισμός
Τα Συστήματα Ανάκτησης Κειμένου

προσφέρουν τη δεικτοδότηση, την αποδοτική
ανάκτηση και διαχείριση των ελεύθερων
κειμένων
Τι είναι η δεικτοδότηση των κειμένων;
10
Information Retrieval
…tools
vs Document Databases
e.g. MongoDB
Lucene
ElasticSearch
ELK
Solr
Lucene
Lucene is a set of libraries that can be used for full text search.
Lucene is a high-performance, scalable information retrieval (IR) library.
It’s a free, open source project implemented in Java
Lucene is currently the most popular free IR library.

ELK
ELK is an acronym for Elasticsearch, Logstash and Kibana

Elasticsearch is a search server built on top of Lucene.
It supports distributed searches in a multitenant environment.
It is a scalable search engine allowing high flexibility of adding machines
easily.
It provides a full-text search engine combined with a RESTful web interface
and JSON documents.
Logstash can process almost any kind of data and normalize it. Logstash
can help you normalize data to a common format before sending the data on
forward to a data store, in the case of the ELK stack the data store is
Elasticsearch.
Kibana is a visualization platform that is built on top of Elasticsearch and

leverages the functionalities of Elasticsearch.
Data Streaming
14
Data Streaming
What is Streaming Data?

Streaming data is the continuous flow of data generated by various sources.
By streaming: data streams can be processed, stored, analyzed, and acted upon
as it's generated in real-time.
Batch Processing vs Real-Time Streams

Batch data processing methods require data to be downloaded as batches
before it can be processed, stored, or analyzed.
Streaming data flows in continuously, allowing that data to be processed

simultaneously, in real-time the time it's generated.
15
Apache Flume
16
Συνδυασμένη χρήση ELK και FLUME
17
Crawlers and Data Mining
18
Crawlers and Spiders
A Web crawler, sometimes called a spider or spiderbot and often shortened to

crawler, is an Internet bot that systematically browses the World Wide Web,
typically operated by search engines for the purpose of Web indexing (web
spidering)
An Internet bot, web robot, robot or simply bot, is a software application that
runs automated tasks (scripts) over the Internet
From Wikipedia
19
Apache Nutch
Nutch
Λογισμικό ανοιχτού κώδικα
Ευέλικτο διότι μας δίνει την δυνατότητα να το συνδυάσουμε με διάφορα εργαλεία

για ένα βέλτιστο σύστημα
Kατανεμημένη λετουργία που βασίζεται σε κατανεμημένο σύστημα αρχείων
Χρησιμοποιεί μηχανή αναζήτησης βασισμένη στο Lucene
20
Apache Nutch Architecture
Apache Nutch
Nutch
21
Τεχνικές Δεικτοδότησης και Ανάκτησης
Πληροφορίας
Definition of information retrieval
Information retrieval (IR) is finding material (usually

documents) of an unstructured nature (usually text) that
satisfies an information need from within large
collections (usually stored on computers).
Unstructured (text) vs. structured
(database) data in the mid-nineties
250
200
150
Unstructured
100 Structured
50
0
Data volume Market Cap
24
Unstructured (text) vs. structured
(database) data today
250
200
150
Unstructured
100 Structured
50
0
Data volume Market Cap
25
Sec. 1.1
Basic assumptions of Information Retrieval
• Collection: A set of documents

– Assume it is a static collection for the moment
• Goal: Retrieve documents with information

that is relevant to the user’s information need
and helps the user complete a task
26
Η διαδικασία της ανάκτησης πληροφορίας από κείμενα
Αξιολόγηση απάντησης στην Ανάκτηση Πληροφορίας
Τι κάνει ένα χρήστη χαρούμενο;
• Ταχύτητα απόκρισης (Speed of response)

• Εύχρηστη διεπαφή (Uncluttered UI)
• Χωρίς κόστος (free)
Αξιολόγηση απάντησης στην Ανάκτηση Πληροφορίας
Τι κάνει ένα χρήστη χαρούμενο;
• Ταχύτητα απόκρισης (Speed of response)

• Εύχρηστη διεπαφή (Uncluttered UI)
• Χωρίς κόστος (free)
Συνάφεια (relevance): Κανένα από αυτά δεν αρκεί:

εξαιρετικά γρήγορες αλλά άχρηστες απαντήσεις δεν
ικανοποιούν ένα χρήστη
Πως μετρείται η συνάφεια;
Πρόβλημα
Έστω μια συλλογή με 1.000.120 έγγραφα, και μια

ερώτηση για την οποία υπάρχουν 80 συναφή έγγραφα.
Η απάντηση που μας δίνει το Σύστημα Ανάκτησης
Πληροφορίας (ΣΑΠ – Information Retrieval System -
IRS) έχει 60 έγγραφα από τα οποία τα 20 είναι συναφή
και τα 40 μη συναφή.
Πόσο «καλό» είναι;

Πως θα μετρήσουμε τη συνάφεια του;
TP
TN FN FP
TN  true negative (δεν έπρεπε να ανακτηθούν και δεν ανακτήθηκαν)

FN  false negative (έπρεπε να ανακτηθούν και δεν ανακτήθηκαν)
ΤP  true positive (έπρεπε να ανακτηθούν και ανακτήθηκαν)
FP  false positive (δεν έπρεπε να ανακτηθούν και ανακτήθηκαν)
Έστω μια συλλογή με 1.000.120 έγγραφα, και μια ερώτηση για την
οποία υπάρχουν 80 συναφή έγγραφα.
Η απάντηση που μας δίνει το ΣΑΕ έχει 60 έγγραφα από τα οποία
τα 20 είναι συναφή και τα 40 μη συναφή.
Ακρίβεια και Ανάκληση (1/2)
Precision (P) – Ακρίβεια είναι το ποσοστό των ανακτημένων

εγγράφων που είναι συναφή
P = TP / ( TP + FP )
TP
TN FN FP
Ακρίβεια και Ανάκληση (2/2)
Recall (R) – Ανάκληση είναι το ποσοστό των συναφών
εγγράφων που ανακτώνται
R = TP / ( TP + FN )
TP
TN FN FP
Έστω μια συλλογή με 1.000.120 έγγραφα, και μια ερώτηση για την
οποία υπάρχουν 80 συναφή έγγραφα.
Η απάντηση που μας δίνει το ΣΑΕ έχει 60 έγγραφα από τα οποία
τα 20 είναι συναφή και τα 40 μη συναφή.
Precision = 20/60 = 1/3 Recall = 20/80 = 1/4
Ακρίβεια vs Ανάκληση
P = σωστά ανακτώμενα / συνολικά ανακτώμενα
R = σωστά ανακτώμενα / συνολικά σχετικά
Η ανάκληση (ίσως) μπορεί να αυξηθεί με το να επιστρέψουμε
περισσότερα έγγραφα.
Η ανάκληση είναι μια μη-φθίνουσα συνάρτηση του αριθμού
των εγγράφων που ανακτώνται (Ένα σύστημα που
επιστρέφει όλα τα έγγραφα έχει ποσοστό ανάκλησης 100%!)
Το αντίστροφο ισχύει για την ακρίβεια (συνήθως):

Είναι εύκολο να πετύχεις μεγάλη ακρίβεια με πολύ μικρή
ανάκληση. (Σκεφθείτε την περίπτωση ενός έγγραφου που
είναι συναφές)
Σε ένα καλό σύστημα η ακρίβεια ελαττώνεται όσο περισσότερα
έγγραφα ανακτούμε ή με την αύξηση της ανάκλησης
Το τι από τα δύο μας ενδιαφέρει περισσότερα εξαρτάται και από την
εφαρμογή (π.χ., web vs email search)
Άσκηση
Σχολιάστε τα σημεία 1, 2 και 3 του διαγράμματος όπως και τις

δυο έγχρωμες περιοχές του
Προετοιμασία μιας ΒΔ
επεξεργασίας κειμένων
• Διαδικασία Tokenization
Αποσπώνται οι προτάσεις ή/και οι λέξεις από το κείμενο
• Διαδικασία Normalization
• Διαδικασία stopping
Οι πολύ κοινές λέξεις (π.χ. άρθρα, σύνδεσμοι, προθέσεις)
διαγράφονται
• Διαδικασία stemming (θεματοποίηση):

Απομακρύνονται οι καταλήξεις των υπόλοιπων λέξεων και
μένουν τα θέματα, π.χ.
connect: connecting, connection, connections
Τελικά:
Κάθε κείμενο αποτελείται από ένα σύνολο θεμάτων.
Πολλές φορές για κάθε κείμενο δημιουργείται ένας κωδικός
και στη θέση του θέματος χρησιμοποιείται ο κωδικός
38
Introduction to Information Retrieval
Initial stages of text processing

 Tokenization
 Cut character sequence into word tokens
 Deal with “John’s”, a state-of-the-art solution
 Normalization
 Map text and query term to same form
 You want U.S.A. and USA to match
 Stemming
 We may wish different forms of a root to match
 authorize, authorization
 Stop words
 We may omit very common words (or not)
 the, a, to, of
Δεικτοδότηση Αντεστραμμένου Αρχείου
(inverted file index)
Δεικτοδότηση Αντεστραμμένου
Αρχείου (inverted file index)
• Διατηρείται ένας κατάλογο που ονομάζεται αντεστραμμένος
δείκτης.
• Ο δείκτης αυτός περιέχει όλους τους όρους που εμφανίζονται στα
κείμενα (μετά την εφαρμογή του stopping και stemming).
• Σε κάθε όρο αντιστοιχεί μια λίστα με τους κωδικούς των κειμένων
στα οποία περιέχεται αυτός ο όρος. Έτσι, η αναζήτηση των κειμένων
στα οποία βρίσκεται ο όρος αυτός γίνεται πολύ εύκολα.
Συνήθως η ερώτηση αποτελείται από αρκετούς όρους και έτσι απαιτείται

ένας κατάλληλος συνδυασμός των συνόλων των κειμένων που
αντιστοιχούν σε κάθε όρο, για να προκύψει η τελική ομάδα κειμένων που
θα δοθεί στο χρήστη. Στα περισσότερα συστήματα, δίνεται η δυνατότητα
να συνδυαστούν οι όροι της ερώτησης με τελεστές όπως οι NOT, AND
και OR. Η συγκεκριμένη μέθοδος είναι
γνωστή και σαν BOOLEAN Retrieval
41
MODEL
Sec. 1.2
Inverted index construction

Documents to Friends, Romans, countrymen.
be indexed
Tokenizer
Token stream Friends Romans Countrymen
Linguistic modules
Modified tokens friend roman countryman
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Παράδειγμα
Υποθέτουμε πως έχουμε τέσσερα κείμενα με τους παρακάτω
όρους:
Κ1 (ορ1, ορ2, ορ3,ορ4)
Κ2 (ορ3, ορ5, ορ6)
Κ3 (ορ1, ορ4, ορ6)
Κ4 (ορ2, ορ3, ορ7)
Τότε ο αντεστραμμένος δείκτης είναι ο παρακάτω:
ΟΡΟΣ ΚΕΙΜΕΝΑ
ΟΡ1 Κ1 Κ3
ΟΡ2 Κ1 Κ4
ΟΡ3 Κ1 Κ2 Κ4
ΟΡ4 Κ1 Κ3
ΟΡ5 Κ2
ΟΡ6 Κ2 Κ3
ΟΡ7 Κ4
43
ΟΡΟΣ ΚΕΙΜΕΝΑ
ΟΡ1 Κ1 Κ3
ΟΡ2 Κ1 Κ4
ΟΡ3 Κ1 Κ2 Κ4
ΟΡ4 Κ1 Κ3
ΟΡ5 Κ2
ΟΡ6 Κ2 Κ3
ΟΡ7 Κ4
Η ερώτηση ΟΡ1 AND ΟΡ3, δίνει το αποτέλεσμα:

(Κ1, Κ3) AND (Κ1, Κ2, Κ4)  Κ1
Η ερώτηση ΟΡ1 OR ΟΡ2, δίνει το αποτέλεσμα:
(Κ1, Κ3) OR (K1, K4)  K1, K3, K4
Δεν δίνονται τα πιο 44
σχετικά κείμενα!!!
Indexing
Sec. 1.2
Indexer steps: Token sequence

• Sequence of (Modified token, Document ID) pairs.
Doc 1 Doc 2
I did enact Julius So let it be with

Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2
Indexer steps: Sort

• Sort by terms
– At least conceptually
• And then docID
Core indexing step

Sec. 1.2
Indexer steps: Dictionary &

Postings
• Multiple term
entries in a single
document are
merged.
• Split into
Dictionary and
Postings
• Doc. frequency
information is
added.
Sec. 1.2
Where do we pay in storage?
Lists of
docIDs
Terms
and
counts
49
Pointers
Query processing
Query processing: AND

 Consider processing the query:
Brutus AND Caesar
 Locate Brutus in the Dictionary;
 Retrieve its postings.
 Locate Caesar in the Dictionary;
 Retrieve its postings.
 “Merge” the two postings (intersect the document sets):
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
51
The merge
 Walk through the two postings simultaneously, in
time linear in the total number of postings entries
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
If the list lengths are x and y, the merge takes O(x+y)

operations.
Crucial: postings sorted by docID.
52
Query optimization
 What is the best order for query processing?

 Consider a query that is an AND of n terms.
 For each of the n terms, get its postings, then
AND them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
Query: Brutus AND Calpurnia AND Caesar 53

Query optimization example

 Process in order of increasing freq:
 start with smallest set, then keep cutting further.
This is why we kept

document freq. in dictionary
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
Execute the query as (Calpurnia AND Brutus) AND Caesar.

54
More general optimization

 e.g., (madding OR crowd) AND (ignoble OR
strife)
 Get doc. freq.’s for all terms.
 Estimate the size of each OR by the sum of its
doc. freq.’s (conservative).
 Process in increasing order of OR sizes.
55
Οι αδυναμίες του Boolean μοντέλου
Άκαμπτο: AND σημαίνει όλα, OR σημαίνει οποιοδήποτε
• Δυσκολίες
– Ο έλεγχος του μεγέθους της απάντησης
• All matched documents will be returned
– Ικανοποιητική ακρίβεια (precision) συχνά σημαίνει
απαράδεκτη ανάκληση (recall)
– Η διατύπωση των ερωτήσεων είναι δύσκολη για πολλούς
χρήστες
– Δεν μας λέει πώς να διατάξουμε την απάντηση
• All matched documents logically satisfy the query
–– Η υποστήριξη ανάδρασης συνάφειας δεν είναι εύκολη
• If a document is identified by the user as relevant or
irrelevant, how should the query by modified ?
Τα θετικά του Boolean μοντέλου
• Προβλέψιμο, εύκολα εξηγήσιμο
• Αποτελεσματικό όταν γνωρίζεις ακριβώς τι

ψάχνεις και τι περιέχει η συλλογή
• Αποδοτική υλοποίηση
Vector Space Model
Το μοντέλο διανυσματικού χώρου
Προτάθηκε από τον Gerard Salton στις αρχές του 1960

Το μοντέλο βασίζεται στο μετασχηματισμό των κειμένων σε
διανύσματα με συντεταγμένες πραγματικούς αριθμούς
χρησιμοποιώντας την έννοια του διανυσματικού χώρου από τη
γραμμική άλγεβρα.
Το ίδιο γίνεται και για τα ερωτήματα των χρηστών.
Την ώρα που υποβάλλεται ένα ερώτημα θα πρέπει να

υπολογιστεί το μέτρο της απόστασης (similarity measure)
μεταξύ του ερωτήματος και όλων των κειμένων που περιέχουν
όρους του ερωτήματος.
58
Vector Space Model με μια ματιά
• Για κάθε κείμενο, διατηρείται μια λίστα (διάνυσμα) με τους όρους που εμφανίζονται
σε αυτό (μετά την εφαρμογή του stopping και stemming).
• Η λίστα περιέχει, για κάθε όρο του κειμένου τον κωδικό του και ένα αντίστοιχο
βάρος.
• Διάφοροι τρόποι απόδοσης βαρών στους όρους: π.χ. συνδυασμός της
συχνότητας εμφάνισης του όρου στο συγκεκριμένο κείμενο, του αριθμού των
κειμένων που περιέχουν αυτόν τον όρο και του συνολικού αριθμού των κειμένων
στη συλλογή.
• Το διάνυσμα περιέχει και τους όρους που δεν συμμετέχουν καθόλου στο κείμενο
(αλλά υπάρχουν σε κάποια από τα υπόλοιπα κείμενα της συλλογής) με βάρος
μηδέν.
• Κάθε κείμενο επομένως αναπαρίσταται από ένα διάνυσμα σταθερού μήκους. Σε
κάθε τέτοιο διάνυσμα η σειρά του κάθε όρου είναι σταθερή, άρα μπορούν να
παραλειφθούν οι κωδικοί τους. Για παράδειγμα το διάνυσμα της μορφής
Κ9= {2, 3, 0, …}
σημαίνει πως στο κείμενο 9, ο 1ος όρος έχει βάρος 2, ο 2ος όρος έχει βάρος 3, ο
3ος όρος έχει βάρος 0 κ.ο.κ.
Το σύνολο των κειμένων της συλλογής

αποτελεί ένα διανυσματικό χώρο και
κάθε κείμενο αντιστοιχεί σε ένα σημείο
59
του χώρου αυτού.
Το μοντέλο Διανυσματικού Χώρου
Vector Space Model
Έστω μια βάση δεδομένων μιας υποθετικής βιβλιοθήκης. Τα βιβλία ξεχωρίζουν
από τις λέξεις κλειδιά που καθορίζουν το περιεχόμενο κάθε βιβλίου, π.χ.
{Μαθηματικά, Φυσική, Ιστορία, κλπ}. Για λόγους διαγραμματικής αναπαράστασης
του μοντέλου θα περιοριστούμε σε τρία μόνο αντικείμενα: {Μαθηματικά, Φυσική,
και Πληροφορική}. Έστω ότι ένα βιβλίο αναφέρεται μόνο σε μαθηματικά (κείμενο
d1), ένα άλλο βιβλίο αποκλειστικά σε φυσική (κείμενο d2) και ένα τρίτο
αναφέρεται κατά 30% για μαθηματικά και κατά 70% σε πληροφορική (κείμενο
d3). Τα κείμενα αυτά στις τρεις διαστάσεις (Μαθηματικά, Φυσική, Πληροφορική)
μπορούν να παρασταθούν με διανύσματα ως εξής:
d1=(1, 0, 0)
d2=(0, 1, 0)
d3=(0.3, 0, 0.7)
61
Ερώτημα: Αναζητούμε ένα βιβλίο το οποίο αναφέρεται κατά 50% σε
Μαθηματικά και 50% σε πληροφορική.
q=(0.5, 0, 0.5)
Μαθηματικά Φυσική Πληροφορική

d1 1 0 0
d2 0 1 0
d3 0,3 0 0,7
q 0,5 0 0,5
d1=(1, 0, 0)
d2=(0, 1, 0)
d3=(0.3, 0, 0.7)
q=(0.5, 0, 0.5)
Παρατηρήσεις:
1. Δυο έννοιες που σχηματίζουν γωνία 90 μοιρών είναι ξένες μεταξύ τους (π.χ. τα
βιβλία d1 και d2).
2. Δυο έννοιες που σχηματίζουν γωνία 0 μοιρών ταυτίζονται.
3. Επομένως μπορούμε να μετρήσουμε τη σχετικότητα δυο διανυσμάτων
(εννοιών – κειμένων) μετρώντας το συνημίτονο της γωνίας φ που
σχηματίζουν
Συνφ=0  κείμενα άσχετα
Συνφ=1  κείμενα ίδια
63
d1=(1, 0, 0)
d2=(0, 1, 0)
d3=(0.3, 0, 0.7)
q=(0.5, 0, 0.5)
Παρατηρήσεις:
4. Όλα τα διανύσματα των βιβλίων αποτελούν τη βάση δεδομένων της
βιβλιοθήκης – δηλαδή, ένα πίνακα, D (3xΝ) όπου 3 είναι το πλήθος των όρων
και Ν το πλήθος των κειμένων.
5. Για κάθε διάνυσμα της βάσης δεδομένων μπορούμε να μετρήσουμε την
απόσταση μεταξύ των διανυσμάτων των βιβλίων και του ερωτήματος, q, από
την γωνία την οποία σχηματίζουν.
64
Υπενθύμιση για το Διανυσματικό Λογισμό ….
Εσωτερικό Γινόμενο δύο μη μηδενικών διανυσμάτων
d1=(1, 0, 0)
d2=(0, 1, 0)
d3=(0.3, 0, 0.7)
q=(0.5, 0, 0.5)
Το συνημίτονο της γωνίας μεταξύ δύο μη μηδενικών διανυσμάτων x, y ορίζεται

σε συνάρτηση του εσωτερικού γινόμενου δια του μέτρου των δυο διανυσμάτων
66
d1=(1, 0, 0)
d2=(0, 1, 0)
d3=(0.3, 0, 0.7)
q=(0.5, 0, 0.5)
cos(d1 , q)  ((0,5*1)+(0*0)+(0*0,5))/( sqrt(0,5^2 + 0^2+0,5^2) *

sqrt(1^2+0^2+0^2))= 0,5/0,707 = 0,707
cos(d2, q)  ((0,5*0)+(0*1)+(0*0,5))/( sqrt(0,5^2 + 0^2+0,5^2) *
sqrt(0^2+1^2+0^2))=0
cos(d3, q)  (0,5*0,3+ 0,5*0,7)/( sqrt(0,5^2 + 0,5^2)
*sqrt(0,3^2+0,7^2)) = 0,5/0,707*0,761= 0,5/ 0,583 = 0,928
Όσο το συνημίτονο είναι πιο κοντά στο 1 τόσο πιο κοντά είναι τα δύο
διανύσματα και τόσο πιο σχετικά θεωρούνται τα αντικείμενα της βάσης
με το ερώτημα. 67
Συνεπώς το πλέον σχετικό με το ερώτημα κείμενο είναι το d3
Ο αλγόριθμος του Διανυσματικού μοντέλου χώρου
Ο αλγόριθμος του μοντέλου του διανυσματικού χώρου

περιγράφεται ως εξής:
1. for each di ∈ D
2. calculate cos(di, q);
3. sort di by cos(di,q);
68
Το Μοντέλο Διανυσματικού Χώρου
Μια πιο λεπτομερής ματιά
Αναπαράσταση ενός κειμένου ως ένα δυαδικό διάνυσμα
Αναπαράσταση ενός κειμένου ως ένα διάνυσμα συχνοτήτων
Bag of words model
We do not consider the order of words in a document.
John is quicker than Mary and Mary is quicker than John are
represented the same way.
This is called a bag of words model.
In a sense, this is a step back: The positional index was able

to distinguish these two documents.
Term frequency (tf or TF)
The term frequency tft,d of term t in document d is

defined as the number of times that t occurs in d.
We want to use tf when computing query-document match
scores.
Raw term frequency is not what we want because:

A document with tf = 10 occurrences of the term is more
relevant than a document with tf = 1 occurrence of the
term.
But not 10 times more relevant.
Relevance does not increase proportionally with term
frequency.
Instead of raw frequency: Log frequency weighting
Επομένως η «κανονικοποιημένη» TF πρέπει να
ληφθεί υπόψη στον υπολογισμό του βάρους κάθε
λέξης ενός κειμένου
Υπάρχει κάτι άλλο;
Frequency in document vs. frequency in collection
In addition, to term frequency (the frequency of the term

in the document) . . .
. . .we also want to use the frequency of the term in the

collection for weighting and ranking.
Ποιο θεωρείται πιο σημαντικό παράγοντα για ζύγισμα λέξεων;
Desired weight for rare terms
Rare terms are more informative than frequent terms.
Consider a term in the query that is rare in the collection

(e.g., arachnocentric).
A document containing this term is very likely to be relevant.

→ We want high weights for rare terms like arachnocentric.
Desired weight for rare terms
Frequent terms are less informative than rare terms.
Consider a term in the query that is frequent in the collection

(e.g., good, increase, line).
A document containing this term is more likely to be relevant
than a document that doesn’t . . .
. . . but words like good, increase and line are not sure
indicators of relevance.
→ For frequent terms like good, increase, and line, we

want positive weights . . .
. . . but lower weights than for rare terms.
Document frequency
We want high weights for rare terms like arachnocentric.
We want low (positive) weights for frequent words like

good, increase, and line.
We will use document frequency to factor this into

computing the matching score.
The document frequency is the number of documents

in the collection that the term occurs in.
idf (Inverse Document Frequency) weight (or IDF)
dft is the document frequency, the number of documents
that t occurs in.
dft is an inverse measure of the informativeness of term
t.
We define the idf weight of term t as follows:
(N is the number of documents in the collection.)

idft is a measure of the informativeness of the term.
[log N/dft ] instead of [N/dft ] to “dampen” the effect of
idf
Note that we use the log transformation for both term
frequency and document frequency.
Examples for idf
Examples for idf
Collection frequency vs. Document frequency
Collection frequency of t: number of tokens of t in the

collection
Document frequency of t: number of documents t occurs in
Why these numbers?
Which word is a better search term (and should get a higher
weight)?
This example suggests that df (and idf) is better for
weighting than cf (and “icf”).
tf-idf weighting (or TF – IDF)
The tf-idf weight of a term is the product of its tf weight
and its idf weight
Best known weighting scheme in information retrieval (?)
Alternative names: tf.idf, tf x idf
Summary: tf-idf
tf-idf weighting
Assign a tf-idf weight for each term t in each document d:
The tf-idf weight . . .

. . . increases with the number of occurrences within a
document. (term frequency)
. . . increases with the rarity of the term in the collection.
(inverse document frequency)
Επομένως
Εναλλακτικοί τρόποι αναπαράστασης κειμένων…
Binary incidence matrix (Δυαδικός πίνακας ενδεχομένων)
Count matrix (Πίνακας συχνοτήτων)
weight matrix (πίνακας βαρών)
TF – IDF and VECTOR SPACE MODEL
Documents as vectors
Each document is now represented as a real-valued

vector of tf-idf weights ∈ R|V|.
So we have a |V |-dimensional real-valued vector space.
Terms are axes of the space.
Documents are points or vectors in this space.
Very high-dimensional: tens of millions of dimensions

when you apply this to web search engines
Each vector is very sparse - most entries are zero.

Queries as vectors
Key idea 1: do the same for queries: represent them as
vectors in the high-dimensional space
Key idea 2: Rank documents according to their proximity

to the query
proximity = similarity = relevance
We’re doing this because we want to get away from the

you’re-either-in-or-out Boolean model.
Instead: rank relevant documents higher than non-relevant
documents
How do we formalize vector space similarity?
 distance between two points

( = distance between the end points of the two vectors)
e.g. Euclidean distance?
Euclidean distance is a bad idea . . .

. . . because Euclidean distance is large for vectors of
different lengths.
Why distance is a bad idea
Use angle instead of distance
Rank documents according to angle with query
Thought experiment: take a document d and append it to

itself. Call this document d′. d′ is twice as long as d.
 “Semantically” d and d′ have the same content.
The angle between the two documents is 0, corresponding

to maximal similarity . . .
. . . even though the Euclidean distance between the two
documents can be quite large.
From angles to cosines
…Use angle instead of distance
The following two notions are equivalent

 Rank documents according to the angle between query
and document in decreasing order
 Rank documents according to cosine ( query, document)
in increasing order
Cosine is a monotonically decreasing function of the angle

for the interval [0◦, 180◦]
Cosine similarity between query and document
Cosine similarity
Cosine similarity
Cosine similarity
Cosine similarity
= 4,58
=<0,33, 0,66, 0,66)
Normalized (y’)T = <1/4,58, 4/4,58, 2/4,58> = <0,22, 0,87, 0,44>
Cos(θ) = 0,33*0,22 + 0,66*0,87 + 0,66*0,44 = 0,94
Βρείτε τη συνάφεια των παρακάτω διηγημάτων με βάση τους
συγκεκριμένους όρους / λέξεις
Διηγήματα
SaS: Sense and Sensibility
PaP: Pride and Prejudice
WH: Wuthering Heights
Όροι
Affection Jealous Gossip Wuthering
Υποθέστε τις παρακάτω

Συχνότητες όρων
Π.χ. 1+log10(115) = 3,06 1+ log10(58) = 2,76

Για απλότητα, δεν θα χρησιμοποιηθεί το idf
Π.χ 3,06 / (sqrt(3,06^2+2,0^2+1,3^2+0^2)) = 0,789

2,0 / (sqrt(3,06^2+2,0^2+1,3^2+0^2)) = 0,515
2,76 / (sqrt(2,76^2+1,85^2+0^2+0^2)) = 0,832
Για παράδειγμα:
cos(SaS,PaP) ≈ 0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0 ≈ 0.94
cos(SaS,WH) ≈ 0.789 ∗ 0.524 + 0.515 ∗ 0.465 + 0.335 ∗ 0.405 + 0.0 ∗ 0.588 ≈

0,79
cos(PaP,WH) ≈ 0.69
Notations
We often use different weightings for queries and documents
Notation: ddd.qqq
Example: lnc.ltn
Document: lnc
l logarithmic tf, n no df weighting, c cosine normalization
Query: ltn
l logarithmic tf, t  idf, n  no normalization
Συμπληρώστε τον παρακάτω πίνακα για το ερώτημα “best car
insurance” και για το κείμενο: “car insurance auto insurance” με βάση το
lnc.ltn notation.
Μετά το TF – IDF + VSM τι υπάρχει;
The Probability Ranking
Principle (PRP)
“If a reference retrieval system’s response to each request is a
ranking of the documents in the collection in order of
decreasing probability of relevance to the user who submitted
the request, where the probabilities are estimated as
accurately as possible on the basis of whatever data have
been made available to the system for this purpose, the
overall effectiveness of the system to its user will be the best
that is obtainable on the basis of those data.”
• [1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron;

van Rijsbergen (1979:113); Manning & Schütze (1999:538)
6. BM25
Σας ευχαριστώ για την προσοχή σας!
112

Ανάκτηση Πληροφορίας

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ανάκτηση Πληροφορίας

Uploaded by

Copyright:

Available Formats

Τμήμα Ηλεκτρολόγων Μηχανικών & Μηχανικών Υπολογιστών

Πληροφοριακά Συστήματα, Εξόρυξη

Τελική εξέταση θεωρίας  60% του τελικού βαθμού

Για την εργασία πρέπει να σχηματιστούν ομάδες 2 ατόμων

Δομημένα (structured) δεδομένα

Μη Δομημένα (unstructured) δεδομένα

Ημι- Δομημένα (semi-structured) δεδομένα

Το 80% των παγκόσμιων δεδομένων είναι σε μη

communication interfaces Data Storage

output processing engine

Big Data Analytics

Crawling unstructured data

Information retrieval tools

Social Networks through APIs

Τα Συστήματα Ανάκτησης Κειμένου

Τι είναι η δεικτοδότηση των κειμένων;

Lucene is a high-performance, scalable information retrieval (IR) library.

It’s a free, open source project implemented in Java

Lucene is currently the most popular free IR library.

ELK is an acronym for Elasticsearch, Logstash and Kibana

Kibana is a visualization platform that is built on top of Elasticsearch and

What is Streaming Data?

Batch Processing vs Real-Time Streams

Streaming data flows in continuously, allowing that data to be processed

A Web crawler, sometimes called a spider or spiderbot and often shortened to

Ευέλικτο διότι μας δίνει την δυνατότητα να το συνδυάσουμε με διάφορα εργαλεία

Kατανεμημένη λετουργία που βασίζεται σε κατανεμημένο σύστημα αρχείων

Χρησιμοποιεί μηχανή αναζήτησης βασισμένη στο Lucene

Information retrieval (IR) is finding material (usually

Basic assumptions of Information Retrieval

• Collection: A set of documents

• Goal: Retrieve documents with information

Τι κάνει ένα χρήστη χαρούμενο;

• Ταχύτητα απόκρισης (Speed of response)

Τι κάνει ένα χρήστη χαρούμενο;

• Ταχύτητα απόκρισης (Speed of response)

Συνάφεια (relevance): Κανένα από αυτά δεν αρκεί:

Πως μετρείται η συνάφεια;

Έστω μια συλλογή με 1.000.120 έγγραφα, και μια

Πόσο «καλό» είναι;

TN  true negative (δεν έπρεπε να ανακτηθούν και δεν ανακτήθηκαν)

Precision (P) – Ακρίβεια είναι το ποσοστό των ανακτημένων

Precision = 20/60 = 1/3 Recall = 20/80 = 1/4

Το αντίστροφο ισχύει για την ακρίβεια (συνήθως):

Σχολιάστε τα σημεία 1, 2 και 3 του διαγράμματος όπως και τις

• Διαδικασία stemming (θεματοποίηση):

Initial stages of text processing

Συνήθως η ερώτηση αποτελείται από αρκετούς όρους και έτσι απαιτείται

Inverted index construction

Token stream Friends Romans Countrymen

Modified tokens friend roman countryman

Η ερώτηση ΟΡ1 AND ΟΡ3, δίνει το αποτέλεσμα:

Indexer steps: Token sequence

I did enact Julius So let it be with

Indexer steps: Sort

Core indexing step

Indexer steps: Dictionary &

Where do we pay in storage?

Query processing: AND

If the list lengths are x and y, the merge takes O(x+y)

 What is the best order for query processing?

Query: Brutus AND Calpurnia AND Caesar 53

Query optimization example

This is why we kept

cos(d1 , q)  ((0,51)+(00)+(00,5))/( sqrt(0,5^2 + 0^2+0,5^2)

Cos(θ) = 0,330,22 + 0,660,87 + 0,66*0,44 = 0,94