Professional Documents
Culture Documents
Ανάκτηση και φιλτράρισμα πληροφοριών
Ανάκτηση και φιλτράρισμα πληροφοριών
-
2.
(Clustering)
2.1 ... 4
2.2 .7
2.2.1 Hierarchical agglomerative clustering...8
2.2.2
... 8
3
. 10
3.1 ... 12
3.2 .... 12
3.3 .... 13
4
Web Crawling.. 14
4.1 crawler.... 14
4.2 crawler ... 15
4.3 Crawling.. 15
4.4 crawler.. 16
5
... 18
5.1 ... 19
5.1.1 .....19
..21
1. -
1.1 ( )
, .
:
1.3
,
Docid , (docid).
.
tokens ,
docID, ...
,
1.4
,
postings, 1.4 .
,
. ,
,
.. Unix,
1.4
(Boolean),
,
. postings docID.
.
ad hoc . ,
posting.
, , postings
, 5
.
postings.
, . posting
.
postings ( ,
),
skip ( 2.3), .
.
.
. postings ,
( ) postings,
( 1.3), - postings
posting .
1.2 block
1.4.
-docID.
docID
., docIDs posting list
. ,
. ,
.
, ID (
1.4), ID .
termIDs
, ,
.
. 4.7
, ,
.
4.1
Reuters-RCV1 100 tokens.
ID docID 4 bytes ID docID
0.8 GB .
Reuters-RCV1.
termID-docid .
,
5 , , postings
.
1.3
H block
,
termIDs. ,
.
4.4
-docid, tokens , . SPIMIINVERT
. tokens (
4) SPIMI-INVERT.
, (
), postings ( 6).
7 posting list .
- BSBI SPIMI SPIMI posting
posting list ( 10). IDdocid ( BSBI), posting
list (, )
postings. :
, , ,
posting.
1.4
(Dynamic
Indexing)
, .
(.., ).
, .
, posting
.
. ,
, ,
, .
,
, largemain .
.
. bit.
.
,
.
. posting ,
posting
. ,
.
Mave , Mave
,
.
, posting list ,
.
, ,
posting lists.
, ( 4.7). ,
.
4.7
posting
n
T postings. ,
Q (T2/ n). (
docIDs. , posting
list docIDs.) Q(T2 / n)
log2 (T / n) I0, I1,2,. . . 20 n , 21 n, 22 n.
postings .
( 4.7). , n postings
, Z0. n
, 20*n postings 0 0 .
0 , I0
Z1 21 n. Z1 1 ( 1)
1 Z2 (I1 ), .
-Z0 Ii
. 2
. Q (log T (T / n)), posting
log (T / n) .
,
log (T / n) (
). ,
, , (
),
.
.
, , 3.3
.
bit, hits
. , IR- ,
, , '
.
-- .
.
, .
, .
2. (Clustering)
cluster , clustering
, - .
.
.
, .
10
.
,
.
.
, .
, .
.
, ,
.
. (
)
1.
11
hard soft
. (hard) ,
.
soft ,
.
soft .
, ,
soft .
2.1
2.1.1
(cluster hypothesis)
.
:
.
,
.
.
.
12
2.
.
,
-
, ,
.
(cluster hypothesis).
(Search
Result Clustering)
,
(query).
.
.
,
.
group,
.
. ( )
13
3.
Scatter - Gather ( )
Scatter Gather,
. Scatter Gather
group .
group
.
.
14
(collection clustering)
, , ,
.
Google , Columbia News
Blaster, . ,
,
.
, ,
,
.
(cluster hypothesis)
, .
standard
(query),
,
. ,
(car)
(automobile),
[,
. automobile, vehicle etc.]. ,
group (mutual) .
2.2
.
deterministic,
.
.
.
15
. ,
(
). ,
HAC
. .
y
, singleton clusters.
, .
<loyds CEO
questioned <loyds chief/U.S. GRILLING ~= 0.56.
- of a singleton cluster
( 1.0 for cosine similarity).
.
2.2.2
,
. ( 5)
16
.
.
.
5. 30
,
( 6).
.
.
.
.
17
6. 30
3 .
: ,
.
, ,
.
18
, Vannevar Bush
1940
1970, (
) 1990.
, ,
:
(http:
hypertext transfer protocol) ,
HTML: Hyper text markup language.
O , (
), .
.
:
1. M ( full - text
index ) AltaVista, Excite Info seek.
2.
Yahoo!
H ,
.
.
,
. (indexing), (query serving)
(ranking)
,
.
.
,
.
(ranking)
(spam)
.
19
, .
(query) ,
(website) .
3.1
.
.
( )
,
(query languages) ,
.
keywords) 2 3.
3.2 .
:
1.
2.
3. ( transactional )
.
1.
,
. ,
.
.
2.
. British Airways.
20
British Airways.
3.
, (download)
, .
.
7 ,
(crawler),
.
7. crawler
3.3
, ,
,
.
,
. http://www.yahoo.com/any_string
HTML :
Yahoo!
(web servers)
. ,
21
(spider traps)
crawler
(spammer)
(site).
:
;
, :
1. M , ,
(
) . ,
.
, p
,
p.
2.
, .
, (web page)
(website)
.
,
,
.
4 Web crawling ( )
Web crawling
(web),
. crawling
,
. web crawler
spider ().
4.1 crawler
: O
(spider traps),
crawler
22
domain ( ). crawlers
.
,
(website).
: (web servers)
crawler. .
4.2 crawler
: O crawler
.
: crawler
crawl
(bandwidth).
: To crawl
,
.
:
, crawler
.
: , crawler :
. ,
crawler , (index)
. crawling, o crawler
crawl
.
: crawlers
,
. crawler
.
4.3 Crawling
H crawler ( ,
) . crawler
URLS (seed
set). URL ,
23
URL.
(links) (
URL).
. (URLS)
(frontier) URL URLS
crawler. , URL
, , URLS
URL. crawling, URL
.
4.4 crawler
crawling
1. URL, URLS
crawl ( crawling,
URL
.
2. DNS (Domain Name System)
3. URL.
4. ( )
http (hyper text transfer protocol)
URL.
5.
(links) .
6.
URL
.
24
8. crawler
crawling (threads),
.
. URL
. URL
( Fetch),
, ( crawling)
URL .
crawler URL
URL, http.
,
, . ,
(parsed) .
( tag () . )
.
(anchor text)
(ranking) .
,
(tests) URL
.
, (tests)
URL.
(checksum)
Doc FPs ( 8).
, URL
URL
. , crawl
domains ( , all.com URLS)
(test) URL com
domain. test .
host
crawling, , Robots Exclusion
Protocol. robots.txt
URL site. robot.txt.file
User agent
Disallow : / yoursite/temp/
User agent : searchengine
Disallow
25
robot.txt robot
URL
/yoursite/temp/, robot searchengine.
robots.txt
URL robot,
URL .
, URL :
HTML p
p. ,
HTML
en.wikipedia.org/wiki/main page:/<ahref= /wiki/wikipedia:General disclaimer
title = wikipedia : general disclaimer > Disclaimers </a>
URL http//en.wikipedia.org/wiki/wikipedia:general disclaimer
, URL : URL
( crawl)
(crawled), . URL ,
(fetching).
.
5
.
(ranking).
,
.
.
,
.
, ,
in-links ( ) .
,
in-links. link spam. ,
26
. ,
()
.
5.1
9.
:
1.
.
2.
, . ,
,
. ,
(copyright), . ,
.
5.1.1 .
HTML
(homepage) ACM :
<a href="http://www.acm.org/jacm/">Journal of the ACM.</a><a href=>
http://www.acm.org/jacm/
Journal of the ACM.
.
(= http://www.acm.org/jacm/)
27
.
;
.
,
,
marketing.
,C,D,
.
(PageRank)
.
28
An Introduction to information Retrieval
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze
Cambridge:
http://www.cam.ac.uk/
( 10/12/09)
:
All about Retrieve
:
http://zaphod.mindlab.umd.edu/docSeminar/pdfs/10.1.1.27.7690.pdf
( 10/12/09)
: http://www.ericward.com/linkalert
( 10/12/09)
Information Retrieval
: http://www.dcs.gla.ac.uk/Keith/Preface.html
( 10/12/09)
: http://vivliothmmy.ee.auth.gr/118/1/diou_c.pdf
( 10/12/09)
29