Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 29

1.

-
2.
(Clustering)
2.1 ... 4
2.2 .7
2.2.1 Hierarchical agglomerative clustering...8
2.2.2
... 8
3
. 10
3.1 ... 12
3.2 .... 12
3.3 .... 13
4
Web Crawling.. 14
4.1 crawler.... 14
4.2 crawler ... 15
4.3 Crawling.. 15
4.4 crawler.. 16
5
... 18
5.1 ... 19
5.1.1 .....19
..21

1. -
1.1 ( )


, .
:

1.3
,
Docid , (docid).

.
tokens ,
docID, ...


,
1.4
,
postings, 1.4 .
,
. ,
,
.. Unix,

1.4
(Boolean),
,
. postings docID.
.

ad hoc . ,
posting.
, , postings
, 5
.
postings.
, . posting

.
postings ( ,
),
skip ( 2.3), .

.
.

. postings ,
( ) postings,
( 1.3), - postings
posting .

1.2 block
1.4.
-docID.
docID
., docIDs posting list
. ,
. ,
.
, ID (
1.4), ID .

termIDs
, ,

.
. 4.7

, ,
.

4.1
Reuters-RCV1 100 tokens.
ID docID 4 bytes ID docID
0.8 GB .
Reuters-RCV1.

termID-docid .

,
5 , , postings
.

1.3
H block
,
termIDs. ,
.

- SPIMI. SPIMI termIDs,


, ,
. SPIMI
. SPIMI
4.4.
6

4.4

-docid, tokens , . SPIMIINVERT
. tokens (
4) SPIMI-INVERT.
, (
), postings ( 6).
7 posting list .
- BSBI SPIMI SPIMI posting
posting list ( 10). IDdocid ( BSBI), posting
list (, )
postings. :
, , ,
posting.

1.4

(Dynamic

Indexing)

, .
(.., ).
, .
, posting
.
. ,

, ,
, .

,
, largemain .
.
. bit.
.
,
.
. posting ,
posting
. ,
.
Mave , Mave
,
.
, posting list ,
.
, ,
posting lists.
, ( 4.7). ,
.

4.7

posting


n
T postings. ,
Q (T2/ n). (
docIDs. , posting
list docIDs.) Q(T2 / n)
log2 (T / n) I0, I1,2,. . . 20 n , 21 n, 22 n.
postings .
( 4.7). , n postings
, Z0. n
, 20*n postings 0 0 .
0 , I0
Z1 21 n. Z1 1 ( 1)
1 Z2 (I1 ), .
-Z0 Ii
. 2

. Q (log T (T / n)), posting
log (T / n) .
,
log (T / n) (
). ,
, , (
),
.
.
, , 3.3
.
bit, hits
. , IR- ,
, , '
.
-- .
.
, .
, .

2. (Clustering)
cluster , clustering

, - .
.

.
, .
10


.
,
.

.
, .

, .
.
, ,
.
. (
)

1.

(Flat and Hierarchical


Clustering)

.
(
).

(Hard and Soft Clustering)

11

hard soft
. (hard) ,
.
soft ,
.
soft .
, ,
soft .

2.1
2.1.1
(cluster hypothesis)

.
:
.

,
.
.
.

12

2.


.
,
-
, ,
.
(cluster hypothesis).

(Search
Result Clustering)

,

(query).
.
.
,
.
group,
.
. ( )

13

3.

Scatter - Gather ( )
Scatter Gather,
. Scatter Gather
group .
group
.
.

14

(collection clustering)

, , ,

.
Google , Columbia News
Blaster, . ,
,

.
, ,
,
.

(cluster hypothesis)
, .
standard
(query),
,
. ,
(car)
(automobile),
[,
. automobile, vehicle etc.]. ,
group (mutual) .

2.2


.


deterministic,
.

.

.

15


. ,

(
). ,

2.2.1 Hierarchical agglomerative clustering (HAC)


top-down bottomup. bottom-up ( )
at the outset successively
( agglomerate)
.
bottom-up agglomerative
HAC (Hierarchical agglomerative clustering) . H top-down
.
. HAC
top-down
.

HAC
. .
y
, singleton clusters.
, .
<loyds CEO
questioned <loyds chief/U.S. GRILLING ~= 0.56.
- of a singleton cluster
( 1.0 for cosine similarity).


.

2.2.2

,
. ( 5)
16

.
.
.

5. 30

,
( 6).
.
.
.
.

17

6. 30

3 .

: ,


.
, ,
.

18

, Vannevar Bush
1940
1970, (
) 1990.

, ,
:

(http:
hypertext transfer protocol) ,

HTML: Hyper text markup language.

O , (
), .



.
:
1. M ( full - text
index ) AltaVista, Excite Info seek.
2.
Yahoo!
H ,
.

.

,

. (indexing), (query serving)
(ranking)
,

.

.
,
.
(ranking)
(spam)
.
19


, .

(query) ,

(website) .

3.1
.



.
( )
,
(query languages) ,
.

keywords) 2 3.

3.2 .

:
1.
2.
3. ( transactional )

.
1.
,
. ,
.
.
2.

. British Airways.
20


British Airways.
3.

, (download)
, .

.
7 ,
(crawler),
.

7. crawler

3.3
, ,

,
.
,
. http://www.yahoo.com/any_string
HTML :
Yahoo!
(web servers)
. ,
21

(spider traps)
crawler
(spammer)
(site).

:
;
, :
1. M , ,
(
) . ,
.
, p
,
p.
2.
, .
, (web page)
(website)

.
,
,
.

4 Web crawling ( )
Web crawling
(web),
. crawling
,
. web crawler
spider ().

4.1 crawler
: O
(spider traps),
crawler
22

domain ( ). crawlers
.
,
(website).
: (web servers)

crawler. .

4.2 crawler
: O crawler
.
: crawler
crawl
(bandwidth).
: To crawl
,
.
:
, crawler
.
: , crawler :
. ,
crawler , (index)

. crawling, o crawler
crawl
.
: crawlers
,
. crawler
.

4.3 Crawling
H crawler ( ,
) . crawler
URLS (seed
set). URL ,

23

URL.
(links) (
URL).
. (URLS)
(frontier) URL URLS
crawler. , URL
, , URLS
URL. crawling, URL
.

4.4 crawler
crawling

1. URL, URLS
crawl ( crawling,
URL
.
2. DNS (Domain Name System)

3. URL.
4. ( )
http (hyper text transfer protocol)
URL.
5.
(links) .
6.
URL
.

24

8. crawler

crawling (threads),

.

. URL

. URL
( Fetch),
, ( crawling)
URL .
crawler URL
URL, http.
,
, . ,
(parsed) .
( tag () . )
.
(anchor text)
(ranking) .
,
(tests) URL
.
, (tests)
URL.
(checksum)
Doc FPs ( 8).
, URL
URL
. , crawl
domains ( , all.com URLS)
(test) URL com
domain. test .
host
crawling, , Robots Exclusion
Protocol. robots.txt
URL site. robot.txt.file
User agent
Disallow : / yoursite/temp/
User agent : searchengine
Disallow

25

robot.txt robot
URL
/yoursite/temp/, robot searchengine.
robots.txt
URL robot,
URL .
, URL :
HTML p
p. ,
HTML
en.wikipedia.org/wiki/main page:/<ahref= /wiki/wikipedia:General disclaimer
title = wikipedia : general disclaimer > Disclaimers </a>
URL http//en.wikipedia.org/wiki/wikipedia:general disclaimer
, URL : URL
( crawl)
(crawled), . URL ,

(fetching).

.

5

.

(ranking).

,
.

.

,

.
, ,
in-links ( ) .

,
in-links. link spam. ,


26

. ,
()
.

5.1

9.

:
1.
.
2.
, . ,
,
. ,


(copyright), . ,

.

5.1.1 .
HTML
(homepage) ACM :
<a href="http://www.acm.org/jacm/">Journal of the ACM.</a><a href=>
http://www.acm.org/jacm/
Journal of the ACM.
.
(= http://www.acm.org/jacm/)
27

.
;

.
,
,
marketing.

10. surfer 1-3


,C,D.
surfer 1-3

,C,D,

.
(PageRank)
.

28


An Introduction to information Retrieval
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze
Cambridge:
http://www.cam.ac.uk/
( 10/12/09)

:
All about Retrieve
:
http://zaphod.mindlab.umd.edu/docSeminar/pdfs/10.1.1.27.7690.pdf
( 10/12/09)

: http://www.ericward.com/linkalert
( 10/12/09)
Information Retrieval
: http://www.dcs.gla.ac.uk/Keith/Preface.html
( 10/12/09)

: http://vivliothmmy.ee.auth.gr/118/1/diou_c.pdf
( 10/12/09)

29

You might also like