Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 38

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING

2012

1
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

I HC QUC GIA THNH PH H CH MINH
TRNG I HC CNG NGH THNG TIN






BI THU HOCHMN HC :
KHAI PH D LIU




TI

TM HIU V
KHAI PH TH
GRAPH MINING
Hc vin thc hin:
L Ngc Hiu MSHV: CH1101012
Lp : CH K6 - UIT
GVHD:PGS.TS. Phc

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

2
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com


MC LC

Trang
Mc lc 2
Li m u 3

Chng I: Gii thiu v khai ph d liu t th - Graph mining 4
1) Cc khi nim & nh ngha
2) Khai thc d liu th - th ph bin mu
3) Mt s thut ton trong khai thc th
4
6
8

Chng II: Cc thut ton v khai thc th 9
1) Thut ton Apriori
2) Thut ton tng trng mu Pattern growth
3) c im ca cc thut ton khai thc th
9
10
14

Chng III: Phn lp th 20
1) Phn lp da trn cu trc .
2) Phn lp da trn mu (pattern)
3) Phn lp da vo cy quyt nh
4) Phn lp da trn nhn (Kernel) ca th
20
20
21
23

Chng IV: Nn th

24
Chng V: ng dng khai thc th qun l tin cy trn mng internet 25
1) Mt s k hiu .
2) S liu lin quan.
3) Cu trc cluster (cm) ton cu
4) Cu trc cluster (cm) cc b
5) Topology

25
25
30
35
Chng VI: Kt lun 37

Ti liu tham kho 38
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

3
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com


LI M U

Khai thc d liu th l mng ti khng c, nhng kh mi m Vit Nam.

Thng qua l bi thu hoch cui k ca mn hc Khai ph d liu & kho D liu, gip
em hiu hn v cc ng dng ca khai ph d liu th, mc tiu, mc ch & kt qu
ca ng dng khai ph d liu th trong cuc sng, l c s vng chc cho vic nghin
cu & pht trin v sau trong qu trnh hc tp ti trng.

han thnh bi thu hoch ny, em xin chn thnh cm n thy PGS.TS. Phc, ngi
truyn cm hng cho em, thy l ngi ch dn tn tnh, cung cp thng tin, t liu cng
nh nhng bi ging c gi tr sn phm ny hon thnh mc bc u nghin cu.

y l ti khng mi nhng khng c, nhng vi thi lng cng nh vic u t nghin
cu cha tng ng, nn y ch mang tnh cht mt bi tiu lun mn hc, ch tm hiu
mc khi qut vn , phn tch v cha i su m x cc vn mt cch trit tng
xng vi mt bi nghin cu khoa hc.

Em rt mong s thng cm & chia s ca thy.

Thnh ph H Ch Minh, Thng 11 Nm 2012.
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

4
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

CHNG I: GII THIU V KHAI PH D LIU TH - GRAPH MINING

I.1) CC KHI NIM & NH NGHA
1. Ti sao phi khai ph d liu th?
- th d dng tm thy khp mi ni trong cuc sng hng ngy ca chng ta nh:
a. H thng mng internet (Co-expression Network)
b. Mng x hi (Social network)
c. Quy trnh ca mt chng trnh (Program flow)
d. Cc hp cht ha hc ( Chemical compound)
e. Cu trc ca Protein (Protein structure)
- Mt d liu ln ngy nay trn cc h thng mng u c th biu din di dng cc th & mi
quan h ca chng theo:
a. Lin kt, kt ni vt l
b. Kt ni gia cc mng trong lp mng
c. Mi quan h trong mng x hi
d. Siu lien kt gia cc trang web
e. Cc tng tc phc tp gia cc thc th
- Nhng th trn cha ng nhng thong tin gi tr cho vic ng dng vo h thng mng nh
a. Nhng pht hin t cng ng, nhng im chung
b. Phn lp
c. Nhng h thng c a ra theo u tin no
d. Tm kim trn mng
e. P2P (im ti im) tm kim & ly d liu
f. Tin cy & uy tn
- a nhng d liu trn vo di dng th, ta cn phi:
a. nh ngha cc ma trn m m t cu trc tng th ca th
b. Tm cc cu trc c tnh c trng cng ng ca mng li
c. nh ngha cc ma trn m n m t cc mu c trng ca cc giao tip bn trong th

d. Pht trin & ng dng nhng thut ton hiu qu nht khai thc d liu trong h thng
mng
e. Hiu r m hnh ca vic ly ra (tha hng) t cc th .
- Nhn chung, th c tnh bao qut hn tng i tng, tun t, cy, mng ni chung. th gii
quyt c nhiu vn c tnh ton phc tp cao.
2. Mt s k hiu & thut ng:
- Mt th c th c xem l 1 tp ca 5 phn t (V,E,F,Lv,Le).
- D ={G1,G2,Gn} l tp d liu ca nhng giao dch
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

5
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

- Nhng giao dch trong tp D l th gin tip c nh du.
- h tr (support) ca mt th G c nh ngha nh l s phn tram th trong tp D c
th con l G.
- Mt th c gi l ph bin (frequent) nu n c h tr ln hn mt ngng cho trc
(ngng ny thng c cho trc).
V d: th con ph bin:

3. Tng quan v khai thc d liu th - Graph Mining:






Graph Mining
Frequent Subgraph
Mining (FSM)
Variant Subgraph
Pattern Mining
Applications of
Frequent Subgraph
Mining
Approximate
methods
Coherent
Subgraph
mining
Classification
Dense
Subgraph
Mining
Apriori
based
Pattern
Growth
based
Closed
Subgraph
mining
AGM
FSG
PATH
gSpan
MoFa
GASTON
FFSM
SPIN
SUBDUE
GBI
CloseGraph
CSA
CLAN
CloseCut
Splat
CODENSE
Clustering
Indexing
and
Search
Kernel Methods
(Graph Kernels)
GraphGrep
Daylight
gIndex
( Grafil)
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

6
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com


I.2) Khai thc d liu th: th ph bin
1. Khai thc th ph bin Graph Pattern Mining:

- Gii thiu: Trn l cc tp th.

2. th mu Graph Pattern:


- Cc thong s hu ch & l th, t c c cc hnh ng theo mc ch a ra:
o Tn s xut hin: th mu ph bin
o ng x khc, x l khc: ly cc thong tin cn thit
o Mc ngha.
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

7
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com


3. th mu ph bin Frequent Graph Pattern
Cho mt tp d liu th D, tm th con g sao cho :
()
Trong freq(g) l phn tram ca cc th trong D cha g
V d 1 v th con ph bin
- Hp cht ha hc:

(a)Cafeine (ca ph in) (b)diurobromine (c) Viagra

- th con ph bin trong cc hp cht trn l:

- ss
V d 2 v th con ph bin
- Cc th biu hin mi quan h gi hm ca mt chng trnh

- Ta c th con ph bin sau: vi h tr l 2


I.3) M S THUT TON TRONG KHAI THC TH
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

8
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

- Lp trnh logic qui np (Inductive Logic Programming) : l phn giao gia k thut lp trnh logic v hc
tp quy np, s dng k thut my hc & lp trnh logic, p dng vo khai thc d liu th.
- Cc thut ton da trn tnh cht ca th:
+ Cch tip cn da vo thut ton Apriori: tm ra tp ph bin nht
_ AGM/AcGM: tc gi Inokuchi, (nm 2000)
_ FSG: tc gi Kuramochi & Karypis (ICDM nm 2001)
_ PATH :tc giVanetik v Gudes ( ICDM 2002, 2004)
_ FFSM: tc giHuan (ICDM 2003) v SPIN: tc giHuan (KDD 2004)
_ FTOSM:tc gi Horvath (KDD 2006)
+ Cch tip cn da vo ln ca mu ( th mu)
_ Subdue: tc giHolder (KDD 1994)
_ MoFa:tc gi Borgelt v Berthold (ICDM 2002)
_ gSpan:tc gi Yan and Han (ICDM 2002)
_ Gaston:tc gi Nijssen v Kok (KDD 2004)
_ CMTreeMiner:tc gi Chi (TKDE 205), LEAP:tc gi Yan (SIGNMOD 2008)

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

9
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

CHNG II: CC THUT TON V KHAI THC TH

II.1) THUT TON APRIORI
1. Nguyn l:
- Nu mt th l ph bin, th tt c cc th con ca n cng l ph bin.


2. Cc c trng ca thut ton Apriori
- Thut ton ny c 2 bc chnh:
o Bc gia nhp (Join): to ra tp cc ng vin th con
o Bc loi b (Prune): kim tra tnh ph bin ca tng ng vin th con
- Hu ht tp trung bc u, c gng ti u ha bc u tin, t bc 2 s tm c th
con ng cu.
- Cc bin s dng biu din kch thc ca th con: nh (Vertices), Cnh (Edges), Trng s cnh
(path-number)
- Trnh t chy ca thut ton:

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

10
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

- thut ton AGM (tc gi Inokuchi): ln ca th l s nh (#vertices)
- thut ton FSG (tc gi Karypis): ln ca th l s cnh (#edges)
o Da vo s cnh m sinh ra cc ng vin: tng kch thc th con ln 1 sau 1 ln lp.
o Bc tham gia (Join) hai th con cng kch thc k c nhp vo khi v ch khi chng c
chung li kch thc k-1.

- thut ton PATH (tc gi Venetik): ln ca th l s [path number] (l s cnh ti thiu phn
chia ng dn vo th c th c phn tch)
- Tuy nhin, bc gia nhp (Join) sinh ra cc ng vin kh phc tp & chi ph cao, tiu hao nhiu
b nh ( nu s dng BFS); ng thi bc loi b cng nhiu khuyt im, khng hiu qu khi
thc hin kim tra tnh ng cu ca th con. T ngi ta a ra thut ton da trn tip cn
tng trng mu (Pattern-Growth)
3. Phn tch thut ton:
- Chi ph ca thut ton:



II.2) THUT TON TNG TRNG MU (PATTERN GROWTH)
1) tng c bn:
- trnh s phc tp trong bc gia nhp (Join) thut ton Apriori
- Ko di & m rng trc tip cc mu bng cch thm vo cnh mi e v ng vin mi c sinh ra
g+
x
e:
o Nu e l mt cnh hng ra, ni vi nh x mi th c gi tr f
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

11
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

o ngc li s c gi tr l b, c ngha l cnh lui li.
- quy ko di & m rng mu ph bin g cho n khi th ph bin cha g khng cn c tm
thy ( tc l duy nht).
2) Framework:
- u vo: g l th con ph bin, D l tp d liu th, l h tr v S l tp cc th con
ph bin.
- Thut ton:
o Lp li bc kim tra:
if (g c trong S)
return;
else
Thm g vo S;
o Bc m rng:
Tm tt c cc cnh e trong tp d liu sao cho
Tp g c th m rng thnh g+
x
e
o Bc loi b:
For each (tp ph bin g+
x
e)
Gi quy Pattern-Growth(g+
x
e,D, ,S);
o Return:
3) Nhc im: bc m rng l bc km hiu qu v:
- Vi nhng th ging nhau s c chy nhiu ln:
o V d: th c cng s cnh l n s c tm thy t n th c n-1 cnh.
- Vic lp i lp li sinh ra & trng lp cc bc kim tra s lm tn b nh, ti nguyn & thi gian
ca thut ton.
4) Thut ton gSpan:
- S dng DFS duyt th
- DFS: duyt theo trnh t cc nh i qua trong cy DFS.

- tng ch o:
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

12
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

o Rt gn vic m rng bng cch cho php m rng ch mt s hng nht nh (ng i
ch yu)
o Mt cnh mi t hng i chnh ( ch yu) t nh V
n
ti bt k nh trong ng i ch
yu .
o (Hoc l) M ra mt nh v ni ti bt k nh no to ra ng i ch yu.
o Vn : C nhiu cy DFS tn ti cho th v s dn n vic trng lp
o Gii php: chn mt trong s th trng lp lm chnh, v m rng theo hng ch o
( theo ng i ch yu).
- Cy t in tm kim DFS: (DFS Lexicographic search tree)

5) Vn m rng th mu Khai thc th ph bin gn nht
- Thut ton Apriori ni rng: nu mt th l ph bin th tt c th con cng l ph bin.
- Mt th n- Cnh l ph bin s c 2
n
th con ph bin.
- V d: Trong s 423 hp cht ha hc c xc nhn l hot tnh i vi AIDS trong tp d liu, th
c 1 triu th mu ph bin m h tr ca n t nht l 5%.
- T ta a ra vic khai thc th con ph bin gn nht.
- th ph bin gn nht:
o Mt th ph bin G l gn nht nu khng tn ti siu th ca G m c h tr
ging G.
o Mt tp cc th con ph bin gn nht c sc mnh ging nhau ging nh tng s ca
tt c cc tp con ph bin
o Torng d liu chng li virus AIDS c 1 triu tp con ph bin, nhng ch c 2000 tp l gn
nht.
- u im ca th ph bin gn nht:
o Mt s tp th ph bin gn nht xa s t hn tng s th ph bin.
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

13
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

o Ta c th thay th cc th ph bin bng tnh nng tng ng trong cc ng dng.
- th gn (Close graph)
o Hiu qu cho thut ton tng trng mu khai thc tp CFG.
o M rng tp mu n gin hn vi thut ton gSpan.
- Gii quyt cc v n kh:

6) Thut ton SUBDUE:
- Bt u vi mt nh n
- M rng cu trc con tt nht vi mt cnh mi
- Gii hn s cu trc con tt nht.
o Cu trc con c nh gi da trn kh nng ca n c th nn ti u vo ca th
( graph inputs)
o S dng chiu di ngn nht m t (DL)
o Cu trc con tt nht S trong th G ti thiu ha: DL(S) + DL(G\S);
- Ngng khi no khng c cu trc con no c tm thy.

- u im:
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

14
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

o Thc hin mang tnh cht tng i, khng chnh xc, cho php c nhiu cu trc khc
nhau.
o Gim s mu th con ph bin .
- ng dng:
o Cu trc phn nhm ca cc cm cha r rang
o Nn th
o Hc ng php th ( Graph grammar learning)
7) Khai thc mu th ti i (Maximal Graph Pattern Mining)
- Lp tng ng da trn cy
o Cy c sp xp theo mt th t nht nh
o th s nm trong cng lp tng ng nu chng c chung th t cy.

- a phng ti i ( vng ti i locally maximal)
o Mt th con g l ti i a phng nu n l ti i trong lp tng tng ca n. V d
nh g khng c th ln hn ph bin, cng s dng chung th t cy ging nh g.
o Mi th mu ti i phi l ti i a phng
o Loi b cc th con khng phi l ti i a phng.

II.3) C IM CA THUT TON KHAI THC TH
1) Trnh t tm kim:
- Cng ging nh trong cc thut ton ca th, ta c trnh t duyt th theo chiu su & theo
chiu rng. BFS & DFS
- Duyt ton b & khng ton b.
2) Cc th h thut ton khai thc th
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

15
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com


3) Th t khm ph mu ( Order)
- M rng t do:

- M rng theo hng chnh ( hng ch yu):

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

16
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

4) Khai thc th con cht ch (Coherent Subgraph)
- ng c:
o gii quyt thit hi gy ra do tnh a chiu, m vn gi nguyn tnh nng tm kim mu
ph bin.
o tng c bn: loi b cc tnh nng d tha m khng cung cp thm bt k thong tin
no.
- Mt th G c xem l cht ch nu nhng thng tin gia G v gia tng th con ca n phi
nm trong ngng ( ni cch khc G c mi tng quan cht ch vi tt c th con ca n)
- Thng tin gia mt th G v th con G ca n c cho bi cthc:

- P(X
G
,X
G
) l phn phi chung.
- P(X
G
=1) = support(G)
- Ging nh thut ton gSpan nhng loi b cc tp ng vin s hon tt da trn thng tin chung
ln nhau.
5) Khai thc th con dy c (Dense Subgraph)
- th quan h: Tt c cc node c duy nht 1 nhn vd nh m hnh mng x hi, mng sinh hc.
- Vn l khai thc s dy c hay ph bin cao nht vi cc th con t cc th quan h.
o Khai thc d liu t mng li x hi
o Tp cc gene c cng chc nng thng c sp xp theo mt trt t sinh hc nht nh.
- Ging nh ang khai thc gi tr trung bnh ca mt nh no
- Tp cc cnh khi loi b i lm cho th khng lien thong, gi l cnh ct.
- Ct t nht l s cnh loi b i t nht
- Mt tp c gi l dy c khi kch thc ca tp ct khng nh hn mt ngng cho trc.
- Bc phn r: (Decompose)
o Phn r cc th quan h tm th con ti i tha cc kt ni kht khe nht.
- Bc giao li: (intersection)
o Giao cc th phn r ly ra th lien thong y ( tng dn)
o Sau khi giao li thnh th mi m khng tha cc kt ni kht khe nht th tip tc phn
r n ly ra ng vin c kch thc nh nht.

6) Tm kim th - Graph index
- Vn : Cho c s d liu th v th truy vn (query graph), chng ta cn phi tm ra tt c
th nm trong th truy vn.
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

17
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com


- Gii php t Nave
o Tun t qut (a I/O)
o Kim tra th con ng cu (NP-Complete)
- Vn l: kh nng m rng l rt kh
- S th l cp s nhn ca s cu trc con, xay dng index ( th t) cho cc th con s dn
n mt lng khng l s th t cc thc th.
- Trc gic: nu th G ch th truy vn Q, th G nn chc s cha bt k cu trc no ca Q
o Bc 1: xy dng th t (index)
Lit k cc cu trc khc nhau ca th
Xy dng s th t ngc gia th v cu trc
o Bc 2: Qu trnh truy vn
Lit k cu trc trong th truy vn
Tnh ton s th ng vin cha cc cu trc trn
Loi b trng hp sai- dng tnh bng cch kim tra th ng cu.

- Tip cn da trn th t c hng chnh (Path-based)
o Daylight ( H thng thng mi)
o GraphGrep: tc gi Shaha
o Grace: tc gi Srinath Srinivasan
- tng c bn ca th t th & tm kim:
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

18
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

o Lit k tt c cc ng dn trong c s d liu, dn ti con ng di nht ( hoc mt
ngng cho trc)
o Xy dng s th t o nghch gia ng & th
o S dng s th t nh danh( nhn dng) cc th ng vin m cha tt c cc ng
dn, dn ti con ng di nht trong th truy vn.
- u im:
o ng dn d dng hn so vi cy & th
o Lm vic vi vng c ng dn r rng rt hiu qu
- Hn ch:
o C th dn ti trng hp khng c kt qu
o Khng ph hp khi truy vn th phc tp
- S dng thut ton gIndex:
o tng:
Tm cu trc ph bin ca c s d liu
Nhn din mt tp nh cc cu trc khc bit
Xy dng s th t o ngc gia cu trc khc bit & cc th
o Trc gic:
Chng ta c l s khng ly ra cc cu trc d tha bi s th t
Ch nh s th t cho nhng cu trc c thong tin nhiu hn so vi cu trc tn ti.
o Cu trc khc bit:
Cho mt tp cc cu trc f
1
, f
2
, ,f
n
v mt cu trc mi x, ta o lng kh nng d
th s th t c cho bi x:

Khi P nh va , x s l cu trc khc bit v c th cha s th t.
Th t ca cc tp cu trc khc bit s l th t ca ln t hn s th t ca tp
cu trc ph bin.
7) Tm kim cu trc con tng t:
- ng c:
o Tm kim chnh xc l qu kh i vi th, nh cu trc phn t, sinh hc
o Tm kim tng t ( tng i) l rt cn thit & quan trng.
- Tm kim tng i:
o Kh m xy dng th t bao gm cc th con tng t.
o tng c bn:
Xy dng s th t trc ht
. , , ,
2 1
x f f f f x P
i n

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

19
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

Chn la ra nhng tnh nng trong khng gian truy vn thay v l khng gian c
s d liu.
- o lng tng t :
o Mi th c biu th bi mt vector chc nng
o S tng t s c nh ngha nh khong cch gia cc vector
o D dng th t ha, nhanh
- n gin ha truy vn
o S cnh c th c b qua, khng phn bit v tr ca chng.

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

20
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

CHNG III: PHN LP TH - GRAPH CLASSIFICATION

III.1) TIP CN DA TRN CU TRC (Structure)
- tng c bn:
o Chuyn cc thi trong c s d liu sang mt vct.

o Khi x
i
l ph bin ca cu trc th I (hay mu th i) trong G. Mi vector c nh du bi
mt lp. Phn lp cc vecto ny trong khng gian vector.
- Tnh nng cu trc:
o Cu trc a phng trong mt th, v d nh nhng cnh xung quanh mt im, ng
vi di c nh.

III.2) TIP CN DA TRN MU (PATTERN).
1) Phn lp da vo th mu trong khai thc d liu:
- Da vo thut ton chui mu (Sequence patterns tc gi De Raedt v Kramer)
- Da vo thut ton th con ph bin
- Da vo thut ton th con ph bin cht ch.
- Da vo thut ton th con ph bin gn nht
- th con c vng h - Acyclic Subgraphs ( tc gi Wale and Karpis 2006)
2) Phn lp da trn cy quyt nh
- tng c bn:
o Phn chia d liu theo cu trc trn xung, v xy dng cy s dng tnh nng tt nht
tng bc.
o Phn chia tp d liu thnh hai tp con, mt cha tnh nng h tr, cn li ko cha.

3) Phn lp theo thut ton Boosting
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

21
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com


III.3)TIP CN DA TRN NHN (KERNEL).
- ng c:
o Phng php hc da trn nhn khng cn truy cp ti cc im d liu chnh xc. M ch
da vo cc hm trong nhn gia cc im d liu.
o C th ng dng vo cc cu trc phc tp m bn c th t nh ngha cc hm trn nhn
.
- tng c bn:
o nh x cc th tng ng ti tp ngha ca cc mu
o nh ngha nhn trong cc tp tng ng ca mu
- ng tip cn (walk) ngu nhin vi nhn (Gartner et al., Borgwardt et al., Inokuchi et al.)
o tng c bn: m s ng tip cn (walk) gia 2 th
- Mt vi iu c bn
o Ma trn k: A [i,j] = 1 nu c mt cnh gia cnh i v j. ngc li l 0;
o Chuyn v: A = D
-1
A

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

22
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

- T mt im bt k i ngu nhin nhy ti im lin k j.
- Xc xut nhy ti j l t l thun vi A [i,j]
- ng tip cn ngu nhin:
o Chiu di ng tip cn (walk) l n
i vo An = di ca n ng tip cn
i vo An = xc sut ca n ng tip cn
o So snh th
C v m s ng tip cn ph hp vi 2 th
Hai th tng ng bu c nhii ng tip cn tng ng
Gim nhng con ng tip cn di hn, cn thn vi vng kn.



Phn lp th ng dng vo debug trong lp trnh

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

23
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

Phn lp th ng dng vo mng Malware
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

24
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

CHNG IV: NN TH - GRAPH COMPRESSION
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

25
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

CHNG V: NG DNG KHAI THC TH VO QUN L TIN CY
TRN MNG
V.1 MT S K HIU
G = (V, E) : th
V: tp ca N nh
E _ VxV : tp cc cnh c nh hng hoc khng nh hng
N(u) = {v|(u, v) 2 E}: xung quanh u
d(u) = |N(u)|: Bc ca u

V.2 CC S LIU LIN QUAN
1) Phn b bc:
- Ck = |{u : d(u) = k}|: s nh c cng bc (+/-) k.
Ta c:

vi hay

( )
- Ni

vi cho ta ng thng vi dc
2) Biu Internet [Faloutsos 1999]

3) Bc vo ca th web:[Broder et al., 2000, Donato et al., 2007]
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

26
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com


4) Bc ra ca th web [Broder et al., 2000, Donato et al., 2007]

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

27
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

5) Mt s bc khc c lin quan:
- Cnh 2 chiu: T l phn trm cc cnh c lin kt 2 chiu
- Bc / Bc trung bnh ca cc nh ln cn (nh k).
- Bc ra trung bnh ca cc nh ln cn (nh k)
- Bc vo trung bnh ca cc nh ln cn(nh k).
6) Tnh ton s bc cho nh k c kch thc d:
Semi-streaming model: graph on disk
1: for node : 1 . . . N do
2: INITIALIZE-MEM(node)
3: end for
4: for distance : 1 . . . d do {Iteration step}
5: for src : 1 . . . N do {Follow links in the graph}
6: for all links from src to dest do
7: COMPUTE(src,dest)
8: end for
9: end for
10: NORMALIZE
11: end for
12: POST-PROCESS
13: return Something
7) Tnh s bc trung bnh ca cc nh k
Semi-streaming model: graph on disk
1: for node : 1 . . . N do
2: SUMDEG(node):=0
3: end for
4: for distance : 1 . . . 1 do {Iteration step}
5: for src : 1 . . . N do {Follow links in the graph}
6: for all links from src to dest do
7: SUMDEG(src) := SUMDEG(src) + DEG(dest)
8: end for
9: end for
10: for node : 1 . . . N do
11: AVGDEG(node) := SUMDEG(node)/ DEG(node)
12: end for
13: end for
14: return AVGDEG
8) S nh c trng vi khong cch l d:
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

28
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com


Gii php n gin
Chy BFS c su d t mi nh u: iu ny khng kh thi cho mng ln.
Gii php na dng (semi-streaning), first version
Ti ln lp th i ca thut ton, Mi nhu s cha mt tp cc cnh c khong cch l i;
Ti ln lp thi + 1, vi mi nh u, hp tt c cc tp cnh c khong cch I ca tt c cc nh k ca u.
Vn : cn O(n) bit cho mt nh

9) Xc sut m:
ng dng tnh ton s nh c trng c khong cch d.
Thc hin ng thi cho tt c cc nh c lp i lp li d
Ti mi bc mt bit vector l lan truyn sang tt c cc nh k.
Ti mi bc tng hp li : OR vi tt c vectors nhn c t cc nh k.
Thc hin vi bit vector c kch thc logarit.

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

29
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com



10) Thut ton chung
Require: N: number of nodes, d: distance, k: bits
1: for node : 1 . . . N, bit: 1 . . . k do
2: INIT(node,bit)
3: end for
4: for distance : 1 . . . d do {Iteration step}
5: Aux 0k
6: for src : 1 . . . N do {Follow links in the graph}
7: for all links from src to dest do
8: Aux[src] Aux[src] OR K[dest,]
9: end for
10: end for
11: K Aux
12: end for
13: for node: 1 . . .N do {Estimate supporters}
14: Supporters[node] ESTIMATE( K[node,] )
15: end for
16: return Supporters

- Khi to ban u gi tr bit l 1 (one) cho tt c cc nh vi xc sut l
- (*)- ()
()

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

30
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

- c lng: ()
()
(
()

)
- D on
o () (

) khi

()

o Lp li vi

v (

) ones
11) Hi t v sai:
- Thut ton s lp (

) ln


V.3 CU TRC CM TON CU CLUSTERING STRUCTURE
1) Cu trc cluster ton cu a phng:
- H s bc cu ca th G:

|*( ) () ()+|

.
|()|


xc sut m mt cp ngu nhin ca cc nh lin k ni vi nhau.





2) H s cluster:
- h s cluster ca mt nh v:

|*( ) () ()+|
.
|()|

/

xc sut m mt cp ngu nhin ca cc nh lin k ni vi nhau.
- H s cluster C
G
ca th G l h s trung bnh ca cc nh
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

31
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com


3) m s tam gic:
- Tnh ton chnh xc gim thiu nhn ma trn: iu ny khng kh thi ngay c khi mng ang
kho st l mng trung bnh
- Sp xp ngu nhin cc mu trong m hnh dng h, v bn dng
C 2 m hnh sau:
_ Cc Cnh c lu theo mt trt t bt k
_ Cnh c trt t: tt c cc cnh ch v 1 nh c lu tun t:


4) Tnh ton cng bc
- Thut ton cng bc s kim tra tng cm 3 cnh (triples)
- Ta c

biu din cho tp cc cm 3 cnh c 0 , 1, 2 v 3 cnh.



-

Tng s cm 3 cnh
- (

)
(||||||)


5) Mu Naive
- r: l s mu c lp gm 3 nh ring bit (a,b,c) t th
- Vi mu th i, nu (a,b,c) l tam gic th xut ra


Ngc li xut ra


- ,


- c lng
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

32
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com


Ti u ha mu trong bn-dng (semi-Streaming): thut ton 3-pass [Buriol 2006]
- Pass 1: m s ng c di l 2 trong dng.
- Pass 2: chn bt k ng no c di l 2 (a,b,c)

- Pass 3: nu (a,c) thuc E th ngc li
m s tam gic:

6) M rng thut ton:
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

33
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com



V.4 - CU TRC CM CC B CLUSTERING STRUCTURE
1) H s cluster cc b

- Tnh ton s tam gic cho tt c cc nh
- Khng kh thi nu p dng cho m hnh bn dng ( semi-Streaming)

2) c lng tp giao:
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

34
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com



3) Thut ton:
1:Z = 0
2: for i: 1 . . . m do {Independent trials}
3: for u : 1 . . . |V| do {Assign labels}
4: li (u) = hashi (u) {Minwise linear permutation}
5: end for
6: for u : 1 . . . |V| do {Compute fingerprints}
7: Fi (u) = minv2S(u) li (u)
8: end for{1 scan of G}
9: for u : 1 . . . |V| do {Update counters}
10: for v 2 S(u) do
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

35
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

11: if Fi (u) == Fi (v) then {Minima are equal}
12: Zuv = Zuv + 1 {Zuv s stored on disk}
13: end if
14: end for
15: end for
16: end for
4) Tnh ton

V.5 S LIU TOPO:
1) Page rank:

2) Tnh ton Pagerank

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

36
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

Semi-streaming version of the power iteration method
1: for node : 1 . . . N do
2: PR(node):=1/N
3: end for
4: for distance : 1 . . . d do {Iteration step}
5: for dest : 1 . . . N do {Follow links in the graph}
6: for all links from src to dest do
7: PR(dest) := PR(src) T(src,dest)
8: end for
9: end for
10: end for
11: return PR
3) TrustRank
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

37
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

CHNG IV: KT LUN
khai ph d liu th m ra mt hng nghin cu & pht trin ng dng Cng Ngh thng tin
vi cch thc x l cc vn d liu thc t mc cao, khng l & c bit l da trn thc t &
cc vn ang ny sinh.

i vi Vit Nam hin nay v trong tng lai gn s dng & pht trin khai ph d liu th cha
c ch trng & nh gi ng vai tr & v tr ca n. Vic nghin cu v khai ph d liu th l
cn rt hn ch. Vic pht trin ng dng khai ph d liu th s gip cho ngnh cng nghip ca
Vit Nam pht trin tt hn, c bit l cng ngh sinh hc ha sinh.
Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING
2012

38
L Ngc Hiu CH1101012 K6UIT occbuu@gmail.com

TI LIU THAM KHO

Ting Anh
[1] D.V Janardhan Rao Prof. Prasad Tadepalli, A study of Graph Mining
Algorithms , 2007
[2] DEEPAYAN CHAKRABARTI AND CHRISTOS FALOUTSOS, Graph Mining:
Laws, Generators, and Algorithms, Yahoo! Research and Carnegie Mellon
University, 2006
[3] Karsten Borgwardt and Xifeng Yan, GRAPH MINING, Max Planck Institute for
Developmental Biology, 2008.

[4] Stefano Leonardi, Graph Mining and its applications to Reputation
Management in Networks, Sapienza University of Rome Rome, Italy, 2008.

You might also like