Professional Documents
Culture Documents
Khai Phá Dữ Liệu Từ Website Việc Làm
Khai Phá Dữ Liệu Từ Website Việc Làm
LI CM N
Em xin chn thnh cm n cc thy gio, c gio trong ngnh Cng ngh
thng tin i Hc Dn Lp Hi Phng, tn tm ging dy cc kin thc
trong 4 nm hc qua cng vi s ng vin t gia nh v bn b v s ch gng
ht sc ca bn thn.
c bit em xin by t s bit n su sc n thy gio Tin s Phng Vn
n, ngi tn tnh hng dn, ng vin em thc hin n ny.
Rt mong s ng gp kin t tt c thy c, bn b ng nghip
n c th pht trin v hon thin hn n ny.
Hi phng, thng 7 nm 2010
Ngi thc hin
Nguyn Ngc Chu
MC LC
LI CM N ......................................................................................................................................... 1
M U................................................................................................................................................. 4
Chng 1: TNG QUAN V KHAI PH D LIU V PHT HIN TRI THC ............................ 5
Tng quan v khai ph d liu......................................................................................... 5
I.
1.
2. Tng quan v k thut pht hin tri thc v khai ph d liu (KDD Knowledge Discovery
and Data Mining) .............................................................................................................. 6
ng dng lut kt hp vo khai ph d liu ................................................................. 10
II.
1.
2.
3.
4.
5.
2.
3.
2.1
2.2
Agents ................................................................................................................. 49
3.2
3.3
3.4
su gii hn .................................................................................................... 52
3.5
3.6
Hn ch ca cc robot ........................................................................................... 53
3.7
3.8
Bi ton: ..................................................................................................................... 55
1.1
1.3
1.4
c t d liu: ...................................................................................................... 61
1.5
1.6
1.7
KT LUN ........................................................................................................................................... 70
TI LIU THAM KHO ..................................................................................................................... 71
supp(X) =
DX
D
12
Tp mc (itemset)
ABDE
BCE
ABDE
ABCE
ABCDE
BCD
I c supp(X)
minsup th ta ni X l mt tp
minsupp.
13
Cc tp mc ph bin
B
100% (6/6)
E, BE
83% (5/6)
67% (4/6)
50% (3/6)
Lut X
Y tn ti mt h tr support - supp. Supp(X
Y) c nh
ngha l kh nng m tp giao dch h tr cho cc thuc tnh c trong c X ln
Y, ngha l:
Support(X Y) = support(X Y).
Lut X
Y tn ti mt tin cy c (confidence - conf). Conf c c nh
ngha l kh nng giao dch T h tr X th cng h tr Y. Ni cch khc c biu
th s phn trm giao dch c cha lun A trong s nhng giao dch c cha X.
Ta c cng thc tnh conf c nh sau:
conf(X
Y) = p(Y
T| X
T) =
p(Y
T
p( X
X T)
T)
sup p( X Y )
%
sup p( X )
Y) minsup v confidence(X
Y) minconf
14
support(X Y)
support(X)
Y, X c th l rng, cn Y phi
Z v Y
Z.
Y v X
Z khng th suy ra X
Z l tho trn D th X
Z v Y
Y Z.
Z c th khng tho
Gi s T(X) T(Y)
T(Z) v confidence(X Y) = confidence(Y Z) =
minconf. Khi ta c confidence(X Z) = minconf2 < minconf v minconf <1,
ngha l lut X Z khng c cofidence ti thiu.
4) Nu lut A (L-A) khng c confidence ti thiu th cng khng c
lut no trong cc lut B (L-B) c confidence ti thiu trong L-A, B l cc
intemset v B A.
Tht vy, theo tnh cht TC1, v B A. Nn support(B) support(A) v
theo nh ngha ca confidence, ta c :
confidence(B
(L-B)) =
sup port( L)
sup port( B)
sup port( L)
<minconf.
sup port( A)
15
K vi
Cho nh x: I
{1,, |I|} l mt php nh x t cc phn t x I nh x
1-1 vo cc s t nhin. By gi, cc phn t c th c xem l c th t hon
ton trn quan h < gia cc s t nhin. Hn na, vi X I, cho X.item:
{1,,|X|}
I: n X.itemn l mt nh x, trong X.itemn l phn t th n ca
cc phn t x X sp xp tng dn trn quan h <. n-tin t ca mt itemset X
vi n |X| c nh ngha bi P={X.itemm |1 m n}.
Cho cc lp E(P), P I vi E(P) = {X I | |X| = |P|+1 v P l mt tin t ca
X} l cc nt ca mt cy. Hai nt s c ni vi nhau bng 1 cnh nu tt c
cc itemset ca lp E c th c pht sinh bng cch kt 2 itemset ca lp cha
E, v d nh trong hnh 3.
20
22
; k++) do
begin
Ck=apriori-gen(Lk-1); //sinh ra tp ng c vin t Lk-1
for (mi mt giao dch T D) do
25
Ck| c.count
minsup}
end;
return
kLk
Lk-1) do
26
Lk-1) do
begin
If ((L1[1]=L2[1])
(L1[2]=L2[2])
...
(L1[k-2]=L2[k-2])
(L1[k-1]=L2[k-1])) then
c = L1
{c}; kt tp c vo Ck
end;
Return Ck;
End;
Hm kim tra tp con k-1 mc ca ng c vin k-mc khng l tp ph
bin:
function has_infrequent_subset(c: ng c vin k-mc; Lk-1 tp ph bin k1 mc)
Begin
//s dng tp mc ph bin trc
For (mi tp con k-1 mc s ca c) do
If s
End;
C th m t hm Apriori_gen trn theo lc sau:
27
Ck) do
Lk-1) then
29
C1
D (CSDL)
1 - itemset
TID
1
{A}
2 - 50%
{B}
3 75%
{C}
3 75%
{D}
1 - 25%
{E}
3 - 75%
{A, C, D}
{B, C, E}
{A, B, C, E}
Count-support
Cc mc
{B, E}
Qut ton b D
C2
C2
2 - itemset
2 - itemset
{A, B}
{A, B}
{A, C}
{A, C}
{A, E}
{A, E}
Ta
Xa b mc c
support < minsup
L1
{B, C}
1 - itemset
Count-support
{A}
2 - 50%
{B}
3 75%
{B, C}
{B, E}
Kt ni
{C}
3 75%
{B, E}
{C, E}
L1 & L1
{E}
3 - 75%
{C, E}
30
C2
C2
2 - itemset
Qut ton b D
2 - itemset
Count-support
{A, B}
{A, B}
1 25%
{A, C}
2 50%
{A, E}
1 25%
{B, C}
2 50%
{B, E}
3 75%
{C, E}
2 50%
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Xa b mc c
support < minsup
Ta
L2
2 - itemset
Qut ton b D
Kt ni
L2 & L2
Count-support
{A, C}
2 50%
{B, C}
2 50%
{B, E}
3 75%
{C, E}
2 50%
Xa b mc c
support < minsup
C3
L3
3 - itemset
Count- support
{B, C, E}
2 - 50%
3 - itemset
Count- support
{B, C, E}
2 - 50%
for tt c t
Ck-1 do
begin
// xc nh tp ng vin trong Ck cha trong giao dch vi nh
//danh t. Tid (Transaction Code)
Ct = c
Ck | (c-c[k])
t.Set_of_ItemSets ^ (c-c[k-1]
t.Set_of_ItemSets
for nhng ng vin c
if (Ct
Ct do c.count ++;
end
Lk = c Ck | c.count
minsup ;
End
return =
kLk;
32
T, t(X) = {y T/ x X, x R s}, X
I.
T.
V d: Cho CSDL D
A
Ta c: t(AB) = 1345;
t(BCD) = 56;
t(E) = 12345
I v S
Cit: I -> I
Cit(X) = i(t(X))
Cti: T -> T
Cti(S) = t(i(S))
V d:
Cit(AB) = i(t(AB)) = i(1345) = ABE
34
f(X)
Y th f(X)
f(Y)
I, bao ng ca X l X+ = Cit(X)
V d: Xt CSDL D trn
A+ = ABE v Cit(A) = i(t(A)) = i(1345) = ABE
B+ = B v Cit(B) = i(t(B)) = i(123456) = B
AC+ = ACE v Cit(ACE) = i(t(AC)) = i(45) = ACE
nh ngha 5: Tp ph bin ng
X I l tp ph bin theo ngng minsupp. Ta ni X l tp ph bin ng
theo ngng minsupp nu X = X+ = Cit(X).
V d, xt CSDL trn, ta c: B, BC l tp ph bin ng theo ngng
minsupp = 0,4 v Cit(B) = B Cit(BC) = BC v supp(B)=1, supp(BC)=0,66.
BCD khng l tp ph bin ng theo ngng minsupp = 0,4 v
Cit(BCD)=BCD nhng supp(BCD)=0,33 < minsupp.
nh ngha 6: Bao ng ca mt tp mc
Cho K supset(I) tha minsupp, ta nh ngha K+ = {X+ | X
ng ca h K.
K} l bao
minsupp}
Method:
35
I do
If ( supp(X)
minsupp) then
K:= K
{Cit(X)}
Endif;
Endfor;
Return (K);
Thut ton 2: Tm tp ng, tm Fix(Cit)
Format: Fix(T, I, minsupp)
Input: CSDL D, minsupp, tp cc mc I
K = Fred_1_Item(T, I, minsupp)
Output: K+ {X
K | X = X+ v supp(X)
minsupp}
Method:
K+:= ;
While (K K+) do
K = K+;
K1:={X
K2 :=
Y | X, Y K};
;
For mi X
K1 do
K2:=K2
{X+}
Endfor
Frequent(K1, K2, minsupp, K+);
K:=K;
Endwhile;
Return(K+);
Thut ton 3: Tm cc tp thng xuyn ca K
Format: Frequent(K1, K, minsupp, K+)
Input: K
I, minsupp;
36
minsupp}
minsupp}
Method:
K2 :=
For mi X
If (
K do
Y
K1) and (X
K1 := K1
Y) then
{X}
Else
If not((
K2) and (X
If supp(X)
Y)) then
minsupp then
K1 := K1
{X}
K2 := K2
{X};
Else
Endif;
Endif;
Endfor;
Return(K1);
V d: Xt CSDL D trn, vi I = {A, B, C, D, E}=ABCDE; T={1, 2, 3,4,
5,6}=123456; minsupp = 0,4 (tng ng vi 3 giao dch)
p dng thut ton 1 ta c K = {ABE, B, BC, BD, BE}
p dng thut ton 2 vi Input: K = {ABE, B, BC, BD, BE}
Ta c Output: K2 = {ABCE, ABDE, BCD, BCE, BDE}
p dng thut ton 3 vi Input: K1 = {ABE, B, BC, BD, BE}
Ta c Output: {ABE, B, BC, BD, BE,ABDE, BCE, BDE}
Nhn xt: Trn y trnh by mt ci tin ca vic tm tp ph bin bng
cch s dng cc kt qu l thuyt v nh x ng, bao ng, Thut ton a
ra trnh phi tm ton b cc tp ph bin, thay vo chi phi tm mt s lng
nh hn cc tp ph bin ng, iu ny ci tin ng k tc tnh ton trong
trng hp d liu c dung lng ln.
37
L1
2
v vy s bc xc nh L2 t
C2 bng cch qut qua ton b c s d liu v kim tra trn tng transaction ln
tp C2 l qu tn km. Bng cch xy dng mt C2 c gim thiu ng k,
thut gii DHP thc hin vic m trn tp C2 nhanh hn nhiu so vi Apriori.
38
Fk
40
42
Fk ;
(L-s)" nu support(L)/support(s)
min_conf"
b, a
bc, b
ab
c, b
ac, c
c, ac
b, bc
ab
a
r: X => Y
i (vi i L)
L do
then begin
do
{ 2, 5 }
{5}
{7}
{ 2, 7 }
{ 5, 7 }
{ 2, 5, 7 }
L= { 1, 2, 3, 5, 7 }
X = { 1, 3 }
Hnh 6: v d minh ha
Trong v d trn, lut {1, 3} => {7} khng tha minConf dn n cc lut
{1, 3} => {2, 7}, {1, 3} => {5, 7}, {1, 3} => {2, 5, 7} cng khng cn xt na.
Vi nhn xt ny, chng ta c th p dng c mt s ci tin trong
nhng ci tin c s dng cho bi ton tm tp ph bin nhng y cn lu
mt iu l y lc lng |L| khng qu ln v vic tnh supp(X Y) v
supp(Y) c th xem nh c lu li (xem li thut gii PHP) nn c th mt
s ci tin tr nn khng cn thit.
4.2 Ci tin 1.a trnh pht sinh cc lut khng c ngha
Mt tnh cht khc m chng ta cng cn lu l l nu chng ta c mt lut
r: X => Y tha conf(r) minConf th lut c pht sinh bng cch thm vo v
tri mt mt item i Y cng tha tin cy minConf:
Nu r: X => Y,conf(r) minConf th
minConf
generate_k_itemset(f) do
do
46
Robots
Search Engine
Internet
Query Server
Database
51
3.4 su gii hn
Mt vn i vi cc robot l su gii hn cho php chng trong khi
duyt mt Web site. Trong v d v duyt theo su trn, trang bt u c
su 0, v xm ca cc trang ch ra 3 mc lin kt vi cc su 1, 2, 3. i
vi mt s Web site, thng tin quan trng nht thng gn vi trang ch v cc
trang c su ln hn thng t lin quan n ch chnh. Mt s khc vi
mc u tin cha ch yu l cc lin kt cn ni dung chi tit li cc mc su
hn. Trong trng hp ny, cc robot phi m bo nh ch mc c cc
trang chi tit bi v chng c gi tr i vi nhng ngi mun tm kim trn
Web site . Cng c mt s robot ch nh ch mc mt vi mc u tin
mc ch tit kim khng gian lu tr.
3.5 Vn tc nghn ng chuyn
Cc Web robot, ging nh cc trnh duyt, c th dng nhiu kt ni ti
mt Web Server c d liu. Tuy nhin, iu ny c th lm cc server qu
ti vi vic bt chng phi tr li hng lot yu cu ca robot. Khi kim tra hot
ng ca server hoc phn tch cc thng bo truy vn t bn ngoi, ngi qun
tr mng c th pht hin ra rt nhiu yu cu xut pht t cng mt a ch IP v
c th ngn chn robot khng cho n truy cp thng tin t na.
Rt nhiu Web robot c c ch t khong thi gian tr i vi cc yu
cu ti cng mt server. iu ny cc k quan trng khi robot xut pht t mt
52
54
55
Ngi tm vic
Tm lc
H tn
Tui
a ch
Chc danh
Yu cu
Kh nng
Yu cu kinh nghim
Loi hnh cng vic
Mc lng
http://works.vn
Vic tm ngi
S lc v Cng ty
S lc
Quy m
a ch
http://www.timviecnhanh.com
Ngi tm vic
Vic tm ngi
Tm lc
S lc v Cng ty
Cng ty
H tn
a ch
Ngy sinh
M t
Gii tnh
in thoi
Tnh trng hn nhn
Quy m
a ch
Tiu ch hot ng
in thoi
Website
Trnh
email
Chi tit cng vic
Chc danh/ v tr
S lng tuyn
Lnh vc ngnh ngh
Cng vic mong mun a im lm vic
Chc danh
K nng ti thiu
M t cng vic
Trnh ti thiu
Mc lng
Kinh nghim yu cu
57
Yu cu gii tnh
Hnh thc lm vic
Mc lng
58
Bng Ngnh
Int
Nvarchar(100)
Bng Ngnh
Int
Nvarchar(100)
59
60
TenUngVien
Nguyn Vn dng
Nguyn Vn h
Nguyn Th Linh
Nguyn Th Hng Ngn
inh Mnh Dng
Phm th Linh
Phm Cng Tm
Phm th thu h
Trn thanh tng
th h
Trn bc thy
Trn th thy
Trn th phng
Phm thanh tng
Phm thanh hng
.
TenUngVien
GioiTi
nh
Dotuoi
Dotuoi<
20
Nguyn Vn
CNTT dng
Nguyn Vn
CNTT h
Nguyn Th
CNTT Linh
Nguyn th
CNTT Ngn
inh Mnh
CNTT Dng
20<Dotuoi
<23
23<Dotuoi
<26
0 Nam
1 Nam
0 N
0 N
0 Nam
0 N
0 Nam
0 N
0
0
1
0
0
1
0 Nam
0 N
Kt Trn bc thy
Kt Trn th thy
Trn th
Kt phng
Phm thanh
Kt tng
Phm thanh
Kt hng
.
.
0
0
1
1
0
0
0 N
0 N
0 N
0 Nam
0
.
0
.
1
.
TenCv
26<Dot
uoi
0 Nam
.
Lp trnh
vin
Lp trnh
vin
Qun tr
mng
Qun tr
mng
K thut Vin
Qun tr
mng
K thut vin
Qun tr
mng
ha my
tnh
k ton vin
k ton
trng
k ton vin
k ton vin
k ton
trng
k ton
trng
.
Ngnh Xy dng
K hiu:
Ngnh Vn ha du lch
K hiu:
Ngnh K ton
F
Ngnh Qun tr
K hiu:
tui:
Dotuoi <20
K hiu :
Gii tnh :
Nam
K hiu
TenUngVien
N
TenCv
M
MaNga
nh
Dotuoi
Dotuoi<
20
Nguyn Vn
dng
Nguyn Vn
h
Nguyn Th
Linh
Nguyn Th
Ngn
inh Mnh
Dng
Phm th Linh
Phm Cng
Tm
Phm th thu
h
Trn thanh
tng
Lp trnh
vin
Lp trnh
vin
20<Dotuoi
<23
23<Dotuoi
<26
26<Dot
uoi
N
K
Qun tr
Qun tr
K thut Vin
Qun tr
mng
K thut vin
Qun tr
ha my
tnh
GioiTinh
Na N
m
I
H
M
N
I
H
M
N
M
N
65
k ton vin
k ton
trng
k ton vin
F
F
F
k ton vin
k ton
trng
k ton
trng
M
M
M
Conf
0.7104
0.9023
0.4409
0.9266
0.554
0.9687
0.854
0.9885
0.5573
0.9765
0.4901
0.9654
K ton =>Dotuoi[20-23]
0.4409
0.9605
K ton =>Dotuoi[23-26]
0.6737
0.9722
K ton =>Dotuoi[>26]
0.5081
0.9117
K ton =>Nam
0.5409
0.9166
K ton =>N
0.5737
Lut kt hp
66
67
68
69
70
71