Bài 5. Phân L P

KHAI THÁC DỮ LIỆU & KHAI PHÁ TRI THỨC
Data Mining & Knowledge Discovery
Bài 5. Phân lớp/ Classification
Mã MH: 505043
TS. HOÀNG Anh
1
Chương 8. Phân loại: Các khái niệm cơ bản
n Bài toán Classification: Các khái niệm cơ bản
n Cây quyết định/ Decision Tree Induction
n Phân loại Bayes/ Bayes Classification Methods
n Phân loại dựa trên luật/ Rule-Based Classification
n Đánh giá và lựa chọn model/ Model Evaluation and
Selection
n Những kỹ thuật tăng độ chính xác trong bài toán phân
loại: Ensemble Methods
n Tóm tắt/ Summary
2
Học có giám sát vs. Học không giám sát
Supervised vs. Unsupervised Learning
n Học có giám sát/ Supervised learning (Phân loại/ classification)

n Giám sát/ Supervision: Dữ liệu huấn luyện có nhãn/ labels
cho biết lớp/class dữ liệu đó thuộc về
n Dữ liệu mới được phân lớp dựa trên dữ liệu huấn luyện
n Học không giám sát/ Unsupervised learning (Gom cụm/
clustering)
n Dữ liệu huấn luyện không bao gồm nhãn
n Tập các điểm dữ liệu được gom cụm dựa trên sự tương
đồng hay khác biệt.
3
Vấn đề dự đoán:
Dự đoán lớp vs. Dự đoán số
n Phân lớp/ Classification
n Dự đoán nhãn lớp phân loại (rời rạc hoặc danh nghĩa)
n Phân loại dữ liệu (constructs a model) dựa trên tập huấn

luyện và giá trị của nhãn (class labels) trong thuộc tính được
phân loại, và sử dụng model để phân lớp dữ liệu mới.
n Dự đoán số/ Numeric Prediction
n Mô hình các hàm có giá trị liên tục, ví dụ dự đoán giá trị bị
mất/ missing values
n Các ứng dụng:
n Phê duyệt tín dụng/ cho vay: Credit/loan approval
n Chẩn đoán y tế/ Medical diagnosis
n Phát hiện gian lận/ Fraud detection
n Phân loại trang/ Web page categorization
4
Phân lớp/ Classification
A Two-Step Process
n Xây dựng mô hình/ Model construction: mô tả tập các lớp đã xác định
n Mỗi dữ liệu/ tuple/sample được xem đã thuộc về một lớp, được xác định
bởi thuộc tính nhãn lớp/ class label attribute

n Tập dữ liệu được dùng xây dựng mô hình là tập huấn luyện/ training set
n Mô hình được biểu diễn dưới dạng luật phân loại, cây quyết định, hoặc
công thức toán học.

n Sử dụng mô hình/ Model usage: để phân loại các đối tượng chưa biết
n Ước tính độ chính xác của mô hình
n Nhãn đã biết của mẫu thử được so sánh với kết quả được phân loại từ
mô hình
n Tỉ lệ chính xác là tỉ lệ phần trăm mẫu thử được phân loại chính xác
theo mô hình
n Tập kiểm tra độc lập với tập huấn luyện (otherwise overfitting)
n Khi độ chính xác được chấp nhận, sử dụng mô hình để classify new data
n Lưu ý: Nếu test set được dùng để chọn mô hình, nó được gọi là validation
(test) set 5
Quá trình (1): Xây dựng mô hình
Classification
Algorithms
Training
Data
NAM E RANK YEARS TENURED Classifier

M ike Assistant Prof 3 no (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
6
Quá trình (2): Sử dụng mô hình trong dự đoán
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Selection
8
Cây quyết định: ví dụ
age income student credit_rating buys_computer
<=30 high no fair no
q Tập huấn luyện: Mua máy tính <=30 high no excellent no
q The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
q Kết quả: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
student? yes credit rating?
no yes excellent fair
no yes no yes
9
Thuật toán cây quyết định
n Thuật toán cơ bản (a greedy algorithm)
n Cây được xây dựng theo cách chia để trị đệ qui từ trên xuống
n Khi bắt đầu, tất cả dữ liệu huấn luyện đều ở gốc/ root node
n Các thuộc tính được phân loại (if continuous-valued, they are
discretized in advance)
n Các điểm dữ liệu được phân vùng đệ qui dựa trên các thuộc tính
đã chọn
n Các thuộc tính kiểm tra được chọn ngẫu nhiện hoặc trên cơ sở
đo lường thống kê (e.g., information gain)

n Điều kiện dừng phân vùng/ partitioning
n Tất cả điểm dữ liệu trên 1 node đều thuộc về cùng 1 lớp/ class
n Không còn thuộc tính nào để phân vùng tiếp – majority voting
is employed for classifying the leaf

n Không còn điểm dữ liệu nào!
10
Entropy
n
m=2
11
Chọn thuộc tính: Độ lợi thông tin
Information Gain (ID3/C4.5)
n Lựa chọn đặc trưng với độ lợi thông tin cao nhất
n Đặt pi là xác suất để một điểm dữ liệu trong D thuộc về lớp Ci,
ước tính bằng |Ci, D|/|D|
n Thông tin kỳ vọng/ Expected information (entropy) cần thiết để
m
phân lớp dữ liệu trong D: Info( D) = -å pi log 2 ( pi )
i =1
n Thông tin cần thiết/ Information needed (after using A to split D

into v partitions) phân lớp D: v | D |
Info A ( D ) = å
j
´ Info( D j )
j =1 | D |
n Độ lợi thông tin/ Information gained bằng cách phân nhánh trên
thuộc tính A Gain(A) = Info(D) - Info (D) A
12
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Infoage ( D ) = I (2,3) + I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = - log 2 ( ) - log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
<=30 2 3 0.971 I (2,3) means “age <=30” has 5 out of
14
14 samples, with 2 yes’es and 3
31…40 4 0 0
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) = Info( D) - Infoage ( D) = 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes Gain(income) = 0.029
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) = 0.151
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes Gain(credit _ rating ) = 0.048
>40 medium no excellent no 13
Tính Information-Gain
cho các thuộc tính có giá trị liên tục
n Thuộc tính A có các giá trị liên tục
n Phải xác định điểm phân chia tốt nhất/ best split point cho A
n Sắp xếp giá trị của A tăng dần
n Thông thường, điểm trung bình/ midpoint mỗi cặp giá trị liền
kề được xem là split point
n (ai+ai+1)/2 là điểm trung bình giữa 2 giá trị ai và ai+1
n Điểm minimum expected information requirement cho A được
chọn làm điểm split-point cho A
n Phân chia/ Split:
n D1 là tập dữ liệu con của D thỏa mãn A ≤ split-point, và D2 là
tập dữ liệu con của D thỏa mãn A > split-point
14
Gain Ratio cho lựa chọn thuộc tính (C4.5)
n Thước đo độ lợi thông tin thiên về các thuộc tính có số lượng giá
trị lớn.
n C4.5 (tiếp theo ID3) sử dụng gain ratio để khắc phục vấn đề trên
(normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) = -å ´ log 2 ( )
j =1 |D| |D|
n GainRatio(A) = Gain(A)/SplitInfo(A)
n Ví dụ.
n gain_ratio(income) = 0.029/1.557 = 0.019

n Thuộc tính với giá trị cực đại gain ratio sẽ được chọn làm điểm
phân chia thuộc tính đó.
15
Chỉ số Gini
(CART, IBM IntelligentMiner)
n Nếu tập dữ liệu D chứa các mẫu từ n lớp, chỉ số gini, gini(D)
được định nghĩa n 2
gini( D) = 1- å p j
j =1
trong đó pj là tần số tương đối của lớp j trong D
n Nếu một tập dữ liệu D được chia trên A thành 2 tập con D1 và
D2, chỉ số gini |D | |D |
gini A ( D) = 1 gini( D1) + 2 gini( D 2)
|D| |D|
n Giảm tạp chất:
Dgini( A) = gini( D) - giniA ( D)
n Thuộc tính có ginisplit(D) nhỏ nhất (or the largest reduction in
impurity) được chọn để chia D (need to enumerate all the
possible splitting points for each attribute)
16
Tính chỉ số Gini
n Ví dụ. D có 9 mẫu buys_computer = “yes” và 5 mẫu “no”
2 2
æ9ö æ5ö
gini ( D) = 1 - ç ÷ - ç ÷ = 0.459
è 14 ø è 14 ø
n Giả sử thuộc tính thu nhập/incomes chia D thành 10 D1: {low,
medium} và 4 D2 {high}
æ 10 ö æ4ö
giniincomeÎ{low,medium} ( D) = ç ÷Gini( D1 ) + ç ÷Gini( D2 )
è 14 ø è 14 ø
Gini{low,high} là 0.458; Gini{medium,high} là 0.450. Do đó, chia theo

{low,medium} (and {high}) vì chỉ số Gini là bé nhất
n Tất cả thuộc tính được mặc định có giá trị là liên tục
n Cần các công cụ khác, như gom cụm, để có các giá trị phân chia
n Có thể được thay đổi với các thuộc tính có giá trị phi số 17
So sánh các phương pháp lựa chọn thuộc tính
n Thông thường, có 3 phương pháp cho kết quả tốt, nhưng:

n Độ lợi thông tin/ Information gain:
n Thiên về các thuộc tính đa giá trị
n Gain ratio:
n Có xu hướng phân chia mất cân đối, trong đó một phân
vùng nhỏ hơn nhiều so với các phân vùng khác
n Chỉ số Gini index:
n Thiên về các thuộc tính đa giá trị
n Khó phân vùng khi số lượng lớp tăng lên
n Có xu hướng các phân vùng có kích thước bằng nhau
18
Các phương pháp lựa chọn thuộc tính khác
n CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
n C-SEP: performs better than info. gain and gini index in certain cases
n G-statistic: has a close approximation to χ2 distribution
n MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
n The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
n Multivariate splits (partition based on multiple variable combinations)
n CART: finds multivariate splits based on a linear comb. of attrs.
n Which attribute selection measure is the best?
n Most give good results, none is significantly superior than others
19
Overfitting và Tree Pruning
n “Quá khớp”/ Overfitting: Cây qui nạp có thể overfit tập dữ liệu
huấn luyện
n Quá nhiều nhánh, một số phản ánh sự bất lượng do nhiễu
n Độ chính xác thấp đối với điểm dữ liệu mới
n 2 phương pháp tránh overfitting

n Cắt tỉa trước/ Prepruning: Halt tree construction early ̵ dừng
tách node nếu việc này làm giảm chỉ số đo, dưới ngưỡng
n Khó chọn ngưỡng
n Cắt tỉa sau/ Postpruning: Remove branches từ cây hoàn
chỉnh—thu được chuỗi các cây được cắt tỉa dần

n Sử dụng tập dữ liệu khác với tập huấn luyện để thu được
“best pruned tree”

20
Các cải tiến trên Basic Decision Tree Induction
n Cho phép giá trị đặc trưng là liên tục/ continuous-valued

n Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
n Xử lý missing attribute values
n Assign the most common value of the attribute
n Assign probability to each of the possible values
n Xây dựng thuộc tính/ Attribute construction
n Create new attributes based on existing ones that are sparsely
represented
n This reduces fragmentation, repetition, and replication
21
Phân lớp đối với dữ liệu lớn
n Phân lớp—một vấn đề cổ điển được nghiên cứu rộng rãi bởi các
nhà thống kê và các nhà nghiên cứu học máy.
n Khả năng mở rộng: Phân loại tập dữ liệu với hàng triệu mẫu, với
hàng trăm thuộc tính, với tốc độ hợp lý
n Tại sao decision tree induction phổ biến?
n Tốc độ học nhanh (than other classification methods)
n Chuyển đổi đơn giãn và dễ hiểu thành các qui tắc phân loại
n Có thể sử dụng truy vấn SQL để truy cập CSDL
n Độ chính xác có thể so sánh được với các pp phân loại khác
n RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

n Builds an AVC-list (attribute, value, class label)
22
Khả năng mở rộng với RainForest
n Separates the scalability aspects from the criteria that determine

the quality of the tree
n Builds an AVC-list: AVC (Attribute, Value, Class_label)
n AVC-set (of an attribute X )
n Projection of training dataset onto the attribute X and class
label where counts of individual class label are aggregated
n AVC-group (of a node n )
n Set of AVC-sets of all predictor attributes at the node n
23
Rainforest: Training Set and Its AVC Sets
Training Examples AVC-set on Age AVC-set on income

age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer
<=30 high no fair no yes no
<=30 high no excellent no yes no
high 2 2
31…40 high no fair yes <=30 2 3
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
student Buy_Computer
>40 medium yes fair yes Credit
Buy_Computer
<=30 medium yes excellent yes yes no rating yes no

31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
24
BOAT (Bootstrapped Optimistic
Algorithm for Tree Construction)
n Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
n Each subset is used to create a tree, resulting in several trees
n These trees are examined and used to construct a new tree T’
n It turns out that T’ is very close to the tree that would be
generated using the whole data set together
n Adv: requires only two scans of DB, an incremental alg.
25
Presentation of Classification Results
August 25, 2023 Data Mining: Concepts and Techniques 26

Visualization of a Decision Tree in SGI/MineSet 3.0
August 25, 2023 Data Mining: Concepts and Techniques 27

Interactive Visual Mining
by Perception-Based Classification (PBC)
Data Mining: Concepts and Techniques 28

Selection
29
Phân loại Bayesian: Tại sao?
n Phân loại thống kê: thực hiện dự đoán xác suất, như, dự đóan
xác suất thành viên lớp
n Foundation: dựa trên định lý Bayes’
n Performance: Phân loại Bayes đơn giản, naïve Bayesian
classifier, có hiệu suất tương đương cây quyết định và mạng
thần kinh
n Incremental: Mỗi dữ liệu huấn luyện làm tăng/giảm xác suất một
giả thuyết là đúng— kiến thức trước đó có thể kết hợp với dữ
liệu quan sát được
n Standard: Ngay cả khi phương pháp Bayes khó tính toán, chúng
vẫn có thể cung cấp một tiêu chuẩn cho việc ra quyết định tối ưu
mà các phương pháp khác có thể đo lường được.
30
Định lý Bayes’: Cơ bản
M
n Định lý xác suất tổng: P(B) = å P(B | A )P( A )
i i
i =1
n Định lý Bayes’: P( H | X) = P(X | H )P(H ) = P(X | H )´ P(H ) / P(X)

P(X)
n Với X là điểm dữ liệu (“evidence”): nhãn lớp không xác định
n H là giả thuyết/ hypothesis rằng X thuộc lớp C
n Phân lớp là để xác định P(H|X), (i.e., posteriori probability): xác suất hậu
nghiệm, là xác suất mà giả thuyết đúng với điều kiện X
n P(H) (prior probability): xác suất tiền nghiệm
n E.g., X will buy computer, regardless of age, income, …
n P(X): xác suất dữ liệu mẫu được quan sát

n P(X|H) (likelihood): xác suất quan sát mẫu X, với điều kiện giả thuyết
đúng
n E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
31
Dự đoán dựa trên định lý Bayes
n Cho tập huấn luyện X, posteriori probability of a hypothesis H,
P(H|X), định lý Bayes
P( H | X) = P( X | H ) P( H ) = P(X | H )´ P(H ) / P(X)

P(X)
n Informally, this can be viewed as
posteriori = likelihood x prior/evidence
n Dự đoán X thuộc lớp Ci nếu xác suất P(Ci|X) là cực đại trong số
P(Ck|X) đối với k lớp
n Khó khăn thực tế: đòi hỏi kiến thức ban đầu về nhiều xác suất,
liên quan đến chi phí tính toán
32
Phân loại: cực đại xác suất hậu nghiệm
n Tập huấn luyện D có nhãn, biểu diễn điểm dữ liệu bằng vector
X = (x1, x2, …, xn)
n Giả sử có m lớp C1, C2, …, Cm.
n Phân lớp để đạt cực đại xác suất hậu nghiệm, i.e., the maximal
P(Ci|X)
n Được thực hiện qua định lý Bayes
P(X | C )P(C )
P(C | X) = i i
i P(X)
n Vì P(X) như nhau với mọi lớp,
Cần tối đa: P(C | X) = P(X | C )P(C )

i i i
33
Phân loại Naïve Bayes
n Gỉa sử: các thuộc tính đều độc lập (i.e., no dependence relation
between attributes):
n
P( X | C i) = Õ P( x | C i) = P( x | C i) ´ P( x | C i) ´ ... ´ P( x | C i)
k 1 2 n
k =1
n Điều này giảm được chi phí tính toán: Chỉ tính phân phối lớp
n Nếu Ak là categorical, P(xk|Ci) là số điểm dữ liệu thuộc lớp Ci
mang giá trị xk trong Ak chia cho |Ci, D| (# of tuples of Ci in D)
n If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is 1 -
( x-µ )2
g ( x, µ , s ) = e 2s 2
2p s
P ( X | C i ) = g ( xk , µ Ci , s Ci )
34
Phân loại Naïve Bayes: Tập huấn luyên
age income studentcredit_rating
buys_computer
Class: <=30 high no excellent no
C1:buys_computer = 31…40 high no fair yes
‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
Data to be classified:
X = (age <=30, <=30 low yes fair yes
Income = medium, >40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
35
Phân loại Naïve Bayes: Ví dụ age income studentcredit_rating
buys_computer
<=30 high no excellent no
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643

31…40 high no fair yes
n >40 medium no fair yes
>40 low yes fair yes
P(buys_computer = “no”) = 5/14= 0.357 >40
31…40
low
low
yes excellent
yes excellent
no
yes
n Compute P(X|Ci) for each class

>40 medium yes fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222<=30
31…40
medium yes excellent
medium no excellent
yes
yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 31…40

>40
high
medium
yes fair
no excellent
yes
no
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444

P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
n X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”) 36
Tránh xác suất bằng 0
n Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i) = Õ P( x k | C i)
k =1
n Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
n Use Laplacian correction (or Laplacian estimator)
n Adding 1 to each case
Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
n The “corrected” prob. estimates are close to their
“uncorrected” counterparts
37
Phân loại Naïve Bayes: Bình luận
n Ưu điểm
n Easy to implement
n Good results obtained in most of the cases
n Hạn chế
n Assumption: class conditional independence, therefore loss of
accuracy
n Practically, dependencies exist among variables
n E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,

diabetes, etc.
n Dependencies among these cannot be modeled by Naïve
Bayes Classifier
n How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
38
Selection
39
Sử dụng luật Nếu-Thì/ IF-THEN
với bài toán phân lớp
n Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
n Rule antecedent/precondition vs. rule consequent
n Assessment of a rule: coverage and accuracy

n ncovers = # of tuples covered by R
n ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */

accuracy(R) = ncorrect / ncovers
n If more than one rule are triggered, need conflict resolution
n Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute tests)

n Class-based ordering: decreasing order of prevalence or misclassification
cost per class

n Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts

40
Trích xuất qui tắc từ cây quyết định
n Rules are easier to understand than
large trees age?
n One rule is created for each path from <=30 31..40 >40
the root to a leaf student? credit rating?
yes
n Each attribute-value pair along a path
no yes excellent fair
forms a conjunction: the leaf holds the
no yes no yes
class prediction
n Rules are mutually exclusive and
exhaustive
n Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes 41
Qui nạp: Phương pháp che phủ tuần tự
n Sequential covering algorithm: Extracts rules directly from training
data
n Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
n Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
n Steps:
n Rules are learned one at a time
n Each time a rule is learned, the tuples covered by the rules are
removed
n Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
n Comp. w. decision-tree induction: learning a set of rules
simultaneously
42
Thuật toán che phủ tuần tự
Sequential Covering Algorithm
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
43
Rule Generation
n To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
44
Học 1 luật như thế nào/ Learn-One-Rule?
n Start with the most general rule possible: condition = empty
n Adding new attributes by adopting a greedy depth-first strategy
n Picks the one that most improves the rule quality
n Rule-Quality measures: consider both coverage and accuracy

n Foil-gain (in FOIL & RIPPER): assesses info_gain by
extending condition pos ' pos

FOIL _ Gain = pos '´(log 2 - log 2 )
pos '+ neg ' pos + neg
n favors rules that have high accuracy and cover many positive tuples
n Rule pruning based on an independent set of test tuples
pos - neg
FOIL _ Prune( R) =
pos + neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
45
Selection
46
Đánh giá và lựa chọn mô hình
n Các chỉ số đánh giá/ Evaluation metrics: Độ chính xác/
Accuracy? Và các chỉ số khác?
n Sử dụng validation test set thay vì tập huấn luyện/ training set
khi đánh giá độ chính xác.
n Các phương pháp ước tính độ chính xác:
n Holdout method, random subsampling
n Cross-validation
n Bootstrap
n So sánh các thuật toán phân lớp:
n Khoảng tin cậy/ Confidence intervals
n Phân tích/ Cost-benefit, và đường cong ROC
47
Các chỉ số đánh giá phân loại:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:

Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
n Cho m lớp, CMi,j là confusion matrix với số điểm dữ liệu i

được dán nhãn bởi lớp j
n Tổng số được tổng hợp ở hàng/cột mở rộng.
48
Chỉ số đánh giá: Accuracy, Error Rate,
Sensitivity và Specificity
A\P C ¬C n Vấn đề mất cân bằng lớp:
C TP FN P
n Một lớp có thể có rất ít điểm
¬C FP TN N
dữ liệu, ví dụ: fraud, or HIV-
P’ N’ All
positive
n Độ chính xác/ Accuracy, or n Significant majority of the
recognition rate: tỉ lệ bộ dữ liệu negative class and minority of

kiểm tra/ test, được phân loại the positive class
chính xác
Accuracy = (TP + TN)/All n Sensitivity: Tỉ lệ đúng thực sự
n Sensitivity = TP/P
n Tỉ lệ lỗi: 1 – accuracy, or n Specificity: Tỉ lệ sai thực sự
Error rate = (FP + FN)/All n Specificity = TN/N
49
Các chỉ số đánh giá:
Precision, Recall, và F-measures
n Precision: exactness – phần trăm bộ dữ liệu được thuật toán
phân loại là dương/ positive thực sự là dương.
n Recall: sự đầy đủ/ Cmpleteness – phần trăm bộ dữ liệu dương/

positive, được thuật toán phân loại là dương
n Điểm tuyệt đối là 1.0
n Mối quan hệ nghịch đảo giữa precision & recall
n F measure (F1 or F-score): trung bình điều hòa của precision
và recall,
n Fß: thước đo trọng số của precision và recall

n assigns ß times as much weight to recall as to precision
50
Các chỉ số đánh giá: Ví dụ
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
n Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
51
Đánh gía độ chính xác:
Phương pháp Holdout & Cross-Validation
n Holdout method
n Dữ liệu đã cho được phân chia ngẫu nhiên thành 2 phần độc lập
n Tập huấn luyện (e.g., 2/3) để xây dựng mô hình
n Tập kiểm tra (e.g., 1/3) để ước tính độ chính xác
n Lấy mẫu ngẫu nhiên: một biến thể của pp holdout
n Lặp lại holdout k lần, độ chính xác = trung bình của các lần
holdout.
n Cross-validation (k-fold, với k = 10 là phổ biến)
n Phân vùng ngẫu nhiên dữ liệu thành k mutually exclusive tập
con, với kích thước xấp xỉ nhau.
n Tại vòng lặp thứ i-th, sử dụng Di làm tập kiểm tra, và phần còn
lại làm tập huấn luyện.
n Leave-one-out: k folds where k = # of tuples, for small sized
data
n *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
52
Đánh gía độ chính xác: Bootstrap
n Bootstrap
n Làm việc tốt với tập dữ liệu nhỏ
n Lấy mẫu tập huấn luyện, có thể lặp/ replacement
n i.e., each time a tuple is selected, it is equally likely to be selected again
and re-added to the training set
n Một vài pp Boostrap, và phổ biến là .632 boostrap
n A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since (1
– 1/d)d ≈ e-1 = 0.368)
n Repeat the sampling procedure k times, overall accuracy of the model:
53
Ước tính khoảng tin cậy:
Classifier Models M1 vs. M2
n Giả sử có 2 thuật toán phân lớp, M1 và M2, cái nào tốt hơn?
n Sử dụng 10-fold cross-validation: và
n These mean error rates are just estimates of error on the true
population of future data cases
n What if the difference between the 2 error rates is just attributed

to chance?
n Use a test of statistical significance
n Obtain confidence limits for our error estimates
54
Null Hypothesis
n Thực hiện 10-fold cross-validation
n Giả sử mẫu tuân theo phân bổ/ t distribution với k–1 degrees of
freedom (here, k=10)
n Sử dụng t-test (or Student’s t-test)
n Null Hypothesis: M1 & M2 là như nhau
n Nếu có thể loại bỏ/ reject null hypothesis, thì
n Kết luận sự khác nhau giữa M1 và M2 mang tính chất thống
kê/ statistically significant
n Chọn mô hình với tỉ lệ lỗi thấp
55
Ước tính khoảng tin cậy: t-test
n Nếu chỉ có 1 tập kiểm tra: pairwise comparison

n For ith round of 10-fold cross-validation, the same cross
partitioning is used to obtain err(M1)i and err(M2)i
n Average over 10 rounds to get
and
n t-test computes t-statistic with k-1 degrees of
freedom:
where
n Nếu có 2 tập kiểm tra: sử dụng non-paired t-test

where
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
56
Table for t-distribution
n Symmetric
n Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
n Confidence limit, z
= sig/2
57
Statistical Significance
n Are M1 & M2 significantly different?
n Compute t. Select significance level (e.g. sig = 5%)
n Consult table for t-distribution: Find t value corresponding to
k-1 degrees of freedom (here, 9)

n t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit

z=sig/2 (here, 0.025)
n If t > z or t < -z, then t value lies in rejection region:
n Reject null hypothesis that mean error rates of M1 & M2
are same
n Conclude: statistically significant difference between M1
& M2
n Otherwise, conclude that any difference is chance
58
Lựa chọn mô hình: ROC Curves
n ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
n Originated from signal detection theory
n Shows the trade-off between the true
positive rate and the false positive rate
n The area under the ROC curve is a n Vertical axis
measure of the accuracy of the model represents the true
positive rate
n Rank the test tuples in decreasing order: n Horizontal axis rep.
the one that is most likely to belong to the false positive rate
the positive class appears at the top of n The plot also shows a
the list diagonal line
n The closer to the diagonal line (i.e., the n A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
59
Các vấn đề ảnh hưởng việc chọn mô hình
n Độ chính xác/ Accuracy
n classifier accuracy: predicting class label
n Tốc độ/ Speed
n time to construct the model (training time)
n time to use the model (classification/prediction time)
n Robustness: handling noise and missing values
n Khả năng mở rộng/ Scalability: efficiency in disk-resident
databases
n Khả năng giải thích/ Interpretability
n understanding and insight provided by the model
n Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules 60
Selection
n Những kỹ thuật tăng độ chính xác trong bài toán
phân loại: Ensemble Methods
61
PP tổng hợp: Tăng độ chính xác
n Ensemble methods
n Sử dụng nhiều mô hình, tăng độ chính xác của thuật toán
n Kết hợp k mô hình học được, M1, M2, …, Mk, với mục đích
đạt được mô hình tốt hơn M*

n Các phương pháp tổng hợp phổ biến
n Bagging: averaging the prediction over a collection of
classifiers
n Boosting: weighted vote with a collection of classifiers
n Ensemble: combining a set of heterogeneous classifiers
62
Bagging: Boostrap Aggregation
n Analogy: Diagnosis based on multiple doctors’ majority vote
n Training
n Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)

n A classifier model Mi is learned for each training set Di
n Classification: classify an unknown sample X

n Each classifier Mi returns its class prediction
n The bagged classifier M* counts the votes and assigns the class with the
most votes to X
n Prediction: can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple
n Accuracy
n Often significantly better than a single classifier derived from D
n For noise data: not considerably worse, more robust
n Proved improved accuracy in prediction

63
Boosting
n Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
n How boosting works?
n Weights are assigned to each training tuple
n A series of k classifiers is iteratively learned
n After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
n The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
n Boosting algorithm can be extended for numeric prediction
n Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 64
Adaboost (Freund and Schapire, 1997)
n Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
n Initially, all the weights of tuples are set the same (1/d)
n Generate k classifiers in k rounds. At round i,
n Tuples from D are sampled (with replacement) to form a training set Di
of the same size
n Each tuple’s chance of being selected is based on its weight
n A classification model Mi is derived from Di
n Its error rate is calculated using Di as a test set
n If a tuple is misclassified, its weight is increased, o.w. it is decreased
n Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weights of the misclassified tuples:
d
error ( M i ) = å w j ´ err ( X j )
j
n The weight of classifier Mi’s vote is 1 - error ( M i )
log
error ( M i )
65
Random Forest (Breiman 2001)
n Random Forest:
n Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to determine

the split
n During classification, each tree votes and the most popular class is
returned
n Two Methods to construct Random Forest:
n Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
n Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces

the correlation between individual classifiers)
n Comparable in accuracy to Adaboost, but more robust to errors and outliers
n Insensitive to the number of attributes selected for consideration at each split,
and faster than bagging or boosting
66
Classification of Class-Imbalanced Data Sets
n Class-imbalance problem: Rare positive example but numerous

negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
n Traditional methods assume a balanced distribution of classes and
equal error costs: not suitable for class-imbalanced data
n Typical methods for imbalance data in 2-class classification:
n Oversampling: re-sampling of data from positive class
n Under-sampling: randomly eliminate tuples from negative

class
n Threshold-moving: moves the decision threshold, t, so that the
rare class tuples are easier to classify, and hence, less chance of
costly false negative errors
n Ensemble techniques: Ensemble multiple classifiers introduced
above
n Still difficult for class imbalance problem on multiclass tasks
67
Selection
68
Tóm tắt (I)
n Phân lớp/ Classification là một dạng phân tích dữ liệu rút ra mô
hình/ models mô tả các lớp dữ liệu quan trọng.
n Các phương pháp hiệu quả, có khả năng mở rộng đã được phát
triển cho decision tree induction, Naive Bayesian classification,
rule-based classification, và nhiều phương pháp phân lớp khác.
n Các chỉ số đánh giá/ Evaluation metrics: accuracy, sensitivity,
specificity, precision, recall, F measure, và Fß measure.
n Stratified k-fold cross-validation được đề xuất để tính độ chính
xác. Bagging và boosting có thể được dùng tăng độ chính xác
bằng việc học và kết hợp các mô hình đơn lẽ.
69
Tóm tắt (II)
n Significance tests và ROC curves là hữu ích cho việc lựa chọn
mô hình.
n Chủ đề nghiên cứu: Phương pháp so sánh các mô hình phân lớp
khác nhau/ comparisons of the different classification
n Không có một mô hình nào là tốt nhất đối với tất cả các kiểu/ loại
dữ liệu.
n Các vấn đề như độ chính xác, thời gian huấn luyện, sự mạnh mẽ,
khả năng mở rộng, và khả năng giải thích phải được nhắc đến và
cân đối để tìm ra mô hình phân lớp tốt nhất đối với từng bài toán
cụ thể.
70
References (1)
n C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
n C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
n L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression
Trees. Wadsworth International Group, 1984
n C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition.
Data Mining and Knowledge Discovery, 2(2): 121-168, 1998
n P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned
data for scaling machine learning. KDD'95
n H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern
Analysis for Effective Classification, ICDE'07
n H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for
Effective Classification, ICDE'08
n W. Cohen. Fast effective rule induction. ICML'95
n G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups
for gene expression data. SIGMOD'05
71
References (2)
n A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall,
1990.
n G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
n R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
n U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
n Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning
and an application to boosting. J. Computer and System Sciences, 1997.
n J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision
tree construction of large datasets. VLDB’98.
n J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision
Tree Construction. SIGMOD'99.
n T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
n D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
n W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules, ICDM'01. 72
References (3)
n T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy,
complexity, and training time of thirty-three old and new classification
algorithms. Machine Learning, 2000.
n J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of
Marketing Research, Blackwell Business, 1994.
n M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data
mining. EDBT'96.
n T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
n S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
n J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
n J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
n J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
n J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
73
References (4)
n R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building
and pruning. VLDB’98.
n J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for
data mining. VLDB’96.
n J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan
Kaufmann, 1990.
n P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
n S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification
and Prediction Methods from Statistics, Neural Nets, Machine Learning, and
Expert Systems. Morgan Kaufman, 1991.
n S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
n I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
n X. Yin and J. Han. CPAR: Classification based on predictive association rules.
SDM'03
n H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03. 74
CS412 Midterm Exam Statistics
n Opinion Question Answering:
n Like the style: 70.83%, dislike: 29.16%
n Exam is hard: 55.75%, easy: 0.6%, just right: 43.63%
n Time: plenty:3.03%, enough: 36.96%, not: 60%
n Score distribution: # of students (Total: 180)

n >=90: 24 n 60-69: 37 n <40: 2
n 80-89: 54 n 50-59: 15
n 70-79: 46 n 40-49: 2
n Final grading are based on overall score accumulation

and relative class distributions
76
Issues: Evaluating Classification Methods
n Accuracy
n classifier accuracy: predicting class label
n predictor accuracy: guessing value of predicted attributes
n Speed
n time to construct the model (training time)
n time to use the model (classification/prediction time)
n Robustness: handling noise and missing values

n Scalability: efficiency in disk-resident databases
n Interpretability
n understanding and insight provided by the model
n Other measures, e.g., goodness of rules, such as decision tree

size or compactness of classification rules
77
Predictor Error Measures
n Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
n Loss function: measures the error betw. yi and the predicted value yi’
n Absolute error: | yi – yi’|
n Squared error: (yi – yi’)2
n Test error (generalization error): the average loss over the test set
d
å | yi - yi ' |
d
n Mean absolute error:
i =1
Mean squared error: å(y
i =1
i - yi ' ) 2
d d
d
Relative absolute error: å | y - yi ' | Relative squared error: d
å(y
n i
i =1
d i - yi ' ) 2
å| y
i =1
i -y| i =1
d
The mean squared-error exaggerates the presence of outliers å(y

i =1
i - y)2
Popularly use (square) root mean-square error, similarly, root relative squared
error
78
Scalable Decision Tree Induction Methods
n SLIQ (EDBT’96 — Mehta et al.)

n Builds an index for each attribute and only class list and the
current attribute list reside in memory

n SPRINT (VLDB’96 — J. Shafer et al.)
n Constructs an attribute list data structure
n PUBLIC (VLDB’98 — Rastogi & Shim)

n Integrates tree splitting and tree pruning: stop growing the
tree earlier
n RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
n Builds an AVC-list (attribute, value, class label)
n BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)

n Uses bootstrapping to create several small samples
79
Data Cube-Based Decision-Tree Induction
n Integration of generalization with decision-tree induction
(Kamber et al.’97)
n Classification at primitive concept levels
n E.g., precise temperature, humidity, outlook, etc.
n Low-level concepts, scattered classes, bushy classification-
trees
n Semantic interpretation problems
n Cube-based multi-level classification
n Relevance analysis at multi-levels
n Information-gain analysis with dimension + level
80

Bài 5. Phân L P

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bài 5. Phân L P

Uploaded by

Copyright:

Available Formats

KHAI THÁC DỮ LIỆU & KHAI PHÁ TRI THỨC

Data Mining & Knowledge Discovery

Bài 5. Phân lớp/ Classification

n Học có giám sát/ Supervised learning (Phân loại/ classification)

n Phân loại dữ liệu (constructs a model) dựa trên tập huấn

n Chẩn đoán y tế/ Medical diagnosis

n Phát hiện gian lận/ Fraud detection

n Phân loại trang/ Web page categorization

bởi thuộc tính nhãn lớp/ class label attribute

công thức toán học.

NAM E RANK YEARS TENURED Classifier

student? yes credit rating?

no yes excellent fair

đo lường thống kê (e.g., information gain)

is employed for classifying the leaf

n Thông tin cần thiết/ Information needed (after using A to split D

n gain_ratio(income) = 0.029/1.557 = 0.019

Gini{low,high} là 0.458; Gini{medium,high} là 0.450. Do đó, chia theo

n Thông thường, có 3 phương pháp cho kết quả tốt, nhưng:

n Độ chính xác thấp đối với điểm dữ liệu mới

n 2 phương pháp tránh overfitting

n Cắt tỉa sau/ Postpruning: Remove branches từ cây hoàn

chỉnh—thu được chuỗi các cây được cắt tỉa dần

“best pruned tree”

n Cho phép giá trị đặc trưng là liên tục/ continuous-valued

n Có thể sử dụng truy vấn SQL để truy cập CSDL

n RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

n Separates the scalability aspects from the criteria that determine

Training Examples AVC-set on Age AVC-set on income

<=30 medium yes excellent yes yes no rating yes no

August 25, 2023 Data Mining: Concepts and Techniques 26

August 25, 2023 Data Mining: Concepts and Techniques 27

Data Mining: Concepts and Techniques 28

n Định lý Bayes’: P( H | X) = P(X | H )P(H ) = P(X | H )´ P(H ) / P(X)

n P(X): xác suất dữ liệu mẫu được quan sát

P( H | X) = P( X | H ) P( H ) = P(X | H )´ P(H ) / P(X)

Cần tối đa: P(C | X) = P(X | C )P(C )

P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643

n Compute P(X|Ci) for each class

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 31…40

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444

Prob(income = low) = 1/1003

n Good results obtained in most of the cases

n E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer,

n Assessment of a rule: coverage and accuracy

n ncorrect = # of tuples correctly classified by R

coverage(R) = ncovers /|D| /* D: training data set */

“toughest” requirement (i.e., with the most attribute tests)

cost per class

priority list, according to some measure of rule quality or by experts

n Rule-Quality measures: consider both coverage and accuracy

extending condition pos ' pos

Example of Confusion Matrix:

n Cho m lớp, CMi,j là confusion matrix với số điểm dữ liệu i

recognition rate: tỉ lệ bộ dữ liệu negative class and minority of

n Tỉ lệ lỗi: 1 – accuracy, or n Specificity: Tỉ lệ sai thực sự

Error rate = (FP + FN)/All n Specificity = TN/N

n Recall: sự đầy đủ/ Cmpleteness – phần trăm bộ dữ liệu dương/

n Fß: thước đo trọng số của precision và recall

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

n Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

n Tập huấn luyện (e.g., 2/3) để xây dựng mô hình