Professional Documents
Culture Documents
Overview Data Mining L
Overview Data Mining L
DATA MINING
Data Exploration
Statistical Analysis, Querying and Reporting
•Bayes Theorem
•Regression Analysis
•Algorithm Design Techniques •EM Algorithm
•Algorithm Analysis •K-Means Clustering
•Data Structures •Time Series Analysis
•Neural Networks
•Decision Tree Algorithms
1. Classification
2. Clustering
3. Association Mining
4. Web Mining
• Prediction Methods
– Use some variables to predict unknown
or future values of other variables.
• Description Methods
– Find human-interpretable patterns that
describe the data.
CLASSIFICATION
Previous
customers Classifier Decision tree
P[ E | H ].P[ H ]
P[ H | E ]
P[ E ]
First, we estimate the likelihood that the example is a defaulter, given its attribute
values: P[H1|E] = P[E|H1].P[H1] (denominator omitted*)
P[Status = DEFAULTS | Delhi,Many,Medium] =
P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x P[Medium|DEFAULTS]
x P[DEFAULTS] =
1 x 1 x 0.5 x 0.5 = 0.25
Then we estimate the likelihood that the example is a payer, given its attributes:
P[H2|E] = P[E|H2].P[H2] (denominator omitted*)
P[Status = PAYS | Delhi,Many,Medium] =
P[Delhi|PAYS] x P[Many|PAYS] x P[Medium|PAYS] x
P[PAYS] =
1 x 0 x 0.5 x 0.5 =0
As the conditional likelihood of being a defaulter is higher (because 0.25 > 0), we
conclude that the new example is a defaulter.
DATA MINING VESIT M.VIJAYALAKSHMI 34
34
Worked Example 1
Now, assume a new example is presented where
City=Delhi, Children=Many, and Income=High:
First, we estimate the likelihood that the example is a defaulter, given its
attribute values:
P[Status = DEFAULTS | Delhi,Many,High] =
P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x P[High|DEFAULTS] x
P[DEFAULTS] = 1 x 1 x 0 x 0.5 =0
Then we estimate the likelihood that the example is a payer, given its
attributes:
P[Status = PAYS | Delhi,Many,High] =
P[Delhi|PAYS] x P[Many|PAYS] x P[High|PAYS] x
P[PAYS] = 1 x 0 x 0.5 x 0.5 = 0
As the conditional likelihood of being a defaulter is the same as that for being
a payer, we can come to no conclusion for this example.
Outlook
humidity P windy
N P N P
p p n n
I ( p, n) log 2 log 2
pn pn pn pn
DATA MINING VESIT M.VIJAYALAKSHMI 47
Information Gain in Decision Tree
Induction
• Assume that using attribute A a set S will be
partitioned into sets {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subtrees Si is
pi ni
E ( A) I ( pi , ni )
i 1 p n
Note: this is
• “Outlook” = “Overcast”:
info([4,0]) entropy(1,0) 1log(1) 0 log(0) 0 bits normally not
defined.
• “Outlook” = “Rainy”:
info([3,2]) entropy(3/5,2/5) 3 / 5 log(3 / 5) 2 / 5 log(2 / 5) 0.971 bits
Clustering
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should
be clustered along continent faults
DATA MINING VESIT M.VIJAYALAKSHMI 56
Clustering vs. Classification
• No prior knowledge
– Number of clusters
– Meaning of clusters
– Cluster results are dynamic
• Unsupervised learning
• Interval-scaled variables:
• Binary variables:
• Nominal, ordinal, and ratio variables:
• Variables of mixed types:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip j p
DATA MINING VESIT M.VIJAYALAKSHMI 64
Clustering Problem
• Given a database D={t1,t2,…,tn} of tuples and
an integer value k, the Clustering Problem is
to define a mapping f:D{1,..,k} where each ti
is assigned to one cluster Kj, 1<=j<=k.
• A Cluster, Kj, contains precisely those tuples
mapped to it.
• Unlike classification problem, clusters are
not known a priori.
ASSOCIATION RULES
Example: Market Basket Data
• Items frequently purchased together:
Computer Printer
• Uses:
– Placement
– Advertising
– Sales
– Coupons
• Objective: increase sales and reduce costs
• Called Market Basket Analysis, Shopping Cart
Analysis
DATA MINING VESIT M.VIJAYALAKSHMI 84
Transaction Data: Supermarket Data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, jam, salt, ice-cream}
… …
tn: {biscuit, jam, milk}
• Concepts:
– An item: an item/article in a basket
– I: the set of all items sold in the store
– A Transaction: items purchased in a basket; it may
have TID (transaction ID)
– A Transactional dataset: A set of transactions
DATA MINING VESIT M.VIJAYALAKSHMI 85
Transaction Data: A Set Of Documents
• A text document data set. Each document is
treated as a “bag” of keywords
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game
Example t2:
t3:
Butter, Cheese
Cheese, Boots
t4: Butter, Cocoa, Cheese
t5: Butter, Cocoa, Clothes, Cheese, Milk
• Transaction data t6: Cocoa, Clothes, Milk
t7: Cocoa, Milk, Clothes
• Assume:
minsup = 30%
minconf = 80%
• An example frequent itemset:
{Cocoa, Clothes, Milk} [sup = 3/7]
• Association rules from the itemset:
Clothes Milk, Cocoa [sup = 3/7, conf = 3/3]
… …
Clothes, Cocoa Milk, [sup = 3/7, conf = 3/3]
DATA MINING VESIT M.VIJAYALAKSHMI 90
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
• Frequent itemset generation is still
computationally expensive
DATA MINING VESIT M.VIJAYALAKSHMI 91
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
TID Items
W
1 Bread, Milk
2 Bread, Biscuit, FruitJuice,
Eggs
3 Milk, Biscuit, FruitJuice,
N 4
Coke
Bread, Milk, Biscuit,
FruitJuice
5 Bread, Milk, Biscuit, Coke
A A B B C C D D E E
AB AB AC AC AD AD AE AE BC BC BD BD BE BE CD CD CE CE DE DE
Found to be
Infrequent
ABCABC ABDABD ABEABE ACDACD ACEACE ADEADE BCDBCD BCEBCE BDEBDE CDECDE
ABCD
ABCD ABCE
ABCE ABDE
ABDE ACDE
ACDE BCDE
BCDE
Pruned
ABCDE
ABCDE
supersets
Itemset Count
If every subset is considered,
{Bread,Milk,Biscuit} 3
C1 + 6C2 + 6C3 = 41
6
• Join (CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
D=>ABC
• Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
f:3
cam-conditional FP-tree
DATA MINING VESIT M.VIJAYALAKSHMI 115
Single FP-tree Path Generation
• Suppose an FP-tree T has a single path P
• The complete set of frequent pattern of T can be generated by
enumeration of all the combinations of the sub-paths of P
{}
All frequent patterns
concerning m
f:3
m,
c:3 fm, cm, am,
fcm, fam, cam,
a:3
fcam
m-conditional FP-tree