Overview Data Mining L

OVERVIEW
DATA MINING
DATA MINING VESIT M.VIJAYALAKSHMI 1

Outline Of the Presentation
– Motivation & Introduction

– Data Mining Algorithms
– Teaching Plan

Why Data Mining? Commercial
Viewpoint
• Lots of data is being collected and warehoused
– Web data, e-commerce
– purchases at department/grocery stores
– Bank/Credit Card transactions
• Computers have become cheaper and more
powerful
• Competitive Pressure is strong
– Provide better, customized services for an edge
(e.g. in Customer Relationship Management)
Typical Decision Making
• Given a database of 100,000 names, which
persons are the least likely to default on their
credit cards?
• Which of my customers are likely to be the
most loyal?
• Which claims in insurance are potential frauds?
• Who may not pay back loans?
• Who are consistent players to bid for in IPL?
• Who can be potential customers for a new toy?
Data Mining helps extract such information
Why Mine Data?
Scientific Viewpoint
• Data collected and stored at enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene expression data
– scientific simulations generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Mining Large Data Sets -
Motivation
• There is often information “hidden” in the

data that is not readily evident.
• Human analysts may take weeks to

discover useful information.

Data Mining works with
Warehouse Data
• Data Warehousing provides the
Enterprise with a memory
 Data Mining provides

the Enterprise with
intelligence
DATA MINING VESIT
M.VIJAYALAKSHMI 7
What Is Data Mining?
• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
• Alternative names and their “inside stories”:
– Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence,
etc.
• What is not data mining?
– (Deductive) query processing.
– Expert systems or small ML/statistical programs
Potential Applications
• Market analysis and management
– target marketing, CRM, market basket
analysis, cross selling, market segmentation
• Risk analysis and management
– Forecasting, customer retention, quality
control, competitive analysis
• Fraud detection and management
• Text mining (news group, email, documents) and
Web analysis.
– Intelligent query answering

Other Applications
• game statistics to gain competitive advantage
Astronomy
• JPL and the Palomar Observatory discovered
22 quasars with the help of data mining
• IBM Surf-Aid applies data mining algorithms
to Web access logs for market-related pages to
discover customer preference and behavior
pages, analyzing effectiveness of Web
marketing, improving Web site organization,
etc.
What makes data mining
possible?
• Advances in the following areas are making
data mining deployable:
– data warehousing
– better and more data (i.e., operational, behavioral,
and demographic)
– the emergence of easily deployed data mining tools
and
– the advent of new data mining techniques.
– -- Gartner Group

What is Not Data Mining
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than
$10,000 in the last month.
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
milk. (association rules)

Data Mining: On What Kind of Data?
• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW

Data Mining Models And Tasks

Are All the “Discovered” Patterns Interesting?
• A data mining system/query may generate thousands of
patterns, not all of them are interesting.
• Interestingness measures:
– A pattern is interesting if it is easily understood by humans,
valid on new or test data with some degree of certainty,
potentially useful, novel, or validates some hypothesis that a
user seeks to confirm
• Objective vs. subjective interestingness measures:
– Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
– Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, etc.
Can We Find All and Only Interesting Patterns?
• Find all the interesting patterns:

Completeness
– Association vs. classification vs. clustering
• Search for only interesting patterns:
• First general all the patterns and then filter
out the uninteresting ones.
• Generate only the interesting paterns
Data Mining vs. KDD
• Knowledge Discovery in Databases

(KDD): process of finding useful
information and patterns in data.
• Data Mining: Use of algorithms to

extract the information and patterns
derived by the KDD process.

KDD Process
• Selection: Obtain data from various sources.

• Preprocessing: Cleanse data.
• Transformation: Convert to common format.
Transform to new format.
• Data Mining: Obtain desired results.
• Interpretation/Evaluation: Present results to user
in meaningful manner.
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts

OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•Relational Data Model •IR Systems
•SQL •Imprecise Queries
•Association Rule Algorithms •Textual Data
•Data Warehousing
•Scalability Techniques •Web Search Engines
•Bayes Theorem
•Regression Analysis
•Algorithm Design Techniques •EM Algorithm
•Algorithm Analysis •K-Means Clustering
•Data Structures •Time Series Analysis
•Neural Networks
•Decision Tree Algorithms

Data Mining Issues
• Human Interaction • Multimedia Data
• Overfitting • Missing Data
• Outliers • Irrelevant Data
• Interpretation • Noisy Data
• Visualization • Changing Data
• Large Datasets • Integration
• High Dimensionality • Application

Social Implications of DM
• Privacy
• Profiling
• Unauthorized use

Data Mining Metrics
• Usefulness
• Return on Investment (ROI)
• Accuracy
• Space/Time

Data Mining Algorithms
1. Classification
2. Clustering
3. Association Mining
4. Web Mining

Data Mining Tasks
• Prediction Methods
– Use some variables to predict unknown
or future values of other variables.
• Description Methods
– Find human-interpretable patterns that
describe the data.

• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]

CLASSIFICATION

Classification
Given old data about customers and payments, predict
new applicant’s loan eligibility.
Previous
customers Classifier Decision tree
Age Salary > 5 K good

Salary /
Profession Prof. = Exec
Location
Customer type bad
New applicant’s
DATA MINING VESIT M.VIJAYALAKSHMI data 28
Classification Problem
• Given a database D={t1,t2,…,tn} and a set of
classes C={C1,…,Cm}, the Classification
Problem is to define a mapping f:DC where
each ti is assigned to one class.
• Actually divides D into equivalence classes.
• Prediction is similar, but may be viewed as
having infinite number of classes.

Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
Overview of Naive Bayes
The goal of Naive Bayes is to work out whether a new example
is in a class given that it has a certain combination of attribute
values. We work out the likelihood of the example being in
each class given the evidence (its attribute values), and take
the highest likelihood as the classification.
Bayes Rule: E- Event has occurred
P[ E | H ].P[ H ]
P[ H | E ] 
P[ E ]
P[H] is called the prior probability (of the hypothesis).

P[H|E] is called the posterior probability (of the hypothesis
given the evidence)

31
Worked Example 1
Take the following training data, from bank loan applicants:
ApplicantID City Children Income Status

1 Delhi Many Medium DEFAULTS
2 Delhi Many Low DEFAULTS
3 Delhi Few Medium PAYS
4 Delhi Few High PAYS
• P[City=Delhi | Status = DEFAULTS] = 2/2 = 1

• P[City=Delhi | Status = PAYS] = 2/2 = 1
• P[Children=Many | Status = DEFAULTS] = 2/2 = 1
• P[Children=Few | Status = DEFAULTS] = 0/2 = 0
• etc.
32
Worked Example 1
Summarizing, we have the following probabilities:
Probability of... ... given ... given PAYS
DEFAULTS
City=Delhi 2/2 = 1 2/2 = 1
Children=Few 0/2 = 0 2/2 = 1
Children=Many 2/2 = 1 0/2 = 0
Income=Low 1/2 = 0.5 0/2 = 0
Income=Medium 1/2 = 0.5 1/2 = 0.5
Income=High 0/2 = 0 1/2 = 0.5
and P[Status = DEFAULTS] = 2/4 = 0.5

P[Status = PAYS] = 2/4 = 0.5
The probability of ( Income=Medium ) /applicant DEFAULTs =
the number of applicants with Income=Medium who DEFAULT
divided by the number of applicants who DEFAULT
= 1/2 = 0.5
Worked Example 1
Now, assume a new example is presented where
City=Delhi, Children=Many, and Income=Medium:
First, we estimate the likelihood that the example is a defaulter, given its attribute
values: P[H1|E] = P[E|H1].P[H1] (denominator omitted*)
P[Status = DEFAULTS | Delhi,Many,Medium] =
P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x P[Medium|DEFAULTS]
x P[DEFAULTS] =
1 x 1 x 0.5 x 0.5 = 0.25
Then we estimate the likelihood that the example is a payer, given its attributes:
P[H2|E] = P[E|H2].P[H2] (denominator omitted*)
P[Status = PAYS | Delhi,Many,Medium] =
P[Delhi|PAYS] x P[Many|PAYS] x P[Medium|PAYS] x
P[PAYS] =
1 x 0 x 0.5 x 0.5 =0
As the conditional likelihood of being a defaulter is higher (because 0.25 > 0), we
conclude that the new example is a defaulter.
34
Worked Example 1
Now, assume a new example is presented where
City=Delhi, Children=Many, and Income=High:
First, we estimate the likelihood that the example is a defaulter, given its
attribute values:
P[Status = DEFAULTS | Delhi,Many,High] =
P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x P[High|DEFAULTS] x
P[DEFAULTS] = 1 x 1 x 0 x 0.5 =0
Then we estimate the likelihood that the example is a payer, given its
attributes:
P[Status = PAYS | Delhi,Many,High] =
P[Delhi|PAYS] x P[Many|PAYS] x P[High|PAYS] x
P[PAYS] = 1 x 0 x 0.5 x 0.5 = 0
As the conditional likelihood of being a defaulter is the same as that for being
a payer, we can come to no conclusion for this example.

35
Weaknesses
• Naive Bayes assumes that variables are equally
important and that they are independent which is
often not the case in practice.
• Naive Bayes is damaged by the inclusion of
redundant (strongly dependent) attributes.
• Sparse data: If some attribute values are not present
in the data, then a zero probability for P[E|H] might
exist. This would lead P[H|E] to be zero no matter
how high P[E|H] is for other attribute values. Small
positive values which estimate the so-called ‘prior
probabilities’ are often used to correct this.

36
Classification Using Decision
Trees
• Partitioning based: Divide search space into
rectangular regions.
• Tuple placed into class based on the region
within which it falls.
• DT approaches differ in how the tree is built:
DT Induction
• Internal nodes associated with attribute and arcs
with values for that attribute.
• Algorithms: ID3, C4.5, CART

DT Issues
• Choosing Splitting Attributes
• Ordering of Splitting Attributes
• Splits
• Tree Structure
• Stopping Criteria
• Training Data
• Pruning

DECISION TREES
• An internal node represents a test on an attribute.
• A branch represents an outcome of the test, e.g.,
Color=red.
• A leaf node represents a class label or class label
distribution.
• At each node, one attribute is chosen to split training
examples into distinct classes as much as possible
• A new case is classified by following a matching path
to a leaf node.

Training Set
Outlook Tempreature Humidity W indy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
Example
Outlook
sunny overcast rain
humidity P windy
high normal true false
N P N P

Building Decision Tree
• Top-down tree construction
– At start, all training examples are at the root.
– Partition the examples recursively by choosing one
attribute each time.
• Bottom-up tree pruning
– Remove subtrees or branches, in a bottom-up manner,
to improve the estimated accuracy on new cases.
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the
decision tree

Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-
conquer manner
– At start, all the training examples are at the root
– Attributes are categorical
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
– There are no samples left
Choosing the Splitting Attribute
• At each node, available attributes are

evaluated on the basis of separating the
classes of the training examples. A Goodness
function is used for this purpose.
• Typical goodness functions:
– information gain (ID3/C4.5)
– information gain ratio
– gini index

Which attribute to select?

A criterion for attribute selection
• Which is the best attribute?
– The one which will result in the smallest tree
– Heuristic: choose the attribute that produces the
“purest” nodes
• Popular impurity criterion: information gain
– Information gain increases with the average
purity of the subsets that an attribute produces
• Strategy: choose attribute that results in greatest
information gain

Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n
elements of class N
– The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as
p p n n
I ( p, n)   log 2  log 2
pn pn pn pn
Information Gain in Decision Tree
Induction
• Assume that using attribute A a set S will be
partitioned into sets {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subtrees Si is

pi  ni
E ( A)   I ( pi , ni )
i 1 p  n
• The encoding information that would be gained by

branching on A
Gain( A)  I ( p, n)  E ( A)
Example: attribute “Outlook”
• “Outlook” = “Sunny”:
info([2,3])  entropy(2/5,3/5)  2 / 5 log(2 / 5)  3 / 5 log(3 / 5)  0.971 bits
Note: this is
• “Outlook” = “Overcast”:
info([4,0])  entropy(1,0)  1log(1)  0 log(0)  0 bits normally not
defined.
• “Outlook” = “Rainy”:
info([3,2])  entropy(3/5,2/5)  3 / 5 log(3 / 5)  2 / 5 log(2 / 5)  0.971 bits
• Expected information for attribute:

info([3,2], [4,0],[3,2])  (5 / 14)  0.971  (4 / 14)  0  (5 / 14)  0.971
 0.693 bits
Computing the information gain
• Information gain: information before splitting –
information after splitting
gain(" Outlook" )  info([9,5]) - info([2,3], [4,0], [3,2])  0.940 - 0.693

 0.247 bits
• Information gain for attributes from weather

data:
gain("Outlook")  0.247 bits gain("Temperature" )  0.029 bits
gain(" Humidity" )  0.152 bits gain(" Windy" )  0.048 bits

Continuing to split
gain(" Temperatur e" )  0.571 bits

gain(" Humidity")  0.971 bits
gain(" Windy" )  0.020 bits
The final decision tree
• Note: not all leaves need to be pure; sometimes identical

instances have different classes
 Splitting stops when data can’t be split any further
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise
or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree
—get a sequence of progressively pruned trees
• Use a set of data different from the training data to decide
which is the “best pruned tree”
Clustering
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should
be clustered along continent faults
Clustering vs. Classification
• No prior knowledge
– Number of clusters
– Meaning of clusters
– Cluster results are dynamic
• Unsupervised learning

Clustering
Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data

Clustering Methods
• Many different method and algorithms:
– For numeric and/or symbolic data
– Deterministic vs. probabilistic
– Exclusive vs. overlapping
– Hierarchical vs. flat
– Top-down vs. bottom-up

Clustering Issues
• Outlier handling
• Dynamic data
• Interpreting results
• Evaluating results
• Number of clusters
• Data to be used
• Scalability

Clustering Evaluation
• Manual inspection
• Benchmarking on existing labels
• Cluster quality measures
– distance measures
– high similarity within a cluster, low across
clusters

Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms
of a distance function, which is typically metric: d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very different
for interval-scaled, boolean, categorical, ordinal and ratio
variables.
• Weights should be associated with different variables based on
applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.

Type of data in clustering analysis
• Interval-scaled variables:
• Binary variables:
• Nominal, ordinal, and ratio variables:
• Variables of mixed types:

Similarity and Dissimilarity
Between Objects
• Distances are normally used to measure the similarity
or dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x | q  | x  x | q ... | x  x | q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip j p
Clustering Problem
• Given a database D={t1,t2,…,tn} of tuples and
an integer value k, the Clustering Problem is
to define a mapping f:D{1,..,k} where each ti
is assigned to one cluster Kj, 1<=j<=k.
• A Cluster, Kj, contains precisely those tuples
mapped to it.
• Unlike classification problem, clusters are
not known a priori.

Types of Clustering
• Hierarchical – Nested set of clusters
created.
• Partitional – One set of clusters created.
• Incremental – Each element handled one at
a time.
• Simultaneous – All elements handled
together.
• Overlapping/Non-overlapping

Clustering Approaches
Clustering
Hierarchical Partitional Categorical Large DB
Agglomerative Divisive Sampling Compression

Cluster Parameters

Distance Between Clusters
• Single Link: smallest distance between points

• Complete Link: largest distance between points
• Average Link: average distance between points
• Centroid: distance between centroids

Hierarchical Clustering
• Clusters are created in levels actually creating sets
of clusters at each level.
• Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
• Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down

Hierarchical Clustering
• Use distance matrix as clustering criteria. This
method does not require the number of clusters k as
an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
Dendrogram
• A tree data structure which
illustrates hierarchical
clustering techniques.
• Each level shows clusters for
that level.
– Leaf – individual clusters
– Root – one cluster
• A cluster at level i is the
union of its children clusters
at level i+1.

A Dendrogram Shows How the Clusters are
Merged Hierarchically
Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected
component forms a cluster.

DIANA (Divisive Analysis)
• Implemented in statistical analysis packages, e.g., Splus

• Inverse order of AGNES
• Eventually each node forms a cluster on its own

Partitional Clustering
• Nonhierarchical
• Creates clusters in one step as opposed to
several steps.
• Since only one set of clusters is output, the
user normally has to input the desired
number of clusters, k.
• Usually deals with static sets.

K-Means
• Initial set of clusters randomly chosen.
• Iteratively, items are moved among sets of
clusters until the desired set is reached.
• High degree of similarity among elements
in a cluster is obtained.
• Given a cluster Ki={ti1,ti2,…,tim}, the cluster
mean is mi = (1/m)(ti1 + … + tim)

K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
• Randomly assign means: m1=3,m2=4
• K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
• K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
• K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
• K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
• Stop as the clusters with these means are the same.

The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in
4 steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the
clusters of the current partition. The centroid is
the center (mean point) of the cluster.
– Assign each object to the cluster with the
nearest seed point.
– Go back to Step 2, stop when no more new
assignment.
Comments on the K-Means Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing and
genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes

The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids,)
– starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves
the total distance of the resulting clustering
– Handles outliers well.
– Ordering of input does not impact results.
– Does not scale well.
– Each cluster represented by one item, called the medoid.
– Initial set of k medoids randomly chosen.
• PAM works effectively for small data sets, but does not scale well
for large data sets
PAM (Partitioning Around Medoids)
• PAM - Use real object to represent the cluster

– Select k representative objects arbitrarily
– For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCih
– For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most similar
representative object
– repeat steps 2-3 until there is no change
PAM

DATA MINING
ASSOCIATION RULES
Example: Market Basket Data
• Items frequently purchased together:
Computer Printer
• Uses:
– Placement
– Advertising
– Sales
– Coupons
• Objective: increase sales and reduce costs
• Called Market Basket Analysis, Shopping Cart
Analysis
Transaction Data: Supermarket Data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, jam, salt, ice-cream}
… …
tn: {biscuit, jam, milk}
• Concepts:
– An item: an item/article in a basket
– I: the set of all items sold in the store
– A Transaction: items purchased in a basket; it may
have TID (transaction ID)
– A Transactional dataset: A set of transactions
Transaction Data: A Set Of Documents
• A text document data set. Each document is
treated as a “bag” of keywords
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game

Association Rule Definitions
• Association Rule (AR): implication X 
Y where X,Y  I and X  Y = ;
• Support of AR (s) X  Y: Percentage of
transactions that contain X Y
• Confidence of AR () X  Y: Ratio of
number of transactions that contain X 
Y to the number that contain X

Association Rule Problem
• Given a set of items I={I1,I2,…,Im} and a
database of transactions D={t1,t2, …, tn} where
ti={Ii1,Ii2, …, Iik} and Iij  I, the Association
Rule Problem is to identify all association rules
X  Y with a minimum support and
confidence.
• Link Analysis
• NOTE: Support of X  Y is same as support
of X  Y.
Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
t1: Butter, Cocoa, Milk
Example t2:
t3:
Butter, Cheese
Cheese, Boots
t4: Butter, Cocoa, Cheese
t5: Butter, Cocoa, Clothes, Cheese, Milk
• Transaction data t6: Cocoa, Clothes, Milk
t7: Cocoa, Milk, Clothes
• Assume:
minsup = 30%
minconf = 80%
• An example frequent itemset:
{Cocoa, Clothes, Milk} [sup = 3/7]
• Association rules from the itemset:
Clothes  Milk, Cocoa [sup = 3/7, conf = 3/3]
… …
Clothes, Cocoa  Milk, [sup = 3/7, conf = 3/3]
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
• Frequent itemset generation is still
computationally expensive
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
TID Items
W
1 Bread, Milk
2 Bread, Biscuit, FruitJuice,
Eggs
3 Milk, Biscuit, FruitJuice,
N 4
Coke
Bread, Milk, Biscuit,
FruitJuice
5 Bread, Milk, Biscuit, Coke

Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must
also be frequent
• Apriori principle holds due to the following
property of the support measure:
X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of

its subsets

Illustrating Apriori Principle
null null
A A B B C C D D E E
AB AB AC AC AD AD AE AE BC BC BD BD BE BE CD CD CE CE DE DE
Found to be
Infrequent
ABCABC ABDABD ABEABE ACDACD ACEACE ADEADE BCDBCD BCEBCE BDEBDE CDECDE
ABCD
ABCD ABCE
ABCE ABDE
ABDE ACDE
ACDE BCDE
BCDE
Pruned
ABCDE
ABCDE
supersets

Illustrating Apriori Principle
Item Count Items (1-itemsets) Pairs (2-itemsets)
Bread 4 (No need to generate
Coke 2 candidates involving
Milk 4 Coke or Eggs)
FruitJuice 3 Itemset Count
Biscuit 4 {Bread,Milk} 3
Eggs 1 {Bread,FruitJuice} 2
{Bread,Biscuit} 3
{Milk,FruitJuice} 2
{Milk,Biscuit} 3
{FruitJuice,Biscuit} 3 Triplets (3-itemsets)
Minimum Support = 3
Itemset Count
If every subset is considered,
{Bread,Milk,Biscuit} 3
C1 + 6C2 + 6C3 = 41
6
With support-based pruning,

6 + 6 + 1 = 13

Apriori Algorithm
• Let k=1
• Generate frequent itemsets of length 1
• Repeat until no new frequent itemsets are
identified
– Generate length (k+1) candidate itemsets from
length k frequent itemsets
– Prune candidate itemsets containing subsets of
length k that are infrequent
– Count the support of each candidate by scanning the
DB
– Eliminate candidates that are infrequent, leaving
only those that are frequent
Example –
Finding frequent itemsets Dataset T
minsup=0.5
TID Items
itemset:count T100 1, 3, 4
1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 T200 2, 3, 5
 F1: {1}:2, {2}:3, {3}:3, {5}:3 T300 1, 2, 3, 5
T400 2, 5
 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2

 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2
 C3: {2, 3,5}
3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}

Apriori Adv/Disadv
• Advantages:
– Uses large itemset property.
– Easily parallelized
– Easy to implement.
• Disadvantages:
– Assumes transaction database is memory resident.
– Requires up to m database scans.

Step 2: Generating Rules From Frequent
Itemsets
• Frequent itemsets  association rules
• One more step is needed to generate association
rules
• For each frequent itemset X,
For each proper nonempty subset A of X,
– Let B = X - A
– A  B is an association rule if
• Confidence(A  B) ≥ minconf,
support(A  B) = support(AB) = support(X)
confidence(A  B) = support(A  B) / support(A)

Generating Rules: An example
• Suppose {2,3,4} is frequent, with sup=50%
– Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4},
with sup=50%, 50%, 75%, 75%, 75%, 75% respectively
– These generate these association rules:
• 2,3  4, confidence=100%
• 2,4  3, confidence=100%
• 3,4  2, confidence=67%
• 2  3,4, confidence=67%
• All rules have support = 50%
Rule Generation
• Given a frequent itemset L, find all non-empty
subsets f  L such that f  L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,
• If |L| = k, then there are 2k – 2 candidate

association rules (ignoring L   and   L)
Generating Rules
• To recap, in order to obtain A  B, we need
to have support(A  B) and support(A)
• All the required information for confidence
computation has already been recorded in
itemset generation. No need to see the data T
any more.
• This step is not as time-consuming as frequent
itemsets generation.

Rule Generation
• How to efficiently generate rules from
frequent itemsets?
– In general, confidence does not have an anti-
monotone property
c(ABC D) can be larger or smaller than
c(AB D)
– But confidence of rules generated from the
same itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)
Rule Generation for Apriori
Lattice of rules Algorithm ABCD=>{
ABCD=>{} }
Low
Confidence
Rule
BCD=>A
BCD=>A ACD=>B
ACD=>B ABD=>C
ABD=>C ABC=>D
ABC=>D
CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD

D=>ABC C=>ABD B=>ACD A=>BCD
Pruned
Rules

Rule Generation for Apriori Algorithm
• Candidate rule is generated by merging two rules
that share the same prefix
in the rule consequent
CD=>AB BD=>AC
• Join (CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
D=>ABC
• Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence

APriori - Performance Bottlenecks
• The core of the Apriori algorithm:
– Use frequent (k – 1)-itemsets to generate candidate frequent
k-itemsets
– Use database scan and pattern matching to collect counts for
the candidate itemsets
• Bottleneck of Apriori: candidate generation
– Huge candidate sets:
• 104 frequent 1-itemset will generate 107 candidate 2-
itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2,
…, a100}, one needs to generate 2100  1030 candidates.
– Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the longest
pattern
Mining Frequent Patterns
Without Candidate Generation
• Compress a large database into a compact, Frequent-
Pattern tree (FP-tree) structure
– highly condensed, but complete for frequent pattern mining
– avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern
mining method
– A divide-and-conquer methodology: decompose mining
tasks into smaller ones
– Avoid candidate generation: sub-database test only!

Construct FP-tree From A Transaction DB
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} min_support = 0.5
200 {a, b, c, f, l, m, o} {f, c, a, b,
m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Steps: Header Table
1. Scan DB once, find
Item frequency head f:4 c:1
frequent 1-itemset
f 4
(single item pattern)
c 4 c:3 b:1 b:1
2. Order frequent items a 3
in frequency b 3
descending order m 3 a:3 p:1
3. Scan DB again, p 3
construct FP-tree m:2 b:1
DATA MINING VESIT M.VIJAYALAKSHMI

p:2 m:1 108
Benefits of the FP-tree Structure
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more
likely to be shared
– never be larger than the original database (if not count node-
links and counts)

Mining Frequent Patterns Using FP-tree
• General idea (divide-and-conquer)
– Recursively grow frequent pattern path using the FP-tree
• Method
– For each item, construct its conditional pattern-base, and
then its conditional FP-tree
– Repeat the process on each newly created conditional FP-
tree
– Until the resulting FP-tree is empty, or it contains only one
path (single path will generate all the combinations of its sub-paths,
each of which is a frequent pattern)

Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in
the FP-tree
2) Construct conditional FP-tree from each conditional
pattern-base
3) Recursively mine conditional FP-trees and grow
frequent patterns obtained so far
 If the conditional FP-tree contains a single path, simply
enumerate all the patterns

Step 1: FP-tree to Conditional Pattern Base
• Starting at the frequent header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item
• Accumulate all of transformed prefix paths of that item to form
a conditional pattern base
Conditional pattern bases
Header Table {} item cond. pattern base
Item frequency head c f:3
f:4 c:1
f 4
c 4 a fc:3
a 3 c:3 b:1 b:1
b 3 b fca:1, f:1, c:1
m 3 a:3 p:1
p 3
m fca:2, fcab:1
m:2 b:1 p fcam:2, cb:1
p:2 m:1

Step 2: Construct Conditional FP-tree
• For each pattern-base
– Accumulate the count for each item in the base
– Construct the FP-tree for the frequent items of the pattern
base
m-conditional
Header Table {} pattern base:
fca:2, fcab:1
Item frequency head
f:4 c:1 {} All frequent
f 4 patterns
c 4 concerning m
a 3 c:3 b:1 b:1 m,
f:3
b 3 fm, cm, am,
m 3 a:3 p:1 
p 3 c:3  fcm, fam, cam,
fcam
m:2 b:1
a:3
p:2 m:1 m-conditional FP-
DATA MINING VESIT M.VIJAYALAKSHMI tree 113
Mining Frequent Patterns by Creating
Conditional Pattern-Bases
Item Conditional pattern-base Conditional FP-tree
p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} Empty
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f Empty Empty
Step 3: Recursively mine the conditional
FP-tree
{
}
f:3
{} Cond. pattern base of “am”: (fc:3)
c:3
f:3 am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
Cond. pattern base of “cam”: (f:3) {}
f:3
cam-conditional FP-tree
Single FP-tree Path Generation
• Suppose an FP-tree T has a single path P
• The complete set of frequent pattern of T can be generated by
enumeration of all the combinations of the sub-paths of P
{}
All frequent patterns
concerning m
f:3
m,
c:3  fm, cm, am,
fcm, fam, cam,
a:3
fcam
m-conditional FP-tree

Why Is Frequent Pattern Growth
Fast?
• Performance study shows
– FP-growth is an order of magnitude faster than Apriori,
and is also faster than tree-projection
• Reasoning
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
– Basic operation is counting and FP-tree building

Overview Data Mining L

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Overview Data Mining L

Uploaded by

Copyright:

Available Formats

OVERVIEW

DATA MINING VESIT M.VIJAYALAKSHMI 1

– Motivation & Introduction

DATA MINING VESIT M.VIJAYALAKSHMI 2

• There is often information “hidden” in the

• Human analysts may take weeks to

DATA MINING VESIT M.VIJAYALAKSHMI 6

 Data Mining provides

DATA MINING VESIT M.VIJAYALAKSHMI 9

DATA MINING VESIT M.VIJAYALAKSHMI 11

DATA MINING VESIT M.VIJAYALAKSHMI 12

DATA MINING VESIT M.VIJAYALAKSHMI 13

DATA MINING VESIT M.VIJAYALAKSHMI 14

• Find all the interesting patterns:

• Knowledge Discovery in Databases

• Data Mining: Use of algorithms to

DATA MINING VESIT M.VIJAYALAKSHMI 17

• Selection: Obtain data from various sources.

Data Presentation Business

Data Warehouses / Data Marts

DATA MINING VESIT M.VIJAYALAKSHMI 20

DATA MINING VESIT M.VIJAYALAKSHMI 21

DATA MINING VESIT M.VIJAYALAKSHMI 22

DATA MINING VESIT M.VIJAYALAKSHMI 23

DATA MINING VESIT M.VIJAYALAKSHMI 24

DATA MINING VESIT M.VIJAYALAKSHMI 25

DATA MINING VESIT M.VIJAYALAKSHMI 26

DATA MINING VESIT M.VIJAYALAKSHMI 27

Age Salary > 5 K good

DATA MINING VESIT M.VIJAYALAKSHMI 29

Bayes Rule: E- Event has occurred

P[H] is called the prior probability (of the hypothesis).

DATA MINING VESIT M.VIJAYALAKSHMI 31

ApplicantID City Children Income Status

• P[City=Delhi | Status = DEFAULTS] = 2/2 = 1

and P[Status = DEFAULTS] = 2/4 = 0.5

DATA MINING VESIT M.VIJAYALAKSHMI 35

DATA MINING VESIT M.VIJAYALAKSHMI 36

DATA MINING VESIT M.VIJAYALAKSHMI 37

DATA MINING VESIT M.VIJAYALAKSHMI 38

DATA MINING VESIT M.VIJAYALAKSHMI 39

sunny overcast rain

high normal true false

DATA MINING VESIT M.VIJAYALAKSHMI 41

DATA MINING VESIT M.VIJAYALAKSHMI 42

• At each node, available attributes are

DATA MINING VESIT M.VIJAYALAKSHMI 44

DATA MINING VESIT M.VIJAYALAKSHMI 45

DATA MINING VESIT M.VIJAYALAKSHMI 46

• The encoding information that would be gained by

• Expected information for attribute:

gain(" Outlook" )  info([9,5]) - info([2,3], [4,0], [3,2])  0.940 - 0.693

• Information gain for attributes from weather

gain(" Humidity" )  0.152 bits gain(" Windy" )  0.048 bits

gain(" Temperatur e" )  0.571 bits

• Note: not all leaves need to be pure; sometimes identical

DATA MINING VESIT M.VIJAYALAKSHMI 57

DATA MINING VESIT M.VIJAYALAKSHMI 58

DATA MINING VESIT M.VIJAYALAKSHMI 59

DATA MINING VESIT M.VIJAYALAKSHMI 60

DATA MINING VESIT M.VIJAYALAKSHMI 61

DATA MINING VESIT M.VIJAYALAKSHMI 62