Professional Documents
Culture Documents
4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16
4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16
Pre-processing
Lecture-11,12,13,14,15,16
Dr. Sumit Dhariwal
School of Computing Information Technology
Manipal University Jaipur
India
Outline
• Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, surveys …
• Target marketing
• Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.,
• E.g. Most customers with income level 60k – 80k with food expenses $600 - $800 a month live in that area
• Determine customer purchasing patterns over time
• E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k usually buy this type of CD player
• Fraud detection
• Find outliers of unusual transactions
• Financial planning
• Summarize and compare the resources and spending
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Database
Technology Statistics
Information Machine
Science Data Mining Learning
Visualization Other
Disciplines
• Data are organized around major subjects, e.g. customer, item, supplier and
activity.
• Provide information from a historical perspective (e.g. from the past 5 – 10
years)
• Typically summarized to a higher level (e.g. a summary of the
transactions per item type for each store)
• User can perform drill-down or roll-up operation to view the data at
different degrees of summarization
• Cluster Analysis
• Class label is unknown: group data to form new classes
• Clusters of objects are formed based on the principle of maximizing intra-
class similarity & minimizing interclass similarity
• E.g. Identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing.
• Outlier Analysis
• Data that do no comply with the general behavior or model.
• Outliers are usually discarded as noise or exceptions.
• Useful for fraud detection.
• E.g. Detect purchases of extremely large amounts
• Evolution Analysis
• Describes and models regularities or trends for objects whose
behavior changes over time.
• E.g. Identify stock evolution regularities for overall stocks and for the stocks of
particular companies.
• Subjective measures
• Reflect the needs and interests of a particular user.
• E.g. A marketing manager is only interested in characteristics of customers who shop
frequently.
Database
Technology Statistics
Information Machine
Science Data Mining Learning
Visualization Other
Disciplines
1.6 Classification of data mining systems
• Database
• Relational, data warehouse, transactional, stream, object-oriented/relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
• Knowledge
• Characterization, discrimination, association, classification, clustering, trend/deviation,
outlier analysis, etc.
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
1.7 Data Mining Task Primitives
(1) The set of task-relevant data – which portion of the database to be used
• Database or data warehouse name
(5) Visualization methods – what form to display the result, e.g. rules,
(1)
(3)
(2)
(1)
(1)
(1)
(2)
(1)
(5)
Data Mining: Concepts and Techniques 40
Why Data Mining Query Language?
• Semi-tight
– Efficient implementations of a few essential data mining primitives in a
DB/DW system are provided, e.g., sorting, indexing, aggregation,
histogram analysis, multiway join, precomputation of some stat
functions
– Enhanced DM performance
• Tight
– DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query analysis, data structures, indexing, query
processing methods of a DB/DW system
– A uniform information processing environment, highly desirable
Data Mining: Concepts and Techniques 43
1.9 Major Issues in Data Mining
• Mining methodology and User interaction
• Mining different kinds of knowledge
• DM should cover a wide spectrum of data analysis and knowledge discovery tasks
• Enable to use the database in different ways
• Require the development of numerous data mining techniques
• Interactive mining of knowledge at multiple levels of abstraction
• Difficult to know exactly what will be discovered
• Allow users to focus the search, refine data mining requests
• Incorporation of background knowledge
• Guide the discovery process
• Allow discovered patterns to be expressed in concise terms and different levels of abstraction
• Data mining query languages and ad hoc data mining
• High-level query languages need to be developed
• Should be integrated with a DB/DW query language
• 1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc.
• (a). Missing Data:
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
• Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
• Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation
• This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
1.Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
2.Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes
to help the mining process.
3.Discretization:
This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective methods of dimensionality
reduction are:Wavelet transforms and PCA (Principal Component Analysis).
WHAT IS FREQUENT PATTERN
MINING?
• Frequent Pattern Mining is also known as the Association Rule
Mining. Finding frequent patterns, causal structures and associations
in data sets and is an inquisitive process called pattern mining. When
a series of transactions are given, pattern mining’s main motive is to
find the rules that enable us to speculate a certain item based on the
happening of other items in the transaction.
• For instance, a set of items, such as pen and ink, often appears
together in a set of data transactions, is called a recurrent item set.
Purchasing a personal computer, later a digital camera, and then a
hard disk, if all these events repeatedly occur in the history of
shopping database, it is a (frequent) sequential pattern. If the
occurrence of a substructure is regular in a graph database, it is called
a (frequent) structural pattern.
HOW DOES FREQUENT PATTERN
MINING SUPPORT BUSINESS
ANALYSIS?
• Pattern mining is applicable in assessing data for varied business operations
and industries.
• Basket Data Analysis: It helps in scrutinizing the association between the
items purchased in a single purchase.
• Selling and Cross Marketing: To identify and operate with businesses that
complement our own business and disregard competitors. For instance,
manufacturers and vehicle dealerships get into cross-marketing campaigns
with gas and oil companies for apparent reasons.
• Catalog Design: Catalogs are designed in such a way that the items of
selection act as mutual complements, which results in buying of one item will
eventually lead to purchasing another, therefore act as complements or are
closely related.
• Medical Treatments: The listed and diagnosed set of illnesses of every patient
is depicted as a transaction, from which the diseases that are probable to
occur sequentially/ simultaneously can be anticipated.
TO UNDERSTAND THE VALUE OF
THIS APPLIED TECHNIQUE.
• Let’s look at two business scenarios:
• Case One
• Problem: A retail manager of a store to prepare a better strategy of product
bundling and product placement wants to administer a Market Basket Analysis.
• Business Solution: Elicited from the rules of pattern mining, the income of the
store can be increased by strategically placing the complementary products
together or in series, which leads to an upswing in sales. Offers like “Buy this and
enjoy % off ”, “Buy this and get this free” or “Buy one and get three” can be
developed based on the rules designed.
• Case Two
• Problem: The fulfil the purpose of analyzing the products which are sequentially
and frequently purchased together beneficial to a bank-marketing manager.
• Business solution: Based off of the rules of pattern mining, cross-selling of bank
products to every prospective or existing customer to increase and manifold bank
revenue and sales. For example, if personal loan, savings and credit cards are
sequentially bought, then along with a credit card and personal loan, a new saving
account can be cross-sold to a customer.
Mining Frequent Patterns, Association, and
Correlations
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• How to count supports of candidates?
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4={abcd}
How to Count Supports of Candidates?
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
Challenges of Frequent Pattern Mining
• Challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates
• Improving Apriori: general ideas
• Reduce passes of transaction database scans
• Shrink number of candidates
• Facilitate support counting of candidates
Reduce the Number of Candidates
• A k-itemset whose corresponding hashing bucket count is below
the threshold cannot be frequent
• Candidates: a, b, c, d, e
• Hash entries: {ab, ad, ae} {bd, be, de} …
• Frequent 1-itemset: a, b, d, e
• ab is not a candidate 2-itemset if the sum of the count of {ab,
ad, ae} is below the support threshold
Sampling for Frequent Patterns
• A, B, C, D, E 10 A,B,C,D,E
20 B,C,D,E,
• 2nd scan: find support for 30 A,C,D,F
• AB, AC, AD, AE, ABCDE
• BC, BD, BE, BCDE
Potential max-
• CD, CE, CDE, DE, patterns
• Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in
later scan
• R. Bayardo. Efficiently mining long patterns from databases. In
SIGMOD’98
Mining Frequent Closed Patterns: CLOSET
Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%
• Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
• Multi-dimensional rules: 2 dimensions or predicates
• Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)
• hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
• Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
• Quantitative Attributes: numeric, implicit ordering among values—
discretization, clustering, and gradient approaches
Mining Quantitative Associations
• Succinctness:
• Given A1, the set of items satisfying a succinctness constraint
C, then any set S satisfying C is based on A1 , i.e., S contains
a subset belonging to A1
• Idea: Without looking at the transaction database, whether
an itemset S satisfies constraint C can be determined based
on the selection of items
• min(S.Price) v is succinct
• sum(S.Price) v is not succinct
• Optimization: If C is succinct, C is pre-counting pushable
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2
itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
Naïve Algorithm: Apriori + Constraint
TDB (min_sup=2)
TID Transaction
• Convert tough constraints into anti-
10 a, b, c, d, f
monotone or monotone by properly ordering
20 b, c, d, f, g, h
items 30 a, c, d, e, f
• Examine C: avg(S.profit) 25 40 c, e, f, g
sum(S) v ( a S, a 0 ) yes no no
sum(S) v ( a S, a 0 ) no yes no
range(S) v yes no no
range(S) v no yes no
support(S) no yes no
A Classification of Constraints
Monotone
Antimonotone
Strongly
convertible
Succinct
Convertible Convertible
anti-monotone monotone
Inconvertible
Frequent-Pattern Mining: Summary