4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16

Data Mining and
Pre-processing
Lecture-11,12,13,14,15,16
Dr. Sumit Dhariwal
School of Computing Information Technology
Manipal University Jaipur
India
Outline
1.1 Motivation: Why data mining?

1.2 What is data mining?
1.3 Data Mining: On what kind of data?
1.4 Data mining functionality: What Kinds of Patterns Can Be Mined?
1.5 Are all the patterns interesting?
1.6 Classification of data mining systems
1.7 Data Mining Task Primitives
1.8 Integration of data mining system with a DB and DW System
1.9 Major issues in data mining
Data Mining: Concepts and Techniques

1.1 Why Data Mining?
• The Explosive Growth of Data: from terabytes(10004) to yottabytes(10008)

• Data collection and data availability
• Automated data collection tools, database systems, web
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: bioinformatics, scientific simulation, medical research …
• Society and everyone: news, digital cameras, …
• Data rich but information poor!
• What does those data mean?
• How to analyze data?
• Data mining — Automated analysis of massive data sets

Evolution of Database Technology
1.2 What Is Data Mining?
• Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Data mining: a misnomer?
• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Data Mining: Concepts and Techniques 5

Potential Applications
• Data analysis and decision support
• Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
• Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
• Fraud detection and detection of unusual patterns (outliers)
• Other Applications
• Text mining (news group, email, documents) and Web mining
• Stream data mining
• Bioinformatics and bio-data analysis
Ex.: Market Analysis and Management
• Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, surveys …
• Target marketing
• Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.,
• E.g. Most customers with income level 60k – 80k with food expenses $600 - $800 a month live in that area
• Determine customer purchasing patterns over time
• E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k usually buy this type of CD player
• Cross-market analysis—Find associations/co-relations between product sales, & predict

based on such association
• E.g. Customers who buy computer A usually buy software B

Ex.: Market Analysis and Management (2)
• Customer requirement analysis

• Identify the best products for different customers
• Predict what factors will attract new customers
• Provision of summary information
• Multidimensional summary reports
• E.g. Summarize all transactions of the first quarter from three different branches
Summarize all transactions of last year from a particular branch
Summarize all transactions of a particular product
• Statistical summary information
• E.g. What is the average age for customers who buy product A?
• Fraud detection
• Find outliers of unusual transactions
• Financial planning
• Summarize and compare the resources and spending

Knowledge Discovery (KDD) Process

KDD Process: Several Key Steps
• Learning the application domain
• relevant prior knowledge and goals of application
• Identifying a target data set: data selection
• Data processing
• Data cleaning (remove noise and inconsistent data)
• Data integration (multiple data sources maybe combined)
• Data selection (data relevant to the analysis task are retrieved from database)
• Data transformation (data transformed or consolidated into forms appropriate for mining)
(Done with data preprocessing)
• Data mining (an essential process where intelligent methods are applied to extract
data patterns)
• Pattern evaluation (indentify the truly interesting patterns)
• Knowledge presentation (mined knowledge is presented to the user with
visualization or representation techniques)
• Use of discovered knowledge
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses

DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
A typical DM System Architecture
• Database, data warehouse, WWW or other information
repository (store data)
• Database or data warehouse server (fetch and
combine data)
• Knowledge base (turn data into meaningful groups
according to domain knowledge)
• Data mining engine (perform mining tasks)
• Pattern evaluation module (find interesting patterns)
• User interface (interact with the user)

A typical DM System Architecture (2)
Confluence of Multiple Disciplines
Database
Technology Statistics
Information Machine
Science Data Mining Learning
Visualization Other
Disciplines
• Not all “Data Mining System” performs true data mining

 machine learning system, statistical analysis (small amount of data)
 Database system (information retrieval, deductive querying…)

1.3 On What Kinds of Data?
• Database-oriented data sets and applications

• Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
• Object-Relational Databases
• Temporal Databases, Sequence Databases, Time-Series databases
• Spatial Databases and Spatiotemporal Databases
• Text databases and Multimedia databases
• Heterogeneous Databases and Legacy Databases
• Data Streams
• The World-Wide Web

Relational Databases
• DBMS – database management system, contains a collection of
interrelated databases
e.g. Faculty database, student database, publications database
• Each database contains a collection of tables and functions to
manage and access the data.
e.g. student_bio, student_graduation, student_parking
• Each table contains columns and rows, with columns as attributes of data and rows as
records.
• Tables can be used to represent the relationships between or among multiple tables.

Relational Databases (2) – AllElectronics store

Relational Databases (3)
• With a relational query language, e.g. SQL, we will be able to find
answers to questions such as:
• How many items were sold last year?
• Who has earned commissions higher than 10%?
• What is the total sales of last month for Dell laptops?
• When data mining is applied to relational databases, we can search for
trends or data patterns.
• Relational databases are one of the most commonly available and
rich information repositories, and thus are a major data form in our study.

Data Warehouses
• A repository of information collected from multiple sources, stored

under a unified schema, and that usually resides at a single site.
• Constructed via a process of data cleaning, data integration, data
transformation, data loading and periodic data refreshing.

Data Warehouses (2)
• Data are organized around major subjects, e.g. customer, item, supplier and
activity.
• Provide information from a historical perspective (e.g. from the past 5 – 10
years)
• Typically summarized to a higher level (e.g. a summary of the
transactions per item type for each store)
• User can perform drill-down or roll-up operation to view the data at
different degrees of summarization

Data Warehouses (3)
Transactional Databases
• Consists of a file where each record represents a transaction
• A transaction typically includes a unique transaction ID and a list of the
items making up the transaction.
• Either stored in a flat file or unfolded into relational tables

• Easy to identify items that are frequently sold together

1.4 Data Mining Functionalities
- What kinds of patterns can be mined?
• Concept/Class Description: Characterization and Discrimination
• Data can be associated with classes or concepts.
• E.g. classes of items – computers, printers, …
concepts of customers – bigSpenders, budgetSpenders, …
• How to describe these items or concepts?
• Descriptions can be derived via
• Data characterization – summarizing the general characteristics of a
target class of data.
• E.g. summarizing the characteristics of customers who spend more than $1,000 a year
at AllElectronics. Result can be a general profile of the customers, such as 40 – 50 years old, employed,
have excellent credit ratings.

• Data discrimination – comparing the target class with one or a set of
comparative classes
• E.g. Compare the general features of software products whole sales increase by 10% in the last year
with those whose sales decrease by 30% during the same period
• Or both of the above
• Mining Frequent Patterns, Associations and

Correlations
• Frequent itemset: a set of items that frequently appear
together in a transactional data set (e.g. milk and bread)
• Frequent subsequence: a pattern that customers tend to purchase product A,
followed by a purchase of product B

• Association Analysis: find frequent patterns
• E.g. a sample analysis result – an association rule:
buys(X, “computer”) => buys(X, “software”) [support = 1%, confidence = 50%]
(if a customer buys a computer, there is a 50% chance that she will buy software. 1% of all of
the transactions under analysis showed that computer and software
are purchased together. )
• Associations rules are discarded as uninteresting if they do not satisfy both a minimum
support threshold and a minimum confidence threshold.
• Correlation Analysis: additional analysis to find statistical correlations
between associated pairs

• Classification and Prediction
• Classification
• The process of finding a model that describes and distinguishes the data classes or concepts,
for the purpose of being able to use the model to predict the class of
objects whose class label is unknown.
• The derived model is based on the analysis of a set of training data (data objects whose class
label is known).
• The model can be represented in classification (IF-THEN) rules, decision trees,
neural networks, etc.
• Prediction
• Predict missing or unavailable numerical data values


Data Mining Functionalities (2)
• Cluster Analysis
• Class label is unknown: group data to form new classes
• Clusters of objects are formed based on the principle of maximizing intra-
class similarity & minimizing interclass similarity
• E.g. Identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing.

Data Mining Functionalities (2)
• Outlier Analysis
• Data that do no comply with the general behavior or model.
• Outliers are usually discarded as noise or exceptions.
• Useful for fraud detection.
• E.g. Detect purchases of extremely large amounts
• Evolution Analysis
• Describes and models regularities or trends for objects whose
behavior changes over time.
• E.g. Identify stock evolution regularities for overall stocks and for the stocks of
particular companies.

1.5 Are All of the Patterns Interesting?
• Data mining may generate thousands of patterns: Not all of them
are interesting
• A pattern is interesting if it is
• easily understood by humans
• valid on new or test data with some degree of certainty,
• potentially useful
• novel
• validates some hypothesis that a user seeks to confirm
• An interesting measure represents knowledge !

• Objective measures
• Based on statistics and structures of patterns, e.g., support, confidence, etc. (Rules
that do not satisfy a threshold are considered uninteresting.)
• Subjective measures
• Reflect the needs and interests of a particular user.
• E.g. A marketing manager is only interested in characteristics of customers who shop
frequently.
• Based on user’s belief in the data.

• e.g., Patterns are interesting if they are unexpected, or can be used for strategic planning, etc
• Objective and subjective measures need to be combined.

• Find all the interesting patterns: Completeness
• Unrealistic and inefficient
• User-provided constraints and interestingness measures should be used
• Search for only interesting patterns: An optimization problem
• Highly desirable
• No need to search through the generated patterns to identify truly
interesting ones.
• Measures can be used to rank the discovered patterns according their
interestingness.

1.6 Classification of data mining systems
Database
Technology Statistics
Information Machine
Science Data Mining Learning
Visualization Other
Disciplines
1.6 Classification of data mining systems
• Database
• Relational, data warehouse, transactional, stream, object-oriented/relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
• Knowledge
• Characterization, discrimination, association, classification, clustering, trend/deviation,
outlier analysis, etc.
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
1.7 Data Mining Task Primitives
• How to construct a data mining query?

• The primitives allow the user to interactively communicate with
the data mining system during discovery to direct the mining
process, or examine the findings

• The primitives specify:
(1) The set of task-relevant data – which portion of the database to be used
• Database or data warehouse name
• Database tables or data warehouse cubes
• Condition for data selection
• Relevant attributes or dimensions
• Data grouping criteria

• The primitives specify:
(2) The kind of knowledge to be mined – what DB functions to be performed

• Characterization
• Discrimination
• Association
• Classification/prediction
• Clustering
• Outlier analysis
• Other data mining tasks

(3) The background knowledge to be used – what domain knowledge,
concept hierarchies, etc.
(4) Interestingness measures and thresholds – support, confidence, etc.
(5) Visualization methods – what form to display the result, e.g. rules,
tables, charts, graphs, …

• DMQL – Data Mining Query Language
• Designed to incorporate these primitives
• Allow user to interact with DM systems
• Providing a standardized language like SQL

An Example Query in DMQL
(1)
(3)
(2)
(1)
(1)
(1)
(2)
(1)
(5)
Why Data Mining Query Language?
• Automated vs. query-driven?

• Finding all the patterns autonomously in a database?—unrealistic because
the patterns could be too many but uninteresting
• Data mining should be an interactive process
• User directs what to be mined
• Users must be provided with a set of primitives to be used to
communicate with the data mining system
• Incorporating these primitives in a data mining query language
• More flexible user interaction
• Foundation for design of graphical user interface
• Standardization of data mining industry and practice

1.8 Integration of Data Mining and Data
Warehousing
• No coupling
– Flat file processing, no utilization of any functions of a DB/DW
system
– Not recommended
• Loose coupling
– Fetching data from DB/DW
– Does not explore data structures and query optimization methods
provided by DB/DW system
– Difficult to achieve high scalability and good performance with
large data sets

1.8 Integration of Data Mining and Data
Warehousing
• Semi-tight
– Efficient implementations of a few essential data mining primitives in a
DB/DW system are provided, e.g., sorting, indexing, aggregation,
histogram analysis, multiway join, precomputation of some stat
functions
– Enhanced DM performance
• Tight
– DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query analysis, data structures, indexing, query
processing methods of a DB/DW system
– A uniform information processing environment, highly desirable
1.9 Major Issues in Data Mining
• Mining methodology and User interaction
• Mining different kinds of knowledge
• DM should cover a wide spectrum of data analysis and knowledge discovery tasks
• Enable to use the database in different ways
• Require the development of numerous data mining techniques
• Interactive mining of knowledge at multiple levels of abstraction
• Difficult to know exactly what will be discovered
• Allow users to focus the search, refine data mining requests
• Incorporation of background knowledge
• Guide the discovery process
• Allow discovered patterns to be expressed in concise terms and different levels of abstraction
• Data mining query languages and ad hoc data mining
• High-level query languages need to be developed
• Should be integrated with a DB/DW query language

• Presentation and visualization of results
• Knowledge should be easily understood and directly usable
• High level languages, visual representations or other expressive forms
• Require the DM system to adopt the above techniques
• Handling noisy or incomplete data
• Require data cleaning methods and data analysis methods that can handle noise
• Pattern evaluation – the interestingness problem
• How to develop techniques to access the interestingness of discovered patterns, especially
with subjective measures bases on user beliefs or expectations

• Performance Issues
• Efficiency and scalability
• Huge amount of data
• Running time must be predictable and acceptable
• Parallel, distributed and incremental mining algorithms
• Divide the data into partitions and processed in parallel
• Incorporate database updates without having to mine the entire data again from
scratch
• Diversity of Database Types

• Other database that contain complex data objects, multimedia data,
spatial data, etc.
• Expect to have different DM systems for different kinds of data
• Heterogeneous databases and global information systems
• Web mining becomes a very challenging and fast-evolving field in data mining
Preprocessing in Data Mining
Data preprocessing is a data mining technique that is used to transform

the raw data in a useful and efficient format.
Steps Involved in Data Preprocessing
• 1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc.
• (a). Missing Data:
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
• Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.
• (b). Noisy Data:
Noisy data is meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled
in following ways :
• Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.
• Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
• Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation
• This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
1.Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
2.Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes
to help the mining process.
3.Discretization:
This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
4.Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy.
For Example-The attribute “city” can be converted to “country”.
3. Data Reduction
• Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
• The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:

The highly relevant attributes should be used, rest all can be discarded. For performing attribute
selection, one can use level of significance and p- value of the attribute.the attribute having p-
value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective methods of dimensionality
reduction are:Wavelet transforms and PCA (Principal Component Analysis).
WHAT IS FREQUENT PATTERN
MINING?
• Frequent Pattern Mining is also known as the Association Rule
Mining. Finding frequent patterns, causal structures and associations
in data sets and is an inquisitive process called pattern mining. When
a series of transactions are given, pattern mining’s main motive is to
find the rules that enable us to speculate a certain item based on the
happening of other items in the transaction.
• For instance, a set of items, such as pen and ink, often appears
together in a set of data transactions, is called a recurrent item set.
Purchasing a personal computer, later a digital camera, and then a
hard disk, if all these events repeatedly occur in the history of
shopping database, it is a (frequent) sequential pattern. If the
occurrence of a substructure is regular in a graph database, it is called
a (frequent) structural pattern.
HOW DOES FREQUENT PATTERN
MINING SUPPORT BUSINESS
ANALYSIS?
• Pattern mining is applicable in assessing data for varied business operations
and industries.
• Basket Data Analysis: It helps in scrutinizing the association between the
items purchased in a single purchase.
• Selling and Cross Marketing: To identify and operate with businesses that
complement our own business and disregard competitors. For instance,
manufacturers and vehicle dealerships get into cross-marketing campaigns
with gas and oil companies for apparent reasons.
• Catalog Design: Catalogs are designed in such a way that the items of
selection act as mutual complements, which results in buying of one item will
eventually lead to purchasing another, therefore act as complements or are
closely related.
• Medical Treatments: The listed and diagnosed set of illnesses of every patient
is depicted as a transaction, from which the diseases that are probable to
occur sequentially/ simultaneously can be anticipated.
TO UNDERSTAND THE VALUE OF
THIS APPLIED TECHNIQUE.
• Let’s look at two business scenarios:
• Case One
• Problem: A retail manager of a store to prepare a better strategy of product
bundling and product placement wants to administer a Market Basket Analysis.
• Business Solution: Elicited from the rules of pattern mining, the income of the
store can be increased by strategically placing the complementary products
together or in series, which leads to an upswing in sales. Offers like “Buy this and
enjoy % off ”, “Buy this and get this free” or “Buy one and get three” can be
developed based on the rules designed.
• Case Two
• Problem: The fulfil the purpose of analyzing the products which are sequentially
and frequently purchased together beneficial to a bank-marketing manager.
• Business solution: Based off of the rules of pattern mining, cross-selling of bank
products to every prospective or existing customer to increase and manifold bank
revenue and sales. For example, if personal loan, savings and credit cards are
sequentially bought, then along with a credit card and personal loan, a new saving
account can be cross-sold to a customer.
Mining Frequent Patterns, Association, and
Correlations
• Basic concepts and a road map

• Efficient and scalable frequent itemset mining methods
• Mining various kinds of association rules
• From association mining to correlation analysis
• Constraint-based association mining
• Summary
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
• What products were often purchased together?— Beer and diapers?!
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?
• Applications
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis.
Why Is Freq. Pattern Mining Important?
• Discloses an intrinsic and important property of data sets

• Forms the foundation for many essential data mining tasks
• Association, correlation, and causality analysis
• Sequential, structural (e.g., sub-graph) patterns
• Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
• Classification: associative classification
• Cluster analysis: frequent pattern-based clustering
• Data warehousing: iceberg cube and cube-gradient
• Semantic data compression: fascicles
• Broad applications
Basic Concepts: Frequent Patterns and Association
Rules
Transaction-id Items bought • Itemset X = {x1, …, xk}

10 A, B, D • Find all the rules X  Y with minimum
20 A, C, D support and confidence
30 A, D, E • support, s, probability that a
40 B, E, F transaction contains X  Y
50 B, C, D, E, F • confidence, c, conditional
probability that a transaction
Customer
buys both
Customer having X also contains Y
buys diaper
Let supmin = 50%, confmin = 50%

Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
Customer
A  D (60%, 100%)
buys beer
D  A (60%, 75%)
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-patterns,

e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 =
1.27*1030 sub-patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no super-
pattern Y ‫ כ‬X, with the same support as X (proposed by Pasquier,
et al. @ ICDT’99)
• An itemset X is a max-pattern if X is frequent and there exists no
frequent super-pattern Y ‫ כ‬X (proposed by Bayardo @
SIGMOD’98)
• Closed pattern is a lossless compression of freq. patterns
• Reducing the # of patterns and rules
Closed Patterns and Max-Patterns
• Exercise. DB = {<a1, …, a100>, < a1, …, a50>}

• Min_sup = 1.
• What is the set of closed itemset?
• <a1, …, a100>: 1
• < a1, …, a50>: 2
• What is the set of max-pattern?
• <a1, …, a100>: 1
• What is the set of all patterns?
• !!
Scalable Methods for Mining Frequent Patterns
• The downward closure property of frequent patterns

• Any subset of a frequent itemset must be frequent
• If {beer, diaper, nuts} is frequent, so is {beer, diaper}
• i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
• Scalable mining methods: Three major approaches
• Apriori (Agrawal & Srikant@VLDB’94)
• Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
• Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
Apriori: A Candidate Generation-and-Test Approach
• Apriori pruning principle: If there is any item set which is

infrequent, its superset should not be generated/tested!
• Method:
• Initially, scan DB once to get frequent 1-itemset
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Test the candidates against DB
• Terminate when no frequent or candidate set can be generated
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Database TDB Itemset sup
{A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2
Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset L3 Itemset sup

3rd scan
{B, C, E} {B, C, E} 2
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• How to count supports of candidates?
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4={abcd}
How to Count Supports of Candidates?
• Why counting supports of candidates a problem?

• The total number of candidates can be very huge
• One transaction may contain many candidates
• Method:
• Candidate itemsets are stored in a hash-tree
• Leaf node of hash-tree contains a list of itemsets and counts
• Interior node contains a hash table
• Subset function: finds all the candidates contained in a
transaction
Example: Counting Supports of Candidates
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
Challenges of Frequent Pattern Mining
• Challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates
• Improving Apriori: general ideas
• Reduce passes of transaction database scans
• Shrink number of candidates
• Facilitate support counting of candidates
Reduce the Number of Candidates
• A k-itemset whose corresponding hashing bucket count is below
the threshold cannot be frequent
• Candidates: a, b, c, d, e
• Hash entries: {ab, ad, ae} {bd, be, de} …
• Frequent 1-itemset: a, b, d, e
• ab is not a candidate 2-itemset if the sum of the count of {ab,
ad, ae} is below the support threshold
Sampling for Frequent Patterns
• Select a sample of original database, mine frequent patterns

within sample using Apriori
• Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
• Example: check abcd instead of ab, ac, …, etc.
• Scan database again to find missed frequent patterns.
MaxMiner: Mining Max-patterns
• 1st scan: find frequent items Tid Items
• A, B, C, D, E 10 A,B,C,D,E
20 B,C,D,E,
• 2nd scan: find support for 30 A,C,D,F
• AB, AC, AD, AE, ABCDE
• BC, BD, BE, BCDE
Potential max-
• CD, CE, CDE, DE, patterns
• Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in
later scan
• R. Bayardo. Efficiently mining long patterns from databases. In
SIGMOD’98
Mining Frequent Closed Patterns: CLOSET
• Flist: list of all frequent items in support ascending order

• Flist: d-a-f-e-c Min_sup=2
• Divide search space TID Items

10 a, c, d, e, f
• Patterns having d 20 a, b, e
30 c, e, f
• Patterns having d but no a, etc. 40 a, c, d, f
50 c, e, f
• Find frequent closed pattern recursively
• Every transaction having d also has cfa  cfad is a frequent
closed pattern
Mining Various Kinds of Association Rules
• Mining multilevel association
• Miming multidimensional association
• Mining quantitative association
• Mining interesting correlation patterns

Mining Multiple-Level Association Rules
• Items often form hierarchies

• Flexible support settings
• Items at the lower level are expected to have lower support
• Exploration of shared multi-level mining
uniform support reduced support
Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%
Level 2 2% Milk Skim Milk Level 2

min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%
Multi-level Association: Redundancy Filtering
• Some rules may be redundant due to “ancestor” relationships

between items.
• Example
• milk  wheat bread [support = 8%, confidence = 70%]
• 2% milk  wheat bread [support = 2%, confidence = 72%]
• We say the first rule is an ancestor of the second rule.
• A rule is redundant if its support is close to the “expected” value,
based on the rule’s ancestor.
Mining Multi-Dimensional Association
• Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
• Multi-dimensional rules:  2 dimensions or predicates
• Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)
• hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
• Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
• Quantitative Attributes: numeric, implicit ordering among values—
discretization, clustering, and gradient approaches
Mining Quantitative Associations
• Techniques can be categorized by how numerical attributes,

such as age or salary are treated
1. Static discretization based on predefined concept hierarchies
(data cube methods)
2. Dynamic discretization based on data distribution
3. Clustering: Distance-based association
• one dimensional clustering then association
4. Deviation: Sex = female => Wage: mean=$7/hr (overall mean = $9)
Static Discretization of Quantitative Attributes
 Discretized prior to mining using concept hierarchy.

 Numeric values are replaced by ranges.
 In relational database, finding all frequent k-predicate sets
will require k or k+1 table scans.
()
 Data cube is well suited for mining.
 The cells of an n-dimensional (age) (income) (buys)
cuboid correspond to the
predicate sets.
(age, income) (age,buys) (income,buys)
 Mining from data cubes
can be much faster. (age,income,buys)
Constraint-based (Query-Directed) Mining
• Finding all the patterns in a database autonomously? —

unrealistic!
• The patterns could be too many but not focused!
• Data mining should be an interactive process
• User directs what to be mined using a data mining query
language (or a graphical user interface)
• Constraint-based mining
• User flexibility: provides constraints on what to be mined
• System optimization: explores such constraints for efficient
mining—constraint-based mining
Constraints in Data Mining
• Knowledge type constraint:

• classification, association, etc.
• Data constraint — using SQL-like queries
• find product pairs sold together in stores in Chicago in Dec.’02
• Dimension/level constraint
• in relevance to region, price, brand, customer category
• Rule (or pattern) constraint
• small sales (price < $10) triggers big sales (sum > $200)
• Interestingness constraint
• strong rules: min_support  3%, min_confidence  60%
Constrained Mining vs. Constraint-Based Search
• Constrained mining vs. constraint-based search/reasoning

• Both are aimed at reducing search space
• Finding all patterns satisfying constraints vs. finding some (or
one) answer in constraint-based search in AI
• Constraint-pushing vs. heuristic search
• It is an interesting research problem on how to integrate them
• Constrained mining vs. query processing in DBMS
• Database query processing requires to find all
• Constrained pattern mining shares a similar philosophy as
pushing selections deeply in query processing
Anti-Monotonicity in Constraint Pushing
TDB (min_sup=2)
• Anti-monotonicity TID Transaction
• When an intemset S violates the constraint, 10 a, b, c, d, f

20 b, c, d, f, g, h
so does any of its superset
30 a, c, d, e, f
• sum(S.Price)  v is anti-monotone 40 c, e, f, g
• sum(S.Price)  v is not anti-monotone
Item Profit
• Example. C: range(S.profit)  15 is anti- a 40
monotone b 0
c -20
• Itemset ab violates C
d 10
• So does every superset of ab e -30
f 30
g 20
h -10
Monotonicity for Constraint Pushing
TDB (min_sup=2)
TID Transaction
• Monotonicity
10 a, b, c, d, f
• When an intemset S satisfies the 20 b, c, d, f, g, h
constraint, so does any of its superset 30 a, c, d, e, f
40 c, e, f, g
• sum(S.Price)  v is monotone
• min(S.Price)  v is monotone Item Profit
a 40
• Example. C: range(S.profit)  15 b 0
• Itemset ab satisfies C c -20

d 10
• So does every superset of ab e -30
f 30
g 20
h -10
Succinctness
• Succinctness:
• Given A1, the set of items satisfying a succinctness constraint
C, then any set S satisfying C is based on A1 , i.e., S contains
a subset belonging to A1
• Idea: Without looking at the transaction database, whether
an itemset S satisfies constraint C can be determined based
on the selection of items
• min(S.Price)  v is succinct
• sum(S.Price)  v is not succinct
• Optimization: If C is succinct, C is pre-counting pushable
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2
itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
Naïve Algorithm: Apriori + Constraint

L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} < 5
The Constrained Apriori Algorithm: Push an
Anti-monotone Constraint Deep
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
{2 3 5} {2 3 5} 2 Sum{S.price} < 5
The Constrained Apriori Algorithm: Push a
Succinct Constraint Deep
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2
{1 3} 2 {1 3} 2 {1 3}
not immediately
{1 5} 1 {1 5} to be used
{2 3} 2
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2 {3 5}
{3 5} 2
{2 3 5} {2 3 5} 2 min{S.price } <= 1
Converting “Tough” Constraints
TDB (min_sup=2)
TID Transaction
• Convert tough constraints into anti-
10 a, b, c, d, f
monotone or monotone by properly ordering
20 b, c, d, f, g, h
items 30 a, c, d, e, f
• Examine C: avg(S.profit)  25 40 c, e, f, g
• Order items in value-descending order Item Profit

• <a, f, g, d, b, h, c, e> a 40
• If an itemset afb violates C b 0
• So does afbh, afb* c -20
• It becomes anti-monotone! d 10
e -30
f 30
g 20
h -10
Strongly Convertible Constraints
• avg(X)  25 is convertible anti-monotone w.r.t. item

value descending order R: <a, f, g, d, b, h, c, e>
• If an itemset af violates a constraint C, so does Item Profit
every itemset with af as prefix, such as afd a 40
• avg(X)  25 is convertible monotone w.r.t. item value b 0
ascending order R-1: <e, c, h, b, d, g, f, a> c -20
• If an itemset d satisfies a constraint C, so does d 10
itemsets df and dfa, which having d as a prefix e -30
f 30
• Thus, avg(X)  25 is strongly convertible
g 20
h -10
Can Apriori Handle Convertible Constraint?
• A convertible, not monotone nor anti-monotone nor

succinct constraint cannot be pushed deep into the an
Apriori mining algorithm
• Within the level wise framework, no direct pruning
based on the constraint can be made Item Value
• Itemset df violates constraint C: avg(X)>=25 a 40

b 0
• Since adf satisfies C, Apriori needs df to assemble
c -20
adf, df cannot be pruned
d 10
• But it can be pushed into frequent-pattern growth e -30
framework! f 30
g 20
h -10
Mining With Convertible Constraints
Item Value
• C: avg(X) >= 25, min_sup=2 a 40
f 30
• List items in every transaction in value descending order R:
g 20
<a, f, g, d, b, h, c, e>
d 10
• C is convertible anti-monotone w.r.t. R b 0
• Scan TDB once h -10

c -20
• remove infrequent items
e -30
• Item h is dropped
• Itemsets a and f are good, …
TDB (min_sup=2)
• Projection-based mining TID Transaction
• Imposing an appropriate order on item projection 10 a, f, d, b, c
• Many tough constraints can be converted into (anti)- 20 f, g, d, b, c

30 a, f, d, c, e
monotone
40 f, g, h, c, e
Handling Multiple Constraints
• Different constraints may require different or even conflicting

item-ordering
• If there exists an order R s.t. both C1 and C2 are convertible
w.r.t. R, then there is no conflict between the two convertible
constraints
• If there exists conflict on order of items
• Try to satisfy one constraint first
• Then using the order for the other constraint to mine
frequent itemsets in the corresponding projected database
What Constraints Are Convertible?
Convertible anti- Convertible Strongly

Constraint monotone monotone convertible
avg(S)  ,  v Yes Yes Yes
median(S)  ,  v Yes Yes Yes
sum(S)  v (items could be of any value,

Yes No No
v  0)
sum(S)  v (items could be of any value,
No Yes No
v  0)
sum(S)  v (items could be of any value,
No Yes No
v  0)
sum(S)  v (items could be of any value,
Yes No No
v  0)
……
Constraint-Based Mining—A General Picture
Constraint Antimonotone Monotone Succinct

vS no yes yes
SV no yes yes
SV yes no yes

min(S)  v no yes yes
min(S)  v yes no yes

max(S)  v yes no yes
max(S)  v no yes yes

count(S)  v yes no weakly
count(S)  v no yes weakly
sum(S)  v ( a  S, a  0 ) yes no no
sum(S)  v ( a  S, a  0 ) no yes no
range(S)  v yes no no
range(S)  v no yes no
avg(S)  v,   { , ,  } convertible convertible no

support(S)   yes no no
support(S)   no yes no
A Classification of Constraints
Monotone
Antimonotone
Strongly
convertible
Succinct
Convertible Convertible
anti-monotone monotone
Inconvertible
Frequent-Pattern Mining: Summary
• Frequent pattern mining—an important task in data mining

• Scalable frequent pattern mining methods
• Apriori (Candidate generation & test)
• Projection-based (FPgrowth, CLOSET+, ...)
• Vertical format approach (CHARM, ...)
 Mining a variety of rules and interesting patterns
 Constraint-based mining
 Mining sequential and structured patterns
 Extensions and applications
Frequent-Pattern Mining: Research Problems
• Mining fault-tolerant frequent, sequential and structured

patterns
• Patterns allows limited faults (insertion, deletion, mutation)
• Mining truly interesting patterns
• Surprising, novel, concise, …
• Application exploration
• E.g., DNA sequence analysis and bio-pattern classification
• “Invisible” data mining
Thank You!

4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16

Uploaded by

Copyright:

Available Formats

Data Mining and

1.1 Motivation: Why data mining?

Data Mining: Concepts and Techniques

• The Explosive Growth of Data: from terabytes(10004) to yottabytes(10008)

• Data mining — Automated analysis of massive data sets

Data Mining: Concepts and Techniques 5

• Cross-market analysis—Find associations/co-relations between product sales, & predict

Data Mining: Concepts and Techniques 7

• Customer requirement analysis

Data Mining: Concepts and Techniques 8

Data Mining: Concepts and Techniques

Data Presentation Business

Data Preprocessing/Integration, Data Warehouses

Data Mining: Concepts and Techniques

• Not all “Data Mining System” performs true data mining

Data Mining: Concepts and Techniques 14

• Database-oriented data sets and applications

Data Mining: Concepts and Techniques 15

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

• A repository of information collected from multiple sources, stored

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

• Either stored in a flat file or unfolded into relational tables

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques 23

• Mining Frequent Patterns, Associations and

Data Mining: Concepts and Techniques 24

Data Mining: Concepts and Techniques 25

Data Mining: Concepts and Techniques 26

Data Mining: Concepts and Techniques 27

Data Mining: Concepts and Techniques 28

Data Mining: Concepts and Techniques 29

• An interesting measure represents knowledge !

Data Mining: Concepts and Techniques 30

• Based on user’s belief in the data.

• Objective and subjective measures need to be combined.

Data Mining: Concepts and Techniques 31

Data Mining: Concepts and Techniques 32

• How to construct a data mining query?

the data mining system during discovery to direct the mining

process, or examine the findings

Data Mining: Concepts and Techniques 35

• The primitives specify:

• Database tables or data warehouse cubes

• Condition for data selection

• Relevant attributes or dimensions

• Data grouping criteria

Data Mining: Concepts and Techniques 36

• The primitives specify:

(2) The kind of knowledge to be mined – what DB functions to be performed

Data Mining: Concepts and Techniques 37

(3) The background knowledge to be used – what domain knowledge,

concept hierarchies, etc.

(4) Interestingness measures and thresholds – support, confidence, etc.

tables, charts, graphs, …

Data Mining: Concepts and Techniques 38

Data Mining: Concepts and Techniques 39

• Automated vs. query-driven?

Data Mining: Concepts and Techniques 41

Data Mining: Concepts and Techniques 42

Data Mining: Concepts and Techniques 44