Data Mining and Data Warehousing

Chapter:5
Data Mining
and
data warehousing
Concepts and Techniques
1
Relation Database Theory
 The process of normalization generally
breaks a table into many independent
tables.
 A normalized database yields a flexible
model, making it easy to maintain dynamic
relationships between business entities.
 A relational database system is effective
and efficient for operational databases – a
lot of updates (aiming at optimizing update
performance).
2
Problems
 A fully normalized data model can perform
very inefficiently for queries.
 Historical data are usually large with static
relationships:
 Unnecessary joins may take unacceptably long
time
 Historical data are diverse
3
Database Vs data Warehouse
 Data helps making decisions ,a decision made without
considering data is simply a guess.
 Data is mostly stored in database or data warehouses.
Terms Database Data warehouse
Purpose Data retrieval, updating Data analysis and decision
making
Format Normalized Demoralized

Time frame Current/real-time History of data
Application Online transaction Processing Online Analytical processing
 Note that you can use DB as a DW. It depends on, for

which purpose you have design the database
4
Problem: Heterogeneous Information
Sources
“Heterogeneities are everywhere”

Personal
Databases
World
Scientific Databases
Wide
Web
Digital Libraries
 Different interfaces
 Different data representations
 Duplicate and inconsistent information
5
The Warehousing Approach
 Information Clients
integrated in
advance Data
 Stored in wh for Warehouse
direct querying
and analysis Integration System Metadata
. . .
Extractor/ Extractor/ Extractor/

Monitor Monitor Monitor
. . .
Source Source Source
CSE601 6
Data warehousing
 Is a relational database that is designed and developed
for query and analysis rather than for transaction
process.
 It contains historical and cumulative data derived from
transaction data from single or multiple sources.
 DWH is a single version of truth for a organization and
created for purpose of help in decision making and
forecasting.
 Data warehouse is not loaded every time when new data
is generated.
 There are timelines determined by the business as to
when a data warehouse needs to be loaded- daily,
monthly, once in a quarter.
7
Properties of Data warehousing
 Subject-oriented
 It focuses on subject rather than ongoing operations.
 Subject can be specific business area in an organization. E.g.
Sales, marketing, products
 It helps to focus on modelling and analysis of data for decision
making.
 Integrated
 Integrates data from multiple data sources( different
formats.
 Transform data from different sources into a consistent
format
 Must keep consistent naming conventions, formats and
encodings
8
Properties of Data warehousing
 Time-variant
 Data in the warehouse is only accurate and valid at some point
in time or over some time interval.
 The time-variance of the data warehouse is also shown in the
extended time that the data is held as a series of snapshots.
 Helps to study trends and changes.
 Non-volatile
 Data should not change once in WH or new data is always
added as a supplement to the database, rather than a
replacement.
 Previous data is not erased when new data is added to DWH.
 Data is read only and periodically refreshed, i.e. it enables to
analyses historical data and understand what and when
happened.
9
Introduction
What is Data Mining?
 Data Mining is the process of collecting large
amounts of raw data and transforming that data into
useful information( operational and financial
reports)
Data Warehousing?
A Data Warehouse is a computerized collection of

mined data.
10
Advantages of Warehousing Approach
 High query performance
 But not necessarily most current information
 Doesn’t interfere with local processing at sources
 Complex queries at warehouse
 OLTP at information sources
 Information copied at warehouse

 Can modify, annotate, summarize, restructure, etc.
 Can store historical information
 Security, no auditing
 Has caught on in industry
CSE601 11
Why do we need DW?
 the primary reason for DW is for a company to
get that extra edge over its competitors.
 This extra edge can be gained by taking smarter
decisions.
 Lets consider some strategic questions that
manager has to answer to get an extra edge over
his company's
 How do we increase the marker share of this
company by 5%?
 Which product is not doing well in the market?
 Which agent needs helps with selling policies
12
Key Terminologies-OLTP vs OLAP
 OLTP: On Line Transaction Processing

 Describes processing at operational sites
 OLAP: On Line Analytical Processing

 Describes processing at warehouse
 Data mart is subset of data warehouse, focuses
on particular subject or department
13
Key Terminologies-OLTP vs OLAP
OLTP OLAP
Contains current data Contains historical data
Useful in running the business Useful in analyzing the business
Provides primitive and highly Provides summarized data
detailed data
Used for writing data into DB Used for reading data from DW
Size ranges from 100MB-1GB Size from 100GB-1 TB
EX: all bank transaction made Ex: Bank transactions made by a
by customer customer at a particular time.
14
Examples
Standard DB (OLTP) Warehouse (OLAP)
 A bank server which records  Bank manager wants to know how
every time a transaction is made may customers are unitizing the
for a particular account. ATM of his branch. Based on this
 A railway reservation server he may made some decision.
which records the transactions of  What factors affect order
passenger. processing time?
 When did that order ship  How did each product contribute
 How many units are in product to profit last quarter?
 Does this customer have unpaid  which products have the lowest
bills? Gross margin?
 Are any customer X’s line items
on backorder
15
Examples
 Performs ETL  Loading
 Extraction • Initial loading

Select data sources, • Incremental loading
determine filters
Automatic replicate
Create intermediary files
 Transformation
Clean, merge, de-duplicate
data
Covert data types
Calculate derived data
Resolve synonyms and
homonyms
16
ETL- Extract, transform and load
 ETL is the process of extracting the data from
various source, transforming this data to meet
your requirement and then loading it into a target
data warehouse. Informatica
Data Mart
 Data mart is a smaller version of the data DWH which deals
with a single subject.
 Data marts are focused on one area. Hence ,they draw data
from a limited number of sources.
 Time taken to build data marts is very less compared to the
time taken to build a DWH.
 It is lower-cost, scaled down version of the DW.
Metadata
 Metadata is defined as data about data.
 Metadata is a DWH defines the source data.
 Metadata is used to define which table is source
and target and which concept is used to build
business logic.
 Where the data store and what is the size of
data.
The Metadata
 For example, a line in a sales database may
contain: 4056 KJ596 223.45
 This is mostly meaningless until we consult the

metadata that tells us it was store number 4056,
product KJ596 and sales of $223.45
 The metadata are essential ingredients in the

transformation of raw data into knowledge. They
are the “keys” that allow us to handle the raw
data.
General Metadata Issues
General metadata issues associated with
Data Warehouse use:
 What tables, attributes and keys does the DW
contain?
 Where did each set of data come from?
 What transformations were applied with
cleansing?
 How have the metadata changed over time?
 How often do the data get reloaded?
 Are there so many data elements that you need
to be careful what you ask for?
Data Warehouse Architecture
 Metadata is defined as data about data.
Data Mining
 Data mining is the process of finding patterns in a
data warehouse using technologies such as online
Analytical process(OLAP).
 Data mining is about explaining the past and
predicting the future by means of data analysis.
 Data mining gives those answers which database
can’t given.
Origins of Data Mining
 Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
Statistics/ Machine Learning/

AI Pattern
Recognition
Data Mining
Database
systems
Data Mining Tasks
 Prediction Methods
 Use some variables to predict unknown or future
values of other variables.
 Description Methods
 Find human-interpretable patterns that describe
the data.
Data Mining
Data Mining techniques
 Information Visualization
 k-nearest neighbor
 decision trees
 neural networks
 association rules
…
DATA MINING
ASSOCIATION RULES
The model: data
 I = {i1, i2, …, im}: a set of items.
 Transaction t:
 t a set of items, and t  I.
 Transaction Database T: a set of transactions
T = {t1, t2, …, tn}.
 An Association Rule: is an implication of the

form
X  Y, where X, Y  I
28
Transaction data: supermarket data
 Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
 Concepts:
 An item: an item in a basket
 I:the set of all items sold in the store
 A transaction: items purchased in a basket; it may
have TID (transaction ID)
 A transactional dataset: A set of transactions
29
Association Rules
 Association Rules: Finding frequent patterns,
associations, correlations structures among sets of
items or objects in transaction databases, relational
databases, and other information repositories.
 Given:
1. Database of transactions
2. Each transaction is a list of items (purchased by a
customer )
 Find: all rules that correlate the presence of one set
of items with that of another set of items.
30
Association Rules
 Association rules helps to analyzes and predicts
customer behavior.
 If/then statements.
 Bread  Milk, if the customer buy Bread there is
probability to buy Milk.
 Buys{ onions, potatoes}  Buys{ tomatoes}
 Such data can be used for marketing actives product
promotion.
31
Association rules
 Data mining can typically be used with
transactional databases ( E.g. in shopping cart
analysis)
 Aim can be to build association rules about the
shopping events.
 Based on item sets, such as
{milk, cocoa powder} 2-itemset
{milk, corn flakes, bread} 3-itemset
Association rules
 Items that occur often together can be
associated to each other.
 These together occuring items form a frequent
itemset
 Conclusions based on the frequent itemsets form
association rules.
 For ex. {milk, cocoa powder} can bring a rule cocoa
powder  milk
Rules
Body ==> Consequent [ Support , Confidence ]
 Body: represents the examined data.
 Consequent: represents a discovered property for the
examined data.
 Support: represents the percentage of the records
satisfying the body or the consequent. Denotes
probability that contains both A and B.
 Confidence: represents the percentage of the
records satisfying both the body and the consequent
to those satisfying only the body. Its denotes
probability that a transaction containing A also
contains B.
34
Quality of rules
 Need to estimate how interesting the rules
are
 Subjective and objective measures
support
 The support of the rule X  Y in the

transaction database D is the support of the
items set X  Y in D.
 The confidence of the rule X  Y in the
transaction database D is the ratio of the
number of transactions in D that contain X 
Y to the number of transactions that contain
X in D.
36
Practical Example
TID Items
 Association Rule 1 Bread, Milk
 An implication expression of the 2 Bread, Diaper, Beer, Eggs
form X  Y, where X and Y are non- 3 Milk, Diaper, Beer, Coke
overlapping itemsets
4 Bread, Milk, Diaper, Beer
 Example:
5 Bread, Milk, Diaper, Coke
{Milk, Diaper}  {Beer}
 Rule Evaluation Metrics Example:

 Support (s) {Milk, Diaper}  Beer
 Fraction of transactions that
contain both X and Y  (Milk , Diaper, Beer) 2

 Confidence (c)
s    0 .4
|T| 5
 Measures how often items in Y
appear in transactions that  (Milk, Diaper, Beer) 2

c   0.67
contain X  (Milk , Diaper) 3
Association rules
Support: This says how popular an itemset is, as
measured by the proportion of transactions in
which an itemset appears.
Example: Database with transactions ( customer_# : item_a1,
item_a2, … )
1: 1, 3, 5.
2: 1, 8, 14, 17, 12.
3: 4, 6, 8, 12, 9, 104.
4: 2, 1, 8.
support {8,12} = 2 (,or 50% ~ 2 of 4 customers)

support {1, 5} = 1 (,or 25% ~ 1 of 4 customers )
support {1} = 3 (,or 75% ~ 3 of 4 customers)
Association rules
 Support: An itemset is called frequent if its support
is equal or greater than an agreed upon minimal value –
the support threshold.
 If you discover that sales of items beyond a certain
proportion tend to have a significant impact on your
profits, you might consider using that proportion as
your support threshold. You may then identify itemsets
with support values above this threshold as significant
itemsets.
add to previous example:
if threshold 50%
then itemsets {8,12} and {1} called frequent
Confidence (certainty) (2)
 Database D consists of events T , T ,… T , that is:
1 2 m
D = {T1, T2,…, Tm}

 Let there be a rule Xa  Xb so that itemsets Xa and Xb are
subregions of event Tk, that is: Xa  Tk  Xb  Tk
 Also let Xa  Xb = 
 The confidence can be defined as
sup(Xa  Xb)
conf(Xa,Xb) = ---------------------
sup(Xa)
 This relation compares number of events containing both
itemsets Xa and Xb to number of events containing an itemset
Xa
Confidence (certainty), example
 Let’s assume D = {(1,2,3), (2,3,4), (1,2,4),
(1,2,5), (1,3,5)}
 The confidence for rule 12
sup(1  2) 3/5
conf((1,2)) = ---------------- = ------ = 3/4
sup(1) 4/5
 That is: relation of number of events
containing both itemsets Xa and Xb to
number of events containing an itemset Xa
Confidence
 Confidence: this says how likely item Y is purchased
when item X is purchased, expressed as {X -> Y}. This is
measured by the proportion of transactions with item X,
in which item Y also appears.
An association rule is of the form: X => Y
• X => Y: if someone buys X, he also buys Y
 The confidence is the conditional probability that, given
X present in a transition , Y will also be present.
 Confidence measure, by definition:

Confidence(X=>Y) equals support(X,Y) / support(X)
Association rules
 Confidence: We should only consider rules
derived from itemsets with high support, and
that also have high confidence.
 One drawback of the confidence measure is that
it might misrepresent the importance of an
association. This is because it only accounts for
how popular X are, but not Y. If Y are also very
popular in general, there will be a higher chance
that a transaction containing X will also contain
Y, thus inflating the confidence measure.
Association-rule mining task
• Given a set of transactions D, the goal of

association rule mining is to find all rules
having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
Example 1
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 3, 5, 8.
2: 2, 6, 8.
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
Conf ( {5} => {8} ) ?

supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,
then conf( {5} => {8} ) = 4/5 = 0.8 or 80%
Example 1
1: 3, 5, 8.
2: 2, 6, 8.
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
Conf ( {5} => {8} ) ? 80% Done. Conf ( {8} =>

{5} ) ?
supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,
then conf( {8} => {5} ) = 4/7 = 0.57 or 57%
Example 1
Conf ( {5} => {8} ) ? 80% Done.

Conf ( {8} => {5} ) ? 57% Done.
Rule ( {5} => {8} ) more meaningful then

Rule ( {8} => {5} )
Example 1
1: 3, 5, 8.
2: 2, 6, 8.
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
Conf ( {9} => {3} ) ?

supp({9}) = 1 , supp({3}) = 1 , supp({3,9}) = 1,
then conf( {9} => {3} ) = 1/1 = 1.0 or 100%. OK?
Example 1
Conf( {9} => {3} ) = 100%. Done.
Notice: High Confidence, Low Support.

-> Rule ( {9} => {3} ) not meaningful
Support and confidence
 If confidence gets a value of 100 % the
rule is an exact rule
 Even if confidence reaches high values the
rule is not useful unless the support value
is high as well
 Rules that have both high confidence and
support are called strong rules
 Some competing alternative approaches
(other that Apriori) can generate useful
rules even with low support values
APRIORI ALGORTHM
 APRIOIRI is an efficient algorithm to find association rules (or, actually, frequent
item sets). The apriori technique is used for “generating large item sets.” Out of all
candidate (k)-item sets.
As is common in association rule mining, given a set of itemsets (for instance, sets of retail
transactions, each listing individual items purchased), the algorithm attempts to find subsets which
are common to at least a minimum number C of the itemsets. Apriori uses a "bottom up" approach,
where frequent subsets are extended one item at a time (a step known as candidate generation), and
groups of candidates are tested against the data. The algorithm terminates when no further
successful extensions are found.
Apriori Algorithm
 Candidate itemsets are generated using only

the large itemsets of the previous pass without
considering the transactions in the database.
1. The large itemset of the previous pass is joined
with itself to generate all itemsets whose size
is higher by 1.
2. Each generated itemset, that has a subset
which is not large, is deleted. The remaining
itemsets are the candidate ones.
52
Creating frequent sets
 Let’s define:
Ck as a candidate itemset of size k
Lk as a frequent itemset of size k
 Main steps of iteration are:
1) Find frequent set Lk-1
2) Join step: Ck is generated by joining Lk-1 with
itself (cartesian product Lk-1 x Lk-1)
3) Prune step (apriori property): Any (k − 1) size
itemset that is not frequent cannot be a subset
of a frequent k size itemset, hence should be
removed
4) Frequent set Lk has been achieved
Creating frequent sets (2)
 Algorithm uses breadth-first search and a
hash tree structure to make candidate
itemsets efficiently
 Then occurance frequency for each
candidate itemset is counted.
 Those candidate itemsets that have higher
frequency than minimum support threshold
are qualified to be frequent itemsets
Apriori Algorithm: (by Agrawal et al at IBM Almaden Research
Centre) can be used to generate all frequent itemset
1. Pass 1Generate the candidate itemsets in C1
2. Save the frequent itemsets in L1
3. PasskGenerate the candidate itemsets in Ck from the frequent
itemsets in Lk-1
1. Join Lk-1 p with Lk-1q, as follows:
insert into Ck
select p.item1, p.item2, . . . , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1q
where p.item1 = q.item1, . . . p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1
2. Generate all (k-1)-subsets from the candidate itemsets in Ck
3. Prune all candidate itemsets from Ck where some (k-1)-subset of
the candidate itemset is not in the frequent itemset Lk-1
4. Scan the transaction database to determine the support for each
candidate itemset in Ck
5. Save the frequent itemsets in Lk
The Apriori Algorithm — Example
Min support =50%
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup To create C3 only look at
items that have the same
{2 3 5} {2 3 5} 2 first item
Example 2
 Suppose you have records of large number of transactions at
a shopping center as follows:
 we say an itemset is frequently bought if it is bought at
least 60% of times. So for here it should be bought at least
3 times.
Transaction ID Items Bought

T1
{Mango, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
T2
{Doll, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
T3
{Mango, Apple, Key-chain, Eggs}
T4
{Mango, Umbrella, Corn, Key-chain, Yo-yo}
T5
{Corn, Onion, Onion, Key-chain, Ice-cream, Eggs}
Example 2
 For simplicity M = Mango O = Onion.. And
so on
Tran ID Items Bought
T1 {M, O, N, K, E, Y }
T2 {D, O, N, K, E, Y }
T3 {M, A, K, E}
T4 {M, U, C, K, Y }
T5 {C, O, O, K, I, E}
Example 2
 Step 1: Count the number of transactions in
which each item occurs, Note ‘O=Onion’ is bought
4 times in total, but, it occurs in just 3
transactions.Item No of
transactions
M 3
O 3
N 2
K 5
E 4
Y 3
D 1
A 1
U 1
C 2
I 1
example
 Step 2: Now remember we said the item is said frequently bought if it is bought at least 3
times. So in this step we remove all the items that are bought less than 3 times from the above
table and we are left with
Item Number of
transactions
M
3
 This is the single items that are bought frequently. Now let’s say we want to find a pair of
O 3
items that are bought frequently. We continue from the above table (Table in step 2)
K 5
E 4
Y 3
exampe
 Step 3: We start making pairs from the first item, like
MO,MK,ME,MY and then we start with the second item like
OK,OE,OY. We did not do OM because we already did MO when
we were making pairs with M and buying a Mango and Onion
together is same as buying Onion and Mango together. After
making all the pairs we get,
Item pairs
MO
MK
ME
MY
OK
OE
OY
KE
KY
EY
Example 2
 Step 4: Now we count how many times each pair
is bought together. For example M and O is just
bought together in {M,O,N,K,E,Y} While M and K is
bought together 3 times in {M,O,N,K,E,Y},
{M,A,K,E} AND {M,U,C, K, Y} After doing that for
all the pairs we get
Example 2
Number of
Item transactions
Pairs
MO 1
MK 3
ME 2
MY 2
OK 3
OE 3
OY 2
KE 4
3
KY
EY 2
Example
Step 5: Remove all the item pairs with
number of transactions less than three and
we are left with These are the pairs of items
frequently bought together.
Item Number of
Pairs transactions
MK 3
OK 3
OE 3
KE
4
KY 3
example
 Now let’s say we want to find a set of three
items that are brought together. We use the
above table (table in step 5) and make a set of
3 items.
 Step 6: To make the set of three items we
need one more rule (it’s termed as self-join),
Example 2
 While we are on this, suppose you have
sets of 3 items say ABC, ABD, ACD, ACE,
BCD and you want to generate item sets of
4 items you look for two sets having the
same first two alphabets.
ABC and ABD -> ABCD
ACD and ACE -> ACDE
 And so on … In general you have to look for
sets having just the last alphabet/item
different.
Selections Involving Comparisons
 Then we find how many times O,K,E are
bought together in the original table and same
for K,E,Y and we get the following table
It simply means, from the Item pairs in the
above table, we find two pairs with the same
first Alphabet, so we get
OK and OE, this gives OKE
KE and KY, this gives KEY
Item Set Number of
transactions
OKE 3
KEY 2
Rule Generation
 Given a frequent item set X, find all non-empty
subsets y X such that y X – y satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent item set,
candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,
• If |X| = k, then there are 2k – 2 candidate

association rules.
Transaction database: Example
Attributes converted to binary flags

TID Products TID A B C D E
1 A, B, E 1 1 1 0 0 1
2 B, D
2 0 1 0 1 0
3 B, C
3 0 1 1 0 0
4 A, B, D 4 1 1 0 1 0
5 A, C 5 1 0 1 0 0
6 B, C
6 0 1 1 0 0
7 A, C
7 1 0 1 0 0
8 A, B, C, E 8 1 1 1 0 1
9 A, B, C
9 1 1 1 0 0
some examples that use of association
rules?
 Shopping centers use association rules to
place the items next to each other so that
users buy more items.
 In Amazon, they use association mining to
recommend you the items based on the
current item you are browsing/buying.
 Another application is the Google auto-
complete, where after you type in a word it
searches frequently associated words that
user type after that particular word
Use of Apriori algorithm
 Initial information: transactional
database D and user-defined numeric
minimun support threshold min_sup
 Algortihm uses knowledge from previous
iteration phase to produce frequent
itemsets
 This is reflected in the Latin origin of
the name that means ”from what comes
before”
Applications
 Market Basket Analysis: given a database of
customer transactions, where each transaction is a
set of items the goal is to find groups of items
which are frequently purchased together.
 Telecommunication (each customer is a
transaction containing the set of phone calls)
 Credit Cards/ Banking Services (each
card/account is a transaction containing the set of
customer’s payments)
 Medical Treatments (each patient is represented
as a transaction containing the ordered set of
diseases)
Exercise
 what are possible association rules with min
2? Items
TID
T1 X,Y,Z
T2 X,Y
W
T3
T4 X
T5 Y,Z
T6 A,B,Z
T7 X,Z,B
T8 X,Z,W
T9 A,X,Z
T10 Z,Y

Data Mining and Data Warehousing

Uploaded by

Copyright:

Available Formats

You might also like

Data Mining and Data Warehousing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining and Data Warehousing

Uploaded by

Copyright:

Available Formats

Chapter:5

Concepts and Techniques

Format Normalized Demoralized

 Note that you can use DB as a DW. It depends on, for

“Heterogeneities are everywhere”

Extractor/ Extractor/ Extractor/

A Data Warehouse is a computerized collection of

 Information copied at warehouse

 Has caught on in industry

 OLTP: On Line Transaction Processing

 OLAP: On Line Analytical Processing

 Extraction • Initial loading

 This is mostly meaningless until we consult the

 The metadata are essential ingredients in the

Statistics/ Machine Learning/

 An Association Rule: is an implication of the

 The support of the rule X  Y in the

 Rule Evaluation Metrics Example:

contain both X and Y  (Milk , Diaper, Beer) 2

appear in transactions that  (Milk, Diaper, Beer) 2

support {8,12} = 2 (,or 50% ~ 2 of 4 customers)

D = {T1, T2,…, Tm}

 Confidence measure, by definition:

• Given a set of transactions D, the goal of

Conf ( {5} => {8} ) ?

Conf ( {5} => {8} ) ? 80% Done. Conf ( {8} =>

Conf ( {5} => {8} ) ? 80% Done.

Rule ( {5} => {8} ) more meaningful then

Conf ( {9} => {3} ) ?

Conf( {9} => {3} ) = 100%. Done.

Notice: High Confidence, Low Support.

 Candidate itemsets are generated using only

Transaction ID Items Bought

Tran ID Items Bought

• If |X| = k, then there are 2k – 2 candidate

Attributes converted to binary flags

You might also like