Professional Documents
Culture Documents
Data Mining and Data Warehousing
Data Mining and Data Warehousing
Data Mining and Data Warehousing
Data Mining
and
data warehousing
1
Relation Database Theory
The process of normalization generally
breaks a table into many independent
tables.
A normalized database yields a flexible
model, making it easy to maintain dynamic
relationships between business entities.
A relational database system is effective
and efficient for operational databases – a
lot of updates (aiming at optimizing update
performance).
2
Problems
A fully normalized data model can perform
very inefficiently for queries.
Historical data are usually large with static
relationships:
Unnecessary joins may take unacceptably long
time
Historical data are diverse
3
Database Vs data Warehouse
Data helps making decisions ,a decision made without
considering data is simply a guess.
Data is mostly stored in database or data warehouses.
Terms Database Data warehouse
Purpose Data retrieval, updating Data analysis and decision
making
World
Scientific Databases
Wide
Web
Digital Libraries
Different interfaces
Different data representations
Duplicate and inconsistent information
5
The Warehousing Approach
Information Clients
integrated in
advance Data
Stored in wh for Warehouse
direct querying
and analysis Integration System Metadata
. . .
. . .
Source Source Source
CSE601 6
Data warehousing
Is a relational database that is designed and developed
for query and analysis rather than for transaction
process.
It contains historical and cumulative data derived from
transaction data from single or multiple sources.
DWH is a single version of truth for a organization and
created for purpose of help in decision making and
forecasting.
Data warehouse is not loaded every time when new data
is generated.
There are timelines determined by the business as to
when a data warehouse needs to be loaded- daily,
monthly, once in a quarter.
7
Properties of Data warehousing
Subject-oriented
It focuses on subject rather than ongoing operations.
Subject can be specific business area in an organization. E.g.
Sales, marketing, products
It helps to focus on modelling and analysis of data for decision
making.
Integrated
Integrates data from multiple data sources( different
formats.
Transform data from different sources into a consistent
format
Must keep consistent naming conventions, formats and
encodings
8
Properties of Data warehousing
Time-variant
Data in the warehouse is only accurate and valid at some point
in time or over some time interval.
The time-variance of the data warehouse is also shown in the
extended time that the data is held as a series of snapshots.
Helps to study trends and changes.
Non-volatile
Data should not change once in WH or new data is always
added as a supplement to the database, rather than a
replacement.
Previous data is not erased when new data is added to DWH.
Data is read only and periodically refreshed, i.e. it enables to
analyses historical data and understand what and when
happened.
9
Introduction
What is Data Mining?
Data Mining is the process of collecting large
amounts of raw data and transforming that data into
useful information( operational and financial
reports)
Data Warehousing?
10
Advantages of Warehousing Approach
High query performance
But not necessarily most current information
Doesn’t interfere with local processing at sources
Complex queries at warehouse
OLTP at information sources
CSE601 11
Why do we need DW?
the primary reason for DW is for a company to
get that extra edge over its competitors.
This extra edge can be gained by taking smarter
decisions.
Lets consider some strategic questions that
manager has to answer to get an extra edge over
his company's
How do we increase the marker share of this
company by 5%?
Which product is not doing well in the market?
Which agent needs helps with selling policies
12
Key Terminologies-OLTP vs OLAP
13
Key Terminologies-OLTP vs OLAP
OLTP OLAP
Contains current data Contains historical data
Useful in running the business Useful in analyzing the business
Provides primitive and highly Provides summarized data
detailed data
Used for writing data into DB Used for reading data from DW
Size ranges from 100MB-1GB Size from 100GB-1 TB
EX: all bank transaction made Ex: Bank transactions made by a
by customer customer at a particular time.
14
Examples
Standard DB (OLTP) Warehouse (OLAP)
A bank server which records Bank manager wants to know how
every time a transaction is made may customers are unitizing the
for a particular account. ATM of his branch. Based on this
A railway reservation server he may made some decision.
which records the transactions of What factors affect order
passenger. processing time?
When did that order ship How did each product contribute
How many units are in product to profit last quarter?
Does this customer have unpaid which products have the lowest
bills? Gross margin?
Are any customer X’s line items
on backorder
15
Examples
Performs ETL Loading
16
ETL- Extract, transform and load
ETL is the process of extracting the data from
various source, transforming this data to meet
your requirement and then loading it into a target
data warehouse. Informatica
Data Mart
Data mart is a smaller version of the data DWH which deals
with a single subject.
Data marts are focused on one area. Hence ,they draw data
from a limited number of sources.
Time taken to build data marts is very less compared to the
time taken to build a DWH.
It is lower-cost, scaled down version of the DW.
Metadata
Metadata is defined as data about data.
Metadata is a DWH defines the source data.
Metadata is used to define which table is source
and target and which concept is used to build
business logic.
Where the data store and what is the size of
data.
The Metadata
For example, a line in a sales database may
contain: 4056 KJ596 223.45
Data Mining
Database
systems
Data Mining Tasks
Prediction Methods
Use some variables to predict unknown or future
values of other variables.
Description Methods
Find human-interpretable patterns that describe
the data.
Data Mining
Data Mining techniques
Information Visualization
k-nearest neighbor
decision trees
neural networks
association rules
…
DATA MINING
ASSOCIATION RULES
The model: data
I = {i1, i2, …, im}: a set of items.
Transaction t:
t a set of items, and t I.
Transaction Database T: a set of transactions
T = {t1, t2, …, tn}.
29
Association Rules
Association Rules: Finding frequent patterns,
associations, correlations structures among sets of
items or objects in transaction databases, relational
databases, and other information repositories.
Given:
1. Database of transactions
2. Each transaction is a list of items (purchased by a
customer )
Find: all rules that correlate the presence of one set
of items with that of another set of items.
30
Association Rules
Association rules helps to analyzes and predicts
customer behavior.
If/then statements.
Bread Milk, if the customer buy Bread there is
probability to buy Milk.
Buys{ onions, potatoes} Buys{ tomatoes}
Such data can be used for marketing actives product
promotion.
31
Association rules
Data mining can typically be used with
transactional databases ( E.g. in shopping cart
analysis)
Aim can be to build association rules about the
shopping events.
Based on item sets, such as
{milk, cocoa powder} 2-itemset
{milk, corn flakes, bread} 3-itemset
Association rules
Items that occur often together can be
associated to each other.
These together occuring items form a frequent
itemset
Conclusions based on the frequent itemsets form
association rules.
For ex. {milk, cocoa powder} can bring a rule cocoa
powder milk
Rules
Body ==> Consequent [ Support , Confidence ]
Body: represents the examined data.
Consequent: represents a discovered property for the
examined data.
Support: represents the percentage of the records
satisfying the body or the consequent. Denotes
probability that contains both A and B.
Confidence: represents the percentage of the
records satisfying both the body and the consequent
to those satisfying only the body. Its denotes
probability that a transaction containing A also
contains B.
34
Quality of rules
Need to estimate how interesting the rules
are
Subjective and objective measures
support
36
Practical Example
TID Items
Association Rule 1 Bread, Milk
An implication expression of the 2 Bread, Diaper, Beer, Eggs
form X Y, where X and Y are non- 3 Milk, Diaper, Beer, Coke
overlapping itemsets
4 Bread, Milk, Diaper, Beer
Example:
5 Bread, Milk, Diaper, Coke
{Milk, Diaper} {Beer}
1: 1, 3, 5.
2: 1, 8, 14, 17, 12.
3: 4, 6, 8, 12, 9, 104.
4: 2, 1, 8.
1: 3, 5, 8.
2: 2, 6, 8.
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
1: 3, 5, 8.
2: 2, 6, 8.
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
1: 3, 5, 8.
2: 2, 6, 8.
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
As is common in association rule mining, given a set of itemsets (for instance, sets of retail
transactions, each listing individual items purchased), the algorithm attempts to find subsets which
are common to at least a minimum number C of the itemsets. Apriori uses a "bottom up" approach,
where frequent subsets are extended one item at a time (a step known as candidate generation), and
groups of candidates are tested against the data. The algorithm terminates when no further
successful extensions are found.
Apriori Algorithm
52
Creating frequent sets
Let’s define:
Ck as a candidate itemset of size k
Lk as a frequent itemset of size k
Main steps of iteration are:
1) Find frequent set Lk-1
2) Join step: Ck is generated by joining Lk-1 with
itself (cartesian product Lk-1 x Lk-1)
3) Prune step (apriori property): Any (k − 1) size
itemset that is not frequent cannot be a subset
of a frequent k size itemset, hence should be
removed
4) Frequent set Lk has been achieved
Creating frequent sets (2)
Algorithm uses breadth-first search and a
hash tree structure to make candidate
itemsets efficiently
Then occurance frequency for each
candidate itemset is counted.
Those candidate itemsets that have higher
frequency than minimum support threshold
are qualified to be frequent itemsets
Apriori Algorithm: (by Agrawal et al at IBM Almaden Research
Centre) can be used to generate all frequent itemset
1. Pass 1Generate the candidate itemsets in C1
2. Save the frequent itemsets in L1
3. PasskGenerate the candidate itemsets in Ck from the frequent
itemsets in Lk-1
1. Join Lk-1 p with Lk-1q, as follows:
insert into Ck
select p.item1, p.item2, . . . , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1q
where p.item1 = q.item1, . . . p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1
2. Generate all (k-1)-subsets from the candidate itemsets in Ck
3. Prune all candidate itemsets from Ck where some (k-1)-subset of
the candidate itemset is not in the frequent itemset Lk-1
4. Scan the transaction database to determine the support for each
candidate itemset in Ck
5. Save the frequent itemsets in Lk
The Apriori Algorithm — Example
Min support =50%
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup To create C3 only look at
items that have the same
{2 3 5} {2 3 5} 2 first item
Example 2
Suppose you have records of large number of transactions at
a shopping center as follows:
we say an itemset is frequently bought if it is bought at
least 60% of times. So for here it should be bought at least
3 times.
T1 {M, O, N, K, E, Y }
T2 {D, O, N, K, E, Y }
T3 {M, A, K, E}
T4 {M, U, C, K, Y }
T5 {C, O, O, K, I, E}
Example 2
Step 1: Count the number of transactions in
which each item occurs, Note ‘O=Onion’ is bought
4 times in total, but, it occurs in just 3
transactions.Item No of
transactions
M 3
O 3
N 2
K 5
E 4
Y 3
D 1
A 1
U 1
C 2
I 1
example
Step 2: Now remember we said the item is said frequently bought if it is bought at least 3
times. So in this step we remove all the items that are bought less than 3 times from the above
table and we are left with
Item Number of
transactions
M
3
This is the single items that are bought frequently. Now let’s say we want to find a pair of
O 3
items that are bought frequently. We continue from the above table (Table in step 2)
K 5
E 4
Y 3
exampe
Step 3: We start making pairs from the first item, like
MO,MK,ME,MY and then we start with the second item like
OK,OE,OY. We did not do OM because we already did MO when
we were making pairs with M and buying a Mango and Onion
together is same as buying Onion and Mango together. After
making all the pairs we get,
Item pairs
MO
MK
ME
MY
OK
OE
OY
KE
KY
EY
Example 2
Step 4: Now we count how many times each pair
is bought together. For example M and O is just
bought together in {M,O,N,K,E,Y} While M and K is
bought together 3 times in {M,O,N,K,E,Y},
{M,A,K,E} AND {M,U,C, K, Y} After doing that for
all the pairs we get
Example 2
Number of
Item transactions
Pairs
MO 1
MK 3
ME 2
MY 2
OK 3
OE 3
OY 2
KE 4
3
KY
EY 2
Example
Step 5: Remove all the item pairs with
number of transactions less than three and
we are left with These are the pairs of items
frequently bought together.
Item Number of
Pairs transactions
MK 3
OK 3
OE 3
KE
4
KY 3
example
Now let’s say we want to find a set of three
items that are brought together. We use the
above table (table in step 5) and make a set of
3 items.
Step 6: To make the set of three items we
need one more rule (it’s termed as self-join),
Example 2
While we are on this, suppose you have
sets of 3 items say ABC, ABD, ACD, ACE,
BCD and you want to generate item sets of
4 items you look for two sets having the
same first two alphabets.
ABC and ABD -> ABCD
ACD and ACE -> ACDE
And so on … In general you have to look for
sets having just the last alphabet/item
different.
Selections Involving Comparisons
Then we find how many times O,K,E are
bought together in the original table and same
for K,E,Y and we get the following table
It simply means, from the Item pairs in the
above table, we find two pairs with the same
first Alphabet, so we get
OK and OE, this gives OKE
KE and KY, this gives KEY
Item Set Number of
transactions
OKE 3
KEY 2
Rule Generation
Given a frequent item set X, find all non-empty
subsets y X such that y X – y satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent item set,
candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC BD, AD BC, BC AD,
BD AC, CD AB,
T1 X,Y,Z
T2 X,Y
W
T3
T4 X
T5 Y,Z
T6 A,B,Z
T7 X,Z,B
T8 X,Z,W
T9 A,X,Z
T10 Z,Y