Professional Documents
Culture Documents
Designing Fact Tables
Designing Fact Tables
Designing Fact Tables
Dimensioon
In all other situations
situation.
the level of deta stored, and the retention period, need to be weighed again
cost of aChieving them. Data warehouses have been built with fact tables larger tnan
ratnel,
1TB. Ihis does not mean that we should aim to create large fact tables;
DESIGNING FACT TABLES
81
us to focus on
limit of database size can generally be ignored, allowing
technological
the real business requirement. order to reduce
and mechanisms used in
This section will explain the techniques within
tables without compromising the value of the data inherent
the size of fact the
of these techniques to be applied to
number
them. Most solutions will require
a
HISTORICAL
THE SIGNIFICANT
OPTION I: IDENTIFY
5.4.
SUPPORTED FUNCTION
PERIOD FOR EACH
tend to be
the retention period of data always
on
Initial business requirements a long
aggressive, in that the requirement may suggest
aggressive and simplistic: in that the sizing will invariably
retention period, say
f+ve to ten years; simplistic,
transactions (such as account
of storing the most detailed
start off on the premise
EPOS transactions):
call records, or
transactions, telephone and different degrees of
warehouses use a mix of, detailed data
Most data and volume of
the best balance between query performance
aggregation to obtain degree
order to achieve the appropriate
balance, we need to determine what
data. In
tor each business function.
of detail is necessary
examines the various requirements to, retain
Broadly speaking, this activity .
1 Jan 96 1 Jan 97
Figure 5.8 Graph showing retention period of functions within a retail operation.
between various conditions. Theanalysis can be carried out effectively using a subset
of the full data. Conversely, sampling may be inappropriate to satisfy campaign
because it would require access to all the detailed customer records to
management,
build up direct mailing lists.
where
For example, let's take the case of a retail sales analysis data warehouse,
basket transactions are stored for a 15% spread of stores across the US. Although
it might be
this information can be used to spot trends across all stores,
inappropriate to determine product-buying patterns in all stores located at seaside
resorts as the statistical sample may be too low.
The next option is to remove all columns from the fact entity that are not required to
answer decision support questions. This will typically be status fields, intermediate
values, aggregations, and bits of reference data replicated for query performance.
A good technique is to examine each attribute in turn, and ask yourself the
following questions:
Is this column telling me something about a factual event?
I s there any other place I can derive this data from?
Does the business care, or is it for control purposes only?
This should allow you to remove columns
that are not required to satisfy the user
requirements. Derived data and aggre-
GUIDELINE 5.8 Do not store
gated data can be produced more
aggregated columnswithinfact
tables.It is usually cheaper to
effectively on the fly, rather than by
aggregate the columns on the fy
storing the aggregated value in the fact
table. erformance improvements to
queries should be addressed by aggrega-
tions within summary tables, not base level tables.
For example, in a sales analysis data warehouse, the columns that are typically
required are: -
product reference, store reference, date of transaction, number sold (may ormay
not be a required aggregation), revenue achieved, and 'optionally,
tomer identifier, time of transaction, till at which the transaction took place,
and teller/assistant that cartied out the transaction. 3gr
The optional items are useful if the basket tránsaction is being analyzed by customer,
for customer-profiling reasons. Till and teller information is used in order to analyze
the effectiveness of specific tellers, or the relationship between levels of sales and the
tills within a store. The time when a transaction took place would be
positions of
used in analyses of buying patterns over different periods of time within the day (for
example, product peaking during rush hours/after business hours).
84 DATABASE SCHEMA
In a telco call analysis data warehouse, the columns that are typically required are:
NON-INTELLIGENT KEYS
As with any relational system, foreign keys within a fact table can be structured in
two ways:
using intelligent keys, where each key represents the unique identifier for the jtem
in the real world;
and
using non-intelligent keys, where each unique key is generated automatically,
refers to the unique identifier of the item in the real world (Figure 5.9).
For example, in a retail sales analysis data warehouse, the foreign keys could be:
Time
time id
date
Sales item Product
product id product id
product n a m e
store id
time id SKU
productdescription|
quantity sold
revenue
Location
store id
store name
fact table.
non-intelligent keys in the
Figure 5.9 Star schema using
Specifically,
advantages.
some performance c a n be
The of intelligent keys provides key, the query
to the intelligent
use
where the query refers directly dimension table
need not be
in a situation which m e a n s the
and
the fact table alone,
-
satisfied by in the s a m e r o w
out are already
the values we are filtering
accessed- because
from s a l e s _ i t e m
where SKU='236712' 21-JAN-97')
'17-JAN-97',
anddate_of_sale between(
as opposed to:
select sum(quantity_sold)
product p, time t
from sales item s,
where p.SKU=236712
'17-JAN-97', 21-JAN-97')
and t.date_of_sale between(
and p . i d = s . p r o d u c t i d
and t.id=s.time_id;
the first query makes direct access to any of the dimension tables.
no
As we can see,
This may improve query performance, by avoiding the need to scanr and join the
product, location, and time dimension tables.
The disadvantage of using intelligent
keys is that if any of the unique item
identifiers changes or is reassigned over GUIDELINE 5.J0 Usenon-
the life of the data warehouse, the fact intelligent keys in facttables, unless
table will have to be updated to reflect the you are certain thatidentifiers will
new identifiers. This exercise can be costly not change during the lifetime of
and time consuming, and should be the datawarehouse.
avoided if at all possible.
In practicc, unless you are
absolutely certain that the identifiers will not
or be reassigned in the future, it is sater to change
loss of the query can be recOvered by other
use
non-intelligent keys. The performance
means, such as
summary tables. The one exception to guideline 5.10 is pre-building appropriate
which is discussed next. storing the time dimension,
in unit of time
It is highly probable that the fact table is partitioned by time, usually
a
When consider
that is meaningful to the business (week, month, quarter, etc.).
we
No. Date
SKU Store Revenue offset
sold
6754987 CA SF_67 8 24.6
9284374 CALA 32 15.75
6754778 MABO_2. 3 1.8
5.65 10
7678163 NY NY_45
as select (. . . ,date_of_event-1-Jan-97
from ( . . . };
Start E
SKU Store Position date date
13 Jan 97 17 Jan 97
6754987 CA_SF_67 157 16 Jan 97
9284374 CA_LA 32 96 14 Jan 97
14 Jan 97 23 Jan 97
6754778 MA_BO_2 34
15 Jan 97 18 Jan 97
7678163 NY_NY_45 79
A number of access tools may not be able to process fact tables in this format.
In the same way as with storing an offset from an inherent start of the table, this
can
be addressed by creating a database view to make the fact table appear to have a row
for every day within the date range. However, in this case the query itself would also.
have to change, with the between being replaced by an equals condition, as
follows:
This view can utilize a time dimension table that has a row for every day of the year,
for every year within the data warehouse. This table is joined against the date range
in the fact table, in order to produce a Cartesian product. Access tools can then
operate against the view in the normal way (Figure 5.12).
Unfortunately, the cost of creating the Cartesian product is very high, requiring
significant processing power and temporary space. In effect, the portion of the
original fact table required to satisfy the query must be re-created for every query.
This means that the processing power to perform the expansion must be available, as
well as enough temporary space to accommodate the Cartesian product.
Cartesian product
Time
Date ID
13 Jan 97
14 Jan 97
15 Jan 97
16 Jan 977
17 Jan 97
Figure 5.12 View expanding a stock position fact table using date ranges, by joining against a
time dimension.
90 DATABASE SCHEMA
a combina-
tory view check that it does
-