Designing Fact Tables

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

80 DATABASE SCHEA

be structured as either facts or dimensions.


tend to
ities that
Table 5.3
Fact or dimepsion Condition
Entity
In a customer profiling or customer
Fact marketing
Customer database, it is probably a fact table

In retail sales analysis data warehouse, or any


a
Dimension
other variation where the customer is not the
focus, but is used as the basis for analysis

In a promotions analysis data warehouse,


Fact
Promotions promotions are the focal transactions, so are
probably facts

Dimensioon
In all other situations

guiding principle is point of analysis


to consider what the focal
As we can see, a
business requirementyis geared toward analysis of the
is within the business. If the
chances are that it is probably more
entity that is currently a candidate dimension,
within this process
appropriate to make it a fact table. A good sanity check to assist
how many other dimensions c a n I view this entity?
is to ask yourself the question: By
a fact.
If the answer is more than three, it's probably
data warehouse, when you look at the
For example, in a retail sales analysis
promotion dimension, ask yourself:

C a n I view promotions by time?


Can I view promotions by product?
Can I view promotions by store?
C a n I view promotions by supplier?

fact tables in your


If the answer to all four questions is *yes," promotions are

situation.

5.4 DESIGNING FACT TABLES


A common question is: How big should a fact table be? In reality, there iIs no
hardware
practical, inherent limit to the size of a fact table, given the appropriate
architecture, database design, and budget
For a database designer, the challenge is to recommend a good balance berwec
the value ol information made available and the cost of producing it. Factors sucn as

the level of deta stored, and the retention period, need to be weighed again
cost of aChieving them. Data warehouses have been built with fact tables larger tnan
ratnel,
1TB. Ihis does not mean that we should aim to create large fact tables;
DESIGNING FACT TABLES
81

us to focus on
limit of database size can generally be ignored, allowing
technological
the real business requirement. order to reduce
and mechanisms used in
This section will explain the techniques within
tables without compromising the value of the data inherent
the size of fact the
of these techniques to be applied to
number
them. Most solutions will require
a

database design unless the business has deep pockets.


-

Consider the following options:


function. This
historical period for each supported
I Understand the significant that is
that we are only storing in the fact table the period of history
makes sure

required analyses required.


for the the
statistical samples on subsets of data will satisfy
Determine whether
This impacts on sizing and
costs.
requirement for detailed data. because it is there, is
no

Select the appropriate columns of data to hold (just


3
reason to hold it)
Minimize the column sizes within
the fact table.
4
or non-intelligent keys.
5 Determine the use of intelligent and
how time should be stored,
time into the fact table. This determines
6 Design size of the fact table.
of each option on the query performance and
the impact
breaks up the fact table
into a number of smalleer
Partition the fact table. This
6.
aid manageability. This is covered in detail in Chapter
tables in order to

HISTORICAL
THE SIGNIFICANT
OPTION I: IDENTIFY
5.4.
SUPPORTED FUNCTION
PERIOD FOR EACH

tend to be
the retention period of data always
on
Initial business requirements a long
aggressive, in that the requirement may suggest
aggressive and simplistic: in that the sizing will invariably
retention period, say
f+ve to ten years; simplistic,
transactions (such as account
of storing the most detailed
start off on the premise
EPOS transactions):
call records, or
transactions, telephone and different degrees of
warehouses use a mix of, detailed data
Most data and volume of
the best balance between query performance
aggregation to obtain degree
order to achieve the appropriate
balance, we need to determine what
data. In
tor each business function.
of detail is necessary
examines the various requirements to, retain
Broadly speaking, this activity .

and pins dowi the


detailed data,
for each one.
minimum retention period
Some guidelines are provided
here to GUIDEEINE 5.6 Identiy the
assist with this procss. historical period significantto
Decision making within organiza- decision-making processes, andthe
This could
tional functions is effective only
if it degree ofdetailrequired.
deals with business patternsthat substantially reduce the volume of
fact table
transpired within a reasonable time data required within the
period.
Shrinkage 15 months monthlyY
analysis

Sales 3 months detail


analysis
3 months detail

Lifestyle 6 months weekly


profiling

1 Jan 96 1 Jan 97

Figure 5.8 Graph showing retention period of functions within a retail operation.

For example, if a retail merchant is trying to decide whether adequate volumes


of a product exist to cover next week's sales, it may be inappropriate to examine
information on the buying patterns of that product for the last six months. It is more
likely that that merchant requires the buying patterns month-to-date and for the
equivalent period a year ago.
The significant retention period will vary between each function, but should be
readily definable by examining the business processes implemented by those
functions. We recommend that you draw a retention period graph showing the
period and detail necessary for each business function (Figure 5.8). Once you have
the chart. it becomes easier to understand what degree of detail is necessary for what
period of time.
After completing this process, you often find that the requirement for detailed
information is more limited than was originally requested. Profile_ similar to that in
Figure 5.8 are not uncommon. Needless to say, as far as the business user is
concerned, any switch between detailed data and daily aggregated data should be
seamless.

5.4.2 OPTION 2: DETERMINE WHETHER SAMPLES WILL SATISFY


THE REQUIREMENT FOR DETAILED DATA

An alternative mechanism to reduce the volume of detailed information is to retain a


sample of detailed information, and daily or weekly aggregations for the rest. As
long as the sample is representative (for exampBe, 15% of the total data spread
demographically, for customer
profiling
data warehouses), this should meet the
analysis needs. GUIDELINE 5.7 the business
This technique may be appropriate in requirement does not require all
situations where the analysis requirement the detailed fact data, consider
is toanalyze trends: that is, analyze storing samples and aggregate the
detailed information in a variety of rest.
ways, in order to spot patterns of behavior
DESIGNING FACT TABLES 83

between various conditions. Theanalysis can be carried out effectively using a subset
of the full data. Conversely, sampling may be inappropriate to satisfy campaign
because it would require access to all the detailed customer records to
management,
build up direct mailing lists.
where
For example, let's take the case of a retail sales analysis data warehouse,
basket transactions are stored for a 15% spread of stores across the US. Although
it might be
this information can be used to spot trends across all stores,
inappropriate to determine product-buying patterns in all stores located at seaside
resorts as the statistical sample may be too low.

5.4.3 OPTION 3: SELECT THE APPROPRIATE COLUMNS

The next option is to remove all columns from the fact entity that are not required to
answer decision support questions. This will typically be status fields, intermediate
values, aggregations, and bits of reference data replicated for query performance.
A good technique is to examine each attribute in turn, and ask yourself the

following questions:
Is this column telling me something about a factual event?
I s there any other place I can derive this data from?
Does the business care, or is it for control purposes only?
This should allow you to remove columns
that are not required to satisfy the user
requirements. Derived data and aggre-
GUIDELINE 5.8 Do not store
gated data can be produced more
aggregated columnswithinfact
tables.It is usually cheaper to
effectively on the fly, rather than by
aggregate the columns on the fy
storing the aggregated value in the fact
table. erformance improvements to
queries should be addressed by aggrega-
tions within summary tables, not base level tables.
For example, in a sales analysis data warehouse, the columns that are typically
required are: -

product reference, store reference, date of transaction, number sold (may ormay
not be a required aggregation), revenue achieved, and 'optionally,
tomer identifier, time of transaction, till at which the transaction took place,
and teller/assistant that cartied out the transaction. 3gr

The optional items are useful if the basket tránsaction is being analyzed by customer,
for customer-profiling reasons. Till and teller information is used in order to analyze
the effectiveness of specific tellers, or the relationship between levels of sales and the
tills within a store. The time when a transaction took place would be
positions of
used in analyses of buying patterns over different periods of time within the day (for
example, product peaking during rush hours/after business hours).
84 DATABASE SCHEMA

In a telco call analysis data warehouse, the columns that are typically required are:

destination phone number, date and time of


initiating phone number,
transaction, tariff band, and duration of call.

5.4.4 OPTION 4: MINIMIZE THE COLUMN SIZES WITHIN THE


FACT TABLE

table. Because fact


Another option is to reduce the size of the row within the fact
of
tables tend to be large because of the requirement to store large numbers
effect on the total table size.
transactions, a small saving per row can have a significant
warehouse contains 3.65
For example, if a fact table for a telco call analysis data
billion rows (that is, 2 million customers, with 2.5 transactions per day per customer,
2 year retention period), a saving of 10 bytes per row will save us:

10x 3.65 billion bytes


- 33.99 GB
GUIDELINE 5.9 Ensurethat
is fairly typical in everybytein the.column definitions
This kind of saving
situations where the business requirement within a arge fact table is needed
Savings here wil have asubstantial
calls for large periods of detailed data.
This is why care should be taken to
effecton thesize andcomplexity of
t h efact table.
consider the value of every byte that is
designed into a large fact table.

5.4.5 OPTION 5: DETERMINE THE USE OF INTELLIGENT OR

NON-INTELLIGENT KEYS

As with any relational system, foreign keys within a fact table can be structured in
two ways:
using intelligent keys, where each key represents the unique identifier for the jtem
in the real world;
and
using non-intelligent keys, where each unique key is generated automatically,
refers to the unique identifier of the item in the real world (Figure 5.9).
For example, in a retail sales analysis data warehouse, the foreign keys could be:

. product SKU, store identifier, and physical date;


or

product code (which maps to SKU in the product dimension table);


store code (which maps to store id in the location dinmension: table);
date id (which maps to physical date in the time dimension table).
DESI

Time
time id
date
Sales item Product
product id product id
product n a m e
store id
time id SKU
productdescription|
quantity sold
revenue

Location
store id
store name

fact table.
non-intelligent keys in the
Figure 5.9 Star schema using
Specifically,
advantages.
some performance c a n be
The of intelligent keys provides key, the query
to the intelligent
use
where the query refers directly dimension table
need not be
in a situation which m e a n s the
and
the fact table alone,
-

satisfied by in the s a m e r o w
out are already
the values we are filtering
accessed- because

so we avoid a database join. diet cola bottles


across the company
total sales of large
If a query required the like:
the query would look
last week, using intelligent keys
s e l e c t sum ( q u a n t i t y s o l d )

from s a l e s _ i t e m
where SKU='236712' 21-JAN-97')
'17-JAN-97',
anddate_of_sale between(
as opposed to:

select sum(quantity_sold)

product p, time t
from sales item s,
where p.SKU=236712
'17-JAN-97', 21-JAN-97')
and t.date_of_sale between(
and p . i d = s . p r o d u c t i d
and t.id=s.time_id;

the first query makes direct access to any of the dimension tables.
no
As we can see,
This may improve query performance, by avoiding the need to scanr and join the
product, location, and time dimension tables.
The disadvantage of using intelligent
keys is that if any of the unique item
identifiers changes or is reassigned over GUIDELINE 5.J0 Usenon-
the life of the data warehouse, the fact intelligent keys in facttables, unless
table will have to be updated to reflect the you are certain thatidentifiers will
new identifiers. This exercise can be costly not change during the lifetime of
and time consuming, and should be the datawarehouse.
avoided if at all possible.
In practicc, unless you are
absolutely certain that the identifiers will not
or be reassigned in the future, it is sater to change
loss of the query can be recOvered by other
use
non-intelligent keys. The performance
means, such as
summary tables. The one exception to guideline 5.10 is pre-building appropriate
which is discussed next. storing the time dimension,

5.4.6 OPTION 6: DESIGN TIME INTO THE FACT TABLE

Time information be stored in a number of


can
ways within the fact table. In most
cases, the most appropriate structure is quite different from the
way in which the
other dimensions are stored.
The starting point för the storage of
time within the fact table is the use of a
foreign key into a time dimension table. Actual physical dates are stored within the
dimension itself.
Before we can start designing more effective storage structures, we have to
determine whether the business requirement exists for the date of transaction
information only, or for the date and time of transaction. This point will have a
material effect on the data reduction
strategies that we can apply.
For example, in a retail sales analysis data warehouse, we
need to determine
whether the business will ever need to analyze sales trends through the day, for
example looking at the sale of a product during rush-hour. If the requirement to do
that does not exist, we can apply options of encoding dates within the fact table.
Possible techniques are:

storing the physical date,


storing an offset from the inherent start of the table,
storing a date range.

Storing the physical date


This option is very effective in situations where the business may require access to the
time of transaction, not just the date. Physical dates within a relational database
typically contain the time stamp within its structure, which can be ignored wihin
queries if not required. However, the one argument for using non-intelligent keys
within the fact table does not apply to dates, because the date specified is not going
to be reassigned or changed in the future.
The cost of storing a physical date, as opposed to a unique reference to a time
dimension table, is
minimal compared
with the query performance
improve-
ment. That statement holds true in all GUIDELINE 5.Il Use physical
situations where the bulk of the queries dates infact.tables, rather than
are constrained by time, which is the foreign keysthat reference rows in a
expected case in all decision support time dimension table.
analyses of historical data.
DESIGNING FACT TABLES 837

offset from the inherent start of the table


Storing an

in unit of time
It is highly probable that the fact table is partitioned by time, usually
a

When consider
that is meaningful to the business (week, month, quarter, etc.).
we

inherent within the


storing time within the fact table, we can exploit the period
date of the table.
partitioned table itself, by referring to dates as offsets from the start that stores
For example, in a customer profiling data warehouse, if the fact table
customer events is partitioned on a monthly basis, an event that
occurred on the 9th
start (0 means no
of the month can be represented as the number 8 from the inherent
offset and therefore corresponds to day 1 of the month).
column are
The advantage of this technique is that. the storage costs of the date
to store up to 31 numbers. In a
very low: in Figure 5.10 two bytes are large enough
store the date
similar vein, weekly table partitions would require only a single byte to
offset.
It is worth pointing out that the physical start date need not be stored anywhere,
or
because it can be implied by the table name: for example, sales_Jan_97,
customer_event_1997_week_L6.
would have to be
The disadvantage of using this technique is-that queries
constructed to convert physical dates into date offsets. For example,
if we wished to
return the sum of all transactions that were completed on 9 January 1997, the query
would look like:

select count (*)


from customer_events_Jan97
where date_of_event= '9-Jan-97' '1-Jan-97'

start from 0 for the first day, 1 for


Note that the offset from the start would have to

the second, 2 for the third, and so on.


This form of encoding may pose
problems for a number of user access

tools that generate SQL the fly. In


on GUIDELINE S12Consider
some cases, they may not be able
to using date offsets from the implied
cope with these structures,
because they Start date of a act cable par tition
don't conform to a simple star schema date olsets are
model. This issue can be addressed by that looksliketheogica table, by
creating a database view that applies adding the ofset tothe startdate
the mathematical logic to the row: ,

No. Date
SKU Store Revenue offset
sold
6754987 CA SF_67 8 24.6
9284374 CALA 32 15.75
6754778 MABO_2. 3 1.8
5.65 10
7678163 NY NY_45

Figure 5.10 Fact table using date offsets.


88 DATABASE SCHEMA

Create view customerevents_Jan_96

as select (. . . ,date_of_event-1-Jan-97
from ( . . . };

Storing a date range


If we consider the content of a fact table, it may be the case that the rows report on
factual counts which do not vary substantially over time. Date ranges allow us to
exploit this relationship, by inserting new records only when a factual count has
changed.
For example, in the retail sector, many retailers have a product catalog that is
far greater than the number of active products sold every day: that is, not every
product will sell every day. The percentage of products sold compared with th

product catalog can be as low as 10-15% of the total.


In a shrinkage analysis data warehouse, we can exploit this relationship by only
is altered. At any
inserting new stock position records every time the stock position that is, it is
point in time, if the stock position for a product has not changed today
the same as yesterday's figure the existing record is updated to reflect this fact. In
-

within the row (Figure 5.1 1).


practice, this is achieved by incrementing a date range
This technique can produce a significant saving in disk capacity and query
relative proportion of changed
performance, because it is directly related to the
for 15% of the total fact
records to unchanged records. If changed records account
to approximately 18% of its
table, the use of date ranges will reduce the table
column for the date range).
original size (we have added an additional
the query would have to
As with the previous code example, the structure of
In effect, queries would have to be
change to take into account the date range. statement.
modified to query against a date range using a between
the user asks a question
Following on the shrinkage analysis example, if
the stock count of baked beans on 3 February 1997,
the query would look
requiring
like:
select sum (sp.stock_count )
s t o c k _ p o s i t i o n _ f a c t _ t a b l e sp, product_dimension p
from
where p.id = s p p_id
and p.product_group= 'baked beans
and 3-FEB-97' between (sp.start_date, sp.end_date)

Start E
SKU Store Position date date

13 Jan 97 17 Jan 97
6754987 CA_SF_67 157 16 Jan 97
9284374 CA_LA 32 96 14 Jan 97
14 Jan 97 23 Jan 97
6754778 MA_BO_2 34
15 Jan 97 18 Jan 97
7678163 NY_NY_45 79

5.11 Stock position fact table with a date range.


Figure
DESIGNING FACT TABLES 89

A number of access tools may not be able to process fact tables in this format.
In the same way as with storing an offset from an inherent start of the table, this
can

be addressed by creating a database view to make the fact table appear to have a row
for every day within the date range. However, in this case the query itself would also.
have to change, with the between being replaced by an equals condition, as
follows:

select sum (sp.stock_count)


from stock_position_view sv, product_dimension p
where p.id=sv.P_id
and p.product_group='baked beans'
and sv.date='3-FEB-97'

This view can utilize a time dimension table that has a row for every day of the year,
for every year within the data warehouse. This table is joined against the date range
in the fact table, in order to produce a Cartesian product. Access tools can then
operate against the view in the normal way (Figure 5.12).
Unfortunately, the cost of creating the Cartesian product is very high, requiring
significant processing power and temporary space. In effect, the portion of the
original fact table required to satisfy the query must be re-created for every query.
This means that the processing power to perform the expansion must be available, as
well as enough temporary space to accommodate the Cartesian product.

Cartesian product

Stock position SKU Store Position Date

SKU Store Position


Start End 6754987 CA_SF_67 157 13 Jan 97
date date 6754897 CA_SF_67 157 14 Jan 97
6754897 CASF_67 157 15 Jan 97
6754987 CA_SF_67 157 13 Jan 97 17 Jan 97 6754897
14 Jan 97 16 Jan 97 CA_SF_67 157 16 Jan 97
9284374 CA_LA_32 96
23 Jan 97 6754897 CA_SF_67 157 Jan 97
6754778 MA_BO_2 34 14 Jan 97 9284374 CA_LA 32 96 14 Jan 97
7678163 NY_NY_45 79 15 Jan 97 18 Jan 97
9284374 CA_LA 32 96 15 Jan 97
9284374 CA_LA_32
en*o*ao.(C)...o*eest
96 16 Jan 97

Time
Date ID
13 Jan 97
14 Jan 97
15 Jan 97
16 Jan 977
17 Jan 97

Figure 5.12 View expanding a stock position fact table using date ranges, by joining against a
time dimension.
90 DATABASE SCHEMA

In practice, this may be


unacceptably
high, and is probably
viable in normal
not
circumstances. We recommend you consid- GUIDELINE 5.13 Consider thee
er the use of. date ranges use of date ranges in tact tables,
only if the access where the access tool can cop
tools can cope directly with
the structure, directly with the structure. lf used
without requiring the generation of a
Cartesian product. make sure that no on-the fly data
A word of warning: if an expansionis occurring within
the
accesstool access tool
appears to operate directly with the
data that is, without using
-

a combina-
tory view check that it does
-

not do the same thing internally. A number of access


tools create internal data stores in order to execute
queries, which are usedd as query
caches. If the cache store is expanding the data on the
fly, the same and performance
capacity issues will
.
apply.

5.4.7 OPTION 7: PARTITION THE FACT TABLE


:

This option is extremely detailed, and is covered separately in Chapter 6.

You might also like