Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

CHAPTER 5

Database schema

5 INTRODUCTION
The objective of this chapter is to explain how to design
appropriate data warehouse
schemas from the logical requirements model. This is a critical
part of designinga
data warehouse, and should be covered in detail by any technical architect oor
database designer. By the end of this chapter, you should be able to design the
database componeht of a data warehouse.
This chapter assumes that you have read Part Two: Data Warehouse
Architecture, even if you are not a technical architect. The design guidelines within
this chapter follow on from the broad architectural concepts covered in the previous
chapters.

5.J. PROCESS
This design activity is defined in the build the vision stage, after therequlrements

analysis and technical blueprint stages are SomPlete Egure, S All design
decisions should be based on the system architecture defined in he technical
blueprint. However, this is an iterative process, and it is possible to encounter a
number of design issues that may not be resolyed without changing the technical
blueprint. For this reason the technical architect nceds to remain involved in the

design process.
5.2 STARFLAKE SCHEMAS

One of the key questions by the database designer is: How can we
to be answered
This question
design a database that, allows unknown queries to be performant?
warehouse and designing
encapsulates the differences between designing for a data
for an OLTP system. In a.data warehouse you design to support the
business process
must
rather than specific query requirements. In order to achieve this, you.
understand the way in which the information within the data warehouse will be used.
ask
In a decision support data warehouse, a large number of queries tend to
questions about an es_ential fact, analyzed in a variety of ways. For example,
reporting on:

the average number of sales of beans per store over the last month (Figure 5.2);
the ten most popular cable programs over the past week (Figure 5.3);
projected sales of Christmas puddings compared with the actual stock level
(Figure 5.4);
the top 20% spending customers over the past quarter;
all customers with an ayerage balance in excess of $25000 month-to-date.

Each of these queries has one thing in common: they all based on factual data.
are

The content and presentation of the results may differ betw en examples, but the
factual and transactional nature of the underlying data is the same.
Fact data possesses some characteristics that allow the underlying information
in the database to be structured. Facts are transactions that have occurred at some
STARFLAKE SCHEMAS 73

60 000 Total baked beans


Normal
50 000
Low salt
40 000 Low fat

30 000

hihhh1
20 000

10 000

6 Jan 97 13 Jan 97 20 Jan 97 27 Jan 97 3 Feb 97


Month-to-date sales of baked beans
Figure 5.2 Sales of baked beans.

| Star Trek The Simpsons


Friends Babylon 5
Frasier Pocahontas
News Mission Impossible
Murder One Independence Day
35
30
25

20
15

10

0
12 Jan 97 13 Jan 97 14 Jan 97 15 Jan 97 16 Jan 97 17 Jan 97 18
Jan 97
Top ten cable programs

Figure 5.3 Ten most popular cable programs. 2

pointin the past, and are unlikely to change in the futuré. Facts cani Be anlyzed. in
different ways by' cross-referencing the facts with diierent reference information.
For example, we can look atsales by store, sales by region,or sales by product. Ina
data warehouse facts also tend to have few attributes, because there are no
operational data overheads. For each ot the examples described;the attributes Of the
factcould.be as listed in Table 5?1: htt 2 : 9 s i E tsrcii 0

One of the major technical challenges within the design of a data warehouse is
to structure a solution thât will be etiective for a reasonable period of time (three to
five ycars), This implies that the data should not have to be restructured when the
business changes or the query profiles change. 'This is an important point, because in
200
Sales last year
150 Stock cover
100
50

0
Traditional Luxury Individual
Sales of Christmas puddings

Figure 5.4 Sales and stock coverage of Christmas puddings, week-to-date

Table 5.1 Example attributes of fact tablës.

.Requirement Fact Attributes

Sales of beans EPOS transaction Quantity sold


Product identifier (SKU)
Store identifier
Date and time
Revenue achieved

Cable programs Cable pay-per-view transaction Customer identifier


Cable channel watched
Program watched
Date and time
Duration
Household identifier
Customer identifier
Customer spend Loyalty card transaction
Store identifier
Transaction value
Date and time

Customer identifier
Customer account balance Account transactions
Account number
Type of transaction
Amount of transaction
Destination account number

the underlying data


morc raaitional applications it is not uncommon to restructure
in order to
address query performance issues.
is unlikely
m a s exploit the fact that the content of factual transactions
to CnanEe, regardless of how it is analvzed, Because the bulk of information in tne
be very effective to treat 1act
data warehouse is represented within the facts, it can
Facts

Customeer
events

Figure 5.5 A customer-profiling star schema, of the style used in retail banking.

data as primarily read-only data, and reference data as data that will change over a
period of time.
If and when reference information
needs to change, the underlying fact data GUIDELINE5. Avoid embed-
should not have to change as well. ding reterence data inta the iaIC
Star schemas are physical database table, because that wilprotect.the
structures that store the factual data in the act data from restruccurn
center," surrounded by the reference the reference data chanze
data (Figure 5.5) The differ
(dimension)
nt types of star schemas are discussed in Section 5.6.

5.3 IDENTIFYING FACTS AND DIMENSIONS

When we are presented with a large entity model, it can be difficult to determine
which entities should become fact tables within the database. Making this
determination is not always straightforward. However, because the query
performance of the data warehouse will hinge on the correct identification off fact
it is
tables, important to get this right.
very
We have found the following steps and guidelines effective in determining facts
from dimensions.
1 Look for the elemental ransactions within the business process. This identifies
5 3

candidates to be fact tables:2


entities that are
22 Determine the key dimensions that apply to each fact. This identifies entites thát
dimension tables. 1 s s 87
are candidates to be
3 Check that a candidate fact is Hotisactually a dimension With embedded fate.
not the coHtextof
actually a fact table within
Check that a candidate dimension 23 e1T7"
the decision support requirement.
change the assignment
ot an entity (fact to dimension' or vice versal
Ifsteps 3 and 4
from step 2, In order to ensure that the correct
the process should be restarted
dimensions are identified (Figure 5.6).
Look for elemental
transactions

Determine key
dimensions

Check if fact is
a dimension

Check if dimension
is a fact

identification process.
Figure 5.6 Flowchart of fact table

5.3. STEP1: LOOK FOR THE ELEMENTAL


TRANSACTIONS WITHIN THE BUSINESS PROCESS

The first step in the process of identifying fact tables is where we examine the
business (enterprise model), and identify the transactions that may be of interest.
They will tend to be transactions that describe events:fundamental to the business:
for example, the record of phone calls made by a telco customer, or the record of
account transactions made by a banking customer. A series of examples are included
in Table 5.2, in order to highlight the entities that tend to become facts within a star
schema.
For each potential fact ask yourself the question: Is this information operated on
by the busines[ processes? Don't assume that the transactions within the operational
systems are the sole candidates for fact data transactions. Or, put another way. the
degree of detail within the operátional system may have more to do with operational
constraints or legacy issues, than with the
reporting requirements.
For example, let us consider sales GUIDELINE 52,Aways deter
information within a small retail operation. mine the attransactions being
It may be the cas that the ed by the business prOcesses
tion being captured at
only informa- Dontassume that the reported
the retail outlets
is
product sales per day per store. This transaCtonsWIthin an operational

information at first glance may look like Caata.


candidate fact data.
If we examine the to ask whether they
business
nesS processes
processes more closely, we have
more
dail aggregations of sales, orith
with daily
operate
operate With with actual basket transactions.
actual bask
answer in this case is that the business processes operate on the
Table 5.2 Candidate fact tables for each industry/business requirement.

Sector and business Fact table

Retail
Sales EPOS transaction
Shrinkage analysis Stock movement and stock position

Retail Banking
Customer profiling Customer events
Customer profitability Account transactions

Insurance
Product profitability Customer claims and receipts
Telco
Call analysis Call event
Customer analysis Customer events (e.g. installation, disconnection, payment)

transaction: that is, the sale of individual items. Sales may be being captured orT
reported at the aggregated level, purely because the operational infrastructure to
capture basket transactions is not in place.
The appropriate design would be one in which the. basket transaction is
identified as the fact, and the daily aggregation is structured to be a large summary
table. The load process should be designed to create the daily aggregation from the
current operational system, with hooks in place tö generate it automatically from
basket data once the basket information becomes available.
When considering the level of detail at which to càpture transactions, you
should design to store the most detailed transaction within the data warehouse, and
data may be available. This process will protect
accept that, initially, only aggregated business move to
the data warehouse design from major restructuring, should the
transactions in the future.
capturing detailed

KEY DIMENSIONS THAT ÁPPLY


5.3.2 STEP 2: DETERMINE THE
TO EACH FACT

the main dimensions for each candidate fact table. This


The next step is identify
model,and inding out which entites are
to
can be achieved by lookíng at the,logical
associated with the entity represènting
the tact table. Thehällenge here is to fo cus
which may not be the ones directly associatedto the
on the dimension entities,
key
fact entity. ideñtiied
retail banking data warehouse, we..may have
For example, in a
table,but therelationship between account
t the relationship between account
a candidate fact
account transactions.as indirect; such:as;throughtheaccount entity.
and
and.customer may be
transaction 7i uo
account-owmed-by
relationship:
C
78 DATABASE SCHEMA

At this point, yoursclf' what the


ask
Is it
tocus of the busincss analysis will be. GUIDELINE 53Structure
transactions
Iikcly to be "analyze account
customers
dimensionsto represent the x
Dy account," or "analyze how focal poincs of analysis of facual
use our services"? If the focus of the
data
transactions Use the reltonsi
warehouse is analysis of customer usage, to the fact table wichin t
then structure the dimension to represent
model as a starting
poin
the entity, the account
not
customer estuctureWhere ecessa
entity. By all means, store the relationship ensure that OU canap
between each customer and the accounts Suitable foreign keys
they own, but key into the account
transaction by the customer identifier.
In the same way, apply the same test to each dimension, making sure that what
you end up with is a candidate fact table with key dimensions.

5.3.3 STEP 3: CHECK THATA CANDIDATE FACT IS NOT

ACTUALLYA DIMENSION TABLE WITH DENORMALIZED

FACTS

We now need to start checking that the candidate facts and dimensions aré what they
appear to be. In some instances, entities within the operational system can appear to
be candidate fact tables, and can turn out to combine both lacts and
dimensions
When the source entity represents information held in a single table within the.
operational system, it is very likely that the table has been designed to capture every
operational detail about the source entity. In other words, consider whether the
attributes of the table were designed around operational data entry requirements: for
instance, data entered in a single screen. It may be the case that some of the
attributes turn out to be denormalized facts embedded in the dimension entity.
For example, let us.consider the address entity in a cable company customer
profling data warehouse. The address table within the source system could contain:

t h e street address and


zip code;
dates that stipulate
when cables were laid past the address,
when a
salesperson visited the household at that address,
when the
household was connected to the cable service,
wnen promotional material was posted to that address,
d Subscription service to each product was initiated,
when
etc.
a
subscription was cancelled,
we examine the attributes in detail, we can see that a
substantial propo
address entitu dates on which various events took place. In practice, the
entity could be mista ken for a fact table, because users wIl
uc
AND DIN
IDENTIF YING FACTS

Facts Dimenslons

Operational
events

Figure 5.7 Star schema for addresses and operational events.

information about addresses. However, a more accurate representation would be to


say that a number of operational events occurred at specific addresses (Figure 5.7).
This means that each date value within an existing row within the address table
becomes a row in a new fact table. This will affect the database sizing calculations, but
the benefits are that users can now easily
query operational events. For example:
GUIDECNE514 Lookf
Report on the number of connections denonalizeddimensons ithin
quarter-to-date. candidate fact täblesltmay be the
Report the time lag between cables case that the candidate fact cablei
being laid and subscriptions being adimensions
taken out groupso kctualateibue
Report the conversion rate between
promotional events and subscriptions.
In addition, we should avoid designs that GUIDELINE
be updated as new tables to store rows that
require fact data to wi
transactions take place. Read-only fact varyover ime theentisy RRear
tables allow us to make use of very o5va edoeconsdE
specific database facilities that improve aaCCablei
manageability and query pertormance that changethesanP

5.3.4 STEP 4: CHECK THAT:ACANDIDATE


DIMENSION IS NOTes

ACTUALLY A FACT TABLE

final ensure that none of the key


dimensions are in themsélves faet
The step is to
problem, because there a r e a number of entities that at first
tables. This is a common
facts, lable
s.3 contains examples of two
glance can apnear to be dimensions or
facts or dimensions.
n that could be either
Table 5.33 Fntities that tend to be siructured as eitheT facts or dimensions.
Entity Fact or dimension Condition
-

Customer Fact In a customer profiling or customer marketing


database. it is probably a fact table
Dimension In a retail sales
analysis data warehouse, or any
other variation where the customer is not the
focus. but is used as the basis for
analysis
Promotions Fact In a
promotions analysis data warehouse,
promotions are the focal transactions, so are
probably facts
Dimension In all other situations

As we can see, a guiding principle is to consider what the focal point of analysis
is within the business. If the business requirernent,is geared toward analysis of the
entity that is currently a candidate dimension, chances are that it is probably more
appropriate to make it a fact table. A good sanity check to assist within this process
is to ask yourself the question: By how many other dimensions can I view this entity?
If the answer is more than three, it's probably a fact
For example, in a retail sales analysis data warehous, when you look at the
promotion dimension, ask yourself:

Can I view promotions by time?


Can I view promotions by product?
.Can I view promotions by store?
Can I view promotions by supplier!

Ifthe answer to all four questions isves," promotions are fact tables in your
situation.

You might also like