Professional Documents
Culture Documents
Database Schema
Database Schema
Database schema
5 INTRODUCTION
The objective of this chapter is to explain how to design
appropriate data warehouse
schemas from the logical requirements model. This is a critical
part of designinga
data warehouse, and should be covered in detail by any technical architect oor
database designer. By the end of this chapter, you should be able to design the
database componeht of a data warehouse.
This chapter assumes that you have read Part Two: Data Warehouse
Architecture, even if you are not a technical architect. The design guidelines within
this chapter follow on from the broad architectural concepts covered in the previous
chapters.
5.J. PROCESS
This design activity is defined in the build the vision stage, after therequlrements
analysis and technical blueprint stages are SomPlete Egure, S All design
decisions should be based on the system architecture defined in he technical
blueprint. However, this is an iterative process, and it is possible to encounter a
number of design issues that may not be resolyed without changing the technical
blueprint. For this reason the technical architect nceds to remain involved in the
design process.
5.2 STARFLAKE SCHEMAS
One of the key questions by the database designer is: How can we
to be answered
This question
design a database that, allows unknown queries to be performant?
warehouse and designing
encapsulates the differences between designing for a data
for an OLTP system. In a.data warehouse you design to support the
business process
must
rather than specific query requirements. In order to achieve this, you.
understand the way in which the information within the data warehouse will be used.
ask
In a decision support data warehouse, a large number of queries tend to
questions about an es_ential fact, analyzed in a variety of ways. For example,
reporting on:
the average number of sales of beans per store over the last month (Figure 5.2);
the ten most popular cable programs over the past week (Figure 5.3);
projected sales of Christmas puddings compared with the actual stock level
(Figure 5.4);
the top 20% spending customers over the past quarter;
all customers with an ayerage balance in excess of $25000 month-to-date.
Each of these queries has one thing in common: they all based on factual data.
are
The content and presentation of the results may differ betw en examples, but the
factual and transactional nature of the underlying data is the same.
Fact data possesses some characteristics that allow the underlying information
in the database to be structured. Facts are transactions that have occurred at some
STARFLAKE SCHEMAS 73
30 000
hihhh1
20 000
10 000
20
15
10
0
12 Jan 97 13 Jan 97 14 Jan 97 15 Jan 97 16 Jan 97 17 Jan 97 18
Jan 97
Top ten cable programs
pointin the past, and are unlikely to change in the futuré. Facts cani Be anlyzed. in
different ways by' cross-referencing the facts with diierent reference information.
For example, we can look atsales by store, sales by region,or sales by product. Ina
data warehouse facts also tend to have few attributes, because there are no
operational data overheads. For each ot the examples described;the attributes Of the
factcould.be as listed in Table 5?1: htt 2 : 9 s i E tsrcii 0
One of the major technical challenges within the design of a data warehouse is
to structure a solution thât will be etiective for a reasonable period of time (three to
five ycars), This implies that the data should not have to be restructured when the
business changes or the query profiles change. 'This is an important point, because in
200
Sales last year
150 Stock cover
100
50
0
Traditional Luxury Individual
Sales of Christmas puddings
Customer identifier
Customer account balance Account transactions
Account number
Type of transaction
Amount of transaction
Destination account number
Customeer
events
Figure 5.5 A customer-profiling star schema, of the style used in retail banking.
data as primarily read-only data, and reference data as data that will change over a
period of time.
If and when reference information
needs to change, the underlying fact data GUIDELINE5. Avoid embed-
should not have to change as well. ding reterence data inta the iaIC
Star schemas are physical database table, because that wilprotect.the
structures that store the factual data in the act data from restruccurn
center," surrounded by the reference the reference data chanze
data (Figure 5.5) The differ
(dimension)
nt types of star schemas are discussed in Section 5.6.
When we are presented with a large entity model, it can be difficult to determine
which entities should become fact tables within the database. Making this
determination is not always straightforward. However, because the query
performance of the data warehouse will hinge on the correct identification off fact
it is
tables, important to get this right.
very
We have found the following steps and guidelines effective in determining facts
from dimensions.
1 Look for the elemental ransactions within the business process. This identifies
5 3
Determine key
dimensions
Check if fact is
a dimension
Check if dimension
is a fact
identification process.
Figure 5.6 Flowchart of fact table
The first step in the process of identifying fact tables is where we examine the
business (enterprise model), and identify the transactions that may be of interest.
They will tend to be transactions that describe events:fundamental to the business:
for example, the record of phone calls made by a telco customer, or the record of
account transactions made by a banking customer. A series of examples are included
in Table 5.2, in order to highlight the entities that tend to become facts within a star
schema.
For each potential fact ask yourself the question: Is this information operated on
by the busines[ processes? Don't assume that the transactions within the operational
systems are the sole candidates for fact data transactions. Or, put another way. the
degree of detail within the operátional system may have more to do with operational
constraints or legacy issues, than with the
reporting requirements.
For example, let us consider sales GUIDELINE 52,Aways deter
information within a small retail operation. mine the attransactions being
It may be the cas that the ed by the business prOcesses
tion being captured at
only informa- Dontassume that the reported
the retail outlets
is
product sales per day per store. This transaCtonsWIthin an operational
Retail
Sales EPOS transaction
Shrinkage analysis Stock movement and stock position
Retail Banking
Customer profiling Customer events
Customer profitability Account transactions
Insurance
Product profitability Customer claims and receipts
Telco
Call analysis Call event
Customer analysis Customer events (e.g. installation, disconnection, payment)
transaction: that is, the sale of individual items. Sales may be being captured orT
reported at the aggregated level, purely because the operational infrastructure to
capture basket transactions is not in place.
The appropriate design would be one in which the. basket transaction is
identified as the fact, and the daily aggregation is structured to be a large summary
table. The load process should be designed to create the daily aggregation from the
current operational system, with hooks in place tö generate it automatically from
basket data once the basket information becomes available.
When considering the level of detail at which to càpture transactions, you
should design to store the most detailed transaction within the data warehouse, and
data may be available. This process will protect
accept that, initially, only aggregated business move to
the data warehouse design from major restructuring, should the
transactions in the future.
capturing detailed
FACTS
We now need to start checking that the candidate facts and dimensions aré what they
appear to be. In some instances, entities within the operational system can appear to
be candidate fact tables, and can turn out to combine both lacts and
dimensions
When the source entity represents information held in a single table within the.
operational system, it is very likely that the table has been designed to capture every
operational detail about the source entity. In other words, consider whether the
attributes of the table were designed around operational data entry requirements: for
instance, data entered in a single screen. It may be the case that some of the
attributes turn out to be denormalized facts embedded in the dimension entity.
For example, let us.consider the address entity in a cable company customer
profling data warehouse. The address table within the source system could contain:
Facts Dimenslons
Operational
events
As we can see, a guiding principle is to consider what the focal point of analysis
is within the business. If the business requirernent,is geared toward analysis of the
entity that is currently a candidate dimension, chances are that it is probably more
appropriate to make it a fact table. A good sanity check to assist within this process
is to ask yourself the question: By how many other dimensions can I view this entity?
If the answer is more than three, it's probably a fact
For example, in a retail sales analysis data warehous, when you look at the
promotion dimension, ask yourself:
Ifthe answer to all four questions isves," promotions are fact tables in your
situation.