Designing Dimension Tables

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

5.

5 DESIGNING DIMENSION TABLES


This section will explain the considerations used in designing the dimension tables.
These techniques should be applied after the fact tables have been designed, because
ithelps to know which fact tables exist.
Getting the design for the dimension tables wrong is not a major disaster. The
volumes should be relatively small (probably below 5 GB in total), so restructuring
costs will be small as long as the primary keys to the fact table(s) are not changed.

5.51 CREATING A STAR DIMENSION

Star dimensions speed up query performance by denormalizing reference


information intoa single table. Star dimensions rely on the perceived
use of
information by typical queries, where the bulk of the queries are likely to be
analyzing facts by applying a number of constraints
against a single dimension.
For example, in a retail sales analysis data warehouse, typical queries will
analyze sales information by the product dimension. That is, the queries will tend to
ask questions about:

aparticular product group, such as ladies' wear;


attributes of products, such as size, color, style - 16" collar shirts, or "white
shirts," or "dress shirts."
DESIGNING DIMENSION TABLES

Product
Price name, color, style,
size

Store Section

Region Department

Business unit

Figure 5.13 Entity model for product information.

Because, in all cases, the query constrains the set of products in a variety of ways, the
query can be speeded up if all the constraining information is in the same table. In
practice, this is achieved by denormalizing all the additional entities related to the
product entity into a single product table (Figure 5.13).
For example, we would take all the. product hierarchy information. and
denormalize it into the same row (that is, one column per level in the hirarchy).
Attributes such as size, color, material, style, would be added to each row as well
(Figure 5.14).
This techntque works well in situations where there are a number of entities
related to the key dimension entity, which are accessd often. There is a
performance
saving of not having to join additional tables to access those attributes.
This technique may not be appropriate in situations where the additional data is
not accessed very often, because the overhead
of scanning the expanded dimension
table may not be offset by the gain in the query. If the bulk of queries don't access
those columns, this technique will speed up a minority of queries by slowing down
the majority.

: arS 1dio tenOIiii 25 Daiaut


| productoTtsteo 9isq
SKU
Section bai zi 25.
Department
BusinesS unit
style i
colour
size

Figure 5.14 Product dimension with denormalized attributes.


92 DATABASE SCHEMA

In order to retain a reasonable


balance, insert the columns that are GUIDELINE 5.14 Denormalize
going to be accessed often, based on
entities accessed often into the star
your understanding of how the informa-
tion will be used. The star dimension
dimension table. Al other entities
able is generated from the snowflake
shouldremain inthe snowflake
Structure
data, so
if you need to change the star
dimension in future to
add extra
columns, it's easy to do. Note that deleting or modifying columns may have an
effect on any existing canned queries, and they may have to be modified as well.

5.5.2 HIERARCHIES AND NETWORKS

It may not be possible to denormalize all entities into star dimensions. Specifically,
all entities that are related through many-to-many relationships should not be
denormalized into a star dimension: that is, it is not efficient to denormalize multiple
values into a single row.
For example, many retailers use multiple product hierarchies to represent
different product views. These could be:
the product group hierarchies (e.g. business unit, sub-business unit, section,
department, SKU);
and a number of hierarchies where each one represents:

products sold by competitors;


products supplied by a single, major supplier;
products that are part of a single promotion (for example, Italian meals);
specific uses within the business (for example, private hierarchies, products sold
during Christmas).
In these situations, it is more effective to determine the hierarchy likely to be used by
the star
the largest numbers of queries. This hierarchy is then denormalized into
shoukd be
dimension table. All queries that require the main product hierarchy
directed to the star dimension; all other queries should use the
snowflake.
dimension is
For example, the most common hierarchy in the retail product
That is then denormalized into the star
likely to bè the main product hierarchy.
5.5.1. All other hierarchies are
product table, in the same way as described in section
accessed through the snowílake.
hierarchies is effective
This technique of finding the common route through
another way, this
when the route is unlikely to change in the future. Or, put
to a point where a
technique is effective as long as the query profiles the change itself is unstable.
don't
popular one, or hierarchy
ditferent hierarchy becomes the most be more cost-effective
If there exists a strong possibility that this may occur, it may
dimension. This is because the
ot to denormalize any of the hierarchies into the star
costof modilying any canned queries may
be substantial if the first denormalized GUIDELINE 5,15 If data in a
hierarchy is removcd dimension is networked, denorma
A good compromise is that you start lize the most commonly accessed
off by dcnormalizing the most accessed hierarchy into the star dimension. f
hierarchy. and accept that if the query the query profile changesin the
profiles change in the future, a sct of future, add columns that denorrma-
columns representing the new hierarchy
arc added to the star dimension. As long
lize the n e w (commonly accessed)
hierarchy.
as cxisting columns are not replaced, there
should not be any impact on existing
canned queries. The cost of updating tools that generate ad hoc queries should be

minimal, as only their metadata definitions would need to be updated.

5.5.3 DIMENSIONS THAT VARY OVER TIME

In many cases, some (if not all) of the dimensions will vary over time. This is
particularly for dimensions that use hierarchies or networks to group basic
true
concepts, because the business will probably change the way in which it categorizes
the dimension over a period of time.
For example, in the retail sector, he product dimension typically contains a
hierarchy used to categorize products into departments, sections, business units, and so
on. As the business changes, it is standard practice to re-categorize these products. Tee
shirts could move from menswear into unisex; baked beans could move from canned
foods to canned vegetables.
Depending on the business requirement, it may be necessary to support queries
that compare facts within a grouping that exists at present, with the grouping that
existed at a point of time in the past. These queries tend to be used to understand
whether various business policies have been successful (referred to as *as is, as was"

queries).
For example, take a case where a department store manager is investigating
whether a policy of upgrading the quality of products in the menswear department
was successful. In order to address that query, we must examine the revenue and

profit generated by the department to date, compared with a year ago.


menswear
In order to determine the revenue achieved by the menswear department, we
have to sum the sales of all products in the department today compared with.a year
ago. Because the list of products
in menswear will be different today compared with
one looks for a different set
a year ago, two queries have to be executed, where each

of products comprising menswear.


If this requirement exists, it is necessary to store date ranges on the dimension
within that row were valid. In
table, which represents the dates in which the values
the dimension, a new row
other words, if there is a change to any of the values within
the old row (the old row is
should be inserted, rather than updating the values of
in order to complete the date range). Adding a date range to the star
updated
dimension table is an effective way of supporting this requirement.
94 DATABASE SCH4t MA

In the menswcar example, the basic qucry structure would be:

select sum(s. revenue_achieved)


from sales _year_to_date s, product_dimension pd
where s.product_id = pd.id
and p d . d e p a r t m e n t = "Menswear'

and pd. end_date > 1-Jan-96

and pd.start_date <31-DEC-96;


This qucry will extract all transactions for all products that were in the menswear
department over any dates in 1996. The result of this query can then be compared
against the result from a similar query that examines dates over 1997.
A minor variation on the theme is when the query needs to compare a
Significant event in the corporate calendar, year on year: for example, if the previous
query were constrained to determine profitability in the menswear department
during the summer sales. The dates for the summer sales event will vary year on year,
so the start and end dates to be applied to the querie: will vary. In these situations,
we recommend that you store the date range for each event within the time
dimension.
The previous query would change to:

select sum (s.revenue_achieved)


from sales year_to_date s, product_dimension på,
time_dimension td
where s.product_id = pd.id
and p d . d e p a r t m e n t = 'Menswear'

and pd.end_date > tå.start_date


and pd.start date < td. end_datee
and td,event_name = *Summer Sales 1996'

Day
date

Week Easter Summer sales

Six-week
period

Month

Year

Figure 5.115 Schema model of time dimension with arbitrary groupings.


t the query requires more unusual dates - for example, queries against "the last
Friday in every month" - we suggest it is addressed by joining the product dimension
to a time dimension table that contains a row for
every day of the year (for each year
that is covered by the data warehouse).
Any arbitrary grouping of dates can then be supported by creating foreign keys
against the set of days required. Avoid using the in statement in SQL, because it can
Substantially affect query performance (the in statement may cause the database too
Scan the table
being joined, once for every value in the statement)

5.5.4 MANAGING LARGE DIMENSION TABLES

A large dimension table,


particularly one
that stores values as they change over
time, may grow to a size where the table
GUIDELINE 5.16 Ifa dimension
table gros to a size Simila to a
becomes too large to be treated as a
fact table partition,or Scanning
dimension table within queries. This point
is reached when the dimension table dimensiontable aosorbs a signif
reaches a size similar to a partition of cantpercentage of the available
query time, considerpartitioninyg
the fact table. Another indication to
watch out for is when a full-table scan of
the table horizontaly.and creatinga
the dimension table starts taking an combinatory view
appreciable amount of time.

You might also like