Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Lecture 1 Notes

The goal of a data warehouse is to make querying fast and easy. To do so, we will denormalize the data
and create a star schema. This is the essence of Data Warehousing.

Dimension Tables

The number of rows in a Dimension table is order of magnitude smaller than the number of rows in a
fact table. We can afford to have ‘extra’ rows; descriptions, hierarchies, etc. without needing to worry
about the impact in storage.

Dimensions normally contain hierarchies. (i.e. city, county, state, county . When a structure is
denormalized you will have many instances of the same value in a table.

Assume in the DIM_CITY table:

 10 cities in each county


 25 counties in each state
 50 states in each country
 1 country

 The total number of rows in the table would be 12,500 (10 * 25* 50 * 1)
 Number of rows with the same city = 1
 Number of rows with the same counties = 10
 Number of rows with the same state = 250
 Number of rows with the same country = 12,500

You can see there are lots of duplicate data in the columns, but adding all these attributes to the
dimension allows us to query the data by city, county, state or country with the need to do joins.

You should add any Dimension columns you think you might need

For State you might add state_abbr (PA), state_short_name (PENNA), state_full_name (PENNSYLVANIA),
state_name_mixed_case (Pennsylvania).

You can see how adding all these when loading the dimension makes handling all the differences of
reporting easier.

The design is all about getting data out easily. We trade off increases in load complexity (done once per
rowl with the efficacy and simplicity of reporting (done frequently).

Dimension tables can have inserts and updates. They do not have deletes.

Loading Dimensions

Dimensions are frequently loaded from a different source than the fact data. The DIM_STORE table
might come from a store management system, while the DIM_PRODUCT might come from the
purchasing system. Each dimension would have a different load strategy and perhaps a different load
frequently.

Date Dimensions are usually not loaded from an operational system. You can load DIM_DATE once a
year with a CSV file created from an excel spreadsheet. This table should contain any formatting of
dates that may be used in reporting (1/1/99, Jan 1, 1999, January 1, 1999) along with things like day of
week, day of month, day of year, fiscal month, fiscal quarter and all the formats of day (M, MON, Mon,
MONDAY, Monday). There are only 365 rows added each year so the impact of long rows in minimal.
The benefit of reporting without having to add formatting logic is significant.

Fact Tables

Fact tables contain the foreign keys to the dimension tables and the measures. Measures are what we
want to report on. (Number sold, net sales, etc.). Measures should always be numeric. If you have a Yes
/No value, convert it to 1 / 0.

If a measure can be manipulated in a meaningful way across all dimension, it is ADDITIVE.


If a measure can only be manipulated in a meaningful way across some dimensions, it is SEMI-ADDITIVE.
Semi-additive measures should be used with caution since users could incorrectly manipulate across a
non-additive dimension.
If a measure cannot be manipulated in a meaningful way across any dimensions, it is NON-ADDITIVE and
should not be included in the fact table.

Fact tables are large. The addition of metrics for future use (at create time) is not costly It is hard, if
even possible, to add columns later, so plan before you build. Data is inserted into fact tables. There is
no update or delete of fact data.

The goal is to capture the fact data at the lowest granularity (GRAIN) possible. You can always role up to
a higher grain, but you unroll to a lower grain. You have determine the right grain. For a store sales
transaction, you would not want to create a create a row for each of the dozen containers of yogurt on
that went over the scanner one at a time, you want to consolidate that to one row of yogurt with a
quantity of 12.

Keys

Each Dimension must have a natural key. The natural key can be a composite key. The natural key is
used to link source data together.

Dimension tables should always have surrogate keys. These are normally sequential numbers created
when loading the dimension row. Surrogate keys are primary keys and unique.

Surrogate keys are used buffer the fact table from business related changes to the natural key. (if we
change the SKU (natural key) on the product, the product information does not change and so we
should be able to report across both old and new SKUs. Without a surrogate key, the Fact table would
have its keys updated. This would be a very costly and potentially risky exercise.

Surrogate keys solve this problem.

You might also like