Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

The Digital Firm

Data-Driven Decision-making
Dimensional Modeling

1 GEST S492 The Digital Firm

Data Organization in DWH: Star Schema


DIMENSIONAL MODELING

2 GEST S492 The Digital Firm

Example of Star Schema


time

item

time_key
day
day_of_the_week
month
quarter
year

Sales Fact Table


time_key
item_key
branch_key

branch
branch_key
branch_name
branch_type

location_key
units_sold
dollars_sold
avg_sales

Measures
3 GEST S492 The Digital Firm

item_key
item_name
brand
type
supplier_type

location
location_key
street
city
province_or_street
country

Conceptual Modeling of DWH


Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a set
of dimension tables
Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
4 GEST S492 The Digital Firm

Example of Snowflake Schema


time
time_key
day
day_of_the_week
month
quarter
year

item
Sales Fact Table
time_key
item_key
branch_key

branch

location_key

branch_key
branch_name
branch_type

units_sold
dollars_sold
avg_sales

Measures
5 GEST S492 The Digital Firm

item_key
item_name
brand
type
supplier_key

supplier

supplier_key
supplier_type

location
location_key
street
city_key

city
city_key
city
province_or_street
country

Example of Fact Constellation


time
time_key
day
day_of_the_week
month
quarter
year

item
Sales Fact Table
time_key
item_key

item_key
item_name
brand
type
supplier_type

location_key

branch_key
branch_name
branch_type

units_sold
dollars_sold
avg_sales

Measures
6 GEST S492 The Digital Firm

time_key
item_key
shipper_key
from_location

branch_key
branch

Shipping Fact Table

location

to_location

location_key
street
city
province_or_street
country

dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type

Star Schema
Fact tables contain
factual or quantitative
data

Dimension tables are


denormalized to
maximize
performance

1:N relationship
between dimension
tables and fact tables

Dimension tables contain


descriptions about the
subjects of the business
7 GEST S492 The Digital Firm

An example
Fact table provides statistics for sales
broken down by product, period and store
dimensions

8 GEST S492 The Digital Firm

An example with data

9 GEST S492 The Digital Firm

Modeling dates

Fact tables contain time-period data


Date dimensions are important
10 GEST S492 The Digital Firm

Issues Regarding Star Schema


Dimension table keys must be surrogate (nonintelligent and non-business related), because:
Keys will change over time when history on dimensions is
built
Length/format consistency

Granularity of Fact Table what level of detail do


you want?
Transactional grain finest level
Aggregated grain more summarized
Finer grain better market basket analysis capability
Finer grain more dimension tables, more rows in fact table
11 GEST S492 The Digital Firm

Star Schema: SQL for OLAP implementation


DIMENSIONAL MODELING

12 GEST S492 The Digital Firm

STAR for Vehicule sales

13 GEST S492 The Digital Firm

How does it work?


Slice 1: Number of Trucks sold in Europe in 1997
SELECT sum(Vehicles_Sold)
FROM Location L, Product P, Time T, Fact F
WHERE L.Location_Key = F.Location_Key
AND P.Product_Key = F.Product_Key
AND T.Time_Key = F.Time_Key
AND L.Continent = Europe
AND P.Vehicle_Kind = Truck
AND T.Year =1997

Step 1

Joining necessary
dimensions & Fact
Using connected
dimensions to
constraint

T.Time_Key= (3,4)
P.Product_Key= (3,4)
L.Location_Key= (1,2)

Step 2

F.Location_Key in (1,2) and F.Product_Key in (3, 4)


and T.Time_Key in (3,4)
GEST S492 The Digital Firm

Step 3
14

L.Location_Key = F.Location_Key = (1,2)


P.Product_Key = F.Product_Key= (3, 4)
T.Time_Key = F.Time_Key = (3,4)

SQL on Facts
Slice 1
SELECT sum(Vehicles_Sold)
FROM Location L, Product P, Time T, Fact F
WHERE L.Location_Key = F.Location_Key
AND P.Product_Key = F.Product_Key
AND T.Time_Key = F.Time_Key
AND L.Continent = Europe
AND P.Vehicle_Kind = Truck
AND T.Year =1997

Joining necessary
dimensions & Fact
Using connected
dimensions to constraint

Slice 2: nbre of vehicles by country, size & year


SELECT L.Country, P.Vehicle_Size, T.Year, SUM(Vehicles_Sold)
FROM Location L, Product P, Time T, Fact F
WHERE L.Location_Key = F.Location_Key
AND P.Product_Key = F.Product_Key
AND T.Time_Key = F.Time_Key
AND L.Country = USA
GROUP BY L.Country, P.Vehicle_Size, T.Year

Roll-up and Drill-down are implemented by Group By and aggregate functions

15 GEST S492 The Digital Firm

SQL for OLAP operations


Analyzing the sales by Continents
SELECT L.Continent , SUM(F.Vehicles_Sold)
FROM Location L, Product P, Time T, Fact F
WHERE L.Location_Key = F.Location_Key
AND P.Product_Key = F.Product_Key
AND T.Time_Key = F.Time_Key

L.Continent
Drill down to countries
GROUP BY

SELECT L.Continent, L.Country, SUM(F.Vehicles_Sold)


FROM Location L, Product P, Time T, Fact F
WHERE L.Location_Key = F.Location_Key
AND P.Product_Key = F.Product_Key
AND T.Time_Key = F.Time_Key
GROUP BY

L.Continent, L.Country

16 GEST S492 The Digital Firm

An example
Star for Grocery stores
TVDM BI 6 Star Grocery.mdb
OLAP in Pivot Table Excel
TVDM BI 6 Star Grocery Pivot.xlsx

17 GEST S492 The Digital Firm

Building Star Schema


DIMENSIONAL MODELING

18 GEST S492 The Digital Firm

Dimensional Models
A denormalized relational model
Made up of tables with attributes
Relationships defined by keys and foreign keys
Organized for understandability and ease of reporting
rather than update
Queried and maintained by SQL or special purpose
management tools
Because the semantic structure is known, and the surrogate
key mechanism is standard, tools can be optimized for query
and indexing

19 GEST S492 The Digital Firm

Building a DWH from a Normalized Database


The steps
Analyse the normalized entity-relationship
identify business processes to model each separately
Translate this into a dimensional model
Select m-n relationships containing numeric and additive
facts: these will be the fact tables
Denormalize the remaining tables into flat tables with singlepart key that connects directly to fact tables: these are the
dimension tables
Translate this into the physical model. This reflects the
changes necessary to reach the performance objectives

20 GEST S492 The Digital Firm

Steps in dimensional modeling


Select an associative entity for a fact table (table containing additive
measures)
Determine granularity
Replace operational keys with surrogate keys
Promote the keys from all hierarchies to the fact table
Can be in attributes: Date/week/month; Product/Category/Brand
Can be in dimensions: Customer / Segment
Add date dimension
Split all compound attributes
Rue Auguste Rodin 45, 1050 Bruxelles, Belgique
Add necessary categorical attributes: Reference Tables !
DOB: baby, young, teenager, adult, older
Fact (varies with time) / Attribute (constant)

21 GEST S492 The Digital Firm

CUSTOMER
customer_ID (PK)
customer_name
purchase_profile
credit_profile
address

ERD
1

1
n
n

STORE
store_ID (PK)
store_name
address
district
floor_type

PRODUCT
SKU (PK)
description
brand
category

ORDER
order_num (PK)
customer_ID (FK)
store_ID (FK)
clerk_ID (FK)
date

n
1

ORDER-LINE
order_num (PK) (FK)
SKU (PK) (FK)
promotion_key (FK)
dollars_sold
units_sold
dollars_cost
n
1

CLERK
clerk_id (PK)
clerk_name
clerk_grade

SKU = Stock Keeping Unit


22 GEST S492 The Digital Firm

PROMOTION
promotion_NUM (PK)
promotion_name
price_type
ad_type

CUSTOMER
customer_ID (PK)
customer_name
purchase_profile
credit_profile
address

ERD
1

Where are the measures?


Where are the dimensions?
n

STORE
store_ID (PK)
store_name
address
district
floor_type
CLERK
clerk_id (PK)
clerk_name
clerk_grade

PRODUCT
SKU (PK)
description
brand
category

ORDER
order_num (PK)
customer_ID (FK)
store_ID (FK)
clerk_ID (FK)
date

n
1

Dimensions Measures
1

SKU = Stock Keeping Unit


23 GEST S492 The Digital Firm

This date will be used to link the time


dimension to the fact
What if this date is missing in OLT?

1
n

ORDER-LINE
order_num (PK) (FK)
SKU (PK) (FK)
promotion_key (FK)
dollars_sold
units_sold
dollars_cost
n
1

PROMOTION
promotion_NUM (PK)
promotion_name
price_type
ad_type

TIME
time_key (PK)
SQL_date
day_of_week
month
STORE
store_key (PK)
store_ID
store_name
address
district
floor_type
CLERK
clerk_key (PK)
clerk_id
clerk_name
clerk_grade

24 GEST S492 The Digital Firm

DIMENSONAL
MODEL
FACT
time_key (FK)
store_key (FK)
clerk_key (FK)
product_key (FK)
customer_key (FK)
promotion_key (FK)
dollars_sold
units_sold
dollars_cost

PRODUCT
product_key (PK)
SKU
description
brand
category
CUSTOMER
customer_key (PK)
customer_name
purchase_profile
credit_profile
address

PROMOTION
promotion_key (PK)
promotion_name
price_type
ad_type

The 5 Step Design Process

Choose the Data Mart


Choose the Grain
Choose the Dimensions
Choose the Facts
Define the Measures

25 GEST S492 The Digital Firm

Choosing the Mart


A specific set of business questions (belongs to a BL)
Finance & control, marketing, sales, procurement, etc.
A set of related fact and dimension tables
Costs, sales, number of contacts
Clients, Products, Campaign, Channel, etc.
Single source or multiple source
ERP, legacy, Excel, etc.
Conformed dimensions
Dimensions shared by all facts allow to join them
Typically have a fact table for each process
Campaign selection, Campaign tracking, etc.
26 GEST S492 The Digital Firm

Grain (unit of analysis)


The grain determines the level of detail of each fact record
For example
Individual transactions: sales transaction, product shipment
Snapshots (points in time): sales of the day
Generally it is better to focus on the smallest grain
Moving from low grain to more aggregate is possible, but
inverse is not
Retailer: sales ticket = transaction, Sales per item per client per week
will, for ever, make any basket analysis impossible

Levels can be used for going from high granularity to lowest


granularity Question: how to do this?
There is a price to pay: the finest grain will create the largest
fact tables, and hence aggregates will have to be computed
27 GEST S492 The Digital Firm

Dimensions
A table (or hierarchy of tables) connected with the fact table with keys
and foreign keys
Dimension tables contain text or numeric attributes that gives the
context of the measures available in a Fact Table
Time, client, status, shop, product, employee, etc.
Preferably in a 1-M relationship with Facts
1 context can have many measures: 1 client having many sales
If not a bridge table must be created
Dimensions attributes will be the source of query constraints
Select sum(Sales) from where Client.ZIP = 1150
Customer (region, status); Product (Type, Price_range); Time (Mondays, WE, 2010)

Connected with surrogate (generated) keys, not operational keys


Surrogate key will be used to store the history of the dimensions. They
are the central point of the whole dimensional architecture
28 GEST S492 The Digital Firm

Fact Tables
Represent a process or reporting environment interesting
for some business users
Contains many foreign keys connecting to Dimensions
It is important to determine the identity of the fact table and
specify exactly what it represents
Sales in shops, direct-mail campaigns, shipment, production of item x
(family)

Typically correspond to an associative entity in the ER


model
Tend to have huge numbers of records
The max number of records is the combination of all records of each
attribute (Cartesian product)

Useful facts tend to be numeric and additive


Quantity ordered, Price paid, Number of responders
29 GEST S492 The Digital Firm

Measures
Measurements associated with fact records at the fact
table granularity
Normally numeric and additive
Sales, number of contacts, etc.
Semi-additive (stock or level measures)
Measures that are non additive on time but additive along
other dimensions: Inventory, balance
Non-additive (Value-per-unit measures)
Measure that are non additive at all: exchange rate
Attributes in dimension tables are constants. Facts attributes vary with
the granularity of the fact table: when aggregations are done, they
should smoothly aggregate as well.
30 GEST S492 The Digital Firm

Dimensions with m:n Fact relation


Use a Bridge Table
Problem for additive measures will be counted N times
Add a weighting factor to correct fact addition
Define a business rule to identify main record (winner-takes-all)
Bridge tables introduce
complexity in the query process
to take care of additive facts
user impact
Campaign
Campaign
Cpg_key Desc
1
hello world
2
Contest
3
Credit
4
loan it
5
Mbre2Mbre
6
Picnic
7
Card
8
Safe it
1 (one Cpg
can befound
once in Cpg
table)

Launch_date
1/01/2009
15/01/2009
17/01/2009
1/02/2009
19/02/2009
30/02/2009
1/03/2009
12/03/2009

31 GEST S492 The Digital Firm

Group_Campaign
GRPCpg_keyCpg_key
1
1
1
3
1
5
2
1
2
3
2
7
2
8

Main
1
0
0
0
1
0
0

n (one Cpg
can be n
times in
Group_Cpg
table)

Client

Group_
Campaign
- Weight
- Main

Credit FACT
- Amount
- Rate

The choice of the weight and


groups is done at the loading of
the fact

Day

Good Dimension Attributes


Verbose
Gives a rich context to the measure
Descriptive
User crystal clear naming and semantics
Complete
All modalities must be represented ooops Risk A is not loaded
All attributes within a dimension must describe completely the object
represented by the dimension
Quality assured
No trash value like 99 for age

32 GEST S492 The Digital Firm

Good Dimension Attributes


Indexed (b-tree vs bitmap)
Fast for SQL where clauses
Equally available
Attribute must be available for most of the objects it represents age is
available for only 1% of our clients
Documented
From a business point: semantic and point in time (risk profile attribute
of clients are updated every night)
From a technical point: Risk profile is loaded from Table X with process
Y, etc.

33 GEST S492 The Digital Firm

Hierarchies in Star Schema


DIMENSIONAL MODELING

34 GEST S492 The Digital Firm

Hierarchies
Hierarchies are the key relationship for roll-up and drilldown in OLAP environment
Sales per Day Week Month Quarter
Sales for cheese milk_product food
Hierarchies can be represented in 2 ways in dimensions
Explicit Attribute sets:
Dim.Time: date; DayofWeek; Week; Month; Quarter;
Dim.product: Product (Cheese); Family (Milk_product); Category
(Food)

Snowflake:
Create a table per level and link tables by 1:N relationships

35 GEST S492 The Digital Firm

Snowflaking & Attribute Hierarchies


Efficiency vs Space
Snowflaking normalizes the attributes
within a hierarchy and hence reduces
redundancy in a dimension table
For each combination of attributes
involved in the hierarchy, a dimension will
have one additional record

Understandability
The user must join tables to get the
right level of aggregation

36 GEST S492 The Digital Firm

Product.level3
Food
Product.level2
milk
Product.level1
cheese
Yougourt
milk

Product.level1
cheese
Yougourt
milk

Product.level2
milk
milk
milk

Product.level3
Food
Food
Food

Building the contextual history


DIMENSIONAL MODELING

37 GEST S492 The Digital Firm

Slowly Changing Dimensions


Type 1: Store only the current value
No history is created: all facts are like the value was always
like the last one
Client
Key
1

Name
Kasparov

zip
1150

Date_in Date_close
1/01/2010

Type 1
Client
Key
1

Name
Kasparov

zip
1950

Date_in Date_close
1/01/2010

Sales
client_key week_key Sale
1
1
10

week
Key
1
2
3

Week Month
1
1
2
1
3
1

Year
2010
2010
2010

Sales
client_key week_key Sale
1
1
10
1
2
8

week
Key
1
2
3

Week Month
1
1
2
1
3
1

Year
2010
2010
2010

Select week.year, Client.zip, Sum(Sales.sale) as Sales


From Sales, Client, Week
Where Sales.client_key=Client.key and
Sales.week_key= week.key and
week.year=2010
Group by week.year, Client.zip
38 GEST S492 The Digital Firm

Year
2010

Zip
1950

Sales
18

Slowly Changing Dimensions


Type 2: Create a new dimension record for each new value
of the attribute
This will create a new surrogate key to be used as from this moment in all
new records of the Fact table, creating the historical context of the measures
Client
Key
1

Name
Kasparov

zip
1150

Date_in Date_close
1/01/2010

Sales
client_key week_key Sale
1
1
10

week
Key
1
2
3

Week Month
1
1
2
1
3
1

Year
2010
2010
2010

Type 2
Client
Key
1
2

Name
Kasparov
Kasparov

zip
1150
1950

Date_in Date_close
1/01/2010 7/01/2010
8/01/2010

Sales
client_key week_key Sale
1
1
10
2
2
8

week
Key
1
2
3

Week Month
1
1
2
1
3
1

Year
2010
2010
2010

Select week.year, Client.zip, Sum(Sales.sale) as Sales


From Sales, Client, Week
Where Sales.client_key=Client.key and
Sales.week_key= week.key and
week.year=2010
Group by week.year, Client.zip
39 GEST S492 The Digital Firm

Year
2010
2010

Zip
1150
1950

Sales
10
8

Slowly Changing Dimensions


Type 3: Create an attribute in the dimension record for
previous value
This will not build a history at the level of the facts, but only in dimension
Client
Key
1

Type 3
Client
Key
1

Name
Kasparov

Name
Kasparov

zip
1950

zip
1150

Date_in Date_close
1/01/2010

zip_prev Date_in Date_close


1150 1/01/2010

Sales
client_key week_key Sale
1
1
10

week
Key
1
2
3

Week Month
1
1
2
1
3
1

Year
2010
2010
2010

Sales
client_key week_key Sale
1
1
10
1
2
8

week
Key
1
2
3

Week Month
1
1
2
1
3
1

Year
2010
2010
2010

Select week.year, Client.zip, Sum(Sales.sale) as Sales


From Sales, Client, Week
Where Sales.client_key=Client.key and
Sales.week_key= week.key and
week.year=2010
Group by week.year, Client.zip
40 GEST S492 The Digital Firm

Year
2010

Zip
1950

Sales
18

Dimensional Model Strengths & Weaknesses


DIMENSIONAL MODELING

41 GEST S492 The Digital Firm

Strengths of Dimensional Modeling


Predictable and understandable, standard framework
The semantic of table links is not hidden from the user as it is the case for
relational models
Users can understand it easily

Respond well to changes in user reporting needs


Most of the reports will be created through aggregation of low level facts. Hence
any reports given that the model uses conformed dimensions can be build on
top

Relatively easy to add data without reloading tables


If a new measure must be added to a fact (credit line level on an account), given
that the historical data is available, adding this measure in existing fact will be
easy and reveal all historical context as well
Adding new attributes to an existing dimension might be more difficult if the
added attribute history must be synchronized with the dimension history
surrogate key conflict
Adding an attribute adds a new granularity and hence some records must be split
(product family is the actual level and we want to have product shelf location)
42 GEST S492 The Digital Firm

Strengths of Dimensional Modeling


Standard design approaches have been developed
Allowing to quickly designing facts and dimensions, even in complex
environments. The framework might be understood by all participants
including the final users through the conceptual star description
There exist a number of products supporting the dimensional model
Because the semantic of the dimensional modeling is fixed, software can
leverage this knowledge and propose tools that provide automatic
mechanism for Type1, Type2 and Type3 updates, that accept explicit
description of hierarchies, etc. OLAP functionalities creation can be fully
automated from a constellation of fact tables. The programming and BDA
work is largely reduced.
Multidimensional models are more efficient for queries & Analysis
Fact tables have many records (millions), and dimensions will normally
be relatively small (product = 10.000, Shops = 500). Joining small tables
on large ones can be efficient with the right indexes.
43 GEST S492 The Digital Firm

Weaknesses of Dimensional Modeling


Some dimensions might be too large
Client with address 10% change/year
Large dimensions must be cut

A relation between two dimensional objects only exist


through their link in a fact

Sales

A client with no sales has no geo info attached

normalized and indexed relational models more flexible


Granularity choice can be avoided
Relationships among objects do not depend on some events
stored in facts
The cost is: lose all advantages of DM!
In practice, we often use a combination of both at low level
(high granularity) and build departmental and personal data
marts using dimensional models only.

44 GEST S492 The Digital Firm

Client
Key
Name
Status
street
n
ZIP
Region
Country
Type_suburb
Client
Key
Name
Status

Sales
Geo
Key
street
n
ZIP
Region
Country
Type_suburb

Weaknesses of Dimensional Modeling


In practice, we often use a combination of both at low level
(high granularity) and build departmental and personal
data marts using dimensional models only.
A link will be created between Client and Geo as soon as
client record is added in the dimension. The client record
will eventuelly closed even before it is used in the fact.
Client
Key
Name
Status
Sales
Geo
Key
street
n
ZIP
Region
Country
Type_suburb

45 GEST S492 The Digital Firm

Dimensional Model example


DIMENSIONAL MODELING

46 GEST S492 The Digital Firm

Campaign Datamart

Factless fact table


Is a table that creates
a relationship between
objects. No measure is
stored.

47 GEST S492 The Digital Firm

Tracking campaign effects

Non additive

48 GEST S492 The Digital Firm

You might also like