Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Overview

Advanced MD Modeling and


MD Database Implementation
• Last lecture: basic multidimensional databases/modeling
• Advanced multidimensional modeling
 Mainly handling changes in dimensions
• Large-scale dimensional modeling
 Coordinating cubes/data marts
• Multidimensional database implementation
 MS SQL Server
 MS Analysis Services

Original slides were written by


Torben Bach Pedersen
Aalborg University 2007 - DWML course 2

Business Dimensional Lifecycle Advanced MD modeling I - Overview

• Handling change over time


Technical
Technical Product
Product
Architecture
Architecture Selection&
Selection& • Changes in dimensions
Design
Design Installation
Installation
 No special handling

Project
Project
Business
Business
Requirements
Requirements
Dimensional
Dimensional
Dimensional Physical
Physical
Data
Data Staging
Staging
Design
Design && Deployment
Deployment
Maintenance
Maintenance
 Versioning dimension values
Planning
Planning Modeling
Modeling Design
Design &
& Growth
Growth Capturing the previous and the current value
Modeling Development 
Definition
Definition Development
 Timestamping
End-User
End-User
Application
End-User
End-User
Application
 Split into changing and constant attributes
Application Application
Specification
Specification Development
Development

Project
Project Management
Management

Aalborg University 2007 - DWML course 3 Aalborg University 2007 - DWML course 4
Changing Dimensions Example

StoreID
• So far, we assume that dimensions are stable over TimeID • Attribute values in
Address
time Weekday dimensions vary over time
TimeID City
 Existing rows do not change Week  A store changes Size
 New rows in dimension tables can be inserted StoreID District  A product changes
Month
ProductID Size Description
• “Slowly changing dimensions” phenomenon Quarter  Districts are changed
… SCategory
 Dimension information change, but changes are not Year • Problems
frequent ItemsSold
DayNo  Update dimensions 
Still assume that the schema is fixed Amount

Holiday ProductID wrong information in
• Study techniques for handling changes in historical data
Description
Don’t update dimensions
dimensions Brand


 DW is not up-to-date
PCategory
change
? ?
timeline
Aalborg University 2007 - DWML course 5 Aalborg University 2007 - DWML course 6

Solution 1: no special handling Solution 1


Sales fact table Store dimension table • Solution 1: Overwrite the old values that change,
StoreID … ItemsSold … StoreID … Size … in the dimension tables
001 2000 001 250 • New facts are facts that are inserted after the
dimension tables change
• Consequences
store size changes
 New facts point to rows with correct information
StoreID … ItemsSold … StoreID … Size …  Old facts point to rows in the dimension tables with
001 2000 001 450 incorrect information!
• Pros
 Easy to implement
new fact arrives
 Useful if the attribute is not significant or the (old) value
StoreID … ItemsSold … StoreID … Size … should be updated (e.g., error correction)
001 2000 001 450 • Cons
001 2500  Old facts may point to “incorrect” rows in dimensions
Aalborg University 2007 - DWML course 7 Aalborg University 2007 - DWML course 8
Solution 2: versioning of rows Solution 2
• Solution 2: Versioning of rows with changing attributes
StoreID … ItemsSold … StoreID … Size …  The key that links dimension and fact table, identifies a version
of a row, not just a “row”
001 2000 001 250
 No need for changes if “surrogate” (“simple”, “non information-
bearing”) keys are used
different versions of a store • Consequences
StoreID … ItemsSold … StoreID … Size …  Larger dimension tables
001 2000 001 250 • Pros
002 450  Correct information captured in DW
 No problems when formulating queries
• Cons
StoreID … ItemsSold … StoreID … Size …  Cannot capture the development over time of the subjects the
dimensions describe
001 2000 001 250
002 2500 002 450

Aalborg University 2007 - DWML course 9 Aalborg University 2007 - DWML course 10

Solution 3: two versions of changing attribute Solution 3


• Solution 3: Create two versions of each changing attribute
StoreID … ItemsSold … StoreID … DistrictOld DistrictNew …  One attribute contains the current value
001 2000 001 37 37  The other attribute contains the previous value
• Consequences
 Two values are attached to each dimension row
versions of an attribute
StoreID … ItemsSold … StoreID … DistrictOld DistrictNew …
• Pros
 Possible to compare across the change in dimension value (which
001 2000 001 37 73 is a problem with Solution 2)
◆ Such comparisons are interesting when we need to work
simultaneously with two alternative values
◆ Example: Categorization of stores and products
StoreID … ItemsSold … StoreID … DistrictOld DistrictNew …
• Cons
001 2000 001 37 73
 Not possible to see when the old value changed to the new
001 2100
 Only possible to capture the two latest values

Aalborg University 2007 - DWML course 11 Aalborg University 2007 - DWML course 12
Solution 2A: inserting special facts Solution 2A
StoreID TimeID … ItemsSold … StoreID … Size …
001 234 2000 001 250 • Solution 2A: Use special facts for capturing changes in
dimensions via the Time dimension.
 Assume that no simultaneous, new fact refers to the new
special fact for capturing changes
dimension row
StoreID TimeID … ItemsSold … StoreID … Size …  Insert a new special fact that points to the new dimension row, and
001 234 2000 001 250 through its reference to the Time dimension, timestamps the row
002 345 - 002 450 • Pros
 Possible to capture the development over time of the subjects that
the dimensions describe
• Cons
StoreID TimeID … ItemsSold … StoreID … Size …
 Even larger tables
001 234 2000 001 250
002 345 - 002 450
002 456 2500
Aalborg University 2007 - DWML course 13 Aalborg University 2007 - DWML course 14

Solution 2B: timestamping Solution 2B


StoreID TimeID … ItemsSold … StoreID Size From To
001 234 2000 001 250 98 -
• Solution 2B: Versioning of rows with changing attributes
attributes: “From”, “To”
like in Solution 2 + timestamping of rows
• Pros
StoreID TimeID … ItemsSold … StoreID Size From To
 Correct information captured in DW
001 234 2000 001 250 98 99
• Cons
002 450 00 -
 Larger dimension tables
 Consider whether Time dimension values and timestamps
describe the same aspect of time
StoreID TimeID … ItemsSold … StoreID Size From To
001 234 2000 001 250 98 99
002 456 2500 002 450 00 -

Aalborg University 2007 - DWML course 15 Aalborg University 2007 - DWML course 16
Solution 2B Rapidly Changing Dimensions
• Difference between “slowly” and “rapidly” is subjective
• Solution 2B: examples
 Solution 2 is often still feasible
• Product descriptions are versioned, when products are  The problem is the size of the dimension
changed, e.g., new package sizes • Example
• Old versions are still in the stores, new facts can refer to  Assume an Employee dimension with 100,000 employess, each
both the newest and older versions of products using 2K bytes and many changes every year
 Solution 2B is recommended
• Time value for a fact not necessarily between “From” and
• Other typical examples of (large) dimensions with many
“To” values in the fact’s Product dimension row
changes are Product and Customer
• Unlike changes in Size for a store, where all facts from a • The more attributes in a dimension table, the more
certain point in time will refer to the newest Size value changes per row can be expected
• Unlike alternative categorizations that one wants to • Example:
choose between.  A Customer dimension with 100M customers and many attributes
 Solution 2 yields a dimension that is too large

Aalborg University 2007 - DWML course 17 Aalborg University 2007 - DWML course 18

Solution 4: dimension splitting Solution 4


CustID • Solution 4
CustID
Name  Make a “minidimension” with the often-changing (demograhic)
Name
PostalAddress attributes
PostalAddress
Gender  Convert (numeric) attributes with many possible values into
Gender attributes with few discrete or banded values
DateofBirth
DateofBirth ◆ Why? Any Information Loss?
Customerside
Customerside  Insert rows for all combinations of values from these new domains

… ◆ With 6 attributes with 10 possible values each, the dimension gets
NoKids DemographyID 1,000,000 rows
NoKids  If the minidimension is too large, it can be further split into more
MaritialStatus
minidimensions
CreditScore MaritialStatus
◆ Here, synchronous/correlated attributes must be considered (and
BuyingStatus CreditScoreGroup placed in the same minidimension)
Income BuyingStatusGroup ◆ The same attribute can be repeated in another minidimension
Education IncomeGroup
… EducationGroup

Aalborg University 2007 - DWML course


… 19 Aalborg University 2007 - DWML course 20
Changing Dimensions Changing dimensions - Summary

• Pros
 DW size (dimension tables) is kept down • Why change dimensions?
 Changes in a customer’s demographic values do not  Applications change
result in changes in dimensions  The modeled reality changes
• Cons • Multidimensional models realized as star schemas
 More dimensions and more keys in the star schema support change over time to a large extent
 Using value groups gives less detail
• A number of techniques for handling change over
 The construction of groups is irreversible and makes it
hard to make other groupings time at the instance level was described
 Navigation of customer attributes is more cumbersome  Solution 2 (and the derived, 2A and 2B) is the most useful
as these are in more than one dimension  Possible to capture change precisely

Aalborg University 2007 - DWML course 21 Aalborg University 2007 - DWML course 22

Overview DW Bus Architecture


• What method for DW construction?
 Everything at once, top-down DW (“monoliths”)
• Advanced multidimensional modeling  Separate, independent marts (“stovepipes”, “data islands”)

 Mainly handling changes in dimensions • Architecture-guided step-by-step method in practice


 Combines the advantages of the first two methods
• Large-scale dimensional modeling • A data mart can be built much faster than a DW
 Coordinating cubes/data marts  ETL is the hardest - minimize risk with a simple mart
• Multidimensional database implementation  Data marts must be compatible  comparable views of the
enterprise
 MS SQL Server
• Start with single-source data marts
 MS Analysis Services
 Facts from only one source makes everything easier

Aalborg University 2007 - DWML course 23 Aalborg University 2007 - DWML course 24
DW Bus Architecture DW Bus Architecture
• Data marts built independently by departments • Dimension content managed by dimension owner
 Good (small projects, focus, independence,…)  The Customer dimension is made and published in one place
 Problems with “stovepipes” (reuse across marts impossible)
• Tools query each data mart separately
• Conformed dimensions  Separate (SQL) queries to each data mart
 Same structure and content across data marts
 Results combined (outer join) by tool (or OLAP server)
 Take data from the best source
 Dimensions are copied to data marts (not a space problem) • Hard to make conformed dimensions and facts
• Conformed fact definitions  Organizational and political challenge, not technical
 The same definition across data marts (price excl. sales tax)  Get everyone together and
 Facts are not copied between data marts (facts > 95% of data)  Get a top manager (CIO) to back the conformance decision
 Observe units of measurement (also currency, etc.) • Exception: business areas totally separated
 Use the same name only if it is exactly the same concept  No common management/control
• Allows several data marts to work together  Build several DWs
 Combining data from several fact tables is no problem

Aalborg University 2007 - DWML course 25 Aalborg University 2007 - DWML course 26

Large Scale Cube Design Coordinating Data Marts


• Multi-source data marts
 Not built initially due to too large complexity
• The design is never “finished”  Built “on top of” several single-source data marts (building blocks)
 The dimensional modeler is always looking for new information  Relatively simple due to conformed dimensions and facts
to include in dimensions and facts
 Can be done physically or virtually (in OLAP server)
 A sign of success!
 Example: profitability data mart
• New dimensions and measures introduced gracefully  Important to have fine (single transaction?) granularity
 Existing queries will give same result
• Saving “stovepipes”
 Example: Location dimension can be added for old+new facts
 Converting “stovepipes” to coordinated data marts
 Can usually be done if data has sufficiently fine granularity
 Can “stovepipe dimensions” be directly mapped to conformed
• Data mart granularity dimensions?
 Always as fine as possible (transaction level detail)  Watch out for indirect mappings based on wrong assumptions
 Makes the mart insensitive to changes (week  day  month?)

Aalborg University 2007 - DWML course 27 Aalborg University 2007 - DWML course 28
Matrix Method Overview

• DW Bus Architecture Matrix


• Planning Process
 Make list of data marts • Advanced multidimensional modeling
 Make list of dimensions  Mainly handling changes in dimensions
 Mark co-occurrences (which marts have which dimensions) • Large-scale dimensional modeling
 Time dimension occurs in (almost) all marts
 Coordinating cubes/data marts
• Multidimensional database implementation
Dimensions  MS SQL Server
Time Customer Product Supplier  MS Analysis Services

Sales + + +
Data marts Costs + +
Profit + + + +

Aalborg University 2007 - DWML course 29 Aalborg University 2007 - DWML course 30

MS SQL Server 2005 SQL Server Data types


• Microsoft’s RDBMS • Character data
 Runs only on Windows OS  CHAR, VARCHAR, ……
• Nice features built-in • Binary data
 Transact-SQL stored procedures  BINARY, VARBINARY, ……
 Replication • Date and time data
 Integration Services  DATETIME, SMALLDATETIME
 Analysis Services  DATEADD(SS,dwml.dbo.[sales].[date],'19700101') converts
 Reporting Services UNIX time
• Easy to use • Numeric data
 Graphical “Management Studio” and “BI Developer Studio” (demo)  INT, FLOAT, ……
• Keys: IDENTITY property generates unique integer keys
 Useful for DW (surrogate) keys !
• Other types ......

Aalborg University 2007 - DWML course 31 Aalborg University 2007 - DWML course 32
Transact-SQL Replication
• SQL Server’s SQL dialect = SQL + procedural code • Publisher
• Procedures/functions, variables, IF-THEN, loops, DDL…  Server that publishes data to other
 Has one or more publications consisting of articles (tables, etc.)
• Built-in functions
• Distributor
• Exception handling (RAISERROR)  Server that manages distribution of data
 Keeps track of publications and subscriptions, history, etc.
CREATE PROCEDURE CustOrdersDetail @OrderID int
AS • Subscriber
SELECT ProductName,  Server that receives data
UnitPrice=ROUND(Od.UnitPrice, 2),  Has subscriptions to a number of publications
Quantity, • Push/pull both possible
Discount=CONVERT(int, Discount * 100),
• Types of replication
ExtendedPrice=ROUND(CONVERT(money, Quantity*(1 - Discount)*Od.UnitPrice), 2)
 Snapshot replication - copies data at a specific point in time
FROM Products P, [Order Details] Od
 Transactional replication - first send snapshot, then send updates
WHERE Od.ProductID = P.ProductID and Od.OrderID = @OrderID
 Merge replication - distributed, disconnected replication
GO

Aalborg University 2007 - DWML course 33 Aalborg University 2007 - DWML course 34

MS Analysis Services MS Analysis Services


• Cheap, easy to use, good enough  “a real MS product” • BI Development Studio (BIDS): demo
• (R/M/H)OLAP technology
 Data placement as desired 1) Build a relational DW in SQL Server
• Intelligent pre-aggregation 2) Create a Analysis Services project in BIDS
• Server and client parts 3) New data source
 Reporting Services a separate service 4) New data source view
• MS OLE DB for OLAP interface 5) New cube
1) Wizard guesses hierarchies
2) Dimensions can be built first, if desired
6) Cube can be browsed/queried

Aalborg University 2007 - DWML course 35 Aalborg University 2007 - DWML course 36
Mini Project Summary
• New subtasks
 See mini project page
• After this part, you should have the following: • Advanced multidimensional modeling
 Description of business processes  Mainly handling changes in dimensions
 Choice of data source(s) + considerations described • Large-scale dimensional modeling
 An “true” multidimensional schema + description
 Coordinating cubes/data marts
 A relational DW schema design, e.g., star schema + description
 An implementation in SQL Server and Analysis Services • Multidimensional database implementation
 You are welcome to discuss your design with me  MS SQL Server
• “Building Cubes” subtask can only be completed after the  MS Analysis Services
ETL subtask is completed
 Cube must be refreshed or rebuilt after ETL is complete
 But “test cube” can be built based on data you type
 Create Relational DW tables by ETL or directly by MS SQL

Aalborg University 2007 - DWML course 37 Aalborg University 2007 - DWML course 38

You might also like