Data Warehouse Implementation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Data Warehouse Introduction

Spring 2016
Zafar Heydari, Farid
Outlines About Lecturer

OLAP vs OLTP

Why Data Warehouse

Designing Data Warehouse

Designing Dimensions

Designing Fact Tables


OLAP vs. OLTP

ROLAP

OLAP

MOLAP

OLTP SSIS
OLTP- Online Transactional Processing
 Designed to solve everyday work transactions (i.e. sales, customer care, manufacturing)
 Points where the data is captured and recorded in the company
 Very efficient in the management of specific information : ERP, CRM, HRM, SCM, Email, Others

Disadvantages Advantages

• Difficult to manage historical • Reduces paper trails


information • Simple & effective
• High cost of maintenance and • Highly accurate Advantages
gathering distributed data
• Need technical skills and
resources
OLAP – Online Analytical Processing

 Designed specifically to obtain information, analyze and solve business problems


 Specific analytical information is added to the data
 They use a different database technology, optimized to extract information.
 Analytical systems unify all of the company’s data in one system: the Data Warehouse
What’s in Cube?
OLAP Cubes

 They let you analyze all of the information available in the Data
Warehouse
 Each cube stores a set of specific information, and contains different
“measures” and “dimensions”
 Measures
numbers (amounts, quantities, percentages)
 Dimensions
contain attributes to filter and order information
Samples of Dimension & MeasuresDimension

Sales Stock / Inventory Supply Orders

• Dimensions: • Dimensions: • Dimensions: • Dimensions:


• Date / Time • Date / Time • Date / Time • Date / Time
• Customer • Store / • Supplier • Product
• Branch / Branch • Product • Customer
Store • Product • Contract • Salesperson
• Product • Contract • Terms of sale
• Measures: • Measures: terms • Measures:
• Quantity • Quantity • Type of • Quantity
• Price • Price transactions • Amount
• Discount • Cost • Measures: • Discount
• Profit • Quantity
• Amount
• Cost
Why is it multidimensional?

A dimension is one of the “edges” of your business

It is called multidimensional because you can see the


information from different “edges” at the same time
 Customers
 Suppliers
 Employees
 Products
 Places (Factories, Branches,…)
 Time
 Transaction Types ( Invoices, Orders, Quotes,…)
OLAP vs. OLTP
OLTP DB of Sales

 Highly Normalized
 Lots of Joins
 Archived Data from 1/1/1
 Low Performance
 Multiple Data Bases
 Different Naming &
definitions
Why Data Warehouses?

 Simplified Reporting

 Archived Data

 Different Data Sources

 Consistent Central Hub for Data Analysis


 OLTP is highly Normalized design
 OLAP/DW are highly DE-Normalized design
Designing Data Warehouse (1)
FACT Tables
(Numbers)
Production De-Normalized
Environment Dimension

Dimension
(Attributes)

DIM Sub-Category

Prototype Normalized DIM Region


DIM Category
Environment Dimension DIM Country
Designing Data Warehouse (2)

Star Schema : Focuses on an area of OLTP data

Central Fact table contains Metric data for Analysis

Dimensions contain attributes to describe fact data

Snowflake Schema extends De-normalized Star Schema


Designing Dimensions (1)
 Dimension tables contain:
 Attributes : to filter/slice Fact
data
 Labels: to get context of the
report
 Keys: Primary Keys(Surrogate
Keys) & Natural Keys (OLTP
Key)
 Lineage Key: Track where &
when the data comes (Master
lineage data)
Designing Dimensions (2)

 SCD: Slowly Changing Dimensions


 How do we deal with updates
 Determine historical aspects of dimension
data
 Type 1: Don’t maintain history
 Type 2: Save history by adding a new
record
 Type 3: Only Save Previous Value (Less
Popular)
 Sample : Department name :
Marketing Sales  Marketing
Designing Dimensions (3)

 Natural, Durable, and Supernatural Keys


 employee number (natural key)
 durable key/ durable supernatural key: persist key, independent of the original business process

 Flags and Indicators as Textual Attributes


 Cryptic abbreviations, true/false flags, and operational indicators should be supplemented in
dimension tables with full text meaningful words

 Null Attributes in Dimensions


 a given dimension row has not been fully populated,
 or when there are attributes that are not applicable to all the dimension’s rows.
 In both cases, we recommend substituting a descriptive string, such as Unknown or Not Applicable
Designing Dimensions (4)

 Calendar Date Dimensions


 Dim Date has different attributes such as week
number, month name, fiscal period, and national
holiday indicator, …
 To facilitate partitioning, the primary key of a date
dimension can be more meaningful, such as an
integer representing YYYYMMDD
 the date dimension table needs a special row to
represent unknown or to-be-determined dates
 a separate date/time stamp can be added to the
fact table. The date/time stamp is not a foreign key
to a dimension table
Designing Dimensions (5)

 Role-Playing Dimensions
 A single physical dimension can be referenced multiple times in a fact table
 For instance: A fact table can have several dates, each of which is represented by a foreign
key to the date dimension. It is essential that each foreign key refers to a separate view of
the date dimension so that the references are independent.
 These separate dimension views (with unique attribute column names) are called roles.
Designing Dimensions (6)

 Hierarchies:
 De-normalized Flattened Dimensions (Star)
 Multiple Hierarchies in Dimensions (Snow flake)

 Which Data levels (Star/ Snowflake?) : Depends on usability in all fact tables
Star Sample
Snowflake Sample
Hybrid Sample
Designing Fact Tables (1)

 Fact tables contain:


 Key: Primary Keys , Dimension
Keys
 Measure: Calculated Numbers
 Lineage: Track where & when
the data comes (Master
lineage data)
 Supportive labels
Designing Fact Tables (2)

 Measures Additivity:
 Additive: All-dimensions
 Non-Additive: No dimensions
(Unit Cost)
 Semi-Additive: Some
dimensions (All except time)
(Units Balance)
Designing Fact Tables (3)

 Fact-less Table:
represent Many-to-Many
relationships in Data
Warehouse.
Designing Fact Tables (4)

 Nulls in Fact Tables:


 The aggregate functions (SUM, COUNT, MIN, MAX, and AVG) all do
the “right thing” with null facts.

 Nulls must be avoided in the fact table’s foreign keys

 The associated dimension table must have a default row (and


surrogate key) representing the unknown or not applicable
condition.
Designing Fact Tables (5)

 Conformed Facts
 If the same measurement appears in separate fact
tables:
 If the separate fact definitions are consistent, the
conformed facts should be identically named

 but if they are incompatible, they should be differently


named to alert the business users and BI applications
Designing Fact Tables (6)

 Periodic Snapshot Fact Tables


 A row in a periodic snapshot fact table summarizes many
measurement events occurring over a standard period, such as
a day, a week, or a month.
 The grain is the period, not the individual transaction.
 if no activity takes place during the period, a row is typically
inserted in the fact table containing a zero or null for each fact.
Designing Fact Tables (7)

 Accumulating Snapshot Fact Tables


 A row in an accumulating snapshot
fact table summarizes the
measurement events occurring at
predictable steps between the
beginning and the end of a
process.
 Samples: Pipeline or workflow
processes, such as order
fulfillment or claim processing
 There is a date foreign key in the
fact table for each critical
milestone in the process.
 As pipeline progress occurs, the
accumulating fact table row is
revisited and updated.
Designing Fact Tables (8)

 Consolidated Fact Tables


 to combine facts from multiple processes together into
a single consolidated fact table.
(Sales actuals can be consolidated with sales forecasts)
Designing Fact Tables (9)

 Header/Line Fact Tables


 parent/child schemas), all the header-level dimension foreign keys and degenerate
dimensions should be included on the line-level fact table.
Managing Fact Data

 Indexing Fact Tables:


 Clustered on 1 small numeric PK
 Non-clustered on FK
 Column Store on Large Tables

Lineage Key
 Partitioning on Large table
P-Schema
 Simplified maintenance
 Improve performance P-Function

 Lineage
 Directly in Fact tables
 Centralized Lineage Dimension
 Meta data about columns
Kimball Dimensional Modeling Techniques

 Gather Business Requirements and Data Realities


 Collaborative Dimensional Modeling Workshops
 Four-Step Dimensional Design Process:
 1. Select the business process.
 2. Declare the grain. (atomic grain)
 3. Identify the dimensions.
 who, what, where, when, why, and how OF CONTECT OF BUSINESS
PROCESS EVENT
 4. Identify the facts.
Exercise 1
Answer 1
Any Question?
Spring 2016
Zafar Heydari, Farid

You might also like