Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 114

Data Warehouse

Life Cycle
Data Warehouse Defined

“A data warehouse is a collection of


corporate information, derived directly
from operational systems and some
external data sources. Its specific purpose
is to support business decisions, not
business operations”
Characteristics of a DW
• Subject-oriented Data
– collects all data for a subject, from different sources
• Read-only Requests
– loaded during off-hours, read-only during day hours
• Interactive Features, ad-hoc query
– flexible design to handle spontaneous user queries
• Pre-aggregated data
– to improve runtime performance
• Highly denormalized data structures
– fat tables with redundant columns
Components of a Data Warehouse
Source Data DWH End User
Systems Staging Servers Data Access
Area

Storage Data Mart 1 Query


Flat Files Tools
RDBMS Dimensional
Processing Conforms to
DW Bus Report
No User
Writers
Query
Services
Mining
Data Mart 2 Tools
Data Modeling
Data Modeling
WHAT IS A DATA MODEL?
A data model is an abstraction of some aspect of
the real world (system).
WHY A DATA MODEL?
– Helps to visualise the business
– A model is a means of communication.
– Models help elicit and document requirements.
– Models reduce the cost of change.
– Model is the essence of DW architecture based on
which DW will be implemented
What do we want to do with the data?

Model depends on what kind of data analysis


we want to do:
• Different Data Analysis Techniques
– Query and reporting
• Display Query Results
– Multidimensional analysis
• Analyse data content by looking at it in different
perspectives
– Data mining
• discover patterns and clustering attributes in data
Impact of Data Analysis
Techniques on DM
• Query and reporting
• Normalized data model
• Select associated data elements
• summarize and group by category
• present results
• direct table scan
• ER with normalized / denormalized appropriate
Impact of Data Analysis
Techniques on DM
• Multidimensional analysis
• Fast and easy access to data
• Any number of analysis dimensions in any
combinations
• ER will mean many joins
• Dimensional model appropriate
Levels of modeling
• Conceptual modeling
– Describe data requirements from a business
point of view without technical details
• Logical modeling
– Refine conceptual models
– Data structure oriented, platform
independent
• Physical modeling
– Detailed specification of what is physically
implemented using specific technology
Conceptual Model

• A conceptual model shows data through


business eyes.
• All entities which have business
meaning.
• Important relationships
• Few significant attributes in the entities.
• Few identifiers or candidate keys.
Logical Model

• Replaces many-to-many relationships with


associative entities.
• Defines a full population of entity
attributes.
• May use non-physical entities for domains
and sub-types.
• Establishes entity identifiers.
• Has no specifics for any RDBMS or
configuration.
Physical Model

• A Physical data model may include


– Referential Integrity
– Indexes
– Views
– Alternate keys and other constraints
– Tablespaces and physical storage objects.
What needs to be modeled during
a data warehouse project
• STAGING AREA
– YES ! (maybe multiple data models are
required)
• ODS
– YES !
• DATAWAREHOUSE/DATAMART
– YES!
Data Modeling - Techniques

• Modeling techniques

– E-R Modeling
– Dimensional Modeling
Implementation and modeling
styles

• Modeling versus implementation


– Modeling: describe what should be built to
non-technical folks
– Implementation: describe what is actually
built to technical folks
Implementation and modeling
styles
• Relational modeling
– Use for implementation
– Difficult to understand by non-technical folks
• Dimensional modeling
– Use for modeling during analysis and design
phases
– Can be implemented using other modeling
styles e.g. object-oriented, relational
Limitations of E-R Modeling

• Poor Performance
• Tend to be very complex and difficult to
navigate.
Dimensional Modeling

• Dimensional modeling uses three basic


concepts : measures, facts, dimensions.
• Is powerful in representing the
requirements of the business user in the
context of database tables.
• Focuses on numeric data, such as values
counts, weights, balances and occurences.
Dimensional modeling

• Must identify
– Business process to be supported
– Grain (level of detail)
– Dimensions
– Facts
Conventions used in
Dimensional modeling
• Facts
• Measures(Variables)
• Dimensions
– Dimension members
– Dimension hierarchies
Facts
• A fact is a collection of related data items,
consisting of measures and context data.
• Each fact typically represents a business
item, a business transaction, or an event
that can be used in analyzing the
business or business process.
• Facts are measured, “continuously
valued”, rapidly changing information.
Can be calculated and/or derived.
Fact Table

• A table that is used to store business


information (measures) that can be used
in mathematical equations.
– Quantities
– Percentages
– Prices
Dimensions

• A dimension is a collection of members or


units of the same type of views.
• Dimensions determine the contextual
background for the facts.
• Dimensions represent the way business
people talk about the data resulting from a
business process, e.g., who, what, when,
where, why, how
Dimension Table

• Table used to store qualitative data about


fact records
– Who
– What
– When
– Where
– Why
Dimension data should be

• verbose, descriptive
• complete
• no misspellings, impossible values
• indexed
• equally available
• documented ( metadata to explain origin,
interpretation of each attribute)
Dimensional model
• Visualise a dimensional model as a CUBE (hypercube
because dimensions can be more than
3 in number)
• Operations for OLAP
Drill Down :Higher level of detail
Roll Up: summarized level of data
(The navigation path is determined by hierarchies within
dimensions.)
Slice: cuts through the cube.Users can focus on specific
perspectives
Dice: rotates the cube to another perspective (change the
dimension)
Drill down …. Roll up
Slice and Dice
Dimensions
• Collection of members or units of the same type of
views.
• determine the contextual background for the facts.
• the parameters over which we want to perform
OLAP (eg. Time, Location/region, Customers)
• Member is a distinct name to determine data item’s
position (eg. Time - Month, quarter)
• Hierarchy arrange members into hierarchies or levels
Hierarchies

• Allow for the ‘rollup’ of data to more


summarized levels.
– Time
• day
• month
• quarter
• year
Hierarchies
Aggregates

• Aggregate Tables are pre-stored


summarized tables… created at a higher
level of granularity across any or all of the
dimensions.

• If the existing granularity is Day wise sales,


then creating a separate month wise sales
table is an example of Aggregate Table.
Aggregates

• The use of such aggregates is the single


most effective tool the data warehouse
designer has to improve query
performance.

• Usage of Aggregates can increase the


performance of Queries by several
times.
Measures
• A measure is a numeric attribute of a fact,
representing the performance or behaviour
of the business relative to dimensions.
• The actual numbers are called as variables.
eg. sales in money, sales volume, quantity supplied, supply
cost, transaction amount
• A measure is determined by combinations of
the members of the dimensions and is
located on facts.
The Cube
Types of Facts
• Additive
– Able to add the facts along all the dimensions
– Discrete numerical measures eg. Retail sales in $
• Semi Additive
– Snapshot, taken at a point in time
– Measures of Intensity
– Not additive along time dimension eg. Account
balance, Inventory balance
– Added and divided by number of time period to
get a time-average
Types of Facts

• Non Additive
– Numeric measures that cannot be added across
any dimensions
– Intensity measure averaged across all
dimensions eg. Room temperature
– Textual facts - AVOID THEM
Common structures for
Data Marts :
Denormalize!
• Star
– Single fact table surrounded by denormalized
dimension tables
– The fact table primary key is the composite of the
foreign keys (primary keys of dimension tables)
– Fact table contains transaction type information.
– Many star schemas in a data mart
– Easily understood by end users, more disk storage
required
Example of Star Schema
Common structures for
Data Marts:
Denormalize!
• Snowflake
– Single fact table surrounded by normalized dimension
tables
– Normalizes dimension table to save data storage
space.
– When dimensions become very very large
– Less intuitive, slower performance due to joins

• May want to use both approaches, especially if


supporting multiple end-user tools.
Example of Snow flake schema
Snowflake - Disadvantages

• Normalization of dimension makes it


difficult for user to understand
• Decreases the query performance because
it involves more joins
• Dimension tables are normally smaller
than fact tables - space may not be a
major issue to warrant snowflaking
Keys …

• Primary Keys
– uniquely identify a record
• Foreign Keys
– primary key of another table referred here
• Surrogate Keys
– system-generated key for dimensions
– key on its own has no meaning
– integer key, less space
More Keys …
• Smart Keys
– primary key out of various attributes of
dimension
– AVOID THEM!
– Join to Fact table should be on single
surrogate key
• Production Keys
– DO NOT USE Production defined attributes
– Business may reuse/change them - DW
cannot!
Basic Dimensional Modeling
Techniques
• Slowing changing Dimensions
• Rapidly changing Small Dimensions
• Large Dimensions
• Rapidly changing Large Dimensions
• Degenerate Dimensions
• Junk Dimensions
Slowly Changing Dimensions

A dimension is considered a Slowly


Changing Dimension when its attributes
remain almost constant over time,
requiring relatively minor alterations to
represent the evolved state.
The Time Dimension
Time_key
day_of_week
day_number_in_month
day_number_overall
week_number_in_year
month
quarter
fiscal_period
holiday_flag
weekday_flag
last_day_in_month_flag
season
event
Time Dimension

• An exclusive Time dimension is required


because the SQL date semantics and
functions cannot generate several important
attributes required for analytical purposes.

• Attributes like weekdays, weekends, fiscal


period, holidays, season cannot be
generated by SQL statements.
Time Dimension

• Moreover SQL date stamps occupy more


space largely increasing the size of the
fact table.

• Joins on such SQL generated date-stamps


are costly decreasing the query speed
significantly.
Time Dimension

• The Day of week(Monday, ...) is useful


to create reports comparing for ex.
Monday sales to Friday sales.
• The Day number in month is useful for
comparing measures for the same day
in each month.
• The last day in month flag is useful for
performing payday analysis.
Time Dimension

• The holiday flag and season attributes


are useful for holiday VS non-holiday
analysis and season business analysis.

• Event attribute is needed to record


special days like strike days, etc..
Data Modeling for Data Warehouse

12 Steps :

1. Study ER 2.Evaluate and Analyse


3. Review Dimension 4. Add Time Dimension
5. Identify Facts 6. Granularity
7. Merge Facts 8. Review Facts
9. Name Facts 10. Size the model
11. Record Metadata 12. Validate model
ETVL Overview
Introduction

Extraction, Transformation, Validation, Load

Source
System 1

E
E
T
Source V Staging Area T
V Data warehouse
System 2 L
L

Source
System 3
Extraction

– Source Systems (Multiple Source Systems)


• Flat files, Excel, Legacy Systems, RDBMS etc.
– Frequency of Extraction
– Staging Area (If any? How many?)
– Most Transformations from Source to Staging
– Cleansing and Data Quality
• Data integrity, De-duplication, completeness,
correctness
Transformation
– Usage of tools
• Reusability of Transformations
• Reusability of Mappings
– Different tools
• Informatica
• Warehouse Builder
• ETI
• Sagent
• PL/SQL scripts
Loading

– Loading Frequency
– Optimized Loading
• Indexing
• Partitioning
– Aggregation
• Sum
• Average
• Max
– Update Strategy
– Error Handling
Synopsis

- Flat files, Excel, Legacy Systems,


RDBMS etc.
– Implement Business Rules
– ODBC Connectivity
– Scheduling the ETVL
– Frequency of Extraction
– Staging Area
– Most Transformations from Source to Staging
Synopsis

– Cleansing and Data Quality


• Data integrity, De-duplication, completeness,
correctness
– Rejected Records
– Exception Handling and Error Log
– Optimized Loading
– Re-usability
– Aggregation of data
– Update Strategy
STAGING AREA - Some Clarity

• Staging Area
– optional
– to cleanse the source data
– Accepts data from different sources
– Data model is required at staging area
– Multiple data models may be required for
parking different sources and for transformed
data to be pushed out to warehouse
ODS - Some Clarity

• Operational Data Store


– Optional
– Granular, detailed level data
– May feed warehouse (eg when warehouse is
aggregated)
– Usually a relational model
– May keep data for a smaller time period than
warehouse
A look at different DW architectures

L
Q
O
U
A
E
Operational D
R
Data Y
M
A
Summary
M
information
N Detailed A
A Information N
External G
A
Meta Data OLAP
data E
G
R
E
R

Warehouse Manager
Data Warehouse Architecture - 2
Data Warehouse Architecture - 3
Data Warehouse Architecture - 4
DW Architecture

• Architecture Choices depend on


– Current infrastructure
– Business environment
– Desired management and control structure
– resources
– commitment …..
• Data Warehouse/data mart
DW Architecture

• Architecture Choices determine


– Where will DW reside?
• Centrally / locally / distributed
– Where will it be managed from?
• Centrally / independently
• 3 choices
• Global
• Independent
• Interconnected
(or) a combination of these three
DW Architecture

• Global Architecture
– related to scope of data access and storage
– does not mean centralized
– can be physically centralized or distributed
– enterprise view of data
– time-consuming & costly to implement
Global Architecture
DW Architecture

• Independent Architecture
– stand-alone
– controlled by a department
– minimal integration
– no global view
– very fast to implement
DW Architecture

• Interconnected Architecture
– distributed
– integrated and interconnected
– gives a global view of enterprise
– more complexity
• who manages / controls data
• another tier in architecture to share common data
between multiple data marts
• have a data sharing schema across data marts
Independent
&
Interconnected Architecture
Types of Data Warehouse

• Enterprise Data Warehouse


• Data Mart

Enterprise
Data Warehouse

Datamart Datamart Datamart


Enterprise data warehouse
• Contains data drawn from multiple
operational systems
• Supports time- series and trend analysis
across different business areas
• Can be used as a transient storage area to
clean all data and ensure consistency
• Can be used to populate data marts
• Can be used for everyday and strategic
decision making
Data Mart
• Logical subset of enterprise data
warehouse
• Organized around a single business
process
• Based on granular data
• May or may not contain aggregates
• Object of analytical processing by the
end user.
• Less expensive and much smaller than a
full blown corporate data warehouse.
Distributed and Centralized
Data warehouses

• DW sitting on a monolithic machine -


unrealistic
• Separate machines, different OS, different
DB systems - reality

Solution
• Share a uniform architecture to allow them
to be fused coherently
Classical Architectures

• Physical data warehouse (physical)


– Data warehouse --> data marts
– Data marts --> data warehouse
– Parallel data warehouse and data marts
Physical data warehouse:
Data warehouse --> data marts
•External
•Data

•SOURCE DATA

•Operational Data

•Data Warehouse •Data Marts

•Staging Area

•Physical Data Warehouse:


•Data Warehouse --> Data Marts
Physical data warehouse:
Data marts --> data warehouse
External
Data

SOURCE DATA

Operational Data

Data Warehouse
Data Marts

Staging Area

Physical Data Warehouse:


Data Marts --> Data Warehouse
Physical Data Warehouse:
Parallel Data Warehouse and
Data Mart
Data Warehouse

External
Data

SOURCE DATA

Staging Area
Operational Data

Data Marts

Physical Data Warehouse:


Parallel Data Warehouse & Data Marts
DW Implementation
Approaches
• Top Down
• Bottom-up
• Combination of both
• Choices depend on:
– current infrastructure
– resources
– architecture
– ROI
– Implementation speed
Top Down Implementation
Bottom Up Implementation
DW Implementation
Approaches
Top Down
Bottom Up
• More planning and design
initially • Can plan initially
without waiting for
• Involve people from global infrastructure
different work-groups,
departments • built incrementally
• Data marts may be built • can be built before or in
later from Global DW parallel with Global DW
• Overall data model to be • Less complexity in
decided up-front design
DW Implementation
Approaches

Top Down Bottom Up


• Consistent data definition • Data redundancy and
and enforcement of business inconsistency between
rules across enterprise data marts may occur
• High cost, lengthy process, • Integration requires great
time consuming planning
• Works well when there is • Less cost of H/W and
centralized IS department other resources
responsible for all H/W and
resources • Faster pay-back
DW Implementation Approaches
Combined Approach
• Determine degree of planning and design for a global
approach to integrate data marts being built by
bottom-up approach
• Develop base level infrastructure definition for global
DW at business level
• Develop plan to handle data elements needed by
multiple data marts
• Build a common data store to be used by data marts
and global DW
OLAP Concepts
OLAP - Definition

OLAP - On Line Analytical Processing


• OLAP enables analysts, managers, and
executives to gain insight into data
through fast, consistent, interactive
access to a wide variety of possible
views of information.
• OLAP transforms raw data so that it
reflects the real dimensionality of the
enterprise as understood by the user.
Data Warehousing vs. OLAP

OLAP focuses on
 Data transformed into information that
meets the end-user’s analytical requirements
 Data modeling and computation processes
is consistent
 OLTP and DW provides the source data
whereas, OLAP turns that data into
information.
OLAP - Functionality
• OLAP functionality is characterized by
– Dynamic multi-dimensional analysis of consolidated
enterprise data supporting end user analytical and
navigational activities.
– Calculations and modeling applied across dimensions,
through hierarchies and/or across members
– Trend analysis over sequential time periods
– Slicing subsets for on-screen viewing
– Drill -down to deeper levels of consolidation
– Reach-through to underlying detail data
– Rotation to new dimensional comparisons in the viewing area
OLAP - Functionality

• OLAP is implemented in a multi-user client/server


mode and offers consistently rapid response to
queries, regardless of database size and
complexity.

• OLAP helps the user synthesize enterprise


information through comparative, personalized
viewing, as well as through analysis of historical
and projected data in various "what -if" data model
scenarios.
OLAP
Functional Requirements
 Fast Access and Calculations

Speed is critical to maintain an analyst’s train
of thought.

An analyst needs to navigate throughout the
data which requires aggregations, or roll-ups.
 Powerful Analytical Capabilities
 There is more complicated calculations to
OLAP than simple aggregations, or roll-ups.
OLAP
Functional Requirements
 Flexibility
 viewing: graphs, charts, row or columns
 definitions: format of numbers, name
changes
 analysis: Sales analyze data differently
than marketing
 interfaces: section wise,report looks
OLAP

Fast and Selective Access to Summarized Data

t
O en
PR r t m
D
a
ep
A D
c
c
Dept. Mgr. View o Actuals Accounting Dir. View
u
n
t
i
n
g Time

Budget Dir. Ad Hoc View


View
OLAP - Features

 Dimensions
 The Ability to display Cubes or dimensions
 Hierarchies
 Formulas and Links
OLAP - Features

 Lower-Dimensional Data sets


 usually thought of as two dimensions (rows
and columns)
 Adding a Third Dimension
 usually thought of as a cube (x,y and z axis)
 Adding an Nth Dimension
 usually NOT thought of . . .
OLAP - Hierarchies

 Aggregation is the Substructure of Hierarchies


 A hierarchy is an attribute of a dimension that
provides a means to group data together.
 Dimensional Hierarchies
 Time dimension is a form of a hierarchy - Year,
periods, quarters, months and weeks
 Single Dimensions can have multiple hierarchies
 Groups of products, groups of customers and so on can
roll up differently within the same dimension.
 Products sold may be sold commercially and residentially.
OLAP - Formula & Links
 Formula turn data into information
 Aggregation is the most simple type of formula
 Ratios and trends are more difficult formula
 Define data
 Numeric, non-numeric, data attribute of a dimension ,
cell-based, graphic, sound
 Define links provide data consistency
 Structure link: structure information about the dimension
 Attribute: maps attribute information to a dimension
 Content links: maps data
 Tie data and links to formula
Multidimensional Analysis

• Multidimensional data
storage
• Dimensions &
Variables
• Summarized data
• Calculation Support

Wk1 Wk2 Wk3


Multidimensional Analysis
Comparative and Relative Reporting
Division A
• How do my actual expenses Division B
compare to budgeted
Division C
expenses?
• How do expenses of Labor 120 115 123
Labor
compare to expenses in
Supplies?
Supplies 60 75
• What was the % expenses 73
growth of Labor relative to
Payroll? Travel 92 87 106

Qtr1 Qtr2 Qtr3

Expense
Multidimensional Analysis
Exception and Trend Reporting
Division A
• Which expenses are 5% or Division B
more below budget and
represent more than 2% of Division C

total expenses?
Labor 120 115 123
• Display all exp.. lines where the
trend over the last 6 months is
negative. Supplies 60 75 73
• How has expense mix changed
over the past 52 weeks? Travel 92 87 106

Qtr1 Qtr2 Qtr3

Expense
Multidimensional Analysis
Modeling, Forecasting, etc.
Division A
• What is the lag factor of Division B
expenses when adding new
Division C
employees?
• What-if I add three more Labor 120 115 123
employees in Division A Group?
• Project next quarter’s expenses Supplies 60 75 73
based on the last 12 months.

Travel 92 87 106

Qtr1 Qtr2 Qtr3

Expense
Fundamental Data Model is the same

KEYS DIMENSIONS

PRODUCT REGION MONTH SALES


Region Month
RECORD #1 TENTS EAST Dec-93 240
RECORD #2 CANOES WEST Jan-94 250 Central Jan Feb Mar
West
Product
RECORD #3 RACQUETS CENTRAL Feb-94 690
East Tents
RECORD #4 TENTS WEST Mar-94 425
RECORD #5 CANOES EAST Apr-94 300
RECORD #6 TENTS WEST May-94 500 Canoes
RECORD #7 RACQUETS CENTRAL Jun-94 125 SALES
RECORD #8 CANOES WEST Jul-94 400
Racquets
RECORD #9 TENTS EAST Aug-94 800
Why Special Technology ?

District Product
Boston
New York
Philadelphia Tents

• Offset addressing
Canoes

• More powerful Racquets

analytics Sportswear

• Better Footwear

performance Q1 Q2 Q3 Q4

Quarter
Derived Measures

Boston
New York
6 8 3
Philadelphia Tents
UNITS * PRICE = SALES
Canoes
SALES
Racquets

Sportswear

Footwear
Q2 Q3 Q4

Q1

Tents 3 4 1 2 2 3 Tents

Canoes Canoes

Racquets
UNITS PRICE Racquets

Sportswear Sportswear

Footwear Footwear

Q1 Q2 Q3 Q4
Q1 Q2 Q3 Q4
Data Storage

Boston
New York
Philadelphia Tents Data Page
Sales Sales Sales Sales
Canoes
Sales Sales Sales Sales
Racquets Sales Sales Sales Sales
Sales Sales Sales Sales
Sportswear Sales Sales Sales Sales

Footwear

Q1 Q2 Q3 Q4
Sample of Built-In Functions

Numeric/Time Series Functions Financial Functions


Average Depreciation
Cumulative Sums Growth Rate
Lag/Lead Net Present Value
Variance Internal Rate of Return
Moving Average/Total
Other Functions
Smallest/Largest
Forecasting
Standard Deviation
Regression
Total

...or create your own user defined functions


OLAP Architecture

MOLAP (Multidimensional OLAP)


ROLAP (Relational OLAP)
• MOLAP: In this type of OLAP, a cube is aggregated from the
relational data source (in General from a Data warehouse).
When user generates a report request, the MOLAP tool can
generate the report quickly because all data is already pre-
aggregated within the cube.

• ROLAP: In this type of OLAP, instead of pre-aggregating


everything into a cube, the ROLAP engine essentially acts as a
smart SQL generator.
OLAP Architecture

• The ROLAP tool typically comes with a 'Designer' piece, where


the data warehouse administrator can specify the relationship
between the relational tables, as well as how dimensions,
attributes, and hierarchies map to the underlying database
tables.
• DOLAP - Distributed OLAP as the architecture to deploy
multidimensional analysis capabilities onto the desks of every
knowledge worker in an enterprise.
• Distributed OLAP" puts on the desktop PC the full power of an
OLAP database engine, along with built-in access to additional
features found in major OLAP engines running on large server
computers.
OLAP Architecture

WOLAP - Web Based Online Analytical Processing


WOLAP, or Web-enabled OLAP, which uses a browser to
deliver OLAP. The combination is powerful, say many business
managers. The delivery capability of the Web, coupled with the
business intelligence tool of OLAP, will allow a broader number
of business analysts to benefit from the software.

Example :
Functional ROLAP Vs MOLAP

Tactical Strategic

• Detailed Data • Summary Data


• Simple Calculations • Complex Calculations
• Analyze past trends • Predict future trends
Summary
 DW and OLAP are different
 DW focus on processing accurate and concise data
 OLAP focus is on the end-users analytical requirements
 Functional requirements for an OLAP tool
 Fast Access and Calculations
 Powerful Analytical Capabilities
 Flexibility
 Multi-End-User Support
 Fundamental requirements for building a Multidimensional cube.
 Cubes or dimensions
 Displaying Cubes or dimensions
 Hierarchies
 Formulas and links
THANK YOU

You might also like