Data Warehouse Concepts: Avinash Kanumuru Diya Jana Debyajit Majumder

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 308

Data Warehouse Concepts

Avinash Kanumuru
Diya Jana
Debyajit Majumder

© 2009 Wipro Ltd - Confidential


Content

1 An Overview of Data Warehouse

2 Data Warehouse Architecture

3 Data Modeling for Data Warehouse

4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

2 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Content [contd…]

6 Metadata Management

7 OLAP

8 Data Warehouse Testing

3 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
An Overview
Understanding What is a Data Warehouse

4 © 2009 Wipro Ltd - Confidential


What is Data Warehouse?

Definitions of Data Warehouse


 A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant
collection of data in support of management's decisions.
– WH Inmon
 Data Warehouse is a repository of data summarized or aggregated in
simplified form from operational systems. End user orientated data access
and reporting tools let user get at the data for decision support
– Babcock
 A data warehouse is a relational database a copy of transaction data
specifically structured for query and analysis
– Ralph Kimball
 In simple: Data warehousing is collection of data from different systems,
which helps in Business Decisions, Analysis and Reporting.

5 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse def. by WH Inmon

A common way of introducing data warehousing is to refer to the characteristics of a data


warehouse as set forth by William Inmon:
Subject Oriented
 Data that gives information about a particular subject instead of about a company's
ongoing operations.
Integrated
 Data that is gathered into the data warehouse from a variety of sources and merged into
a coherent whole.
Nonvolatile
 Data is stable in a data warehouse. More data is added but data is never removed. This
enables management to gain a consistent picture of the business.
Time Variant
 In order to discover trends in business, analysts need large amounts of data. This is very
much in contrast to online transaction processing (OLTP) systems, where performance
requirements demand that historical data be moved to an archive. All data in the data
warehouse is identified with a particular time period.

6 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Architecture
What makes a Data Warehouse

7 © 2009 Wipro Ltd - Confidential


Components of Warehouse

 Source Tables: These are real-time, volatile data in relational databases for
transaction processing (OLTP). These can be any relational databases or flat files.
 ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from
sources to target.
 Maintenance and Administration Tools: To authorize and monitor access to the data,
set-up users. Scheduling jobs to run on offshore periods.
 Modeling Tools: Used for data warehouse design for high-performance using
dimensional data modeling technique, mapping the source and target files.
 Databases: Target databases and data marts, which are part of data warehouse.
These are structured for analysis and reporting purposes.
 End-user tools for analysis and reporting: get the reports and analyze the data from
target tables. Different types of Querying, Data Mining, OLAP tools are used for this
purpose.

8 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Architecture

This is a basic design, where there are source


files, which are loaded to a warehouse and
users query the data for different purposes.

This has a staging area, where the data


after cleansing, transforming is loaded and
tested here. Later is directly loaded to the
target database/warehouse. Which is
divided to data marts and can be accessed
by different users for their reporting and
analyzing purposes.

9 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Modeling
Effective way of using a Data Warehouse

10 © 2009 Wipro Ltd - Confidential


Data Modeling

Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data


Model is used commonly.
E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics.
Like employee, book, student…
Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model
o Logical Data Model
o Physical Data Model

 Types of Dimensional Data Models – most commonly used:


o Star Schema
o Snowflake Schema

11 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Terms used in Dimensional Data Model

To understand dimensional data modeling, let's define some of the terms


commonly used in this type of modeling:
 Dimension: A category of information. For example, the time dimension.
 Attribute: A unique level within a dimension. For example, Month is an attribute
in the Time Dimension.
 Hierarchy: The specification of levels that represents relationship between
different attributes within a dimension. For example, one possible hierarchy in
the Time dimension is Year → Quarter → Month → Day.
 Fact Table: A table that contains the measures of interest.
 Lookup Table: It provides the detailed information about the attributes. For
example, the lookup table for the Quarter attribute would include a list of all of
the quarters available in the data warehouse.
 Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are
helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one
or more lookup tables, but fact tables do not have direct relationships to one another.
Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key
columns in the lookup tables.
12 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Star Schema

Dimension Table Dimension Table


product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5
c2 sfo
c3 la

Fact Table
sale oderId date custId prodId storeId qty amt
o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50

Dimension Table
customer custId name address city
53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

13 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Snowflake Schema

Dimension Table
sType tId size location
Fact Table t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe Dimension Table
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south

The star and snowflake schema are most commonly region regId name
found in dimensional data warehouses and data north cold region
marts where speed of data retrieval is more south warm region
important than the efficiency of data manipulations.
As such, the tables in these schema are not
normalized much, and are frequently designed at a
level of normalization short of third normal form.

14 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Overview of Data Cleansing

15 © 2009 Wipro Ltd - Confidential


The Need For Data Quality

 Difficulty in decision making


 Time delays in operation
 Organizational mistrust
 Data ownership conflicts
 Customer attrition
 Costs associated with
– error detection
– error rework
– customer service
– fixing customer problems

16 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Six Steps To Data Quality
Understand
Understand Information
Information Flow
Flow  Identify authoritative data sources
In Organization
In Organization
 Interview Employees & Customers

Identify Potential Problem  Data Entry Points


Areas & Asses Impact  Cost of bad data

 Use business rule discovery tools to identify data with


Measure Quality Of Data inconsistent, missing, incomplete, duplicate or incorrect
values

Clean & Load  Use data cleansing tools to clean data at the source
Data  Load only clean data into the data warehouse

Continuous Monitoring  Schedule Periodic Cleansing of Source Data

 Identify & Correct Cause of Defects


Identify Areas of Improvement  Refine data capture mechanisms at source
 Educate users on importance of DQ

17 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Quality Solution
Customized Programs
 Strengths:
– Addresses specific needs
– No bulky one time investment
 Limitations
– Tons of Custom programs in different environments are difficult to
manage
– Minor alterations demand coding efforts
Data Quality Assessment tools
 Strength
– Provide automated assessment
 Limitation
– No measure of data accuracy

18 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Quality Solution

Business Rule Discovery tools


 Strengths
– Detect Correlation in data values
– Can detect Patterns of behavior that indicate fraud
 Limitations
– Not all variables can be discovered
– Some discovered rules might not be pertinent
– There may be performance problems with large files or with many
fields.

Data Reengineering & Cleansing tools


 Strengths
– Usually are integrated packages with cleansing features as Add-on
 Limitations 
– Error prevention at source is usually absent
– The ETL tools have limited cleansing facilities

19 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Tools In The Market
 Business Rule Discovery Tools
– Integrity Data Reengineering Tool from Vality Technology
– Trillium Software System from Harte -Hanks Data Technologies
– Migration Architect from DB Star
 Data Reengineering & Cleansing Tools
– Carlton Pureview from Oracle
– ETI-Extract from Evolutionary Technologies
– PowerMart from Informatica Corp
– Sagent Data Mart from Sagent Technology
 Data Quality Assessment Tools
– Migration Architect, Evoke Axio from Evoke Software
– Wizrule from Wizsoft
 Name & Address Cleansing Tools
– Centrus Suite from Sagent
– I.d.centric from First Logic

20 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Extraction, Transformation, Load

21 © 2009 Wipro Ltd - Confidential


ETL Architecture

Visitors

Web
Browsers
External Data –
Demographics,
Household,
The Webographics,
Internet Income

Staging Area
Meta Data
Repository
Web Server Logs Flat Files
& •Clean
E-comm •Transform Enterprise
Transaction Data Scheduled •Match Scheduled Data
RDBMS •Merge
Extraction Loading Warehouse

Other OLTP
Systems

Data Collection Data Extraction Data Transformation Data Loading Data Storage &
Integration

22 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ETL Architecture

Data Extraction: Data transformation

Rummages through a file or Integrating dissimilar data types


database
Changing codes
Uses some criteria for selection
Identifies qualified data and Adding a time attribute
Transports the data over onto Summarizing data
another file or database Calculating derived values
Renormalizing data
Data Extraction – Cleanup
Data loading
Restructuring of records or fields
Removal of Operational-only data
Supply of missing field values Initial and incremental loading
Data Integrity checks Updation of metadata
Data Consistency and Range checks,
etc...

23 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Why ETL ?

 Companies have valuable data lying around throughout their networks that
needs to be moved from one place to another.

 The data lies in all sorts of heterogeneous systems,and therefore in all sorts
of formats.

 To solve the problem, companies use extract, transform and load (ETL)
software.

 The data used in ETL processes can come from any source:
a mainframe application, an ERP application, a CRM tool, a flat file, and
an Excel spreadsheet.

24 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Major components involved in ETL Processing

25 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Major components involved in ETL Processing

 Design manager
Lets developers define source-to-target mappings, transformations, process flows, and jobs
 Meta data management
Provides a repository to define, document, and manage information about the ETL design and runtime
processes
 Extract
The process of reading data from a database.
 Transform
The process of converting the extracted data
 Load
The process of writing the data into the target database.
 Transport services
ETL tools use network and file protocols to move data between
source and target systems and in-memory protocols to move data
between ETL run-time components.
 Administration and operation
ETL utilities let administrators schedule, run, monitor ETL jobs, log
all events, manage errors, recover from failures, reconcile outputs
with source systems

26 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ETL Tools

 Provides facility to specify a large number of transformation rules with a


GUI
 Generate programs to transform data
 Handle multiple data sources
 Handle data redundancy
 Generate metadata as output
 Most tools exploit parallelism by running on multiple low-cost servers in
multi-threaded environment

ETL Tools - Second-Generation


 PowerCentre/Mart from Informatica
 Data Mart Solution from Sagent Technology
 DataStage from Ascential

27 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Metadata Management

28 © 2009 Wipro Ltd - Confidential


What Is Metadata?

Metadata is Information...

 That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse
 About the data being captured and loaded into the Warehouse
 Documented in IT tools that improves both business and technical understanding of data
and data-related processes

29 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Importance Of Metadata
Locating Information
Time spent in looking for information.
How often information is found?
What poor decisions were made based on the incomplete information?
How much money was lost or earned as a result?
Interpreting information
How many times have businesses needed to rework or recall products?
 What impact does it have on the bottom line ?
How many mistakes were due to misinterpretation of existing documentation?
How much interpretation results form too much metadata?
How much time is spent trying to determine if any of the metadata is accurate?
Integrating information
How various data perspectives connect together?
How much time is spent trying to figure out that?
How much does the inefficiency and lack of metadata affect decision making

30 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Requirements for DW Metadata Management

 Provide a simple catalogue of business metadata descriptions and views

 Document/manage metadata descriptions from an integrated development


environment

 Enable DW users to identify and invoke pre-built queries against the data stores

 Design and enhance new data models and schemas for the data warehouse

 Capture data transformation rules between the operational and data


warehousing databases

 Provide change impact analysis, and update across these technologies

31 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Consumers of Metadata

 Technical Users
• Warehouse administrator
• Application developer
 Business Users -Business metadata
• Meanings
• Definitions
• Business Rules
 Software Tools
• Used in DW life-cycle development
• Metadata requirements for each tool must be identified
• The tool-specific metadata should be analysed for inclusion in the enterprise
metadata repository
• Previously captured metadata should be electronically transferred from the
enterprise metadata repository to each individual tool

32 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools

Third Party Bridging Tools


 Oracle Exchange
– Technology of choice for a long list of repository, enterprise and
workgroup vendors
 Reischmann-Informatik-Toolbus
– Features include facilitation of selective bridging of metadata
 Ardent Software/ Dovetail Software -Interplay
– ‘Hub and Spoke’ solution for enabling metadata interoperability
– Ardent focussing on own engagements, not selling it as independent
product
 Informix's Metadata Plug-ins
– Available with Ardent Datastage version 3.6.2 free of cost for Erwin,
Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy

33 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools
Metadata Repositories
 IBM, Oracle and Microsoft to offer free or near-free basic
repository services
 Enable organisations to reuse metadata across technologies
 Integrate DB design, data transformation and BI tools from
different vendors
 Multi-tool vendors taking a bridged or federated rather than
integrated approach to sharing metadata
 Both IBM and Oracle have multiple repositories for different lines
of products — e.g., One for AD and one for DW, with bridges
between them

34 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools

Metadata Interchange Standards


 CDIF (CASE Data Interchange Format)
– Most frequently used interchange standard
– Addresses only a limited subset of metadata artifacts
 OMG (Object Management Group)-CWM
– XML-addresses context and data meaning, not presentation
– Can enable exchange over the web employing industry standards for
storing and sharing programming data
– Will allow sharing of UML and MOF objects b/w various development
tools and repositories
 MDC (Metadata Coalition)
– Based on XML/UML standards
– Promoted by Microsoft Along With 20 partners including Object
Management Group (OMG), Oracle Carleton Group, CA-PLATINUM
Technology (Founding Member), Viasoft

35 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP

36 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Agenda
 OLAP Definition
 Distinction between OLTP and OLAP
 MDDB Concepts
 Implementation Techniques
 Architectures
 Features
 Representative Tools

12/08/21 37

37 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP: On-Line Analytical Processing

 OLAP can be defined as a technology which allows the


users to view the aggregate data across measurements (like
Maturity Amount, Interest Rate etc.) along with a set of
related parameters called dimensions (like Product,
Organization, Customer, etc.)
• Used interchangeably with ‘BI’
• Multidimensional view of data is the foundation of OLAP
• Users :Analysts, Decision makers

12/08/21 38

38 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Distinction between OLTP and OLAP

OLTP System OLAP System


Source of data Operational data; OLTPs are Consolidation data; OLAP
the original source of the data comes from the
data various OLTP databases

Purpose of data To control and run Decision support


fundamental business tasks

What the data A snapshot of ongoing Multi-dimensional views of


reveals business processes various kinds of business
activities
Inserts and Updates Short and fast inserts and Periodic long-running
updates initiated by end batch jobs refresh the data39
users
12/08/21

39 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MDDB Concepts

A multidimensional database is a computer software system


designed to allow for efficient and convenient storage and
retrieval of data that is
 intimately related and
 stored, viewed and analyzed from different perspectives
(Dimensions).

A hypercube represents a collection of multidimensional data.


 The edges of the cube are called dimensions
 Individual items within each dimensions are called members

40 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
RDBMS v/s MDDB: Increased Complexity...
Relational DBMS MDDB
MODEL COLOR DEALER VOL.
MINI VAN BLUE Clyde 6
MINI VAN BLUE Gleason 3
MINI VAN BLUE Carr 2
MINI VAN RED Clyde 5 Sales Volumes
MINI VAN RED Gleason 3
MINI VAN RED Carr 1
MINI VAN WHITE Clyde 3
MINI VAN WHITE Gleason 1
M Mini Van
MINI VAN WHITE Carr 4 O
SPORTS COUPE BLUE Clyde 3 D Coupe

SPORTS COUPE BLUE Gleason 3 E


SPORTS COUPE BLUE Carr 3 L Sedan
Carr
Gleason
Clyde
DEALERSHIP
SPORTS COUPE RED Clyde 4
Blue Red White
SPORTS COUPE RED Gleason 3
SPORTS COUPE RED Carr 6
SPORTS COUPE WHITE Clyde 2
COLOR
SPORTS COUPE WHITE Gleason 3
SPORTS COUPE WHITE Carr 5
SEDAN BLUE Clyde 4
SEDAN BLUE Gleason 3
SEDAN BLUE Carr 2
... … … ...

27 x 4 = 108 cells 3 x 3 x 3 = 27 cells

41 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Benefits of MDDB over RDBMS
 Ease of Data Presentation & Navigation
– A great deal of information is gleaned immediately upon direct inspection of
the array
– User is able to view data along presorted dimensions with data arranged in an
inherently more organized, and accessible fashion than the one offered by the
relational table.
 Storage Space
– Very low Space Consumption compared to Relational DB
 Performance
– Gives much better performance.
– Relational DB may give comparable results only through database tuning
(indexing, keys etc), which may not be possible for ad-hoc queries.
 Ease of Maintenance
– No overhead as data is stored in the same way it is viewed. In Relational DB,
indexes, sophisticated joins etc. are used which require considerable storage
and maintenance

12/08/21 42
42 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Issues with MDDB

• Sparsity
- Input data in applications are typically sparse
-Increases with increased dimensions
• Data Explosion
-Due to Sparsity
-Due to Summarization
• Performance
-Doesn’t perform better than RDBMS at high data
volumes (>20-30 GB)

12/08/21 43
43 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell


is left behind. Employee Age
LAST NAME EMP# AGE Smith 21
SMITH 01 21 Regan 19
REGAN 12 Sales Volumes
19
FOX 31 63 L
Fox 63

WELD M 14 6 5 314
Miini Van A
S
Weld 31
O T
KELLY D Coupe54 3 5 275 Kelly 27
E N
LINK L 03 56 A
Sedan 4 3 2 M Link 56
KRANZ 41 45 E
Blue Red White Kranz 45
LUCUS 33 COLOR41
WEISS 23 19 Lucas 41

Weiss 19

31 41 23 01 14 54 03 12 33

EMPLOYEE #

12/08/21 44
44 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Features

 Calculations applied across dimensions, through


hierarchies and/or across members
 Trend analysis over sequential time periods,
 What-if scenarios.
 Slicing / Dicing subsets for on-screen viewing
 Rotation to new dimensional comparisons in the
viewing area
 Drill-down/up along the hierarchy
 Reach-through / Drill-through to underlying detail data

12/08/21 45
45 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Rotation

• Complex Queries & Sorts in Relational environment


translated to simple rotation.
Sales Volumes

M
Mini Van
6 5 4 C Blue 6 3 4
O O
D Coupe
3 5 5 L Red 5 5 3
E O
L R
Sedan 4 3 2 o
White 4 5 2
( ROTATE 90 ) Mini Van Coupe Sedan
Blue Red White

COLOR MODEL

View #1 View #2

2 dimensional array has 2 views.


12/08/21 46
46 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Rotation

Sales Volumes

M Mini Van
C Blue C Blue

O O O
D Coupe L Red
L Red
E O O
L Sedan
Carr
Gleason
R White
Carr
Gleason
R White
Mini Van
Coupe
Clyde Clyde Sedan
Blue Red White Sedan Coupe Mini Van Carr Gleason Clyde

COLOR o
MODEL o
DEALERSHIP o
( ROTATE 90 ) ( ROTATE 90 ) ( ROTATE 90 )

DEALERSHIP DEALERSHIP MODEL

View #1 View #2 View #3

D D
E E
A A
L Carr L Carr Mini Van
E E M
R Gleason
R Gleason O Coupe
S S D
H Mini Van H Blue E Sedan
Blue
I Clyde Coupe I Clyde Red L Red
White
White
P Sedan P Mini Van Coupe Sedan
White Red Blue Clyde Gleason Carr

COLOR o
MODEL o
DEALERSHIP
( ROTATE 90 ) ( ROTATE 90 )

MODEL COLOR COLOR

View #4 View #5 View #6

3 dimensional array has 6 views.


12/08/21 47
47 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Slicing / Filtering

 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

Mini Van
M Mini Van
O
D Coupe Carr
E Coupe
Clyde
L Normal Metal
Blue Blue
Carr
Clyde

Normal Metal
DEALERSHIP
Blue Blue

COLOR
12/08/21 48
48 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION

REGION Midwest

DISTRICT Chicago St. Louis Gary

DEALERSHIP Clyde Gleason Carr Levi Lucas Bolton

Sales at region/District/Dealership Level

• Moving Up and moving down in a hierarchy is referred to


as “drill-up” / “roll-up” and “drill-down”

12/08/21 49
49 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year)


200
150
Inflows
100
($M) East
50 West
Central
0
Year Year
1999 2000
Years

12/08/21 50
50 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90
80
70
60
50
Inflows ( $M) East
40
30 West
20 Central
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Year 1999

• Drill-down from Year to Quarter

12/08/21 51
51 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20
15
Inflows ( $M 10
) East
West
5 Central
0
January February March
Year 1999

• Drill-down from Quarter to Month

52 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Implementation Techniques -OLAP Architectures

 MOLAP - Multidimensional OLAP


 Multidimensional Databases for database and application logic layer
 ROLAP - Relational OLAP
 Access Data stored in relational Data Warehouse for OLAP Analysis.
 Database and Application logic provided as separate layers
 HOLAP - Hybrid OLAP
 OLAP Server routes queries first to MDDB, then to RDBMS and
result processed on-the-fly in Server
 DOLAP - Desk OLAP
 Personal MDDB Server and application on the desktop

12/08/21 53
53 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MOLAP - MDDB storage

Web
OLAP Browser
Cube
OLAP
Calculation
Engine OLAP
Tools

OLAP
Applications

12/08/21 54
54 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MOLAP - Features

 Powerful analytical capabilities (e.g.,


financial, forecasting, statistical)
 Aggregation and calculation capabilities
 Read/write analytic applications
 Specialized data structures for
 Maximum query performance.
 Optimum space utilization.

12/08/21 55
55 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ROLAP - Standard SQL storage

MDDB - Relational Mapping


Relational DW
Web
Browser

OLAP
Calculation
SQL Engine OLAP
Tools

OLAP
Applications

12/08/21 56
56 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ROLAP - Features

 Three-tier hardware/software architecture:


 GUI on client; multidimensional processing on mid-
tier server; target database on database server
 Processing split between mid-tier & database
servers
 Ad hoc query capabilities to very large databases
 DW integration
 Data scalability

12/08/21 57
57 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
HOLAP - Combination of RDBMS and MDDB

OLAP Cube
Any Client

Relational DW Web
Browser

OLAP
Calculation
SQL Engine OLAP
Tools

OLAP
Applications

12/08/21 58
58 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
HOLAP - Features

 RDBMS used for detailed data stored in large


databases
 MDDB used for fast, read/write OLAP analysis and
calculations
 Scalability of RDBMS and MDDB performance
 Calculation engine provides full analysis features
 Source of data transparent to end user

12/08/21 59
59 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Architecture Comparison

MOLAP ROLAP HOLAP


Definition MDDB OLAP = Relational OLAP = Hybrid OLAP =
Transaction level data + Transaction level data + ROLAP + summary in
summary in MDDB summary in RDBMS MDDB
Data explosion due Good Design 3 – 10 No Sparsity Sparsity exists only in
to Sparsity times MDDB part
Data explosion due High (May go beyond To the necessary extent To the necessary extent
to Summarization control. Estimation is
very important)
Query Execution Fast - (Depends upon Slow Optimum - If the data is
Speed the size of the MDDB) fetched from RDBMS
then it’s like ROLAP
otherwise like MOLAP.
Cost Medium: MDDB Server Low: Only RDBMS + disk High: RDBMS + disk
+ large disk space cost space cost space + MDDB Server
cost
Where to apply? Small transactional Very large transactional Large transactional data
data + complex model + data & it needs to be + frequent summary
frequent summary viewed / sorted analysis
analysis

12/08/21 60
60 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Representative OLAP Tools:

 Oracle Express  Micro Strategy - DSS


Products Agent
 Hyperion Essbase  Informix MetaCube
 Cognos -PowerPlay  Brio Query
 Seagate - Holos  Business Objects /
 SAS Web Intelligence

12/08/21 61
61 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Sample OLAP Applications

 Sales Analysis
 Financial Analysis
 Profitability Analysis
 Performance Analysis
 Risk Management
 Profiling & Segmentation
 Scorecard Application
 NPA Management
 Strategic Planning
 Customer Relationship Management (CRM)

12/08/21 62
62 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing

63 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing Overview
 There is an exponentially increasing cost associated with finding
software defects later in the development lifecycle. In data
warehousing, this is compounded because of the additional business
costs of using incorrect data to make critical business decisions

 The methodology required for testing a Data Warehouse is different


from testing a typical transaction system

64 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System

Data warehouse testing is different on the following counts:


– User-Triggered vs. System triggered
– Volume of Test Data
– Possible scenarios/ Test Cases
– Programming for testing challenge

65 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System….
 User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of


the production/Source system testing is the processing of individual
transactions, which are driven by some input from the users
(Application Form, Servicing Request.). There are very few test
cycles, which cover the system-triggered scenarios (Like billing,
Valuation.)

66 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System…
 Volume of Test Data
The test data in a transaction system is a very small sample of the
overall production data. Data Warehouse has typically large test
data as one does try to fill-up maximum possible combination of
dimensions and facts.
 Possible scenarios/ Test Cases
In case of Data Warehouse, the permutations and combinations one
can possibly test is virtually unlimited due to the core objective of
Data Warehouse is to allow all possible views of data.

67 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System…

• Programming for testing challenge

In case of transaction systems, users/business analysts typically test


the output of the system. In case of data warehouse, most of the
'Data Warehouse data Quality testing' and ETL testing is done at
backend by running separate stand-alone scripts. These scripts
compare pre-Transformation to post Transformation of data.

68 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing Process

Data-Warehouse testing is basically divided into two parts :


 'Back-end' testing where the source systems data is compared to the end-result data
in Loaded area
 'Front-end' testing where the user checks the data by comparing their MIS with the
data displayed by the end-user tools like OLAP.
Testing phases consists of :
 Requirements testing
 Unit testing
 Integration testing
 Performance testing
 Acceptance testing

69 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Requirements testing

The main aim for doing Requirements testing is to check


stated requirements for completeness.
Requirements can be tested on following factors.
 Are the requirements Complete?
 Are the requirements Singular?
 Are the requirements Ambiguous?
 Are the requirements Developable?
 Are the requirements Testable?

70 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL
procedures/mappings/jobs and the reports developed.
Unit testing the ETL procedures:

•Whether ETLs are accessing and picking up right data from right source.

•All the data transformations are correct according to the business rules and data
warehouse is correctly populated with the transformed data.

•Testing the rejected records that don’t fulfil transformation rules.

71 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Unit Testing…

Unit Testing the Report data:

•Verify Report data with source:


Data present in a data warehouse will be stored at an aggregate level compare to
source systems. QA team should verify the granular data stored in data warehouse
against the source data available
•Field level data verification:
QA team must understand the linkages for the fields displayed in the report and
should trace back and compare that with the source systems
•Derivation formulae/calculation rules should be verified

72 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Integration Testing
Integration testing will involve following:
 
 Sequence of ETLs jobs in batch.
 Initial loading of records on data warehouse.
 Incremental loading of records at a later date to verify the newly
inserted or updated data.
 Testing the rejected records that don’t fulfil transformation rules.
 Error log generation

73 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Performance Testing

Performance Testing should check for :

 ETL processes completing within time window.

 Monitoring and measuring the data quality issues.

 Refresh times for standard/complex reports.

74 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Acceptance testing

Here the system is tested with full functionality and is expected to


function as in production. At the end of UAT, the system should be
acceptable to the client for use in terms of ETL process integrity and
business functionality and reporting.

75 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Questions

76 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Thank You

77 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Concepts

Avinash Kanumuru
Diya Jana
Debyajit Majumder

© 2009 Wipro Ltd - Confidential


Content

1 An Overview of Data Warehouse

2 Data Warehouse Architecture

3 Data Modeling for Data Warehouse

4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

79 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Content [contd…]

6 Metadata Management

7 OLAP

8 Data Warehouse Testing

80 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
An Overview
Understanding What is a Data Warehouse

81 © 2009 Wipro Ltd - Confidential


What is Data Warehouse?

Definitions of Data Warehouse


 A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant
collection of data in support of management's decisions.
– WH Inmon
 Data Warehouse is a repository of data summarized or aggregated in
simplified form from operational systems. End user orientated data access
and reporting tools let user get at the data for decision support
– Babcock
 A data warehouse is a relational database a copy of transaction data
specifically structured for query and analysis
– Ralph Kimball
 In simple: Data warehousing is collection of data from different systems,
which helps in Business Decisions, Analysis and Reporting.

82 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse def. by WH Inmon

A common way of introducing data warehousing is to refer to the characteristics of a data


warehouse as set forth by William Inmon:
Subject Oriented
 Data that gives information about a particular subject instead of about a company's
ongoing operations.
Integrated
 Data that is gathered into the data warehouse from a variety of sources and merged into
a coherent whole.
Nonvolatile
 Data is stable in a data warehouse. More data is added but data is never removed. This
enables management to gain a consistent picture of the business.
Time Variant
 In order to discover trends in business, analysts need large amounts of data. This is very
much in contrast to online transaction processing (OLTP) systems, where performance
requirements demand that historical data be moved to an archive. All data in the data
warehouse is identified with a particular time period.

83 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Architecture
What makes a Data Warehouse

84 © 2009 Wipro Ltd - Confidential


Data Warehouse Concepts

Avinash Kanumuru
Diya Jana
Debyajit Majumder

© 2009 Wipro Ltd - Confidential


Content

1 An Overview of Data Warehouse

2 Data Warehouse Architecture

3 Data Modeling for Data Warehouse

4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

86 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Content [contd…]

6 Metadata Management

7 OLAP

8 Data Warehouse Testing

87 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
An Overview
Understanding What is a Data Warehouse

88 © 2009 Wipro Ltd - Confidential


What is Data Warehouse?

Definitions of Data Warehouse


 A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant
collection of data in support of management's decisions.
– WH Inmon
 Data Warehouse is a repository of data summarized or aggregated in
simplified form from operational systems. End user orientated data access
and reporting tools let user get at the data for decision support
– Babcock
 A data warehouse is a relational database a copy of transaction data
specifically structured for query and analysis
– Ralph Kimball
 In simple: Data warehousing is collection of data from different systems,
which helps in Business Decisions, Analysis and Reporting.

89 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse def. by WH Inmon

A common way of introducing data warehousing is to refer to the characteristics of a data


warehouse as set forth by William Inmon:
Subject Oriented
 Data that gives information about a particular subject instead of about a company's
ongoing operations.
Integrated
 Data that is gathered into the data warehouse from a variety of sources and merged into
a coherent whole.
Nonvolatile
 Data is stable in a data warehouse. More data is added but data is never removed. This
enables management to gain a consistent picture of the business.
Time Variant
 In order to discover trends in business, analysts need large amounts of data. This is very
much in contrast to online transaction processing (OLTP) systems, where performance
requirements demand that historical data be moved to an archive. All data in the data
warehouse is identified with a particular time period.

90 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Architecture
What makes a Data Warehouse

91 © 2009 Wipro Ltd - Confidential


Data Warehouse Concepts

Avinash Kanumuru
Diya Jana
Debyajit Majumder

© 2009 Wipro Ltd - Confidential


Content

1 An Overview of Data Warehouse

2 Data Warehouse Architecture

3 Data Modeling for Data Warehouse

4 Overview of Data Cleansing

5 Data Extraction, Transformation, Load

93 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Content [contd…]

6 Metadata Management

7 OLAP

8 Data Warehouse Testing

94 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
An Overview
Understanding What is a Data Warehouse

95 © 2009 Wipro Ltd - Confidential


What is Data Warehouse?

Definitions of Data Warehouse


 A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant
collection of data in support of management's decisions.
– WH Inmon
 Data Warehouse is a repository of data summarized or aggregated in
simplified form from operational systems. End user orientated data access
and reporting tools let user get at the data for decision support
– Babcock
 A data warehouse is a relational database a copy of transaction data
specifically structured for query and analysis
– Ralph Kimball
 In simple: Data warehousing is collection of data from different systems,
which helps in Business Decisions, Analysis and Reporting.

96 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse def. by WH Inmon

A common way of introducing data warehousing is to refer to the characteristics of a data


warehouse as set forth by William Inmon:
Subject Oriented
 Data that gives information about a particular subject instead of about a company's
ongoing operations.
Integrated
 Data that is gathered into the data warehouse from a variety of sources and merged into
a coherent whole.
Nonvolatile
 Data is stable in a data warehouse. More data is added but data is never removed. This
enables management to gain a consistent picture of the business.
Time Variant
 In order to discover trends in business, analysts need large amounts of data. This is very
much in contrast to online transaction processing (OLTP) systems, where performance
requirements demand that historical data be moved to an archive. All data in the data
warehouse is identified with a particular time period.

97 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Architecture
What makes a Data Warehouse

98 © 2009 Wipro Ltd - Confidential


Components of Warehouse

 Source Tables: These are real-time, volatile data in relational databases for
transaction processing (OLTP). These can be any relational databases or flat files.
 ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from
sources to target.
 Maintenance and Administration Tools: To authorize and monitor access to the data,
set-up users. Scheduling jobs to run on offshore periods.
 Modeling Tools: Used for data warehouse design for high-performance using
dimensional data modeling technique, mapping the source and target files.
 Databases: Target databases and data marts, which are part of data warehouse.
These are structured for analysis and reporting purposes.
 End-user tools for analysis and reporting: get the reports and analyze the data from
target tables. Different types of Querying, Data Mining, OLAP tools are used for this
purpose.

99 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Architecture

This is a basic design, where there are source


files, which are loaded to a warehouse and
users query the data for different purposes.

This has a staging area, where the data


after cleansing, transforming is loaded and
tested here. Later is directly loaded to the
target database/warehouse. Which is
divided to data marts and can be accessed
by different users for their reporting and
analyzing purposes.

100 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Modeling
Effective way of using a Data Warehouse

101 © 2009 Wipro Ltd - Confidential


Data Modeling

Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data


Model is used commonly.
E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics.
Like employee, book, student…
Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model
o Logical Data Model
o Physical Data Model

 Types of Dimensional Data Models – most commonly used:


o Star Schema
o Snowflake Schema

102 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Terms used in Dimensional Data Model

To understand dimensional data modeling, let's define some of the terms


commonly used in this type of modeling:
 Dimension: A category of information. For example, the time dimension.
 Attribute: A unique level within a dimension. For example, Month is an attribute
in the Time Dimension.
 Hierarchy: The specification of levels that represents relationship between
different attributes within a dimension. For example, one possible hierarchy in
the Time dimension is Year → Quarter → Month → Day.
 Fact Table: A table that contains the measures of interest.
 Lookup Table: It provides the detailed information about the attributes. For
example, the lookup table for the Quarter attribute would include a list of all of
the quarters available in the data warehouse.
 Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are
helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one
or more lookup tables, but fact tables do not have direct relationships to one another.
Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key
columns in the lookup tables.
103 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Star Schema

Dimension Table Dimension Table


product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5
c2 sfo
c3 la

Fact Table
sale oderId date custId prodId storeId qty amt
o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50

Dimension Table
customer custId name address city
53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

104 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Snowflake Schema

Dimension Table
sType tId size location
Fact Table t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe Dimension Table
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south

The star and snowflake schema are most commonly region regId name
found in dimensional data warehouses and data north cold region
marts where speed of data retrieval is more south warm region
important than the efficiency of data manipulations.
As such, the tables in these schema are not
normalized much, and are frequently designed at a
level of normalization short of third normal form.

105 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Overview of Data Cleansing

106 © 2009 Wipro Ltd - Confidential


The Need For Data Quality

 Difficulty in decision making


 Time delays in operation
 Organizational mistrust
 Data ownership conflicts
 Customer attrition
 Costs associated with
– error detection
– error rework
– customer service
– fixing customer problems

107 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Six Steps To Data Quality
Understand
Understand Information
Information Flow
Flow  Identify authoritative data sources
In Organization
In Organization
 Interview Employees & Customers

Identify Potential Problem  Data Entry Points


Areas & Asses Impact  Cost of bad data

 Use business rule discovery tools to identify data with


Measure Quality Of Data inconsistent, missing, incomplete, duplicate or incorrect
values

Clean & Load  Use data cleansing tools to clean data at the source
Data  Load only clean data into the data warehouse

Continuous Monitoring  Schedule Periodic Cleansing of Source Data

 Identify & Correct Cause of Defects


Identify Areas of Improvement  Refine data capture mechanisms at source
 Educate users on importance of DQ

108 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Quality Solution
Customized Programs
 Strengths:
– Addresses specific needs
– No bulky one time investment
 Limitations
– Tons of Custom programs in different environments are difficult to
manage
– Minor alterations demand coding efforts
Data Quality Assessment tools
 Strength
– Provide automated assessment
 Limitation
– No measure of data accuracy

109 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Quality Solution

Business Rule Discovery tools


 Strengths
– Detect Correlation in data values
– Can detect Patterns of behavior that indicate fraud
 Limitations
– Not all variables can be discovered
– Some discovered rules might not be pertinent
– There may be performance problems with large files or with many
fields.

Data Reengineering & Cleansing tools


 Strengths
– Usually are integrated packages with cleansing features as Add-on
 Limitations 
– Error prevention at source is usually absent
– The ETL tools have limited cleansing facilities

110 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Tools In The Market
 Business Rule Discovery Tools
– Integrity Data Reengineering Tool from Vality Technology
– Trillium Software System from Harte -Hanks Data Technologies
– Migration Architect from DB Star
 Data Reengineering & Cleansing Tools
– Carlton Pureview from Oracle
– ETI-Extract from Evolutionary Technologies
– PowerMart from Informatica Corp
– Sagent Data Mart from Sagent Technology
 Data Quality Assessment Tools
– Migration Architect, Evoke Axio from Evoke Software
– Wizrule from Wizsoft
 Name & Address Cleansing Tools
– Centrus Suite from Sagent
– I.d.centric from First Logic

111 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Extraction, Transformation, Load

112 © 2009 Wipro Ltd - Confidential


ETL Architecture

Visitors

Web
Browsers
External Data –
Demographics,
Household,
The Webographics,
Internet Income

Staging Area
Meta Data
Repository
Web Server Logs Flat Files
& •Clean
E-comm •Transform Enterprise
Transaction Data Scheduled •Match Scheduled Data
RDBMS •Merge
Extraction Loading Warehouse

Other OLTP
Systems

Data Collection Data Extraction Data Transformation Data Loading Data Storage &
Integration

113 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ETL Architecture

Data Extraction: Data transformation

Rummages through a file or Integrating dissimilar data types


database
Changing codes
Uses some criteria for selection
Identifies qualified data and Adding a time attribute
Transports the data over onto Summarizing data
another file or database Calculating derived values
Renormalizing data
Data Extraction – Cleanup
Data loading
Restructuring of records or fields
Removal of Operational-only data
Supply of missing field values Initial and incremental loading
Data Integrity checks Updation of metadata
Data Consistency and Range checks,
etc...

114 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Why ETL ?

 Companies have valuable data lying around throughout their networks that
needs to be moved from one place to another.

 The data lies in all sorts of heterogeneous systems,and therefore in all sorts
of formats.

 To solve the problem, companies use extract, transform and load (ETL)
software.

 The data used in ETL processes can come from any source:
a mainframe application, an ERP application, a CRM tool, a flat file, and
an Excel spreadsheet.

115 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Major components involved in ETL Processing

116 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Major components involved in ETL Processing

 Design manager
Lets developers define source-to-target mappings, transformations, process flows, and jobs
 Meta data management
Provides a repository to define, document, and manage information about the ETL design and runtime
processes
 Extract
The process of reading data from a database.
 Transform
The process of converting the extracted data
 Load
The process of writing the data into the target database.
 Transport services
ETL tools use network and file protocols to move data between
source and target systems and in-memory protocols to move data
between ETL run-time components.
 Administration and operation
ETL utilities let administrators schedule, run, monitor ETL jobs, log
all events, manage errors, recover from failures, reconcile outputs
with source systems

117 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ETL Tools

 Provides facility to specify a large number of transformation rules with a


GUI
 Generate programs to transform data
 Handle multiple data sources
 Handle data redundancy
 Generate metadata as output
 Most tools exploit parallelism by running on multiple low-cost servers in
multi-threaded environment

ETL Tools - Second-Generation


 PowerCentre/Mart from Informatica
 Data Mart Solution from Sagent Technology
 DataStage from Ascential

118 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Metadata Management

119 © 2009 Wipro Ltd - Confidential


What Is Metadata?

Metadata is Information...

 That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse
 About the data being captured and loaded into the Warehouse
 Documented in IT tools that improves both business and technical understanding of data
and data-related processes

120 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Importance Of Metadata
Locating Information
Time spent in looking for information.
How often information is found?
What poor decisions were made based on the incomplete information?
How much money was lost or earned as a result?
Interpreting information
How many times have businesses needed to rework or recall products?
 What impact does it have on the bottom line ?
How many mistakes were due to misinterpretation of existing documentation?
How much interpretation results form too much metadata?
How much time is spent trying to determine if any of the metadata is accurate?
Integrating information
How various data perspectives connect together?
How much time is spent trying to figure out that?
How much does the inefficiency and lack of metadata affect decision making

121 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Requirements for DW Metadata Management

 Provide a simple catalogue of business metadata descriptions and views

 Document/manage metadata descriptions from an integrated development


environment

 Enable DW users to identify and invoke pre-built queries against the data stores

 Design and enhance new data models and schemas for the data warehouse

 Capture data transformation rules between the operational and data


warehousing databases

 Provide change impact analysis, and update across these technologies

122 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Consumers of Metadata

 Technical Users
• Warehouse administrator
• Application developer
 Business Users -Business metadata
• Meanings
• Definitions
• Business Rules
 Software Tools
• Used in DW life-cycle development
• Metadata requirements for each tool must be identified
• The tool-specific metadata should be analysed for inclusion in the enterprise
metadata repository
• Previously captured metadata should be electronically transferred from the
enterprise metadata repository to each individual tool

123 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools

Third Party Bridging Tools


 Oracle Exchange
– Technology of choice for a long list of repository, enterprise and
workgroup vendors
 Reischmann-Informatik-Toolbus
– Features include facilitation of selective bridging of metadata
 Ardent Software/ Dovetail Software -Interplay
– ‘Hub and Spoke’ solution for enabling metadata interoperability
– Ardent focussing on own engagements, not selling it as independent
product
 Informix's Metadata Plug-ins
– Available with Ardent Datastage version 3.6.2 free of cost for Erwin,
Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy

124 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools
Metadata Repositories
 IBM, Oracle and Microsoft to offer free or near-free basic
repository services
 Enable organisations to reuse metadata across technologies
 Integrate DB design, data transformation and BI tools from
different vendors
 Multi-tool vendors taking a bridged or federated rather than
integrated approach to sharing metadata
 Both IBM and Oracle have multiple repositories for different lines
of products — e.g., One for AD and one for DW, with bridges
between them

125 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools

Metadata Interchange Standards


 CDIF (CASE Data Interchange Format)
– Most frequently used interchange standard
– Addresses only a limited subset of metadata artifacts
 OMG (Object Management Group)-CWM
– XML-addresses context and data meaning, not presentation
– Can enable exchange over the web employing industry standards for
storing and sharing programming data
– Will allow sharing of UML and MOF objects b/w various development
tools and repositories
 MDC (Metadata Coalition)
– Based on XML/UML standards
– Promoted by Microsoft Along With 20 partners including Object
Management Group (OMG), Oracle Carleton Group, CA-PLATINUM
Technology (Founding Member), Viasoft

126 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP

127 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Agenda
 OLAP Definition
 Distinction between OLTP and OLAP
 MDDB Concepts
 Implementation Techniques
 Architectures
 Features
 Representative Tools

12/08/21 128

128 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP: On-Line Analytical Processing

 OLAP can be defined as a technology which allows the


users to view the aggregate data across measurements (like
Maturity Amount, Interest Rate etc.) along with a set of
related parameters called dimensions (like Product,
Organization, Customer, etc.)
• Used interchangeably with ‘BI’
• Multidimensional view of data is the foundation of OLAP
• Users :Analysts, Decision makers

12/08/21 129

129 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Distinction between OLTP and OLAP

OLTP System OLAP System


Source of data Operational data; OLTPs are Consolidation data; OLAP
the original source of the data comes from the
data various OLTP databases

Purpose of data To control and run Decision support


fundamental business tasks

What the data A snapshot of ongoing Multi-dimensional views of


reveals business processes various kinds of business
activities
Inserts and Updates Short and fast inserts and Periodic long-running
updates initiated by end batch jobs refresh the data
130

users
12/08/21

130 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MDDB Concepts

A multidimensional database is a computer software system


designed to allow for efficient and convenient storage and
retrieval of data that is
 intimately related and
 stored, viewed and analyzed from different perspectives
(Dimensions).

A hypercube represents a collection of multidimensional data.


 The edges of the cube are called dimensions
 Individual items within each dimensions are called members

131 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
RDBMS v/s MDDB: Increased Complexity...
Relational DBMS MDDB
MODEL COLOR DEALER VOL.
MINI VAN BLUE Clyde 6
MINI VAN BLUE Gleason 3
MINI VAN BLUE Carr 2
MINI VAN RED Clyde 5 Sales Volumes
MINI VAN RED Gleason 3
MINI VAN RED Carr 1
MINI VAN WHITE Clyde 3
MINI VAN WHITE Gleason 1
M Mini Van
MINI VAN WHITE Carr 4 O
SPORTS COUPE BLUE Clyde 3 D Coupe

SPORTS COUPE BLUE Gleason 3 E


SPORTS COUPE BLUE Carr 3 L Sedan
Carr
Gleason
Clyde
DEALERSHIP
SPORTS COUPE RED Clyde 4
Blue Red White
SPORTS COUPE RED Gleason 3
SPORTS COUPE RED Carr 6
SPORTS COUPE WHITE Clyde 2
COLOR
SPORTS COUPE WHITE Gleason 3
SPORTS COUPE WHITE Carr 5
SEDAN BLUE Clyde 4
SEDAN BLUE Gleason 3
SEDAN BLUE Carr 2
... … … ...

27 x 4 = 108 cells 3 x 3 x 3 = 27 cells

132 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Benefits of MDDB over RDBMS
 Ease of Data Presentation & Navigation
– A great deal of information is gleaned immediately upon direct inspection of
the array
– User is able to view data along presorted dimensions with data arranged in an
inherently more organized, and accessible fashion than the one offered by the
relational table.
 Storage Space
– Very low Space Consumption compared to Relational DB
 Performance
– Gives much better performance.
– Relational DB may give comparable results only through database tuning
(indexing, keys etc), which may not be possible for ad-hoc queries.
 Ease of Maintenance
– No overhead as data is stored in the same way it is viewed. In Relational DB,
indexes, sophisticated joins etc. are used which require considerable storage
and maintenance

12/08/21 133
133 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Issues with MDDB

• Sparsity
- Input data in applications are typically sparse
-Increases with increased dimensions
• Data Explosion
-Due to Sparsity
-Due to Summarization
• Performance
-Doesn’t perform better than RDBMS at high data
volumes (>20-30 GB)

12/08/21 134
134 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell


is left behind. Employee Age
LAST NAME EMP# AGE Smith 21
SMITH 01 21 Regan 19
REGAN 12 Sales Volumes
19
FOX 31 63 L
Fox 63

WELD M 14 6 5 314
Miini Van A
S
Weld 31
O T
KELLY D Coupe54 3 5 275 Kelly 27
E N
LINK L 03 56 A
Sedan 4 3 2 M Link 56
KRANZ 41 45 E
Blue Red White Kranz 45
LUCUS 33 COLOR41
WEISS 23 19 Lucas 41

Weiss 19

31 41 23 01 14 54 03 12 33

EMPLOYEE #

12/08/21 135
135 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Features

 Calculations applied across dimensions, through


hierarchies and/or across members
 Trend analysis over sequential time periods,
 What-if scenarios.
 Slicing / Dicing subsets for on-screen viewing
 Rotation to new dimensional comparisons in the
viewing area
 Drill-down/up along the hierarchy
 Reach-through / Drill-through to underlying detail data

12/08/21 136
136 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Rotation

• Complex Queries & Sorts in Relational environment


translated to simple rotation.
Sales Volumes

M
Mini Van
6 5 4 C Blue 6 3 4
O O
D Coupe
3 5 5 L Red 5 5 3
E O
L R
Sedan 4 3 2 o
White 4 5 2
( ROTATE 90 ) Mini Van Coupe Sedan
Blue Red White

COLOR MODEL

View #1 View #2

2 dimensional array has 2 views.


12/08/21 137
137 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Rotation

Sales Volumes

M Mini Van
C Blue C Blue

O O O
D Coupe L Red
L Red
E O O
L Sedan
Carr
Gleason
R White
Carr
Gleason
R White
Mini Van
Coupe
Clyde Clyde Sedan
Blue Red White Sedan Coupe Mini Van Carr Gleason Clyde

COLOR o
MODEL o
DEALERSHIP o
( ROTATE 90 ) ( ROTATE 90 ) ( ROTATE 90 )

DEALERSHIP DEALERSHIP MODEL

View #1 View #2 View #3

D D
E E
A A
L Carr L Carr Mini Van
E E M
R Gleason
R Gleason O Coupe
S S D
H Mini Van H Blue E Sedan
Blue
I Clyde Coupe I Clyde Red L Red
White
White
P Sedan P Mini Van Coupe Sedan
White Red Blue Clyde Gleason Carr

COLOR o
MODEL o
DEALERSHIP
( ROTATE 90 ) ( ROTATE 90 )

MODEL COLOR COLOR

View #4 View #5 View #6

3 dimensional array has 6 views.


12/08/21 138
138 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Slicing / Filtering

 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

Mini Van
M Mini Van
O
D Coupe Carr
E Coupe
Clyde
L Normal Metal
Blue Blue
Carr
Clyde

Normal Metal
DEALERSHIP
Blue Blue

COLOR
12/08/21 139
139 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION

REGION Midwest

DISTRICT Chicago St. Louis Gary

DEALERSHIP Clyde Gleason Carr Levi Lucas Bolton

Sales at region/District/Dealership Level

• Moving Up and moving down in a hierarchy is referred to


as “drill-up” / “roll-up” and “drill-down”

12/08/21 140
140 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year)


200
150
Inflows
100
($M) East
50 West
Central
0
Year Year
1999 2000
Years

12/08/21 141
141 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90
80
70
60
50
Inflows ( $M) East
40
30 West
20 Central
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Year 1999

• Drill-down from Year to Quarter

12/08/21 142
142 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20
15
Inflows ( $M 10
) East
West
5 Central
0
January February March
Year 1999

• Drill-down from Quarter to Month

143 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Implementation Techniques -OLAP Architectures

 MOLAP - Multidimensional OLAP


 Multidimensional Databases for database and application logic layer
 ROLAP - Relational OLAP
 Access Data stored in relational Data Warehouse for OLAP Analysis.
 Database and Application logic provided as separate layers
 HOLAP - Hybrid OLAP
 OLAP Server routes queries first to MDDB, then to RDBMS and
result processed on-the-fly in Server
 DOLAP - Desk OLAP
 Personal MDDB Server and application on the desktop

12/08/21 144
144 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MOLAP - MDDB storage

Web
OLAP Browser
Cube
OLAP
Calculation
Engine OLAP
Tools

OLAP
Applications

12/08/21 145
145 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MOLAP - Features

 Powerful analytical capabilities (e.g.,


financial, forecasting, statistical)
 Aggregation and calculation capabilities
 Read/write analytic applications
 Specialized data structures for
 Maximum query performance.
 Optimum space utilization.

12/08/21 146
146 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ROLAP - Standard SQL storage

MDDB - Relational Mapping


Relational DW
Web
Browser

OLAP
Calculation
SQL Engine OLAP
Tools

OLAP
Applications

12/08/21 147
147 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ROLAP - Features

 Three-tier hardware/software architecture:


 GUI on client; multidimensional processing on mid-
tier server; target database on database server
 Processing split between mid-tier & database
servers
 Ad hoc query capabilities to very large databases
 DW integration
 Data scalability

12/08/21 148
148 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
HOLAP - Combination of RDBMS and MDDB

OLAP Cube
Any Client

Relational DW Web
Browser

OLAP
Calculation
SQL Engine OLAP
Tools

OLAP
Applications

12/08/21 149
149 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
HOLAP - Features

 RDBMS used for detailed data stored in large


databases
 MDDB used for fast, read/write OLAP analysis and
calculations
 Scalability of RDBMS and MDDB performance
 Calculation engine provides full analysis features
 Source of data transparent to end user

12/08/21 150
150 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Architecture Comparison

MOLAP ROLAP HOLAP


Definition MDDB OLAP = Relational OLAP = Hybrid OLAP =
Transaction level data + Transaction level data + ROLAP + summary in
summary in MDDB summary in RDBMS MDDB
Data explosion due Good Design 3 – 10 No Sparsity Sparsity exists only in
to Sparsity times MDDB part
Data explosion due High (May go beyond To the necessary extent To the necessary extent
to Summarization control. Estimation is
very important)
Query Execution Fast - (Depends upon Slow Optimum - If the data is
Speed the size of the MDDB) fetched from RDBMS
then it’s like ROLAP
otherwise like MOLAP.
Cost Medium: MDDB Server Low: Only RDBMS + disk High: RDBMS + disk
+ large disk space cost space cost space + MDDB Server
cost
Where to apply? Small transactional Very large transactional Large transactional data
data + complex model + data & it needs to be + frequent summary
frequent summary viewed / sorted analysis
analysis

12/08/21 151
151 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Representative OLAP Tools:

 Oracle Express  Micro Strategy - DSS


Products Agent
 Hyperion Essbase  Informix MetaCube
 Cognos -PowerPlay  Brio Query
 Seagate - Holos  Business Objects /
 SAS Web Intelligence

12/08/21 152
152 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Sample OLAP Applications

 Sales Analysis
 Financial Analysis
 Profitability Analysis
 Performance Analysis
 Risk Management
 Profiling & Segmentation
 Scorecard Application
 NPA Management
 Strategic Planning
 Customer Relationship Management (CRM)

12/08/21 153
153 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing

154 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing Overview
 There is an exponentially increasing cost associated with finding
software defects later in the development lifecycle. In data
warehousing, this is compounded because of the additional business
costs of using incorrect data to make critical business decisions

 The methodology required for testing a Data Warehouse is different


from testing a typical transaction system

155 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System

Data warehouse testing is different on the following counts:


– User-Triggered vs. System triggered
– Volume of Test Data
– Possible scenarios/ Test Cases
– Programming for testing challenge

156 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System….
 User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of


the production/Source system testing is the processing of individual
transactions, which are driven by some input from the users
(Application Form, Servicing Request.). There are very few test
cycles, which cover the system-triggered scenarios (Like billing,
Valuation.)

157 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System…
 Volume of Test Data
The test data in a transaction system is a very small sample of the
overall production data. Data Warehouse has typically large test
data as one does try to fill-up maximum possible combination of
dimensions and facts.
 Possible scenarios/ Test Cases
In case of Data Warehouse, the permutations and combinations one
can possibly test is virtually unlimited due to the core objective of
Data Warehouse is to allow all possible views of data.

158 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System…

• Programming for testing challenge

In case of transaction systems, users/business analysts typically test


the output of the system. In case of data warehouse, most of the
'Data Warehouse data Quality testing' and ETL testing is done at
backend by running separate stand-alone scripts. These scripts
compare pre-Transformation to post Transformation of data.

159 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing Process

Data-Warehouse testing is basically divided into two parts :


 'Back-end' testing where the source systems data is compared to the end-result data
in Loaded area
 'Front-end' testing where the user checks the data by comparing their MIS with the
data displayed by the end-user tools like OLAP.
Testing phases consists of :
 Requirements testing
 Unit testing
 Integration testing
 Performance testing
 Acceptance testing

160 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Requirements testing

The main aim for doing Requirements testing is to check


stated requirements for completeness.
Requirements can be tested on following factors.
 Are the requirements Complete?
 Are the requirements Singular?
 Are the requirements Ambiguous?
 Are the requirements Developable?
 Are the requirements Testable?

161 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL
procedures/mappings/jobs and the reports developed.
Unit testing the ETL procedures:

•Whether ETLs are accessing and picking up right data from right source.

•All the data transformations are correct according to the business rules and data
warehouse is correctly populated with the transformed data.

•Testing the rejected records that don’t fulfil transformation rules.

162 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Unit Testing…

Unit Testing the Report data:

•Verify Report data with source:


Data present in a data warehouse will be stored at an aggregate level compare to
source systems. QA team should verify the granular data stored in data warehouse
against the source data available
•Field level data verification:
QA team must understand the linkages for the fields displayed in the report and
should trace back and compare that with the source systems
•Derivation formulae/calculation rules should be verified

163 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Integration Testing
Integration testing will involve following:
 
 Sequence of ETLs jobs in batch.
 Initial loading of records on data warehouse.
 Incremental loading of records at a later date to verify the newly
inserted or updated data.
 Testing the rejected records that don’t fulfil transformation rules.
 Error log generation

164 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Performance Testing

Performance Testing should check for :

 ETL processes completing within time window.

 Monitoring and measuring the data quality issues.

 Refresh times for standard/complex reports.

165 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Acceptance testing

Here the system is tested with full functionality and is expected to


function as in production. At the end of UAT, the system should be
acceptable to the client for use in terms of ETL process integrity and
business functionality and reporting.

166 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Questions

167 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Thank You

168 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Components of Warehouse

 Source Tables: These are real-time, volatile data in relational databases for
transaction processing (OLTP). These can be any relational databases or flat files.
 ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from
sources to target.
 Maintenance and Administration Tools: To authorize and monitor access to the data,
set-up users. Scheduling jobs to run on offshore periods.
 Modeling Tools: Used for data warehouse design for high-performance using
dimensional data modeling technique, mapping the source and target files.
 Databases: Target databases and data marts, which are part of data warehouse.
These are structured for analysis and reporting purposes.
 End-user tools for analysis and reporting: get the reports and analyze the data from
target tables. Different types of Querying, Data Mining, OLAP tools are used for this
purpose.

169 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Architecture

This is a basic design, where there are source


files, which are loaded to a warehouse and
users query the data for different purposes.

This has a staging area, where the data


after cleansing, transforming is loaded and
tested here. Later is directly loaded to the
target database/warehouse. Which is
divided to data marts and can be accessed
by different users for their reporting and
analyzing purposes.

170 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Modeling
Effective way of using a Data Warehouse

171 © 2009 Wipro Ltd - Confidential


Data Modeling

Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data


Model is used commonly.
E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics.
Like employee, book, student…
Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model
o Logical Data Model
o Physical Data Model

 Types of Dimensional Data Models – most commonly used:


o Star Schema
o Snowflake Schema

172 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Terms used in Dimensional Data Model

To understand dimensional data modeling, let's define some of the terms


commonly used in this type of modeling:
 Dimension: A category of information. For example, the time dimension.
 Attribute: A unique level within a dimension. For example, Month is an attribute
in the Time Dimension.
 Hierarchy: The specification of levels that represents relationship between
different attributes within a dimension. For example, one possible hierarchy in
the Time dimension is Year → Quarter → Month → Day.
 Fact Table: A table that contains the measures of interest.
 Lookup Table: It provides the detailed information about the attributes. For
example, the lookup table for the Quarter attribute would include a list of all of
the quarters available in the data warehouse.
 Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are
helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one
or more lookup tables, but fact tables do not have direct relationships to one another.
Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key
columns in the lookup tables.
173 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Star Schema

Dimension Table Dimension Table


product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5
c2 sfo
c3 la

Fact Table
sale oderId date custId prodId storeId qty amt
o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50

Dimension Table
customer custId name address city
53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

174 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Snowflake Schema

Dimension Table
sType tId size location
Fact Table t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe Dimension Table
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south

The star and snowflake schema are most commonly region regId name
found in dimensional data warehouses and data north cold region
marts where speed of data retrieval is more south warm region
important than the efficiency of data manipulations.
As such, the tables in these schema are not
normalized much, and are frequently designed at a
level of normalization short of third normal form.

175 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Overview of Data Cleansing

176 © 2009 Wipro Ltd - Confidential


The Need For Data Quality

 Difficulty in decision making


 Time delays in operation
 Organizational mistrust
 Data ownership conflicts
 Customer attrition
 Costs associated with
– error detection
– error rework
– customer service
– fixing customer problems

177 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Six Steps To Data Quality
Understand
Understand Information
Information Flow
Flow  Identify authoritative data sources
In Organization
In Organization
 Interview Employees & Customers

Identify Potential Problem  Data Entry Points


Areas & Asses Impact  Cost of bad data

 Use business rule discovery tools to identify data with


Measure Quality Of Data inconsistent, missing, incomplete, duplicate or incorrect
values

Clean & Load  Use data cleansing tools to clean data at the source
Data  Load only clean data into the data warehouse

Continuous Monitoring  Schedule Periodic Cleansing of Source Data

 Identify & Correct Cause of Defects


Identify Areas of Improvement  Refine data capture mechanisms at source
 Educate users on importance of DQ

178 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Quality Solution
Customized Programs
 Strengths:
– Addresses specific needs
– No bulky one time investment
 Limitations
– Tons of Custom programs in different environments are difficult to
manage
– Minor alterations demand coding efforts
Data Quality Assessment tools
 Strength
– Provide automated assessment
 Limitation
– No measure of data accuracy

179 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Quality Solution

Business Rule Discovery tools


 Strengths
– Detect Correlation in data values
– Can detect Patterns of behavior that indicate fraud
 Limitations
– Not all variables can be discovered
– Some discovered rules might not be pertinent
– There may be performance problems with large files or with many
fields.

Data Reengineering & Cleansing tools


 Strengths
– Usually are integrated packages with cleansing features as Add-on
 Limitations 
– Error prevention at source is usually absent
– The ETL tools have limited cleansing facilities

180 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Tools In The Market
 Business Rule Discovery Tools
– Integrity Data Reengineering Tool from Vality Technology
– Trillium Software System from Harte -Hanks Data Technologies
– Migration Architect from DB Star
 Data Reengineering & Cleansing Tools
– Carlton Pureview from Oracle
– ETI-Extract from Evolutionary Technologies
– PowerMart from Informatica Corp
– Sagent Data Mart from Sagent Technology
 Data Quality Assessment Tools
– Migration Architect, Evoke Axio from Evoke Software
– Wizrule from Wizsoft
 Name & Address Cleansing Tools
– Centrus Suite from Sagent
– I.d.centric from First Logic

181 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Extraction, Transformation, Load

182 © 2009 Wipro Ltd - Confidential


ETL Architecture

Visitors

Web
Browsers
External Data –
Demographics,
Household,
The Webographics,
Internet Income

Staging Area
Meta Data
Repository
Web Server Logs Flat Files
& •Clean
E-comm •Transform Enterprise
Transaction Data Scheduled •Match Scheduled Data
RDBMS •Merge
Extraction Loading Warehouse

Other OLTP
Systems

Data Collection Data Extraction Data Transformation Data Loading Data Storage &
Integration

183 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ETL Architecture

Data Extraction: Data transformation

Rummages through a file or Integrating dissimilar data types


database
Changing codes
Uses some criteria for selection
Identifies qualified data and Adding a time attribute
Transports the data over onto Summarizing data
another file or database Calculating derived values
Renormalizing data
Data Extraction – Cleanup
Data loading
Restructuring of records or fields
Removal of Operational-only data
Supply of missing field values Initial and incremental loading
Data Integrity checks Updation of metadata
Data Consistency and Range checks,
etc...

184 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Why ETL ?

 Companies have valuable data lying around throughout their networks that
needs to be moved from one place to another.

 The data lies in all sorts of heterogeneous systems,and therefore in all sorts
of formats.

 To solve the problem, companies use extract, transform and load (ETL)
software.

 The data used in ETL processes can come from any source:
a mainframe application, an ERP application, a CRM tool, a flat file, and
an Excel spreadsheet.

185 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Major components involved in ETL Processing

186 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Major components involved in ETL Processing

 Design manager
Lets developers define source-to-target mappings, transformations, process flows, and jobs
 Meta data management
Provides a repository to define, document, and manage information about the ETL design and runtime
processes
 Extract
The process of reading data from a database.
 Transform
The process of converting the extracted data
 Load
The process of writing the data into the target database.
 Transport services
ETL tools use network and file protocols to move data between
source and target systems and in-memory protocols to move data
between ETL run-time components.
 Administration and operation
ETL utilities let administrators schedule, run, monitor ETL jobs, log
all events, manage errors, recover from failures, reconcile outputs
with source systems

187 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ETL Tools

 Provides facility to specify a large number of transformation rules with a


GUI
 Generate programs to transform data
 Handle multiple data sources
 Handle data redundancy
 Generate metadata as output
 Most tools exploit parallelism by running on multiple low-cost servers in
multi-threaded environment

ETL Tools - Second-Generation


 PowerCentre/Mart from Informatica
 Data Mart Solution from Sagent Technology
 DataStage from Ascential

188 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Metadata Management

189 © 2009 Wipro Ltd - Confidential


What Is Metadata?

Metadata is Information...

 That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse
 About the data being captured and loaded into the Warehouse
 Documented in IT tools that improves both business and technical understanding of data
and data-related processes

190 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Importance Of Metadata
Locating Information
Time spent in looking for information.
How often information is found?
What poor decisions were made based on the incomplete information?
How much money was lost or earned as a result?
Interpreting information
How many times have businesses needed to rework or recall products?
 What impact does it have on the bottom line ?
How many mistakes were due to misinterpretation of existing documentation?
How much interpretation results form too much metadata?
How much time is spent trying to determine if any of the metadata is accurate?
Integrating information
How various data perspectives connect together?
How much time is spent trying to figure out that?
How much does the inefficiency and lack of metadata affect decision making

191 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Requirements for DW Metadata Management

 Provide a simple catalogue of business metadata descriptions and views

 Document/manage metadata descriptions from an integrated development


environment

 Enable DW users to identify and invoke pre-built queries against the data stores

 Design and enhance new data models and schemas for the data warehouse

 Capture data transformation rules between the operational and data


warehousing databases

 Provide change impact analysis, and update across these technologies

192 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Consumers of Metadata

 Technical Users
• Warehouse administrator
• Application developer
 Business Users -Business metadata
• Meanings
• Definitions
• Business Rules
 Software Tools
• Used in DW life-cycle development
• Metadata requirements for each tool must be identified
• The tool-specific metadata should be analysed for inclusion in the enterprise
metadata repository
• Previously captured metadata should be electronically transferred from the
enterprise metadata repository to each individual tool

193 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools

Third Party Bridging Tools


 Oracle Exchange
– Technology of choice for a long list of repository, enterprise and
workgroup vendors
 Reischmann-Informatik-Toolbus
– Features include facilitation of selective bridging of metadata
 Ardent Software/ Dovetail Software -Interplay
– ‘Hub and Spoke’ solution for enabling metadata interoperability
– Ardent focussing on own engagements, not selling it as independent
product
 Informix's Metadata Plug-ins
– Available with Ardent Datastage version 3.6.2 free of cost for Erwin,
Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy

194 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools
Metadata Repositories
 IBM, Oracle and Microsoft to offer free or near-free basic
repository services
 Enable organisations to reuse metadata across technologies
 Integrate DB design, data transformation and BI tools from
different vendors
 Multi-tool vendors taking a bridged or federated rather than
integrated approach to sharing metadata
 Both IBM and Oracle have multiple repositories for different lines
of products — e.g., One for AD and one for DW, with bridges
between them

195 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools

Metadata Interchange Standards


 CDIF (CASE Data Interchange Format)
– Most frequently used interchange standard
– Addresses only a limited subset of metadata artifacts
 OMG (Object Management Group)-CWM
– XML-addresses context and data meaning, not presentation
– Can enable exchange over the web employing industry standards for
storing and sharing programming data
– Will allow sharing of UML and MOF objects b/w various development
tools and repositories
 MDC (Metadata Coalition)
– Based on XML/UML standards
– Promoted by Microsoft Along With 20 partners including Object
Management Group (OMG), Oracle Carleton Group, CA-PLATINUM
Technology (Founding Member), Viasoft

196 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP

197 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Agenda
 OLAP Definition
 Distinction between OLTP and OLAP
 MDDB Concepts
 Implementation Techniques
 Architectures
 Features
 Representative Tools

12/08/21 198

198 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP: On-Line Analytical Processing

 OLAP can be defined as a technology which allows the


users to view the aggregate data across measurements (like
Maturity Amount, Interest Rate etc.) along with a set of
related parameters called dimensions (like Product,
Organization, Customer, etc.)
• Used interchangeably with ‘BI’
• Multidimensional view of data is the foundation of OLAP
• Users :Analysts, Decision makers

12/08/21 199

199 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Distinction between OLTP and OLAP

OLTP System OLAP System


Source of data Operational data; OLTPs are Consolidation data; OLAP
the original source of the data comes from the
data various OLTP databases

Purpose of data To control and run Decision support


fundamental business tasks

What the data A snapshot of ongoing Multi-dimensional views of


reveals business processes various kinds of business
activities
Inserts and Updates Short and fast inserts and Periodic long-running
updates initiated by end batch jobs refresh the data
200

users
12/08/21

200 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MDDB Concepts

A multidimensional database is a computer software system


designed to allow for efficient and convenient storage and
retrieval of data that is
 intimately related and
 stored, viewed and analyzed from different perspectives
(Dimensions).

A hypercube represents a collection of multidimensional data.


 The edges of the cube are called dimensions
 Individual items within each dimensions are called members

201 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
RDBMS v/s MDDB: Increased Complexity...
Relational DBMS MDDB
MODEL COLOR DEALER VOL.
MINI VAN BLUE Clyde 6
MINI VAN BLUE Gleason 3
MINI VAN BLUE Carr 2
MINI VAN RED Clyde 5 Sales Volumes
MINI VAN RED Gleason 3
MINI VAN RED Carr 1
MINI VAN WHITE Clyde 3
MINI VAN WHITE Gleason 1
M Mini Van
MINI VAN WHITE Carr 4 O
SPORTS COUPE BLUE Clyde 3 D Coupe

SPORTS COUPE BLUE Gleason 3 E


SPORTS COUPE BLUE Carr 3 L Sedan
Carr
Gleason
Clyde
DEALERSHIP
SPORTS COUPE RED Clyde 4
Blue Red White
SPORTS COUPE RED Gleason 3
SPORTS COUPE RED Carr 6
SPORTS COUPE WHITE Clyde 2
COLOR
SPORTS COUPE WHITE Gleason 3
SPORTS COUPE WHITE Carr 5
SEDAN BLUE Clyde 4
SEDAN BLUE Gleason 3
SEDAN BLUE Carr 2
... … … ...

27 x 4 = 108 cells 3 x 3 x 3 = 27 cells

202 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Benefits of MDDB over RDBMS
 Ease of Data Presentation & Navigation
– A great deal of information is gleaned immediately upon direct inspection of
the array
– User is able to view data along presorted dimensions with data arranged in an
inherently more organized, and accessible fashion than the one offered by the
relational table.
 Storage Space
– Very low Space Consumption compared to Relational DB
 Performance
– Gives much better performance.
– Relational DB may give comparable results only through database tuning
(indexing, keys etc), which may not be possible for ad-hoc queries.
 Ease of Maintenance
– No overhead as data is stored in the same way it is viewed. In Relational DB,
indexes, sophisticated joins etc. are used which require considerable storage
and maintenance

12/08/21 203
203 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Issues with MDDB

• Sparsity
- Input data in applications are typically sparse
-Increases with increased dimensions
• Data Explosion
-Due to Sparsity
-Due to Summarization
• Performance
-Doesn’t perform better than RDBMS at high data
volumes (>20-30 GB)

12/08/21 204
204 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell


is left behind. Employee Age
LAST NAME EMP# AGE Smith 21
SMITH 01 21 Regan 19
REGAN 12 Sales Volumes
19
FOX 31 63 L
Fox 63

WELD M 14 6 5 314
Miini Van A
S
Weld 31
O T
KELLY D Coupe54 3 5 275 Kelly 27
E N
LINK L 03 56 A
Sedan 4 3 2 M Link 56
KRANZ 41 45 E
Blue Red White Kranz 45
LUCUS 33 COLOR41
WEISS 23 19 Lucas 41

Weiss 19

31 41 23 01 14 54 03 12 33

EMPLOYEE #

12/08/21 205
205 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Features

 Calculations applied across dimensions, through


hierarchies and/or across members
 Trend analysis over sequential time periods,
 What-if scenarios.
 Slicing / Dicing subsets for on-screen viewing
 Rotation to new dimensional comparisons in the
viewing area
 Drill-down/up along the hierarchy
 Reach-through / Drill-through to underlying detail data

12/08/21 206
206 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Rotation

• Complex Queries & Sorts in Relational environment


translated to simple rotation.
Sales Volumes

M
Mini Van
6 5 4 C Blue 6 3 4
O O
D Coupe
3 5 5 L Red 5 5 3
E O
L R
Sedan 4 3 2 o
White 4 5 2
( ROTATE 90 ) Mini Van Coupe Sedan
Blue Red White

COLOR MODEL

View #1 View #2

2 dimensional array has 2 views.


12/08/21 207
207 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Rotation

Sales Volumes

M Mini Van
C Blue C Blue

O O O
D Coupe L Red
L Red
E O O
L Sedan
Carr
Gleason
R White
Carr
Gleason
R White
Mini Van
Coupe
Clyde Clyde Sedan
Blue Red White Sedan Coupe Mini Van Carr Gleason Clyde

COLOR o
MODEL o
DEALERSHIP o
( ROTATE 90 ) ( ROTATE 90 ) ( ROTATE 90 )

DEALERSHIP DEALERSHIP MODEL

View #1 View #2 View #3

D D
E E
A A
L Carr L Carr Mini Van
E E M
R Gleason
R Gleason O Coupe
S S D
H Mini Van H Blue E Sedan
Blue
I Clyde Coupe I Clyde Red L Red
White
White
P Sedan P Mini Van Coupe Sedan
White Red Blue Clyde Gleason Carr

COLOR o
MODEL o
DEALERSHIP
( ROTATE 90 ) ( ROTATE 90 )

MODEL COLOR COLOR

View #4 View #5 View #6

3 dimensional array has 6 views.


12/08/21 208
208 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Slicing / Filtering

 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

Mini Van
M Mini Van
O
D Coupe Carr
E Coupe
Clyde
L Normal Metal
Blue Blue
Carr
Clyde

Normal Metal
DEALERSHIP
Blue Blue

COLOR
12/08/21 209
209 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION

REGION Midwest

DISTRICT Chicago St. Louis Gary

DEALERSHIP Clyde Gleason Carr Levi Lucas Bolton

Sales at region/District/Dealership Level

• Moving Up and moving down in a hierarchy is referred to


as “drill-up” / “roll-up” and “drill-down”

12/08/21 210
210 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year)


200
150
Inflows
100
($M) East
50 West
Central
0
Year Year
1999 2000
Years

12/08/21 211
211 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90
80
70
60
50
Inflows ( $M) East
40
30 West
20 Central
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Year 1999

• Drill-down from Year to Quarter

12/08/21 212
212 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20
15
Inflows ( $M 10
) East
West
5 Central
0
January February March
Year 1999

• Drill-down from Quarter to Month

213 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Implementation Techniques -OLAP Architectures

 MOLAP - Multidimensional OLAP


 Multidimensional Databases for database and application logic layer
 ROLAP - Relational OLAP
 Access Data stored in relational Data Warehouse for OLAP Analysis.
 Database and Application logic provided as separate layers
 HOLAP - Hybrid OLAP
 OLAP Server routes queries first to MDDB, then to RDBMS and
result processed on-the-fly in Server
 DOLAP - Desk OLAP
 Personal MDDB Server and application on the desktop

12/08/21 214
214 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MOLAP - MDDB storage

Web
OLAP Browser
Cube
OLAP
Calculation
Engine OLAP
Tools

OLAP
Applications

12/08/21 215
215 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MOLAP - Features

 Powerful analytical capabilities (e.g.,


financial, forecasting, statistical)
 Aggregation and calculation capabilities
 Read/write analytic applications
 Specialized data structures for
 Maximum query performance.
 Optimum space utilization.

12/08/21 216
216 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ROLAP - Standard SQL storage

MDDB - Relational Mapping


Relational DW
Web
Browser

OLAP
Calculation
SQL Engine OLAP
Tools

OLAP
Applications

12/08/21 217
217 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ROLAP - Features

 Three-tier hardware/software architecture:


 GUI on client; multidimensional processing on mid-
tier server; target database on database server
 Processing split between mid-tier & database
servers
 Ad hoc query capabilities to very large databases
 DW integration
 Data scalability

12/08/21 218
218 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
HOLAP - Combination of RDBMS and MDDB

OLAP Cube
Any Client

Relational DW Web
Browser

OLAP
Calculation
SQL Engine OLAP
Tools

OLAP
Applications

12/08/21 219
219 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
HOLAP - Features

 RDBMS used for detailed data stored in large


databases
 MDDB used for fast, read/write OLAP analysis and
calculations
 Scalability of RDBMS and MDDB performance
 Calculation engine provides full analysis features
 Source of data transparent to end user

12/08/21 220
220 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Architecture Comparison

MOLAP ROLAP HOLAP


Definition MDDB OLAP = Relational OLAP = Hybrid OLAP =
Transaction level data + Transaction level data + ROLAP + summary in
summary in MDDB summary in RDBMS MDDB
Data explosion due Good Design 3 – 10 No Sparsity Sparsity exists only in
to Sparsity times MDDB part
Data explosion due High (May go beyond To the necessary extent To the necessary extent
to Summarization control. Estimation is
very important)
Query Execution Fast - (Depends upon Slow Optimum - If the data is
Speed the size of the MDDB) fetched from RDBMS
then it’s like ROLAP
otherwise like MOLAP.
Cost Medium: MDDB Server Low: Only RDBMS + disk High: RDBMS + disk
+ large disk space cost space cost space + MDDB Server
cost
Where to apply? Small transactional Very large transactional Large transactional data
data + complex model + data & it needs to be + frequent summary
frequent summary viewed / sorted analysis
analysis

12/08/21 221
221 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Representative OLAP Tools:

 Oracle Express  Micro Strategy - DSS


Products Agent
 Hyperion Essbase  Informix MetaCube
 Cognos -PowerPlay  Brio Query
 Seagate - Holos  Business Objects /
 SAS Web Intelligence

12/08/21 222
222 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Sample OLAP Applications

 Sales Analysis
 Financial Analysis
 Profitability Analysis
 Performance Analysis
 Risk Management
 Profiling & Segmentation
 Scorecard Application
 NPA Management
 Strategic Planning
 Customer Relationship Management (CRM)

12/08/21 223
223 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing

224 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing Overview
 There is an exponentially increasing cost associated with finding
software defects later in the development lifecycle. In data
warehousing, this is compounded because of the additional business
costs of using incorrect data to make critical business decisions

 The methodology required for testing a Data Warehouse is different


from testing a typical transaction system

225 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System

Data warehouse testing is different on the following counts:


– User-Triggered vs. System triggered
– Volume of Test Data
– Possible scenarios/ Test Cases
– Programming for testing challenge

226 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System….
 User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of


the production/Source system testing is the processing of individual
transactions, which are driven by some input from the users
(Application Form, Servicing Request.). There are very few test
cycles, which cover the system-triggered scenarios (Like billing,
Valuation.)

227 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System…
 Volume of Test Data
The test data in a transaction system is a very small sample of the
overall production data. Data Warehouse has typically large test
data as one does try to fill-up maximum possible combination of
dimensions and facts.
 Possible scenarios/ Test Cases
In case of Data Warehouse, the permutations and combinations one
can possibly test is virtually unlimited due to the core objective of
Data Warehouse is to allow all possible views of data.

228 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System…

• Programming for testing challenge

In case of transaction systems, users/business analysts typically test


the output of the system. In case of data warehouse, most of the
'Data Warehouse data Quality testing' and ETL testing is done at
backend by running separate stand-alone scripts. These scripts
compare pre-Transformation to post Transformation of data.

229 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing Process

Data-Warehouse testing is basically divided into two parts :


 'Back-end' testing where the source systems data is compared to the end-result data
in Loaded area
 'Front-end' testing where the user checks the data by comparing their MIS with the
data displayed by the end-user tools like OLAP.
Testing phases consists of :
 Requirements testing
 Unit testing
 Integration testing
 Performance testing
 Acceptance testing

230 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Requirements testing

The main aim for doing Requirements testing is to check


stated requirements for completeness.
Requirements can be tested on following factors.
 Are the requirements Complete?
 Are the requirements Singular?
 Are the requirements Ambiguous?
 Are the requirements Developable?
 Are the requirements Testable?

231 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL
procedures/mappings/jobs and the reports developed.
Unit testing the ETL procedures:

•Whether ETLs are accessing and picking up right data from right source.

•All the data transformations are correct according to the business rules and data
warehouse is correctly populated with the transformed data.

•Testing the rejected records that don’t fulfil transformation rules.

232 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Unit Testing…

Unit Testing the Report data:

•Verify Report data with source:


Data present in a data warehouse will be stored at an aggregate level compare to
source systems. QA team should verify the granular data stored in data warehouse
against the source data available
•Field level data verification:
QA team must understand the linkages for the fields displayed in the report and
should trace back and compare that with the source systems
•Derivation formulae/calculation rules should be verified

233 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Integration Testing
Integration testing will involve following:
 
 Sequence of ETLs jobs in batch.
 Initial loading of records on data warehouse.
 Incremental loading of records at a later date to verify the newly
inserted or updated data.
 Testing the rejected records that don’t fulfil transformation rules.
 Error log generation

234 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Performance Testing

Performance Testing should check for :

 ETL processes completing within time window.

 Monitoring and measuring the data quality issues.

 Refresh times for standard/complex reports.

235 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Acceptance testing

Here the system is tested with full functionality and is expected to


function as in production. At the end of UAT, the system should be
acceptable to the client for use in terms of ETL process integrity and
business functionality and reporting.

236 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Questions

237 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Thank You

238 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Components of Warehouse

 Source Tables: These are real-time, volatile data in relational databases for
transaction processing (OLTP). These can be any relational databases or flat files.
 ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from
sources to target.
 Maintenance and Administration Tools: To authorize and monitor access to the data,
set-up users. Scheduling jobs to run on offshore periods.
 Modeling Tools: Used for data warehouse design for high-performance using
dimensional data modeling technique, mapping the source and target files.
 Databases: Target databases and data marts, which are part of data warehouse.
These are structured for analysis and reporting purposes.
 End-user tools for analysis and reporting: get the reports and analyze the data from
target tables. Different types of Querying, Data Mining, OLAP tools are used for this
purpose.

239 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Architecture

This is a basic design, where there are source


files, which are loaded to a warehouse and
users query the data for different purposes.

This has a staging area, where the data


after cleansing, transforming is loaded and
tested here. Later is directly loaded to the
target database/warehouse. Which is
divided to data marts and can be accessed
by different users for their reporting and
analyzing purposes.

240 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Modeling
Effective way of using a Data Warehouse

241 © 2009 Wipro Ltd - Confidential


Data Modeling

Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data


Model is used commonly.
E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics.
Like employee, book, student…
Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model
o Logical Data Model
o Physical Data Model

 Types of Dimensional Data Models – most commonly used:


o Star Schema
o Snowflake Schema

242 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Terms used in Dimensional Data Model

To understand dimensional data modeling, let's define some of the terms


commonly used in this type of modeling:
 Dimension: A category of information. For example, the time dimension.
 Attribute: A unique level within a dimension. For example, Month is an attribute
in the Time Dimension.
 Hierarchy: The specification of levels that represents relationship between
different attributes within a dimension. For example, one possible hierarchy in
the Time dimension is Year → Quarter → Month → Day.
 Fact Table: A table that contains the measures of interest.
 Lookup Table: It provides the detailed information about the attributes. For
example, the lookup table for the Quarter attribute would include a list of all of
the quarters available in the data warehouse.
 Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are
helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one
or more lookup tables, but fact tables do not have direct relationships to one another.
Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key
columns in the lookup tables.
243 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Star Schema

Dimension Table Dimension Table


product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5
c2 sfo
c3 la

Fact Table
sale oderId date custId prodId storeId qty amt
o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50

Dimension Table
customer custId name address city
53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

244 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Snowflake Schema

Dimension Table
sType tId size location
Fact Table t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe Dimension Table
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south

The star and snowflake schema are most commonly region regId name
found in dimensional data warehouses and data north cold region
marts where speed of data retrieval is more south warm region
important than the efficiency of data manipulations.
As such, the tables in these schema are not
normalized much, and are frequently designed at a
level of normalization short of third normal form.

245 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Overview of Data Cleansing

246 © 2009 Wipro Ltd - Confidential


The Need For Data Quality

 Difficulty in decision making


 Time delays in operation
 Organizational mistrust
 Data ownership conflicts
 Customer attrition
 Costs associated with
– error detection
– error rework
– customer service
– fixing customer problems

247 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Six Steps To Data Quality
Understand
Understand Information
Information Flow
Flow  Identify authoritative data sources
In Organization
In Organization
 Interview Employees & Customers

Identify Potential Problem  Data Entry Points


Areas & Asses Impact  Cost of bad data

 Use business rule discovery tools to identify data with


Measure Quality Of Data inconsistent, missing, incomplete, duplicate or incorrect
values

Clean & Load  Use data cleansing tools to clean data at the source
Data  Load only clean data into the data warehouse

Continuous Monitoring  Schedule Periodic Cleansing of Source Data

 Identify & Correct Cause of Defects


Identify Areas of Improvement  Refine data capture mechanisms at source
 Educate users on importance of DQ

248 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Quality Solution
Customized Programs
 Strengths:
– Addresses specific needs
– No bulky one time investment
 Limitations
– Tons of Custom programs in different environments are difficult to
manage
– Minor alterations demand coding efforts
Data Quality Assessment tools
 Strength
– Provide automated assessment
 Limitation
– No measure of data accuracy

249 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Quality Solution

Business Rule Discovery tools


 Strengths
– Detect Correlation in data values
– Can detect Patterns of behavior that indicate fraud
 Limitations
– Not all variables can be discovered
– Some discovered rules might not be pertinent
– There may be performance problems with large files or with many
fields.

Data Reengineering & Cleansing tools


 Strengths
– Usually are integrated packages with cleansing features as Add-on
 Limitations 
– Error prevention at source is usually absent
– The ETL tools have limited cleansing facilities

250 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Tools In The Market
 Business Rule Discovery Tools
– Integrity Data Reengineering Tool from Vality Technology
– Trillium Software System from Harte -Hanks Data Technologies
– Migration Architect from DB Star
 Data Reengineering & Cleansing Tools
– Carlton Pureview from Oracle
– ETI-Extract from Evolutionary Technologies
– PowerMart from Informatica Corp
– Sagent Data Mart from Sagent Technology
 Data Quality Assessment Tools
– Migration Architect, Evoke Axio from Evoke Software
– Wizrule from Wizsoft
 Name & Address Cleansing Tools
– Centrus Suite from Sagent
– I.d.centric from First Logic

251 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Extraction, Transformation, Load

252 © 2009 Wipro Ltd - Confidential


ETL Architecture

Visitors

Web
Browsers
External Data –
Demographics,
Household,
The Webographics,
Internet Income

Staging Area
Meta Data
Repository
Web Server Logs Flat Files
& •Clean
E-comm •Transform Enterprise
Transaction Data Scheduled •Match Scheduled Data
RDBMS •Merge
Extraction Loading Warehouse

Other OLTP
Systems

Data Collection Data Extraction Data Transformation Data Loading Data Storage &
Integration

253 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ETL Architecture

Data Extraction: Data transformation

Rummages through a file or Integrating dissimilar data types


database
Changing codes
Uses some criteria for selection
Identifies qualified data and Adding a time attribute
Transports the data over onto Summarizing data
another file or database Calculating derived values
Renormalizing data
Data Extraction – Cleanup
Data loading
Restructuring of records or fields
Removal of Operational-only data
Supply of missing field values Initial and incremental loading
Data Integrity checks Updation of metadata
Data Consistency and Range checks,
etc...

254 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Why ETL ?

 Companies have valuable data lying around throughout their networks that
needs to be moved from one place to another.

 The data lies in all sorts of heterogeneous systems,and therefore in all sorts
of formats.

 To solve the problem, companies use extract, transform and load (ETL)
software.

 The data used in ETL processes can come from any source:
a mainframe application, an ERP application, a CRM tool, a flat file, and
an Excel spreadsheet.

255 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Major components involved in ETL Processing

256 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Major components involved in ETL Processing

 Design manager
Lets developers define source-to-target mappings, transformations, process flows, and jobs
 Meta data management
Provides a repository to define, document, and manage information about the ETL design and runtime
processes
 Extract
The process of reading data from a database.
 Transform
The process of converting the extracted data
 Load
The process of writing the data into the target database.
 Transport services
ETL tools use network and file protocols to move data between
source and target systems and in-memory protocols to move data
between ETL run-time components.
 Administration and operation
ETL utilities let administrators schedule, run, monitor ETL jobs, log
all events, manage errors, recover from failures, reconcile outputs
with source systems

257 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ETL Tools

 Provides facility to specify a large number of transformation rules with a


GUI
 Generate programs to transform data
 Handle multiple data sources
 Handle data redundancy
 Generate metadata as output
 Most tools exploit parallelism by running on multiple low-cost servers in
multi-threaded environment

ETL Tools - Second-Generation


 PowerCentre/Mart from Informatica
 Data Mart Solution from Sagent Technology
 DataStage from Ascential

258 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Metadata Management

259 © 2009 Wipro Ltd - Confidential


What Is Metadata?

Metadata is Information...

 That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse
 About the data being captured and loaded into the Warehouse
 Documented in IT tools that improves both business and technical understanding of data
and data-related processes

260 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Importance Of Metadata
Locating Information
Time spent in looking for information.
How often information is found?
What poor decisions were made based on the incomplete information?
How much money was lost or earned as a result?
Interpreting information
How many times have businesses needed to rework or recall products?
 What impact does it have on the bottom line ?
How many mistakes were due to misinterpretation of existing documentation?
How much interpretation results form too much metadata?
How much time is spent trying to determine if any of the metadata is accurate?
Integrating information
How various data perspectives connect together?
How much time is spent trying to figure out that?
How much does the inefficiency and lack of metadata affect decision making

261 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Requirements for DW Metadata Management

 Provide a simple catalogue of business metadata descriptions and views

 Document/manage metadata descriptions from an integrated development


environment

 Enable DW users to identify and invoke pre-built queries against the data stores

 Design and enhance new data models and schemas for the data warehouse

 Capture data transformation rules between the operational and data


warehousing databases

 Provide change impact analysis, and update across these technologies

262 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Consumers of Metadata

 Technical Users
• Warehouse administrator
• Application developer
 Business Users -Business metadata
• Meanings
• Definitions
• Business Rules
 Software Tools
• Used in DW life-cycle development
• Metadata requirements for each tool must be identified
• The tool-specific metadata should be analysed for inclusion in the enterprise
metadata repository
• Previously captured metadata should be electronically transferred from the
enterprise metadata repository to each individual tool

263 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools

Third Party Bridging Tools


 Oracle Exchange
– Technology of choice for a long list of repository, enterprise and
workgroup vendors
 Reischmann-Informatik-Toolbus
– Features include facilitation of selective bridging of metadata
 Ardent Software/ Dovetail Software -Interplay
– ‘Hub and Spoke’ solution for enabling metadata interoperability
– Ardent focussing on own engagements, not selling it as independent
product
 Informix's Metadata Plug-ins
– Available with Ardent Datastage version 3.6.2 free of cost for Erwin,
Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy

264 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools
Metadata Repositories
 IBM, Oracle and Microsoft to offer free or near-free basic
repository services
 Enable organisations to reuse metadata across technologies
 Integrate DB design, data transformation and BI tools from
different vendors
 Multi-tool vendors taking a bridged or federated rather than
integrated approach to sharing metadata
 Both IBM and Oracle have multiple repositories for different lines
of products — e.g., One for AD and one for DW, with bridges
between them

265 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Trends in the Metadata Management Tools

Metadata Interchange Standards


 CDIF (CASE Data Interchange Format)
– Most frequently used interchange standard
– Addresses only a limited subset of metadata artifacts
 OMG (Object Management Group)-CWM
– XML-addresses context and data meaning, not presentation
– Can enable exchange over the web employing industry standards for
storing and sharing programming data
– Will allow sharing of UML and MOF objects b/w various development
tools and repositories
 MDC (Metadata Coalition)
– Based on XML/UML standards
– Promoted by Microsoft Along With 20 partners including Object
Management Group (OMG), Oracle Carleton Group, CA-PLATINUM
Technology (Founding Member), Viasoft

266 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP

267 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Agenda
 OLAP Definition
 Distinction between OLTP and OLAP
 MDDB Concepts
 Implementation Techniques
 Architectures
 Features
 Representative Tools

12/08/21 268

268 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP: On-Line Analytical Processing

 OLAP can be defined as a technology which allows the


users to view the aggregate data across measurements (like
Maturity Amount, Interest Rate etc.) along with a set of
related parameters called dimensions (like Product,
Organization, Customer, etc.)
• Used interchangeably with ‘BI’
• Multidimensional view of data is the foundation of OLAP
• Users :Analysts, Decision makers

12/08/21 269

269 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Distinction between OLTP and OLAP

OLTP System OLAP System


Source of data Operational data; OLTPs are Consolidation data; OLAP
the original source of the data comes from the
data various OLTP databases

Purpose of data To control and run Decision support


fundamental business tasks

What the data A snapshot of ongoing Multi-dimensional views of


reveals business processes various kinds of business
activities
Inserts and Updates Short and fast inserts and Periodic long-running
updates initiated by end batch jobs refresh the data
270

users
12/08/21

270 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MDDB Concepts

A multidimensional database is a computer software system


designed to allow for efficient and convenient storage and
retrieval of data that is
 intimately related and
 stored, viewed and analyzed from different perspectives
(Dimensions).

A hypercube represents a collection of multidimensional data.


 The edges of the cube are called dimensions
 Individual items within each dimensions are called members

271 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
RDBMS v/s MDDB: Increased Complexity...
Relational DBMS MDDB
MODEL COLOR DEALER VOL.
MINI VAN BLUE Clyde 6
MINI VAN BLUE Gleason 3
MINI VAN BLUE Carr 2
MINI VAN RED Clyde 5 Sales Volumes
MINI VAN RED Gleason 3
MINI VAN RED Carr 1
MINI VAN WHITE Clyde 3
MINI VAN WHITE Gleason 1
M Mini Van
MINI VAN WHITE Carr 4 O
SPORTS COUPE BLUE Clyde 3 D Coupe

SPORTS COUPE BLUE Gleason 3 E


SPORTS COUPE BLUE Carr 3 L Sedan
Carr
Gleason
Clyde
DEALERSHIP
SPORTS COUPE RED Clyde 4
Blue Red White
SPORTS COUPE RED Gleason 3
SPORTS COUPE RED Carr 6
SPORTS COUPE WHITE Clyde 2
COLOR
SPORTS COUPE WHITE Gleason 3
SPORTS COUPE WHITE Carr 5
SEDAN BLUE Clyde 4
SEDAN BLUE Gleason 3
SEDAN BLUE Carr 2
... … … ...

27 x 4 = 108 cells 3 x 3 x 3 = 27 cells

272 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Benefits of MDDB over RDBMS
 Ease of Data Presentation & Navigation
– A great deal of information is gleaned immediately upon direct inspection of
the array
– User is able to view data along presorted dimensions with data arranged in an
inherently more organized, and accessible fashion than the one offered by the
relational table.
 Storage Space
– Very low Space Consumption compared to Relational DB
 Performance
– Gives much better performance.
– Relational DB may give comparable results only through database tuning
(indexing, keys etc), which may not be possible for ad-hoc queries.
 Ease of Maintenance
– No overhead as data is stored in the same way it is viewed. In Relational DB,
indexes, sophisticated joins etc. are used which require considerable storage
and maintenance

12/08/21 273
273 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Issues with MDDB

• Sparsity
- Input data in applications are typically sparse
-Increases with increased dimensions
• Data Explosion
-Due to Sparsity
-Due to Summarization
• Performance
-Doesn’t perform better than RDBMS at high data
volumes (>20-30 GB)

12/08/21 274
274 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Issues with MDDB - Sparsity Example

If dimension members of different dimensions do not interact , then blank cell


is left behind. Employee Age
LAST NAME EMP# AGE Smith 21
SMITH 01 21 Regan 19
REGAN 12 Sales Volumes
19
FOX 31 63 L
Fox 63

WELD M 14 6 5 314
Miini Van A
S
Weld 31
O T
KELLY D Coupe54 3 5 275 Kelly 27
E N
LINK L 03 56 A
Sedan 4 3 2 M Link 56
KRANZ 41 45 E
Blue Red White Kranz 45
LUCUS 33 COLOR41
WEISS 23 19 Lucas 41

Weiss 19

31 41 23 01 14 54 03 12 33

EMPLOYEE #

12/08/21 275
275 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Features

 Calculations applied across dimensions, through


hierarchies and/or across members
 Trend analysis over sequential time periods,
 What-if scenarios.
 Slicing / Dicing subsets for on-screen viewing
 Rotation to new dimensional comparisons in the
viewing area
 Drill-down/up along the hierarchy
 Reach-through / Drill-through to underlying detail data

12/08/21 276
276 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Rotation

• Complex Queries & Sorts in Relational environment


translated to simple rotation.
Sales Volumes

M
Mini Van
6 5 4 C Blue 6 3 4
O O
D Coupe
3 5 5 L Red 5 5 3
E O
L R
Sedan 4 3 2 o
White 4 5 2
( ROTATE 90 ) Mini Van Coupe Sedan
Blue Red White

COLOR MODEL

View #1 View #2

2 dimensional array has 2 views.


12/08/21 277
277 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Rotation

Sales Volumes

M Mini Van
C Blue C Blue

O O O
D Coupe L Red
L Red
E O O
L Sedan
Carr
Gleason
R White
Carr
Gleason
R White
Mini Van
Coupe
Clyde Clyde Sedan
Blue Red White Sedan Coupe Mini Van Carr Gleason Clyde

COLOR o
MODEL o
DEALERSHIP o
( ROTATE 90 ) ( ROTATE 90 ) ( ROTATE 90 )

DEALERSHIP DEALERSHIP MODEL

View #1 View #2 View #3

D D
E E
A A
L Carr L Carr Mini Van
E E M
R Gleason
R Gleason O Coupe
S S D
H Mini Van H Blue E Sedan
Blue
I Clyde Coupe I Clyde Red L Red
White
White
P Sedan P Mini Van Coupe Sedan
White Red Blue Clyde Gleason Carr

COLOR o
MODEL o
DEALERSHIP
( ROTATE 90 ) ( ROTATE 90 )

MODEL COLOR COLOR

View #4 View #5 View #6

3 dimensional array has 6 views.


12/08/21 278
278 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Slicing / Filtering

 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

Mini Van
M Mini Van
O
D Coupe Carr
E Coupe
Clyde
L Normal Metal
Blue Blue
Carr
Clyde

Normal Metal
DEALERSHIP
Blue Blue

COLOR
12/08/21 279
279 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION

REGION Midwest

DISTRICT Chicago St. Louis Gary

DEALERSHIP Clyde Gleason Carr Levi Lucas Bolton

Sales at region/District/Dealership Level

• Moving Up and moving down in a hierarchy is referred to


as “drill-up” / “roll-up” and “drill-down”

12/08/21 280
280 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year)


200
150
Inflows
100
($M) East
50 West
Central
0
Year Year
1999 2000
Years

12/08/21 281
281 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90
80
70
60
50
Inflows ( $M) East
40
30 West
20 Central
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Year 1999

• Drill-down from Year to Quarter

12/08/21 282
282 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20
15
Inflows ( $M 10
) East
West
5 Central
0
January February March
Year 1999

• Drill-down from Quarter to Month

283 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Implementation Techniques -OLAP Architectures

 MOLAP - Multidimensional OLAP


 Multidimensional Databases for database and application logic layer
 ROLAP - Relational OLAP
 Access Data stored in relational Data Warehouse for OLAP Analysis.
 Database and Application logic provided as separate layers
 HOLAP - Hybrid OLAP
 OLAP Server routes queries first to MDDB, then to RDBMS and
result processed on-the-fly in Server
 DOLAP - Desk OLAP
 Personal MDDB Server and application on the desktop

12/08/21 284
284 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MOLAP - MDDB storage

Web
OLAP Browser
Cube
OLAP
Calculation
Engine OLAP
Tools

OLAP
Applications

12/08/21 285
285 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
MOLAP - Features

 Powerful analytical capabilities (e.g.,


financial, forecasting, statistical)
 Aggregation and calculation capabilities
 Read/write analytic applications
 Specialized data structures for
 Maximum query performance.
 Optimum space utilization.

12/08/21 286
286 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ROLAP - Standard SQL storage

MDDB - Relational Mapping


Relational DW
Web
Browser

OLAP
Calculation
SQL Engine OLAP
Tools

OLAP
Applications

12/08/21 287
287 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
ROLAP - Features

 Three-tier hardware/software architecture:


 GUI on client; multidimensional processing on mid-
tier server; target database on database server
 Processing split between mid-tier & database
servers
 Ad hoc query capabilities to very large databases
 DW integration
 Data scalability

12/08/21 288
288 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
HOLAP - Combination of RDBMS and MDDB

OLAP Cube
Any Client

Relational DW Web
Browser

OLAP
Calculation
SQL Engine OLAP
Tools

OLAP
Applications

12/08/21 289
289 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
HOLAP - Features

 RDBMS used for detailed data stored in large


databases
 MDDB used for fast, read/write OLAP analysis and
calculations
 Scalability of RDBMS and MDDB performance
 Calculation engine provides full analysis features
 Source of data transparent to end user

12/08/21 290
290 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Architecture Comparison

MOLAP ROLAP HOLAP


Definition MDDB OLAP = Relational OLAP = Hybrid OLAP =
Transaction level data + Transaction level data + ROLAP + summary in
summary in MDDB summary in RDBMS MDDB
Data explosion due Good Design 3 – 10 No Sparsity Sparsity exists only in
to Sparsity times MDDB part
Data explosion due High (May go beyond To the necessary extent To the necessary extent
to Summarization control. Estimation is
very important)
Query Execution Fast - (Depends upon Slow Optimum - If the data is
Speed the size of the MDDB) fetched from RDBMS
then it’s like ROLAP
otherwise like MOLAP.
Cost Medium: MDDB Server Low: Only RDBMS + disk High: RDBMS + disk
+ large disk space cost space cost space + MDDB Server
cost
Where to apply? Small transactional Very large transactional Large transactional data
data + complex model + data & it needs to be + frequent summary
frequent summary viewed / sorted analysis
analysis

12/08/21 291
291 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Representative OLAP Tools:

 Oracle Express  Micro Strategy - DSS


Products Agent
 Hyperion Essbase  Informix MetaCube
 Cognos -PowerPlay  Brio Query
 Seagate - Holos  Business Objects /
 SAS Web Intelligence

12/08/21 292
292 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Sample OLAP Applications

 Sales Analysis
 Financial Analysis
 Profitability Analysis
 Performance Analysis
 Risk Management
 Profiling & Segmentation
 Scorecard Application
 NPA Management
 Strategic Planning
 Customer Relationship Management (CRM)

12/08/21 293
293 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing

294 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing Overview
 There is an exponentially increasing cost associated with finding
software defects later in the development lifecycle. In data
warehousing, this is compounded because of the additional business
costs of using incorrect data to make critical business decisions

 The methodology required for testing a Data Warehouse is different


from testing a typical transaction system

295 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System

Data warehouse testing is different on the following counts:


– User-Triggered vs. System triggered
– Volume of Test Data
– Possible scenarios/ Test Cases
– Programming for testing challenge

296 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System….
 User-Triggered vs. System triggered

In data Warehouse, most of the testing is system triggered. Most of


the production/Source system testing is the processing of individual
transactions, which are driven by some input from the users
(Application Form, Servicing Request.). There are very few test
cycles, which cover the system-triggered scenarios (Like billing,
Valuation.)

297 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System…
 Volume of Test Data
The test data in a transaction system is a very small sample of the
overall production data. Data Warehouse has typically large test
data as one does try to fill-up maximum possible combination of
dimensions and facts.
 Possible scenarios/ Test Cases
In case of Data Warehouse, the permutations and combinations one
can possibly test is virtually unlimited due to the core objective of
Data Warehouse is to allow all possible views of data.

298 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Difference In Testing Data warehouse and
Transaction System…

• Programming for testing challenge

In case of transaction systems, users/business analysts typically test


the output of the system. In case of data warehouse, most of the
'Data Warehouse data Quality testing' and ETL testing is done at
backend by running separate stand-alone scripts. These scripts
compare pre-Transformation to post Transformation of data.

299 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Data Warehouse Testing Process

Data-Warehouse testing is basically divided into two parts :


 'Back-end' testing where the source systems data is compared to the end-result data
in Loaded area
 'Front-end' testing where the user checks the data by comparing their MIS with the
data displayed by the end-user tools like OLAP.
Testing phases consists of :
 Requirements testing
 Unit testing
 Integration testing
 Performance testing
 Acceptance testing

300 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Requirements testing

The main aim for doing Requirements testing is to check


stated requirements for completeness.
Requirements can be tested on following factors.
 Are the requirements Complete?
 Are the requirements Singular?
 Are the requirements Ambiguous?
 Are the requirements Developable?
 Are the requirements Testable?

301 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL
procedures/mappings/jobs and the reports developed.
Unit testing the ETL procedures:

•Whether ETLs are accessing and picking up right data from right source.

•All the data transformations are correct according to the business rules and data
warehouse is correctly populated with the transformed data.

•Testing the rejected records that don’t fulfil transformation rules.

302 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Unit Testing…

Unit Testing the Report data:

•Verify Report data with source:


Data present in a data warehouse will be stored at an aggregate level compare to
source systems. QA team should verify the granular data stored in data warehouse
against the source data available
•Field level data verification:
QA team must understand the linkages for the fields displayed in the report and
should trace back and compare that with the source systems
•Derivation formulae/calculation rules should be verified

303 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Integration Testing
Integration testing will involve following:
 
 Sequence of ETLs jobs in batch.
 Initial loading of records on data warehouse.
 Incremental loading of records at a later date to verify the newly
inserted or updated data.
 Testing the rejected records that don’t fulfil transformation rules.
 Error log generation

304 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Performance Testing

Performance Testing should check for :

 ETL processes completing within time window.

 Monitoring and measuring the data quality issues.

 Refresh times for standard/complex reports.

305 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Acceptance testing

Here the system is tested with full functionality and is expected to


function as in production. At the end of UAT, the system should be
acceptable to the client for use in terms of ETL process integrity and
business functionality and reporting.

306 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Questions

307 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential
Thank You

308 ©
© 2009
2009 Wipro
Wipro Ltd
Ltd -- Confidential
Confidential

You might also like