U4 F 100 B 1 F 0 Cef 0

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 969

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

2009 Wipro Ltd - Confidential

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

10

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

11

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

12

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

13

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

14

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

15

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

16

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

17

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

18

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
19
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
20
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

21

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
22
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
23
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

24

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
25

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

26

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

27

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

28

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

29

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
30
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

31

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

32

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

33

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

34

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

35

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

36

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

37

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
38

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

39

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

40

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

41

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
42
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

43

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

44

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
45
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

46

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
47
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
48
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

49

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
50
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

51

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

52

52

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

53

53

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
54

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

54

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

55

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
56

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
57
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

57

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
58
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

58

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
59
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

59

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
60
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

60

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
61
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

61

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
62
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

62

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
63
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

63

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
64
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

64

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
65
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

65

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
66
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

66

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

67

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
68
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

68

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
69
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

69

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
70
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

70

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
71
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

71

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
72
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

72

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
73
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

73

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
74
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

74

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
75
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

75

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
76
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

76

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
77
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

77

Data Warehouse Testing

78

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

79

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

80

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

81

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

82

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

83

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

84

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

85

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

86

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

87

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

88

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

89

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

90

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

91

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

92

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

94

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

95

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

96

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

97

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

98

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

99

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

101

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

102

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

103

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

104

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

105

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

106

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

108

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

109

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

110

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

111

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

112

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

113

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

114

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

115

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

116

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
117
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
118
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

119

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
120

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

121

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

122

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

123

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

124

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
125
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

126

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

127

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

128

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

129

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

130

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

131

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

132

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
133

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

134

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

135

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

136

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
137
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

138

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
139
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

140

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
141
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

142

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

143

143

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

144

144

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
145

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

145

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

146

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
147

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
148
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

148

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
149
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

149

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
150
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

150

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
151
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

151

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
152
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

152

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
153
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

153

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
154
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

154

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
155
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

155

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
156
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

156

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
157
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

157

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

158

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
159
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

159

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
160
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

160

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
161
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

161

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
162
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

162

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
163
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

163

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
164
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

164

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
165
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

165

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
166
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

166

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
167
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

167

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
168
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

168

Data Warehouse Testing

169

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

170

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

171

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

172

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

173

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

174

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

175

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

176

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

177

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

178

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

179

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

180

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

181

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

182

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

183

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

184

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

185

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

186

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
187
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
188
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

189

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
190

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

191

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

192

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

193

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

194

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
195
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

196

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

197

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

198

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

199

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

200

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

202

2009 Wipro Ltd - Confidential

203

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

205

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

206

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

208

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

209

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

210

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

211

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

212

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

213

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

214

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

215

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

216

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

217

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

218

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
219
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
220
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

221

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
222
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
223
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

224

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
225

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

226

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

227

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

228

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

229

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
230
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

231

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

232

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

233

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

234

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

235

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

236

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

237

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
238

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

239

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

240

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

241

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
242
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

243

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

244

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
245
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

246

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
247
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
248
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

249

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
250
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

251

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

252

252

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

253

253

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
254

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

254

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

255

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
256

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
257
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

257

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
258
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

258

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
259
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

259

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
260
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

260

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
261
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

261

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
262
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

262

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
263
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

263

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
264
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

264

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
265
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

265

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
266
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

266

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

267

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
268
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

268

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
269
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

269

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
270
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

270

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
271
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

271

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
272
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

272

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
273
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

273

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
274
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

274

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
275
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

275

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
276
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

276

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
277
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

277

Data Warehouse Testing

278

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

279

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

280

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

281

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

282

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

283

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

284

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

285

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

286

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

287

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

288

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

289

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

290

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

291

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

292

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

294

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

295

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

296

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

297

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

298

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

299

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

301

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

302

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

303

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

304

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

305

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

306

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

308

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

309

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

310

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

311

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

312

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

313

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

314

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

315

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

316

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
317
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
318
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

319

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
320

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

321

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

322

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

323

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

324

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
325
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

326

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

327

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

328

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

329

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

330

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

331

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

332

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
333

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

334

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

335

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

336

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
337
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

338

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
339
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

340

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
341
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

342

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

343

343

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

344

344

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
345

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

345

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

346

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
347

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
348
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

348

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
349
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

349

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
350
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

350

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
351
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

351

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
352
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

352

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
353
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

353

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
354
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

354

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
355
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

355

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
356
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

356

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
357
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

357

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

358

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
359
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

359

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
360
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

360

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
361
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

361

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
362
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

362

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
363
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

363

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
364
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

364

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
365
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

365

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
366
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

366

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
367
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

367

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
368
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

368

Data Warehouse Testing

369

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

370

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

371

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

372

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

373

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

374

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

375

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

376

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

377

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

378

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

379

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

380

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

381

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

382

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

383

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

384

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

385

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

386

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
387
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
388
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

389

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
390

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

391

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

392

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

393

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

394

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
395
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

396

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

397

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

398

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

399

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

400

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

401

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

402

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
403

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

404

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

405

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

406

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
407
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

408

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
409
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

410

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

412

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

414

2009 Wipro Ltd - Confidential

415

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

417

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

418

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

420

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

421

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

422

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

423

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

424

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

425

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

426

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

427

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

428

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

429

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

430

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
431
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
432
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

433

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
434
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
435
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

436

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
437

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

438

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

439

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

440

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

441

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
442
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

443

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

444

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

445

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

446

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

447

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

448

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

449

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
450

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

451

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

452

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

453

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
454
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

455

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

456

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
457
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

458

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
459
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
460
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

461

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
462
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

463

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

464

464

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

465

465

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
466

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

466

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

467

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
468

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
469
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

469

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
470
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

470

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
471
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

471

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
472
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

472

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
473
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

473

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
474
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

474

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
475
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

475

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
476
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

476

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
477
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

477

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
478
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

478

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

479

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
480
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

480

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
481
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

481

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
482
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

482

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
483
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

483

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
484
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

484

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
485
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

485

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
486
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

486

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
487
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

487

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
488
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

488

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
489
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

489

Data Warehouse Testing

490

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

491

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

492

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

493

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

494

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

495

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

496

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

497

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

498

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

499

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

500

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

501

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

502

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

503

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

504

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

506

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

507

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

508

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

509

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

510

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

511

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

513

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

514

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

515

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

516

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

517

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

518

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

520

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

521

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

522

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

523

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

524

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

525

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

526

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

527

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

528

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
529
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
530
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

531

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
532

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

533

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

534

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

535

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

536

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
537
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

538

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

539

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

540

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

541

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

542

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

543

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

544

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
545

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

546

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

547

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

548

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
549
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

550

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
551
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

552

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
553
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

554

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

555

555

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

556

556

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
557

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

557

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

558

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
559

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
560
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

560

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
561
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

561

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
562
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

562

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
563
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

563

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
564
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

564

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
565
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

565

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
566
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

566

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
567
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

567

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
568
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

568

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
569
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

569

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

570

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
571
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

571

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
572
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

572

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
573
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

573

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
574
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

574

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
575
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

575

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
576
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

576

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
577
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

577

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
578
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

578

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
579
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

579

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
580
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

580

Data Warehouse Testing

581

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

582

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

583

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

584

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

585

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

586

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

587

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

588

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

589

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

590

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

591

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

592

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

593

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

594

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

595

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

596

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

597

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

598

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
599
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
600
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

601

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
602

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

603

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

604

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

605

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

606

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
607
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

608

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

609

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

610

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

611

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

612

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

614

2009 Wipro Ltd - Confidential

615

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

617

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

618

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

620

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

621

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

622

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

623

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

624

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

625

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

626

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

627

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

628

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

629

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

630

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
631
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
632
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

633

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
634
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
635
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

636

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
637

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

638

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

639

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

640

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

641

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
642
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

643

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

644

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

645

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

646

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

647

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

648

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

649

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
650

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

651

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

652

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

653

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
654
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

655

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

656

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
657
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

658

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
659
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
660
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

661

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
662
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

663

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

664

664

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

665

665

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
666

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

666

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

667

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
668

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
669
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

669

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
670
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

670

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
671
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

671

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
672
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

672

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
673
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

673

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
674
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

674

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
675
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

675

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
676
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

676

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
677
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

677

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
678
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

678

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

679

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
680
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

680

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
681
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

681

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
682
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

682

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
683
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

683

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
684
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

684

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
685
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

685

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
686
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

686

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
687
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

687

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
688
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

688

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
689
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

689

Data Warehouse Testing

690

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

691

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

692

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

693

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

694

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

695

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

696

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

697

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

698

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

699

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

700

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

701

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

702

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

703

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

704

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

706

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

707

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

708

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

709

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

710

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

711

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

713

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

714

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

715

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

716

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

717

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

718

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

720

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

721

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

722

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

723

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

724

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

725

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

726

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

727

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

728

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
729
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
730
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

731

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
732

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

733

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

734

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

735

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

736

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
737
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

738

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

739

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

740

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

741

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

742

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

743

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

744

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
745

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

746

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

747

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

748

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
749
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

750

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
751
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

752

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
753
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

754

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

755

755

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

756

756

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
757

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

757

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

758

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
759

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
760
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

760

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
761
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

761

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
762
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

762

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
763
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

763

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
764
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

764

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
765
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

765

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
766
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

766

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
767
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

767

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
768
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

768

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
769
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

769

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

770

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
771
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

771

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
772
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

772

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
773
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

773

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
774
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

774

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
775
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

775

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
776
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

776

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
777
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

777

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
778
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

778

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
779
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

779

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
780
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

780

Data Warehouse Testing

781

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

782

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

783

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

784

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

785

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

786

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

787

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

788

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

789

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

790

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

791

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

792

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

793

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

794

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

795

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

796

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

797

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

798

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
799
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
800
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

801

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
802

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

803

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

804

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

805

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

806

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
807
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

808

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

809

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

810

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

811

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

812

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

813

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

814

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
815

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

816

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

817

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

818

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
819
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

820

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
821
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

822

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

824

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

826

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

827

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

828

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

829

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

830

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

831

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

832

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

833

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

834

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
835
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
836
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

837

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
838

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

839

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

840

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

841

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

842

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
843
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

844

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

845

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

846

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

847

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

848

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

849

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

850

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
851

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

852

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

853

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

854

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
855
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

856

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
857
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

858

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
859
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

860

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

861

861

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

862

862

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
863

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

863

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

864

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
865

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
866
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

866

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
867
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

867

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
868
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

868

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
869
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

869

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
870
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

870

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
871
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

871

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
872
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

872

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
873
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

873

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
874
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

874

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
875
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

875

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

876

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
877
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

877

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
878
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

878

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
879
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

879

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
880
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

880

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
881
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

881

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
882
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

882

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
883
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

883

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
884
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

884

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
885
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

885

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
886
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

886

Data Warehouse Testing

887

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

888

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

889

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

890

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

891

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

892

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

893

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

894

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

895

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

896

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

897

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

898

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

899

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

900

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

901

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

902

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

903

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

904

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

905

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

906

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

907

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

908

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

909

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
910
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
911
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

912

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
913

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

914

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

915

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

916

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

917

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
918
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

919

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

920

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

921

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

922

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

923

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

924

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

925

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
926

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

927

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

928

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

929

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
930
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

931

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
932
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

933

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
934
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

935

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

936

936

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

937

937

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
938

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

938

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

939

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
940

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
941
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

941

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
942
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

942

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
943
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

943

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
944
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

944

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
945
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

945

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
946
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

946

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
947
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

947

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
948
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

948

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
949
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

949

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
950
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

950

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

951

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
952
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

952

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
953
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

953

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
954
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

954

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
955
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

955

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
956
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

956

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
957
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

957

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
958
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

958

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
959
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

959

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
960
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

960

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
961
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

961

Data Warehouse Testing

962

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

963

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

964

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

965

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

966

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

967

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

968

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

969

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

You might also like