Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 174

Data Management

October 20, 2010 TCS Public


Contents

• Data Warehouse Concepts


• Data Modeling
• Dimensional Modeling
• Implementation and Maintenance
• Data Management
 Data Quality Analysis
 Metadata Management
 Data Governance
 Master Data Management
 Data Storage, Movement and Access

October 20, 2010 2


Data Warehouse Concepts

October 20, 2010 3


Data Warehouse Concepts Agenda

A.What is a Data Warehouse (DW) ?


B.What are the components of a DW ?
C.What are the various architectures/formats of a DW ?
D.Examples of Data Warehousing tools in use

October 20, 2010 4


Need for Data Warehousing – Business View

• Customer Centricity
 Single view of each customer and his/her activities
 Integrated information from heterogeneous sources
• Adaptability to rapidly changing business needs
 Multiple ways to view business performance
 Low cycle time, faster analytics
• Increased Global competition
 Crunch more and more data, faster and faster
• Mergers and Acquisition
 With each acquisition comes another set of disparate IT systems
affecting consistency and performance

October 20, 2010 5


Need for Data Warehousing – Systemic View

• Performance Optimization
 OLTP systems get overloaded with large analytical
queries
 Data Models for OLTP and OLAP are very different
• Reduce reliance on IT to produce reports
 Reporting making on OLTP systems is very technical
• OLTP systems not built to hold history data
• Data Security
 To prevent unauthorized access to sensitive data

October 20, 2010 6


Data Warehouse Defined
Data Warehouse is a –
• Subject-Oriented
• Integrated
• Time-Variant
• Non-volatile
collection of data enabling management
decision making

October 20, 2010 7


Subject Orientation
Process Oriented Subject Oriented

Entry
Sales Rep Sales
Sales
Quantity Sold
Part Number
Date
Customer Name
Customers
Customers
Product Description
Unit Price
Mail Address Products
Products

Transactional Storage Data Warehouse Storage

October 20, 2010 8


Data Volatility
Volatile Non-Volatile

Insert Change

Delete Access

Insert Load
Change
Access

Record-by-Record Data Manipulation Mass Load / Access of Data

Transactional Storage Data Warehouse Storage

October 20, 2010 9


Time Variance

Current Data Historical Data

Sales ( Region , Year - Year 97 - 1st Qtr)

20

15
Sales ( in lakhs
10 East
)
West
5 North

0
January February March
Year97

Transactional Storage Data Warehouse Storage

October 20, 2010 10


Data Warehouse Characteristics

• Stores large volumes of data used frequently by DSS

• Is maintained separately from operational databases

• Are relatively static with infrequent updates

• Contains data integrated from several, possibly


heterogeneous operational databases

• Supports queries processing large data volumes

October 20, 2010 11


Three views of Data Warehousing

• Strategic or Business view


 Define key business drivers of data warehouse
 How can business-driven approach achieve high ROI?
• Architectural or Technology view
 Alternative data warehousing architectures
 How can the right architecture achieve a high ROI?
• Methodology or Implementation view
 Development and implementation methodology
 How can the right methodology achieve a rapid ROI?

October 20, 2010 12


Data Warehouse Concepts Agenda

A.What is a Data Warehouse (DW) ?


B.What are the components of a DW ?
C.What are the various architectures/formats of a DW ?
D.Examples of Data Warehousing tools in use

October 20, 2010 13


Data Warehouse Components

Metadata Layer
Extraction Data Mart
Cleansing Population
Aggregation
FS1 Summarization
S Transformation
T DM1
A
FS2
G
. I
N ODS DW
DM2

Transmission
. N
E
G
DMn
A
. T
W
O
R
E
OLAP ANALYSIS
FSn R
K A
Legacy System Knowledge Discovery

October 20, 2010 14


Data Warehouse Build Lifecycle

• Data extraction

• Data Cleansing and Transformation

• Data Load and refresh

• Build derived data and views

• Service queries

• Administer the warehouse

October 20, 2010 15


Data Warehouse Concepts Agenda

A.What is a Data Warehouse (DW) ?


B.What are the components of a DW ?
C.What are the various architectures/formats of a DW ?
D.Examples of Data Warehousing tools in use

October 20, 2010 16


Data Warehouse Architectures

• Virtual Data Warehouse

• Enterprise Data Warehouse

• Data Marts

• Distributed Data Marts

• Multi-tiered warehouse

October 20, 2010 17


Virtual Data Warehouse

Legacy

REPORTING TOOL
Operational Systems Data
U
Client/ S
Server E
R
OLTP S
Application

External

October 20, 2010 18


Enterprise Data Warehouse

Data Preparation

Legacy Select

REPORTING TOOL
Operational Systems Data
Metadata
Repository
Extract U
Client/
Server
S
A E
Transform DATA P
WAREHOUSE R
OLTP Integrate I S

Maintain
External

October 20, 2010 19


Data Marts

Legacy Select

REPORTING TOOL
Metadata
Repository
Extract U
Client/
Server S
A
Transform DATA MART E
P
R
OLTP Integrate I
S

Maintain
External
Data Preparation
Operational Systems
Data

October 20, 2010 20


Distributed Data Marts

Legacy Select Data Mart

REPORTING TOOL
Extract
Client/ U
Server A S
Transform Data Mart P E
OLTP Integrate I R
S
Maintain
External Data Mart
Data Preparation
Operational Systems
Data

October 20, 2010 21


Multi-tiered Data Warehouse: Option 1

Data Mart

REPORTING TOOL
Legacy
Select

Extract
Client/
Server
Metadata U
Repository A S
Data Mart P
Transform E
DATA
OLTP WAREHOUSE I R
Integrate
S
External Maintain
Data Mart
Operational Systems
Enterprise wide Data

October 20, 2010 22


Multi-tiered Data Warehouse: Option 2

Legacy Select Data Mart

REPORTING TOOL
Extract Metadata
Client/ Repository
U
Server A S
Transform Data Mart DATA P
WAREHOUSE E
OLTP Integrate I R
S
Maintain
External Data Mart
Data Preparation
Operational Systems
Data

October 20, 2010 23


Relative Data sizes in a Data Warehouse
Highly Summarized Data

Lightly Summarized Data

Current Detail Data

Metadata

Older Detail Data

Cont.
October 20, 2010 24
Data Warehouse - Example

Monthly sales by Monthly Sales by Product for


region for 1991-94 1991-94

Weekly sales by
Weekly sales by product/sub-product
region for 1991-94 for 1991-94

Sales Detail
for 1991-94

Metadata

Sales Detail for


1985-90

October 20, 2010 25


Building a Data Warehouse - Steps

• Identify key business drivers, sponsorship, risks, ROI

• Survey information needs and identify desired functionality


and define functional requirements for initial subject area.

• Architect long-term, data warehousing architecture

• Evaluate and Finalize DW tool & technology

• Conduct Proof-of-Concept

Cont.
October 20, 2010 26
Building a Data Warehouse - Steps

• Design target data base schema

• Build data mapping, extract, transformation, cleansing and


aggregation/summarization rules

• Build initial data mart, using exact subset of enterprise data


warehousing architecture and expand to enterprise
architecture over subsequent phases

• Maintain and administer data warehouse

October 20, 2010 27


Data Warehouse Concepts Agenda

A.What is a Data Warehouse (DW) ?


B.What are the components of a DW ?
C.What are the various architectures/formats of a DW ?
D.Examples of Data Warehousing tools in use

October 20, 2010 28


Representative DW Tools

Tool Category Prod


ETL Tools ETI
Ora
OLAP Server Ora
OLA
October 20, 2010 29
Data Warehousing - Insights

• An enabling technology that facilitates improved business


decision-making

• A process, not a product

• A technique for assembling and managing a wide variety of


data from heterogeneous systems for decision support

October 20, 2010 30


Data Modeling

October 20, 2010 31


Modeling – ER Model

Definition
Logical & Graphical representation of the information needs

Process
• Classifying
 Entities
• Characterizing
 Attributes
• Inter-relating
 Relationships

October 20, 2010 32


Modeling – Logical Model

Definition
Representation of a business problem without regard to
implementation, technology and organizational structure

Features
• Represent business requirement completely, correctly &
consistently
• Remove redundancy
• Does not presuppose data granularity
• Not implemented

October 20, 2010 33


Modeling – Example of ER model

October 20, 2010 34


Modeling – Physical Model

Definition
Specification of what is implemented

Features
• Optimized
• Efficient
• Buildable
• Robust

October 20, 2010 35


Dimensional Modeling

• Form of analytical design (or physical model) in


which data is pre-classified as a fact or dimension
• Improves performance by matching the data
structure to the queries

“Give this period’s total sales volume and


revenue by product, business unit and
package”

October 20, 2010 36


Dimensional Modeling

October 20, 2010 37


Dimensional Modeling Agenda

A.What is Dimensional Modeling ?


B.What are the various types of Dimension Tables ?
C.What are the various types of Fact Tables ?
D.How do I model a star schema ?

October 20, 2010 38


A Quick Recap of Data Warehousing
Data Warehouse is a SUBJECT ORIENTED, INTEGRATED, TIME-VARIANT,
NON-VOLATILE collection of data enabling Management Decision Making
RDBMS
Cleansing,
Transformation,
Client
Validation, Massaging
Browsers
ERP
ODS
Extraction

STAGING AREA
CRM

Network
DW

Data

Mainframe DBs Marts


Aggregation,
summarization, Data
Mart Population, Reports, Cubes, Analysis,
Dimension loading, Data mining, Dashboards,
Fact Loading MIS reports, Company
PC DBs Quarterly reports etc..
October 20, 2010 39
Dimensional Modeling: In Perspective
Dimensional Modeling is an effective, efficient and successful technique to design
Enterprise Data Warehouse and Distributed Data Mart database schemas for
maximum query performance

RDBMS
Client
Browsers
ERP
ODS
Extraction

STAGING AREA
CRM

Network
DW

Data

Mainframe DBs Marts

DIMENSIONAL
MODELING AREA
PC DBs
October 20, 2010 40
Dimensional Modeling

• Form of analytical design (or physical model) in


which data is pre-classified as a fact or dimension
• Improves performance by matching the data
structure to the queries

“Give this period’s total sales volume and


revenue by product, business unit and package”

October 20, 2010 41


Dimensional Model: Strengths
• Predictable, Standard Framework
 Query tools and user interfaces can be created to provide a consistent
way of reporting data
 Most filter conditions being on dimensional attributes allows
performance boosting through bit-vector indexes on dimension table
columns
 Metadata functionality in the query tools can make use of the known
cardinality of dimension values to offer such facilities such as value-
selection drop-downs and value-selection windows

• Resilience to unexpected user querying patterns


 All dimensions are equivalent entry points into the fact table (number
of joins to fact table is same = 1)
 Symmetrical query strategies and SQL
 Logical Design can be done independent of query patterns

October 20, 2010 42


Dimensional Model: Strengths
• Graceful Extensibility
 Allows adding new unanticipated facts as long as they are
consistent with the fact table grain
 Allows adding new dimension tables as long as a single value of
that dimension is defined for each existing fact record
 Allows adding new unanticipated dimensional attributes

• Standard approaches for common modeling situations


 Slowly Changing Dimensions (SCDs)
 Heterogeneous products (e.g. Savings account, current account)
 Pay-in-advance databases
 Event-handling databases (fact-less facts)

October 20, 2010 43


What is a Fact?

Facts Measure

Sales
Volume

Revenue

October 20, 2010 44


Facts and Fact Tables
Fact Measure Sales Billing
Date Key (int) Date Key (int)
Store Key (int) Customer Key (int)

Revenue Cost Product Key (int) Service Line Key (int)


Sales (float) Rate Plan Key (int)
Qty Sold (int) Number of Total Minutes
Price (float) Number of Calls (int)
No. of Accounts Discount (float) Service Charge (float)
Taxes (float)

The term FACT represents a single business measure. E.g. Sales, Qty Sold
Each fact has a GRAIN which is the set of perspectives or attributes that define/ qualify the
fact completely.
E.g. Grain of “Sales” could be “for each PRODUCT, at each STORE, on each/ every DAY”.
A FACT TABLE is the primary table in a dimensional model where the business measures
or FACTS are stored.
A business measure or FACT is a row in a FACT TABLE. All FACTS in a FACT TABLE must
be at the SAME GRAIN.

October 20, 2010 45


Fact tables: some features
• Fact tables express MANY-MANY RELATIONSHIPS between dimensions in
dimensional models
 One product can be sold in many stores while a single store typically sells
many different products at the same time
 The same customer can visit many stores and a single store typically has
many customers
• The fact table is typically the MOST NORMALIZED TABLE in a dimensional model
 No repeating groups (1N), No redundant columns (2N)
 No columns that are not dependent on keys; all facts are dependent on keys
(3N)
 No Independent multiple relationships (4N)
• Fact tables can contain HUGE DATA VOLUMES running into millions of rows
• Facts can be identified by answering the question: “WHAT ARE WE
MEASURING?”
• The grain of the fact table can be identified by answering the question: “HOW DO
YOU DESCRIBE A SINGLE ROW IN THE FACT TABLE?”
• All facts within the same fact table must be at the SAME GRAIN

October 20, 2010 46


Fact tables: some features
• The grain of the fact table is the LIST OF DIMENSIONS that uniquely define each
row of the fact table
– Each row of sales fact table is uniquely identified by a unique combination
of store, product and time: which product, which store, when sold
– Additional foreign keys may exist in the fact table that point to other
dimension tables e.g payment type, but which do not contribute to the grain

• Every foreign key in a Fact table is usually a DIMENSION TABLE PRIMARY KEY

• Every foreign key in a Fact table is usually an INTEGER KEY

• Fact tables are TYPICALLY used in GROUP BY SQL queries

• Every column in a fact table is either a foreign key to a dimension table primary
key or a fact

• Every non-key column in the fact table is typically used in the SELECT clause of
a SQL query

October 20, 2010 47


What is a Dimension?

Data Warehouse is
• Subject-Oriented

•Integrated
• Time-Variant
• Non-volatile
collection of data in support of management’s
decision.

Subject Dimension

Product Package
Business
Unit

October 20, 2010 48


Dimensions and Dimension Tables
Dimension Perspective Store Product
Store Key (int) Product Key (int)
Store name (char) Product id (char/int)
Time Street Address (char) Product name (char)
Customer City (Char) Product Group
State (Char) Brand
Geography Region (Char) Department
Country (Char)

The term DIMENSION represents a single category or perspective by which


associated FACTS are interpreted and understood.
E.g. “Store” is a perspective by which sales are understood. It is the answer to the
question “Where did the sales occur?”
A DIMENSION TABLE is a table which holds a list of attributes or qualities of the
dimension most often used in queries and reports.
E.g. The “Store” dimension can have attributes such as the street and block number,
the city, the region and the country where it is located in addition to its name.
Every row in the DIMENSION TABLE represents a unique instance of that
DIMENSION and has a unique identifier called the DIMENSION KEY.

October 20, 2010 49


Dimension tables: some features
• Dimensions can be identified by answering the question: “HOW DO BUSINESS
PEOPLE DESCRIBE THE DATA THAT RESULTS FROM A BUSINESS PROCESS?”
• Dimension tables are ENTRY POINTS into the fact table. Typically -
 The number of rows selected and processed from the fact table depends on
the conditions (“WHERE” clauses) the user applies on the dimensional
attributes selected
• Dimension tables are typically DE-NORMALIZED in order to reduce the number of
joins in resulting queries
• Dimension table attributes are generally STATIC, DESCRIPTIVE fields describing
aspects of the dimension
• Dimension tables typically designed to hold IN-FREQUENT CHANGES to attribute
values over time using SCD concepts
• Dimension tables are TYPICALLY used in GROUP BY SQL queries
• Dimension Tables serve to simplify SQL GROUP BY queries
• Every column in the dimension table is TYPICALLY either the primary key or a
dimensional attribute
• Every non-key column in the dimension table is typically used in the GROUP BY
clause of a SQL Query

October 20, 2010 50


Some more jargons…..

Hierarchy - Geography
• Hierarchy Dimension
• Level
World Level World
• Member

n
io
• Attribute

at
el
Continent America Europe Asia

tR
• Grain Level Pa
en
r

Country USA Canada Argentina


Level

State FL GA VA CA WA
Level

City Miami Tampa Orlando Naples


Level
Attributes: Population, Dimension
Tourist’s Place Member / Business
Entity

October 20, 2010 51


Some more jargons…..
A Dimension can have one or more hierarchies
Another Hierarchy -
Geography Dimension

Global Level Global


A hierarchy can have one or more levels or grain

n
io
at
el
Economy Developed Developing Third world

tR
Level
enr
Pa

Upper Middle Lower


Each Financial
level can have one or more members
Class Level

Regional Metro Suburb Town Village


Level
Each member can have one or more attributes
City Chennai Delhi Mumbai Kolkatta
Level
Attributes: Population, Dimension
Tourist’s Place Member / Business
Entity

October 20, 2010 52


The Star Schema: Linking Facts and Dimensions
Date/Time Store Product Sales
Date Key (int) Store Key (int) Product Key (int) Date Key (int)
Date (date – dd/mm/yy) Store name (char) Product id (char/int) Store Key (int)
Day Of Week (char) Street Address (char) Product name (char) Product Key (int)
Day of Month (int) City (Char) Product Group Customer Key (int)
Month (int) State (Char) Brand Payment Type Key (int)
Quarter (int) Region (Char) Department Sales (float)
Year (int) Country (Char) Qty Sold (int)
Price (float)

Store Discount (float)

The Star Join Schema or


STAR SCHEMA is a single
FACT TABLE joined to a set
Time Store Payment of DIMENSION TABLES

Sales Type Simple, Symmetric,


Extensible and Optimized!
GRAIN of the Star Schema
Customer Product is the grain of its central
Fact table!
October 20, 2010 53
Star Schema
• Particular form of a dimensional model
• Central fact table containing Measures
• Surrounded by one perimeter of descriptors - Dimensions

October 20, 2010 54


Star Schema
Fact Table This table is the core of the Star
Schema Structure and contains
the Facts or Measures available
through the Data Warehouse.

These Facts answer the


questions of “What”, “How
Much”, or “How Many”.

Some Examples:
Sales Dollars, Units Sold, Gross
Profit, Expense Amount, Net Income,
Unit Cost, Number of Employees,
Turnover, Salary, Tenure, etc.

October 20, 2010 55


Star Schema
Dimension These tables describe the Facts
Tables or Measures. These tables
contain the Attributes and may
also be Hierarchical.

These Dimensions answer the


questions of “Who”, “What”,
“When”, or “Where”.

Some Examples:

• Day, Week, Month, Quarter, Year


• Sales Person, Sales Manager, VP of Sales
• Product, Product Category, Product Line
• Cost Center, Unit, Segment, Business,
Company

October 20, 2010 56


Star Schema
Employee_Dim
Employee_Dim
EmployeeKey
EmployeeID
.
.
.

Time_Dim
Time_Dim Product_Dim
Product_Dim
TimeKey Sales_Fact
Sales_Fact ProductKey
TheDate TimeKey ProductID
. .
. EmployeeKey .
. .
ProductKey
CustomerKey
ShipperKey
Required Data
(Business Metrics)
or (Measures)
Shipper_Dim
Shipper_Dim . Customer_Dim
Customer_Dim
.
ShipperKey . CustomerKey
ShipperID CustomerID
. .
. .
. .

October 20, 2010 57


Snow Flake Schema
• Complex dimensions are re-normalized
• Different levels or hierarchies of a dimension are kept separate
• Given dimension has relationship to other levels of same
dimension

October 20, 2010 58


Star and Snow are De-normalized
•It violates 3NF in dimensions by collapsing higher-level dimensions into the lowest level as in Brand
and Category.
•It violates 2NF in facts by collapsing common fact data from Order Header into the transaction, such
as Order Date.
•It often violates Boyce-Codd Normal Form (BCNF) by recording redundant relationships, such as the
relationships both from Customer and Customer Demographics to Booked Order.
•However, it supports changing dimensions by preserving 1NF in Customer and Customer
Demographics.

October 20, 2010 59


A shot at Dimensional Modeling….
STEP 1
• Identify Subjects (Dimensions)
• Identify Hierarchies of a Dimension
• Identify Attributes of levels in Hierarchies
• Define Grain

Country
Industry
Segment State

Industry Type City Fin. Class

Customer

October 20, 2010 60


A shot at Dimensional Modeling….
STEP 2
• Use KPIs to identify the Facts
• Group the Facts in a logical set

Financial Non-Financial Transactions


Transactions

Trans. Amount No. of Cheques Cleared


No. of Bonds No. of Visits to a Branch
No. of Transactions No. of DEMAT Transactions
Service Cost
... ...

October 20, 2010 61


A shot at Dimensional Modeling….
STEP 3
• Link the Group of Facts to the Dimensions that
participate in the Facts

Customer Product

Time Financial Organization


Transactions

Channel

October 20, 2010 62


A shot at Dimensional Modeling….

STEP 4
• Define Granularity for each Group of Facts

Customer Product
(Customer) (Scheme)

Time Financial Organization


(Day-Hour) Transactions (Branch)

Channel
(Channel)

October 20, 2010 63


Dimensional Modeling Agenda

A.What is Dimensional Modeling ?


B.What are the various types of Dimension Tables ?
C.What are the various types of Fact Tables ?
D.How do I model a star schema ?

October 20, 2010 64


Types of Dimensions
• Primary Dimension
 Contributes to the fact grain
 A set of these uniquely defines the associated fact
 E.g. SALES fact is typically completely defined by store, product and time

• Secondary Dimension
 Does NOT contribute to the fact grain
 Non-primary dimensions such as payment type, customer, manufacturer are
still important for analysis of SALES fact
 Useful for rich analytic slicing and dicing, e.g. Top 10 customers.

• Degenerate Dimension
 A dimension without any attributes; but useful for analysis
 Generally included in the associated fact table before facts
 E.g. invoice number, by itself, in a shipping fact

October 20, 2010 65


Types of Dimensions
• Conformed Dimension
 A dimension used across the enterprise
 Requires standardized structure and definition
 Requires to be designed upfront before individual designing schemas
 Plugs into multiple stars as either primary or secondary dimension
 E.g. Customer, Product, Store, Time, Employee
 Customer could be captured at the store card swiping machine (sales
fact), be part of Marketing promotion strategy (campaign fact) and
also could be serviced by call center for warranty replacements
(warranty fact)
 Employee may be a Sales-rep claiming credit for sales (Sales fact) or
may be a Finance manager authorizing vendor payments (vendor
Payment fact) or a call center person taking customer calls (Service
Call fact)

October 20, 2010 66


Types of Dimensions
• Slowly Changing Dimension
 Dimensional attributes change over time
 Need to capture these changing realities as history
 Requires special design techniques to keep it single valued for
each fact row while still retaining history
 E.g. Customer City, Marriage status, salesrep department
 Type 1 (Overwrite previous values)
 Type 2 (Create additional time-stamped dimension record
 Type 2 automatically partitions history
 Type 3 (Create additional attribute column to retain any one
previous value e.g. first value, previous value)
 Requires dimension key to be generalized

October 20, 2010 67


Types of Dimensions
• Rapidly Changing Small Dimension
 Same as SCD except frequency is higher
 Need to track changes to attributes E.g. Employee attributes such as
appraisal rating
 E.g. telecom product: rate plans keep changing

• Large Dimension
 Size increases with decreasing level of granularity
 Typical of public utility companies, government agencies
 Human records kept by supermarkets e.g. Shopper’s Stop
 Do NOT create SCDs to address slow changes/ history
 See “…Monster Dimension” for SCD strategy
 Choose indexing strategies to reduce query run times
 Choose RDBMS wisely e.g. SQLServer vs. Oracle vs. Teradata

October 20, 2010 68


Types of Dimensions
• Rapidly Changing Monster Dimension
 Similar to large dimension but typical of a large insurance customer
dimension
 Customers and Claims are rapidly created and changed
 Need to track history for credit and legal reasons
 Remove the continuously changing attributes to another dimension
table e.g. demographic
 Reduce the cardinality of these attributes by banding them e.g.
income_band, credit_band, etc.
 Then create all possible combinations of these attributes
 Then assign a dimension key to each unique set of these
combinations; this is the demographics table
 For each combination that represents the customer’s status in a
particular period, plug the demographic key into the fact as an
additional key

October 20, 2010 69


Types of Dimensions
• Junk Dimension
 A convenient grouping of random flags and attributes to get
them out of the fact table
 Retain only useful fields
 Remove fields that make no sense at all
 Remove fields that are inconsistently filled
 Remove fields that are of operational interest only
 Design similar to demographics; maximum unique
combinations, assign integer key, plug into fact
 Create new combination (insert new dimension record) at ETL
run-time
 E.g. Yes/No Flags in old retail transaction data

October 20, 2010 70


Types of Dimensions
• Role-playing Dimension
 dimension appears several times in the same fact table
 Typically, Date/Time dimension plays many roles
 E.g. Order Fulfillment is a typical retail fact table having the
following dimensions:
 Order Date, Packaging date, Shipping date, Delivery date
 Create one fact table key for each role
 Create one SQL view of the dimension for each role
 Use view names to run SQL queries
 In Business Objects, this scenario is designed using Aliases
and Contexts
 E.g. (2) – Employee dimension: Salesrep, Manager, Appraiser,
Appraisee in Sales Compensation fact and Employee Appraisal
fact respectively.

October 20, 2010 71


Dimensional Hierarchy
Geography Dimension
World Level World

ion
lat
Re
Continent America Europe Asia
Level nt
re
Pa

Country USA Canada Argentin


Level a

State FL GA VA CA WA
Level

City Level Miami Tampa Orlando Naples

Dimension
Attributes: Population,
Member /
Tourist’s Place
Business Entity

October 20, 2010 72


Dimensional Modeling Agenda

A.What is Dimensional Modeling ?


B.What are the various types of Dimension Tables ?
C.What are the various types of Fact Tables ?
D.How do I model a star schema ?

October 20, 2010 73


Types of Facts
Value Based Classification
• Numeric Facts
• Count / Occurrence Based (e.g. Employees assigned to a project)
• Non-numeric Facts (e.g. Comments in fact tables)
Summary Based Classification
• Additive (along all dimensions)
• Semi Additive (mostly along Time dimension)
• Non Additive (cannot be added along any dimension)
In our example discussed earlier in these slides, sales, number of total minutes are
examples of value based and additive facts as these can be added across all
dimensions…
Whereas, Price, Quantity Sold are examples of value based but semi-additive facts
as these can be added only across some dimensions e.g. they cannot be added
across the product dimension as the fact “Total price” does not make sense. Rather
Average Price across products makes more sense.

October 20, 2010 74


Types of Fact Tables
Fact tables are classified based upon:
• the type of grain they address or the level of detail they contain AND
• the way the measurements are taken with respect to time

Thus we have:
• Transaction Fact Table
• Snapshot Fact Table
• Summary Fact Table
Figure 1
The context of a transaction is modeled as a set of generally independent
dimensions. Figure 1 shows seven such dimensions.
The measured transaction amount is in a fact table that refers to all the
dimensions by foreign keys pointing outward to their respective dimension tables.
The clean removal of all the context detail from the transaction record is an
important normalization step and is why fact tables are “highly
normalized.”

October 20, 2010 75


Types of Fact Tables
Kimball mentions 3
fundamental types of fact
table grain:
• Transaction: A transaction
is a set of data fields that
record a basic business event.
• e.g. a point-of-sale in a
supermarket, attendance in a
classroom, an insurance
claim, etc.
Figure 2

• The measurements group nicely together into a single fact table with the same
grain.
• Periodic Snapshot: A snapshot is a measurement of status at a specific point in
time. E.g. In Figure 2, earned premium is the fraction of the total policy premium
that the insurance company can book as revenue during the particular reporting
period. The periodic-snapshot-grained fact table represents a predefined time span.

October 20, 2010 76


Types of Fact Tables

The accumulating-snapshot-
grained fact table represents
an indeterminate time span,
covering the entire history
starting when the collision
coverage was created for the
car in our example and ending
with the present moment.

Figure 2

In dramatic contrast to the other fact-table types, we frequently revisit


accumulating-snapshot fact records to update the facts.
Remember that in this table there is generally only one fact record for the
collision coverage on a particular customer’s car.
As history unfolds, we must revisit the same record several times to revise
the accumulating status.

October 20, 2010 77


Dimensional Modeling Agenda

A.What is Dimensional Modeling ?


B.What are the various types of Dimension Tables ?
C.What are the various types of Fact Tables ?
D.How do I model a star schema ?

October 20, 2010 78


Visualizing a dimensional model
The most popular way of visualizing a
dimensional model is to draw a cube. We
can represent a three-dimensional model
using a cube.
Usually a dimensional model consists of more
than three dimensions and is referred to as a
hypercube.
However, a hypercube is difficult to visualize,
so a cube is the more commonly used term.

In Figure 1, the measurement is the volume


of production, which is determined by the
combination of three dimensions: location,
product, and time.
Figure 1
The location dimension and product dimension have their own two levels of hierarchy. For example,
the location dimension has the region level and plant level. In each dimension, there are members
such as the east region and west region of the location dimension.

Although not shown in the figure, the time dimension has its numbers, such as 1996 and 1997. Each
sub-cube has its own numbers, which represent the volume of production as a measurement.

For example, in a specific time period (not expressed in the figure), the Armonk plant in East region
has produced 11,000 Cell Phones, of model number 1001.
October 20, 2010 79
Data Warehouse Bus Matrix
All Dimensional models together form the logical design of the data warehouse.To Decide which
Dimensional Models to build we start with a top-down planning approach called the Data Warehouse
Bus Architecture Matrix.

This Matrix forces us to list all the possible data marts we could possibly build and name all the
dimensions that are present in those data marts (at a high level).

A Dimensional Model is made up of one or more star schemas. Some of these star schemas may be
snow flaked for better organization and storage.

Dim ension

Organization
Equipment

Employee
Customer

Accounts
Calendar

Outage
Vendor
Subject A rea
Accounts ✔✔ ✔✔
S ales ✔✔✔✔✔
Q uotes ✔✔ ✔
G eneral Ledger ✔ ✔✔✔✔ ✔
S hipm ent ✔✔✔✔✔
P arts/Finance ✔ ✔✔✔✔ ✔
October 20, 2010 80
Dimensional Modeling Approach
CDM LDM PDM
Each star schema has a single fact table at its centre surrounded by multiple dimension tables. Once we
do this, we can then start the design of each individual fact table/star schema using a 4-step process.

STEP 1: Identify Subject Area/ Business Process


Start the model by choosing a single business process or a business sub-process to model so that you
have only one fact table: e.g. SALES business process.

STEP 2: Define Fact Table Grain


Choose the GRAIN of the central fact table.
e.g. i) Each Sales transaction is a fact record: Grain is sales by product by store by transaction
ii) Each daily product sales total in each store is a fact record: Grain is sales by product by store
by day

STEP 3a: Identify Dimensions


Choose the DIMENSIONS as follows:
• Primary dimensions from fact grain e.g. product, store, day (or date)
• Additional dimensions based on user interviews, reports analysis
• Ensure each dimension is at its lowest level of detail possible while still being single valued.

October 20, 2010 81


Dimensional Modeling Approach
STEP 3b: Identify grain of dimension table
Ensure that each dimension table grain is NOT lower than the central fact table grain: e.g. the store
dimension should have one row for each store. Each store may have departments but the store
dimension’s row should represent only the store and not the department.
STEP 3c: Identify all dimensional Attributes
For each dimension choose only SINGLE VALUED attributes e.g. if Region is an attribute of the store
dimension then it should have one and only one value for each store.
STEP 3d: Identify Dimension Hierarchies and attributes of levels in Hierarchies

Country Sales
Date Key (int)
Industry
Segment State Store Key (int)
Product Key (int) Figure: Customer
Dimension Hierarchies
Customer Key (int) (Industry, Geography)
Industry Type City Fin. Class
Sales (float)
Qty Sold (int)
Customer Price (float)
Discount (float)
STEP 4a: Choose Facts
Choose each fact for the fact table making sure that the fact is relevant and also has the same
grain has the fact table e.g. for SALES fact table, typical facts would be price, quantity sold and
sales amount as these are all dimensioned by product, by store, by day.
October 20, 2010 82
Dimensional Modeling Approach
STEP 4b: Connect Fact to dimension tables by means of surrogate keys
Customer Product
(Customer) (Scheme)

Time Financial Organization


(Day-Hour) Transactions (Branch)

Channel
(Channel)

Important Notes:
1. Each dimension table will have a MEANINGLESS SINGLE PART INTEGER PRIMARY KEY called
surrogate key. This key also occurs as part of the central fact primary key.
2. All fact table primary keys should ideally be foreign keys to the corresponding dimension tables.

October 20, 2010 83


Implementation and Maintenance
DWH Design, Deployment and Maintenance

October 20, 2010 84


Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 85
DWH Physical Design Process

October 20, 2010 86


Physical Design Process

•Develop Standards.
•Process Standards
•Naming Standards
Database Objects
Word Separators
Names in Logical and Physical Model
Physical File naming standards
Naming of files & tables in Staging area

October 20, 2010 87


Continue…

•Create Aggregates Plan


Identify granularity level

•Determine the Data Partitioning Scheme


Selecting fact and dimensions
Horizontal or Vertical
Number of partitions
Criteria for partitions

October 20, 2010 88


Continue…
•Establish Clustering Options
Placing and managing related units of data together

•Prepare Indexing Strategy


Identify the columns.
Identify the sequence.

•Assign Storage Structures


•Complete Physical Model
Review all the above activities
Create physical model

October 20, 2010 89


Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 90
Physical Design Considerations

•Improve Performance
•Ensure Scalability
•Manage Storage
•Provide Ease of Administration
•Design Flexibility
•Assign Storage Structures

October 20, 2010 91


Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 92
Physical Storage

• Types of Storage Structure


Files
Facts
Dimensions
Indexed Data Structures

October 20, 2010 93


Storage Considerations

•Set correct units of units of database space allocation


Data Block

•Set proper block usage parameters


Free and Used Space

•Manage data migration


Row Chaining
Row Migrating

•Manage block utilization


Should have less free space

October 20, 2010 94


Continue…

•Resolve dynamic extension


Inserting a new record.
Updating the existing record.

•Employ file striping techniques


Splitting files into multiple physical parts
Enables Concurrent I/O

October 20, 2010 95


Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 96
Indexing the Data Warehouse

•What are Indexes?


•Why are Indexes required?
•When should I create an Index?
•What are different types of Indexes?

October 20, 2010 97


Types of Indexes
•B-tree: The default and the most common.
•B-tree cluster:Defined specifically for cluster.
•Hash cluster:Defined specifically for a hash cluster
•Bitmap indexes: Compact, work best for columns with a
small set of values.
•Bitmap Join Indexes - Index based on one table that
involves columns of one or more different tables through
a join.
•Function-based: Contain the pre- computed value of a
function/expression.

October 20, 2010 98


B-tree Index

October 20, 2010 99


B-tree Index - Advantages

•All leaf blocks of the tree are at the same depth, so


retrieval of any record from anywhere in the index takes
approximately the same amount of time.
•B-tree indexes automatically stay balanced.
•All blocks of the B-tree are three-quarters full on the
average.
•B-trees provide excellent retrieval performance for a wide
range of queries, including exact match and range
searches.
•Inserts, updates, and deletes are efficient, maintaining key
order for fast retrieval.
•B-tree performance is good for both small and large tables
and does not degrade as the size of a table grows.

October 20, 2010 100


Bitmapped Index

October 20, 2010 101


Bitmapped Index - Advantages

•Reduced response time for large classes of ad hoc


queries

•A substantial reduction of space use compared to


other indexing techniques

•Dramatic performance gains even on very low end


hardware

•Very efficient parallel DML and loads

October 20, 2010 102


Clustered Index

October 20, 2010 103


Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 104
Performance Enhancement Techniques
•Data Partitioning
Decomposing tables into smaller and more
manageable pieces called partitions
Range, list, hash & composite partitioning.

•Data Clustering
•Parallel Processing
•Summary levels
•Referential Integrity Checks
•Initialization Parameters
•Data Arrays
October 20, 2010 105
Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 106
Major Deployment Activities

•Complete User Acceptance


•Perform Initial Loads
•Get User Desktops Ready
•Complete Initial User Training

October 20, 2010 107


Deployment Approaches
•Top-Down Approach
Deploy the overall enterprise DWH followed by the
dependent data marts,one by one.
•Bottom-up Approach
Gather departmental requirements, and deploy the
independent data marts,one by one.
•Practical/general Approach
Deploy the Subject data marts, one by one, with
flexible approach with fully conformed dimension
following water fall model.
Note – It is always advisable to deploy in stages
October 20, 2010 108
Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 109
Security

•Prepare a Security Policy


Should cover scope of information, physical security,
network and connections, DB access privileges, &
access matrix.
•Manage user privileges
•Password considerations
•Security tools

October 20, 2010 110


Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 111
Backup and Recovery

•Why Backup is required?


•What is Data warehouse Administration?
•What are the roles of Data warehouse Administrator
(DWA)?

October 20, 2010 112


DWA - Roles

•Building the data warehouse


•Ongoing monitoring and maintenance for the data
warehouse
•Coordinating usage of the data warehouse
•Management feedback as to successes and failures
•Competition for resources for making the data
warehouse a reality
•Selection of hardware and software platforms

October 20, 2010 113


Backup Strategy

•Should the data be actually discarded or should the data


be removed to lower-cost, bulk storage?
•What criteria should be applied to data to determine
whether it is a candidate for removal?
•Should the data be condensed (profile records, rolling
summarization, etc.)? If so, what condensation technique
should be used?
•How should the data be indexed once it is removed (if
there is ever to be any attempt to retrieve the data)?
•Where and how should metadata be stored once the data
is removed?

October 20, 2010 114


Continue…

•Should metadata be allowed to be stored in the data


warehouse for data that is not actively stored in the
warehouse?
•What release information should be stored for the base
technology (i.e., the DBMS) so that the data as stored will
not become stale and unreadable?
•How reliable (physically) is the media that the data will be
stored on?
•What seek time to first record is there for the data upon
retrieval?

October 20, 2010 115


Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 116
Monitoring the Data Warehouse

•Collection of Statistics
•Using Statistics for growth planning.
•Using Statistics for Fine-Tuning.
•Publishing Trends for users.

October 20, 2010 117


Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 118
Support

•Help Desk support.


•Hotline Support
•Technical Support.
•User representative.

Note - We should always follow multi-tier support structure

October 20, 2010 119


User Training

•User Training Contents


Should provide enough Data Content.
Should talk about all the applications involved.
Should talks features and usage of tools used.
•Identifying the users to be trained.
•Delivering the training program.

October 20, 2010 120


Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 121
Managing the Data Warehouse

•Platform Upgrades.
•Managing Data Growth.
•Storage Management.
•ETL Management.
•Data Model Revisions.
•Information Delivery Enhancements.
•Ongoing fine tuning.

October 20, 2010 122


Data Management

October 20, 2010 123


Data Storage, Movement
and Access

Master Data
Management

Data Architecture Data Quality


Data Governance

Management

Metadata Management
Enterprise Data Management Framework

October 20, 2010


124
124
Enterprise Data Management Framework explained
• Data Governance • Data Storage, Movement and
• Data governance (DG) refers Access
to the overall management of • Data Movement involves
the availability, usability,
integrity, and security of the translating/moving data from
data employed in an one format/storage device to
enterprise another.
• Data governance • Data security is the means of
encompasses the people, ensuring that data is kept safe
processes and procedures from corruption and that
required to create a access to it is suitably
consistent, enterprise view of controlled.
a company's data

• Metadata Management • Data Architecture • Data Quality Management


• Metadata is Data about data. • Data quality assurance (DQA)
Metadata describes how and is the process of verifying the
when and by whom a reliability and effectiveness of
particular set of data was data. Maintaining data quality
collected, and how the data is requires going through the
formatted. data periodically and
• Metadata Management is scrubbing it.
becoming very important • Typically this involves
because as systems become updating it, standardizing it,
more interdependent, it is vital and de-duplicating records to
to know the impact of altering create a single view of the
data data, even if it is stored in
multiple disparate systems

October 20, 2010 125


125
Data Quality Analysis

October 20, 2010 126


Data Quality Analysis Agenda

A.Why Data Quality Management ?


B.Elements for Data Quality
C.Classification of Data Quality Issues
D.Dimensions of Data Quality
E.TCS’ DQM Approach
F. DQM Architecture Options
G.BIDSTM DQM Methodology
H.Tools and Technology

October 20, 2010 127


Why DQM?
Where can I get
Why is this
just one view for
NULL?
all the data?
Is empid same as
emp_id?

So many duplicate
products on this
list…

I am still not
able to see
latest data…
Holland??? Is
this customer
in Europe or Returns on Investment
USA? are below expectations

October 20, 2010 128


Data Quality Analysis Agenda

A.Why Data Quality Management ?


B.Elements for Data Quality
C.Classification of Data Quality Issues
D.Dimensions of Data Quality
E.TCS’ DQM Approach
F. DQM Architecture Options
G.BIDSTM DQM Methodology
H.Tools and Technology

October 20, 2010 129


Elements for Data Quality

• Data Quality can be hampered by errors in following


elements:
 Definitions
 Domains
 Completeness
 Validity
 Data Flows
 Structural Integrity
 Business Rules
 Transformations

October 20, 2010 130


Definition

• This indicates how entities are referenced throughout the enterprise


• Definition problems can be further categorized as below:

Synonyms The fields EMP_ID, EMPID, and EM01 may or may


not all actually refer to the same type of data

Homonyms These indicate fields that are spelled the same, but
really aren’t the same (id or ID)

Relationships Just because a field is named FK_INVOICE doesn’t


mean that is is really a foreign key to the invoice
file.

October 20, 2010 131


Domains

• Domains describe the range and types of values present that can be
present in a data set
• Some examples of domain errors are:
 Unexpected values - e.g. Home State = one of {Kan, Mic, Min,…)
 Cardinality - A Yes/No field can have only two credible values
 Uniqueness – for a field, 98% of data is NULL
 Constant
 Outliers
 Length of field
 Precision
 Scale
 Internationalization – Date formats, postal codes, time zones, etc

October 20, 2010 132


Completeness

• This indicates whether or not all of the data is actually


present

• Completeness of dataset can be gauged by its


 Integrity - Is actual data mapping to our definition of
data?
 Accuracy – Name and address matching, demographics
check
 Reliability – Zip code should match to city and state
 Redundancy – Data duplication
 Consistency – Is same invoice number referenced with
different amounts?

October 20, 2010 133


Validity

• Validity indicates whether or not the data is valid

• Validity checks used to spot data problems are


 Acceptability – Product part number should consist of 7-
character alphanumeric string with two characters and 5
digits
 Anomalies
 Timeliness

October 20, 2010 134


Data Flows

• These checks are related to the aggregate results of


movement of data from source to target
• Many data quality problems can be traced back to incorrect
data loads, missed loads or system failures that go
unnoticed
• Data flow checks to ensure data quality are
 Record counts – Reconciliation of source and target
record counts
 Checksums
 Timestamps
 Process Time

October 20, 2010 135


Structural Integrity

• These checks ensure that when data is taken as a whole,


you are getting correct results

• Structural integrity checks include


 Cardinality Checks between tables
 Primary keys – Are these unique?
 Referential integrity – Product available on invoice but
missing from product catalog

October 20, 2010 136


Business Rules

• Business rule checks measure the degree of compliance


between actual data and expected data
• These checks constitute of
 Constraints – Does the data comply to a known set of
validations?
 Computational rules – Is formula for deriving amount
correct?
 Comparisons
 Functional dependencies
 Conditions

October 20, 2010 137


Transformations

• Transformation checks examine the impact of data


transformations as data moves from system to system
• Quality of data can be affected by incorrect transformation
logic
• Only way to identify these are to compare source data set
with target data set and verify transformations for
 Computations
 Merging
 Filtering
 Relationships

October 20, 2010 138


Data Quality Analysis Agenda

A.Why Data Quality Management ?


B.Elements for Data Quality
C.Classification of Data Quality Issues
D.Dimensions of Data Quality
E.TCS’ DQM Approach
F. DQM Architecture Options
G.BIDSTM DQM Methodology
H.Tools and Technology

October 20, 2010 139


Data Quality Issues - Classification

Data Quality Issues

Unmanaged
Physical Issues Logical Issues
Data Issues

Data Parsing
Data Profile Business Rules (using Rules, Text
(for Cleansing) Mining etc.)

October 20, 2010 140


Data Quality Analysis Agenda

A.Why Data Quality Management ?


B.Elements for Data Quality
C.Classification of Data Quality Issues
D.Dimensions of Data Quality
E.TCS’ DQM Approach
F. DQM Architecture Options
G.BIDSTM DQM Methodology
H.Tools and Technology

October 20, 2010 141


Dimensions of Data Quality
• Sufficiency for the purpose of
Business Intelligence
Sufficiency

• Consistency of definition of data


across the Data Warehouse
Accuracy Redundancy

• Accuracy as defined by business Data

rules Quality

• No redundancy across the


warehouse Latency Consistency

• Latency - No major change of data


between the instance of data The FIVE Dimensions of Data
capture and when processed Quality

Data Quality is measured across these dimensions!!!


October 20, 2010 142
Data Quality Analysis Agenda

A.Why Data Quality Management ?


B.Elements for Data Quality
C.Classification of Data Quality Issues
D.Dimensions of Data Quality
E.TCS’ DQM Approach
F. DQM Architecture Options
G.BIDSTM DQM Methodology
H.Tools and Technology

October 20, 2010 143


TCS’ DQM Approach

• Analysis of source data quality


 User driven
 Data driven
• Characterization of quality data
 This implies identification of necessary and
sufficient criteria that define quality data
 Domain validations, Business Rules validations
are touched upon in this
• Feasibility Analysis
 Mapping of data elements to rules
 Mapping of relationships to rules
 Assessment of grain of data
• Design and Implementation
 BIDS™ DQM Methodology

October 20, 2010 144


Data Quality Analysis Agenda

A.Why Data Quality Management ?


B.Elements for Data Quality
C.Classification of Data Quality Issues
D.Dimensions of Data Quality
E.TCS’ DQM Approach
F. DQM Architecture Options
G.BIDSTM DQM Methodology
H.Tools and Technology

October 20, 2010 145


DQM Architecture

• DQM at Source
• DQM as part of ETL processes
• DQM in the target

October 20, 2010 146


Data Quality Analysis Agenda

A.Why Data Quality Management ?


B.Elements for Data Quality
C.Classification of Data Quality Issues
D.Dimensions of Data Quality
E.TCS’ DQM Approach
F. DQM Architecture Options
G.BIDSTM DQM Methodology
H.Tools and Technology

October 20, 2010 147


DQM Methodology

• Modular approach to
building solutions
• Clear and well defined
guidelines, checklists
and standards
• Supports the Onsite-
Offshore delivery model
• Flexibility to adapt with
other methodologies
• E-T-V-X criteria re-
enforced by best
practices and TCS’
quality initiatives

October 20, 2010 148


Data Quality Analysis Agenda

A.Why Data Quality Management ?


B.Elements for Data Quality
C.Classification of Data Quality Issues
D.Dimensions of Data Quality
E.TCS’ DQM Approach
F. DQM Architecture Options
G.BIDSTM DQM Methodology
H.Tools and Technology

October 20, 2010 149


Tools and Technology

Common software products for Name and Address cleansing


• Trillium
• First Logic
• TCS’ DataClean™

Common ETL Tools


• Informatica
• Ab Initio

October 20, 2010 150


Tools and Technology
TCS expertise in industry standard tools and products ranging from RDBMS, ETL,
CGI & Web products in conjunction with in-house developed tools . TCS
Knowledge Base also includes a number of specialised tools that are proficient in
data cleansing, validation and trending
The common software products in use -

Trillium Software : Used for cleansing the name and address data. The software is able to identify
and match households, business contacts and other relationships to eliminate duplicates in large databases
using fuzzy matching techniques

Ab Initio : Provides a suite of software packages used for ETL in data warehouses. Its features include
Parallel data transformation , validation and filtering, real time data capture and integration
with relational DBMS systems and Data Profiling Capability

Unitech and Actuate : Comprises of a set of reporting tools used for trend analysis, point to point
reconciliation and detecting data inconsistency.

October 20, 2010 151


Case Study: British Telecom – Retail :
SWIFT

October 20, 2010 152


Client Profile and Business Drivers
Client Profile
• BT Retail is a significant player in the communications market in the
UK
• BT Retail has three main customer groups, namely, consumer,
business and major business or corporate. The products and services
cover the entire range from traditional telephony service, mobile
technology, internet access and web-based services.

Business drivers
• To unify existing marketing systems for providing centralized customer
repository to cater to
 Developing, targeting and presenting propositions
 Managing customer relationships
 Undertaking rapid tactical marketing
 Improving campaign effectiveness
 Reducing marketing operational costs

October 20, 2010 153


Business Objectives
• Reduction in marketing operational costs
• Reduction of marketing cycle times
• 360 degree view of customers
• Delivery of consistent messages across all customer channels
• Increased customer focus, better understanding and segmentation of
customers
• Event driven campaigns targeted at focused customer group
• Improvement in campaign effectiveness
• Maintaining a large data volume - One of the largest data warehouses
in Europe — 3.36 TB of data with a growth projection of 1% per week

October 20, 2010 154


Challenges
• Data quality issues in the vital data attributes in BT’s operational data
store
• Cleansing a backlog of erroneous information stored in database
• Decommissioning and migration of data from legacy systems
• Decommissioning of 30 TB of RDBMS
• Maintaining data integrity

October 20, 2010 155


Proposed Profiling and DQM Solution
• As part of the solution, the team deployed Business Rules Repository (BRR) to store
all business rules scattered across the enterprise. This enabled
 Sharing of information within business and IT stakeholders in an effective and
efficient manner
 Storage for basic information about each Business Rule with a history of changes
applied to it over time
• Types of Business Rules
 Format check – numeric, character or date with a specific pattern
 Cross Attribute Value Check within dataset (compare multiple attributes in a
dataset)
 List of Value – for small list of valid values
 Lookup – for large list of values like list of country codes, etc
 Uniqueness or duplication check
 Data integrity check
 Cross Attribute Value Check across datasets
• Data Profiler reports were created to report
 Structure and statistics for each data element
 Data value, range, distribution, pattern and format of each attribute
 Relationship of various attributes within and across datasets – join key, primary
key, potential foreign key, data dependencies, etc
• Web based application “Quest” delivered as a value-add for Data Quality Management

October 20, 2010 156


Integration of Data Profiler and BRR

Source Target
Source Common System
System
System Rules
Rules Rules
Rules
BRR for
BRR Source & Target
system

ADAPTOR -
Embed Business
Rules
into Data Profiler
Profile Data Profiler Profile Analyze
Analyze
Schema

Transformation
Target System
Source System Test data
Test data
Compare

October 20, 2010 157


Software and Hardware

Software
• Database Layer – Oracle
• Application Layer – Ab Initio, Trillium, Unitech and Actuate

Hardware
• IBM Sequent NUMA-Q server 16 quad machine and 2GB
RAM

October 20, 2010 158


Application Architecture
• Reduces Data
Assumption
Profile Full volume Data Analysts,
Source Data Business Owners, Design document

Profiled DQ • Use Profiler


Legacy
Information reports output for
System Requirement
Analysis different
Sales
Design phases
System
Solution
Business Rules DQ
DQMonitors
Monitors
Design
ERP Repository • Use Profiler
High/Low output to build
Level Design
CRM Source BRR
system
Test Data
Billing Profile Live • Embed the
System data for Data
Various Test Test Target rules defined
Audit
Phases System
3rd Party in BRR into
Data Live Target
Deployment Data Profiler
System
Profile Source Profile
Source Test Data Target Test
Data
Data • Data Audit and
Compare and analyse profiler output to
validate the transformation process Data Quality
Monitors
(DQM)
October 20, 2010 159
Benefits to Client
• A huge backlog of data quality issues resolved leading to millions of
pounds worth of saving to BT
• A generic name and address data cleansing methodology designed
that can be used as a prototype for similar requirement time effectively
• Profiling of Live data on an ongoing basis to check compliance over
time
• BRR was used for developing Data Quality monitors (DQM)
• Development of uniform data dictionary for all disparate source
systems
• Reduced Risks and accurate planning
Client Speak
“...We have made fantastic progress in managing to roll out some
really big, complex deliveries... all thanks to your commitment,
and your ability to work as a team in order to resolve issues
quickly whilst under a lot of pressure. Well done everyone.”
- Simon Riley

October 20, 2010 160


Metadata Management

October 20, 2010 161


Metadata
•“Data about Data”
• For every data element
 definition
 characteristics
 relationships to other data elements

Metadata categories Metadata types


Business Metadata Control Metadata
Technical Metadata Process Metadata
Metadata currency
Static (Slowly changing) Metadata
Dynamic Metadata

October 20, 2010 162


Metadata Products
Candidate Products Vendor
SuperGlue Informatica
System Architect Popkin
MetaStage Ascential
Platinum Repository Platinum Technologies

Advantage Computer Associates


Rochade Allen Systems Group
Microsoft Repository Platinum/Microsoft Partnership

MetaCentre Repository Data Advantage Group

MDM I2

October 20, 2010 163


Data Governance

October 20, 2010 164


Data Governance

Data governance (DG)


refers to the overall Why go in for it ?
management of the
availability, usability,
integrity, and security of Increase consistency &
the data employed in an confidence in decision
enterprise making

Decrease the risk of


regulatory fines
What is Data Governance ?
Improve data security

October 20, 2010 165


Master Data Management

October 20, 2010 166


What is Master Data Management

• Master Data Management (MDM), is a Practices Processes Technologies


discipline in Information Technology (IT)
that focuses on the management of
reference or master data that is shared
by several disparate IT systems and
groups.
• MDM enables consistent computing
between diverse system architectures
and business functions.
• MDM integrates dimensional and master
data across BI, data warehouse, Master Data
financial & operational systems,
providing for accurate, consistent and
compliant enterprise reporting.
• MDM supplies meta-data for aggregating
and integrating transactional data.
Metadata

October 20, 2010 167


167
Typical Requirements for MDM

• Role Definition Support: Support for definition of roles with access rights enforced
depending on the responsibilities assigned for that role
• ETL: ETL capabilities for extracting master data/reference data files or tables from multiple
sources and loading the data into the master data repository
• Data Cleansing: Data cleansing capabilities for de-duplication and matching of master
data records
• Collaborative platform: A collaborative platform for coordinating decisions on master
data reconciliation and rationalization. The platform should be supported by standards, if
available, or via industry knowledge of a master data domain. An example is a standard
product hierarchy for a particular industry
• Data synchronization and replication support: For applying changes established in a
central server to each consuming application. Incremental change support is important for
performance reasons
• Version control and Change monitoring: Version control at the central policy hub
combined with change monitoring across all of the participating systems. This is needed in
order to track changes to master data over time.

October 20, 2010 168


168
Processes required for Master Data Management

• Master Data is managed Via a


Policy Hub as shown in the
figure
• The policy hub for master
data management collects
master data from participating
analytical and transactional
systems
• Collaborative applications run
on the central policy hub to
coordinate decisions among
team members on master
data policies
• The standard master data is
published to each
participating system
(transactional and analytical)
so that they are synchronized
with the hub

October 20, 2010 169


169
Processes required for Master Data Management
• Steps in the Process for Managing and Maintaining Master Data

 Assign business responsibility for each master data domain such as products,
customers, suppliers, organizational structure
 Extract master data for a domain from separate operational and reporting systems to
a central server
 Apply data quality standards, such as de-duplication and matching of master data
records, to get a clean set of master data for the domain
 Reconcile and rationalize the master data records. This process entails setting
policies pertaining to an optimal product hierarchy, organizational structure, or
preferred supplier list
 Synchronize participating operational and reporting systems with the centrally
managed, canonical master data
 Monitor changes or updates to master data in each participating system. Then repeat
the preceding steps for ongoing maintenance of master data. Over time, with the
centralization of master data management responsibilities, the origination of master
data changes moves from the participating systems to the master data management
hub or server

October 20, 2010 170


170
Data Storage, Movement and
Access

October 20, 2010 171


Data Security Access

• Data security is the means of ensuring that data is kept safe from
corruption and that access to it is suitably controlled. Thus data
security helps to ensure privacy. It also helps in protecting personal
data.

• It is the process of protecting data from unauthorized access, use,


disclosure, destruction, modification, or disruption.

• Protecting confidential information is a business requirement, and in


many cases, it is also a legal requirement, and some would say that it
is the right thing to do.

October 20, 2010 172


Question & Answer Session...

October 20, 2010 173


October 20, 2010 174

You might also like