02 Data Management

Data Management
October 20, 2010 TCS Public

Contents
• Data Warehouse Concepts

• Data Modeling
• Dimensional Modeling
• Implementation and Maintenance
• Data Management
 Data Quality Analysis
 Metadata Management
 Data Governance
 Master Data Management
 Data Storage, Movement and Access
October 20, 2010 2

Data Warehouse Concepts
October 20, 2010 3

Data Warehouse Concepts Agenda
A.What is a Data Warehouse (DW) ?

B.What are the components of a DW ?
C.What are the various architectures/formats of a DW ?
D.Examples of Data Warehousing tools in use
October 20, 2010 4

Need for Data Warehousing – Business View
• Customer Centricity
 Single view of each customer and his/her activities
 Integrated information from heterogeneous sources
• Adaptability to rapidly changing business needs
 Multiple ways to view business performance
 Low cycle time, faster analytics
• Increased Global competition
 Crunch more and more data, faster and faster
• Mergers and Acquisition
 With each acquisition comes another set of disparate IT systems
affecting consistency and performance
October 20, 2010 5

Need for Data Warehousing – Systemic View
• Performance Optimization
 OLTP systems get overloaded with large analytical
queries
 Data Models for OLTP and OLAP are very different
• Reduce reliance on IT to produce reports
 Reporting making on OLTP systems is very technical
• OLTP systems not built to hold history data
• Data Security
 To prevent unauthorized access to sensitive data
October 20, 2010 6

Data Warehouse Defined
Data Warehouse is a –
• Subject-Oriented
• Integrated
• Time-Variant
• Non-volatile
collection of data enabling management
decision making
October 20, 2010 7

Subject Orientation
Process Oriented Subject Oriented
Entry
Sales Rep Sales
Sales
Quantity Sold
Part Number
Date
Customer Name
Customers
Customers
Product Description
Unit Price
Mail Address Products
Products
Transactional Storage Data Warehouse Storage
October 20, 2010 8

Data Volatility
Volatile Non-Volatile
Insert Change
Delete Access
Insert Load
Change
Access
Record-by-Record Data Manipulation Mass Load / Access of Data
October 20, 2010 9

Time Variance
Current Data Historical Data
Sales ( Region , Year - Year 97 - 1st Qtr)
20
15
Sales ( in lakhs
10 East
)
West
5 North
0
January February March
Year97
October 20, 2010 10

Data Warehouse Characteristics
• Stores large volumes of data used frequently by DSS
• Is maintained separately from operational databases
• Are relatively static with infrequent updates
• Contains data integrated from several, possibly

heterogeneous operational databases
• Supports queries processing large data volumes
October 20, 2010 11

Three views of Data Warehousing
• Strategic or Business view

 Define key business drivers of data warehouse
 How can business-driven approach achieve high ROI?
• Architectural or Technology view
 Alternative data warehousing architectures
 How can the right architecture achieve a high ROI?
• Methodology or Implementation view
 Development and implementation methodology
 How can the right methodology achieve a rapid ROI?
October 20, 2010 12


October 20, 2010 13

Data Warehouse Components
Metadata Layer
Extraction Data Mart
Cleansing Population
Aggregation
FS1 Summarization
S Transformation
T DM1
A
FS2
G
. I
N ODS DW
DM2
Transmission
. N
E
G
DMn
A
. T
W
O
R
E
OLAP ANALYSIS
FSn R
K A
Legacy System Knowledge Discovery
October 20, 2010 14

Data Warehouse Build Lifecycle
• Data extraction
• Data Cleansing and Transformation
• Data Load and refresh
• Build derived data and views
• Service queries
• Administer the warehouse
October 20, 2010 15


October 20, 2010 16

Data Warehouse Architectures
• Virtual Data Warehouse
• Enterprise Data Warehouse
• Data Marts
• Distributed Data Marts
• Multi-tiered warehouse
October 20, 2010 17

Virtual Data Warehouse
Legacy
REPORTING TOOL
Operational Systems Data
U
Client/ S
Server E
R
OLTP S
Application
External
October 20, 2010 18

Enterprise Data Warehouse
Data Preparation
Legacy Select
REPORTING TOOL
Operational Systems Data
Metadata
Repository
Extract U
Client/
Server
S
A E
Transform DATA P
WAREHOUSE R
OLTP Integrate I S
Maintain
External
October 20, 2010 19

Data Marts
Legacy Select
REPORTING TOOL
Metadata
Repository
Extract U
Client/
Server S
A
Transform DATA MART E
P
R
OLTP Integrate I
S
Maintain
External
Data Preparation
Operational Systems
Data
October 20, 2010 20

Distributed Data Marts
Legacy Select Data Mart
REPORTING TOOL
Extract
Client/ U
Server A S
Transform Data Mart P E
OLTP Integrate I R
S
Maintain
External Data Mart
Data Preparation
Operational Systems
Data
October 20, 2010 21

Multi-tiered Data Warehouse: Option 1
Data Mart
REPORTING TOOL
Legacy
Select
Extract
Client/
Server
Metadata U
Repository A S
Data Mart P
Transform E
DATA
OLTP WAREHOUSE I R
Integrate
S
External Maintain
Data Mart
Operational Systems
Enterprise wide Data
October 20, 2010 22

Multi-tiered Data Warehouse: Option 2
Legacy Select Data Mart
REPORTING TOOL
Extract Metadata
Client/ Repository
U
Server A S
Transform Data Mart DATA P
WAREHOUSE E
OLTP Integrate I R
S
Maintain
External Data Mart
Data Preparation
Operational Systems
Data
October 20, 2010 23

Relative Data sizes in a Data Warehouse
Highly Summarized Data
Lightly Summarized Data
Current Detail Data
Metadata
Older Detail Data
Cont.
October 20, 2010 24
Data Warehouse - Example
Monthly sales by Monthly Sales by Product for

region for 1991-94 1991-94
Weekly sales by
Weekly sales by product/sub-product
region for 1991-94 for 1991-94
Sales Detail
for 1991-94
Metadata
Sales Detail for

1985-90
October 20, 2010 25

Building a Data Warehouse - Steps
• Identify key business drivers, sponsorship, risks, ROI
• Survey information needs and identify desired functionality

and define functional requirements for initial subject area.
• Architect long-term, data warehousing architecture
• Evaluate and Finalize DW tool & technology
• Conduct Proof-of-Concept
Cont.
October 20, 2010 26
Building a Data Warehouse - Steps
• Design target data base schema
• Build data mapping, extract, transformation, cleansing and

aggregation/summarization rules
• Build initial data mart, using exact subset of enterprise data

warehousing architecture and expand to enterprise
architecture over subsequent phases
• Maintain and administer data warehouse
October 20, 2010 27


October 20, 2010 28

Representative DW Tools
Tool Category Prod

ETL Tools ETI
Ora
OLAP Server Ora
OLA
October 20, 2010 29
Data Warehousing - Insights
• An enabling technology that facilitates improved business

decision-making
• A process, not a product
• A technique for assembling and managing a wide variety of

data from heterogeneous systems for decision support
October 20, 2010 30

Data Modeling
October 20, 2010 31

Modeling – ER Model
Definition
Logical & Graphical representation of the information needs
Process
• Classifying
 Entities
• Characterizing
 Attributes
• Inter-relating
 Relationships
October 20, 2010 32

Modeling – Logical Model
Definition
Representation of a business problem without regard to
implementation, technology and organizational structure
Features
• Represent business requirement completely, correctly &
consistently
• Remove redundancy
• Does not presuppose data granularity
• Not implemented
October 20, 2010 33

Modeling – Example of ER model
October 20, 2010 34

Modeling – Physical Model
Definition
Specification of what is implemented
Features
• Optimized
• Efficient
• Buildable
• Robust
October 20, 2010 35

Dimensional Modeling
• Form of analytical design (or physical model) in

which data is pre-classified as a fact or dimension
• Improves performance by matching the data
structure to the queries
“Give this period’s total sales volume and

revenue by product, business unit and
package”
October 20, 2010 36

October 20, 2010 37

Dimensional Modeling Agenda
A.What is Dimensional Modeling ?

B.What are the various types of Dimension Tables ?
C.What are the various types of Fact Tables ?
D.How do I model a star schema ?
October 20, 2010 38

A Quick Recap of Data Warehousing
Data Warehouse is a SUBJECT ORIENTED, INTEGRATED, TIME-VARIANT,
NON-VOLATILE collection of data enabling Management Decision Making
RDBMS
Cleansing,
Transformation,
Client
Validation, Massaging
Browsers
ERP
ODS
Extraction
STAGING AREA
CRM
Network
DW
Data
Mainframe DBs Marts

Aggregation,
summarization, Data
Mart Population, Reports, Cubes, Analysis,
Dimension loading, Data mining, Dashboards,
Fact Loading MIS reports, Company
PC DBs Quarterly reports etc..
October 20, 2010 39
Dimensional Modeling: In Perspective
Dimensional Modeling is an effective, efficient and successful technique to design
Enterprise Data Warehouse and Distributed Data Mart database schemas for
maximum query performance
RDBMS
Client
Browsers
ERP
ODS
Extraction
STAGING AREA
CRM
Network
DW
Data
Mainframe DBs Marts
DIMENSIONAL
MODELING AREA
PC DBs
October 20, 2010 40
• Form of analytical design (or physical model) in

which data is pre-classified as a fact or dimension
• Improves performance by matching the data
structure to the queries
“Give this period’s total sales volume and

revenue by product, business unit and package”
October 20, 2010 41

Dimensional Model: Strengths
• Predictable, Standard Framework
 Query tools and user interfaces can be created to provide a consistent
way of reporting data
 Most filter conditions being on dimensional attributes allows
performance boosting through bit-vector indexes on dimension table
columns
 Metadata functionality in the query tools can make use of the known
cardinality of dimension values to offer such facilities such as value-
selection drop-downs and value-selection windows
• Resilience to unexpected user querying patterns

 All dimensions are equivalent entry points into the fact table (number
of joins to fact table is same = 1)
 Symmetrical query strategies and SQL
 Logical Design can be done independent of query patterns
October 20, 2010 42

Dimensional Model: Strengths
• Graceful Extensibility
 Allows adding new unanticipated facts as long as they are
consistent with the fact table grain
 Allows adding new dimension tables as long as a single value of
that dimension is defined for each existing fact record
 Allows adding new unanticipated dimensional attributes
• Standard approaches for common modeling situations

 Slowly Changing Dimensions (SCDs)
 Heterogeneous products (e.g. Savings account, current account)
 Pay-in-advance databases
 Event-handling databases (fact-less facts)
October 20, 2010 43

What is a Fact?
Facts Measure
Sales
Volume
Revenue
October 20, 2010 44

Facts and Fact Tables
Fact Measure Sales Billing
Date Key (int) Date Key (int)
Store Key (int) Customer Key (int)
Revenue Cost Product Key (int) Service Line Key (int)

Sales (float) Rate Plan Key (int)
Qty Sold (int) Number of Total Minutes
Price (float) Number of Calls (int)
No. of Accounts Discount (float) Service Charge (float)
Taxes (float)
The term FACT represents a single business measure. E.g. Sales, Qty Sold
Each fact has a GRAIN which is the set of perspectives or attributes that define/ qualify the
fact completely.
E.g. Grain of “Sales” could be “for each PRODUCT, at each STORE, on each/ every DAY”.
A FACT TABLE is the primary table in a dimensional model where the business measures
or FACTS are stored.
A business measure or FACT is a row in a FACT TABLE. All FACTS in a FACT TABLE must
be at the SAME GRAIN.
October 20, 2010 45

Fact tables: some features
• Fact tables express MANY-MANY RELATIONSHIPS between dimensions in
dimensional models
 One product can be sold in many stores while a single store typically sells
many different products at the same time
 The same customer can visit many stores and a single store typically has
many customers
• The fact table is typically the MOST NORMALIZED TABLE in a dimensional model
 No repeating groups (1N), No redundant columns (2N)
 No columns that are not dependent on keys; all facts are dependent on keys
(3N)
 No Independent multiple relationships (4N)
• Fact tables can contain HUGE DATA VOLUMES running into millions of rows
• Facts can be identified by answering the question: “WHAT ARE WE
MEASURING?”
• The grain of the fact table can be identified by answering the question: “HOW DO
YOU DESCRIBE A SINGLE ROW IN THE FACT TABLE?”
• All facts within the same fact table must be at the SAME GRAIN
October 20, 2010 46

Fact tables: some features
• The grain of the fact table is the LIST OF DIMENSIONS that uniquely define each
row of the fact table
– Each row of sales fact table is uniquely identified by a unique combination
of store, product and time: which product, which store, when sold
– Additional foreign keys may exist in the fact table that point to other
dimension tables e.g payment type, but which do not contribute to the grain
• Every foreign key in a Fact table is usually a DIMENSION TABLE PRIMARY KEY
• Every foreign key in a Fact table is usually an INTEGER KEY
• Fact tables are TYPICALLY used in GROUP BY SQL queries
• Every column in a fact table is either a foreign key to a dimension table primary
key or a fact
• Every non-key column in the fact table is typically used in the SELECT clause of
a SQL query
October 20, 2010 47

What is a Dimension?
Data Warehouse is
• Subject-Oriented
•
•Integrated
• Time-Variant
• Non-volatile
collection of data in support of management’s
decision.
Subject Dimension
Product Package
Business
Unit
October 20, 2010 48

Dimensions and Dimension Tables
Dimension Perspective Store Product
Store Key (int) Product Key (int)
Store name (char) Product id (char/int)
Time Street Address (char) Product name (char)
Customer City (Char) Product Group
State (Char) Brand
Geography Region (Char) Department
Country (Char)
The term DIMENSION represents a single category or perspective by which

associated FACTS are interpreted and understood.
E.g. “Store” is a perspective by which sales are understood. It is the answer to the
question “Where did the sales occur?”
A DIMENSION TABLE is a table which holds a list of attributes or qualities of the
dimension most often used in queries and reports.
E.g. The “Store” dimension can have attributes such as the street and block number,
the city, the region and the country where it is located in addition to its name.
Every row in the DIMENSION TABLE represents a unique instance of that
DIMENSION and has a unique identifier called the DIMENSION KEY.
October 20, 2010 49

Dimension tables: some features
• Dimensions can be identified by answering the question: “HOW DO BUSINESS
PEOPLE DESCRIBE THE DATA THAT RESULTS FROM A BUSINESS PROCESS?”
• Dimension tables are ENTRY POINTS into the fact table. Typically -
 The number of rows selected and processed from the fact table depends on
the conditions (“WHERE” clauses) the user applies on the dimensional
attributes selected
• Dimension tables are typically DE-NORMALIZED in order to reduce the number of
joins in resulting queries
• Dimension table attributes are generally STATIC, DESCRIPTIVE fields describing
aspects of the dimension
• Dimension tables typically designed to hold IN-FREQUENT CHANGES to attribute
values over time using SCD concepts
• Dimension tables are TYPICALLY used in GROUP BY SQL queries
• Dimension Tables serve to simplify SQL GROUP BY queries
• Every column in the dimension table is TYPICALLY either the primary key or a
dimensional attribute
• Every non-key column in the dimension table is typically used in the GROUP BY
clause of a SQL Query
October 20, 2010 50

Some more jargons…..
Hierarchy - Geography
• Hierarchy Dimension
• Level
World Level World
• Member
n
io
• Attribute
at
el
Continent America Europe Asia
tR
• Grain Level Pa
en
r
Country USA Canada Argentina

Level
State FL GA VA CA WA
Level
City Miami Tampa Orlando Naples

Level
Attributes: Population, Dimension
Tourist’s Place Member / Business
Entity
October 20, 2010 51

Some more jargons…..
A Dimension can have one or more hierarchies
Another Hierarchy -
Geography Dimension
Global Level Global

A hierarchy can have one or more levels or grain
n
io
at
el
Economy Developed Developing Third world
tR
Level
enr
Pa
Upper Middle Lower

Each Financial
level can have one or more members
Class Level
Regional Metro Suburb Town Village

Level
Each member can have one or more attributes
City Chennai Delhi Mumbai Kolkatta
Level
Attributes: Population, Dimension
Tourist’s Place Member / Business
Entity
October 20, 2010 52

The Star Schema: Linking Facts and Dimensions
Date/Time Store Product Sales
Date Key (int) Store Key (int) Product Key (int) Date Key (int)
Date (date – dd/mm/yy) Store name (char) Product id (char/int) Store Key (int)
Day Of Week (char) Street Address (char) Product name (char) Product Key (int)
Day of Month (int) City (Char) Product Group Customer Key (int)
Month (int) State (Char) Brand Payment Type Key (int)
Quarter (int) Region (Char) Department Sales (float)
Year (int) Country (Char) Qty Sold (int)
Price (float)
Store Discount (float)
The Star Join Schema or

STAR SCHEMA is a single
FACT TABLE joined to a set
Time Store Payment of DIMENSION TABLES
Sales Type Simple, Symmetric,

Extensible and Optimized!
GRAIN of the Star Schema
Customer Product is the grain of its central
Fact table!
October 20, 2010 53
Star Schema
• Particular form of a dimensional model
• Central fact table containing Measures
• Surrounded by one perimeter of descriptors - Dimensions
October 20, 2010 54

Star Schema
Fact Table This table is the core of the Star
Schema Structure and contains
the Facts or Measures available
through the Data Warehouse.
These Facts answer the

questions of “What”, “How
Much”, or “How Many”.
Some Examples:
Sales Dollars, Units Sold, Gross
Profit, Expense Amount, Net Income,
Unit Cost, Number of Employees,
Turnover, Salary, Tenure, etc.
October 20, 2010 55

Star Schema
Dimension These tables describe the Facts
Tables or Measures. These tables
contain the Attributes and may
also be Hierarchical.
These Dimensions answer the

questions of “Who”, “What”,
“When”, or “Where”.
Some Examples:
• Day, Week, Month, Quarter, Year

• Sales Person, Sales Manager, VP of Sales
• Product, Product Category, Product Line
• Cost Center, Unit, Segment, Business,
Company
October 20, 2010 56

Star Schema
Employee_Dim
Employee_Dim
EmployeeKey
EmployeeID
.
.
.
Time_Dim
Time_Dim Product_Dim
Product_Dim
TimeKey Sales_Fact
Sales_Fact ProductKey
TheDate TimeKey ProductID
. .
. EmployeeKey .
. .
ProductKey
CustomerKey
ShipperKey
Required Data
(Business Metrics)
or (Measures)
Shipper_Dim
Shipper_Dim . Customer_Dim
Customer_Dim
.
ShipperKey . CustomerKey
ShipperID CustomerID
. .
. .
. .
October 20, 2010 57

Snow Flake Schema
• Complex dimensions are re-normalized
• Different levels or hierarchies of a dimension are kept separate
• Given dimension has relationship to other levels of same
dimension
October 20, 2010 58

Star and Snow are De-normalized
•It violates 3NF in dimensions by collapsing higher-level dimensions into the lowest level as in Brand
and Category.
•It violates 2NF in facts by collapsing common fact data from Order Header into the transaction, such
as Order Date.
•It often violates Boyce-Codd Normal Form (BCNF) by recording redundant relationships, such as the
relationships both from Customer and Customer Demographics to Booked Order.
•However, it supports changing dimensions by preserving 1NF in Customer and Customer
Demographics.
October 20, 2010 59

A shot at Dimensional Modeling….
STEP 1
• Identify Subjects (Dimensions)
• Identify Hierarchies of a Dimension
• Identify Attributes of levels in Hierarchies
• Define Grain
Country
Industry
Segment State
Industry Type City Fin. Class
Customer
October 20, 2010 60

STEP 2
• Use KPIs to identify the Facts
• Group the Facts in a logical set
Financial Non-Financial Transactions

Transactions
Trans. Amount No. of Cheques Cleared

No. of Bonds No. of Visits to a Branch
No. of Transactions No. of DEMAT Transactions
Service Cost
... ...
October 20, 2010 61

STEP 3
• Link the Group of Facts to the Dimensions that
participate in the Facts
Customer Product
Time Financial Organization

Transactions
Channel
October 20, 2010 62

STEP 4
• Define Granularity for each Group of Facts
Customer Product
(Customer) (Scheme)

(Day-Hour) Transactions (Branch)
Channel
(Channel)
October 20, 2010 63


October 20, 2010 64

Types of Dimensions
• Primary Dimension
 Contributes to the fact grain
 A set of these uniquely defines the associated fact
 E.g. SALES fact is typically completely defined by store, product and time
• Secondary Dimension
 Does NOT contribute to the fact grain
 Non-primary dimensions such as payment type, customer, manufacturer are
still important for analysis of SALES fact
 Useful for rich analytic slicing and dicing, e.g. Top 10 customers.
• Degenerate Dimension
 A dimension without any attributes; but useful for analysis
 Generally included in the associated fact table before facts
 E.g. invoice number, by itself, in a shipping fact
October 20, 2010 65

Types of Dimensions
• Conformed Dimension
 A dimension used across the enterprise
 Requires standardized structure and definition
 Requires to be designed upfront before individual designing schemas
 Plugs into multiple stars as either primary or secondary dimension
 E.g. Customer, Product, Store, Time, Employee
 Customer could be captured at the store card swiping machine (sales
fact), be part of Marketing promotion strategy (campaign fact) and
also could be serviced by call center for warranty replacements
(warranty fact)
 Employee may be a Sales-rep claiming credit for sales (Sales fact) or
may be a Finance manager authorizing vendor payments (vendor
Payment fact) or a call center person taking customer calls (Service
Call fact)
October 20, 2010 66

Types of Dimensions
• Slowly Changing Dimension
 Dimensional attributes change over time
 Need to capture these changing realities as history
 Requires special design techniques to keep it single valued for
each fact row while still retaining history
 E.g. Customer City, Marriage status, salesrep department
 Type 1 (Overwrite previous values)
 Type 2 (Create additional time-stamped dimension record
 Type 2 automatically partitions history
 Type 3 (Create additional attribute column to retain any one
previous value e.g. first value, previous value)
 Requires dimension key to be generalized
October 20, 2010 67

Types of Dimensions
• Rapidly Changing Small Dimension
 Same as SCD except frequency is higher
 Need to track changes to attributes E.g. Employee attributes such as
appraisal rating
 E.g. telecom product: rate plans keep changing
• Large Dimension
 Size increases with decreasing level of granularity
 Typical of public utility companies, government agencies
 Human records kept by supermarkets e.g. Shopper’s Stop
 Do NOT create SCDs to address slow changes/ history
 See “…Monster Dimension” for SCD strategy
 Choose indexing strategies to reduce query run times
 Choose RDBMS wisely e.g. SQLServer vs. Oracle vs. Teradata
October 20, 2010 68

Types of Dimensions
• Rapidly Changing Monster Dimension
 Similar to large dimension but typical of a large insurance customer
dimension
 Customers and Claims are rapidly created and changed
 Need to track history for credit and legal reasons
 Remove the continuously changing attributes to another dimension
table e.g. demographic
 Reduce the cardinality of these attributes by banding them e.g.
income_band, credit_band, etc.
 Then create all possible combinations of these attributes
 Then assign a dimension key to each unique set of these
combinations; this is the demographics table
 For each combination that represents the customer’s status in a
particular period, plug the demographic key into the fact as an
additional key
October 20, 2010 69

Types of Dimensions
• Junk Dimension
 A convenient grouping of random flags and attributes to get
them out of the fact table
 Retain only useful fields
 Remove fields that make no sense at all
 Remove fields that are inconsistently filled
 Remove fields that are of operational interest only
 Design similar to demographics; maximum unique
combinations, assign integer key, plug into fact
 Create new combination (insert new dimension record) at ETL
run-time
 E.g. Yes/No Flags in old retail transaction data
October 20, 2010 70

Types of Dimensions
• Role-playing Dimension
 dimension appears several times in the same fact table
 Typically, Date/Time dimension plays many roles
 E.g. Order Fulfillment is a typical retail fact table having the
following dimensions:
 Order Date, Packaging date, Shipping date, Delivery date
 Create one fact table key for each role
 Create one SQL view of the dimension for each role
 Use view names to run SQL queries
 In Business Objects, this scenario is designed using Aliases
and Contexts
 E.g. (2) – Employee dimension: Salesrep, Manager, Appraiser,
Appraisee in Sales Compensation fact and Employee Appraisal
fact respectively.
October 20, 2010 71

Dimensional Hierarchy
Geography Dimension
World Level World
ion
lat
Re
Continent America Europe Asia
Level nt
re
Pa
Country USA Canada Argentin

Level a
State FL GA VA CA WA
Level
City Level Miami Tampa Orlando Naples
Dimension
Attributes: Population,
Member /
Tourist’s Place
Business Entity
October 20, 2010 72


October 20, 2010 73

Types of Facts
Value Based Classification
• Numeric Facts
• Count / Occurrence Based (e.g. Employees assigned to a project)
• Non-numeric Facts (e.g. Comments in fact tables)
Summary Based Classification
• Additive (along all dimensions)
• Semi Additive (mostly along Time dimension)
• Non Additive (cannot be added along any dimension)
In our example discussed earlier in these slides, sales, number of total minutes are
examples of value based and additive facts as these can be added across all
dimensions…
Whereas, Price, Quantity Sold are examples of value based but semi-additive facts
as these can be added only across some dimensions e.g. they cannot be added
across the product dimension as the fact “Total price” does not make sense. Rather
Average Price across products makes more sense.
October 20, 2010 74

Types of Fact Tables
Fact tables are classified based upon:
• the type of grain they address or the level of detail they contain AND
• the way the measurements are taken with respect to time
Thus we have:
• Transaction Fact Table
• Snapshot Fact Table
• Summary Fact Table
Figure 1
The context of a transaction is modeled as a set of generally independent
dimensions. Figure 1 shows seven such dimensions.
The measured transaction amount is in a fact table that refers to all the
dimensions by foreign keys pointing outward to their respective dimension tables.
The clean removal of all the context detail from the transaction record is an
important normalization step and is why fact tables are “highly
normalized.”
October 20, 2010 75

Kimball mentions 3
fundamental types of fact
table grain:
• Transaction: A transaction
is a set of data fields that
record a basic business event.
• e.g. a point-of-sale in a
supermarket, attendance in a
classroom, an insurance
claim, etc.
Figure 2
• The measurements group nicely together into a single fact table with the same
grain.
• Periodic Snapshot: A snapshot is a measurement of status at a specific point in
time. E.g. In Figure 2, earned premium is the fraction of the total policy premium
that the insurance company can book as revenue during the particular reporting
period. The periodic-snapshot-grained fact table represents a predefined time span.
October 20, 2010 76

The accumulating-snapshot-
grained fact table represents
an indeterminate time span,
covering the entire history
starting when the collision
coverage was created for the
car in our example and ending
with the present moment.
Figure 2
In dramatic contrast to the other fact-table types, we frequently revisit

accumulating-snapshot fact records to update the facts.
Remember that in this table there is generally only one fact record for the
collision coverage on a particular customer’s car.
As history unfolds, we must revisit the same record several times to revise
the accumulating status.
October 20, 2010 77


October 20, 2010 78

Visualizing a dimensional model
The most popular way of visualizing a
dimensional model is to draw a cube. We
can represent a three-dimensional model
using a cube.
Usually a dimensional model consists of more
than three dimensions and is referred to as a
hypercube.
However, a hypercube is difficult to visualize,
so a cube is the more commonly used term.
In Figure 1, the measurement is the volume

of production, which is determined by the
combination of three dimensions: location,
product, and time.
Figure 1
The location dimension and product dimension have their own two levels of hierarchy. For example,
the location dimension has the region level and plant level. In each dimension, there are members
such as the east region and west region of the location dimension.
Although not shown in the figure, the time dimension has its numbers, such as 1996 and 1997. Each
sub-cube has its own numbers, which represent the volume of production as a measurement.
For example, in a specific time period (not expressed in the figure), the Armonk plant in East region
has produced 11,000 Cell Phones, of model number 1001.
October 20, 2010 79
Data Warehouse Bus Matrix
All Dimensional models together form the logical design of the data warehouse.To Decide which
Dimensional Models to build we start with a top-down planning approach called the Data Warehouse
Bus Architecture Matrix.
This Matrix forces us to list all the possible data marts we could possibly build and name all the
dimensions that are present in those data marts (at a high level).
A Dimensional Model is made up of one or more star schemas. Some of these star schemas may be
snow flaked for better organization and storage.
Dim ension
Organization
Equipment
Employee
Customer
Accounts
Calendar
Outage
Vendor
Subject A rea
Accounts ✔✔ ✔✔
S ales ✔✔✔✔✔
Q uotes ✔✔ ✔
G eneral Ledger ✔ ✔✔✔✔ ✔
S hipm ent ✔✔✔✔✔
P arts/Finance ✔ ✔✔✔✔ ✔
October 20, 2010 80
Dimensional Modeling Approach
CDM LDM PDM
Each star schema has a single fact table at its centre surrounded by multiple dimension tables. Once we
do this, we can then start the design of each individual fact table/star schema using a 4-step process.
STEP 1: Identify Subject Area/ Business Process

Start the model by choosing a single business process or a business sub-process to model so that you
have only one fact table: e.g. SALES business process.
STEP 2: Define Fact Table Grain

Choose the GRAIN of the central fact table.
e.g. i) Each Sales transaction is a fact record: Grain is sales by product by store by transaction
ii) Each daily product sales total in each store is a fact record: Grain is sales by product by store
by day
STEP 3a: Identify Dimensions

Choose the DIMENSIONS as follows:
• Primary dimensions from fact grain e.g. product, store, day (or date)
• Additional dimensions based on user interviews, reports analysis
• Ensure each dimension is at its lowest level of detail possible while still being single valued.
October 20, 2010 81

STEP 3b: Identify grain of dimension table
Ensure that each dimension table grain is NOT lower than the central fact table grain: e.g. the store
dimension should have one row for each store. Each store may have departments but the store
dimension’s row should represent only the store and not the department.
STEP 3c: Identify all dimensional Attributes
For each dimension choose only SINGLE VALUED attributes e.g. if Region is an attribute of the store
dimension then it should have one and only one value for each store.
STEP 3d: Identify Dimension Hierarchies and attributes of levels in Hierarchies
Country Sales
Date Key (int)
Industry
Segment State Store Key (int)
Product Key (int) Figure: Customer
Dimension Hierarchies
Customer Key (int) (Industry, Geography)
Industry Type City Fin. Class
Sales (float)
Qty Sold (int)
Customer Price (float)
Discount (float)
STEP 4a: Choose Facts
Choose each fact for the fact table making sure that the fact is relevant and also has the same
grain has the fact table e.g. for SALES fact table, typical facts would be price, quantity sold and
sales amount as these are all dimensioned by product, by store, by day.
October 20, 2010 82
STEP 4b: Connect Fact to dimension tables by means of surrogate keys
Customer Product
(Customer) (Scheme)

(Day-Hour) Transactions (Branch)
Channel
(Channel)
Important Notes:
1. Each dimension table will have a MEANINGLESS SINGLE PART INTEGER PRIMARY KEY called
surrogate key. This key also occurs as part of the central fact primary key.
2. All fact table primary keys should ideally be foreign keys to the corresponding dimension tables.
October 20, 2010 83

Implementation and Maintenance
DWH Design, Deployment and Maintenance
October 20, 2010 84

Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 85
DWH Physical Design Process
October 20, 2010 86

Physical Design Process
•Develop Standards.
•Process Standards
•Naming Standards
Database Objects
Word Separators
Names in Logical and Physical Model
Physical File naming standards
Naming of files & tables in Staging area
October 20, 2010 87

Continue…
•Create Aggregates Plan

Identify granularity level
•Determine the Data Partitioning Scheme

Selecting fact and dimensions
Horizontal or Vertical
Number of partitions
Criteria for partitions
October 20, 2010 88

Continue…
•Establish Clustering Options
Placing and managing related units of data together
•Prepare Indexing Strategy

Identify the columns.
Identify the sequence.
•Assign Storage Structures

•Complete Physical Model
Review all the above activities
Create physical model
October 20, 2010 89

C.Physical Storage
D.Indexing
G.Security
October 20, 2010 90
Physical Design Considerations
•Improve Performance
•Ensure Scalability
•Manage Storage
•Provide Ease of Administration
•Design Flexibility
•Assign Storage Structures
October 20, 2010 91

C.Physical Storage
D.Indexing
G.Security
October 20, 2010 92
Physical Storage
• Types of Storage Structure

Files
Facts
Dimensions
Indexed Data Structures
October 20, 2010 93

Storage Considerations
•Set correct units of units of database space allocation

Data Block
•Set proper block usage parameters

Free and Used Space
•Manage data migration

Row Chaining
Row Migrating
•Manage block utilization

Should have less free space
October 20, 2010 94

Continue…
•Resolve dynamic extension

Inserting a new record.
Updating the existing record.
•Employ file striping techniques

Splitting files into multiple physical parts
Enables Concurrent I/O
October 20, 2010 95

C.Physical Storage
D.Indexing
G.Security
October 20, 2010 96
Indexing the Data Warehouse
•What are Indexes?

•Why are Indexes required?
•When should I create an Index?
•What are different types of Indexes?
October 20, 2010 97

Types of Indexes
•B-tree: The default and the most common.
•B-tree cluster:Defined specifically for cluster.
•Hash cluster:Defined specifically for a hash cluster
•Bitmap indexes: Compact, work best for columns with a
small set of values.
•Bitmap Join Indexes - Index based on one table that
involves columns of one or more different tables through
a join.
•Function-based: Contain the pre- computed value of a
function/expression.
October 20, 2010 98

B-tree Index
October 20, 2010 99

B-tree Index - Advantages
•All leaf blocks of the tree are at the same depth, so

retrieval of any record from anywhere in the index takes
approximately the same amount of time.
•B-tree indexes automatically stay balanced.
•All blocks of the B-tree are three-quarters full on the
average.
•B-trees provide excellent retrieval performance for a wide
range of queries, including exact match and range
searches.
•Inserts, updates, and deletes are efficient, maintaining key
order for fast retrieval.
•B-tree performance is good for both small and large tables
and does not degrade as the size of a table grows.
October 20, 2010 100

Bitmapped Index
October 20, 2010 101

Bitmapped Index - Advantages
•Reduced response time for large classes of ad hoc

queries
•A substantial reduction of space use compared to

other indexing techniques
•Dramatic performance gains even on very low end

hardware
•Very efficient parallel DML and loads
October 20, 2010 102

Clustered Index
October 20, 2010 103

C.Physical Storage
D.Indexing
G.Security
October 20, 2010 104
Performance Enhancement Techniques
•Data Partitioning
Decomposing tables into smaller and more
manageable pieces called partitions
Range, list, hash & composite partitioning.
•Data Clustering
•Parallel Processing
•Summary levels
•Referential Integrity Checks
•Initialization Parameters
•Data Arrays
October 20, 2010 105
C.Physical Storage
D.Indexing
G.Security
October 20, 2010 106
Major Deployment Activities
•Complete User Acceptance

•Perform Initial Loads
•Get User Desktops Ready
•Complete Initial User Training
October 20, 2010 107

Deployment Approaches
•Top-Down Approach
Deploy the overall enterprise DWH followed by the
dependent data marts,one by one.
•Bottom-up Approach
Gather departmental requirements, and deploy the
independent data marts,one by one.
•Practical/general Approach
Deploy the Subject data marts, one by one, with
flexible approach with fully conformed dimension
following water fall model.
Note – It is always advisable to deploy in stages
October 20, 2010 108
C.Physical Storage
D.Indexing
G.Security
October 20, 2010 109
Security
•Prepare a Security Policy

Should cover scope of information, physical security,
network and connections, DB access privileges, &
access matrix.
•Manage user privileges
•Password considerations
•Security tools
October 20, 2010 110

C.Physical Storage
D.Indexing
G.Security
October 20, 2010 111
Backup and Recovery
•Why Backup is required?

•What is Data warehouse Administration?
•What are the roles of Data warehouse Administrator
(DWA)?
October 20, 2010 112

DWA - Roles
•Building the data warehouse

•Ongoing monitoring and maintenance for the data
warehouse
•Coordinating usage of the data warehouse
•Management feedback as to successes and failures
•Competition for resources for making the data
warehouse a reality
•Selection of hardware and software platforms
October 20, 2010 113

Backup Strategy
•Should the data be actually discarded or should the data

be removed to lower-cost, bulk storage?
•What criteria should be applied to data to determine
whether it is a candidate for removal?
•Should the data be condensed (profile records, rolling
summarization, etc.)? If so, what condensation technique
should be used?
•How should the data be indexed once it is removed (if
there is ever to be any attempt to retrieve the data)?
•Where and how should metadata be stored once the data
is removed?
October 20, 2010 114

Continue…
•Should metadata be allowed to be stored in the data

warehouse for data that is not actively stored in the
warehouse?
•What release information should be stored for the base
technology (i.e., the DBMS) so that the data as stored will
not become stale and unreadable?
•How reliable (physically) is the media that the data will be
stored on?
•What seek time to first record is there for the data upon
retrieval?
October 20, 2010 115

C.Physical Storage
D.Indexing
G.Security
October 20, 2010 116
Monitoring the Data Warehouse
•Collection of Statistics
•Using Statistics for growth planning.
•Using Statistics for Fine-Tuning.
•Publishing Trends for users.
October 20, 2010 117

C.Physical Storage
D.Indexing
G.Security
October 20, 2010 118
Support
•Help Desk support.

•Hotline Support
•Technical Support.
•User representative.
Note - We should always follow multi-tier support structure
October 20, 2010 119

User Training
•User Training Contents

Should provide enough Data Content.
Should talk about all the applications involved.
Should talks features and usage of tools used.
•Identifying the users to be trained.
•Delivering the training program.
October 20, 2010 120

C.Physical Storage
D.Indexing
G.Security
October 20, 2010 121
Managing the Data Warehouse
•Platform Upgrades.
•Managing Data Growth.
•Storage Management.
•ETL Management.
•Data Model Revisions.
•Information Delivery Enhancements.
•Ongoing fine tuning.
October 20, 2010 122

Data Management
October 20, 2010 123

Data Storage, Movement
and Access
Master Data
Management
Data Architecture Data Quality

Data Governance
Management
Metadata Management
Enterprise Data Management Framework
October 20, 2010

124
124
Enterprise Data Management Framework explained
• Data Governance • Data Storage, Movement and
• Data governance (DG) refers Access
to the overall management of • Data Movement involves
the availability, usability,
integrity, and security of the translating/moving data from
data employed in an one format/storage device to
enterprise another.
• Data governance • Data security is the means of
encompasses the people, ensuring that data is kept safe
processes and procedures from corruption and that
required to create a access to it is suitably
consistent, enterprise view of controlled.
a company's data
• Metadata Management • Data Architecture • Data Quality Management

• Metadata is Data about data. • Data quality assurance (DQA)
Metadata describes how and is the process of verifying the
when and by whom a reliability and effectiveness of
particular set of data was data. Maintaining data quality
collected, and how the data is requires going through the
formatted. data periodically and
• Metadata Management is scrubbing it.
becoming very important • Typically this involves
because as systems become updating it, standardizing it,
more interdependent, it is vital and de-duplicating records to
to know the impact of altering create a single view of the
data data, even if it is stored in
multiple disparate systems
October 20, 2010 125

125
Data Quality Analysis
October 20, 2010 126

Data Quality Analysis Agenda
A.Why Data Quality Management ?

B.Elements for Data Quality
C.Classification of Data Quality Issues
D.Dimensions of Data Quality
E.TCS’ DQM Approach
F. DQM Architecture Options
G.BIDSTM DQM Methodology
H.Tools and Technology
October 20, 2010 127

Why DQM?
Where can I get
Why is this
just one view for
NULL?
all the data?
Is empid same as
emp_id?
So many duplicate
products on this
list…
I am still not
able to see
latest data…
Holland??? Is
this customer
in Europe or Returns on Investment
USA? are below expectations
October 20, 2010 128


October 20, 2010 129

Elements for Data Quality
• Data Quality can be hampered by errors in following

elements:
 Definitions
 Domains
 Completeness
 Validity
 Data Flows
 Structural Integrity
 Business Rules
 Transformations
October 20, 2010 130

Definition
• This indicates how entities are referenced throughout the enterprise

• Definition problems can be further categorized as below:
Synonyms The fields EMP_ID, EMPID, and EM01 may or may

not all actually refer to the same type of data
Homonyms These indicate fields that are spelled the same, but
really aren’t the same (id or ID)
Relationships Just because a field is named FK_INVOICE doesn’t

mean that is is really a foreign key to the invoice
file.
October 20, 2010 131

Domains
• Domains describe the range and types of values present that can be
present in a data set
• Some examples of domain errors are:
 Unexpected values - e.g. Home State = one of {Kan, Mic, Min,…)
 Cardinality - A Yes/No field can have only two credible values
 Uniqueness – for a field, 98% of data is NULL
 Constant
 Outliers
 Length of field
 Precision
 Scale
 Internationalization – Date formats, postal codes, time zones, etc
October 20, 2010 132

Completeness
• This indicates whether or not all of the data is actually

present
• Completeness of dataset can be gauged by its

 Integrity - Is actual data mapping to our definition of
data?
 Accuracy – Name and address matching, demographics
check
 Reliability – Zip code should match to city and state
 Redundancy – Data duplication
 Consistency – Is same invoice number referenced with
different amounts?
October 20, 2010 133

Validity
• Validity indicates whether or not the data is valid
• Validity checks used to spot data problems are

 Acceptability – Product part number should consist of 7-
character alphanumeric string with two characters and 5
digits
 Anomalies
 Timeliness
October 20, 2010 134

Data Flows
• These checks are related to the aggregate results of

movement of data from source to target
• Many data quality problems can be traced back to incorrect
data loads, missed loads or system failures that go
unnoticed
• Data flow checks to ensure data quality are
 Record counts – Reconciliation of source and target
record counts
 Checksums
 Timestamps
 Process Time
October 20, 2010 135

Structural Integrity
• These checks ensure that when data is taken as a whole,

you are getting correct results
• Structural integrity checks include

 Cardinality Checks between tables
 Primary keys – Are these unique?
 Referential integrity – Product available on invoice but
missing from product catalog
October 20, 2010 136

Business Rules
• Business rule checks measure the degree of compliance

between actual data and expected data
• These checks constitute of
 Constraints – Does the data comply to a known set of
validations?
 Computational rules – Is formula for deriving amount
correct?
 Comparisons
 Functional dependencies
 Conditions
October 20, 2010 137

Transformations
• Transformation checks examine the impact of data

transformations as data moves from system to system
• Quality of data can be affected by incorrect transformation
logic
• Only way to identify these are to compare source data set
with target data set and verify transformations for
 Computations
 Merging
 Filtering
 Relationships
October 20, 2010 138


October 20, 2010 139

Data Quality Issues - Classification
Data Quality Issues
Unmanaged
Physical Issues Logical Issues
Data Issues
Data Parsing
Data Profile Business Rules (using Rules, Text
(for Cleansing) Mining etc.)
October 20, 2010 140


October 20, 2010 141

Dimensions of Data Quality
• Sufficiency for the purpose of
Business Intelligence
Sufficiency
• Consistency of definition of data

across the Data Warehouse
Accuracy Redundancy
• Accuracy as defined by business Data
rules Quality
• No redundancy across the

warehouse Latency Consistency
• Latency - No major change of data

between the instance of data The FIVE Dimensions of Data
capture and when processed Quality
Data Quality is measured across these dimensions!!!

October 20, 2010 142

October 20, 2010 143

TCS’ DQM Approach
• Analysis of source data quality

 User driven
 Data driven
• Characterization of quality data
 This implies identification of necessary and
sufficient criteria that define quality data
 Domain validations, Business Rules validations
are touched upon in this
• Feasibility Analysis
 Mapping of data elements to rules
 Mapping of relationships to rules
 Assessment of grain of data
• Design and Implementation
 BIDS™ DQM Methodology
October 20, 2010 144


October 20, 2010 145

DQM Architecture
• DQM at Source
• DQM as part of ETL processes
• DQM in the target
October 20, 2010 146


October 20, 2010 147

DQM Methodology
• Modular approach to
building solutions
• Clear and well defined
guidelines, checklists
and standards
• Supports the Onsite-
Offshore delivery model
• Flexibility to adapt with
other methodologies
• E-T-V-X criteria re-
enforced by best
practices and TCS’
quality initiatives
October 20, 2010 148


October 20, 2010 149

Tools and Technology
Common software products for Name and Address cleansing

• Trillium
• First Logic
• TCS’ DataClean™
Common ETL Tools

• Informatica
• Ab Initio
October 20, 2010 150

Tools and Technology
TCS expertise in industry standard tools and products ranging from RDBMS, ETL,
CGI & Web products in conjunction with in-house developed tools . TCS
Knowledge Base also includes a number of specialised tools that are proficient in
data cleansing, validation and trending
The common software products in use -
Trillium Software : Used for cleansing the name and address data. The software is able to identify
and match households, business contacts and other relationships to eliminate duplicates in large databases
using fuzzy matching techniques
Ab Initio : Provides a suite of software packages used for ETL in data warehouses. Its features include
Parallel data transformation , validation and filtering, real time data capture and integration
with relational DBMS systems and Data Profiling Capability
Unitech and Actuate : Comprises of a set of reporting tools used for trend analysis, point to point
reconciliation and detecting data inconsistency.
October 20, 2010 151

Case Study: British Telecom – Retail :
SWIFT
October 20, 2010 152

Client Profile and Business Drivers
Client Profile
• BT Retail is a significant player in the communications market in the
UK
• BT Retail has three main customer groups, namely, consumer,
business and major business or corporate. The products and services
cover the entire range from traditional telephony service, mobile
technology, internet access and web-based services.
Business drivers
• To unify existing marketing systems for providing centralized customer
repository to cater to
 Developing, targeting and presenting propositions
 Managing customer relationships
 Undertaking rapid tactical marketing
 Improving campaign effectiveness
 Reducing marketing operational costs
October 20, 2010 153

Business Objectives
• Reduction in marketing operational costs
• Reduction of marketing cycle times
• 360 degree view of customers
• Delivery of consistent messages across all customer channels
• Increased customer focus, better understanding and segmentation of
customers
• Event driven campaigns targeted at focused customer group
• Improvement in campaign effectiveness
• Maintaining a large data volume - One of the largest data warehouses
in Europe — 3.36 TB of data with a growth projection of 1% per week
October 20, 2010 154

Challenges
• Data quality issues in the vital data attributes in BT’s operational data
store
• Cleansing a backlog of erroneous information stored in database
• Decommissioning and migration of data from legacy systems
• Decommissioning of 30 TB of RDBMS
• Maintaining data integrity
October 20, 2010 155

Proposed Profiling and DQM Solution
• As part of the solution, the team deployed Business Rules Repository (BRR) to store
all business rules scattered across the enterprise. This enabled
 Sharing of information within business and IT stakeholders in an effective and
efficient manner
 Storage for basic information about each Business Rule with a history of changes
applied to it over time
• Types of Business Rules
 Format check – numeric, character or date with a specific pattern
 Cross Attribute Value Check within dataset (compare multiple attributes in a
dataset)
 List of Value – for small list of valid values
 Lookup – for large list of values like list of country codes, etc
 Uniqueness or duplication check
 Data integrity check
 Cross Attribute Value Check across datasets
• Data Profiler reports were created to report
 Structure and statistics for each data element
 Data value, range, distribution, pattern and format of each attribute
 Relationship of various attributes within and across datasets – join key, primary
key, potential foreign key, data dependencies, etc
• Web based application “Quest” delivered as a value-add for Data Quality Management
October 20, 2010 156

Integration of Data Profiler and BRR
Source Target
Source Common System
System
System Rules
Rules Rules
Rules
BRR for
BRR Source & Target
system
ADAPTOR -
Embed Business
Rules
into Data Profiler
Profile Data Profiler Profile Analyze
Analyze
Schema
Transformation
Target System
Source System Test data
Test data
Compare
October 20, 2010 157

Software and Hardware
Software
• Database Layer – Oracle
• Application Layer – Ab Initio, Trillium, Unitech and Actuate
Hardware
• IBM Sequent NUMA-Q server 16 quad machine and 2GB
RAM
October 20, 2010 158

Application Architecture
• Reduces Data
Assumption
Profile Full volume Data Analysts,
Source Data Business Owners, Design document
Profiled DQ • Use Profiler

Legacy
Information reports output for
System Requirement
Analysis different
Sales
Design phases
System
Solution
Business Rules DQ
DQMonitors
Monitors
Design
ERP Repository • Use Profiler
High/Low output to build
Level Design
CRM Source BRR
system
Test Data
Billing Profile Live • Embed the
System data for Data
Various Test Test Target rules defined
Audit
Phases System
3rd Party in BRR into
Data Live Target
Deployment Data Profiler
System
Profile Source Profile
Source Test Data Target Test
Data
Data • Data Audit and
Compare and analyse profiler output to
validate the transformation process Data Quality
Monitors
(DQM)
October 20, 2010 159
Benefits to Client
• A huge backlog of data quality issues resolved leading to millions of
pounds worth of saving to BT
• A generic name and address data cleansing methodology designed
that can be used as a prototype for similar requirement time effectively
• Profiling of Live data on an ongoing basis to check compliance over
time
• BRR was used for developing Data Quality monitors (DQM)
• Development of uniform data dictionary for all disparate source
systems
• Reduced Risks and accurate planning
Client Speak
“...We have made fantastic progress in managing to roll out some
really big, complex deliveries... all thanks to your commitment,
and your ability to work as a team in order to resolve issues
quickly whilst under a lot of pressure. Well done everyone.”
- Simon Riley
October 20, 2010 160

Metadata Management
October 20, 2010 161

Metadata
•“Data about Data”
• For every data element
 definition
 characteristics
 relationships to other data elements
Metadata categories Metadata types

Business Metadata Control Metadata
Technical Metadata Process Metadata
Metadata currency
Static (Slowly changing) Metadata
Dynamic Metadata
October 20, 2010 162

Metadata Products
Candidate Products Vendor
SuperGlue Informatica
System Architect Popkin
MetaStage Ascential
Platinum Repository Platinum Technologies
Advantage Computer Associates

Rochade Allen Systems Group
Microsoft Repository Platinum/Microsoft Partnership
MetaCentre Repository Data Advantage Group
MDM I2
October 20, 2010 163

Data Governance
October 20, 2010 164

Data Governance
Data governance (DG)

refers to the overall Why go in for it ?
management of the
availability, usability,
integrity, and security of Increase consistency &
the data employed in an confidence in decision
enterprise making
Decrease the risk of

regulatory fines
What is Data Governance ?
Improve data security
October 20, 2010 165

Master Data Management
October 20, 2010 166

What is Master Data Management
• Master Data Management (MDM), is a Practices Processes Technologies

discipline in Information Technology (IT)
that focuses on the management of
reference or master data that is shared
by several disparate IT systems and
groups.
• MDM enables consistent computing
between diverse system architectures
and business functions.
• MDM integrates dimensional and master
data across BI, data warehouse, Master Data
financial & operational systems,
providing for accurate, consistent and
compliant enterprise reporting.
• MDM supplies meta-data for aggregating
and integrating transactional data.
Metadata
October 20, 2010 167

167
Typical Requirements for MDM
• Role Definition Support: Support for definition of roles with access rights enforced
depending on the responsibilities assigned for that role
• ETL: ETL capabilities for extracting master data/reference data files or tables from multiple
sources and loading the data into the master data repository
• Data Cleansing: Data cleansing capabilities for de-duplication and matching of master
data records
• Collaborative platform: A collaborative platform for coordinating decisions on master
data reconciliation and rationalization. The platform should be supported by standards, if
available, or via industry knowledge of a master data domain. An example is a standard
product hierarchy for a particular industry
• Data synchronization and replication support: For applying changes established in a
central server to each consuming application. Incremental change support is important for
performance reasons
• Version control and Change monitoring: Version control at the central policy hub
combined with change monitoring across all of the participating systems. This is needed in
order to track changes to master data over time.
October 20, 2010 168

168
Processes required for Master Data Management
• Master Data is managed Via a

Policy Hub as shown in the
figure
• The policy hub for master
data management collects
master data from participating
analytical and transactional
systems
• Collaborative applications run
on the central policy hub to
coordinate decisions among
team members on master
data policies
• The standard master data is
published to each
participating system
(transactional and analytical)
so that they are synchronized
with the hub
October 20, 2010 169

169
Processes required for Master Data Management
• Steps in the Process for Managing and Maintaining Master Data
 Assign business responsibility for each master data domain such as products,
customers, suppliers, organizational structure
 Extract master data for a domain from separate operational and reporting systems to
a central server
 Apply data quality standards, such as de-duplication and matching of master data
records, to get a clean set of master data for the domain
 Reconcile and rationalize the master data records. This process entails setting
policies pertaining to an optimal product hierarchy, organizational structure, or
preferred supplier list
 Synchronize participating operational and reporting systems with the centrally
managed, canonical master data
 Monitor changes or updates to master data in each participating system. Then repeat
the preceding steps for ongoing maintenance of master data. Over time, with the
centralization of master data management responsibilities, the origination of master
data changes moves from the participating systems to the master data management
hub or server
October 20, 2010 170

170
Data Storage, Movement and
Access
October 20, 2010 171

Data Security Access
• Data security is the means of ensuring that data is kept safe from
corruption and that access to it is suitably controlled. Thus data
security helps to ensure privacy. It also helps in protecting personal
data.
• It is the process of protecting data from unauthorized access, use,

disclosure, destruction, modification, or disruption.
• Protecting confidential information is a business requirement, and in

many cases, it is also a legal requirement, and some would say that it
is the right thing to do.
October 20, 2010 172

Question & Answer Session...
October 20, 2010 173

October 20, 2010 174

02 Data Management

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 Data Management

Uploaded by

Copyright:

Available Formats

Data Management

October 20, 2010 TCS Public

• Data Warehouse Concepts

October 20, 2010 2

October 20, 2010 3

A.What is a Data Warehouse (DW) ?

October 20, 2010 4

October 20, 2010 5

October 20, 2010 6

October 20, 2010 7

Transactional Storage Data Warehouse Storage

October 20, 2010 8

Record-by-Record Data Manipulation Mass Load / Access of Data

Transactional Storage Data Warehouse Storage

October 20, 2010 9

Current Data Historical Data

Sales ( Region , Year - Year 97 - 1st Qtr)

Transactional Storage Data Warehouse Storage

October 20, 2010 10

• Stores large volumes of data used frequently by DSS

• Is maintained separately from operational databases

• Are relatively static with infrequent updates

• Contains data integrated from several, possibly

• Supports queries processing large data volumes

October 20, 2010 11

• Strategic or Business view

October 20, 2010 12

A.What is a Data Warehouse (DW) ?

October 20, 2010 13

October 20, 2010 14

• Data Cleansing and Transformation

• Data Load and refresh

• Build derived data and views

• Administer the warehouse

October 20, 2010 15

A.What is a Data Warehouse (DW) ?

October 20, 2010 16

• Virtual Data Warehouse

• Enterprise Data Warehouse

• Distributed Data Marts

October 20, 2010 17

October 20, 2010 18

October 20, 2010 19

October 20, 2010 20

Legacy Select Data Mart

October 20, 2010 21

October 20, 2010 22

Legacy Select Data Mart

October 20, 2010 23

Lightly Summarized Data

Current Detail Data

Older Detail Data

Monthly sales by Monthly Sales by Product for

Sales Detail for

October 20, 2010 25

• Identify key business drivers, sponsorship, risks, ROI

• Survey information needs and identify desired functionality

• Architect long-term, data warehousing architecture

• Evaluate and Finalize DW tool & technology

• Design target data base schema

• Build data mapping, extract, transformation, cleansing and

• Build initial data mart, using exact subset of enterprise data

• Maintain and administer data warehouse