Professional Documents
Culture Documents
02 Data Management
02 Data Management
• Customer Centricity
Single view of each customer and his/her activities
Integrated information from heterogeneous sources
• Adaptability to rapidly changing business needs
Multiple ways to view business performance
Low cycle time, faster analytics
• Increased Global competition
Crunch more and more data, faster and faster
• Mergers and Acquisition
With each acquisition comes another set of disparate IT systems
affecting consistency and performance
• Performance Optimization
OLTP systems get overloaded with large analytical
queries
Data Models for OLTP and OLAP are very different
• Reduce reliance on IT to produce reports
Reporting making on OLTP systems is very technical
• OLTP systems not built to hold history data
• Data Security
To prevent unauthorized access to sensitive data
Entry
Sales Rep Sales
Sales
Quantity Sold
Part Number
Date
Customer Name
Customers
Customers
Product Description
Unit Price
Mail Address Products
Products
Insert Change
Delete Access
Insert Load
Change
Access
20
15
Sales ( in lakhs
10 East
)
West
5 North
0
January February March
Year97
Metadata Layer
Extraction Data Mart
Cleansing Population
Aggregation
FS1 Summarization
S Transformation
T DM1
A
FS2
G
. I
N ODS DW
DM2
Transmission
. N
E
G
DMn
A
. T
W
O
R
E
OLAP ANALYSIS
FSn R
K A
Legacy System Knowledge Discovery
• Data extraction
• Service queries
• Data Marts
• Multi-tiered warehouse
Legacy
REPORTING TOOL
Operational Systems Data
U
Client/ S
Server E
R
OLTP S
Application
External
Data Preparation
Legacy Select
REPORTING TOOL
Operational Systems Data
Metadata
Repository
Extract U
Client/
Server
S
A E
Transform DATA P
WAREHOUSE R
OLTP Integrate I S
Maintain
External
Legacy Select
REPORTING TOOL
Metadata
Repository
Extract U
Client/
Server S
A
Transform DATA MART E
P
R
OLTP Integrate I
S
Maintain
External
Data Preparation
Operational Systems
Data
REPORTING TOOL
Extract
Client/ U
Server A S
Transform Data Mart P E
OLTP Integrate I R
S
Maintain
External Data Mart
Data Preparation
Operational Systems
Data
Data Mart
REPORTING TOOL
Legacy
Select
Extract
Client/
Server
Metadata U
Repository A S
Data Mart P
Transform E
DATA
OLTP WAREHOUSE I R
Integrate
S
External Maintain
Data Mart
Operational Systems
Enterprise wide Data
REPORTING TOOL
Extract Metadata
Client/ Repository
U
Server A S
Transform Data Mart DATA P
WAREHOUSE E
OLTP Integrate I R
S
Maintain
External Data Mart
Data Preparation
Operational Systems
Data
Metadata
Cont.
October 20, 2010 24
Data Warehouse - Example
Weekly sales by
Weekly sales by product/sub-product
region for 1991-94 for 1991-94
Sales Detail
for 1991-94
Metadata
• Conduct Proof-of-Concept
Cont.
October 20, 2010 26
Building a Data Warehouse - Steps
Definition
Logical & Graphical representation of the information needs
Process
• Classifying
Entities
• Characterizing
Attributes
• Inter-relating
Relationships
Definition
Representation of a business problem without regard to
implementation, technology and organizational structure
Features
• Represent business requirement completely, correctly &
consistently
• Remove redundancy
• Does not presuppose data granularity
• Not implemented
Definition
Specification of what is implemented
Features
• Optimized
• Efficient
• Buildable
• Robust
STAGING AREA
CRM
Network
DW
Data
RDBMS
Client
Browsers
ERP
ODS
Extraction
STAGING AREA
CRM
Network
DW
Data
DIMENSIONAL
MODELING AREA
PC DBs
October 20, 2010 40
Dimensional Modeling
Facts Measure
Sales
Volume
Revenue
The term FACT represents a single business measure. E.g. Sales, Qty Sold
Each fact has a GRAIN which is the set of perspectives or attributes that define/ qualify the
fact completely.
E.g. Grain of “Sales” could be “for each PRODUCT, at each STORE, on each/ every DAY”.
A FACT TABLE is the primary table in a dimensional model where the business measures
or FACTS are stored.
A business measure or FACT is a row in a FACT TABLE. All FACTS in a FACT TABLE must
be at the SAME GRAIN.
• Every foreign key in a Fact table is usually a DIMENSION TABLE PRIMARY KEY
• Every column in a fact table is either a foreign key to a dimension table primary
key or a fact
• Every non-key column in the fact table is typically used in the SELECT clause of
a SQL query
Data Warehouse is
• Subject-Oriented
•
•Integrated
• Time-Variant
• Non-volatile
collection of data in support of management’s
decision.
Subject Dimension
Product Package
Business
Unit
Hierarchy - Geography
• Hierarchy Dimension
• Level
World Level World
• Member
n
io
• Attribute
at
el
Continent America Europe Asia
tR
• Grain Level Pa
en
r
State FL GA VA CA WA
Level
n
io
at
el
Economy Developed Developing Third world
tR
Level
enr
Pa
Some Examples:
Sales Dollars, Units Sold, Gross
Profit, Expense Amount, Net Income,
Unit Cost, Number of Employees,
Turnover, Salary, Tenure, etc.
Some Examples:
Time_Dim
Time_Dim Product_Dim
Product_Dim
TimeKey Sales_Fact
Sales_Fact ProductKey
TheDate TimeKey ProductID
. .
. EmployeeKey .
. .
ProductKey
CustomerKey
ShipperKey
Required Data
(Business Metrics)
or (Measures)
Shipper_Dim
Shipper_Dim . Customer_Dim
Customer_Dim
.
ShipperKey . CustomerKey
ShipperID CustomerID
. .
. .
. .
Country
Industry
Segment State
Customer
Customer Product
Channel
STEP 4
• Define Granularity for each Group of Facts
Customer Product
(Customer) (Scheme)
Channel
(Channel)
• Secondary Dimension
Does NOT contribute to the fact grain
Non-primary dimensions such as payment type, customer, manufacturer are
still important for analysis of SALES fact
Useful for rich analytic slicing and dicing, e.g. Top 10 customers.
• Degenerate Dimension
A dimension without any attributes; but useful for analysis
Generally included in the associated fact table before facts
E.g. invoice number, by itself, in a shipping fact
• Large Dimension
Size increases with decreasing level of granularity
Typical of public utility companies, government agencies
Human records kept by supermarkets e.g. Shopper’s Stop
Do NOT create SCDs to address slow changes/ history
See “…Monster Dimension” for SCD strategy
Choose indexing strategies to reduce query run times
Choose RDBMS wisely e.g. SQLServer vs. Oracle vs. Teradata
ion
lat
Re
Continent America Europe Asia
Level nt
re
Pa
State FL GA VA CA WA
Level
Dimension
Attributes: Population,
Member /
Tourist’s Place
Business Entity
Thus we have:
• Transaction Fact Table
• Snapshot Fact Table
• Summary Fact Table
Figure 1
The context of a transaction is modeled as a set of generally independent
dimensions. Figure 1 shows seven such dimensions.
The measured transaction amount is in a fact table that refers to all the
dimensions by foreign keys pointing outward to their respective dimension tables.
The clean removal of all the context detail from the transaction record is an
important normalization step and is why fact tables are “highly
normalized.”
• The measurements group nicely together into a single fact table with the same
grain.
• Periodic Snapshot: A snapshot is a measurement of status at a specific point in
time. E.g. In Figure 2, earned premium is the fraction of the total policy premium
that the insurance company can book as revenue during the particular reporting
period. The periodic-snapshot-grained fact table represents a predefined time span.
The accumulating-snapshot-
grained fact table represents
an indeterminate time span,
covering the entire history
starting when the collision
coverage was created for the
car in our example and ending
with the present moment.
Figure 2
Although not shown in the figure, the time dimension has its numbers, such as 1996 and 1997. Each
sub-cube has its own numbers, which represent the volume of production as a measurement.
For example, in a specific time period (not expressed in the figure), the Armonk plant in East region
has produced 11,000 Cell Phones, of model number 1001.
October 20, 2010 79
Data Warehouse Bus Matrix
All Dimensional models together form the logical design of the data warehouse.To Decide which
Dimensional Models to build we start with a top-down planning approach called the Data Warehouse
Bus Architecture Matrix.
This Matrix forces us to list all the possible data marts we could possibly build and name all the
dimensions that are present in those data marts (at a high level).
A Dimensional Model is made up of one or more star schemas. Some of these star schemas may be
snow flaked for better organization and storage.
Dim ension
Organization
Equipment
Employee
Customer
Accounts
Calendar
Outage
Vendor
Subject A rea
Accounts ✔✔ ✔✔
S ales ✔✔✔✔✔
Q uotes ✔✔ ✔
G eneral Ledger ✔ ✔✔✔✔ ✔
S hipm ent ✔✔✔✔✔
P arts/Finance ✔ ✔✔✔✔ ✔
October 20, 2010 80
Dimensional Modeling Approach
CDM LDM PDM
Each star schema has a single fact table at its centre surrounded by multiple dimension tables. Once we
do this, we can then start the design of each individual fact table/star schema using a 4-step process.
Country Sales
Date Key (int)
Industry
Segment State Store Key (int)
Product Key (int) Figure: Customer
Dimension Hierarchies
Customer Key (int) (Industry, Geography)
Industry Type City Fin. Class
Sales (float)
Qty Sold (int)
Customer Price (float)
Discount (float)
STEP 4a: Choose Facts
Choose each fact for the fact table making sure that the fact is relevant and also has the same
grain has the fact table e.g. for SALES fact table, typical facts would be price, quantity sold and
sales amount as these are all dimensioned by product, by store, by day.
October 20, 2010 82
Dimensional Modeling Approach
STEP 4b: Connect Fact to dimension tables by means of surrogate keys
Customer Product
(Customer) (Scheme)
Channel
(Channel)
Important Notes:
1. Each dimension table will have a MEANINGLESS SINGLE PART INTEGER PRIMARY KEY called
surrogate key. This key also occurs as part of the central fact primary key.
2. All fact table primary keys should ideally be foreign keys to the corresponding dimension tables.
•Develop Standards.
•Process Standards
•Naming Standards
Database Objects
Word Separators
Names in Logical and Physical Model
Physical File naming standards
Naming of files & tables in Staging area
•Improve Performance
•Ensure Scalability
•Manage Storage
•Provide Ease of Administration
•Design Flexibility
•Assign Storage Structures
•Data Clustering
•Parallel Processing
•Summary levels
•Referential Integrity Checks
•Initialization Parameters
•Data Arrays
October 20, 2010 105
Implementation & Maintenance Agenda
A.Physical Design Steps
B.Physical Design Considerations
C.Physical Storage
D.Indexing
E.Performance Enhancement techniques
F.Deployment Activities
G.Security
H.Backup and recovery
I. Monitoring Data Warehouse
J. User Training and Support
K.Managing Data Warehouse
October 20, 2010 106
Major Deployment Activities
•Collection of Statistics
•Using Statistics for growth planning.
•Using Statistics for Fine-Tuning.
•Publishing Trends for users.
•Platform Upgrades.
•Managing Data Growth.
•Storage Management.
•ETL Management.
•Data Model Revisions.
•Information Delivery Enhancements.
•Ongoing fine tuning.
Master Data
Management
Management
Metadata Management
Enterprise Data Management Framework
So many duplicate
products on this
list…
I am still not
able to see
latest data…
Holland??? Is
this customer
in Europe or Returns on Investment
USA? are below expectations
Homonyms These indicate fields that are spelled the same, but
really aren’t the same (id or ID)
• Domains describe the range and types of values present that can be
present in a data set
• Some examples of domain errors are:
Unexpected values - e.g. Home State = one of {Kan, Mic, Min,…)
Cardinality - A Yes/No field can have only two credible values
Uniqueness – for a field, 98% of data is NULL
Constant
Outliers
Length of field
Precision
Scale
Internationalization – Date formats, postal codes, time zones, etc
Unmanaged
Physical Issues Logical Issues
Data Issues
Data Parsing
Data Profile Business Rules (using Rules, Text
(for Cleansing) Mining etc.)
rules Quality
• DQM at Source
• DQM as part of ETL processes
• DQM in the target
• Modular approach to
building solutions
• Clear and well defined
guidelines, checklists
and standards
• Supports the Onsite-
Offshore delivery model
• Flexibility to adapt with
other methodologies
• E-T-V-X criteria re-
enforced by best
practices and TCS’
quality initiatives
Trillium Software : Used for cleansing the name and address data. The software is able to identify
and match households, business contacts and other relationships to eliminate duplicates in large databases
using fuzzy matching techniques
Ab Initio : Provides a suite of software packages used for ETL in data warehouses. Its features include
Parallel data transformation , validation and filtering, real time data capture and integration
with relational DBMS systems and Data Profiling Capability
Unitech and Actuate : Comprises of a set of reporting tools used for trend analysis, point to point
reconciliation and detecting data inconsistency.
Business drivers
• To unify existing marketing systems for providing centralized customer
repository to cater to
Developing, targeting and presenting propositions
Managing customer relationships
Undertaking rapid tactical marketing
Improving campaign effectiveness
Reducing marketing operational costs
Source Target
Source Common System
System
System Rules
Rules Rules
Rules
BRR for
BRR Source & Target
system
ADAPTOR -
Embed Business
Rules
into Data Profiler
Profile Data Profiler Profile Analyze
Analyze
Schema
Transformation
Target System
Source System Test data
Test data
Compare
Software
• Database Layer – Oracle
• Application Layer – Ab Initio, Trillium, Unitech and Actuate
Hardware
• IBM Sequent NUMA-Q server 16 quad machine and 2GB
RAM
MDM I2
• Role Definition Support: Support for definition of roles with access rights enforced
depending on the responsibilities assigned for that role
• ETL: ETL capabilities for extracting master data/reference data files or tables from multiple
sources and loading the data into the master data repository
• Data Cleansing: Data cleansing capabilities for de-duplication and matching of master
data records
• Collaborative platform: A collaborative platform for coordinating decisions on master
data reconciliation and rationalization. The platform should be supported by standards, if
available, or via industry knowledge of a master data domain. An example is a standard
product hierarchy for a particular industry
• Data synchronization and replication support: For applying changes established in a
central server to each consuming application. Incremental change support is important for
performance reasons
• Version control and Change monitoring: Version control at the central policy hub
combined with change monitoring across all of the participating systems. This is needed in
order to track changes to master data over time.
Assign business responsibility for each master data domain such as products,
customers, suppliers, organizational structure
Extract master data for a domain from separate operational and reporting systems to
a central server
Apply data quality standards, such as de-duplication and matching of master data
records, to get a clean set of master data for the domain
Reconcile and rationalize the master data records. This process entails setting
policies pertaining to an optimal product hierarchy, organizational structure, or
preferred supplier list
Synchronize participating operational and reporting systems with the centrally
managed, canonical master data
Monitor changes or updates to master data in each participating system. Then repeat
the preceding steps for ongoing maintenance of master data. Over time, with the
centralization of master data management responsibilities, the origination of master
data changes moves from the participating systems to the master data management
hub or server
• Data security is the means of ensuring that data is kept safe from
corruption and that access to it is suitably controlled. Thus data
security helps to ensure privacy. It also helps in protecting personal
data.