Professional Documents
Culture Documents
An Introduction To Data Warehousing
An Introduction To Data Warehousing
An Introduction To Data Warehousing
to
Data Warehousing
Which
Whichare
areour
our
lowest/highest
lowest/highestmargin
margin
customers
customers??
Who
Whoare
aremymycustomers
customers
What and
andwhat
whatproducts
Whatisisthe
themost
most products
effective are
arethey
theybuying?
effectivedistribution
distribution buying?
channel?
channel?
What
Whatproduct
productprom- Which
prom- Whichcustomers
customers
-otions
-otionshave
havethe
thebiggest are
biggest aremost
mostlikely
likelyto
togo
go
impact
impactononrevenue? to
revenue? tothe
thecompetition
competition??
What
Whatimpactimpactwill will
new
newproducts/services
products/services
have
have on
onrevenue
Lnt Infotech revenue
Use Only
and margins?
and margins?
Data, Data everywhere
yet ...
• A data warehouse is a
Subject-oriented
Integrated
Time-varying
Non-volatile
collection of data that is used primarily in
organizational decision making
Information A process of
transforming data into
information and
making it available to
users in a timely
enough manner to
make a difference
• Unfriendly
• Slow
• Dependent on IS programmers
• Inflexible
• Analysis limited to defined reports
Lnt Infotech Use Only
Focus on Reporting
Evolution of Data Warehousing
• Trend Analysis
• What If ?
• Moving Averages
• Cross Dimensional Comparisons
• Statistical profiles
• LntAutomated
Infotech Use Only
pattern and rule discovery
Focus on Online Analysis
Data Warehousing Concepts and Terms
Remember
Between OLTP and Data Warehouse systems
hardware is different
Lnt Infotech Use Only
Time of day
Lnt Infotech Use Only
• Performance
• special data organization, access methods, and
implementation methods are needed to support
multidimensional views and operations typical of OLAP
• Complex OLAP queries would degrade performance for
operational transactions
• Concurrency control and recovery modes of OLTP are
not compatible with OLAP analysis
• Function
• missing data: Decision support requires historical data which
operational DBs do not typically maintain
• data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources:
operational DBs, external sources
• data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled.
A
Data
B ODS Warehouse
Operational
DSS
•• Historical
Historical
Has the incidence of
Tuberculosis increased in
last 5 years in Southern
region
Lnt Infotech Use Only
OLTP Vs ODS Vs DWH
Order processing
•2 second response time
Data
Daily c
•Last 6 months orders
losed o
rders Warehouse
•Last 5 years data
ory •Response time 2 seconds
Product Price/inventory /I nvent
rice to 60 minutes
d uct p
•10 second response time p ro
kly •Data is not modified
Wee s
•Last 10 price changes r am
g
•Last 20 inventory transactions g pro
tin
a rke
kl ym
Marketing ee
W
•30 second response time
•Last 2 years programs
Order processing
Customer Product
orders price Data
Available Inventory Warehouse
Customers
Products
Product Price/inventory
Product Product Orders
price Inventory
Product Inventory
Product Price changes
Product Price
Marketing
Customer Product
Profile price
Marketing programs
Order Processing
System Data
Daily c
losed or
d ers Warehouse
Editor: Orders (Closed)
Order Please add Open,
Backorder, Shipped, Inventory snapshot 1
Closed to the arrow
around the order
Inventory snapshot 2
ot
snapsh
y
ntor
inve
kly
Down
Wee
Up
Inventory
Data Marts
Data Mining
Operational Data
ETL
OLAP Tools
Distributed data
Data
DSS Tools
Warehouse
data
Typical Data Warehouse Architecture
Data
Marts
EIS /DSS
Metadata
Select Query Tools
Extract
Transform
Integrate
Data OLAP/ROLAP
Warehouse
Maintain
Web Browsers
Operational
Systems/Data Middleware/
Data API Data Mining
Preparation
Lnt Infotech Use Only
Data
Marts
Metadata Metadata
Select Select
Extract Extract
ODS Data
Transform
Transform Warehouse
Integrate Load
Maintain
Operational
Systems/Data
Data
Data
Preparation
Preparation
Lnt Infotech Use Only
• OLAP
• Data Mining
Measure: The entity in numeric figure that tells about the business.
Dimension: A category of information that describes the measure.
For e.g The time dimension.
Attribute: A unique level within a dimension, For e.g Month is an
attribute within the time dimension.
Hierarchy: The specification of levels that represents relationship
between different attributes within a hierarchy. For example: one
possible hierarchy in the Time dimension is
Year-- Quarter--Month--Day
OLAP EXAMPLE:
An example OLAP database may be comprised of sales data which has
been aggregated by region, product type, and sales channel. A typical
OLAP query might access a multi-year sales database in order to find
all product sales in each region for each product type.
After reviewing the results, an analyst might further refine the query to
find sales volume for each sales channel within region/product
classifications.
As a last step the analyst might want to perform year-to-year or quarter-
to-quarter comparison for each sales channel. This whole process must
be carried out on-line with rapid response time so that the analysis
process is undisturbed.
Location
Atlanta Product
Grapes
Denver
Detroit Cherries
Melons
Sales
Sales Apples
Pears
Q1 Q2 Q3 Q4
Time
Lnt Infotech Use Only
Online Analytical Processing
Data mining has varied fields of applications some of which are listed
below:
RETAIL/ MARKETING
Identify buying patterns from customers
Find associations among customer demographic characteristics
Predict response to mailing campaigns
BANKING
Detect patterns of fraudulent credit card use
Identify loyal customers
Determine credit card spending by customer groups
Find hidden correlations between different financial indicators
• Entity-Relationship Modeling
– Traditional modeling technique
– Technique of choice for OLTP
– Suited for corporate data warehouse
• Dimensional Modeling
– Analyzing business measures in the specific business
context
– Helps visualize very abstract business questions
– End users can easily understand and navigate the data
structure
FK
City Salesrep table
FK
Sales District Order Header Customer Table
Sales Region FK
Order Details Item Table
Product Category
Lnt Infotech Use Only
Entity-Relationship Modeling - Basic
Concepts
• Entity
– Object that can be observed and classified by its properties
and characteristics
– Business definition with a clear boundary
– Characterized by a noun
– Example
• Product
• Employee
• Relationship
– Relationship between entities - structural interaction and
association
– described by a verb
– Cardinality
• 1-1
• 1-M
• M-M
– Example : Books belong to Printed Media
• Attributes
– Characteristics and properties of entities
– Example :
• Book Id, Description, book category are attributes of entity
“Book”
– Attribute name should be unique and self-explanatory
– Primary Key, Foreign Key, Constraints are defined on
Attributes
Product Dimension
product_key
Sales Fact
description
Time Dimension time_key brand
product_key category
time_key store_key
day_of_week dollars_sold
month units_sold Store Dimension
quarter dollars_cost
year store_key
holiday_flag store_name
address
floor_plan_type
Lnt Infotech Use Only
Star Schema Architecture
FACT TABLES
are stored.
Sparse
Dimension Tables
The dimension tables are where the textual descriptions of the
dimensions of the business are stored.
Dimension tables are designed especially for selection and grouping.
There is no access control on these tables, all users can view this
information
These tables are much smaller than the Fact tables, may contain
10,000 rows of data.
• Dimension Tables
Each dimension table has a single-part primary key that
corresponds exactly to one of the components of the
multipart key in the fact table.
Dimension tables, most often contain descriptive textual
information
Determine contextual background for facts
Examples :
• Time
• Location/Region
• Customers
Time_Dim Product_Dim
TimeKey Sales_Fact
Sales_Fact ProductKey
TheDate TimeKey
TimeKey ProductID
... EmployeeKey
EmployeeKey ...
Dimensional
Dimensional Keys
Keys ProductKey Multipart
Multipart Key
Key
CustomerKey
ShipperKey
RequiredDate
... Measures
Measures
Shipper_Dim Customer_Dim
ShipperKey CustomerKey
ShipperID CustomerID
Lnt Infotech Use Only
... ...
Fact Table & Dimension Tables
sType
store
city region
•
Dim Dim
Table Table
Fact
Table
Dim Dim
Table Table
Sales_Fact
Sales_Fact Product_Dim
TimeKey ProductKey
EmployeeKey
ProductKey Product Name
CustomerKey
ShipperKey Product Size
RequiredDate
... Product Brand ID
Product_Brand_ID
Product Brand
Product Category ID
Product_Category_ID
Product Category
Product Category ID
Lnt Infotech Use Only
Conformed Dimensions
• ERWIN
– Supports Data Warehouse design as a modeling technique
• Powersoft WarehouseArchitect
– Module of Power Designer specifically for DW Modeling
• Oracle Designer
– Can be extended for Warehouse modeling
• Others like Infomodeler, Silverrun are also used