Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 51

Data Warehouse

A producer wants to know….


Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the
What impact will competition ?
new products/services
have on revenue
and margins?
Introduction – Decision Support System (DSS)
 A Decision Support System (DSS) is an interactive computer-
based system or subsystem intended to help decision makers use
communications technologies, data, documents, knowledge
and/or models to identify and solve problems, complete decision
process tasks, and make decisions.
 It is clear that DSS belong to an environment with
multidisciplinary foundations, including (but not exclusively):
– Database research,
– Artificial intelligence,
– Human-computer interaction,
– Simulation methods,
– Software engineering, and
– Telecommunications.
Introduction – Decision Support System (DSS)
• DSSs serve the management, operations, and planning
levels of an organization (usually mid and higher
management) and help to make decisions, which may
be rapidly changing and not easily specified in advance
(Unstructured and Semi-Structured decision
problems).
• Decision support systems can be either fully
computerized, human or a combination of both.
Operational Database
 Operational database management systems are used to
manage dynamic data in real-time.

 These types of databases allow you to do more than simply


view archived data. Operational databases allows to modify
that data (add, change or delete data), doing it in real-time.

 Operational databases are increasingly supporting


distributed database architecture that provides high
availability and fault tolerance through replication and scale
out ability.
Data Warehousing brief History
• The concept of data warehousing dates back to the late 1980s
when IBM researchers Barry Devlin and Paul Murphy
developed the "business data warehouse".
• 1960s - General Mills and Dartmouth College, in a joint
research project, develop the terms dimensions and facts.
• 1970s - ACNielsen and IRI provide dimensional data marts for
retail sales.
• 1983 – Tera data introduces a database management system
specifically designed for decision support.
• 1988 - Barry Devlin and Paul Murphy publish the article An
architecture for a business and information systems in IBM
Systems Journal where they introduce the term "business data
warehouse".
Data warehouse Introduction

7
Data warehouse Introduction
Subject
“Data Warehouse is a Oriented

subject oriented,
integrated, time-
variant and non-
volatile collection of Non-
volatile
Data
Warehous
Integrated

data in support of e

management’s decision
making process.” – W.
H. Inmon Time
Varian
t
Data warehouse Usage
 Three kinds of data warehouse applications
– Information processing
 supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
– Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting
– Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools.
 Differences among the three tasks

9
Data warehouse: Subject Oriented
 Organized around major subjects, such as customer, product,
sales.

 Focusing on the modeling and analysis of data for decision


makers, not on daily operations or transaction processing.

 Provide a simple and concise view around particular


subject issues by excluding data that are not useful in the
decision support process.

10
Data warehouse: Subject Oriented

Operational Data
Warehouse

Leads Prospects Customers Products

Quotes Orders Regions Time

11
Data warehouse: Integrated
 Constructed by integrating multiple, heterogeneous
data sources
– relational databases, flat files, on-line transaction records

 Data cleaning and data integration techniques are applied.


– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
 E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted.

12
Data warehouse: Time Varying
 The time horizon for the data warehouse is significantly longer
than that of operational systems.
– Operational database: current value data.
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not
contain “time element”.
13
Data warehouse: Time Varying

Operational Data
Warehouse

Current Value data Snapshot data


• time horizon : 60-90 days • time horizon : 5-10 years
• data warehouse stores historical
data

14
Data warehouse: Non-Volatile
 A physically separate store of data transformed from
the
operational environment.
 Operational update of data does not occur in the
data warehouse environment.
– Does not require transaction processing,
recovery, and concurrency control mechanisms
– Requires only two operations in data accessing:
 initial loading of data and access of data.
15
Data warehouse: Non-Volatile
insert change

Operational Data
Warehouse
insert
delete
load
read only
access
replace
change

16
Data, Data everywhere yet ...
• I can’t find the data I need
– data is scattered over the network
– many versions, subtle differences
• I can’t get the data I need
– need an expert to get the data
• I can’t understand the data I found
– available data poorly documented

• I can’t use the data I found


– results are unexpected
– data needs to be transformed from
one form to other
What Is a Data Warehouse?
“A subject-oriented, integrated, time-variant, non-
volatile collection of data in support of
management’s decision making process.”
W.H. Inmon

“The data warehouse is where we publish used


data.”
Ralph Kimball

6
Difference between Database and data warehouse
FEATURES DATABASE DATA WAREHOUSE
Characteristic It is based on Operational Processing. It is based on Informational Processing.
Data It mainly stores the Current data which It usually stores the Historical data whose
always guaranteed to be up-to-date. accuracy is maintained over time.

Function It is used for day-to-day operations. It is used for long-term informational


requirements and decision support.

User The common users are clerk, DBA, The common users are knowledge worker
database professional. (e.g., manager, executive, analyst)

Unit of work Its work consists of short and simple The operations on it consists of complex
transaction. queries..

Focus The focus is on “Data IN” The focus is on “Information OUT”

Orientation The orientation is on Transaction. The orientation is on Analysis.


DB design The designing of database is ER based The designing is done using star/snowflake
and application-oriented. schema and its subject-oriented.

Summarization The data is primitive and highly The data is summarized and in consolidated
detailed. form.

View The view of the data is flat relational. The view of the data is multidimensional.
19
Difference between Database and data warehouse
FEATURES DATABASE DATA WAREHOUSE
Function It is used for day-to-day operations. It is used for long-term informational
requirements and decision support.
User The common users are clerk, DBA, The common users are knowledge worker
database professional. (e.g., manager, executive, analyst)
Access The most frequent type of access type is It mostly use the read access for the
read/write. stored data.
Operations The main operation is index/hash on For any operation it needs a lot of scans.
primary key.
Number of A few tens of records. A bunch of millions of records.
records accessed

Number of users In order of thousands. In the order of hundreds only.

DB size 100 MB to GB. 100 GB to TB.


Priority High performance, high availability High flexibility, end-user autonomy

Metric To measure the efficiency, transaction To measure the efficiency, query


throughput is measured. throughput and response time is
measured.
20
Evolution in organizational use of data warehouses
Organizations generally start off with relatively simple use of data
warehousing. Over time, more sophisticated use of data warehousing evolves.
The following general stages of use of the data warehouse can be
distinguished:
• Off line Operational Database
–Data warehouses in this initial stage are developed by simply copying the
data off an operational system to another server where the processing load
of reporting against the copied data does not impact the operational
system's performance.
• Off line Data Warehouse
–Data warehouses at this stage are updated from data in the operational
systems on a regular basis and the data warehouse data is stored in a data
structure designed to facilitate reporting.
• Real Time Data Warehouse
–Data warehouses at this stage are updated every time an operational
system performs a transaction (e.g. an order or a delivery or a booking.)
• Integrated Data Warehouse
–Data warehouses at this stage are updated every time an operational
system performs a transaction. The data warehouses then generate
transactions that are passed back into the operational systems.
Data Warehouse Architecture
Client Client

Query &
Analysis

Metadata Warehouse

Integration

Source Source
Source
Why a Warehouse?
· Two Approaches:
- Query-Driven (Lazy)
- Warehouse (Eager)

Source Source
The Traditional Research Approach
· Query-driven (lazy, on-demand)
Clients

Integration System Metadata

...

Wrapper Wrapper Wrapper

...
Source Source Source
Disadvantages of Query-Driven Approach
¨ Delay in query processing
¨ Slow or unavailable information sources
¨ Complex filtering and integration
¨ Inefficient and potentially expensive for frequent
queries
¨ Competes with local processing at sources
¨ Hasn’t caught on in industry
The Warehousing Approach
· Information Clients
integrated in
advance Data
Warehouse
· Stored in wh
for direct
querying and Integration System Metadata

analysis ...

Extractor/ Extractor/ Extractor/


Monitor Monitor Monitor

...
Source Source Source
Advantages of Warehousing Approach
High query performance
But not necessarily most current information
Doesn’t interfere with local processing at sources
Complex queries at warehouse
OLTP at information sources
Information copied at warehouse
Can modify, annotate, summarize, restructure, etc.
Can store historical information
Security, no auditing
Has caught on in industry
Not Either-Or Decision
Query-driven approach still better for
Rapidly changing information
Rapidly changing information sources
Truly vast amounts of data from large numbers of
sources
Clients with unpredictable needs
Data Warehouse? A Practitioners Viewpoint
“A data warehouse is simply a single, complete, and
consistent store of data obtained from a variety of
sources and made available to end users in a way they
can understand and use it in a business context.”
-- Barry Devlin, IBM Consultant
Data Warehouse Architectures: Conceptual View
Operational Informational
Single-layer systems systems

Every data element is stored once only


Virtual warehouse “Real-time data”

Two-layer
Real-time + derived data
Operational Informational
Most commonly used approach in systems systems

industry today
Derived Data

Real-time data
Three-layer Architecture: Conceptual View
Transformation of real-time data to derived data
really requires two steps

Operational Informational
systems systems

View level
“Particular informational
Derived Data
needs”

Reconciled Data
Physical Implementation
of the Data Warehouse

Real-time data
Data Warehousing: Two Distinct Issues
(1) How to get information into warehouse
“Data warehousing”
(2) What to do with data once it’s in warehouse
“Warehouse DBMS”
Both rich research areas
Industry has focused on (2)
Issues

in Data Warehousing
Warehouse Design
Extraction
Wrappers, monitors (change detectors)
Integration
Cleansing & merging
Warehousing specification & Maintenance
Optimizations
Miscellaneous (e.g., evolution)
Warehouse is a Specialized DB
Standard DB (OLTP) Warehouse (OLAP)
Mostly updates · Mostly reads
Many small transactions · Queries are long and
Mb - Gb of data complex
Current snapshot · Gb - Tb of data
Index/hash on p.k. · History
Raw data · Lots of scans
Thousands of users (e.g., · Summarized, reconciled
clerical users) data
· Hundreds of users (e.g.,
decision-makers, analysts)
DATA WAREHOUSES
Data Warehouses:
Data spread in several databases –
physically located at numerous sites
Data warehouse – repository of multiple
DBs in single schema; resides at single site.
Data warehousing processes

1.Data Cleaning 2. Data Integration 3. Data


Transformation
4. Data Loading 5. Periodic data refreshing
Data warehouse diagram
Data cleaning:-Data Cleaning includes,
filling in missing values, smoothing noisy
data, identifying or removing outliers, and
resolving inconsistencies.
Data integration:-Data Integration includes
integration of multiple databases, data
cubes, or files.
Data transformation:-Convert data from
legacy or host format to warehouse format.
Load :-sort;summarize,consolidate;compute
views; check integrity.Build indices and
partitions.
Refresh:-Propagates the update from data
sources to the warehouse.

Data in a data warehouse are
organized around major subjects.

Data provide information on
historical perspective – summarized
on periodic dimension.
 Eg. Sales of an item for a region in a
period

Data warehouse model –
multidimensional database structure / data
cube
 Dimensions – Attributes / set of attributes

Facts – Aggregated measures
(Count / Sales amount)
Facts about data warehousing:-
Issues involved in warehousing include
techniques for dealing with errors and
techniques for efficient storage and
indexing of large volumes of data.
This system is used for reporting and data
analysis.
It usually contains historical data
derived from transaction data.
Data warehousing is not meant for current
“live”data.
Components of a data warehouse
• Sources –Data source interaction
• Data Transformation
• Data warehouse (data storage )
• Reporting (Data presentation )
• Metadata
Tiers of data warehouse architecture
Top tier:

Middle tier

Bottom tier

Data
warehouse
server

Backend tools
fig:- A three tier data warehousing
1)Bottom tier:-The bottom tier is a warehouse database
server that is always a relational database system.
 Back-end tools and utilities are used to feed data into the
bottom tier from operational databases or other external
sources. These tools and utilities perform data
extraction,cleaning and transformation as well as load and
refresh functions to update the data warehouse.
 The date extracted using application program
interfaces known as gateways.
 Example of gateways are ODBC(open database
connection)and OLEDB(Open Linking and embedding for
database) by microsoft and jdbc(java database
connecton).
 This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.
2.)Middle tier:- The middle tier is an OLAP server
that is typically implemented using either:-

a) A relational OLAP (ROLAP) model that is,an


extended relation DBMS that maps
operations.Intermediate server b/w relational
back-end server and client front end tools.
b) A multidimentional OLAP (MOLAP) model that is,
a special purpose server that directly
implements multidimentional data and
operations. Supports multidimention views.
3.)Top tier:-The top tier is a front –end client
layer ,which contains query and reporting tools ,analysis
tools,and or data mining tools.
Note:-

OLAP – Online Analytical Processing:


 This is the major task of Data Warehousing
System.
 Useful for complex data analysis and
decision making.
 Market oriented –used by
managers,executives and data analyst.
Needs for Data Warehousing
 Better business intelligence for end-users

 Reduction in time to locate, access, and analyze information

 Consolidation of disparate information sources

 Strategic advantage over competitors

 Faster time-to-market for products and services

 Replacement of older, less-responsive decision support systems

 Reduction in demand on IS to generate reports


47
Why separate Data Warehouse?
 High performance for both systems
– DBMS— tuned for OLTP: access methods,
indexing,
concurrency control, recovery
– Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
 Different functions and different data:
– missing data: Decision support requires historical data
which operational DBs do not typically maintain
– data consolidation: DS requires
consolidation (aggregation, summarization) of data from
heterogeneous sources
– data quality: different sources typically use inconsistent
data representations, codes and formats which have to be
reconciled
Applications of Data Warehouse
Industry Application
Finance Credit card Analysis
Insurance Claims, Fraud
Telecommunicatio Analysis Call record
n Analysis Logistics
Transport Consumer management
goods Promotion Analysis
Popular Tools of Data Warehouse
Tools Company
Infomix IBM
Oracle Warehouse Builder ORACLE
Microsoft SQL Server Microsoft
Integration
for your attention!

You might also like