Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 25

DATA WAREHOUSE

CONCEPTS
• Data, Information, and Knowledge
• Data is the reality that a computer records, stores, and processes.
• The use of computers can be referred to as data processing. At the lowest
level data has no significance for people. This lowest level in the perception
of reality is sometimes referred to as "raw data".
• Information is what a person is able to understand about reality.
• Information systems use computers to organize data in such a way that
people can understand the results.
• Knowledge is what a business uses to make decisions.

• The goal of business intelligence and data warehousing - changing


data into information and knowledge.
Definition

• A Data Warehouse:
– is a subject-oriented, integrated, time-variant and
non-volatile collection of data in support of
management's decision making process
– provides a historical perspective of information
– is most often, but not exclusively, used for
decision support applications and business
information queries
– can be more than one database
Definition
• Subject-Oriented: A data warehouse can be used to analyze a particular subject
area. For example, "sales".
• Integrated: A data warehouse integrates data from multiple data sources. For
example, source A and source B may have different ways of identifying a product,
but in a data warehouse, there will be only a single way of identifying a product.
• Time-Variant: Historical data is kept in a data warehouse. For example, one can
retrieve data from 3 months, 6 months, 12 months, or even older data from a data
warehouse. This contrasts with a transactions system, where often only the most
recent data is kept. For example, a transaction system may hold the most recent
address of a customer, where a data warehouse can hold all addresses associated
with a customer.
• Non-volatile: Once data is in the data warehouse, it will not change. So, historical
data in a data warehouse should never be altered.
Why a warehouse?
• Analysis and Decision support
• End users access to data captured and stored in an organization’s operational or production systems
• This data is stored
– in multiple formats
– on multiple platforms
– in multiple data structures,
– with multiple names
– created using different business rules
• Most of the data is stored in Relational Database, which is difficult for end-users to access.
• Obtaining the data or a report usually requires waiting for a programmer to either develop the report
or provide a customized download program.
• All of the data may not be consistent as of the same point in time.
• There may not be enough copies of the data kept for historical reporting in the operational systems.
• End-users do not have the knowledge of what is kept in the existing data stores.
Advantages
• Improved end-user access to a wide variety of University data
• Increased data consistency
• Additional documentation of the data
• Potentially lower computing costs and increased productivity
• Providing a place to combine related data from separate sources
• Creation of a computing infrastructure that can support changes in
computer systems and business structures
• Empowering end-users to perform any level of ad-hoc queries or
reports without impacting the performance of the operational
systems
Why a central data store
Life LTD

Voluntary
Non-Medical
Benfits

Individual
Financial
Disability

Account
Management
Underwriting

Customer Service
Sales/Marketing
Financial Analysis

A Good Reason for a Central Data Store


Interesting Statistics
• 95% of the Fortune 1000 companies have, are
implementing, or are looking at, data warehouses (Meta
Group)

• 95% of all information processing organizations will be


pursuing a data warehouse strategy in the next three
years (Meta Group)

• The Decision Support industry will be a $100 Billion


industry by 2020 (IDC & Forrester)
Architecture
• Operational database layer
– The source data for the data warehouse — An organization's
Enterprise Resource Planning systems fall into this layer.
• Data access layer
– The interface between the operational and informational access layer — Tools to
extract, transform, load data into the warehouse fall into this layer.
• Metadata layer
– The data dictionary — This is usually more detailed than an operational system
data dictionary. There are dictionaries for the entire warehouse and sometimes
dictionaries for the data that can be accessed by a particular reporting and
analysis tool.
• Informational access layer
– The data accessed for reporting and analyzing and the tools for reporting and
analyzing data — This is also called the data mart. Business intelligence tools fall
into this layer. The Inmon-Kimball differences about design methodology,
discussed later in this article, have to do with this layer
Data Warehouse Evolution - Stage 0
End User
Application Reports

No end user access to


production files
Production Files
“What we print” is
“what you get”
End User
Application Reports

Production Files
Data Warehouse Evolution - Stage 1
End User
Application Reports End users denied direct
access to production
files

Snapshots or copies of
Production Files Snapshot File
production files are
made available instead

Solution: Provide end


End User users access to
Application Reports
production systems

Production Files Snapshot Files


No Integration Between Systems

System developed in 1979

c
b
Purchased company
system

Purchased Package

New Application development


Rebuilt Application
Data Warehouse Evolution - Stage 2
Document
Document
Document
Document
Desktop computer
Document
Document
Document
Document
A Document
Desktop computer

Document
Document
Document

b Mainframe
Document
Document
Document
Server or Midreange

Document
Document
Document c 4GL

Desktop computer
Data Characteristics
Type Production Warehouse

• Data Use Operational Mgt Reporting


• Level of detail Detailed Summary
• Currency Real time, Multiple
Latest value generations
• Longevity Relatively brief Forever
• Stability Dynamic Static
• Scope of definition Application wide Enterprise wide
• Data Operations Capture/update Read only
• Data values Coded Decoded
Transforming the Logical Model
Brand Group

Brand
Product
Shipment

SKU

Day Siz e
Order Item

Week Month Type

customer

Class
Calendar Year
Sales Rep Route

time
market
Dis tr ic t

Region
Key Differences - Part 1
• Key differences between “data jails” (operational database) &
warehouses

– Subject orientation - operational systems are application-


segmented (i.e. banks = auto loan, demand deposit accounting
or mortgages). Subject areas for banks would be customer and
each financial product
– Level of integration - warehouses resolve years of application
inconsistency in encoding/decoding, data name rationalization,
etc
– Update volatility - record at a time updates in operational
database vs bulk loads in data warehouse
– Time variance norms include: 30-90 days of transactions for
operational system, 1-10 years for data warehouses
Key Differences - Part 2

Characteristic Operational Warehouse

Transaction volume High Low to huge


Response time Very fast Reasonable
Updating High volume Very Low
Time Period Current Period Past to Future
Scope Internal External
Activities Focused, clerical Exploratory,
operational analytical, managerial
Queries Predictable, Can be Unpredictable,
periodic Ad hoc
Types of Warehouse Configurations

• Enterprise
• Division
• Functional
– Financial
– Personnel
– Engineering/Product
• Departmental
• Special Project
What’s Really Involved?
Data Warehouse
Components

DB/2 VSAM
Management Reporting
Mainframe Sales/Marketing
IMS Applications Customer Relations
Reserve Analysis
Risk Analysis

DB2/2

PC
Applications
Extract Programs Reserves Customers
Data Cleansers/Scrubbers
Translators/Transformers
Timing Tools
Combined
Data Loading Rates
Data Policies
??? File Transfer Warehouse
External
Sources

Claims Premiums

DB/6000

Midrange
Decision
Support Tools
DB/400
Typical Users of a Data Warehouse
• Decision Support Analysts, Business Analysts
– Marketing, Actuaries, Financial, Sales, Executive
• Grocery Store attitudes
– Going to the store, not knowing what they want
– Close proximity says give me “everything”
• Explorers
– Don’t know what they want
– Search on a random basis, non-repetitively
– Frequently finds nothing, but when they do, there are huge
rewards
• Farmers
– Know what they want
– Non random searches, finds frequent “flakes of gold”
– Finds small amounts of data
Advanced Warehouse Topics
• Metadata repositories
– Information about the data in the warehouse
• Like a library card catalog
• Data about when the information was created, what files
accessed, how much data
• Data about changes in business rules, processes
• Context versus Content
– “What does it mean?”
• Data Mining
– Drilling down into databases with tools to find specific anomalies
• Online Analysis Processing (OLAP)
– Really means summary data
OLTP Vs OLAP
Bill Inmon Vs Ralph Kimball
Bill Inmon Ralph Kimball
DW=Normalized Central Data Storage DW=Union of Data Marts
Top Down Approach Bottom Up Approach
E-R Modeling Renormalized Dimensional
3rd Normal Form Multi-Dimensional DB design
Time attributes spread in diff. tables Time Dimension
• Aggregations - Information stored in a data warehouse in a summarized form

• Alert -A message that is sent automatically by a computer system when a certain situation occurs.
• Attribute - Additional information included with a dimension, that is not used in defining the levels of the dimension.
• Business Intelligence Tools - Software that enables business users to see and use large amounts of complex data.
• Changing Dimensions - A dimension that has level or attribute data that needs to be updated.
• Conformed Dimension - A dimension that is used in more than one Fact/Cube.
• Cube - Also Known As Multidimensional CubeThe fundamental structure for data in a multidimensional (OLAP) system. 
• Data Cleansing - Removing errors and inconsistencies from data being imported into a data warehouse.
• Data Mart -  Also Known As: Local Data Warehouse or Datamart
• Data Migration - The movement of data from one environment to another.
• Data Mining - The process of finding hidden patterns and relationships in the data.
• Data Quality Assurance - Also
- Also Known As: Data Cleansing or Data Scrubbing
• Data Warehouse - A database where data is collected for the purpose of being analyzed.
• Data Warehousing - The process of visioning, planning, building, using, managing, maintaining, and enhancing data warehouses and/or data marts.
marts.
• Database Management System (DBMS) - The software that is used to store, access, and manage data.
• Decision Support System (DSS) - A computer system designed to assist an organization in making decisions.
• Dimension Table - In a star schema, a table which contains the data for one of the perspectives that can be used to analyze the data
• Drill Down - Changing the view of the data to a greater level of detail.
• Drill Up - Changing the view of the data to a higher level of aggregation.
• ETL (Extract, Transform, and Load) - ETL refers to the process of getting data out of one data store (Extract), modifiying it (Transform), and inserting it into a
different data store (Load).
• EDW – Enterprise Data Warehouse
• Fact table - In a star schema, the central table which contains the individual facts being stored in the database.
• Granularity - The level of detail of the facts stored in a data warehouse.
• Hierarchy - Organization of data into a logical tree structure.
• Measure - A numeric value stored in a fact table and in an OLAP cube.
• Metadata - Data that describes the data in the warehouse.
• Non-Volatile - Data that does not change.
• Normalization - The process of organizing data in accordance with the rules of a relational database.
• ODS – Operational Data Store
• OLAP (On-Line Analytical Processing) - The use of computers to analyze an organization's data.
• OLTP (OnLine Transaction Processing) - The use of computers to run the on-going operation of a business.
• Relational Database Management System (RDBMS) - A Database Management System based on relational theory.
• Replication - The physical copying of data from one database to another.
• Schema - The logical organization of data in a database.
• Slice and Dice - The ability to move between different combinations of dimensions when viewing data
• Slowly Changing Dimensions (SCD) - A dimension that has levels or attributes that are changing on an occasional basis
• SQL (Structured Query Language) - The standard language for accessing relational databases.
• Snowflaking -Normalization applied to the dimension tables of a star schema.
• Star Schema (Business Definition) -A method of organizing information in a data warehouse that allows the business information to be viewed from many
perspectives.
Technical Definition - A database design that consists of a fact table and one or more dimension tables.
• Time-variant data - Data that is identified with a particular time period.
• XML (eXtensible Markup Language) -A method of sharing data between disparate data systems, without needing a direct connection between them.

You might also like