Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 62

Recent Developments in

Data Warehousing
Hugh J. Watson
Terry College of Business
University of Georgia
hwatson@terry.uga.edu
http://www.terry.uga.edu/~hwatson/dw_tutorial.ppt
Tutorial Objectives
 Provide an overview of data
warehousing
 Provide materials to support the
teaching of data warehousing
 Discuss recent developments in data
warehousing
The Importance of Data
Warehousing
 Provide a “single version of the truth”
 Improve decision making
 Support key corporate initiatives such as
performance management, B2C and B2B
e-commerce, and customer relationship
management
 Estimated to be a $113.5 billion market in
2002 for systems, software, services, and
in-house expenditures (Palo Alto
Management Group)
A Simple Definition

A data warehouse is a collection of


data created to support decision-
making applications.
Data Warehouse
Characteristics
 Subject oriented -- data are organized
around sales, products, etc.
 Integrated -- data are integrated to
provide a comprehensive view
 Time variant -- historical data are
maintained
 Nonvolatile -- data are not updated by
users
Another Definition

Data warehousing is the entire


process of data extraction,
transformation, and loading of data to
the warehouse and the access of the
data by end users and applications.
Data Mart

 A data mart stores data for a limited number of


subject areas, such as marketing and sales data. It is
used to support specific applications.

 An independent data mart is created directly from


source systems.

 A dependent data mart is populated from a data


warehouse.
Operational Data Store

 An operational data store consolidates data from


multiple source systems and provides a near real-
time, integrated view of volatile, current data.

 Its purpose is to provide integrated data for


operational purposes. It has add, change, and delete
functionality.

 It may be created to avoid a full blown ERP


implementation.
Data Sources ETL Software Data Stores Data Analysis Users
Tools and
Applications
Transaction Data S
T
IBM A
Prod G
I
N SQL
G ANALYSTS
Mkt IMS
A
R Cognos
Ascential E
HR VSAM Dat a Mart s
A Teradata SAS
IBM MANAGERS
O Finance
Oracle P
Fi n E Load
R Dat a Essbase Queries,Reporting,
Extract A Warehouse DSS/EIS,
T Informatica Data Mining
Acctg Syba se Marketing
I
O EXECUTIVES
Other Internal Data N Micro Strategy
A Meta
L Dat a Sales
ERP SAP Sagent
D
A Microsoft Siebel
Web Data T
A Business OPERATIONAL
Infor mix Objects PERSONNEL
Clickstream S
T
SAS O
External Data R Web
E Browser
Demographic Harte-
Hanks Clean/Scrub CUSTOMERS/
Trans form SUPPLIERS
Fi rst logic
Topics Covered
 Definitions and concepts
 Two case studies: Harrah’s Entertainment (first)
and Owens&Minor (last)
 The data mart and enterprise-wide data
warehouse strategies
 Data extraction, cleansing, transformation and
loading
 Meta data
 Data stores
 Online analytical processing (OLAP)
 Warehouse users, tools, and applications
Harrah’s Entertainment
 Harrah’s Entertainment -- data warehousing
supported a successful shift to a CRM oriented
corporate strategy. Winner of the 2000 TDWI
Leadership Award
 Operates 21 casinos across the country
 In 1993, the gaming laws changed, which allowed
Harrah’s to expand
 Harrah’s decided to compete using a brand
strategy supported by information technology
 Needed to know their customers exceptionally well
Harrah’s Data Warehousing
Architecture
 WINet sources data from the casino,
hotel, and event systems
 The patron data base serves as an
operational data store
 The marketing workbench serves as
the data warehouse
Sample Applications
 Operational personnel use PDB to
check the preferences, history, and
value of customers
 Analysts use PDB and MWB to create
offers to visit a Harrah’s casino
 Analysts use MWB to support
predictive modeling efforts
 Predict the value Define:
of a customer  Objectives
 Tests
 Control cells
 Right Offer
 Market based on  Right Message
that expected value Learn
 Right Time

 Track transactions Customer


that are linked to Treatment
Measure:
marketing  Profit & Loss
initiatives  Behavior change
 New test report
 Evaluate the Execute
effectiveness
Track

 Track profitability Customer


Action/
 Refine Marketing Non-Action
Approaches
Customer Relationship Lifecycle

Establish Strengthen Reinvigorate

Annual
Revenue

Length of Relationship
Two Data Warehousing
Strategies
 Enterprise-wide warehouse, top
down, the Inmon methodology
 Data mart, bottom up, the Kimball
methodology
 When properly executed, both result
in an enterprise-wide data
warehouse
The Data Mart Strategy
 The most common approach
 Begins with a single mart and architected marts
are added over time for more subject areas
 Relatively inexpensive and easy to implement
 Can be used as a proof of concept for data
warehousing
 Can perpetuate the “silos of information” problem
 Can postpone difficult decisions and activities
 Requires an overall integration plan
The Enterprise-wide Strategy
 A comprehensive warehouse is built initially
 An initial dependent data mart is built using
a subset of the data in the warehouse
 Additional data marts are built using subsets
of the data in the warehouse
 Like all complex projects, it is expensive,
time consuming, and prone to failure
 When successful, it results in an integrated,
scalable warehouse
Data Sources and Types
 Primarily from legacy, operational
systems
 Almost exclusively numerical data at the
present time
 External data may be included, often
purchased from third-party sources
 Technology exists for storing unstructured
data and expect this to become more
important over time
Extraction, Transformation,
and Loading (ETL) Processes
 The “plumbing” work of data
warehousing
 Data are moved from source to
target data bases
 A very costly, time consuming part
of data warehousing
Recent Development:
More Frequent Updates
 Updates can be done in bulk and
trickle modes
 Business requirements, such as
trading partner access to a Web site,
requires current data
 For international firms, there is no
good time to load the warehouse
Recent Development:
Clickstream Data
 Results from clicks at web sites
 A dialog manager handles user
interactions. An ODS helps to custom
tailor the dialog
 The clickstream data is filtered and
parsed and sent to a data warehouse
where it is analyzed
 Software is available to analyze the
clickstream data
Recent Development:
Further Automation of ETL Processes

 MetaRecon from Metagenix reverse


engineers data into information
 Analyzes and profiles source systems
 Uncovers problems in source systems
 Recommends primary and secondary
keys, dimensions and measures, etc.
 Generates ETL scripts
Data Extraction
 Often performed by COBOL routines
(not recommended because of high program
maintenance and no automatically generated
meta data)
 Sometimes source data is copied to the
target database using the replication
capabilities of standard RDMS (not
recommended because of “dirty data” in the
source systems)
 Increasing performed by specialized ETL
software
Sample ETL Tools
 DataStage from Ascential Software
 SAS System from SAS Institute
 Power Mart/Power Center from
Informatica
 Sagent Solution from Sagent
Software
 Hummingbird Genio Suite from
Hummingbird Communications
Reasons for “Dirty” Data
 Dummy Values
 Absence of Data
 Multipurpose Fields
 Cryptic Data
 Contradicting Data
 Inappropriate Use of Address Lines
 Violation of Business Rules
 Reused Primary Keys,
 Non-Unique Identifiers
 Data Integration Problems
Data Cleansing
 Source systems contain “dirty data” that must
be cleansed
 ETL software contains rudimentary data
cleansing capabilities
 Specialized data cleansing software is often
used. Important for performing name and
address correction and householding functions
 Leading data cleansing vendors include Vality
(Integrity), Harte-Hanks (Trillium), and
Firstlogic (i.d.Centric)
Steps in Data Cleansing
 Parsing
 Correcting
 Standardizing
 Matching
 Consolidating
Parsing
 Parsing locates and identifies
individual data elements in the
source files and then isolates these
data elements in the target files.
 Examples include parsing the first,
middle, and last name; street
number and street name; and city
and state.
Correcting
 Corrects parsed individual data
components using sophisticated data
algorithms and secondary data
sources.
 Example include replacing a vanity
address and adding a zip code.
Standardizing
 Standardizing applies conversion
routines to transform data into its
preferred (and consistent) format
using both standard and custom
business rules.
 Examples include adding a pre
name, replacing a nickname, and
using a preferred street name.
Matching
 Searching and matching records
within and across the parsed,
corrected and standardized data
based on predefined business rules
to eliminate duplications.
 Examples include identifying similar
names and addresses.
Consolidating
 Analyzing and identifying
relationships between matched
records and consolidating/merging
them into ONE representation.
Data Staging
 Often used as an interim step between data
extraction and later steps
 Accumulates data from asynchronous sources
using native interfaces, flat files, FTP sessions, or
other processes
 At a predefined cutoff time, data in the staging
file is transformed and loaded to the warehouse
 There is usually no end user access to the
staging file
 An operational data store may be used for data
staging
Data Transformation
 Transforms the data in accordance
with the business rules and
standards that have been
established
 Example include: format changes,
deduplication, splitting up fields,
replacement of codes, derived
values, and aggregates
Data Loading
 Data are physically moved to the
data warehouse
 The loading takes place within a
“load window”
 The trend is to near real time
updates of the data warehouse as
the warehouse is increasingly used
for operational applications
Meta Data
 Data about data
 Needed by both information technology
personnel and users
 IT personnel need to know data sources and
targets; database, table and column names;
refresh schedules; data usage measures; etc.
 Users need to know entity/attribute
definitions; reports/query tools available;
report distribution information; help desk
contact information, etc.
Recent Development:
Meta Data Integration
 A growing realization that meta data is
critical to data warehousing success
 Progress is being made on getting
vendors to agree on standards and to
incorporate the sharing of meta data
among their tools
 Vendors like Microsoft, Computer
Associates, and Oracle have entered the
meta data marketplace with significant
product offerings
Database Vendors
 High end (i.e., terabyte plus)
vendors include IBM (DB2) and
NCR-Teradata (Teradata)
 Oracle (8i) and Microsoft (SQL
Server 7) are major players for
smaller databases
On-line Analytical
Processing (OLAP)
 A set of functionality that facilitates
multidimensional analysis
 Allows users to analyze data in ways
that are natural to them
 Comes in many varieties -- ROLAP,
MOLAP, DOLAP, etc.
ROLAP
 Relational OLAP
 Uses a RDBMS to implement and OLAP
environment
 Typically involves a star schema to
provide the multidimensional capabilities
 OLAP tool manipulates RDBMS star
schema data
 Called slowlap by MOLAP vendors
MOLAP
 Multidimensional OLAP
 Uses a MDDBS (e.g., Essbase) to
store and access data
 Usually requires proprietary
(non SQL) data access tools
 Provides exceptionally fast response
times
Star Schema
 Creates non-normalized data
structures
 Easier for users to understand
 Optimized for OLAP
 Uses fact (facts or measures in the
business) and dimension
(establishes the context of the facts)
tables
OLAP Tools
 Products come from vendors such as Brio, Cognos,
Hyperion, and BusinessObjects
 Typically available as a fat or thin (i.e., browser) client
 In a web environment, the browser communicates with a
web server, which talks to an application server, which
connects to backend databases
 The application server provides query, reporting, and OLAP
analysis functionality over the web
 Java applets or downloaded components augment the thin
client
 A broadcast server may be used to schedule, run, publish,
and broadcast reports, alerts, and responses over the LAN,
email, or personal digital assistant.
Star Schema

Patient Physician
#Patient ID #Physician ID Service
Patient Name Physician Name
Address Specialty ID #Service Code
Age Credential ID Service Description
Sex #Category Code
Insurance ID

Claim
# Physician ID
Payer # Patient ID
# Service Code
Time Periods
#Payer ID
Name # Payer ID #Claim Date
Address # Claim Number Year
Phone Number # Line Item Number Month
EDI Number # Claim Date Quarter
Date of Services Week
Amount of Charge
Unit of Services
Dimension Table Examples
 Retail -- store name, zip code, product
name, product category, day of week
 Telecommunications -- call origin, call
destination
 Banking -- customer name, account
number, branch, account officer
 Insurance -- policy type, insured party
Fact Table Examples
 Retail -- number of units sold, sales
amount
 Telecommunications -- length of
call in minutes, average number of
calls
 Banking -- average monthly
balance
 Insurance -- claims amount
The Fact Table Key Concatenates
the Dimension Keys
Assume that you want to know the
number of television sets sold
to Best Buys on January 15, 2001.
The query might be:
SELECT CLIENT.CUSNAME, SALES.NOSOLD
FROM CLIENT, PRODUCT, TIME, SALES
WHERE CLIENT.CUSNAME=SALES.CUSNAME AND
PRODUCT.PRODNAME=SALES.PRODNAME AND
TIME.DATE=SALES.DATE AND CLIENT.CUSNAME=“BEST BUYS”
AND PRODUCT.PRODNAME=“TELEVISION” AND
TIME.DATE=#01/15/2001#
Warehouse Users
 Analysts
 Managers
 Executives
 Operational personnel
 Customers and suppliers
Warehouse Tools and
Applications
 SQL queries
 Managed query environments
 Structured and ad hoc reports
 DSS/EIS
 Portals
 Data mining
 Packaged applications
 Custom-built applications
Recent Development:
Growing Dominance of MS SQL
Server 7.0 with OLAP Services
 Low cost, integration of bundled
DSS components from one vendor,
and extended SQL for OLAP
 Competitors are either leaving the
market or are repositioning their
products to be complimentary
Recent Development:
Enterprise Intelligence Portals
 Offers users an effective way to access
information scattered across networked
enterprise systems through a simple and
personalized Web interface
 Provides access to structured and
unstructured data
 Potentially integrates data warehousing
and knowledge management
Owens & Minor
 Owens&Minor -- data warehousing has supported
integration along the supply chain. Winner of the
1999 TDWI Leadership Award
 the nation's leading distributor of name-brand
medical and surgical supplies
 has transformed its business model by integrating
supply chain management, e-business, data
warehousing, and Internet technologies
 as part of this initiative, WISDOM
(WebIntelligence Supporting Decisions from
Owens & Minor) has been especially valuable
P
R O
DUCT

R
awM aterials M
anufacturer O
wens&M
inor P
rovider P
atient
Suppliers

IN
FORMA
TIO
N
+1,400m
anufacturers +4,000AcuteCareFacilities
WISDOM
 a Web-based decision support system that
provides information to OM’s employees,
suppliers and customers
 accesses data from a data warehouse that
maintains supplier and customer transaction
data
 sold to trading partners as a value added
product
 WISDOM II provides data about the
transactions that suppliers and customers
have with all of their trading partners
Sample Applications
 Supports reporting and queries for internal
personnel
 Supports an EIS for senior management
 Suppliers can determine their market share
in specific hospitals
 Hospitals can identify which products are
being bought off contract
 WISDOM II extends data warehousing to
trading partners through an outsourcing
arrangement
Articles
 Cooper, B.L., H.J. Watson, B.H. Wixom, and D.L. Goodhue, "Data Warehousing
Supports Corporate Strategy at First American Corporation," MIS Quarterly,
(December 2000), pp. 547-567. Provides a case study of how the First American
Corporation turned their strategy and fortunes around through the use of data
warehousing.
 Stoller, Wixom, and Watson, “WISDOM Provides Competitive Advantage at
Owens & Minor,” (http://terry.uga.edu/~watson/owens&minor.doc) Provides a
case study of how data warehousing can support supply chain integration.
 Watson, Wixom, Buonamica, and Revak, “Sherwin-Williams' Data Mart Strategy:
Creating Intelligence Across the Supply Chain,” Communications of ACIS, April
2001. Provides a textbook example of how to implement a data mart strategy.
 Watson, H.J., D.A. Annino, B.H. Wixom, K.L. Avery, and M. Rutherford, “Current
Practices in Data Warehousing,” Information Systems Management, (Winter,
2001), pp. 47-55. Provides data on companies’ data warehousing experiences,
with an emphasis on the benefits being realized.
 Watson, H.J. and L. Volonino, “Harrah’s High Payoff from Customer
Information,” (http://www.terry.uga.edu/~hwatson/harrahs.doc) Provides a
case study of how Harrah’s Entertainment has implemented a CRM strategy
facilitated by data warehousing.
Books
 Devlin, Data Warehouse -- Architecture to Implementation, Addison-
Wesley, 1997.
 Gray and Watson, Decision Support in the Data Warehouse, Prentice-Hall,
1998.
 Kimball, The Data Warehouse Toolkit, Wiley, 1996.
 Kimball and Merz, The Data Webhouse Toolkit, Wiley, 2000.
 Inmon, Building the Operational Data Store, second edition, Wiley, 1999.
 Inmon, Imhoff, and Sousa, Corporate Information Factory, Wiley, 1999.
Websites
 http://www.olapreport.com
(provides detailed information about the OLAP
market, products, and applications)
 http://www.firstlogic.com
(includes an interactive demo of their data cleansing
tool)
 http://www.billinmon.com
(a wealth of current information from “the father of
data warehousing”)
 http://www.metagenix.com
(illustrates recent advances in ETL tools)
 http://www.microstrategy.com
(excellent materials from one of the leading DSS
vendors)
Questions

You might also like