Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 60

Data Warehousing

Version 6.0 - 04/18/2000

CIS 465 - Data Warehousing 1


Can your database answer
questions like these?
• What is the cost of staff to break into a new line of
business?
• What are the travel routes of my competition’s
inventory?
• At what velocity is my competitor moving toward a
common goal?
• How will a transaction on a certain date be affected
by currency exchange rates?
• Is a foreign labor source likely to produce a higher
quality product?
• Which 20% of the problem creates 80% of the
problems?

CIS 465 - Data Warehousing 2


Can your database answer
questions like these?
• By product and location, how can we regain a
lost customer base?
• Which skill and staff levels are most likely to
accept the voluntary layoff package?

CIS 465 - Data Warehousing 3


Data is Difficult to Manage

• Amount of data increases exponentially; past


data must be kept for long times; new data are
added rapidly.
• Data are scattered throughout organizations;
collected by many individuals; using different
methods and devices.
• Only small portions of an organization’s data
are relevant for specific decisions.
• An ever-increasing amount of external data
needs to be considered in making
organizational decisions.

CIS 465 - Data Warehousing 4


Data is Difficult to Manage - contd.

• Raw data may be stored in different computing


systems, formats, and human and computer
languages.
• Legal requirements relating to data differ
among countries and change frequently.
• Selecting data management tools can be
difficult because of the large number of tools
available.
• Data security, quality, and integrity are critical,
yet easily jeopardized.

CIS 465 - Data Warehousing 5


Data and Knowledge
Managementment
• Businesses do not run on data. They run on
information and their knowledge of how to put
that information to use successfully.

CIS 465 - Data Warehousing 6


Some Information Concepts
• Data: Unorganized facts and figures. (raw material)
• Information: Data that has been processed into a form
that is meaningful to the recipient and is of real of
perceived value in current or prospective actions or
decisions.
• Information:
– adds to a representation
– corrects or confirms previous information
– has “surprise” value in that it tells us something we did not know,
or could not predict.
– What is a “finished product” to one, may be “raw materials” to
someone else.

CIS 465 - Data Warehousing 7


Definitions: Information vs.
Knowledge

• Knowledge: a combination of instincts, ideas,


rules, and procedures that guide actions and
decisions.
• Helping to provide the best available
knowledge to decision-making is another role
of information systems

CIS 465 - Data Warehousing 8


Relationship Between Data,
Information, and Knowledge
• The difference between data and information is easy to
remember.
• It is often cited as the reason why systems that collect large
amounts of information fail to meet management’s
information needs.
• There are many methods of converting data into information
for decision making.
• Managers take action based on information about a current
situation plus their accumulated knowledge. Actions taken
feed the process of accumulating more knowledge
(experience).
• Example: How do medical students become competent
physicians?

CIS 465 - Data Warehousing 9


Relationship Between Data,
Information, and Knowledge

CIS 465 - Data Warehousing 10


Attributes of Quality
Information
• Timeliness
• Completeness
• Conciseness
• Relevance
• Accuracy
• Precision
• Appropriateness of Form

CIS 465 - Data Warehousing 11


Special Characteristics of
Information
• Usefulness - depends on combination of
quality,accessibility,and presentation.
• One person’s information may be another
person’s noise.
• Soft data may be as important as hard data.
• Ownership of information may be hard to
maintain.
• More information is not always better
(information overload).
• Politics can often hide or distort information.

CIS 465 - Data Warehousing 12


Sources of Data

• Internal Data
– Data about people, products, services, processes.
– Often stored in corporate data bases (e.g. Sales or HR).
– Some data may be disparate in different regions, but
accessible by networks.
• Personal Data
– Individuals document expertise by creating personal data -
subjective estimates. Some is kept in heads, or mental
models.
– Can be store on PCs, or available on the Web.
• External Data
– Many sources.

CIS 465 - Data Warehousing 13


Some Sources of Business
External Data
• Federal Publications • Other-contd.
– Survey of Current Business – ABI/Inform
– Monthly Labor Review – Annual Editor & Publisher
– Federal Reserve Bulletin Market Guide
– Employment and Earnings – Thomas Register On-line
– Commerce Business Daily • Indexes
– Census Bureau – Encyclopedia of Business
• Other Information Sources
– International Monetary
Fund
– Moody’s
– Standard & Poor’s
– Advertising Age
– Dialog and Lexis/Nexis

CIS 465 - Data Warehousing 14


What is a data warehouse?

• A data warehouse is a pool of data organized in


a format that enables users to interpret data and
convert it into useful information to gain
knowledge from this interpretation.
• It is a single place that contains complete and
consistent data from multiple sources.
• Data warehousing is the act of a business
person extracting business value from the data
stored in the data warehouse.

CIS 465 - Data Warehousing 15


Collecting Raw Data

• Is not Easy
– collect in the field
– elicit from people
– collect manually, electronically, or by sensors.
• Data collection technology has not kept pace
with advances of data storage technology.
• Data collection from external sources is not easy
either.
• Bottom Line: Garbage IN, Garbage OUT - GIGO.
• Data Quality is an Important Issue.

CIS 465 - Data Warehousing 16


Data Quality Issues

• Intrinsic DQ:
– accuracy, objectivity, believability, reputation.
• Accessibility DQ:
– Accessibility and access security
• Contextual DQ
– relevancy, value-added, timeliness,completeness, amount of
data.
• Representation DQ:
– Interpretability, ease of understanding,concise
representation, consistent representation

CIS 465 - Data Warehousing 17


Why Data Warehousing?

• Managers do not make decisions that are


“good” or “bad”, they make decisions on the
basis of good or bad information.
• Management information =
– a. the right information
– b. in the right form
– c. at the right time.”
• Most transaction-based information systems
have difficulty delivering this information.

CIS 465 - Data Warehousing 18


Why Data Warehousing - 2

• Not the right information:


– data not easily accessible
– meaning is subtly (or significantly different)
from the question context.
– Information is presented with too much or
too little detail, covers the wrong time spans,
or is in the wrong intervals.

CIS 465 - Data Warehousing 19


Why Data Warehousing - 3

• Not the right time:


– Getting this information may require the
efforts of highly skilled professionals who
are not generally available at the whim of
business managers.
– Data comes from a variety of different
systems which are resident on a variety of
different technology platforms.

CIS 465 - Data Warehousing 20


Why Data Warehousing - 4

• Not the right format:


– If data is extracted, merged, and converted
into a meaningful information, often it is not
in a usable format.
– Users will want it loadable into a particular
PC tool or spreadsheet with which he/she is
familiar.
– Printouts weighing 10 pounds are not in the
right format.
– a diskette with a COBOL file description is
not in the right format.

CIS 465 - Data Warehousing 21


The Dilemma for Corporate IT

• How to control scarce IT resources consumed


by insatiable user demand for ad-hoc reports.
• Each ad-hoc report generated by IT and
analyzed by the user generates three more
reports to further illuminate the insights
gleamed in the first.
• Often the extract programs have few reusable
components.
• The user is on a “voyage of discovery in a sea of
data”.

CIS 465 - Data Warehousing 22


The Response of Corporate IT

• New methodologies: Align the IT systems with


the business goals and requirements.
• These techniques concentrate on business
process requirements, not decision support
requirements.
• Transaction systems must be rigorously
specified in advance. The are an intersection
between the organization and the customer.
• These systems should not be a “voyage of
discovery” for either.

CIS 465 - Data Warehousing 23


Transaction Systems vs.
Analytical Support Systems
• Transaction Systems:
– Insert an order for 300 baseballs
– Update this passenger’s airline reservation.
– close-out accounts payable records for this vendor.
– What is the current checking account balance for this
customer?
• Analytical Support Systems:
– Did the sales promotion last quarter do better than the same
promotion last year?
– Is the five-day moving average for this security leading or
trailing actual prices?
– Which product line sells best in middle-America and how
does this correlate to demographic data.

CIS 465 - Data Warehousing 24


Analytical Processing

• Analytical Processing today includes what in


the past have been called:
– DSS (Decision Support Systems)
– EIS (Executive Information Systems)
– ESS (Executive Support Systems)
• It is an evolution of “End-User Computing”
• Placing strategic data access in the hands of
decision makers aids productivity and enables
them to be better decision makers.

CIS 465 - Data Warehousing 25


Key Difference: OLTP vs. OLAP

• OLTP: Processing specific functions


• OLAP: providing flexibility for undetermined
analysis.

CIS 465 - Data Warehousing 26


A Multidimensional database
Data for Decision Support

• The data must be integrated - requires data from


many separate internal corporate databases.
• The data must be enriched - through integration
with other external data.
• The data must be available - and not
constrained by machine resources.

CIS 465 - Data Warehousing 28


Sources of Data

• Internal Data:
– Financial Systems
– Logistics Systems
– Sales Systems
– Production Systems
– Personnel Systems
– Billing Systems
– Information Systems
• External Data Needs:
– to recognize opportunities
– to detect threats
– to identify synergies

CIS 465 - Data Warehousing 29


Sources of Data - 2

• External Data Categories


– Competitor Data
– Economic Data
– Industry Data
– Credit Data
– Commodity Data
– Econometric Data
– Psychometric Data
– Meteorological Data
– Demographic Data
– Sales & Marketing Data

CIS 465 - Data Warehousing 30


Operational Control vs.
Operational Strategy
• Data is a source not just of operational control,
but of operational strategy.
• Operational strategy is an attempt to describe
the need, in a competitive and turbulent
market, to continually innovate and re-align
strategy with time scales too short to be
comprehended by strategic planning in the
conventional corporate sense.

CIS 465 - Data Warehousing 31


Comparison of Control and
Strategy Data:
• Operational Data:
– short-lived, rapidly changing
– requires record-level access
– repetitive standard transactions and access patterns
– updated in real-time
– event-driven; process generates data
• Strategic Data:
– long-living, static
– data aggregated into sets (which is why warehouse data is
friendly to RDBMS).
– ad-hoc queries with some periodic reporting
– updated periodically with mass loads
– data-driven; data governs process.

CIS 465 - Data Warehousing 32


Information Requirements by
Management Level
(Source: Gorry and Scott Morton)
Characteristics Operational Management Strategic
of Information Control Control Planning
Source Largely Internal External

Scope Well defined, Very wide


narrow
Level of Detailed Aggregate
Aggregation
Time Horizon Historical Future

Currency Highly current Quite Old

Required High Low


Accuracy
Frequency of Use Very frequent Infrequent

CIS 465 - Data Warehousing 33


Dimensional Modeling

• Dimensional Modeling gives us a way to


visualize data.
• The CEO’s perspective:
– “We sell products in various markets, and
we measure our performance over time.”
• From the data warehouse designer’s
perspective, we hear three dimensions:
– We sell Products
– in various Markets
– and measure performance over time.

CIS 465 - Data Warehousing 34


Dimensional Modeling - contd.

• Management may be interested in examining


sales figures in a certain city by product, by
time period, by salesperson, and by store.
• The more dimensions involved, the more
difficult it is to represent in a single table or
graph.
• The ability to add and modify the dimensions
used in a table or graph is often known as
“slicing and dicing” the data.

CIS 465 - Data Warehousing 35


Dimensional Model of the
Business
e
m
i
T

M
a
r
k
e
t

Product

CIS 465 - Data Warehousing 36


Data Dependencies Model of a Business

Ship
Shipper Ship To Product
Type

Product
District Order Contact Line
Credit Item Location

Sales Customer
Order Location

Product
Group

Contract
Contract Customer Contact
Type

Sales Sales Sales Sales


Rep District Region Division

CIS 465 - Data Warehousing 37


Transaction Processing

• The Relation Model was full of promises for


equal access to data.
• In the early 1980’s the relational model was a
dream. Typical transaction rates were one per
second.
• Today the SABRE system typically processes
4,000 transactions per second, with peak bursts
of 13,000 per second.
• OLTP - (On-line Transaction Processing) The
point is to get data “in” to the database.

CIS 465 - Data Warehousing 38


Segregating Operational and
Warehouse Data
• In the past, data administrators were constantly told
to build data sharing, normalization, and non-
redundant corporate databases.
• Early attempts at data warehousing tried to share the
data with transaction-based systems. This resulted in
LONG response times for complex queries.
• The idea today is to keep the two separate.
• Separate databases, and perhaps separate DBMS
products and processor platforms are used.
• Controlled and practical redundancy is better than
out-of-control theoretical purity.

CIS 465 - Data Warehousing 39


Fundamental Obstacles With
Traditional Systems
• Systems Integration - Disintegration grew slowly
from islands of automation.
– ownership, planning, economic, organizational
development issues all contribute.
• Hardware Architecture
• Inconsistent Data
• Data Pollution:
– Bad Application Design (semantic and syntactical
differences).
– Ownership
– Data Entry Conventions

CIS 465 - Data Warehousing 40


The Data Warehouse

• Active, tactical, and current events flow from the


operational systems to the data warehouse to
become static, strategic, and historical data.
• The data warehouse becomes a “middle ground”
where a large number of disparate and
incompatible “legacy systems” are tied to an
equally diverse collection of end-user workstations.
• Legacy systems usually comprise a hodge-podge of
assorted hardware, software, and operational
systems accumulated over many decades, are by
nature, incompatible with one another and unique
to each organization.

CIS 465 - Data Warehousing 41


Practical Facts About the
Warehouse
• The chances are remote that any single vendor
will be able to develop a product that can
interface with all “legacy systems” painlessly
and “seamlessly” and at the same time,
combine data from modern platforms and data
external to the organization.
• Instead warehouse product vendors develop
specialized capabilities to work with various
environments.

CIS 465 - Data Warehousing 42


Typical Dimensional Model
Product Dimension
Sales Fact
Time Dimension Product_key
description
Time_key brand
product_key category
store_key
Time_key dollars_sold
day-of-week units_sold
month dollars_cost
quarter
year
holiday_flag Store Dimension

Store_key
store_name
address
floor_plan_type

CIS 465 - Data Warehousing 43


Fact Table

• Fact Table is where numerical measurements of


the business are stored.
• Each measurement is taken at the intersection
of all the dimensions.
• The “best” facts are numeric, continuously
valued and additive.
• For every query made against the fact table may
use hundreds of thousands of individual
records to construct an answer set.

CIS 465 - Data Warehousing 44


Dimension Tables

• Dimension tables are where textual


descriptions of the business are stored.
• Each textual description helps to describe a
member of the dimension.
• Example: each member in the product
dimension is a specific product. The product
dimension database has many attributes to
describe the product. A key role of the
dimension table attribute is to serve as the
source of constraints in a query.

CIS 465 - Data Warehousing 45


Example

Brand Dollar Sales Unit Sales


Axon 780 263
Framis 1044 509
Widget 213 444
Zapper 95 39

CIS 465 - Data Warehousing 46


Example Query

• Find all product brands that were sold in the


first quarter of 1995 and present the total dollar
sales as well as the number of units.
• Brand is a collection of individual products.
• To construct:
– A. Drag attribute brand from product dimension. Place as
Row Header.
– B. Drag Dollar Sales and Units Sold from the Fact Table,
and place to the right of the Brand row header.
– C. Specify row constraint “1st Q 1995” on the quarter
attribute in the Time Dimension Table.

CIS 465 - Data Warehousing 47


Multidimensionality

• Examples of dimensions
– products, salespeople, market segments, business units,
geographical locations, distribution channels, country,
industry
• Examples of Facts or Measures:
– money, sales volume, head count, inventory, profit, actual
vs. forecast.
• Examples of Time:
– daily, weekly, monthly, quarterly, yearly

CIS 465 - Data Warehousing 48


Components of a Data
Warehouse - 1
• Acquisition - The first component handles
acquisition of data from legacy systems and
outside sources.
• Data is identified, copied, formatted, and
prepared for loading into a warehouse.
• Vendors provide tools for extraction and
preparation.

CIS 465 - Data Warehousing 49


Components of a Data
Warehouse - 2
• Storage Area - The second component is the
storage area managed by relational databases,
multi-dimensional databases, specialized
hardware - symmetric multiprocessor (SMP) or
massively parallel processors (MPP) machines -
or by software.
• The storage component hold the data so that
many different data mining, executive
information and decision support systems can
make use of it effectively.

CIS 465 - Data Warehousing 50


Components of a Data
Warehouse - 3
• Access - The third component of the warehouse
is the access area.
• Different end-user PCs and workstations draw
data from the warehouse with the help of multi-
dimensional analysis tools, neural networks,
data discover tools, or analysis tools.
• These “smart” data-mining tools are the driving
force behind the data warehouse concept.
• What good is it to store all the information
without some way to understand it in new and
different ways.

CIS 465 - Data Warehousing 51


Data Warehouse Access Tools

• Intelligent Agents and Agencies - tools work


and think for user.
• Query Facilities and Managed Query
environments.
• Statistical Analysis - One of the biggest
surprises in the data warehousing marketplace
is the resurgence of interest in traditional
statistical analysis, and the concomitant
resurrection of the popularity to products like
SAS and SPSS.

CIS 465 - Data Warehousing 52


Data Warehouse Access Tools - 2

• Data Discovery - A large class of tools formerly


classified as decision support, artificial
intelligence and expert systems. They now
make use of neural networks, fuzzy logic,
decision trees, and other tools from advanced
mathematics to allow a user to “sift” through
massive amounts of raw data to “discover” new,
interesting, insightful, and in many cases useful
things about the organization, its operations,
and its markets.
• Currently there are nearly 60 different data
discovery tools/products on the market.

CIS 465 - Data Warehousing 53


Data Warehouse Access Tools - 3
• OLAP - On-line Analytical Processing often uses
multi-dimensional spreadsheet tools allowing users to
look at information from many different angles.
• Users are able to “slice and dice” reports and to look at
the same kinds of information at different levels at the
same time.
• Typical OLAP application might allow a product
manager to view sales figures for a given product at
the national level, see them broken down by division,
drill down to see territories within a division, check
sales numbers for each store within a territory, and
then compare them against sales of stores from
another territory.

CIS 465 - Data Warehousing 54


Data Warehouse Access Tools - 4

• Data Visualization - These tools turn ugly, boring


numbers into exciting visual presentations.
• These tools bring graphical representation to new
heights. Example: Geographical information
systems turn data about stores, individuals, or
anything else into compelling, easy to
understand, dynamic maps.
• PC-based Geographic Information systems have
the ability to display spatial occurrences and the
relationship between and among geographically
specific variables.

CIS 465 - Data Warehousing 55


Developing the Data Warehouse

• The most expensive warehousing ventures


involve major new hardware acquisitions and
significant investments in training, analysis, and
systems development costs.
• Typical startup projects allocate 60% of budget for
hardware and software for creation of a powerful
storage component. 30% on data mining and
acquisition tools.
• Budgeting for Systems Analysis and
Development has 50% of budget on acquisition
capabilities, 30% fund user solutions, 20%
creation of databases in the storage component.

CIS 465 - Data Warehousing 56


Developing the Data Warehouse

• Clarify what you want to do with the


Warehouse - How Will It be Used.
• Scrutinize the offerings of vendors and systems
integrators. Make sure you understand which
functions they provide, and which you must
build.
• Most successful projects start as small, tightly
defined tactical systems to solve pressing
business needs, and develop into larger systems
over time.

CIS 465 - Data Warehousing 57


DW Summary: Key Concepts

• The DW is a “collection of integrated, subject-


oriented databases designed to support the
decision support function where each unit of
data is non-volatile and relevant to some
moment in time: (W.H. Inmon, 1992).
• Implicit Assumptions:
– physically separate from operational systems
– hold aggregated data and transactional (atomic) data for
management separate from those used for OLTP.

CIS 465 - Data Warehousing 58


DW Summary: Characteristics

• Subject-orientation
• integrated
• non-volatile (i.e. not updated)
• time variant (kept for long periods, for forecasting
and trend analysis)
• summarized
• large volume
• not normalized
• metadata
• data sources

CIS 465 - Data Warehousing 59


Data Mining

CIS 465 - Data Warehousing 60

You might also like