Professional Documents
Culture Documents
Data Warehouse: Tobiasgroup, Inc
Data Warehouse: Tobiasgroup, Inc
Data Warehouse: Tobiasgroup, Inc
TobiasGroup, Inc.
8536 Crow Drive
Suite 218
Macedonia, Ohio USA 44056
330.468.2468
information@tobiasgroup.com
1. Introduction ___________________________________________________________ 2
2. Data Warehousing ______________________________________________________ 3
History________________________________________________________________________ 3
The Goals of a Data Warehouse ___________________________________________________ 4
Makes an organization’s information accessible.______________________________________________ 4
Makes the organization’s information consistent. _____________________________________________ 4
Is an adaptive and resilient source of information._____________________________________________ 4
Is a secure bastion that protects the organization’s information asset. ______________________________ 4
Is the foundation for decision making.______________________________________________________ 4
Data Warehouse Information Flow ________________________________________________ 5
Data Access__________________________________________________________________________ 7
Data Cleansing _______________________________________________________________________ 7
Business Rule Application ______________________________________________________________ 7
Data Translation ______________________________________________________________________ 7
Warehouse Databases __________________________________________________________________ 7
Querying ____________________________________________________________________________ 7
Information Access ____________________________________________________________________ 8
Basic Elements of a Data Warehouse _______________________________________________ 9
Source System________________________________________________________________________ 9
Staging Area _________________________________________________________________________ 9
Presentation Area ____________________________________________________________________ 10
End User Data Access Tools____________________________________________________________ 10
Metadata ___________________________________________________________________________ 10
Basic Processes of the Data Warehouse ____________________________________________ 10
Conforming Dimensions_______________________________________________________________ 10
Extracting __________________________________________________________________________ 10
Transforming________________________________________________________________________ 10
Loading and Indexing _________________________________________________________________ 11
Quality Assurance Checking____________________________________________________________ 11
Release/Publishing ___________________________________________________________________ 11
Updating ___________________________________________________________________________ 11
Querying ___________________________________________________________________________ 11
Data Feedback/Feeding in Reverse _______________________________________________________ 11
Auditing ___________________________________________________________________________ 11
Securing ___________________________________________________________________________ 11
Backing Up and Recovering ____________________________________________________________ 12
3. Terms and Definitions __________________________________________________ 13
Data Warehousing _____________________________________________________________ 13
Common Terms _______________________________________________________________ 14
1. Introduction
At any given time the optimal IT architecture depends on a few important factors. They
include; the business requirements of the enterprise, the available technology of the time,
and the accumulated investments of the enterprise from earlier technologies.
In his book The Data Warehouse Toolkit, Dr. Ralph Kimball describes the data
warehouse as “The place where people can go to access their data .”
Although relatively new to the data warehousing market, the Microsoft™ set of tools are
recommended because of price and the relative ease of use of the tools. Skills are easily
transferable from one tool to another. The individual components of the Microsoft Data
Warehousing Framework (Data Transformation Services - DTS, Relational Database
Management System – RDBMS, and OLAP Services, and Repository are becoming the
de-facto standard tools for data warehousing.
History
Data warehousing methodology has grown out of the need for immediate and
comprehensive access to enterprise information. Fast, informed business decisions are
no longer a competitive advantage, but a requirement.
In the past, information technology has focused almost entirely on OLTP (On Line
Transaction Processing) systems. OLTP or Operational Systems track our customers
and orders, process general ledger and other accounting data, tell us about inventory
levels and how much was spent on raw material last year, but do little to answer questions
requiring data from multiple operational systems. When someone asks such a question,
traditional MIS gathers information from the OLTP systems and delivers a new report
containing the answer. Hopefully, this only takes a few days. The goal of Data
Warehousing is to answer questions in a few minutes or seconds rather than days.
Operational systems are designed for fast data entry and storage, and immediate retrieval
of simple information. Usually based on a simple query - “name and address of customer
#34865”, “quantity of item AB-11 in stock”, OLTP’s do deliver sophisticated reports but
usually not in real time. To paraphrase one IT professional - “Operational reporting gives
a perfect picture of the wake of the boat, but does little to help steer a course.”
To answer questions in seconds, new system design and implementation philosophies are
required. Data Warehouses are Information Access Systems. Data must be stored in
new ways for faster access. OLTP systems are optimized for fast entry and quick record
processing. On the other hand, data warehouses must retrieve large amounts of
information quickly and deliver it to an end user’s desktop. Business information must be
extracted from the OLTP system and loaded into the Data Warehouse’s new fast access
formats.
A Data Warehouse must then be able to deliver loaded information in a variety of formats.
Hardcopy and spreadsheet “straight line”processing is no longer sufficient. A
comprehensive repository of business rules information is also required. Business
definitions about sales periods or even the simple “what is customer?”have previously
been defined differently in different departments and facilities. “Well yes, but my number
also includes… ”or “Our period ends on … , so you can’t apply that number to these
figures”are familiar comments in today’s conference room. Redefining and consolidating
all the business rules in the warehouse eliminates these difficulties.
End users have many new tools for information access that require complex information.
Most of these new tools translate business language questions into database queries that
were previously only written in the MIS department. In addition, there are now many
applications that perform very sophisticated calculations and modeling.
One of the most important assets of an organization is its information. These assets are
usually kept by an organization in three forms: the operational systems of record;
distributed, ad-hoc, or departmental documents and databases utilized to satisfy reporting
requirements; and the data warehouse. While most of the ad-hoc documents, usually in
spreadsheets, will be replaced over time with the warehoused data, the data warehouse
will never be a substitute for the operational systems. The data warehouse has profoundly
different needs, clients, structures, and rhythms than the operational systems. The
operational systems of record are where data is put in, and the data warehouse is where
the data is taken out. While the basic goals of the operational systems are to capture the
daily transactions and to aid in the day-to-day running of that business, the data
warehouse:
The contents of the data warehouse are understandable and navigable, and the access is
characterized by fast performance. Understandable means correctly labeled and obvious.
Navigable means recognizing the destination on the screens and getting there in one click.
Fast performance means zero wait time. In addition, accessibility of the data means the
data warehouse can be the source of information for many data-hungry business
improvement efforts such as process modeling and simulation, budgeting and forecasting,
activity-based costing, and new product development.
Information from one part of the organization can be matched with information from
another part of the organization. If two measures of an organization have the same name,
then they must mean the same thing. Conversely, if two measures don’t mean the same
thing, then they are labeled differently. Consistent information means high quality
information. It means that all of the information is accounted for and is complete.
The data warehouse is designed for continuous change. When new questions are asked
of the data warehouse, the existing data and the technologies are not changed or
disrupted. When new data is added to the data warehouse, the existing data are not
changed or disrupted. The design of the separate data marts that make up the data
warehouse must be distributed and incremental.
The data warehouse not only controls access to the data effectively, but gives its owners
great visibility into the uses and abuses of that data, even after it has left the data
warehouse.
The data warehouse has the right data in it to support decision-making. There is only one
true output from a data warehouse: the decisions that are made after the data warehouse
has presented its evidence.
Over the past two decades, information technology has been adapted and integrated into
all aspects of the enterprise. Information access was never ignored, but was usually
designed by operational systems professionals. OLTP and reporting systems were
departmentalized and so diverse that information gathering was problematic. Most of
today’s information access systems look something like the following:
There are several important considerations in the process model above. Data generally,
but not exclusively, moves from the left to the right. In the Data Access layer, information
is gathered from the existing operational systems. Data Staging is arguably the most
important and the most complex layer. It can be divided into Data Cleansing, Business
Rule Application, and Data Translation. Information requests are compiled in the
Querying layer. Finally, data is processed, displayed, and reported in the Information
Access layer.
Process Management applications drive the warehouse and control each layer using
Metadata Functions. Metadata or “data about data”is used in every layer and defines
each process. At a minimum, Metadata includes data warehouse group, table, and field
names with descriptions; original data source; allowable values and formats; and simple
business rules. Ideally, complex business rules, cleansing information, source data
formats, and end user data formats are also included.
© TOBIASGROUP – MARCH 1999 PAGE 6 OF 18 02/07/01
INFORMATION@TOBIASGROUP.COM
Data Access
The Data Access layer is the first step when loading the warehouse with new information.
Any new data since the last load must be retrieved from all existing OLTP and operational
systems. Data Access contains tools that understand all the mainframe and PC database
formats. Metadata is used to specify which data is included and where it resides in the
operational systems.
Data Cleansing
There is virtually no computer-based information that is 100% accurate. The old OLTP
saying “garbage in, garbage out”applies even more dramatically to data warehousing.
Correcting data entry errors is only a small part of the cleansing process. Typically, data
relationship problems are the most daunting. The simple question “Who is our largest
customer?”may be answered incorrectly if data is not cleansed properly. OLTP systems
in different divisions may have separate codes for the “unrelated”customers “Digital
Equipment Corporation”, “Digital”, and “DEC”. Products and services may be just as hard
to reconcile. Stephen Brown of Vality Technology, Inc. reports that 10 times the allotted
resources are usually spent implementing the cleansing layer.
Any rules for correcting inconsistencies should be stored in the Metadata repository for
reference and ease of modification.
Business Rules (as defined in Metadata) are applied in the data-staging layer. Consistent
periods and relationships are necessary to correlate departmental and divisional
information.
Data Translation
While Cleansing and Business Rule applications occur, data is translated into standard
formats and stored in the warehouse.
Warehouse Databases
The physical warehouse data store may contain one or many standardized database
formats. Optimization of information access speed is determined by varying format
selections. Data Normalization was once thought to be a rule for any database design, but
is now known to apply to OLTP-like systems only. In a warehouse, data is frequently
“denormalized”. For example, records may contain redundant and uncoded information;
additionally, summary data is stored separately or alongside detail. The many different
data formats are defined in Metadata.
Querying
Querying is the simple and hopefully speedy process of compiling warehouse information
for delivery to an end user. In the query layer, data is gathered per user request and
translated from the standardized warehouse formats into any new formats required by end
user tools. Again, information about all the different formats is stored in Metadata.
On the end user’s desktop are the varieties of information access tools mentioned above.
Tools with built-in warehousing technology may query the warehouse directly and produce
reports. Other tools can be populated with data by applications in the Querying layer.
From the information process flow above, the following basic elements are derived and
illustrated below:
Source System
Staging Area
A storage area and set of processes that clean, transform, combine, de-duplicate,
household, archive, and prepare source data for use in the presentation server. In many
cases, the primary objects in this area are a set of flat-file tables representing extracted
(from the source systems) data, loading and transformation routines, and a resulting set of
tables containing clean data – Dynamic Data Store. This area does not usually provide
query and presentation services.
The presentation area are the target physical machines on which the data warehouse data
is organized and stored for direct querying by end users, report writers, and other
applications. The set of presentable data, or Analytical Data Store, normally take the form
of dimensionally modeled tables when stored in a relational database, and cube files when
stored in an Olap database.
End user data access tools are any clients of the data warehouse. An end user access
tool can be as simple as an ad hoc query tool, or can be as complex as a sophisticated
data mining or modeling application.
Metadata
All of the information in the data warehouse environment that is not the actual data itself.
This data about data is catalogued, versioned, documented, and backed up.
Conforming Dimensions
The process of aligning business user’s understanding of the dimensions used in the data
warehouse. The resulting conformed dimensions are dimensions that mean the same
thing with every possible fact table to which it can be joined. Examples of obvious
conformed dimensions include customer, product, location, and calendar (time).
Extracting
The extract step is the first step of getting data into the data warehouse environment.
Extracting means reading and understanding the source data, and copying the parts that
are needed to the data staging area for further work.
Transforming
Once the data is extracted into the data staging area, there are many possible
transformation steps including
At the end of the transformation process, the data is in the form of load record images.
Loading in the data warehouse environment usually takes the form of replicating the
dimension tables and fact tables and presenting these tables to the bulk loading facilities
of the presentation area servers.
When each presentation server is loaded, indexed, and supplied with appropriate
aggregates, the last step before publishing is the quality assurance step. Quality
assurance can be checked by running a comprehensive exception report over the entire
set of newly loaded data. All of the reporting categories must be present, and the counts
and totals must be satisfactory. All reported values must be consistent with the time series
of similar values that preceded them. The exception report is probably built with an end
user report writing facility. All issues dealing with transformation should have already been
resolved.
Release/Publishing
The user community is notified that the new data is ready – “ring the data bell.”
Updating
Incorrect data should obviously be corrected. Changes in labels, hierarchies, status, and
corporate ownership often trigger necessary changes in the original data stored in the data
warehouse. In general, these are managed load updates, not transactional updates.
Querying
Querying is a broad term that encompasses all the activities of requesting data, report
writing, complex decision support applications, requests from models, and full-fledged
data mining. Querying never takes place in the staging area.
When modeling tools are used in a data warehousing environment, results from these
tools are sometimes loaded into the warehouse.
Auditing
At times it is critically important to know where the data came from and what were the
calculations performed.
Securing
Every data warehouse has an exquisite dilemma: the need to publish the data widely to as
many users as possible with the easiest-to-use interface, but at the same time protect the
valuable sensitive data from hackers, snoopers, and industrial spies. Data warehouse
security must be managed centrally while users must be able to access all the constituent
data with a single sign-on.
The project team will decide where to take the necessary snapshots of the data for
archival purposes and disaster recovery.
Data Warehousing
Data warehousing then, is the process of building, creating, and maintaining a data
warehouse.
Below is a closer look at each of these components followed by a listing of the common
terms used within a data warehousing effort.
Integrated
The data warehouse is comprised of data from many systems. Each system may be
similar in nature or have a totally different use. For example, customer A may be an active
buyer of goods and services and their information is stored in an accounting system. They
may also be tracked in a marketing database used to track new construction projects.
Each of these systems store data about the same customer, but neither of them knows
anything about the other. A data warehouse captures information from both of these
systems, integrates the information so they can be related, and provides new, meaningful
ways of looking at the data.
Subject Oriented
The data warehouse takes a different approach than the traditional OLTP systems. It
looks at subjects like customers, sales, and profits as opposed to the systems that focus
on one department or process.
Databases
The term data warehouse refers to the entire collection of tools, processes and hardware
required to plan, develop, implement, and use the system. At its core is very large,
typically read-only database that collects both internal and external data and provides
unique ways of viewing the data. Internal data includes the operational system within the
organization. External data may come from customer, the government, research, and
other organizations that sell data related to your organization.
Decision-Making
Traditional operational systems are typically built on normalized relational databases that
are designed to maintain a high level of relational data integrity. Data warehouses are
denormalized in order to make the data more meaningful to the users. It is designed for
presentation and performance, allows for many different views. Product managers may
be interested in sales per region while a financial manager is interested in profitability.
Common Terms
Analytical Tools
This is an umbrella phrase used to connote software that employs some sort of
mathematical algorithm(s) to analyze the data contained in the warehouse. Data Mining,
OLAP, ROLAP and other terms are used to designate types of these tools with different
functionality. In practice, however, a given analytical tool may provide more than one type
of analysis procedure and may also encompass some middleware functionality. Such
tools are difficult to classify.
The process of extracting and reporting information from a database through the issuance
of a structured query. Programmers usually write queries using special languages that are
associated with database management systems. Most relational database managers use
a variant of 4GL (4th Generation Language originally developed by IBM). An example of
an ad hoc query might be "How many customers called the UK between the hours of 6-8
am?” Several packages available that make the construction of queries user-friendlier
than writing language constructs. These usually employ some sort of graphic/visualization
front end.
A phrase coined by (or at least popularized by) Gartner Group that covers any
computerized process used to extract and/or analyze business data.
The process by which data is extracted from an operational database, cleaned and then
transformed into a format useful for a data warehouse-based application.
Data Mall
A Data Mart is a data warehouse that is restricted to dealing with a single subject or topic.
The operational data that feeds a data mart generally comes from a single set or source of
operational data.
Data Mining
Data Mining is a process by which the computer looks for trends and patterns in the data
and flags potentially significant information. An example of a data-mining query might be
"What are the psychological factors associated with child abusers?"
DBMS
Refers to a single collection of data designed to serve the diverse needs of an enterprise.
The opposing concept is that of a collection of smallish databases, each designed to
support a limited requirement.
A single repository holding data from several operational sources that serves many
different users, typically in different divisions or departments. An enterprise data
warehouse for a large company might, for example, contain data from several separate
divisions, and serve the needs of both those divisions and of corporate users wishing to
analyze consolidated information.
ERP systems are comprised of software programs which tie together all of an enterprise's
various functions -- such as finance, manufacturing, sales and human resources. This
software also provides for the analysis of the data from these areas to plan production,
forecast sales and analyze quality. Today many organizations are realizing that to
maximize the value of the information stored in their ERP systems, it is necessary to
extend the ERP architectures to include more advanced reporting, analytical and decision
support capabilities. This is best accomplished through the application of data
warehousing tools and techniques.
A phrase coined by (or at least popularized by) Gartner Group defined as the process of
discovering meaningful new correlations, patterns and trends by sifting through large
amounts of data stored in repositories (e.g., data warehouses), using such technologies
as pattern recognition, statistics and other mathematical techniques. Knowledge
Discovery is really the same thing as data mining.
Knowledge Management
An umbrella term that is used by some in the same context as Business Intelligence.
A person or organization that designs the software for the DW/BI application.
Middleware
An umbrella term used to describe software that bridges various parts of a DW/DSS
system. For example, software that extracts, cleans or separates data.
Microsoft
A set of user interfaces, applications, and proprietary database technologies that have a
strongly dimensional flavor.
OLAP/ROLAP/MOLAP
The general activity of querying and presenting text and number data from data
warehouses, as well as a specifically dimensional style of querying and presenting that is
exemplified by a number of “OLAP”vendors. The OLAP vendors’technology is non-
relational and is almost always based on an explicit multidimentional cube of data. OLAP
databases are also known as multidimensional databases, or MDDBs.
Operational data is the data collected from operations such as order processing,
accounting, manufacturing, marketing, etc. Most modern companies collect most of this
data using a form of OnLine Transaction Processing (OLTP). Data generated by these
systems is generally not in a format that makes for efficient query processing or analysis.
Relational Data
Data that has been formatted and organized to work in a database designed to work using
relational schema.
A set of user interfaces and applications that give a relational database a dimensional
flavor.