Professional Documents
Culture Documents
Database Warehousing
Database Warehousing
Database Warehousing
Chapter 32
Chapter 31
OLAP
1232
1204
1181
1149
Business Intelligence
Chapter 33
Data Mining
9
Chapter 34
Chapter
31
Data Warehousing
Concepts
Chapter Objectives
The main issues associated with the development and management of data marts.
The concept of a data mart and the main reasons for implementing a data mart.
The issues associated with the integration of a data warehouse and the
importance of managing metadata.
How Online Transaction Processing (OLTP) systems differ from data warehousing.
We have already noted in earlier chapters that database management systems are pervasive throughout industry, with relational database management systems being the dominant system. These systems have been designed to handle high transaction throughput,
with transactions typically making small changes to the organizations operational data,
that is, data that the organization requires to handle its day-to-day operations. These types
of system are called Online Transaction Processing (OLTP) systems. The size of OLTP
databases can range from small databases of a few megabytes (Mb), to medium-sized
databases with several gigabytes (Gb), to large databases requiring terabytes (Tb) or even
petabytes (Pb) of storage.
Corporate decision-makers require access to all the organizations data, wherever it is
located. To provide comprehensive analysis of the organization, its business, its requirements, and any trends, requires access to not only the current values in the database but
also to historical data. To facilitate this type of analysis, the data warehouse has been
created to hold data drawn from several data sources, maintained by different operating
units, together with historical and summary transformations. The data warehouse based on
1150
In Section 31.1 we outline what data warehousing is and how it evolved, and also describe
the potential benefits and problems associated with this approach. In Section 31.2 we
describe the architecture and main components of a data warehouse. In Sections 31.3 and
31.4 we identify and discuss the important data flows or processes of a data warehouse, and
the associated tools and technologies of a data warehouse, respectively. In Section 31.5 we
introduce data marts and the issues associated with the development and management of
data marts. Finally, in Section 31.6 we present an overview of how Oracle supports a data
warehouse environment. The examples in this chapter are taken from the DreamHome
case study described in Section 10.4 and Appendix A.
31.1
In this section we discuss the origin and evolution of the concept of data warehousing.
We then discuss the main benefits associated with data warehousing. We next identify the
main characteristics of data warehousing systems in comparison with Online Transaction
Processing (OLTP) systems. We conclude this section by examining the problems of
developing and managing a data warehouse.
31.1.2
a typical organization may have numerous operational systems with overlapping and
sometimes contradictory definitions, such as data types. The challenge for an organization
is to turn its archives of data into a source of knowledge, so that a single integrated/
consolidated view of the organizations data is presented to the user. The concept of a
data warehouse was deemed the solution to meet the requirements of a system capable of
supporting decision-making, receiving data from multiple operational data sources.
A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of managements decision-making process.
Data
warehousing
The original concept of a data warehouse was devised by IBM as the information warehouse and presented as a solution for accessing data held in non-relational systems. The
information warehouse was proposed to allow organizations to use their data archives to
help them gain a business advantage. However, due to the sheer complexity and performance problems associated with the implementation of such solutions, the early attempts at
creating an information warehouse were mostly rejected. Since then, the concept of data
warehousing has been raised several times but it is only in recent years that the potential
of data warehousing is now seen as a valuable and viable solution. The latest and most
successful advocate for data warehousing is Bill Inmon, who has earned the title of father
of data warehousing due to his active promotion of the concept.
Subject-oriented as the warehouse is organized around the major subjects of the enterprise (such as customers, products, and sales) rather than the major application areas
(such as customer invoicing, stock control, and product sales). This is reflected in the
need to store decision-support data rather than application-oriented data.
Integrated because of the coming together of source data from different enterprise-wide
applications systems. The source data is often inconsistent using, for example, different
formats. The integrated data source must be made consistent to present a unified view
of the data to the users.
Time-variant because data in the warehouse is only accurate and valid at some point in
time or over some time interval. The time-variance of the data warehouse is also shown
in the extended time that the data is held, the implicit or explicit association of time with
all data, and the fact that the data represents a series of snapshots.
Non-volatile as the data is not updated in real time but is refreshed from operational
systems on a regular basis. New data is always added as a supplement to the database,
rather than a replacement. The database continually absorbs this new data, incrementally integrating it with the previous data.
There are numerous definitions of data warehousing, with the earlier definitions focusing
on the characteristics of the data held in the warehouse. Alternative definitions widen the
1151
1152
|
Chapter 31 z Data Warehousing Concepts
scope of the definition of data warehousing to include the processing associated with
accessing the data from the original sources to the delivery of the data to the decisionmakers (Anahory and Murray, 1997).
Whatever the definition, the ultimate goal of data warehousing is to integrate enterprisewide corporate data into a single repository from which users can easily run queries, produce reports, and perform analysis. In summary, a data warehouse is data management and
data analysis technology.
In recent years a new term associated with data warehousing has been used, namely
Data Webhouse.
Data
Webhouse
31.1.4
OLTP systems
Table 31.1
1153
1154
|
Chapter 31 z Data Warehousing Concepts
What was the total revenue for Scotland in the third quarter of 2004?
What was the total revenue for property sales for each type of property in Great Britain
in 2003?
What are the three most popular areas in each city for the renting of property in 2004
and how does this compare with the results for the previous two years?
What is the monthly revenue for property sales at each branch office, compared with
rolling 12-monthly prior figures?
What would be the effect on property sales in the different regions of Britain if legal costs
went up by 3.5% and Government taxes went down by 1.5% for properties over 100,000?
Which type of property sells for prices above the average selling price for properties in
the main cities of Great Britain and how does this correlate to demographic data?
What is the relationship between the total annual revenue generated by each branch
office and the total number of sales staff assigned to each branch office?
to change, containing duplicate or missing entries. As such, the operational data must be
cleaned up before it can be used in the data warehouse. We discuss the tasks associated
with this process in Section 31.3.1.
OLTP systems are not built to quickly answer ad hoc queries. They also tend not to store
historical data, which is necessary to analyze trends. Basically, OLTP offers large amounts
of raw data, which is not easily analyzed. The data warehouse allows more complex queries
to be answered besides just simple aggregations such as, What is the average selling price
for properties in the major cities of Great Britain?. The types of queries that a data warehouse is expected to answer range from the relatively simple to the highly complex and are
dependent on the types of end-user access tools used (see Section 31.2.10). Examples of the
range of queries that the DreamHome data warehouse may be capable of supporting include:
n
n
The problems associated with developing and managing a data warehouse are listed in
Table 31.2 (Greenfield, 1996).
Table 31.2
Data homogenization
Large-scale data warehousing can become an exercise in data homogenization that lessens
the value of the data. For example, in producing a consolidated and integrated view of the
organizations data, the warehouse designer may be tempted to emphasize similarities
rather than differences in the data used by different application areas such as property sales
and property renting.
|
1155
1156
The most important area for the management of a data warehouse is the integration
capabilities. This means an organization must spend a significant amount of time determining how well the various different data warehousing tools can be integrated into the
overall solution that is needed. This can be a very difficult task, as there are a number of
tools for every operation of the data warehouse, which must integrate well in order that the
warehouse works to the organizations benefit.
Complexity of integration
A data warehouse represents a single data resource for the organization. However, the
building of a warehouse can take up to three years, which is why some organizations are
building data marts (see Section 31.5). Data marts support only the requirements of a
particular department or functional area and can therefore be built more rapidly.
Long-duration projects
Data warehouses are high maintenance systems. Any reorganization of the business
processes and the source systems may affect the data warehouse. To remain a valuable
resource, the data warehouse must remain consistent with the organization that it supports.
High maintenance
Data warehousing may change the attitude of end-users to the ownership of data. Sensitive
data that was originally viewed and used only by a particular department or business area,
such as sales or marketing, may now be made accessible to others in the organization.
Data ownership
(see Chapter 32). These approaches result in the creation of very large fact tables. If there
are many dimensions to the factual data, the combination of aggregate tables and indexes
to the fact tables can use up more space than the raw data.
31.2
In this section we present an overview of the architecture and major components of a data
warehouse (Anahory and Murray, 1997). The processes, tools, and technologies associated
with data warehousing are described in more detail in the following sections of this chapter.
The typical architecture of a data warehouse is shown in Figure 31.1.
31.2.2
Departmental data held in proprietary file systems such as VSAM, RMS, and relational
DBMSs such as Informix and Oracle.
Private data held on workstations and private servers.
External systems such as the Internet, commercially available databases, or databases
associated with an organizations suppliers or customers.
n
n
1157
1158
|
Chapter 31 z Data Warehousing Concepts
Building an ODS can be a helpful step towards building a data warehouse because an
ODS can supply data that has been already extracted from the source systems and cleaned.
This means that the remaining work of integrating and restructuring the data for the data
warehouse is simplified (see Section 32.3).
The warehouse manager performs all the operations associated with the management of
the data in the warehouse. This component is constructed using vendor data management
tools and custom-built programs. The operations performed by the warehouse manager
include:
n
n
n
n
n
n
In some cases, the warehouse manager also generates query profiles to determine which
indexes and aggregations are appropriate. A query profile can be generated for each user,
group of users, or the data warehouse and is based on information that describes the characteristics of the queries such as frequency, target table(s), and size of result sets.
31.2.9
31.2.8
31.2.7
31.2.6
performed by this component include directing queries to the appropriate tables and
scheduling the execution of queries. In some cases, the query manager also generates
query profiles to allow the warehouse manager to determine which indexes and aggregations are appropriate.
Detailed Data
This area of the warehouse stores all the detailed data in the database schema. In most
cases, the detailed data is not stored online but is made available by aggregating the data
to the next level of detail. However, on a regular basis, detailed data is added to the warehouse to supplement the aggregated data.
Archive/Backup Data
This area of the warehouse stores the detailed and summarized data for the purposes of
archiving and backup. Even although summary data is generated from detailed data, it
may be necessary to backup online summary data if this data is kept beyond the retention
period for detailed data. The data is transferred to storage archives such as magnetic tape
or optical disk.
Metadata
as part of the query management process metadata is used to direct a query to the most
appropriate data source.
the extraction and loading processes metadata is used to map data sources to a
common view of the data within the warehouse;
the warehouse management process metadata is used to automate the production of
summary tables;
This area of the warehouse stores all the metadata (data about data) definitions used by all
the processes in the warehouse. Metadata is used for a variety of purposes including:
n
1159
1160
|
Chapter 31 z Data Warehousing Concepts
The structure of metadata differs between each process, because the purpose is different.
This means that multiple copies of metadata describing the same data item are held within
the data warehouse. In addition, most vendor tools for copy management and end-user
data access use their own versions of metadata. Specifically, copy management tools use
metadata to understand the mapping rules to apply in order to convert the source data into
a common form. End-user access tools use metadata to understand how to build a query.
The management of metadata within the data warehouse is a very complex task that should
not be underestimated. The issues associated with the management of metadata in a data
warehouse are discussed in Section 31.4.3.
31.3
or because the user interaction requires an unreasonably high level of expertise by the
user. In this situation, user access may require the development of in-house applications
using graphical data access tools designed primarily for clientserver environments. Some
of these application development tools integrate with popular OLAP tools, and can access
all major database systems, including Oracle, Sybase, and Informix.
1161
1162
|
Chapter 31 z Data Warehousing Concepts
Inflow
Upflow
Downflow
Outflow
Metaflow
The processes associated with the extraction, cleansing, and loading of the
data from the source systems into the data warehouse.
n
n
n
n
n
31.3.1 Inflow
Inflow
The inflow is concerned with taking data from the source systems to load into the data
warehouse. Alternatively, the data may be first loaded into the operational data store
31.3.2
(ODS) (see Section 31.2.2) before being transferred to the data warehouse. As the source
data is generated predominately by OLTP systems, the data must be reconstructed for the
purposes of the data warehouse. The reconstruction of data involves:
n
n
The processes associated with adding value to the data in the warehouse
through summarizing, packaging, and distribution of the data.
Upflow
Upflow
Summarizing the data by selecting, projecting, joining, and grouping relational data
into views that are more convenient and useful to the end-users. Summarizing extends
beyond simple relational operations to involve sophisticated statistical analysis
including identifying trends, clustering, and sampling the data.
Packaging the data by converting the detailed or summarized data into more useful
formats, such as spreadsheets, text documents, charts, other graphical presentations,
private databases, and animation.
Distributing the data to appropriate groups to increase its availability and accessibility.
While adding value to the data, consideration must also be given to support the performance requirements of the data warehouse and to minimize the ongoing operational costs.
These requirements essentially pull the design in opposing directions, forcing restructuring to improve query performance or to lower operational costs. In other words, the data
warehouse administrator must identify the most appropriate database design to meet all
requirements, which often necessitates a degree of compromise.
1163
1164
|
Chapter 31 z Data Warehousing Concepts
31.3.3 Downflow
Downflow
Archiving old data plays an important role in maintaining the effectiveness and performance of the warehouse by transferring the older data of limited value to a storage archive
such as magnetic tape or optical disk. However, if the correct partitioning scheme is
selected for the database, the amount of data online should not affect performance.
Partitioning is a useful design option for very large databases that enables the fragmentation of a table storing enormous numbers of records into several smaller tables. The
rule for the partitioning a given table can be based on characteristics of the data such as
timespan or area of the country. For example, the PropertySale table of DreamHome could
be partitioned according to the countries of the UK.
The downflow of data includes the processes to ensure that the current state of the data
warehouse can be rebuilt following data loss, or software/hardware failures. Archived data
should be stored in a way that allows the re-establishment of the data in the warehouse,
when required.
31.3.4 Outflow
Outflow
Accessing, which is concerned with satisfying the end-users requests for the data they
need. The main issue is to create an environment so that users can effectively use
the query tools to access the most appropriate data source. The frequency of user
accesses can vary from ad hoc, to routine, to real-time. It is important to ensure that the
systems resources are used in the most effective way in scheduling the execution
of user queries.
The outflow is where the real value of warehousing is realized by the organization.
This may require re-engineering the business processes to achieve competitive advantage
(Hackathorn, 1995). The two key activities involved in the outflow include:
An important issue in managing the outflow is the active marketing of the data warehouse
to users, which will contribute to its overall impact on an organizations operations. There
are additional operational activities in managing the outflow including directing queries to
31.4.1
31.4
31.3.5
the appropriate target table(s) and capturing information on the query profiles associated
with user groups to determine which aggregations to generate.
Data warehouses that contain summary data potentially provide a number of distinct
data sources to respond to a specific query including the detailed data itself and any number of aggregations that satisfy the querys data needs. However, the performance of the
query will vary considerably depending on the characteristics of the target data, the most
obvious being the volume of data to be read. As part of managing the outflow, the system
must determine the most efficient way to answer a query.
Metaflow
Metaflow
The previous flows describe the management of the data warehouse with regard to how the
data moves in and out of the warehouse. Metaflow is the process that moves metadata
(data about the other flows). Metadata is a description of the data contents of the data
warehouse, what is in it, where it came from originally, and what has been done to it by
way of cleansing, integrating, and summarizing. We discuss issues associated with the
management of metadata in a data warehouse in Section 31.4.3.
To respond to changing business needs, legacy systems are constantly changing. Therefore, the warehouse involves responding to these continuous changes, which must reflect
the changes to the source legacy systems and the changing business environment. The
metaflow (metadata) must be continuously updated with these changes.
1165
1166
|
code generators;
database data replication tools;
dynamic transformation engines.
n
n
n
Code generators
Code generators create customized 3GL/4GL transformation programs based on source and
target data definitions. The main issue with this approach is the management of the large
number of programs required to support a complex corporate data warehouse. Vendors
recognize this issue and some are developing management components employing techniques such as workflow methods and automated scheduling systems.
Table 31.3
Load performance
Load processing
Data quality management
Query performance
Terabyte scalability
Mass user scalability
Networked data warehouse
Warehouse administration
Integrated dimensional analysis
Advanced query functionality
Load performance
Data warehouses require incremental loading of new data on a periodic basis within
narrow time windows. Performance of the load process should be measured in hundreds
of millions of rows or gigabytes of data per hour and there should be no maximum limit
that constrains the business.
Load processing
Many steps must be taken to load new or updated data into the data warehouse including
data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and
metadata update. Although each step may in practice be atomic, the load process should
appear to execute as a single, seamless unit of work.
Data quality management
The shift to fact-based management demands the highest data quality. The warehouse
must ensure local consistency, global consistency, and referential integrity despite dirty
sources and massive database sizes. While loading and preparation are necessary steps,
they are not sufficient. The ability to answer end-users queries is the measure of success
for a data warehouse application. As more questions are answered, analysts tend to ask
more creative and complex questions.
Query performance
Fact-based management and ad hoc analysis must not be slowed or inhibited by the
performance of the data warehouse RDBMS. Large, complex queries for key business
operations must complete in reasonable time periods.
Terabyte scalability
Data warehouse sizes are growing at enormous rates with sizes ranging from a few to
hundreds of gigabytes to terabyte-sized (1012 bytes) and petabyte-sized (1015 bytes).
|
1167
1168
|
Chapter 31 z Data Warehousing Concepts
The RDBMS must not have any architectural limitations to the size of the database and
should support modular and parallel management. In the event of failure, the RDBMS should
support continued availability, and provide mechanisms for recovery. The RDBMS must
support mass storage devices such as optical disk and hierarchical storage management
devices. Lastly, query performance should not be dependent on the size of the database,
but rather on the complexity of the query.
Mass user scalability
Current thinking is that access to a data warehouse is limited to relatively low numbers
of managerial users. This is unlikely to remain true as the value of data warehouses is
realized. It is predicted that the data warehouse RDBMS should be capable of supporting
hundreds, or even thousands, of concurrent users while maintaining acceptable query
performance.
Networked data warehouse
Data warehouse systems should be capable of cooperating in a larger network of data warehouses. The data warehouse must include tools that coordinate the movement of subsets
of data between warehouses. Users should be able to look at, and work with, multiple data
warehouses from a single client workstation.
Warehouse administration
The very-large scale and time-cyclic nature of the data warehouse demands administrative ease and flexibility. The RDBMS must provide controls for implementing resource
limits, chargeback accounting to allocate costs back to users, and query prioritization to
address the needs of different user classes and activities. The RDBMS must also provide
for workload tracking and tuning so that system resources may be optimized for maximum
performance and throughput. The most visible and measurable value of implementing
a data warehouse is evidenced in the uninhibited, creative access to data it provides for
end-users.
Integrated dimensional analysis
The power of multi-dimensional views is widely accepted, and dimensional support
must be inherent in the warehouse RDBMS to provide the highest performance for
relational OLAP tools (see Chapter 33). The RDBMS must support fast, easy creation of
pre-computed summaries common in large data warehouses, and provide maintenance
tools to automate the creation of these pre-computed aggregates. Dynamic calculation of
aggregates should be consistent with the interactive performance needs of the end-user.
Advanced query functionality
End-users require advanced analytical calculations, sequential and comparative analysis,
and consistent access to detailed and summarized data. Using SQL in a clientserver
point-and-click tool environment may sometimes be impractical or even impossible
due to the complexity of the users queries. The RDBMS must provide a complete and
advanced set of analytical operations.
Parallel DBMSs
31.4.3
Data warehousing requires the processing of enormous amounts of data and parallel database technology offers a solution to providing the necessary growth in performance. The
success of parallel DBMSs depends on the efficient operation of many resources including processors, memory, disks, and network connections. As data warehousing grows
in popularity, many vendors are building large decision-support DBMSs using parallel
technologies. The aim is to solve decision-support problems using multiple nodes working on the same problem. The major characteristics of parallel DBMSs are scalability,
operability, and availability.
The parallel DBMS performs many database operations simultaneously, splitting
individual tasks into smaller parts so that tasks can be spread across multiple processors.
Parallel DBMSs must be capable of running parallel queries. In other words, they must
be able to decompose large complex queries into subqueries, run the separate subqueries
simultaneously, and reassemble the results at the end. The capability of such DBMSs must
also include parallel data loading, table scanning, and data archiving and backup. There
are two main parallel hardware architectures commonly used as database server platforms
for data warehousing:
n
The SMP and MPP parallel architectures were described in detail in Section 22.1.1.
1169
1170
|
Chapter 31 z Data Warehousing Concepts
table, index, and view, and any associated constraints. This information is held in the
DBMS system catalog, however, there are additional requirements for the purposes of
the warehouse. For example, metadata should also describe any fields associated with
aggregations, including a description of the aggregation that was performed. In addition,
table partitions should be described including information on the partition key, and the
data range associated with that partition.
The metadata described above is also required by the query manager to generate appropriate queries. In turn, the query manager generates additional metadata about the queries
that are run, which can be used to generate a history on all the queries and a query profile
for each user, group of users, or the data warehouse. There is also metadata associated
with the users of queries that includes, for example, information describing what the term
price or customer means in a particular database and whether the meaning has changed
over time.
Synchronizing metadata
The major integration issue is how to synchronize the various types of metadata used
throughout the data warehouse. The various tools of a data warehouse generate and use
their own metadata, and to achieve integration, we require that these tools are capable of
sharing their metadata. The challenge is to synchronize metadata between different products from different vendors using different metadata stores. For example, it is necessary
to identify the correct item of metadata at the right level of detail from one product and
map it to the appropriate item of metadata at the right level of detail in another product,
then sort out any coding differences between them. This has to be repeated for all other
metadata that the two products have in common. Further, any changes to the metadata
(or even meta-metadata), in one product needs to be conveyed to the other product. The
task of synchronizing two products is highly complex, and therefore repeating this process
for six or more products that make up the data warehouse can be resource intensive.
However, integration of the metadata must be achieved.
In the beginning there were two major standards for metadata and modeling in the
areas of data warehousing and component-based development proposed by the Meta
Data Coalition (MDC) and the Object Management Group (OMG). However, these two
industry organizations jointly announced that the MDC would merge into the OMG. As
a result, the MDC discontinued independent operations and work continued in the OMG
to integrate the two standards.
The merger of MDC into the OMG marked an agreement of the major data warehousing and metadata vendors to converge on one standard, incorporating the best of the
MDCs Open Information Model (OIM) with the best of the OMGs Common Warehouse
Metamodel (CWM). This work is now complete and the resulting specification issued by
the OMG as the next version of the CWM is discussed in Section 27.1.3. A single standard allows users to exchange metadata between different products from different vendors
freely.
The OMGs CWM builds on various standards, including OMGs UML (Unified
Modeling Language), XMI (XML Metadata Interchange), and MOF (Meta Object
Facility), and on the MDCs OIM. The CWM was developed by a number of companies,
including IBM, Oracle, Unisys, Hyperion, Genesis, NCR, UBS, and Dimension EDI.
a data mart focuses on only the requirements of users associated with one department
or business function;
31.5
31.4.4
A data warehouse requires tools to support the administration and management of such
a complex environment. These tools are relatively scarce, especially those that are well
integrated with the various types of metadata and the day-to-day operations of the data
warehouse. The data warehouse administration and management tools must be capable of
supporting the following tasks:
n
n
n
n
n
n
n
n
n
n
n
Data Marts
Accompanying the rapid emergence of data warehouses is the related concept of data
marts. In this section we describe what data marts are, the reasons for building data marts,
and the issues associated with the development and use of data marts.
Data
mart
data marts do not normally contain detailed operational data, unlike data warehouses;
A data mart holds a subset of the data in a data warehouse normally in the form of
summary data relating to a particular department or business function. The data mart can
be standalone or linked centrally to the corporate data warehouse. As a data warehouse
grows larger, the ability to serve the various needs of the organization may be compromised. The popularity of data marts stems from the fact that corporate-wide data warehouses
are proving difficult to build and use. The typical architecture for a data warehouse and
associated data mart is shown in Figure 31.3. The characteristics that differentiate data
marts and data warehouses include:
1171
1172
|
Chapter 31 z Data Warehousing Concepts
31.5.2
31.5.1
as data marts contain less data compared with data warehouses, data marts are more
easily understood and navigated.
There are several approaches to building data marts. One approach is to build several
data marts with a view to the eventual integration into a warehouse; another approach is
to build the infrastructure for a corporate data warehouse while at the same time building
one or more data marts to satisfy immediate business needs.
Data mart architectures can be built as two-tier or three-tier database applications. The
data warehouse is the optional first tier (if the data warehouse provides the data for the
data mart), the data mart is the second tier, and the end-user workstation is the third tier,
as shown in Figure 31.3. Data is distributed among the tiers.
There are many reasons for creating a data mart, which include:
n
n
1173
1174
|
The issues associated with data marts.
Processing (OLAP) and/or data mining tools. Further, hundreds of users must be capable
of remotely accessing the data mart. The complexity and size of some data marts are
matching the characteristics of small-scale corporate data warehouses.
31.6.1
31.6
Explorer. Data mart Internet/Intranet products normally sit between a Web server and the
data analysis product. Vendors are developing products with increasingly advanced Web
capabilities. These products include Java and ActiveX capabilities. We discussed Web and
DBMS integration in detail in Chapter 29.
Oracle9i
Oracle9i Enterprise Edition is one of the leading relational DBMS for data warehousing. Oracle has achieved this success by focusing on basic, core requirements for data
warehousing: performance, scalability, and manageability. Data warehouses store larger
volumes of data, support more users, and require faster performance, so that these core
requirements remain key factors in the successful implementation of data warehouses.
However, Oracle goes beyond these core requirements and is the first true data warehouse
platform. Data warehouse applications require specialized processing techniques to allow
support for complex, ad hoc queries running against large amounts of data. To address
these special requirements, Oracle offers a variety of query processing techniques, sophisticated query optimization to choose the most efficient data access path, and a scalable
architecture that takes full advantage of all parallel hardware configurations. Successful
data warehouse applications rely on superior performance when accessing the enormous
amounts of stored data. Oracle provides a rich variety of integrated indexing schemes,
join methods, and summary management features, to deliver answers quickly to data
1175
1176
|
Chapter 31 z Data Warehousing Concepts
summary management;
analytical functions;
bitmapped indexes;
advanced join methods;
sophisticated SQL optimizer;
resource management.
warehouse users. Oracle also addresses applications that have mixed workloads and where
administrators want to control which users, or groups of users, have priority when executing transactions or queries. In this section we provide an overview of the main features
of Oracle, which are particularly aimed at supporting data warehousing applications.
These features include:
n
n
n
n
n
n
Summary management
In a data warehouse application, users often issue queries that summarize detail data by
common dimensions, such as month, product, or region. Oracle provides a mechanism for
storing multiple dimensions and summary calculations on a table. Thus, when a query
requests a summary of detail records, the query is transparently re-written to access the
stored aggregates rather than summing the detail records every time the query is issued.
This results in dramatic improvements in query performance. These summaries are automatically maintained from data in the base tables. Oracle also provides summary advisory
functions that assist database administrators in choosing which summary tables are the
most effective, depending on actual workload and schema statistics. Oracle Enterprise
Manager supports the creation and management of materialized views and related dimensions and hierarchies via a graphical interface, greatly simplifying the management of
materialized views.
Analytical functions
ranking (for example, who are the top ten sales reps in each region of Great Britain?);
Oracle9i includes a range of SQL functions for business intelligence and data warehousing applications. These functions are collectively called analytical functions, and they
provide improved performance and simplified coding for many business analysis queries.
Some examples of the new capabilities are:
moving aggregates (for example, what is the three-month moving average of property
sales?);
other functions including cumulative aggregates, lag/lead expressions, period-over-period
comparisons, and ratio-to-report.
Oracle also includes the CUBE and ROLLUP operators for OLAP analysis, via SQL.
These analytical and OLAP functions significantly extend the capabilities of Oracle for
analytical applications (see Chapter 33).
Bitmapped indexes
Resource management
Managing CPU and disk resources in a multi-user data warehouse or OLTP application
is challenging. As more users require access, contention for resources becomes greater.
|
1177
1178
|
Chapter 31 z Data Warehousing Concepts
Oracle has resource management functionality that provides control of system resources
assigned to users. Important online users, such as order entry clerks, can be given a high
priority, while other users those running batch reports receive lower priorities. Users
are assigned to resource classes, such as order entry or batch, and each resource class
is then assigned an appropriate percentage of machine resources. In this way, highpriority users are given more system resources than lower-priority users.
The operational data source for the data warehouse is supplied from mainframe operational data held in first
generation hierarchical and network databases, departmental data held in proprietary file systems, private data
held on workstations and private servers and external systems such as the Internet, commercially available
databases, or databases associated with an organizations suppliers or customers.
The major components of a data warehouse include the operational data sources, operational data store, load
manager, warehouse manager, query manager, detailed, lightly and highly summarized data, archive/backup
data, metadata, and end-user access tools.
A DBMS built for Online Transaction Processing (OLTP) is generally regarded as unsuitable for data warehousing because each system is designed with a differing set of requirements in mind. For example, OLTP
systems are design to maximize the transaction processing capacity, while data warehouses are designed to
support ad hoc query processing.
The potential benefits of data warehousing are high returns on investment, substantial competitive advantage,
and increased productivity of corporate decision-makers.
Data Webhouse is a distributed data warehouse that is implemented over the Web with no central data
repository.
Data warehousing is subject-oriented, integrated, time-variant, and non-volatile collection of data in support of managements decision-making process. A data warehouse is data management and data analysis
technology.
Chapter Summary
The operational data store (ODS) is a repository of current and integrated operational data used for analysis.
It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act
as a staging area for data to be moved into the warehouse.
Metaflow is the processes associated with the management of the metadata (data about data).
Outflow is the processes associated with making the data available to the end-users.
Downflow is the processes associated with archiving and backing-up of data in the warehouse.
Upflow is the processes associated with adding value to the data in the warehouse through summarizing,
packaging, and distribution of the data.
Inflow is the processes associated with the extraction, cleansing, and loading of the data from the source
systems into the data warehouse.
Data warehousing focuses on the management of five primary data flows, namely the inflow, upflow,
downflow, outflow, and metaflow.
End-user access tools can be categorized into five main groups: data reporting and query tools, application
development tools, executive information system (EIS) tools, Online Analytical Processing (OLAP) tools, and
data mining tools.
The query manager (also called the backend component) performs all the operations associated with the
management of user queries. The operations performed by this component include directing queries to the
appropriate tables and scheduling the execution of queries.
The warehouse manager performs all the operations associated with the management of the data in the
warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and
aggregations, and archiving and backing-up data.
The load manager (also called the frontend component) performs all the operations associated with the
extraction and loading of data into the warehouse. These operations include simple transformations of the data
to prepare the data for entry into the warehouse.
1179
The requirements for a data warehouse RDBMS include load performance, load processing, data quality
management, query performance, terabyte scalability, mass user scalability, networked data warehouse,
warehouse administration, integrated dimensional analysis, and advanced query functionality.
Chapter Summary
Data mart is a subset of a data warehouse that supports the requirements of a particular department or
business function. The issues associated with data marts include functionality, size, load performance, users
access to data in multiple data marts, Internet/intranet access, administration, and installation.
1180
|
Chapter 31 z Data Warehousing Concepts
Review Questions
31.1
31.2
31.3
31.4
31.5
31.6
Exercise
31.7
31.8
31.9
31.10
31.11
31.12
31.13
31.14
(c) downflow;
(d) outflow;
(e) metaflow.
What are the three main approaches taken by
vendors to provide data extraction, cleansing,
and transformation tools?
Describe the specialized requirements of
a relational database management system
(RDBMS) suitable for use in a data
warehouse environment.
Discuss how parallel technologies can
support the requirements of a data
warehouse.
Discuss the importance of managing metadata
and how this relates to the integration of the
data warehouse.
Discuss the main tasks associated with the
administration and management of a data
warehouse.
Discuss how data marts differ from data
warehouses and identify the main reasons for
implementing a data mart.
Identify the main issues associated with
the development and management of data
marts.
Describe the features of Oracle that
support the core requirements of data
warehousing.
31.15 You are asked by the Managing Director of DreamHome to investigate and report on the applicability of data
warehousing for the organization. The report should compare data warehouse technology with OLTP systems
and should identify the advantages and disadvantages, and any problem areas associated with implementing
a data warehouse. The report should reach a fully justified set of conclusions on the applicability of a data
warehouse for DreamHome.
Chapter
32
Data Warehousing Design
Chapter Objectives
1182
In Section 32.1 we highlight the major issues associated with data warehouse design.
In Section 32.2 we describe the basic concepts associated with dimensionality modeling and then compare this technique with traditional EntityRelationship modeling.
In Section 32.3 we describe and demonstrate a step-by-step methodology for designing
a data warehouse database using worked examples taken from an extended version of
the DreamHome case study described in Section 10.4 and Appendix A. In Section 32.4
we describe criteria for assessing the dimensionality of a data warehouse. Finally, in
Section 32.5 we describe how to design a data warehouse using an Oracle product called
Oracle Warehouse Builder.
32.1
Designing a data warehouse database is highly complex. To begin a data warehouse project, we need answers for questions such as: which user requirements are most important
and which data should be considered first? Also, should the project be scaled down into
something more manageable yet at the same time provide an infrastructure capable of
ultimately delivering a full-scale enterprise-wide data warehouse? Questions such as these
highlight some of the major issues in building data warehouses. For many enterprises the
solution is data marts, which we described in Section 31.5. Data marts allow designers
to build something that is far simpler and achievable for a specific group of users. Few
designers are willing to commit to an enterprise-wide design that must meet all user
requirements at one time. However, despite the interim solution of building data marts,
the goal remains the same; the ultimate creation of a data warehouse that supports the
requirements of the enterprise.
The requirements collection and analysis stage (see Section 9.5) of a data warehouse
project involves interviewing appropriate members of staff such as marketing users,
finance users, sales users, operational users, and management to enable the identification
of a prioritized set of requirements for the enterprise that the data warehouse must meet.
At the same time, interviews are conducted with members of staff responsible for Online
Transaction Processing (OLTP) systems to identify, which data sources can provide clean,
valid, and consistent data that will remain supported over the next few years.
The interviews provide the necessary information for the top-down view (user requirements) and the bottom-up view (which data sources are available) of the data warehouse.
With these two views defined we are ready to begin the process of designing the data warehouse database.
The database component of a data warehouse is described using a technique called dimensionality modeling. In the following sections, we first describe the concepts associated
with a dimensional model and contrast this model with the traditional EntityRelationship
(ER) model (see Chapters 11 and 12). We then present a step-by-step methodology for
creating a dimensional model using worked examples from an extended version of the
DreamHome case study.
32.2
Dimensionality Modeling
Dimensionality
modeling
A logical structure that has a fact table containing factual data in the
center, surrounded by dimension tables containing reference data (which
can be denormalized).
The star schema exploits the characteristics of factual data such that facts are generated
by events that occurred in the past, and are unlikely to change, regardless of how they are
analyzed. As the bulk of data in a data warehouse is represented as facts, the fact tables
can be extremely large relative to the dimension tables. As such, it is important to treat
fact data as read-only reference data that will not change over time. The most useful fact
tables contain one or more numerical measures, or facts, that occur for each record. In
Figure 32.1, the facts are offerPrice, sellingPrice, saleCommission, and saleRevenue. The most
useful facts in a fact table are numeric and additive because data warehouse applications
almost never access a single record; rather, they access hundreds, thousands, or even
millions of records at a time and the most useful thing to do with so many records is to
aggregate them.
Dimension tables, by contrast, generally contain descriptive textual information.
Dimension attributes are used as the constraints in data warehouse queries. For example,
the star schema shown in Figure 32.1 can support queries that require access to sales
of properties in Glasgow using the city attribute of the PropertyForSale table, and on sales
of properties that are flats using the type attribute in the PropertyForSale table. In fact, the
usefulness of a data warehouse is in relation to the appropriateness of the data held in the
dimension tables.
1183
1184
|
Chapter 32 z Data Warehousing Design
Figure 32.1
Star schema for
property sales of
DreamHome.
Efficiency The consistency of the underlying database structure allows more efficient
access to the data by various tools including report writers and query tools.
|
1185
Figure 32.2
Part of star schema
for property sales of
DreamHome with a
normalized version
of the Branch
dimension table.
There is a variation to the star schema called the snowflake schema, which allows
dimensions to have dimensions. For example, we could normalize the location data (city,
region, and country attributes) in the Branch dimension table of Figure 32.1 to create two
new dimension tables called City and Region. A normalized version of the Branch dimension table of the property sales schema is shown in Figure 32.2. In a snowflake schema
the location data in the PropertyForSale, ClientBuyer, Staff, and Owner dimension tables would
also be removed and the new City and Region dimension tables would be shared with these
tables.
Starflake
schema
The most appropriate database schemas use a mixture of denormalized star and normalized snowflake schemas. This combination of star and snowflake schemas is called a
starflake schema. Some dimensions may be present in both forms to cater for different
query requirements. Whether the schema is star, snowflake, or starflake, the predictable
and standard form of the underlying dimensional model offers important advantages
within a data warehouse environment including:
Ability to handle changing requirements The star schema can adapt to changes in the
users requirements, as all dimensions are equivalent in terms of providing access to the
fact table. This means that the design is better able to support ad hoc user queries.
1186
Ability to model common business situations There are a growing number of standard
approaches for handling common modeling situations in the business world. Each of
these situations has a well-understood set of alternatives that can be specifically programmed in report writers, query tools, and other user interfaces; for example, slowly
changing dimensions where a constant dimension such as Branch or Staff actually
evolves slowly and asynchronously. We discuss slowly changing dimensions in more
detail in Section 32.3, Step 8.
Extensibility The dimensional model is extensible; for example typical changes that
a DM must support include: (a) adding new facts as long as they are consistent with
the fundamental granularity of the existing fact table; (b) adding new dimensions, as
long as there is a single value of that dimension defined for each existing fact record;
(c) adding new dimensional attributes; and (d) breaking existing dimension records
down to a lower level of granularity from a certain point in time forward.
Predictable query processing Data warehouse applications that drill down will simply
be adding more dimension attributes from within a single star schema. Applications that
drill across will be linking separate fact tables together through the shared (conformed)
dimensions. Even though the overall suite of star schemas in the enterprise dimensional
model is complex, the query processing is very predictable because at the lowest level,
each fact table should be queried independently.
Activity
32.3
Step
1
2
3
4
5
6
7
8
9
1187
1188
Figure 32.3
property sales;
property rentals (leasing);
property viewing;
property advertising;
property maintenance.
n
n
n
n
n
The data requirements associated with these processes are shown in the ER diagram of
Figure 32.3. Note that this ER diagram forms part of the design documentation, which
describes the Online Transaction Processing (OLTP) systems required to support the business processes of DreamHome. The ER diagram of Figure 32.3 has been simplified by
labeling only the main entities and relationships and is created by following Steps 1 and 2
of the database design methodology described earlier in Chapters 15 and 16. The shaded
entities represent the core facts for each business process of DreamHome. The business
process selected to be the first data mart is property sales. The part of the original ER
|
1189
Figure 32.4
Part of ER diagram
in Figure 32.3 that
represents the data
requirements of the
property sales
business process
of DreamHome.
diagram that represents the data requirements of the property sales business process is
shown in Figure 32.4.
Step 2: Choosing the grain
Choosing the grain means deciding exactly what a fact table record represents. For example,
the PropertySale entity shown with shading in Figure 32.4 represents the facts about each
property sale and becomes the fact table of the property sales star schema shown
previously in Figure 32.1. Therefore, the grain of the PropertySale fact table is individual
property sales.
Only when the grain for the fact table is chosen can we identify the dimensions of the
fact table. For example, the Branch, Staff, Owner, ClientBuyer, PropertyForSale, and Promotion
entities in Figure 32.4 will be used to reference the data about property sales and will become the dimension tables of the property sales star schema shown previously in Figure 32.1.
We also include Time as a core dimension, which is always present in star schemas.
The grain decision for the fact table also determines the grain of each of the dimension
tables. For example, if the grain for the PropertySale fact table is an individual property sale,
then the grain of the ClientBuyer dimension is the details of the client who bought a particular property.
Step 3: Identifying and conforming the dimensions
Dimensions set the context for asking questions about the facts in the fact table. A wellbuilt set of dimensions makes the data mart understandable and easy to use. We identify
dimensions in sufficient detail to describe things such as clients and properties at the
correct grain. For example, each client of the ClientBuyer dimension table is described by
the clientID, clientNo, clientName, clientType, city, region, and country attributes, as shown previously in Figure 32.1. A poorly presented or incomplete set of dimensions will reduce the
usefulness of a data mart to an enterprise.
1190
|
Chapter 32 z Data Warehousing Design
Figure 32.5
Star schemas for
property sales and
property advertising
with Time,
PropertyForSale,
Branch, and
Promotion as
conformed (shared)
dimension tables.
If any dimension occurs in two data marts, they must be exactly the same dimension, or
one must be a mathematical subset of the other. Only in this way can two data marts share
one or more dimensions in the same application. When a dimension is used in more than
one data mart, the dimension is referred to as being conformed. Examples of dimensions that must conform between property sales and property advertising are the Time,
PropertyForSale, Branch, and Promotion dimensions. If these dimensions are not synchronized
or if they are allowed to drift out of synchronization between data marts, the overall data
warehouse will fail, because the two data marts will not be able to be used together.
For example, in Figure 32.5 we show the star schemas for property sales and property
advertising with Time, PropertyForSale, Branch, and Promotion as conformed dimensions with
light shading.
|
1191
Figure 32.6
Star schema for
property rentals of
DreamHome. This
is an example of a
badly structured
fact table with
non-numeric facts,
a non-additive fact,
and a numeric fact
with an inconsistent
granularity with the
other facts in the
table.
The grain of the fact table determines which facts can be used in the data mart. All the
facts must be expressed at the level implied by the grain. In other words, if the grain
of the fact table is an individual property sale, then all the numerical facts must refer
to this particular sale. Also, the facts should be numeric and additive. In Figure 32.6 we
use the star schema of the property rental process of DreamHome to illustrate a badly
structured fact table. This fact table is unusable with non-numeric facts (promotionName
and staffName), a non-additive fact (monthlyRent), and a fact (lastYearRevenue) at a different
granularity from the other facts in the table. Figure 32.7 shows how the Lease fact
table shown in Figure 32.6 could be corrected so that the fact table is appropriately
structured.
Additional facts can be added to a fact table at any time provided they are consistent
with the grain of the table.
1192
|
Chapter 32 z Data Warehousing Design
Figure 32.7
Star schema for the
property rentals of
DreamHome. This is
the schema shown in
Figure 32.6 with the
problems corrected.
|
1193
1194
|
Chapter 32 z Data Warehousing Design
Figure 32.8
Dimensional model
(fact constellation)
for the DreamHome
data warehouse.
We integrate the star schemas for the business processes of DreamHome using the conformed dimensions. For example, all the fact tables share the Time and Branch dimensions
as shown in Table 32.2. A dimensional model, which contains more than one fact table
sharing one or more conformed dimension tables, is referred to as a fact constellation.
The fact constellation for the DreamHome data warehouse is shown in Figure 32.8. The
model has been simplified by displaying only the names of the fact and dimension tables.
Note that the fact tables are shown with dark shading and all the dimension tables being
conformed are shown with light shading.
Property viewing
Property rentals
Property sales
Business process
Advert
PropertyViewing
Lease
PropertySale
Fact table
Dimension tables
32.4
Property advertising
PropertyMaintenance
Table 32.2 Fact and dimension tables for each business process of DreamHome.
Property maintenance
1195
1196
Expression
Administration
Architecture
Group
Graceful modification
Dimensional replication
Changed dimension notification
Surrogate key administration
International consistency
Explicit declaration
Conformed dimensions and facts
Dimensional integrity
Open aggregate navigation
Dimensional symmetry
Dimensional scalability
Sparsity tolerance
Criteria
Expression criteria are mostly analytic capabilities that are needed in real-life situations. The end-user community experiences all expression criteria directly. The expression
criteria for dimensional systems are not the only features users look for in a data warehouse, but they are all capabilities that need to exploit the power of a dimensional system.
A system that supports most or all of these dimensional criteria would be adaptable,
easier to administer, and able to address many real-world applications. The major point of
dimensional systems is that they are business-issue and end-user driven. For further details
of the criteria in Table 32.3, the interested reader is referred to Kimball (2000a,b).
Multiple-dimension hierarchies
Ragged-dimension hierarchies
Multiple valued dimensions
Slowly changing dimensions
Roles of a dimension
Hot-swappable dimensions
On-the-fly fact range dimensions
On-the-fly behavior dimensions
32.5
We introduced the Oracle DBMS in Section 8.2. In this section, we describe Oracle
Warehouse Builder (OWB) as a key component of the Oracle Warehouse solution,
enabling the design and deployment of data warehouses, data marts, and e-Business intelligence applications. OWB is a design tool and an extraction, transformation, and loading
32.5.1
(ETL) tool. An important aspect of OWB from the customers perspective is that it allows
the integration of the traditional data warehousing environments with the new e-Business
environments (Oracle Corporation, 2000). This section first provides an overview of the
components of OWB and the underlying technologies and then describes how the user
would apply OWB to typical data warehousing tasks.
The architecture of the Oracle Warehouse Builder is shown in Figure 32.9. Oracle Warehouse Builder is a key component of the larger Oracle data warehouse. The other products
that the OWB must work with within the data warehouse include:
1197
1198
|
Chapter 32 z Data Warehousing Design
Figure 32.9
Oracle Warehouse
Builder architecture.
Defining sources
Once the requirements have been determined and all the data sources have been identified,
a tool such as OWB can be used for constructing the data warehouse. OWB can handle
a diverse set of data sources by means of integrators. OWB also has the concept of a
module, which is a logical grouping of related objects. There are two types of modules:
data source and warehouse. For example, a data source module might contain all the
definitions of the tables in an OLTP database that is a source for the data warehouse.
And a module of type warehouse might contain definitions of the facts, dimensions, and
staging tables that make up the data warehouse. It is important to note that modules merely
contain definitions, that is metadata, about either sources or warehouses, and not objects
that can be populated or queried. A user identifies the integrators that are appropriate
for the data sources, and each integrator accesses a source and imports the metadata
that describes it.
Oracle sources
To connect to an Oracle database, the user chooses the integrator for Oracle databases.
Next, the user supplies some more detailed connection information, for example user
name, password, and SQL*Net connection string. This information is used to define a
database link in the database that hosts the OWB repository. OWB uses this database link
to query the system catalog of the source database and extract metadata that describes the
tables and views of interest to the user. The user experiences this as a process of visually
inspecting the source and selecting objects of interest.
Non-Oracle sources
Non-Oracle databases are accessed in exactly the same way as Oracle databases. What
makes this possible is the Transparent Gateway technology of Oracle. In essence, a
Transparent Gateway allows a non-Oracle database to be treated in exactly the same
way as if it were an Oracle database. On the SQL level, once the database link pointing to
the non-Oracle database has been defined, the non-Oracle database can be queried via
SELECT just like any Oracle database. In OWB, all the user has to do is identify the type
of database, so that OWB can select the appropriate Transparent Gateway for the database
link definition. In the case of MVS mainframe sources, OWB and Oracle PureExtract
provide data extraction from sources such as IMS, DB2, and VSAM. The plan is that
Oracle PureExtract will ultimately be integrated with the OWB technology.
Flat files
OWB supports two kinds of flat files: character-delimited and fixed-length files. If the data
source is a flat file, the user selects the integrator for flat files and specifies the path and
file name. The process of creating the meta-data that describes a file is different from the
process used for a table in a database. With a table, the owning database itself stores
extensive information about the table such as the table name, the column names, and data
types. This information can be easily queried from the catalog. With a file, on the other
hand, the user assists in the process of creating the metadata with some intelligent guesses
supplied by OWB. In OWB, this process is called sampling.
Web data
With the proliferation of the Internet, the new challenge for data warehousing is to capture
data from Web sites. There are different types of data in e-Business environments: transactional Web data stored in the underlying databases; clickstream data stored in Web server
log files; registration data in databases or log files; and consolidated clickstream data in
the log files of Web analysis tools. OWB can address all these sources with its built-in
features for accessing databases and flat files.
Data quality
powerful rule-based merging to resolve conflicting data and create the best possible
integrated result from the matched data.
integrated name and address processing to standardize, correct, and enhance representations of customer names and locations;
advanced probabilistic matching to identify unique consumers, businesses, households,
super-households, or other entities for which no common identifiers exist;
A solution to the challenge of data quality is OWB with Oracle PureIntegrate. Oracle
PureIntegrate is customer data integration software that automates the creation of consolidated profiles of customers and related business data to support e-Business and
customer relationship management applications. PureIntegrate complements OWB by
providing advanced data transformation and cleansing features designed specifically to
meet the requirements of database applications. These include:
n
|
1199
1200
|
Chapter 32 z Data Warehousing Design
Generating code
The Code Generator is the OWB component that reads the target definitions and sourceto-target mappings and generates code to implement the warehouse. The type of generated
code varies depending on the type of object that the user wants to implement.
Logical versus physical design
Before generating code, the user has primarily been working on the logical level, that is,
on the level of object definitions. On this level, the user is concerned with capturing all the
details and relationships (the semantics) of an object, but is not yet concerned with
defining any implementation characteristics. For example, consider a table to be implemented in an Oracle database. On the logical level, the user may be concerned with the
table name, the number of columns, the column names and data types, and any relationships that the table has to other tables. On the physical level, however, the question
becomes: how can this table be optimally implemented in an Oracle database? The user
must now be concerned with things like tablespaces, indexes, and storage parameters (see
Section 8.2.2). OWB allows the user to view and manipulate an object on both the logical
and physical level. The logical definition and physical implementation details are automatically synchronized.
Configuration
In OWB, the process of assigning physical characteristics to an object is called configuration. The specific characteristics that can be defined depend on the object that is being
configured. These objects include, for example, storage parameters, indexes, tablespaces,
and partitions.
Validation
It is good practice to check the object definitions for completeness and consistency prior
to code generation. OWB offers a validate feature to automate this process. Errors detectable by the validation process include, for example, data type mismatches between sources
and targets, and foreign key errors.
Generation
The following are some of the main types of code that OWB produces:
n
SQL Data Definition Language (DDL) commands A warehouse module with its
definitions of fact and dimension tables is implemented as a relational schema in an
Oracle database. OWB generates SQL DDL scripts that create this schema. The scripts
can either be executed from within OWB or saved to the file system for later, manual
execution.
PL/SQL programs A source-to-target mapping results in a PL/SQL program if the
source is a database, whether Oracle or non-Oracle. The PL/SQL program accesses
the source database via a database link, performs the transformations as defined in the
mapping, and loads the data into the target table.
SQL*Loader control files If the source in a mapping is a flat file, OWB generates a
control file for use with SQL*Loader.
Tcl scripts OWB also generates Tcl scripts. These can be used to schedule PL/SQL
and SQL*Loader mappings as jobs in Oracle Enterprise Manager for example, to
refresh the warehouse at regular intervals.
|
1201
1202
|
Chapter 32 z Data Warehousing Design
UPDATE
DELETE
INSERT/UPDATE (insert a row; if it already exists, update it)
UPDATE/INSERT (update a row; if it does not exist, insert it)
updated, albeit much less frequently than fact tables. An example of a slowly changing
dimension is the Customer table, in which a customers address, marital status, or name
may all change over time. In addition to INSERT, OWB also supports other ways of
manipulating the warehouse:
n
n
n
n
These features give the OWB user a variety of tools to undertake ongoing maintenance
tasks. OWB interfaces with Oracle Enterprise Manager for repetitive maintenance tasks;
for example, a fact table refresh that is scheduled to occur at a regular interval. For complex dependencies OWB integrates with Oracle Workflow.
Metadata integration
OWB is based on the Common Warehouse Model (CWM) standard (see Section 31.4.3).
It can seamlessly exchange metadata with Oracle Express and Oracle Discoverer as well
as other business intelligence tools that comply with the standard.
Starflake schema is a hybrid structure that contains a mixture of star and snowflake schemas.
Snowflake schema is a variant of the star schema where dimension tables do not contain denormalized data.
Dimension tables most often contain descriptive textual information. Dimension attributes are used as the
constraints in data warehouse queries.
The most useful facts in a fact table are numerical and additive because data warehouse applications almost
never access a single record; rather, they access hundreds, thousands, or even millions of records at a time and
the most useful thing to do with so many records is to aggregate them.
The star schema exploits the characteristics of factual data such that facts are generated by events that
occurred in the past, and are unlikely to change, regardless of how they are analyzed. As the bulk of data in
the data warehouse is represented within facts, the fact tables can be extremely large relative to the dimension
tables.
Star schema is a logical structure that has a fact table containing factual data in the center, surrounded by
dimension tables containing reference data (which can be denormalized).
Every dimensional model (DM) is composed of one table with a composite primary key, called the fact table,
and a set of smaller tables called dimension tables. Each dimension table has a simple (non-composite)
primary key that corresponds exactly to one of the components of the composite key in the fact table. In other
words, the primary key of the fact table is made up of two or more foreign keys. This characteristic star-like
structure is called a star schema or star join.
Dimensionality modeling is a design technique that aims to present the data in a standard, intuitive form that
allows for high-performance access.
Chapter Summary
The Nine-Step Methodology specifies the steps required for the design of a data mart / warehouse. The steps
include: Step 1 Choosing the process, Step 2 Choosing the grain, Step 3 Identifying and conforming the
dimensions, Step 4 Choosing the facts, Step 5 Storing pre-calculations in the fact table, Step 6 Rounding out
the dimensions, Step 7 Choosing the duration of the database, Step 8 Tracking slowly changing dimensions,
and Step 9 Deciding the query priorities and query modes.
There are many approaches that offer alternative routes to the creation of a data warehouse. One of the more
successful approaches is to decompose the design of the data warehouse into more manageable parts, namely
data marts. At a later stage, the integration of the smaller data marts leads to the creation of the enterprisewide data warehouse.
The key to understanding the relationship between dimensional models and ER models is that a single ER
model normally decomposes into multiple DMs. The multiple DMs are then associated through conformed
(shared) dimension tables.
1203
There are criteria to measure the extent to which a system supports the dimensional view of data warehousing. The criteria are divided into three broad groups: architecture, administration, and expression.
Exercises
Oracle Warehouse Builder (OWB) is a key component of the Oracle Warehouse solution, enabling the
design and deployment of data warehouses, data marts, and e-Business intelligence applications. OWB is both
a design tool and an extraction, transformation, and loading (ETL) tool.
Review Questions
31.1
31.2
31.3
31.4
31.5
31.6
Exercises
31.11 Use the Nine-Step Methodology for data warehouse database design to produce dimensional models for the
case studies described in Appendix B.
31.12 Use the Nine-Step Methodology for data warehouse database design to produce a dimensional model for all
or part of your organization.
Chapter
33
OLAP
Chapter Objectives
33.1
In Section 33.1 we introduce Online Analytical Processing (OLAP) and discuss the
relationship between OLAP and data warehousing. In Section 33.2 we describe OLAP
applications and identify the key features and potential benefits associated with OLAP
applications. In Section 33.3 we discuss how multi-dimensional data can be represented
and describe the main concepts associated with multi-dimensional analysis. In Section
33.4 we describe the rules for OLAP tools and highlight the characteristics and issues
associated with OLAP tools. In Section 33.5 we discuss how the SQL standard has been
extended to include OLAP functions. Finally, in Section 33.6, we describe how Oracle
supports OLAP. The examples in this chapter are taken from the DreamHome case study
described in Section 10.4 and Appendix A.
Over the past few decades, we have witnessed the increasing popularity and prevalence of
relational DBMSs such that we now find a significant proportion of corporate data is housed
in such systems. Relational databases have been used primarily to support traditional
Online Transaction Processing (OLTP) systems. To provide appropriate support for OLTP
systems, relational DBMSs have been developed to enable the highly efficient execution
of a large number of relatively simple transactions.
In the past few years, relational DBMS vendors have targeted the data warehousing
market and have promoted their systems as tools for building data warehouses. As discussed in Chapter 31, a data warehouse stores operational data and is expected to support
a wide range of queries from the relatively simple to the highly complex. However, the
ability to answer particular queries is dependent on the types of end-user access tools
available for use on the data warehouse. General-purpose tools such as reporting and query
tools can easily support who? and what? questions about past events. A typical query
submitted directly to a data warehouse is: What was the total revenue for Scotland in the
third quarter of 2004?. In this section we focus on a tool that can support more advanced
queries, namely Online Analytical Processing (OLAP).
Online Analytical
Processing (OLAP)
OLAP is a term that describes a technology that uses a multi-dimensional view of aggregate data to provide quick access to strategic information for the purposes of advanced
analysis (Codd et al., 1995). OLAP enables users to gain a deeper understanding and knowledge about various aspects of their corporate data through fast, consistent, interactive access
to a wide variety of possible views of the data. OLAP allows the user to view corporate
data in such a way that it is a better model of the true dimensionality of the enterprise.
While OLAP systems can easily answer who? and what? questions, it is their ability to
answer what if? and why? type questions that distinguishes them from general-purpose
1205