DM Lect4

DATA MINING
Data Warehousing
Lec 4
Relational Database Theory
• Relational database modeling process –
normalization, relations or tables are
progressively decomposed into smaller relations
to a point where all attributes in a relation are very
tightly coupled with the primary key of the relation.
Relational Database Theory
• The process of normalization generally breaks a
table into many independent tables.
• A normalized database yields a flexible model,

making it easy to maintain dynamic relationships
between business entities.
• A relational database system is effective and

efficient for operational databases – a lot of
updates (aiming at optimizing update
performance).
Problems
• A fully normalized data model can perform very
inefficiently for queries.
• Historical data are usually large with static

relationships:
• Unnecessary joins may take unacceptably long time
• Historical data are diverse
Heterogeneous Information Sources
Different interfaces
Different data representations
Duplicate and inconsistent information
The Need for Data Warehouses
• Two major factors drive the need for data
warehousing in most organizations today:
• Business requires an integrated company-wide view of
high-quality information.
• The IS department must separate informational from
operational systems in order to dramatically improve
performance in managing company data.
Need for a Company Wide View
• Data in operational systems typically fragmented
and of poor quality.
• Generally distributed on a variety of incompatible
HW and SW platforms:
• Unix running oracle DBMS
• IBM MVS running the DB2 DBMS
• Often necessary to provide a single, corporate
view of that information for decision making.
Need to Separate Operational and
Informational Systems
• Operational system used to run a business in real
time based on current data.
• E.g. sales order processing, reservation systems, patient
registration,
• Process large volumes of relatively simple read/write
transactions, while providing fast response.
• Information systems designed to support decision
making based on historical data.
• Designed for complex and read-only queries or data
mining application.
• Sales trend analysis, customer segmentation, and human
resource planning.
Goal: Unif ied Access to Data
Collects and combines information

Provides integrated view, uniform
user interface
Supports sharing
The Traditional Research Approach
• Query-driven (lazy, on-demand)
The Warehousing Approach
Historical background
What is a Data Warehouse?
• “A data warehouse is simply a single, complete,
and consistent store of data obtained from a
variety of sources and made available to end
users in a way they can understand and use it in a
business context.”
-- Barry Devlin, IBM Consultant
What is a Data Warehouse?
• “A DW is a
subject-oriented,
integrated,
time-varying,
non-volatile
collection of data that is used primarily in organizational
decision making.”
-- W.H. Inmon, Building the Data Warehouse, 1992
• A data warehouse
is based on a multidimensional data model which views data in the
form of a data cube
The Data Warehouse
• Key characteristics
• Subject-oriented
• Integrated
• Time-variant
• Nonvolatile
Subject Oriented
• Data is stored by business subject rather than by
application
• A data warehouse never focuses on the ongoing

operations. Instead, it put emphasis on modeling
and analysis of data for decision making. It also
provides a simple and concise view around the
specific subject by excluding data which not
helpful to support the decision process.
Integrated
• Data is stored once in a single integrated location
• Integration means the establishment of a common
unit of measure for all similar data from the
dissimilar database. The data also needs to be
stored in the Datawarehouse in common and
universally acceptable manner.
• A data warehouse is developed by integrating data

from varied sources like a mainframe, relational
databases, flat files, etc. Moreover, it must keep
consistent naming conventions, format, and coding.
Time-variant
• Data is stored as a series of snapshots or views
which records data content and context across
time.
• Data is tagged with some element of time -
creation date, as of date/to , etc.
• Data is available for long periods of time. For
example, five or more years
Non-volatile
• Existing data in the warehouse is not overwritten
or updated.
• It also means the previous data is not erased

when new data is entered in it.
• Data is read-only and periodically refreshed.
• It does not require transaction process, recovery

and concurrency control mechanisms.
Non-volatile
Data Cube
• Data Cube
• Defined by dimensions and facts
• Dimension
• the perspectives or entities with respect to which an organization
wants to keep records
• Fact
• numerical measure from a dimension
• Allow data to be modeled and viewed in multiple
• dimensions
OLAP Operations
• OLAP Operations is the operations that can be
conducted on data cube in order to view data
from different angles. There are four basic
operations that can be implemented on a data
cube:
• Roll Up
• Drill down
• Slice and dice
• Pivot
Roll Up
• Roll-up operation summarizes or aggregates the
dimensions either by performing dimension reduction or
you can perform concept hierarchy.
• The below figure shows you the example of a roll-up
operation performed on the location dimension of the
data cube we have seen above.
Drill Down
• When the drill-down operation is performed on any
dimension the data on the dimension is fragmented into
granular form.
• In the figure below you can see the drill-down operation
on the time dimension where the quarter Q1, Q2, is
fragmented into months.
Slice
• The slice and dice operation pick up one
dimension of the data cube and then forms a
subcube out of it. The figure below represents the
slice operation on a data cube where the data
cube is sliced based on time.
Dice
• The dice operation select more than one
dimension to form a subcube. Like in the figure
below you can see that the subcube is formed by
selecting the dimensions such as location, items
and time
Pivot
• The Pivot is not a calculative operation actually it
rotates the data cube in order to view data cube
from different dimensions.
• The figure below shows the pivot operation
performed on the data cube
Data Cube advantages
• Data cube ease in aggregating and summarizing
the data.
• Data cube provide better visualization of data.
• Data cube stores huge amount of data in a very
simplified way.
• Data cube increases the overall efficiency of the
data warehouse.
• The aggregated data in data cube helps in
analysing the data fast and thereby reducing the
access time
ANY QUESTIONS

DM Lect4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM Lect4

Uploaded by

Copyright:

Available Formats

DATA MINING

• A normalized database yields a ﬂexible model,

• A relational database system is effective and

• Historical data are usually large with static

Collects and combines information

• A data warehouse never focuses on the ongoing

• A data warehouse is developed by integrating data

• It also means the previous data is not erased

• Data is read-only and periodically refreshed.

• It does not require transaction process, recovery

You might also like