Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

DATA MINING

Data Warehousing
Lec 4
Relational Database Theory
• Relational database modeling process –
normalization, relations or tables are
progressively decomposed into smaller relations
to a point where all attributes in a relation are very
tightly coupled with the primary key of the relation.
Relational Database Theory
• The process of normalization generally breaks a
table into many independent tables.

• A normalized database yields a flexible model,


making it easy to maintain dynamic relationships
between business entities.

• A relational database system is effective and


efficient for operational databases – a lot of
updates (aiming at optimizing update
performance).
Problems
• A fully normalized data model can perform very
inefficiently for queries.

• Historical data are usually large with static


relationships:
• Unnecessary joins may take unacceptably long time
• Historical data are diverse
Heterogeneous Information Sources

Different interfaces
Different data representations
Duplicate and inconsistent information
The Need for Data Warehouses
• Two major factors drive the need for data
warehousing in most organizations today:
• Business requires an integrated company-wide view of
high-quality information.
• The IS department must separate informational from
operational systems in order to dramatically improve
performance in managing company data.
Need for a Company Wide View
• Data in operational systems typically fragmented
and of poor quality.
• Generally distributed on a variety of incompatible
HW and SW platforms:
• Unix running oracle DBMS
• IBM MVS running the DB2 DBMS
• Often necessary to provide a single, corporate
view of that information for decision making.
Need to Separate Operational and
Informational Systems
• Operational system used to run a business in real
time based on current data.
• E.g. sales order processing, reservation systems, patient
registration,
• Process large volumes of relatively simple read/write
transactions, while providing fast response.
• Information systems designed to support decision
making based on historical data.
• Designed for complex and read-only queries or data
mining application.
• Sales trend analysis, customer segmentation, and human
resource planning.
Goal: Unif ied Access to Data

Collects and combines information


Provides integrated view, uniform
user interface
Supports sharing
The Traditional Research Approach
• Query-driven (lazy, on-demand)
The Warehousing Approach
Historical background
What is a Data Warehouse?
• “A data warehouse is simply a single, complete,
and consistent store of data obtained from a
variety of sources and made available to end
users in a way they can understand and use it in a
business context.”
-- Barry Devlin, IBM Consultant
What is a Data Warehouse?
• “A DW is a
subject-oriented,
integrated,
time-varying,
non-volatile
collection of data that is used primarily in organizational
decision making.”
-- W.H. Inmon, Building the Data Warehouse, 1992

• A data warehouse
is based on a multidimensional data model which views data in the
form of a data cube
The Data Warehouse
• Key characteristics
• Subject-oriented
• Integrated
• Time-variant
• Nonvolatile
Subject Oriented
• Data is stored by business subject rather than by
application

• A data warehouse never focuses on the ongoing


operations. Instead, it put emphasis on modeling
and analysis of data for decision making. It also
provides a simple and concise view around the
specific subject by excluding data which not
helpful to support the decision process.
Integrated
• Data is stored once in a single integrated location
• Integration means the establishment of a common
unit of measure for all similar data from the
dissimilar database. The data also needs to be
stored in the Datawarehouse in common and
universally acceptable manner.

• A data warehouse is developed by integrating data


from varied sources like a mainframe, relational
databases, flat files, etc. Moreover, it must keep
consistent naming conventions, format, and coding.
Time-variant
• Data is stored as a series of snapshots or views
which records data content and context across
time.
• Data is tagged with some element of time -
creation date, as of date/to , etc.
• Data is available for long periods of time. For
example, five or more years
Non-volatile
• Existing data in the warehouse is not overwritten
or updated.

• It also means the previous data is not erased


when new data is entered in it.

• Data is read-only and periodically refreshed.

• It does not require transaction process, recovery


and concurrency control mechanisms.
Non-volatile
Data Cube
• Data Cube
• Defined by dimensions and facts
• Dimension
• the perspectives or entities with respect to which an organization
wants to keep records
• Fact
• numerical measure from a dimension
• Allow data to be modeled and viewed in multiple
• dimensions
OLAP Operations
• OLAP Operations is the operations that can be
conducted on data cube in order to view data
from different angles. There are four basic
operations that can be implemented on a data
cube:
• Roll Up
• Drill down
• Slice and dice
• Pivot
Roll Up
• Roll-up operation summarizes or aggregates the
dimensions either by performing dimension reduction or
you can perform concept hierarchy.
• The below figure shows you the example of a roll-up
operation performed on the location dimension of the
data cube we have seen above.
Drill Down
• When the drill-down operation is performed on any
dimension the data on the dimension is fragmented into
granular form.
• In the figure below you can see the drill-down operation
on the time dimension where the quarter Q1, Q2, is
fragmented into months.
Slice
• The slice and dice operation pick up one
dimension of the data cube and then forms a
subcube out of it. The figure below represents the
slice operation on a data cube where the data
cube is sliced based on time.
Dice
• The dice operation select more than one
dimension to form a subcube. Like in the figure
below you can see that the subcube is formed by
selecting the dimensions such as location, items
and time
Pivot
• The Pivot is not a calculative operation actually it
rotates the data cube in order to view data cube
from different dimensions.
• The figure below shows the pivot operation
performed on the data cube
Data Cube advantages
• Data cube ease in aggregating and summarizing
the data.
• Data cube provide better visualization of data.
• Data cube stores huge amount of data in a very
simplified way.
• Data cube increases the overall efficiency of the
data warehouse.
• The aggregated data in data cube helps in
analysing the data fast and thereby reducing the
access time
ANY QUESTIONS

You might also like