DWM 3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Samarth

Data Warehouse Designing and Online Analytical Processing-||


Data Warehouse design & usage-
-Data Warehouse design & usage deals with how data warehousing can be used for information
processing,analytical processing and data mining.
-The target of the design becomes how the record from multiple data sources should be
extracted, transformed, and loaded (ETL) to be organized in a database as the data warehouse.

There are two approaches

1. "top-down" approach
2. "bottom-up" approach

1."top-down" approach-

-This method is invented by”Bill Inmon”.

-In this approach “data warehouse is built first and then data marts are built on top of the data
warehouse.”

Process-

-First data is extracted from various sources or systems.

-The extracted data is loaded and validated in stage area.

-ETL tools are used for to check accuracy and correctness of data.

-We can apply various techniques like summarization,aggregation of the data and then it loaded
on data warehouse.
Samarth

Advantages

1. It provides consistent dimensional data views.

2. It is robust approach means we can easily add a new data mart

3.Data Marts are loaded from the data warehouses.

4.Developing new data mart from the data warehouse is very easy.

Disadvantages

1. It is inflexible because changing departmental needs its hard to implement.

2.The cost of implementing the project is high.

2."bottom-up" approach

-Ralph kimball invented this type of approach.

Process-

-this approach Data Marts are first created .data marts provide reporting and analytics capability
for specific business approach.

The data flow from extraction of data from various source system into stage area

data marts refreshed the current data

Advantages
Samarth

1.Documents can be generated quickly.

2.This data warehouse extended easily.

3 It contains consistent data marts

4. Data marts are delivered quickly

Disadvantages

-the locations of the data warehouse and the data marts are reversed in the bottom-up approach
design.

The Business Analysis Framework-

-Data Warehouse presenting relevant information to user.

-Data Warehouse can enhance business productivity.

-It quickly gathers information and it describes it accurately to the organization.

-Data Warehouse provides consistent view to user.

-It saves time and money of an IT industries in their business analysis process.

Business Analysis Framework having following views-

1)Top down view

2) Data Source view

3) Data Warehouse view

4) Business Query view

The top-down view − This view allows the selection of relevant information needed for a data
warehouse.

The data source view − This view presents the information being captured, stored, and
managed by the operational system.

The data warehouse view − This view includes the fact tables and dimension tables. It
represents the information stored inside the data warehouse.

The business query view − It is the view of the data from the viewpoint of the end-user.

Data Warehouse Design Process

-A data warehouse can be built using top-down, bottom-up or combination of both.

-In top-down approach, the approach starts with overall design and planning.
Samarth

-This approach can be used where the technology is mature and well known, the business
problem is clear and well understood.

-In bottom-up approach, the process can start with experiments and prototypes, this process is
useful in early stage of business modeling and technology development.

-Steps in the design and construction of a data warehouse from software engineering point of
view:

*Planning

*Requirements study

*Problem analysis

*Warehouse design

*Data integration and testing

*Deployment of the data warehouse

-There are two best method in software development as waterfall model and spiral model

-in waterfall model, we can developed structured and analysis of step to next step.

-In spiral model, we can developed rapid steps one involve others

General design and construction of a data warehouse

1. Choosing a business process to model (e.g. customer, accounts, loans etc.)

- In case the business process is organizational and it involves multiple complex objects
then a data warehouse should be used.

-In case of analysis of one business process, a Data mart is to be used.

2.Choosing the business process grain

- It is the atomic level of data to be represented in fact table.

3. Choosing the dimension applied to each fact table record E.g. customer, account, item
4. Choosing the measures
-Measures like additive quantities e.g. dollars_sold, units_sold.
-After the data warehouse design and construction ,data warehouse deployment starts which
includes
-Installation
-Training
-Roll out Planning
Samarth

-Data Warehouse administration


-Data refreshment
-Managing Data growth
-Managing database performance
-Managing access control & security
-Scope Management
-controlling the number and range of queries,dimensions and reports
-Limiting data warehouse size
-Limiting the schedule,budget or resources

Data Warehouse Usage for Information Processing

Three kinds of warehouse applications

1. Information processing Includes

-Querying

-Statistical analysis

-Reporting using cross tabs, tables, charts, graphs

2. Analytical processing includes

-OLAP operations (roll-up, drill-down, slice and dice)

-Operations on historic data, summarized and detailed forms.

3. Data mining includes

-Finding hidden patterns and associations.

-Constructing analytical models.

-Classification and prediction.

-Presenting mining rules using visualization tolls.

From Online Analytical Processing to Multidimensional Data Mining

-Multidimensional data mining is also known as Exploratory. multidimensional data mining,online


analytical data mining(OLAM) integrates OLAP with data mining for uncovering knowledge in
multidimensional databases.

-Importance of multidimensional data mining


Samarth

1. High quality of data in data warehouses: A data warehouse constructed with pre-processing
techniques serves as a valuable source for OLAP.

2. Available information processing infrastructure around data warehouses.

3. OLAP based exploration of multidimensional data.

4.Provides the user with analysis of data at various levels of abstraction.

5. Drilling, pivoting, filtering, slicing or intermediate data mining results

6. Online selection of data mining functions: OLAP integrated with data mining gives users
flexibility of selecting desired data mining functions.

Data Warehouse Implementation Efficient Data Cube Computation: An Overview

In multidimensional data analysis, computation of aggregation across my dimensions should have


efficient computation and it can be achieved by group by queries. Each group by query output is
represented by cuboid and such set of group by forms a lattice of cuboid which is called a data
cube.

For Example, All Electronics sales data cube contains city, item, year and sales in dollars as
shown in figure below.The three attributes city, item, and year, as the dimensions for the data
cube and sales in dollars as the measure; the total number of cuboids or group by's, that can be
computed for this data cube is 2^3=8. The possible group-by's are

{(city, item, year),

(city, item),

(city, year),

(item, year),

(city),

(item),

(year),

()}

where () means that the group-by is empty (ie, the dimensions are not grouped). These group-
by's form a lattice of cuboids for the data cube, as shown in Figure below.
Samarth

Data cube can be viewed as a lattice of cuboids

The bottom-most cuboid is the base cuboid.


The top-most cuboid (apex) contains only one cell.
Number of cuboids in an n-dimensional cube with L levels can be calculated as
(T=SUM(Li+1))
Where Li=the number of levels associated with dimension i.
The figure shows the lattice of cuboids creating 4-D data cubes for the dimension time,
item, location, and supplier. Each cuboid represents a different degree of summarization.

Partial Materialization: Selected Computation of Cuboids

Three types of data cube materialization are:


1 .No Materialization:
-No pre computation for nonbase cuboid.
-This leads to expensive multidimensional aggregates.
-This can be enormously slow.
Samarth

2 .Full Materialization:

-This referred to as a full cube as pre-computation of all of the cuboids is done initially.
-Memory space is required to store all cuboids.

3.Partial Materialization:

-Only subset of cuboids is precomputed which are frequently used.


-Based on user specified criteria, compute subset of cuboids only for specific cells.
-It represents an interesting trade-off between storage space and response time.

Indexing OLAP Data: Bitmap Index and Join Index


-To give efficient access of data,we can construct an indexed based data storage.We
can give index by using Bitmap Index and Join Index.

1.Bitmap Indexing

-It allows quick searching in data cubes.


-The bitmap index is an alternative representation of the record ID (RID) list.
-Each attribute is represented by a distinct bit value.
-If attribute's domain consists of n values, then n bits are needed for each entry in the
bitmap index.
-If attribute value is present in the row then it is represented by 1 in the corresponding
row of the bitmap index and all other bits for that row are set to 0.
for example, if the base table is represented as below:
It’s mapping to bitmap index tables for each of the dimensions Region and Types are:

Advantages
-It represent data in single bit
-It is useful to save time for preprocessing
-It reduces a space for storage.

2.Join Index Method

-It is useful for RDBMS data queries.


-It joins two similar relations from RDBMS
-It joins foreign keys along with its matchable primary keys.
-It is useful to derive sub cubes from data cubes.
Samarth

-It maintains relationships between attribute values of dimension and corresponding row
of table.
-In data warehouses,Join Index relates the values of the dimensions of a star schema to
rows in the fact table.
For example,A star Schema containing a fact table:sales and two dimensions:city and
product then A join index on city maintains for each distinct city a list of R-IDs of the
tuples recording the sales in the city as shown in below figure.

Efficient Processing of OLAP Queries:

-
Samarth

Explore indexing structures and compressed vs.dense array structs in MOLAP.

OLAP Server Architecture:ROLAP Versus MOLAP Versus HOLAP

1.Relational OLAP Servers(ROLAP)

-It is intermediate server

-It is situated between relational back end servers and client front end tools.

-ROLAP uses relational and extended relational database to store and manage and handle the
data in data warehouse.

-It uses OLAP middleware to support missing pieces.

-It has highest scalability than MOLAP

ROLAP Architecture includes the following components

○ Database server.
○ ROLAP server.
○ Front-end tool.
Samarth

Advantages-

Can handle large amounts of data - the limitation is the data size of the underlying relational
database. OLAP itself has no limitation on data amount.

Can be used with data warehouse and OLTP systems.

Performs better than MOLAP when the data is sparse.

ROLAP servers can be easily used with existing RDBMS.

Disadvantages

-Poor query performance.

-Some limitations of scalability depending on the technology architecture that is utilized.

2.Multidimensional OLAP Server(MOLAP)

-It supports multidimensional data views.

-It uses array based multidimensional storage engines to produce the views.

-It directly maps multidimensional views to data cube array structure.

-It uses data cubes for fast indexing process.

MOLAP includes the following components −

Database server.
MOLAP server.
Front-end tool.
Samarth

Advantages

Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for
slicing and dicing operations.

Can perform complex calculations: All evaluation have been pre-generated when the cube is
created. Hence, complex calculations are not only possible, but they return quickly.

Disadvantages

Limited in the amount of information it can handle: Because all calculations are performed when
the cube is built, it is not possible to contain a large amount of data in the cube itself.

Requires additional investment: Cube technology is generally proprietary and does not already
exist in the organization. Therefore, to adopt MOLAP technology, chances are other investments
in human and capital resources are needed.

3.Hybrid OLAP Server(HOLAP)

-It combines features of ROLAP & MOLAP for greater scalability and faster computation.

-It allows to store large data on relational database and aggregation store in separate MOLAP
storage

-Microsoft OLAP server 2000 supports HOLAP technology.

Advantages of HOLAP

1. HOLAP provide benefits of both MOLAP and ROLAP.


2. It provides fast access at all levels of aggregation.
Samarth

3. HOLAP balances the disk space requirement, as it only stores the aggregate information
on the OLAP server and the detail record remains in the relational database. So no
duplicate copy of the detail record is maintained.

Disadvantages of HOLAP

1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP
servers.

Difference Between ROLAP,MOLAP and HOLAP

Basics for ROLAP MOLAP HOLAP


comparison

Acronym Relational online Multi-dimensional Hybrid online


analytical online analytical analytical
processing processing processing

Storage methods Data is stored on Data is stored on Data is stored


the main data the registered on the
warehouse database MDDB relational
databases

Fetching methods Data is fetched Data is fetched Data is fetched


from the main from the from the
repository Proprietary relational
database databases

Data Arrangement Data is arranged Data is arranged Data is


and saved in the and stored in the arranged in
form of tables with form of data cubes multi-
rows and columns dimensional
form
Samarth

Volume Enormous data is Limited data which Large data can


processed is kept in be processed
proprietary is
processed

Technique It works with SQL It works with It uses both


Sparse Matrix Sparse matrix
technology technology and
SQL

Designed view It has dynamic It has a static It has dynamic


access access access

Response time It has Maximum It has Minimum It takes


response time response time Minimum
response time
Samarth

You might also like