Data Warehouse

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 76

DATA WAREHOUSE

Introduction:
Advancements in IT and Database technology increased the

capabilities of data generating and collecting.


Contributing factors include :
the computerization of many business, the widespread use of bar codes for most commercial products, advances in data collection tools.

Popular use of World Wide Web as a global information system has

flooded us with tremendous amount of data and information.

This explosive growth in stored data has generated an urgent need for new

techniques and automated tools that can intelligently assist us in

transforming the vast amounts of data into useful information and


knowledge.
Management of data is one of the important objective of computer science. Data for efficient management requires to be stored in better architecture. Data warehousing helps in this respect which stores data in multiple

dimensions.

Data Warehouse refers to a database that is maintained separately

from an organizations operational databases.


Data Warehouse systems allow for the integration of a variety of

application systems.
Data Warehouses are designed to facilitate reporting & analysis.

Definition:

A Data Warehouse is a repository of information


collected from multiple sources, stored under a unified schema and which usually resides at a single site.

Definition: A Data Warehouse is a repository of subjectively selected and adapted operational data which can answer any ad hoc, complex, statistical or analytical queries.

Definition:
A Data Warehouse is a:
subject-oriented, integrated, time- variant and non- volatile

collection of data in support of managements decision


making process.

Features:
1.

Subject Oriented:

DW is organized around major subjects like customer, supplier, product and sales.

2.

Integrated:

DW is constructed by integrating multiple, heterogeneous data sources such as relational databases, flat files, on-line transaction Records.

3.

Time Variant:

The time horizon for the data warehouse is significantly longer than that of

operational systems.

4. Nonvolatile:

Nonvolatile means that, once entered into the warehouse, data should not change.

5.

Accessible:

The primary purpose of data warehouse is to provide readily accessible information to end users.

Characteristics:

Smaller number of (concurrent) users. Instant response is less important (only for interactively composing reports.

Read-only access by users. Most data access will be targeted at a small partition of the data: the last month or quarter.

Database access less frequent but executing large and complicated queries that access many rows per table.

Inconsistent, primarily long- running and complex read-only transactions instead of high constant transaction rate.

Load from operational data store will only insert new records, existing ones do not get changed (updated).

Bulk load from operational data store, no single-record inserts (at most once daily).

Large storage capacity for historical data . May also contain aggregate data.

Benefits of data warehousing Some of the benefits that a data warehouse provides are as follows:
Common data model. DW makes it easier to report and analyze information. Prior to loading data into the data warehouse, inconsistencies are identified

and resolved.
Because they are separate from operational systems, data warehouses

provide retrieval of data without slowing down operational systems.


Data warehouses facilitate decision support system applications such as

trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.

Data Warehousing:
Data warehousing is a process of constructing and using data warehouses. Thus, an expanded definition for data warehousing includes business

intelligence tools (, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.

Process in Warehousing:
Extract, Transform, and Load (ETL) is a process in data warehousing that

involves:
extracting data from outside sources, transforming it to fit business needs loading it into the end target, i.e. the data warehouse.

1) Extract:
The first part of an ETL process is to extract the data from the source

systems.
An intrinsic part of the extraction is the parsing of extracted data,

resulting in a check if the data meets an expected pattern or structure.

If not, the data may be rejected entirely.

2) Transform:

The transform stage applies to a series of rules or functions to the extracted

data.
In other cases, one or more of the following transformations types to meet

the business and technical needs of the end target may be required:
Selecting only certain columns to load (or selecting null columns not to

load).
Translating coded values (e.g., if the source system stores 1 for male and 2

for female, but the warehouse stores M for male and F for female) .
Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M)
Deriving a new calculated value (e.g., sale_amount = qty * unit_price)

Filtering Sorting

Joining together data from multiple sources.


Aggregation. Transposing or pivoting (turning multiple columns into multiple rows or

vice versa)
Splitting a column into multiple columns (e.g., putting a comma-separated

list specified as a string in one column as individual values in different columns)

3) Load:
The load phase loads the data into the end target, usually being

the data warehouse.


Depending on the requirements of the organization, this process

ranges widely.
Some data warehouses might weekly overwrite existing information

with cumulative, updated data, while other DW (or even other parts of the same DW) might add new data in a historized form, e.g. hourly.

Need for a separate data warehouse:


Why not perform online analytical processing directly on operational

database?
Why to spend additional time and resources to construct a separate data

warehouse?
1)Major reason for such separation is to promote high performance of both systems.

2)OLAP operations on operational db reduces the throughput of an OLTP


system. 3)Separation is based on different structures, contents and use of the data in

two systems.
Since the two systems provide quite different functionalities and require

different kinds of data, it is necessary to maintain separate database.

Data Warehouse Process Flow:


Operational & External Data Transform Extract

Distribute
Store

Meta Data

Find & Understand Automate & Manage

Display, Analyze, Discover

Meta Data:
Meta data is an information about the data. It is the data that describes the structure and business meaning of the

data stored in a data warehouse.


What data exists? Where is it located?

What format is it in?


Who is responsible for it? META DATA How is it related to other data? When was it related to other data? When was the data last updated? What tools can I use to retrieve it?

Types of Metadata:

Technical Metadata:
Detailed picture of the warehouse.
Flow of data. Extraction sources & processes. Business rules. Data transformation rules. Load rules and procedures. Physical data structures.

Business Metadata: Connects the business user to the data warehouse data that is relevant to them.
Data context. Understandability. Accessibility.

Validity.

OLTP:
Online transaction processing.
OLTP refers to a class of systems that facilitate and manage:

transaction-oriented applications,

typically for data entry and retrieval transaction processing.

OLTP systems responds immediately to user requests. The major task of OLTP is to perform online transaction and query processing. They cover day to day operations of an organization such as purchasing, inventory,

manufacturing, banking, payroll, registration and accounting.


An automatic teller machine (ATM) for a bank is an example of a commercial

transaction processing application.

Benefits:
Online Transaction Processing has two key benefits:

1.

Simplicity and

2. Efficiency.

Disadvantages:
It is a great tool for any organization, but in using OLTP, there are a

few things to be wary of:


1.

The security issues and

2. Economic costs.

OLAP: Online Analytical Processing. OLAP, is an approach to quickly provide answers to analytical queries

that are multi-dimensional in nature.


OLAP organizes and presents data in various formats in order to

accommodate the diverse needs of the different users.


It serves users or knowledge workers in the role of data analysis and

decision making.

The typical applications of OLAP are in:


business reporting for sales, marketing, management reporting, business process management (BPM),

budgeting and forecasting,


financial reporting and similar areas.

Distinguishing features between OLTP & OLAP are:


Users & System Orientation:

OLTP: is customer- oriented and is used for transaction processing. OLAP: is market oriented and is used for data analysis .
Data contents:

OLTP: manages current data that are too detailed. OLAP: manages large amounts of historical data, provides facility for summarization & aggregation.

View:

OLTP: focuses mainly on the current data within an enterprise or department. OLAP: focuses on historical data.
Access Patterns:

OLTP: consists of short, atomic transactions. Requires concurrency and recovery mechanisms. OLAP: are mostly read only operations.

Feature Characteristics Orientation users function DB design DB design data View Focus usage access unit of work records accessed users DB size

OLTP Operational processing transaction clerk, IT professional day to day operations application-oriented ER based current, up-to-date Detailed Data in repetitive read/write short, simple transaction tens thousands 100MB-GB

OLAP Information processing analysis knowledge worker decision support subject-oriented Star/snowflake historical summarized Information out ad-hoc lots of scans complex query millions hundreds 100GB-TB

Multidimensional Data Model:


A data model is a way to describe data and to issue queries against it. DW & OLAP tools are based on a multi-dimensional data model. This model views data in the form of a data cube.

Data Cube:
Data cube allows data to be modeled and viewed in multiple

dimensions.
It is defined by dimensions and facts.

Dimensions:
Dimensions are perceptive or entities with respect to which an

organizations wants to keep records.


For e.g.. A sales data warehouse in order to keep records of the stores

sales with respect to dimensions time, item, branch, and location.


Each dimension may have table associated with it called a dimension

table, which further describes the dimension.


For e.g.. A dimension table for item may contain attributes

item_name, brand, type etc.

Facts:
Multidimensional data model is organized around a central theme called as

facts.
Eg. Sales
This theme is represented by a fact table. Facts are numerical measures. They are the quantities by which we want to analyze relationships between

dimensions.
Eg. Facts for a sales DW include dollars_sold, units_sold, amt_budgeted. The fact table contains the names of the facts, or measures as well as keys to

each of the related dimension tables.

2 D View item (type)


Time (quarter) home ent. Computer phone Q1 605 825 14 Q2 680 925 31 Q3 812 1023 30 Q4 927 1038 38 security 400 512 501 580

3 D View
location=Chicago
item (type)

Location= New York


item (type)

Time (quarter) home ent. Computer Q1 Q2 Q3 Q4 605 680 812 927 825 925 1023 1038

phone 14 31 30 38

security 400 512 501 580

Time (quarter) home ent. Computer Q1 Q2 Q3 Q4 1087 1130 1034 1142 968 1024 1048 1091

phone 38 41 45 54

security 872 925 1002 984

location= Toronto item (type)

Location= Vancouver item (type)

Time (quarter) home ent. Computer Q1 Q2 Q3 Q4

phone 43 52 58 59

security 591 682 728 784

Time (quarter) home ent. Computer Q1 Q2 Q3 Q4 854 680 812 927 822 925 1023 1038 89 31 30 38

phone

security

818 894 940 978

746 769 795 864

623 512 501 580

A 3D data cube representation of the data according to the dimensions time, item and location. The measures displayed is dollars_sold.

Time

882 89 623 854 New York 1087 968 38 872 Toronto 818 746 43 591 Chicago 1087 968 968 825 605 14 400 Q1 682 Q2 925 31 512 680
Q3
Q4

Vancouver

812
927
Home Ent

1023 1038
Comp

30
38
Phone

501
580
Security

items

In data warehousing literature, a data cube such as each of the above

is referred to as a cuboid.
The cuboid that holds the lowest level of summarization is called

base cuboid.
The top most 0-D cuboid, which holds the highest-level of

summarization, is called the apex cuboid.

OLAP Operations
- Multidimensional data model allows data to be stored in

multiple dimensions.
- Each dimension contains multiple levels of abstraction

defined by concept hierarchies.


- This facilitates users to view data in different perceptive. - There are number of OLAP operations to materialize this

different views.

440 New York 1560 Toronto 395 Vancouver 1087 968 968 825 14 Q1 605

Chicago

400
682

Time

Q2

Q3
Q4
Home Ent Comp Phone Security

items

Roll Up: Is also called as Drill up operation. Performs aggregation on data cube either by climbing up a concept hierarchy or for a dimension by dimension reduction.

Roll Up

On location from cities to countries

USA

Canada Q1 1000

2000 968

682
Time Q2 Q3 Q4
Home Ent Comp Phone Security

The roll up operation shows aggregates the data by ascending the

location hierarchy from the level of city to the level of country.


Rather than grouping data by city the resulting cube groups the data

by country.
When roll up is performed by dimension reduction, one or more

dimensions are removed from the given cube.


E.g.. Consider sales data cube containing two dimensions location and

time. Roll up may be performed by removing time dimension.


This results in an aggregation of the total sales by location.

Drill Down:
It is the reverse of roll up. It navigates from less detailed data to more detailed data. It can be done either by stepping down concept hierarchy or

introducing additional dimensions.

Drill Down On time from quarters to months 1087 968 968

Jan Feb Mar Apr May June

150 100 150

Time

July
Aug Sept Oct Nov Dec
Home Ent Comp Phone Security

Drill down occurs by descending the time hierarchy from the level of

quarter to more detailed level of month.


The resulting data cube details the total sales per month rather than

summarized by quarters.
Since drill down adds more detail to the given data it can also be

performed by adding new dimensions to a cube.


Eg. additional dimension customer_type can be introduced.

Slice : The slice operation performs a selection on one dimension of the given cube resulting in a sub cube. Following figure shows a slice operation where a sales data are selected from the central cube for the dimension time using criterion the criterion time = Q1.
Slice for time = Q1

Chicago

Location

New York

Toronto
Vancouver

605 825

14

440

Dice:
The dice operation defines a sub cube by performing a selection on two or

more dimensions. Following figure shows a dice operation on the central cube based on following selection criterion that involves 3 dimensions: (location= Toronto or Vancouver) and (time= Q1 or Q2) and (item= home entertainment or computer).

Toronto 395
968 Q1 Time Q2
Home Ent Comp

605

Item (type)

Pivot:
Pivot is also called as rotate. It is a visualization operation that rotates the data axes in view

in order to provide an alternative presentation of the data.


Following figure shows a pivot operation where the item and

location axes in a 2- D slice are rotated.

Cube: A Lattice of Cuboids

all time item location supplier

0-D(apex) cuboid

1-D cuboids

time,item

time,location

item,location item,supplier

location,supplier

time,supplier time,item,location

2-D cuboids

time,location,supplier

3-D cuboids
item,location,supplier

time,item,supplier

4-D(base) cuboid
time, item, location, supplier

Schemas for Multidimensional Database:


Database schema consists of a set of entities and the relationships between

them.
The entity relationship data model is commonly used in the design of

relational database.
Such data model is appropriate for on-line transaction processing. A data warehouse requires a concise, subject oriented schema that facilitates

on-line data analysis.


The most popular data model for a data warehouse is a multidimensional

model.
Such a model can exists in the form of a star schema, snowflake schema or a

fact constellation schema.

Star Schema:

It is the most common modeling paradigm.

In star schema DW contains:


A large central table (fact table) containing bulk of data with

no redundancy. Facts are numerical measures. A set of smaller attendant tables (dimension tables) one for each dimension.

Dimensions are perceptive or entities with respect to which an organizations wants to keep records.

Example of Star Schema


time
time_key day day_of_the_week month quarter year

item
Sales Fact Table time_key item_key branch_key
item_key item_name brand type supplier_type

branch
branch_key branch_name branch_type

location
location_key street city province_or_state country

location_key units_sold

dollars_sold
avg_sales

Measures

Snowflake schema:
It is a variant of the star schema model. Here dimension tables are normalized thereby further splitting

the data into additional tables. The resulting schema graph forms a shape similar to a snowflake. The dimension tables of the snowflake model may be kept in normalized form to reduce redundancies. Such table is easy to maintain. Snowflake structure can reduce the effectiveness of browsing since more joins will be needed to execute a query. System performance may be adversely impacted.

Example of Snowflake Schema


time
time_key day day_of_the_week month quarter year

item
Sales Fact Table
item_key item_name brand type supplier_key

supplier
supplier_key supplier_type

time_key
item_key branch_key

branch
branch_key branch_name branch_type

location
location_key street city_key

location_key
units_sold dollars_sold avg_sales Measures

city

city_key city province_or_street country

A concept hierarchy defines a sequence of mappings from a set of low level

Concept Hierarchies

concepts to higher level, more general concepts.


Consider a concept hierarchy for the dimension location. City values for location include Vancouver, Toronto, New York, and Chicago. Each city can be mapped to province or state to which it belongs. For e.g.. Vancouver can be mapped to British Columbia, and Chicago to

Illinois.
The state can in turn be mapped to the country to which they belong such as

Canada or USA.
These mappings form a concept hierarchy for the dimension location

mapping a set of low level concepts (i.e. cities) to higher-level , more general concepts (i.e. countries)

Location
all all

Country Province or state

Canada

USA

British Columbia

Ontario

New York

Illinois

Vancouver City

Victoria

Toronto

Ottawa

New York

Buffalo

Chicago

Country

Year

Province or state City

Quarter

Month Day A lattice for time

Week

Street

Hierarchy for location

Architecture of Data Warehouse

One Tier:
A one-Tier D.W. is suitable for small to medium sized businesses. The relevant source data is collected and transformed into the

warehouse.
This is then accessed by all users.

Two Tier:
A Two-Tier D.W. is suitable for medium to large sized businesses. There is a high degree of variance in types of users and analysis

requirements.
All data is kept in warehouse. Specific subset of data is moved to second data storage called marts. This data is accessed by the users.

Data Warehouse

Virtual Warehouse:
In virtual warehouse a user is given direct access to the source data

instead of transforming it into a warehouse and then accessing the


warehouse.

Data Warehouse

Three Tier:
In a three tier data warehouse an OLAP server is placed

between the users and data warehouse.

External sources
Operational Databases

Metadata Repository

Monitor & Integrator

OLAP Server

Extract Transform Load Refresh

Data Warehouse

Serve

Analysis Query Reports Data mining

OLAP Server
Data Marts

Data Sources

Data Storage Bottom Tier

OLAP Engine Middle Tier

Front-End Tools Top Tier

OLAP Server:
Relational OLAP (ROLAP) :
Use relational or extended-relational DBMS to store and manage

warehouse data and OLAP middle ware to support missing pieces


Include optimization of DBMS backend, implementation of aggregation

navigation logic, and additional tools and services


greater scalability

Multidimensional OLAP (MOLAP) :


Array-based

multidimensional

storage

engine

(sparse

matrix

techniques)
fast indexing to pre-computed summarized data

Data Marts:
Definition: An implementation of a data warehouse with restricted scope of content and support for analytical processing, serving a single department of an organization or a particular data analysis problem domain.

Data marts usually contain operational data, summarized data,

spatial data, or meta data.


A data mart is a subset of a data warehouse that is stored into a

smaller database and is designed to fulfill a specific purpose or data subject rather than for enterprise-wide planning.

Data marts can exist without being connected to a data warehouse. Data marts can be classified as:
Dependent data marts. Independent data marts.

Dependent Data Marts: This data mart use the same extract and transformation processes and have the same content for all the shared data.

Independent Data Marts:


A data mart that stands alone and has its own unique system of

extracting and transforming the source data.

Data Warehouse For Business Intelligence:


Business intelligence is a term commonly associated with

data warehousing.
Business intelligence usually refers to the information that is

available for the enterprise to make decisions on.


A data warehousing (or data mart) system is the backend, or

the

infrastructural,

component

for

achieving

business

intelligence.

You might also like