Data Warehouse

DATA WAREHOUSE
Introduction:
Advancements in IT and Database technology increased the
capabilities of data generating and collecting.

Contributing factors include :
the computerization of many business, the widespread use of bar codes for most commercial products, advances in data collection tools.
Popular use of World Wide Web as a global information system has
flooded us with tremendous amount of data and information.
This explosive growth in stored data has generated an urgent need for new
techniques and automated tools that can intelligently assist us in
transforming the vast amounts of data into useful information and

knowledge.
Management of data is one of the important objective of computer science. Data for efficient management requires to be stored in better architecture. Data warehousing helps in this respect which stores data in multiple
dimensions.
Data Warehouse refers to a database that is maintained separately
from an organizations operational databases.

Data Warehouse systems allow for the integration of a variety of
application systems.
Data Warehouses are designed to facilitate reporting & analysis.
Definition:
A Data Warehouse is a repository of information

collected from multiple sources, stored under a unified schema and which usually resides at a single site.
Definition: A Data Warehouse is a repository of subjectively selected and adapted operational data which can answer any ad hoc, complex, statistical or analytical queries.
Definition:
A Data Warehouse is a:
subject-oriented, integrated, time- variant and nonvolatile
collection of data in support of managements decision

making process.
Features:
1.
Subject Oriented:
DW is organized around major subjects like customer, supplier, product and sales.
2.
Integrated:
DW is constructed by integrating multiple, heterogeneous data sources such as relational databases, flat files, on-line transaction Records.
3.
Time Variant:
The time horizon for the data warehouse is significantly longer than that of
operational systems.
4. Nonvolatile:
Nonvolatile means that, once entered into the warehouse, data should not change.
5.
Accessible:
The primary purpose of data warehouse is to provide readily accessible information to end users.
Characteristics:

Smaller number of (concurrent) users. Instant response is less important (only for interactively composing reports.
Read-only access by users. Most data access will be targeted at a small partition of the data: the last month or quarter.
Database access less frequent but executing large and complicated queries that access many rows per table.
Inconsistent, primarily long- running and complex read-only transactions instead of high constant transaction rate.
Load from operational data store will only insert new records, existing ones do not get changed (updated).
Bulk load from operational data store, no single-record inserts (at most once daily).
Large storage capacity for historical data . May also contain aggregate data.
Benefits of data warehousing Some of the benefits that a data warehouse provides are as follows:
Common data model. DW makes it easier to report and analyze information. Prior to loading data into the data warehouse, inconsistencies are identified
and resolved.
Because they are separate from operational systems, data warehouses
provide retrieval of data without slowing down operational systems.

Data warehouses facilitate decision support system applications such as
trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.
Data Warehousing:
Data warehousing is a process of constructing and using data warehouses. Thus, an expanded definition for data warehousing includes business
intelligence tools (, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.
Process in Warehousing:
Extract, Transform, and Load (ETL) is a process in data warehousing that
involves:
extracting data from outside sources, transforming it to fit business needs loading it into the end target, i.e. the data warehouse.
1) Extract:
The first part of an ETL process is to extract the data from the source
systems.
An intrinsic part of the extraction is the parsing of extracted data,
resulting in a check if the data meets an expected pattern or structure.
If not, the data may be rejected entirely.
2) Transform:
The transform stage applies to a series of rules or functions to the extracted
data.
In other cases, one or more of the following transformations types to meet
the business and technical needs of the end target may be required:
Selecting only certain columns to load (or selecting null columns not to
load).
Translating coded values (e.g., if the source system stores 1 for male and 2
for female, but the warehouse stores M for male and F for female) .
Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M)
Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
Filtering Sorting
Joining together data from multiple sources.

Aggregation. Transposing or pivoting (turning multiple columns into multiple rows or
vice versa)
Splitting a column into multiple columns (e.g., putting a comma-separated
list specified as a string in one column as individual values in different columns)
3) Load:
The load phase loads the data into the end target, usually being
the data warehouse.

Depending on the requirements of the organization, this process
ranges widely.
Some data warehouses might weekly overwrite existing information
with cumulative, updated data, while other DW (or even other parts of the same DW) might add new data in a historized form, e.g. hourly.
Need for a separate data warehouse:

Why not perform online analytical processing directly on operational
database?
Why to spend additional time and resources to construct a separate data
warehouse?
1)Major reason for such separation is to promote high performance of both systems.
2)OLAP operations on operational db reduces the throughput of an OLTP

system. 3)Separation is based on different structures, contents and use of the data in
two systems.
Since the two systems provide quite different functionalities and require
different kinds of data, it is necessary to maintain separate database.
Data Warehouse Process Flow:

Operational & External Data Transform Extract
Distribute
Store
Meta Data
Find & Understand Automate & Manage
Display, Analyze, Discover
Meta Data:
Meta data is an information about the data. It is the data that describes the structure and business meaning of the
data stored in a data warehouse.

What data exists? Where is it located?
What format is it in?

Who is responsible for it? META DATA How is it related to other data? When was it related to other data? When was the data last updated? What tools can I use to retrieve it?
Types of Metadata:
Technical Metadata:
Detailed picture of the warehouse.
Flow of data. Extraction sources & processes. Business rules. Data transformation rules. Load rules and procedures. Physical data structures.
Business Metadata: Connects the business user to the data warehouse data that is relevant to them.
Data context. Understandability. Accessibility.
Validity.
OLTP:
Online transaction processing.
OLTP refers to a class of systems that facilitate and manage:
transaction-oriented applications,
typically for data entry and retrieval transaction processing.
OLTP systems responds immediately to user requests. The major task of OLTP is to perform online transaction and query processing. They cover day to day operations of an organization such as purchasing, inventory,
manufacturing, banking, payroll, registration and accounting.

An automatic teller machine (ATM) for a bank is an example of a commercial
transaction processing application.
Benefits:
Online Transaction Processing has two key benefits:
1.
Simplicity and
2. Efficiency.
Disadvantages:
It is a great tool for any organization, but in using OLTP, there are a
few things to be wary of:

1.
The security issues and
2. Economic costs.
OLAP: Online Analytical Processing. OLAP, is an approach to quickly provide answers to analytical queries
that are multi-dimensional in nature.

OLAP organizes and presents data in various formats in order to
accommodate the diverse needs of the different users.

It serves users or knowledge workers in the role of data analysis and
decision making.
The typical applications of OLAP are in:

business reporting for sales, marketing, management reporting, business process management (BPM),
budgeting and forecasting,

financial reporting and similar areas.
Distinguishing features between OLTP & OLAP are:

Users & System Orientation:
OLTP: is customer- oriented and is used for transaction processing. OLAP: is market oriented and is used for data analysis .
Data contents:
OLTP: manages current data that are too detailed. OLAP: manages large amounts of historical data, provides facility for summarization & aggregation.
View:
OLTP: focuses mainly on the current data within an enterprise or department. OLAP: focuses on historical data.
Access Patterns:
OLTP: consists of short, atomic transactions. Requires concurrency and recovery mechanisms. OLAP: are mostly read only operations.
Feature Characteristics Orientation users function DB design DB design data View Focus usage access unit of work records accessed users DB size
OLTP Operational processing transaction clerk, IT professional day to day operations application-oriented ER based current, up-to-date Detailed Data in repetitive read/write short, simple transaction tens thousands 100MB-GB
OLAP Information processing analysis knowledge worker decision support subject-oriented Star/snowflake historical summarized Information out ad-hoc lots of scans complex query millions hundreds 100GB-TB
Multidimensional Data Model:

A data model is a way to describe data and to issue queries against it. DW & OLAP tools are based on a multi-dimensional data model. This model views data in the form of a data cube.
Data Cube:
Data cube allows data to be modeled and viewed in multiple
dimensions.
It is defined by dimensions and facts.
Dimensions:
Dimensions are perceptive or entities with respect to which an
organizations wants to keep records.

For e.g.. A sales data warehouse in order to keep records of the stores
sales with respect to dimensions time, item, branch, and location.

Each dimension may have table associated with it called a dimension
table, which further describes the dimension.

For e.g.. A dimension table for item may contain attributes
item_name, brand, type etc.
Facts:
Multidimensional data model is organized around a central theme called as
facts.
Eg. Sales
This theme is represented by a fact table. Facts are numerical measures. They are the quantities by which we want to analyze relationships between
dimensions.
Eg. Facts for a sales DW include dollars_sold, units_sold, amt_budgeted. The fact table contains the names of the facts, or measures as well as keys to
each of the related dimension tables.
2 D View item (type)

Time (quarter) home ent. Computer phone Q1 605 825 14 Q2 680 925 31 Q3 812 1023 30 Q4 927 1038 38 security 400 512 501 580
3 D View
location=Chicago
item (type)
Location= New York

item (type)
Time (quarter) home ent. Computer Q1 Q2 Q3 Q4 605 680 812 927 825 925 1023 1038
phone 14 31 30 38
security 400 512 501 580
Time (quarter) home ent. Computer Q1 Q2 Q3 Q4 1087 1130 1034 1142 968 1024 1048 1091
phone 38 41 45 54
security 872 925 1002 984
location= Toronto item (type)
Location= Vancouver item (type)
Time (quarter) home ent. Computer Q1 Q2 Q3 Q4
phone 43 52 58 59
security 591 682 728 784
Time (quarter) home ent. Computer Q1 Q2 Q3 Q4 854 680 812 927 822 925 1023 1038 89 31 30 38
phone
security
818 894 940 978
746 769 795 864
623 512 501 580
A 3D data cube representation of the data according to the dimensions time, item and location. The measures displayed is dollars_sold.
Time
882 89 623 854 New York 1087 968 38 872 Toronto 818 746 43 591 Chicago 1087 968 968 825 605 14 400 Q1 682 Q2 925 31 512 680
Q3
Q4
Vancouver
812
927
Home Ent
1023 1038
Comp
30
38
Phone
501
580
Security
items
In data warehousing literature, a data cube such as each of the above
is referred to as a cuboid.
The cuboid that holds the lowest level of summarization is called
base cuboid.
The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid.
OLAP Operations
- Multidimensional data model allows data to be stored in
multiple dimensions.
- Each dimension contains multiple levels of abstraction
defined by concept hierarchies.

- This facilitates users to view data in different perceptive. - There are number of OLAP operations to materialize this
different views.
440 New York 1560 Toronto 395 Vancouver 1087 968 968 825 14 Q1 605
Chicago
400
682
Time
Q2
Q3
Q4
Home Ent Comp Phone Security
items
Roll Up: Is also called as Drill up operation. Performs aggregation on data cube either by climbing up a concept hierarchy or for a dimension by dimension reduction.
Roll Up
On location from cities to countries
USA
Canada Q1 1000
2000 968
682
Time Q2 Q3 Q4
The roll up operation shows aggregates the data by ascending the
location hierarchy from the level of city to the level of country.

Rather than grouping data by city the resulting cube groups the data
by country.
When roll up is performed by dimension reduction, one or more
dimensions are removed from the given cube.

E.g.. Consider sales data cube containing two dimensions location and
time. Roll up may be performed by removing time dimension.

This results in an aggregation of the total sales by location.
Drill Down:
It is the reverse of roll up. It navigates from less detailed data to more detailed data. It can be done either by stepping down concept hierarchy or
introducing additional dimensions.
Drill Down On time from quarters to months 1087 968 968
Jan Feb Mar Apr May June
150 100 150
Time
July
Aug Sept Oct Nov Dec
Drill down occurs by descending the time hierarchy from the level of
quarter to more detailed level of month.

The resulting data cube details the total sales per month rather than
summarized by quarters.
Since drill down adds more detail to the given data it can also be
performed by adding new dimensions to a cube.

Eg. additional dimension customer_type can be introduced.
Slice : The slice operation performs a selection on one dimension of the given cube resulting in a sub cube. Following figure shows a slice operation where a sales data are selected from the central cube for the dimension time using criterion the criterion time = Q1.
Slice for time = Q1
Chicago
Location
New York
Toronto
Vancouver
605 825
14
440
Dice:
The dice operation defines a sub cube by performing a selection on two or
more dimensions. Following figure shows a dice operation on the central cube based on following selection criterion that involves 3 dimensions: (location= Toronto or Vancouver) and (time= Q1 or Q2) and (item= home entertainment or computer).
Toronto 395
968 Q1 Time Q2
Home Ent Comp
605
Item (type)
Pivot:
Pivot is also called as rotate. It is a visualization operation that rotates the data axes in view
in order to provide an alternative presentation of the data.

Following figure shows a pivot operation where the item and
location axes in a 2- D slice are rotated.
Cube: A Lattice of Cuboids
all time item location supplier
0-D(apex) cuboid
1-D cuboids
time,item
time,location
item,location item,supplier
location,supplier
time,supplier time,item,location
2-D cuboids
time,location,supplier
3-D cuboids
item,location,supplier
time,item,supplier
4-D(base) cuboid
time, item, location, supplier
Schemas for Multidimensional Database:

Database schema consists of a set of entities and the relationships between
them.
The entity relationship data model is commonly used in the design of
relational database.
Such data model is appropriate for on-line transaction processing. A data warehouse requires a concise, subject oriented schema that facilitates
on-line data analysis.

The most popular data model for a data warehouse is a multidimensional
model.
Such a model can exists in the form of a star schema, snowflake schema or a
fact constellation schema.
Star Schema:
It is the most common modeling paradigm.
In star schema DW contains:

A large central table (fact table) containing bulk of data with
no redundancy. Facts are numerical measures. A set of smaller attendant tables (dimension tables) one for each dimension.
Dimensions are perceptive or entities with respect to which an organizations wants to keep records.
Example of Star Schema

time
time_key day day_of_the_week month quarter year
item
Sales Fact Table time_key item_key branch_key
item_key item_name brand type supplier_type
branch
branch_key branch_name branch_type
location
location_key street city province_or_state country
location_key units_sold
dollars_sold
avg_sales
Measures
Snowflake schema:
It is a variant of the star schema model. Here dimension tables are normalized thereby further splitting
the data into additional tables. The resulting schema graph forms a shape similar to a snowflake. The dimension tables of the snowflake model may be kept in normalized form to reduce redundancies. Such table is easy to maintain. Snowflake structure can reduce the effectiveness of browsing since more joins will be needed to execute a query. System performance may be adversely impacted.
Example of Snowflake Schema

time
time_key day day_of_the_week month quarter year
item
Sales Fact Table
item_key item_name brand type supplier_key
supplier
supplier_key supplier_type
time_key
item_key branch_key
branch
branch_key branch_name branch_type
location
location_key street city_key
location_key
units_sold dollars_sold avg_sales Measures
city
city_key city province_or_street country
A concept hierarchy defines a sequence of mappings from a set of low level
Concept Hierarchies
concepts to higher level, more general concepts.

Consider a concept hierarchy for the dimension location. City values for location include Vancouver, Toronto, New York, and Chicago. Each city can be mapped to province or state to which it belongs. For e.g.. Vancouver can be mapped to British Columbia, and Chicago to
Illinois.
The state can in turn be mapped to the country to which they belong such as
Canada or USA.
These mappings form a concept hierarchy for the dimension location
mapping a set of low level concepts (i.e. cities) to higher-level , more general concepts (i.e. countries)
Location
all all
Country Province or state
Canada
USA
British Columbia
Ontario
New York
Illinois
Vancouver City
Victoria
Toronto
Ottawa
New York
Buffalo
Chicago
Country
Year
Province or state City
Quarter
Month Day A lattice for time
Week
Street
Hierarchy for location
Architecture of Data Warehouse
One Tier:
A one-Tier D.W. is suitable for small to medium sized businesses. The relevant source data is collected and transformed into the
warehouse.
This is then accessed by all users.
Two Tier:
A Two-Tier D.W. is suitable for medium to large sized businesses. There is a high degree of variance in types of users and analysis
requirements.
All data is kept in warehouse. Specific subset of data is moved to second data storage called marts. This data is accessed by the users.
Data Warehouse
Virtual Warehouse:
In virtual warehouse a user is given direct access to the source data
instead of transforming it into a warehouse and then accessing the

warehouse.
Data Warehouse
Three Tier:
In a three tier data warehouse an OLAP server is placed
between the users and data warehouse.
External sources
Operational Databases
Metadata Repository
Monitor & Integrator
OLAP Server
Extract Transform Load Refresh
Data Warehouse
Serve
Analysis Query Reports Data mining
OLAP Server
Data Marts
Data Sources
Data Storage Bottom Tier
OLAP Engine Middle Tier
Front-End Tools Top Tier
OLAP Server:
Relational OLAP (ROLAP) :
Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces

Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services

greater scalability
Multidimensional OLAP (MOLAP) :

Array-based
multidimensional
storage
engine
(sparse
matrix
techniques)
fast indexing to pre-computed summarized data
Data Marts:
Definition: An implementation of a data warehouse with restricted scope of content and support for analytical processing, serving a single department of an organization or a particular data analysis problem domain.
Data marts usually contain operational data, summarized data,
spatial data, or meta data.

A data mart is a subset of a data warehouse that is stored into a
smaller database and is designed to fulfill a specific purpose or data subject rather than for enterprise-wide planning.
Data marts can exist without being connected to a data warehouse. Data marts can be classified as:
Dependent data marts. Independent data marts.
Dependent Data Marts: This data mart use the same extract and transformation processes and have the same content for all the shared data.
Independent Data Marts:

A data mart that stands alone and has its own unique system of
extracting and transforming the source data.
Data Warehouse For Business Intelligence:

Business intelligence is a term commonly associated with
data warehousing.
Business intelligence usually refers to the information that is
available for the enterprise to make decisions on.

A data warehousing (or data mart) system is the backend, or
the
infrastructural,
component
for
achieving
business
intelligence.

Data Warehouse

Uploaded by

Copyright:

Available Formats

You might also like

Data Warehouse

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehouse

Uploaded by

Copyright:

Available Formats

DATA WAREHOUSE

capabilities of data generating and collecting.

Popular use of World Wide Web as a global information system has

flooded us with tremendous amount of data and information.

techniques and automated tools that can intelligently assist us in

transforming the vast amounts of data into useful information and

Data Warehouse refers to a database that is maintained separately

from an organizations operational databases.

A Data Warehouse is a repository of information

collection of data in support of managements decision

provide retrieval of data without slowing down operational systems.

resulting in a check if the data meets an expected pattern or structure.

If not, the data may be rejected entirely.

The transform stage applies to a series of rules or functions to the extracted

Joining together data from multiple sources.

list specified as a string in one column as individual values in different columns)

the data warehouse.

Need for a separate data warehouse:

2)OLAP operations on operational db reduces the throughput of an OLTP

different kinds of data, it is necessary to maintain separate database.

Data Warehouse Process Flow:

Find & Understand Automate & Manage

Display, Analyze, Discover

data stored in a data warehouse.

What format is it in?

typically for data entry and retrieval transaction processing.

manufacturing, banking, payroll, registration and accounting.

transaction processing application.

few things to be wary of:

The security issues and

that are multi-dimensional in nature.

accommodate the diverse needs of the different users.

The typical applications of OLAP are in:

budgeting and forecasting,

Distinguishing features between OLTP & OLAP are:

Multidimensional Data Model:

organizations wants to keep records.

sales with respect to dimensions time, item, branch, and location.

table, which further describes the dimension.

item_name, brand, type etc.

each of the related dimension tables.

2 D View item (type)

Location= New York

security 400 512 501 580

security 872 925 1002 984

location= Toronto item (type)

Location= Vancouver item (type)

Time (quarter) home ent. Computer Q1 Q2 Q3 Q4

security 591 682 728 784

818 894 940 978

746 769 795 864

623 512 501 580

In data warehousing literature, a data cube such as each of the above

summarization, is called the apex cuboid.

defined by concept hierarchies.

On location from cities to countries

The roll up operation shows aggregates the data by ascending the

location hierarchy from the level of city to the level of country.

dimensions are removed from the given cube.