Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

No.1 Vo Van Ngan Street, Thu Duc Dist.

, HCMC, VN
Tel: +84 8 37221223, Fax: +84 8 38960640

HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

DATA WAREHOUSE
(DAWH430784)
CONCEPTS

Lecturer: Msc. Nguyễn Văn Thành

DAWH430784 1/16/2024 1
OUTLINE
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Motivation for data warehouse

❑Data warehouse definition

❑OLTP vs. OLAP

❑Multidimensional Model

❑OLAP Operations

❑Data warehouse architectures


DAWH430784 2
Motivation for DWH
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Organizations face increasingly complex


challenges to achieve operational goals so need
analysis tools for decision support.
❑Traditional operational or transactional
databases do not satisfy the requirements for
data analysis
❑Designed/optimized to support daily business
operations; primary concern: concurrent access and
recovery techniques to guarantee data consistency.
❑Contain detailed data, do not include historical data,
and perform poorly for complex queries that involve
many tables or aggregate large volumes of data.

DAWH430784 3
Motivation for DWH
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑To analyze the behavior of an organization, data


from several operational systems must be
integrated.
❑Difficult to accomplish due to many differences in
data definition and content
❑Data warehouse address requirements of
decision-making
❑Populated from operational databases and external
data sources
❑Integrated and transformed data
❑Optimized for reporting and periodic integration

DAWH430784 4
Data Warehouse
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑A data warehouse is a copy of transaction data


specifically structured for query and analysis.
Ralph Kimball
❑A data warehouse is a collection of
❑subject-oriented,
❑integrated,
❑time-variant, and
❑nonvolatile
data to support data management decisions
Bill Inmon

DAWH430784 5
DWH
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Subject-oriented: Organized around business


entities (e.g., customers, products, and
employees) rather than business processes.
❑Integrated: many transformations to unify source
data from independent data sources (units of
measure, data formats, naming conventions)
❑Time-variant: historical data, snapshots of
business processes captured at different points
in time.
❑Nonvolatile: new data are appended periodically,
existing data is not changed.
DAWH430784 6
DWH in Knowledge Discovery (KDD)
Process HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases 7
DAWH430784
DWH in Business Intelligence
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
8
DAWH430784
OLTP vs. OLAP
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Operational databases (online transaction


processing systems or OLTP), are not
suitable for data analysis.
❑Contain detailed data, do not include historical
data, perform poorly for complex queries due to
normalization.
❑Online analytical processing (OLAP): Allows
decision-making users to perform interactive
analysis of data.

DAWH430784 9
OLTP vs. OLAP
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

Transaction
processing (OLTP)
• Primary data from
transactions
• Daily operations and
short term decisions

Business intelligence
processing (OLAP)
• Transformed secondary
data
• Medium and long-term
decisions

10
DAWH430784
OLTP vs. OLAP
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

Characteristic Operational Data Warehouse (OLAP)


Database (OLTP)
Currency Current Historical
Details level Individual Individual and summary
Orientation Process Subject
Records per request Few Thousands
Normalization level Mostly normalized Normalization relaxed
Update level Highly volatile Mostly refreshed (non
volatile)
Data model Relational Relational (star schemas)
and multidimensional (data
cubes)
11

DAWH430784
OLTP vs. OLAP Schema Comparison
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

Operational database Data warehouse


Manages Store
Item StoreId
ItemId StoreManager
Employee ItemName StoreStreet
ItemUnitPrice StoreCity
EmpNo StoreState
EmpFirstName ItemBrand StoreSales
ItemCategory StoreZip
EmpLastName StoreNation
... DivId
ItemSales Sales
DivName
SalesNo
DivManager
SalesUnits
SalesDollar
Takes Customer SalesCost
TimeDim
CustId TimeNo
Product CustName TimeSales TimeDay
Customer CustPhone
Order ProdNo TimeMonth
CustNo CustStreet CustSales TimeQuarter
ProdName
OrdNo CustCity TimeYear
CustFirstName Places Contains ProdQOH
OrdDate CustState TimeDayOfWeek
CustLastName ...
... CustZip TimeFiscalYear
...
CustNation
Qty

12
DAWH430784
Multidimensional Model
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑DWs and OLAP use a multidimensional view of data


❑Represented as a data cube or a hypercube
▪ Dimensions: Perspectives for analyzing data
▪ Cells (facts): Contain measures, and values that are to be
analyzed ity e Milan 24 18 28 14
(C tor

Rome 33 25 23 25
)
S

Nice 12 20 24 33
Paris

18
23
Q1 21 10 18 35 measure
Time (Quarter)

14

17
values

20
Q2 27 14 11 30

12

18
dimensions

33
Q3 26 12 35 32

10
Q4 14 20 47 31
games DVDs
books CDs
Product (Category) 13
DAWH430784
Hierarchies
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Data granularity: level of detail at which measures


are represented for each dimension of the cube
❑Data analyzed at different granularities (abstraction
levels)
❑Hierarchies relate low-level (detailed) concepts to
higher-level (general concepts)
❑Example: Store – City – Region/Province – Country
❑Given two related levels in a hierarchy, lower level is
called child, higher level is called parent
❑Instances of these levels are called members
14
DAWH430784
Hierarchies
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Example
▪ Hierarchies of
the Product,
Time, and
Customer
dimensions

15
DAWH430784
Hierarchies
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Members of the hierarchy Store → City →


Region/Province → Country
Store dimension All

Country level France ... Italy

Region/ Ile-de-France Provence-Alpes- Lazio Lombardy


Province level Côte d'Azur ... ...

City level Paris Nice Rome Milan

Store level Store 1 Store 2 Store 3 ... Store 10 Store 11 Store 12

16
DAWH430784
Measure Aggregation and
Summarizability
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION
❑ Measures are aggregated when using hierarchies for
visualizing data at different abstraction levels
❑ Summarizability refers to the correct aggregation of
cube measures along dimension hierarchies
❑ Summarizability conditions:
❑Disjointness of instances: the grouping of instances in a level
with respect to the parent in the next level must result in
disjoint sets
❑Completeness: all instances are included in the hierarchy and
each instance is related to one parent in the next level
❑Correct use of aggregation functions (“measure type”
condition): Type of measures determine the kind of
aggregation functions that can be applied.
17
DAWH430784
Measure Classification
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑ Each measure is associated to an aggregation function


that combines several measure values into a single one
❑ Measures can be classified according to the way they
can be aggregated:
❑ Additive measures (flow or rate measures): Can be meaningfully
summarized using addition along all dimensions
❑E.g., sales amount can be summarized when the hierarchies in Store,
Time, and Product dimensions are traversed
❑ Semi-additive measures (stock or level measures): Can be meaningfully
summarized using addition along some (not all) dimensions
❑E.g., inventory quantities, can be aggregated in the Store dimension,
but cannot be aggregated in the Time dimension
❑ Non-additive measures (value-per-unit measures): Cannot be
meaningfully summarized using addition along any dimension
❑E.g., item price, cost per unit, exchange rate
18
DAWH430784
Another Measure Classification
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑ Distributive measures: defined by an aggregation function


that can be computed in a distributed way
❑The result of applying the distributive function to the
whole data set is the same as the result of applying it
(not necessarily the same) to the n aggregated values
of n subsets.
❑E.g., sum, min, max, and count (distinct count is not)
❑ Algebraic measures: defined by an aggregation function
that can be expressed as a scalar function of distributive
ones
❑E.g., average (can be computed by sum and count)
❑ Holistic measures: Cannot be computed from other sub-
aggregates.
❑E.g., median, mode, and rank
19
DAWH430784
OLAP Operations: Roll up
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Transforms detailed measures into summarized


ones when one moves up in a hierarchy

Milan 24 18 28 14
ity e

Italy 57 43 51 39
(C tor

Rome 33 25 23 25
)

France
S

Nice 12 20 24 33
Paris
18

Q1 33 30 42 68

Time (Quarter)

41
23

Q1 21 10 18 35
Time (Quarter)

Roll-up to the Country level


14

17

Q2 27 14 11 30

37
20

Q2 27 14 11 30
12

18

Q3 26 12 35 32

51
33

Q3 26 12 35 32
10

Q4 14 20 47 31
Q4 14 20 47 31
games DVDs
games DVDs books CDs
books CDs Product (Category)
Product (Category)

20
DAWH430784
OLAP Operations: Drill down
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Opposite to the roll-up operation, i.e., it moves from


a more general level to a detailed level in a
hierarchy
Milan 24 18 28 14 Milan 8 6 9 5

ity e
(C tor
ity e

Rome 10 8 11 8

)
(C tor

Rome 33 25 23 25
)

S
Nice 4 7 8 10
S

Nice 12 20 24 33
Paris Paris

6
18

10
Jan 7 2 6 13
23

Q1 21 10 18 35 Drill-down to the

14
Time (Quarter)

3
14

17

Time (Quarter)
Month level

7
Feb 8 4 8 12
20

Q2 27 14 11 30

...
9
12

18

...
Mar 6 4 4 10
33

Q3 26 12 35 32

...

8
10

14
Q4 14 20 47 31 ... ... ... ... ...

5
games DVDs Dec 4 4 16 7
books CDs games DVDs
Product (Category) books CDs
Product (Category)
21
DAWH430784
OLAP Operations: Pivot or Rotate
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Rotates the axes of a cube to provide an alternative


presentation of the data

Milan 24 18 28 14

or t
ity e

DVDs 35 30 32 31

y)
eg c
(C tor

Rome 33 25 23 25
)

at du
CDs 18 11 35 47
S

(C Pro
Nice 12 20 24 33 games 10 14 12 20
Paris
18

books

10
23

Q1 21 10 18 35

21
Time (Quarter)

Paris 21 27 26 14
14

17

17

33
Store (City)
20

Q2 27 14 11 30

20
Pivot Nice 12 14 11 13
12

18

28

18
33

Q3 26 12 35 32

47
Rome 33 28 35 32
10

19
Q4 14 20 47 31
Milan 24 23 25 18
games DVDs
Q1 Q2 Q3 Q4
books CDs
Product (Category) Time (Quarter)

22
DAWH430784
OLAP Operations: Slice
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Performs a selection on one dimension of a cube,


resulting in a subcube

Milan 24 18 28 14
ity e
(C tor

Rome 33 25 23 25
)

Q1 21 10 18 35

Time (Quarter)
S

Nice 12 20 24 33
Paris Q2 27 14 11 30
18
23

Q1 21 10 18 35
Time (Quarter)

Q3 26 12 35 32
14

17

Slice on Store.City = ‘Paris’


20

Q2 27 14 11 30
Q4 14 20 47 31
12

18
33

Q3 26 12 35 32 games DVDs
10

Q4 14 20 47 31 books CDs
Product (Category)
games DVDs
books CDs
Product (Category)

23
DAWH430784
OLAP Operations: Dice
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

❑Defines a selection on two or more dimensions, thus


again defining a subcube

Milan 24 18 28 14
ity e
(C tor

Rome 33 25 23 25
)

ity e
(C tor
S

)
Nice 12 20 24 33 Nice 12 20 24 33

S
Paris
18

Paris
23

(Quarter)
Q1 21 10 18 35
Dice on Store.Country = ‘France’
Time (Quarter)

Q1 21 10 18 35

Time
14

17

14
and Time.Quarter= ‘Q1’ or ‘Q2’
20

Q2 27 14 11 30 Q2 27 14 11 30
12

18
33

Q3 26 12 35 32 games DVDs
10

Q4 14 20 47 31 books CDs
Product (Category)
games DVDs
books CDs
Product (Category)

24
DAWH430784
OLAP Operations – Summary
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

Operation Purpose Description


Slice Focus attention on a Replace a dimension with a
subset of dimensions single member value or with a
summary of its measure values
Dice Focus attention on a Replace a dimension with a
subset of member values subset of members
Drill-down Obtain more detail about Navigate from a more general
a dimension level to a more specific level
Roll-up Summarize details about Navigate from a more specific
a dimension level to a more general level
Pivot Present data in a Rearrange the dimensions in a
different order data cube
25
DAWH430784
Data Warehouse Architectures
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

Top Down
• Enterprise data warehouse
• Higher integration levels
• Logically centralized
• Larger project scope

Bottom Up
• Independent data marts
• Lower integration levels
• Logically decentralized
• Smaller project scope

26
DAWH430784
Bottom-up Architecture
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

User
Data mart tier
departments

Operational
database
Transformation
process

Data mart
Operational
database

External
data source

Data mart

27
DAWH430784
Top-Down Architecture
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

Data warehouse User


Data mart tier
server departments

Operational
database Staging Extraction
Area
process
Transformation
process

Operational Data mart


database

Detailed and
summarized data
EDM
External
data source Data warehouse
Data mart

28
DAWH430784
General Architecture
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

29
DAWH430784
General Architecture
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

✓ Data sources
▪ Operational databases
▪ Other internal or external sources of information (e.g. files)
✓ Back-end tier
▪ Extraction-Transformation-Loading (ETL) tools for manipulating data
from sources
▪ Data staging area: Intermediate database where manipulation is done
✓ OLAP tier
▪ OLAP Server: Supports multidimensional data and operations
✓ Front-end tier: Deals with data analysis and visualization
▪ Composed of OLAP tools, reporting tools, statistical tools, data-
mining tools, …

30
DAWH430784
Extraction-Transformation-Loading
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

✓ Extraction: Gathers data from multiple heterogeneous


data sources
▪ May be operational databases or files in various formats
▪ May be internal or external to the organization
▪ Uses APIs such as ODBC, JDBC, … for achieving
interoperability
✓ Transformation: Modifies data to conform to the data
warehouse format
▪ Cleaning: Removes errors, inconsistencies, format
transformation
▪ Integration: Reconciles data from different sources
▪ Aggregation: Summarizes data according to the granularity (level
of detail) of the DW
✓ Loading: Feeds the DW with transformed data 31
DAWH430784
▪ Also includes refreshing the data warehouse at a specified
Data Warehouse Tier
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

✓ Enterprise data warehouse: Centralized DW that


encompasses all areas in an organization
✓ Data mart: Specialized DW targeted to a particular functional
area or user group
▪ Their data can be derived from the enterprise DW or collected from
data sources
✓ Metadata repository: Describes the content of the DW
▪ Business metadata: Meaning (semantics) of data, organization rules,
policies, constraints, …
▪ Technical metadata: How data is structured/stored in the computer
• Data sources, data warehouse, and data marts: logical and physical
schemas, security information, monitoring information …
• ETL process: Data lineage (trace to sources), rules, defaults, refresh and
purging rules, algorithms for summarization, …
32
DAWH430784
OLAP Tier
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

✓ OLAP servers that provides multidimensional view


from DWs and data marts
▪ Can be ROLAP, MOLAP, or HOLAP
✓ Most database products provide OLAP extensions
and related tools for manipulating cubes
✓ However, no standardized language for querying data
cubes
▪ Oracle uses Java and query language OLAP DML
▪ SQL Server uses .NET and query language MDX
✓ XMLA (XML for Analysis) aims at providing a common
language for exchanging multidimensional data
33
DAWH430784
Front-End Tier
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

✓ OLAP tools: Allow interactive exploration and


manipulation of the warehouse data
▪ Facilitate formulation of ad hoc queries (no prior
knowledge of them)
✓ Reporting tools: Enable production, delivery and
management of reports (paper and web-based)
▪ Use predefined queries
✓ Statistical tools: Used to analyze and visualize
the cube data using statistical methods
✓ Data-mining tools: Allow users to analyze data to
discover patterns, trends, enable predictions 34
DAWH430784
Summary
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

✓ Motivation for data warehouse


✓ Data warehouse definition
✓ OLTP vs. OLAP
✓ Multidimensional Model
✓ OLAP Operations
✓ Data warehouse architectures

35
DAWH430784
HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

DAWH430784 1/16/2024 37

You might also like