05 Data Warehousing Process Overview

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

Data Warehousing Process Overview

Business Intelligence

Bayu Setiaji, M.Kom

Magister Teknik Informatika


Data Warehouse Architectures

2
Three-tier Architecture

query / report analysis data mining


Top tier – front end tools

OLAP server OLAP server


output Middle tier – OLAP server

data mart
monitoring administration DW server Bottom tier – data warehouse server

Extract
Clean
operational DB external sources
Trans
Load
Refresh

3
Three-tier Architecture

• Bottom tier is the data warehouse database server. It is the relational


database system. It uses back-end tools and utilities to feed data into the
bottom tier. These back-end tools and utilities perform the Extract, Clean,
Transform, Load, and Refresh functions.
• Middle tier is OLAP Server that can be implemented in either of the
following ways:
• By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP maps the operations on multidimensional data to
standard relational operations.
• By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
• Top tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis tools and data mining tools.

4
Data Integration and ETL Processes

5
Who needs data integration?

Data integration isn’t just for large enterprises. At this point, just about every
organization can benefit from a data integration strategy, because every
single business needs to be able to use its data to compete effectively. Most
businesses use multiple applications to support their business, such as
CRMs, accounting applications, and asset management systems, and even
standard spreadsheet software. Data is locked in silos in each of these
applications, which can result in disconnects and miscommunications
between departments or processes. If important decisions are based on
misinformation resulting from these disconnects, the results will be less than
optimal or even detrimental to a business.

6
Types of Data Integration

• Data warehousing
• Middle-ware data integration
• Data consolidation
• Application-based integration
• Data virtualization

7
Data Warehousing
Data warehousing is a type of data integration that involves using a data
warehouse to cleanse, format, and store data. Data warehousing is one of
many integration systems that is used to deliver insight to an organization by
allowing analysts to compare data consolidated from multiple heterogeneous
sources.

8
Middle-ware Data Integration
Middleware data integration is a data integration system that involves using
a middleware application as a go-between, moving data between source
systems and the central data repository. The middleware helps to format and
validate data before sending it to the repository, which might be a cloud data
warehouse or a database.

9
Data Consolidation
Data consolidation involves the combining of data from multiple systems to
create a single data source. ETL software is often used to support data
consolidation.

10
Application-based Integration
Application-based integration involves using software to find, extract, and
integrate data. During integration, the software processes the data so that
data sets from different source systems are compatible with one another and
with the destination system.

11
Data Virtualization
When using a virtualization approach, users can gain a near real-time,
consolidated view of data via a single interface, even though the data
remains in separate source systems.

12
How Data Integration Works

Data integration includes several key steps. The first step, referred to as data
ingestion, involves moving data out of each source system and into a central
location. Cloud data warehouses or data lakes are often used for this
purpose.
Many organizations use data integration solutions for the ingestion phase,
such an ETL tool. ETL stands for extract, transform, and load data.

13
Extract
It extracts the data from the source system. The ETL tool connects to data
sources using pre-built connectors or by querying the source API.

14
Transform
It transforms the data to ensure consistency at its destination, regardless of
its origin. This transformation typically includes changing the data’s format;
standardizing values such as currencies, units of measurement, and time
zones; enriching and validating the data to eliminate missing values and
duplicates; and applying business rules.

15
Load
The tool then loads the data into the destination system, where it can be used
for analytics and reporting. A data loader can also be used in this phase.

16
Modern ETL tools actually take an ELT approach: Extracting and loading
data into the cloud, and then transforming it, taking advantage of speed and
scalability of the cloud platform. The ETL process must be repeated
frequently to ensure that the central source of data is always up to date.
Some data integration software is designed to capture streaming data and
integrate with data platforms to support real-time data pipelines, so that data
in the central location is constantly refreshed and so that analysts and data
scientists have access to the required data.

17
On-Line Analytical Processing (OLAP)

18
OLAP is based on the multidimensional data model. It allows managers,
and analysts to get an insight of the information through fast, consistent, and
interactive access to information.

19
Types of OLAP Server

• Relational OLAP (ROLAP)


• Multidimensional OLAP (MOLAP)
• Hybrid OLAP (HOLAP)
• Specialized SQL Servers

20
ROLAP servers are placed between relational back-end server and client
front-end tools. To store and manage warehouse data, ROLAP uses
relational or extended-relational DBMS.
ROLAP includes the following:
• Implementation of aggregation navigation logic.
• Optimization for each DBMS back end.
• Additional tools and services.

21
MOLAP server uses array-based multidimensional storage engines for
multidimensional views of data. With multidimensional data stores, the
storage utilization may be low if the data set is sparse.
Therefore, many MOLAP server use two levels of data storage
representation to handle dense and sparse data sets.

22
Hybrid OLAP server is a combination of both ROLAP and MOLAP. It
offers higher scalability of ROLAP and faster computation of MOLAP.
HOLAP servers allows to store the large data volumes of detailed
information. The aggregations are stored separately in MOLAP store.

23
Specialized SQL servers provide advanced query language and query
processing support for SQL queries over star and snowflake schemas in a
read-only environment.

24
OLAP Operations

• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)

25
Roll-up
Roll-up performs aggregation on a data cube in any of the
following ways:
• By climbing up a concept hierarchy for a dimension
• By dimension reduction

1. Roll-up is performed by climbing up a concept hierarchy


for the dimension location.
2. Initially the concept hierarchy was "street < city <
province < country".
3. On rolling up, the data is aggregated by ascending the
location hierarchy from the level of city to the level of
country.
4. The data is grouped into cities rather than countries.
5. When roll-up is performed, one or more dimensions
from the data cube are removed.

26
Drill-down
Drill-down is the reverse operation of roll-up. It is
performed by either of the following ways:
• By stepping down a concept hierarchy for a
dimension
• By introducing a new dimension.

1. Drill-down is performed by stepping down a concept


hierarchy for the dimension time.
2. Initially the concept hierarchy was "day < month <
quarter < year."
3. On drilling down, the time dimension is descended
from the level of quarter to the level of month.
4. When drill-down is performed, one or more
dimensions from the data cube are added.
5. It navigates the data from less detailed data to
highly detailed data.

27
Slice
The slice operation selects one particular dimension from a
given cube and provides a new sub-cube. Consider the
following diagram that shows how slice works.

1. Here Slice is performed for the dimension "time" using the


criterion time = "Q1".
2. It will form a new sub-cube by selecting one or more
dimensions.

28
Dice
Dice selects two or more dimensions from a given cube and
provides a new sub-cube. Consider the following diagram that
shows the dice operation.

The dice operation on the cube based on the following selection


criteria involves three dimensions.
1. (location = "Toronto" or "Vancouver")
2. (time = "Q1" or "Q2")
3. (item =" Mobile" or "Modem")

29
Pivot
The pivot operation is also known as rotation. It rotates the
data axes in view in order to provide an alternative
presentation of data. Consider the following diagram that
shows the pivot operation.

30
OLAP vs. OLTP

No. Data Warehouse (OLAP) Operational Database (OLTP)

1 Involves historical processing of information Involves day-to-day processing


Used by knowledge workers such as executives,
Used by clerks, DBAs, or DB professionals
managers, and analysts
Useful in analyzing the business Useful in running the business

It focuses on information out It focuses on data in


Based on star schema, snowflake, schema and
Based on Entity Relationship Diagram
fact constellation schema
Contains historical data Contains current data

Provides summarized and consolidated data Provides primitive and highly detailed data

Number or users is in hundreds Number of users is in thousands

Number of records accessed is in millions Number of records accessed is in tens

DB size is from 100 GB to 1 TB DB size is from 100 MB to 1 GB

Highly flexible Provides high performance

31

You might also like