Cosc411 M7.1 2023

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

COSC 411: DATABASE SYSTEMS II 2022/2023 SESSION

DATA WAREHOUSING
(MODULE 7.1)
Data Warehousing
Data warehouse is a data management system that is designed to store large data, facilitate and
support business intelligence activities, especially analytics. The data stored in the warehouse is
uploaded from operational systems, such as marketing, sales, etc., which might be passed through
operational data store for cleaning in order to ensure data quality before used in the warehouse.
Some scholars characterized data warehouse as a subject-oriented, integrated, nonvolatile, time-
variant collection of data in support of management’s decisions. In comparison to traditional
databases, data warehouses generally contain very large amounts of data from multiple sources
that may include databases from different data models and sometimes files acquired from
independent systems and platforms.
Data warehousing can be described more generally as a collection of decision-support technologies
aimed at enabling the knowledge worker (executive, manager, or analyst) to make better and faster
decisions. Data warehouses provide access to data for complex analysis, knowledge discovery,
and decision making through ad hoc and canned queries. They support high-performance demands
on an organization’s data and information. Several types of applications—OLAP, DSS, and Data
Mining applications—are supported. We define each of these as follows:
1. Online Analytical Processing (OLAP): This is a term used to describe the analysis of
complex data from the data warehouse. In the hands of skilled knowledge workers, OLAP
tools enable quick and straightforward querying of the analytical data stored in data
warehouses and data marts (analytical databases similar to data warehouses but with a
defined narrow scope). OLAP helps companies extract insights from their transaction data,
so that they can use it for making more informed decisions.
2. Decision Support System (DSS): Also known as Executive Information Systems (EIS) or
Management Information Systems (MIS), which support an organization’s leading decision
makers with higher-level (analytical) data for complex and important decisions.
3. Data Mining: This is used for knowledge discovery, the ad hoc process of searching data
for unanticipated new knowledge.

Page 1 of 6
COSC 411: DATABASE SYSTEMS II 2022/2023 SESSION

Traditional relational databases cannot be optimized for OLAP, DSS, or Data Mining. By contrast,
data warehouses are designed precisely to support efficient extraction, processing, and presentation
for analytic and decision-making purposes.

General Architecture of a Data Warehouse


There are basically two (2) approaches used in building a data warehouse system. They are as
follows:
 Extraction, Transformation, Load (ETL)-based data warehouse.
 Extraction, Load Transformation (ELT)-based data warehouse.

(1) ETL-Based Data Warehouse


Figure 7.1.1 gives an overview of the conceptual structure of an ETL-based data warehouse. It
shows the entire data warehousing process, which includes possible cleaning and reformatting of
data before loading it into the warehouse. This process is handled by tools known as ETL tools.
The figure also shows that data sources are many which include Flat Files, CRM, ERP and
Operational System.

Figure 7.1.1: General Architecture of an ETL-Based Data Warehouse.

(2) ELT-Based Data Warehouse


In the ELT-based data warehouse, there is no separate tool for data transformation before loading.
It has a staging area within the data warehouse itself. The data is uploaded from heterogeneous

Page 2 of 6
COSC 411: DATABASE SYSTEMS II 2022/2023 SESSION

sources and loaded directly into the data warehouse before any data transformation. Then, all
necessary data transformations are performed inside the data warehouse itself. This can be shown
in Figure 7.1.2 as follows:

Figure 7.1.2: General Architecture of an ELT-Based Data Warehouse.

Characteristics of Data Warehouses


i. In contrast to multi-databases, which provide access to disjoint and usually heterogeneous
databases, a data warehouse is frequently a store of integrated data from multiple sources,
processed for storage in a multidimensional model.
ii. Unlike most transactional databases, data warehouses typically support time series and
trend analyses along with predictive-type analyses.
iii. Compared with transactional databases, data warehouses are nonvolatile. This means that
information in the data warehouse is typically not subject to modification.
iv. In transactional systems, transactions are the unit and are the agent of change to the
database; by contrast, data warehouse information is much more coarse-grained and is
refreshed according to a careful choice of refresh policy, usually incremental.
v. Data warehouse insertions are handled by the warehouse’s ETL or ELT process.
vi. A data warehouse may be regarded as non–real-time with periodic insertions.

Page 3 of 6
COSC 411: DATABASE SYSTEMS II 2022/2023 SESSION

Types of Data Warehousing


1. Enterprise-Wide Data Warehouses: Enterprise data warehouse (EDW) is a central
repository of integrated data from one or more sources, used for reporting and data analysis.
It stores current and historical data in a single place that is used for creating analytical
reports for workers throughout the enterprise. These are huge projects requiring massive
investment of time and resources.
2. Virtual Data Warehouses: These are sets of separate databases which can be queried
together in which a user can access all the data as if it was stored in one data warehouse.
Data is aggregated from a range of sources relevant to a specific area.
3. Logical Data Warehouses: These are data management architectures in which an
architectural layer sits on top of the traditional data warehouse, that enables access to
multiple, diverse data sources while appearing as one logical data source to all users. They
use data federation, distribution, and virtualization techniques.
4. Data Marts: These generally are targeted to a subset of the organization, such as a
department, and are more tightly focused. It focuses on a single functional area of an
organization and contains a subset of the data stored in the data warehouse.
Other terms frequently encountered in the context of data warehousing are the following:
5. Operational Data Store (ODS): This term is commonly used for intermediate form of
databases before they are cleansed, aggregated, and transformed into a data warehouse.
6. Analytical Data Store (ADS): These are databases that are built for the purpose of
conducting data analysis. Typically, ODSs are reconfigured and repurposed into ADSs
through the processes of cleansing, aggregation, and transformation.

Data Modeling for Data Warehouses


Multidimensional models take advantage of inherent relationships in data to populate data in
multidimensional matrices called data cubes. These may be called hypercubes if they have more
than three dimensions. However, two and three dimensional models are briefly discussed as
follows:
1. Two-Dimensional Data Cube: A standard spreadsheet with data represented in rows and
columns is a two-dimensional matrix. Figure 7.1.3 shows this two-dimensional
organization.

Page 4 of 6
COSC 411: DATABASE SYSTEMS II 2022/2023 SESSION

Figure 7.1.3: Two-Dimensional Data Model


2. Three-Dimensional Data Cube: Figure 7.1.4 shows a three-dimensional data cube that
organizes products, stores and times. Each cell could contain data for a specific product,
specific store, and specific time.

Figure 7.1.4: Three-Dimensional Data Model


Changing from one-dimensional hierarchy (orientation) to another is easily accomplished in a data
cube with a technique called pivoting (also called rotation).

Building a Data Warehouse


In constructing a data warehouse, builders should take a broad view of the anticipated use of the
warehouse. Acquisition of data for the warehouse involves the following steps:
1. The data must be extracted from multiple, heterogeneous sources.

Page 5 of 6
COSC 411: DATABASE SYSTEMS II 2022/2023 SESSION

2. Data must be formatted for consistency within the warehouse. Names, meanings, and
domains of data from unrelated sources must be reconciled.
3. The data must be cleaned to ensure validity.
4. The data must be fitted into the data model of the warehouse. Data may have to be
converted from relational, object-oriented, etc. to a multidimensional model.
5. The data must be loaded into the warehouse.

Page 6 of 6

You might also like