Professional Documents
Culture Documents
03 Data Warehousing-AE
03 Data Warehousing-AE
DATA WAREHOUSING
Ahmed Elragal
Professor of Information Systems, LTU
March 2021
Learning Objectives (LO’s)
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 2
AGENDA
ETL
Data Marts
Data
Warehousing
Data Lakes
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 3
The Evolution of Data Warehousing: M sources
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 4
The Evolution of Data Warehousing:
Delayed decision making!
⏤ Organizations now focus on ways to use operational data to
support decision making, as a means of gaining competitive
advantage
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 5
The Evolution of Data Warehousing:
The Solution
⏤ Organizations need to turn their archives of data into a source of
knowledge, so that a single integrated view of an organization’s
data is presented to users
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 6
Information Silos
Enterprise Human
Systems Capital
Accounting Customer
Systems Experience
Islands of information
E-Commerce Warehouse
MGT Sys.
Sourcing Call
Systems Center
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 7
Data Warehousing
Enterprise Human
Systems Capital
Accounting Customer
Systems Experience
Enterprise
Data Warehouse
E-Commerce Warehouse
MGT Sys.
Sourcing Call
Systems Center
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 8
Data Warehousing: a definition
⏤Inmon (1993):
⏤A subject-oriented, integrated, time-variant, and non-
volatile collection of data in support of management’s
decision making process
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 9
Characteristics of DW
Time-variant (time
Subject oriented Integrated
series)
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 10
Benefits of Data Warehousing
Higher ROI
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 11
Data Warehouse Access Tools
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 12
Data Mining*
Data homogeneity
High demand for
i.e., unification of Data ownership High maintenance
resources
business terms
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 14
A Data Warehouse Architecture
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 15
A Generic DW Framework
Data Applications
Sources No data marts option (Visualization)
Data
Marts Routine
ERP Business
ETL
Reporting
Process
Data mart
Select (Marketing)
Legacy Metadata Data/text
/ Middleware
Extract mining
Data mart
Transform Enterprise (Operations)
POS Data warehouse
OLAP,
Integrate
Dashboard,
API
Data mart
(Finance) Web
Other Load
OLTP/Web
Replication Data mart
(...) Custom built
External
applications
Data
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 16
Components of a Data Warehouse*
* Source: Laudon & Laudon, Management Information Systems, 14th Global Ed., Pearson, 2016, pp.267.
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 18
Operational Data Sources
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 19
Operational Data Store (ODS)
⏤ May act as staging area for data to be moved into the warehouse
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 20
ETL
⏤ ETL for Extract, Transform, and Load
⏤ Data for DW must be extracted from one or more data sources,
transformed into a form that is easy to analyze and consistent with
data already in the warehouse, and then finally loaded into the DW
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 21
Data Transformation
Change
buyer_name reg_id total_sales buyer_name reg_id total_sales
Barr, Adam II 17.60 Barr, Adam 2 17.60
Chai, Sean IV 52.80 Chai, Sean 4 52.80
O’Melia, Erin VI 8.82 O’Melia, Erin 6 8.82
... ... ... ... ... ...
Combine
buyer_first buyer_last reg_id total_sales buyer_name reg_id total_sales
Adam Barr 2 17.60 Barr, Adam 2 17.60
Sean Chai 4 52.80 Chai, Sean 4 52.80
Erin O’Melia 6 8.82 O’Melia, Erin 6 8.82
... ... ... ... ... ... ...
Calculate
buyer_name price qty buyer_name price qty total_sales
Barr, Adam .55 32 Barr, Adam .55 32 17.60
Chai, Sean 1.10 48 Chai, Sean 1.10 48 52.80
O’Melia, Erin .99 9 O’Melia, Erin .99 9 8.82
22
... ... ... ... ... ... ...
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 22
Data Warehouse DBMS Requirements
⏤ Load performance [as part of the ETL]
⏤ Load processing e.g., conversion, filter, integrity, etc.
⏤ Data quality management
⏤ Query performance
⏤ Volume Scalability [TB or PB]
⏤ User scalability [10’s or 100’s or 1000’s]
⏤ Networked data warehouse e.g., cloud-based
⏤ Warehouse administration
⏤ Integrated dimensional analysis
⏤ Advanced query functionality
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 23
Data Mart
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 24
Reasons for Creating a Data Mart
⏤ To give users access to the data they need to analyze most often
⏤ To provide data in a form that matches the collective view of the
data by a group of users in a department or business application
area
⏤ To improve end-user response time due to the reduction in the
volume of data to be accessed
⏤ To provide appropriately structured data as dictated by the
requirements of the end-user access tools
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 25
Data Mart Examples*
⏤ Data Mart
⏤a subset of corporate-wide data that is of value to a specific groups of
users. Its scope is confined to specific, selected groups, such as marketing
data mart
⏤ Independent vs. dependent (directly from warehouse) data mart
⏤Virtual warehouse
⏤A set of views over operational databases. Only some of the possible
summary views may be materialized
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 27
Alternative DW Architectures
(a) Independent Data Marts Architecture Least cost - each DM
operate
independently –
ETL
inconsistent data
End user across the DM’s –
Source Staging Independent data marts
access and almost impossible to
Systems Area (atomic/summarized data)
applications bring them to EDW
(b) Data Mart Bus Architecture with Linked Dimensional Datamarts DM’s are linked to
each other via
ETL middleware – higher
consistency –
Dimensionalized data marts End user queries across the
Source Staging
linked by conformed dimentions access and DM’s
Systems Area
(atomic/summarized data) applications
(c) Hub and Spoke Architecture (Corporate Information Factory) Most famous –
central EDW and
ETL multiples dependent
End user DM’s
Source Staging Normalized relational
access and
Systems Area warehouse (atomic data)
applications
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 28
Alternative DW Architectures
Similar to the
(d) Centralized Data Warehouse Architecture
hub-and-spoke,
but there are no
ETL dependent data
Normalized relational End user marts! –
Source Staging advocated mainly
warehouse (atomic/some access and
Systems Area by Teradata
summarized data) applications
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 29
Data {aka analytics} sandbox
⏤ The data sandbox platform provides the necessary & adequate computing
resources – e.g., massive parallel processing units, high-end memory, high-capacity
and high I/O storage - required by the data scientists to tackle typically
complex analytical workloads
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 30
Why sandboxing?
⏤ This has promoted the thinking of alternative quick solution and therefore
came in the concept of the Analytics Sandbox
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 31
Rationale
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 32
Data lake and the position of analytics Sandbox
⏤ Physical Sandbox
⏤ It uses a platform –e.g., an appliance, columnar database, MPP database – to create a
separate physical sandbox for the business analysts and data scientists
⏤ It offloads complex queries from the EDW
⏤ It enables analysts to upload personal or external data to those systems
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 34
⏤ Desktop Sandbox
⏤ It can download datasets from the EDW and other sources to explore the data at the
speed of thought
⏤ Analysts get a high degree of local control and fast performance but give up data
scalability compared to the other two approaches
⏤ the challenge is preventing analysts from publishing the results of their analyses in an ad
hoc manner that undermines information consistency for the enterprise
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 35
From the sandbox to the data lab
⏤ Sandbox, being considered as a data mart - may be perceived to be lower cost and
faster to implement, but it has limitations!
⏤ It requires the acquisition of hardware & software
⏤ Every time we need data from the data warehouse, it must make a formal request
to IT staff
⏤ It is also costly, risky, and resource intensive
⏤ A better solution is a data lab {aka virtual sandbox}, which is a self-service analytics
sandbox that exists within a production data warehouse
⏤ Business users enjoy quick and easy access to required data through automated
processes and user-friendly tools
⏤ IT staff is assured of proper governance without the strain on resources created
by sandboxes and data marts
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 36 36
The Data Lab*
*LSource:
U L E Å https://www.teradata.com
U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 37
Representation of Data in the DW
⏤ Dimensional Modeling
⏤A retrieval-based system that supports high-volume query access
⏤ Star schema
⏤The most commonly used and the simplest style of dimensional
modeling
⏤Contain a fact table surrounded by and connected to several
dimension tables
⏤ Snowflakes schema
⏤An extension of star schema where the diagram resembles a
snowflake in shape
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 38
Multidimensionality
⏤Multidimensional presentation
⏤ Dimensions: products, salespeople, market segments, business units,
geographical locations, distribution channels, country, or industry
⏤ Measures: money, sales volume, head count, inventory profit, actual versus
forecast
⏤ Time: daily, weekly, monthly, quarterly, or yearly
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 39
Star Schema versus Snowflake Schema
40
Successful DW Implementation: Things to Avoid
01 02 03 04 05
Starting with the Setting Engaging in Loading the data Believing that data
wrong sponsorship expectations that politically naive warehouse with warehousing
chain you cannot meet behavior information just database design is
because it is the same as
available transactional
database design
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 41
Q&A
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 42
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 43