Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

T E A C H I N G

DATA WAREHOUSING

Ahmed Elragal
Professor of Information Systems, LTU

March 2021
Learning Objectives (LO’s)

⏤ Understand the definitions and concepts of data warehousing


⏤ Understand data warehousing architectures
⏤ Explain data warehousing operations
⏤ Explain the role of data warehouses in decision support
⏤ Explain the ETL processes
⏤ Illustrate analytics sandboxes

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 2
AGENDA

ETL

Data Marts

Data
Warehousing
Data Lakes

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 3
The Evolution of Data Warehousing: M sources

⏤ Since 1970s, organizations gained competitive advantage through


systems that automate business processes to offer more efficient
and cost-effective services to the customer
⏤ This resulted in accumulation of growing amounts of data in
operational databases

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 4
The Evolution of Data Warehousing:
Delayed decision making!
⏤ Organizations now focus on ways to use operational data to
support decision making, as a means of gaining competitive
advantage

⏤ However, operational systems were never designed to support such


business activities

⏤ Typically, businesses have many operational systems with


overlapping and sometimes contradictory definitions

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 5
The Evolution of Data Warehousing:
The Solution
⏤ Organizations need to turn their archives of data into a source of
knowledge, so that a single integrated view of an organization’s
data is presented to users

⏤ A data warehouse (DW) was supposed to be the solution to meet


the requirements of a system capable of supporting decision
making, receiving data from multiple operational data sources

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 6
Information Silos

Enterprise Human
Systems Capital

Accounting Customer
Systems Experience

Islands of information

E-Commerce Warehouse
MGT Sys.
Sourcing Call
Systems Center

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 7
Data Warehousing

Enterprise Human
Systems Capital

Accounting Customer
Systems Experience

Enterprise
Data Warehouse

E-Commerce Warehouse
MGT Sys.
Sourcing Call
Systems Center

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 8
Data Warehousing: a definition

⏤Inmon (1993):
⏤A subject-oriented, integrated, time-variant, and non-
volatile collection of data in support of management’s
decision making process

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 9
Characteristics of DW

Time-variant (time
Subject oriented Integrated
series)

Nonvolatile Summarized Not normalized

Web based, Client/server, real-


Metadata relational/multi- time/right-
dimensional time/active...

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 10
Benefits of Data Warehousing

Unified business terms

Higher ROI

Renders competitive advantage

Increased productivity of corporate decision makers

Single version of the truth

Timeliness decision making

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 11
Data Warehouse Access Tools

⏤ The types of queries that a data warehouse is expected to answer


ranges from the relatively simple to the highly complex and is
dependent on the type of end-user access tools used.

⏤ End-user access tools include:


⏤ Traditional reporting and query
⏤ Online Analytical Processing [OLAP]
⏤ Data mining

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 12
Data Mining*

* Source: Kroenke, Using MIS, 3rd Ed., Pearson, 2011, pp.284.


L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 13
Data Warehousing: Challenges

Underestimation Hidden problems Required data


Increased end-
of resources for with source cannot be
user demands
data loading systems captured

Data homogeneity
High demand for
i.e., unification of Data ownership High maintenance
resources
business terms

Long duration Complexity of


Data Security!
projects integration

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 14
A Data Warehouse Architecture

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 15
A Generic DW Framework

Data Applications
Sources No data marts option (Visualization)
Data
Marts Routine
ERP Business
ETL
Reporting
Process
Data mart
Select (Marketing)
Legacy Metadata Data/text

/ Middleware
Extract mining
Data mart
Transform Enterprise (Operations)
POS Data warehouse
OLAP,
Integrate
Dashboard,

API
Data mart
(Finance) Web
Other Load
OLTP/Web
Replication Data mart
(...) Custom built
External
applications
Data

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 16
Components of a Data Warehouse*

* Source: Kroenke, Using MIS, 3rd Ed., Pearson, 2011, pp.280.


L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 17
Contemporary Business Intelligence Infrastructure*

* Source: Laudon & Laudon, Management Information Systems, 14th Global Ed., Pearson, 2016, pp.267.
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 18
Operational Data Sources

Also include sources such as


Main sources are called online
personal databases and
transaction processing (OLTP)
spreadsheets, Enterprise systems,
databases
and web usage log files

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 19
Operational Data Store (ODS)

⏤ Holds current and integrated operational data for analysis


⏤ Often structured and supplied with data in the same way as the
data warehouse

⏤ May act as staging area for data to be moved into the warehouse

⏤ Often created when legacy operational systems are found to be


incapable of achieving reporting requirements

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 20
ETL
⏤ ETL for Extract, Transform, and Load
⏤ Data for DW must be extracted from one or more data sources,
transformed into a form that is easy to analyze and consistent with
data already in the warehouse, and then finally loaded into the DW

⏤ ETL are tools that automate the extraction, transformation, and


loading processes and offer additional facilities such as data
profiling, data quality control, and metadata management

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 21
Data Transformation
Change
buyer_name reg_id total_sales buyer_name reg_id total_sales
Barr, Adam II 17.60 Barr, Adam 2 17.60
Chai, Sean IV 52.80 Chai, Sean 4 52.80
O’Melia, Erin VI 8.82 O’Melia, Erin 6 8.82
... ... ... ... ... ...

Combine
buyer_first buyer_last reg_id total_sales buyer_name reg_id total_sales
Adam Barr 2 17.60 Barr, Adam 2 17.60
Sean Chai 4 52.80 Chai, Sean 4 52.80
Erin O’Melia 6 8.82 O’Melia, Erin 6 8.82
... ... ... ... ... ... ...

Calculate
buyer_name price qty buyer_name price qty total_sales
Barr, Adam .55 32 Barr, Adam .55 32 17.60
Chai, Sean 1.10 48 Chai, Sean 1.10 48 52.80
O’Melia, Erin .99 9 O’Melia, Erin .99 9 8.82
22
... ... ... ... ... ... ...
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 22
Data Warehouse DBMS Requirements
⏤ Load performance [as part of the ETL]
⏤ Load processing e.g., conversion, filter, integrity, etc.
⏤ Data quality management
⏤ Query performance
⏤ Volume Scalability [TB or PB]
⏤ User scalability [10’s or 100’s or 1000’s]
⏤ Networked data warehouse e.g., cloud-based
⏤ Warehouse administration
⏤ Integrated dimensional analysis
⏤ Advanced query functionality

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 23
Data Mart

⏤ A data mart contains a subset of corporate data to support the


analytical requirements of a particular business unit (such as the
sales department) or to support users who share the same
requirements to analyse a particular business process (such as
property sales)

⏤ Building a data mart is simpler compared with establishing an


enterprise-wide DW (EDW)

⏤ The cost of implementing data marts is normally less than that


required to establish a EDW

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 24
Reasons for Creating a Data Mart

⏤ To give users access to the data they need to analyze most often
⏤ To provide data in a form that matches the collective view of the
data by a group of users in a department or business application
area
⏤ To improve end-user response time due to the reduction in the
volume of data to be accessed
⏤ To provide appropriately structured data as dictated by the
requirements of the end-user access tools

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 25
Data Mart Examples*

* Source: Kroenke, Using MIS, 3rd Ed., Pearson, 2011, pp.283.


L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 26
Three Data Warehouse Models
⏤ Enterprise data warehouse [EDW]
⏤collects all the information about subjects spanning the entire
organization

⏤ Data Mart
⏤a subset of corporate-wide data that is of value to a specific groups of
users. Its scope is confined to specific, selected groups, such as marketing
data mart
⏤ Independent vs. dependent (directly from warehouse) data mart

⏤Virtual warehouse
⏤A set of views over operational databases. Only some of the possible
summary views may be materialized

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 27
Alternative DW Architectures
(a) Independent Data Marts Architecture Least cost - each DM
operate
independently –
ETL
inconsistent data
End user across the DM’s –
Source Staging Independent data marts
access and almost impossible to
Systems Area (atomic/summarized data)
applications bring them to EDW

(b) Data Mart Bus Architecture with Linked Dimensional Datamarts DM’s are linked to
each other via
ETL middleware – higher
consistency –
Dimensionalized data marts End user queries across the
Source Staging
linked by conformed dimentions access and DM’s
Systems Area
(atomic/summarized data) applications

(c) Hub and Spoke Architecture (Corporate Information Factory) Most famous –
central EDW and
ETL multiples dependent
End user DM’s
Source Staging Normalized relational
access and
Systems Area warehouse (atomic data)
applications

Dependent data marts


(summarized/some atomic data)

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 28
Alternative DW Architectures
Similar to the
(d) Centralized Data Warehouse Architecture
hub-and-spoke,
but there are no
ETL dependent data
Normalized relational End user marts! –
Source Staging advocated mainly
warehouse (atomic/some access and
Systems Area by Teradata
summarized data) applications

(e) Federated Architecture Essentially,


integrating
disparate systems
– data from those
Data mapping / metadata
systems are
End user accessed when
Logical/physical integration of access and needed – the
Existing data warehouses
common data elements applications approach is
Data marts and legacy systmes
supported via
middleware
vendors enable
the federated
systems to
integrate

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 29
Data {aka analytics} sandbox

⏤ A data sandbox, normally used in big data contexts, is a scalable


development platform utilized by the data scientists, for the purpose of
exploring the organizational data asset

⏤ The sandbox permits enterprises to realize their investments in big data

⏤ The data sandbox platform provides the necessary & adequate computing
resources – e.g., massive parallel processing units, high-end memory, high-capacity
and high I/O storage - required by the data scientists to tackle typically
complex analytical workloads

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 30
Why sandboxing?

⏤ Enterprise data warehouse (EDW) normally takes substantial time to


implement and do not always meet the rapidly changing needs of the
businesses

⏤ This has promoted the thinking of alternative quick solution and therefore
came in the concept of the Analytics Sandbox

⏤ An analytics sandbox is a separate environment that is part of the overall


data lake architecture, meaning that it is a centralized environment meant
to be used by multiple users and is maintained with the support of IT

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 31
Rationale

⏤ EDW is normally structured & subject to defined business rules and


tightly governed by the enterprise
⏤ EDW data is frequently synchronized with the production environments
via regularly scheduled loads. To that end, it simply takes time for the
EDW to react to new data and analytic requests
⏤ Modern organizations need to be agile via quickly test new ideas, new
hypotheses, new data sources, and new technologies
⏤ Used for Innovation & Exploration purposes
⏤ An analytic sandbox complements your dimensional data warehouse. It
is not intended to replace the data warehouse, but rather stand beside it
and provide an environment that can react more quickly to new
requirements
⏤ Disadvantages:
⏤ Costs; maintenance; and security!

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 32
Data lake and the position of analytics Sandbox

⏤ Analytics Sandbox could be providing data to the EDW. See above


⏤ It is also possible to feed key data from the EDW to the Analytics Sandbox, in addition to non-
👆
EDW data
* Source: https://www.blue-granite.com/blog/advantages-of-the-analytics-sandbox-for-data-lakes
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 33
Types of sandboxes

⏤ Physical Sandbox
⏤ It uses a platform –e.g., an appliance, columnar database, MPP database – to create a
separate physical sandbox for the business analysts and data scientists
⏤ It offloads complex queries from the EDW
⏤ It enables analysts to upload personal or external data to those systems

⏤ Virtual Sandbox aka data lab


⏤ It implements the virtual sandbox inside the EDW using workload management utilities
⏤ Business analysts can upload their own data to these virtual partitions, mix it with
corporate data, and run complex SQL queries
⏤ These virtual sandboxes require delicate handling to keep the two populations (casual
and power users) from encroaching on each other’s processing territories.
⏤ Compared to a physical sandbox, it avoids having to replicate and distribute corporate
data to a secondary environment that runs on a non-standard platform

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 34
⏤ Desktop Sandbox
⏤ It can download datasets from the EDW and other sources to explore the data at the
speed of thought
⏤ Analysts get a high degree of local control and fast performance but give up data
scalability compared to the other two approaches
⏤ the challenge is preventing analysts from publishing the results of their analyses in an ad
hoc manner that undermines information consistency for the enterprise

⏤ Software as a Service (SaaS) Sandbox


⏤ SaaS enables the easy and rapid creation of sandboxes for data exploration
⏤ SaaS vendors offer a complete and comprehensive solution can satisfy IT and business
requirements for sandboxes
⏤ The entire ETL and data warehouse processes are automated with intelligent source
data analysis, data extraction, loading and transformation to automatically create
staging and warehouse tables
⏤ IT can quickly set up sandboxes based on data that IT can control. The sandboxes
themselves can also be centrally managed, so that end users have access to the
appropriate amount of information and IT is confident in the data’s integrity and
security

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 35
From the sandbox to the data lab

⏤ Sandbox, being considered as a data mart - may be perceived to be lower cost and
faster to implement, but it has limitations!
⏤ It requires the acquisition of hardware & software
⏤ Every time we need data from the data warehouse, it must make a formal request
to IT staff
⏤ It is also costly, risky, and resource intensive

⏤ A better solution is a data lab {aka virtual sandbox}, which is a self-service analytics
sandbox that exists within a production data warehouse
⏤ Business users enjoy quick and easy access to required data through automated
processes and user-friendly tools
⏤ IT staff is assured of proper governance without the strain on resources created
by sandboxes and data marts

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 36 36
The Data Lab*

*LSource:
U L E Å https://www.teradata.com
U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 37
Representation of Data in the DW

⏤ Dimensional Modeling
⏤A retrieval-based system that supports high-volume query access

⏤ Star schema
⏤The most commonly used and the simplest style of dimensional
modeling
⏤Contain a fact table surrounded by and connected to several
dimension tables

⏤ Snowflakes schema
⏤An extension of star schema where the diagram resembles a
snowflake in shape

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 38
Multidimensionality

The ability to organize, present, and analyze data by several


dimensions, such as sales by region, by product, by salesperson,
and by time (four dimensions)

⏤Multidimensional presentation
⏤ Dimensions: products, salespeople, market segments, business units,
geographical locations, distribution channels, country, or industry
⏤ Measures: money, sales volume, head count, inventory profit, actual versus
forecast
⏤ Time: daily, weekly, monthly, quarterly, or yearly

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 39
Star Schema versus Snowflake Schema

40
Successful DW Implementation: Things to Avoid

01 02 03 04 05
Starting with the Setting Engaging in Loading the data Believing that data
wrong sponsorship expectations that politically naive warehouse with warehousing
chain you cannot meet behavior information just database design is
because it is the same as
available transactional
database design

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 41
Q&A

L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 42
L U L E Å U N I V E R S I T Y O F T E C H N O L O G Y T E A C H I N G 43

You might also like