Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 59

DATA WAREHOUSING AND

BUSINESS INTELLIGENCE
Outline
Decision Support Systems
Data Warehousing
Data Mart
OLAP
DMBOK Activities
Decision Support Systems
 Decision-support systems are used to make
business decisions often based on data collected
by on-line transaction-processing systems.
 Examples of business decisions:
What items to stock?
What insurance premium to change?
Who to send advertisements to?
 Examples of data used for making decisions
Retail sales transaction details
Customer profiles (income, age, sex, etc.)
Decision-Support Systems: Overview
Data analysis tasks are simplified by specialized tools and SQL
extensions
Statistical analysis packages (e.g., : S++) can be interfaced
with databases
 Statistical analysis is a large field will not study it here
Data mining seeks to discover knowledge automatically in
the form of statistical rules and patterns from Large databases.
A data warehouse archives information gathered from
multiple sources, and stores it under a unified schema, at a
single site.
 Important for large businesses which generate data from multiple
divisions, possibly at multiple sites
 Data may also be purchased externally
The Evolution of Data Warehousing
 Since 1970s, organizations gained competitive advantage
through systems that automate business processes to
offer more efficient and cost-effective services to the
customer.
 Growing amounts of data in operational databases.
 Organizations now focus on ways to use operational data
to support decision-making, as a means of gaining
competitive advantage.
 However, operational systems were never designed to
support such business activities.
 Organizations need to turn their archives of data into a
source of knowledge, so that a single integrated /
consolidated view of the organization’s data is presented
to the user.
 A data warehouse was deemed the solution to meet the
requirements of a system capable of supporting decision-
making, receiving data from multiple operational data
sources.
Data Warehousing
A subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision-making
process (Inmon, 1993).
Subject Oriented:
◦ The warehouse is organized around the major subjects of the enterprise (e.g.
customers, products, and sales) rather than the major application areas (e.g.
customer invoicing, stock control, and product sales).
◦ Need to store decision support data rather than application-oriented data
Integrated:
◦ The data warehouse integrates corporate application-oriented data from different
source systems, which often includes data that is inconsistent.
Time-Variant:
◦ Data in the warehouse is only accurate and valid at some point in time or over
some time interval.
Non-volatile:
◦ Data in the warehouse is not updated in real-time but is refreshed from
operational systems on a regular basis.
Subject Oriented
Data is categorized and stored by business subject rather than
by application.

For example, to learn more about your company’s sales data ,


"Who was our best customer for this item, in this
region last year?"
This ability to define a data warehouse by subject matter,
sales in this case, makes the data warehouse subject oriented.

Region Customer
Customer
Financial
Financial
Product
Information
Information

Time Data Warehouse


Subject Area
Operational Systems Customer
Integrated

Data warehouses must put data from disparate sources into a


consistent format.
Time Variant (time series)

Data is stored as a series of snapshots, each representing a


period of time.

Time Data
Jan/03 Data for January

Feb/03 Data for February

Mar/03 Data for March

Data
Warehouse
Non Volatile
•Typically data in the data warehouse is not updated or deleted.

Nonvolatile means that, once entered into the warehouse,


data should not change .This is logical because the purpose
of a warehouse is to enable you to analyze what has
occurred. Load

Operational Databases Warehouse Database

INSERT Read
Read
UPDATE
DELETE
Data Warehousing
Data Warehouse vs OLTP
Data Mart
A subset of a data warehouse that supports the requirements
of a particular department or business function.
Characteristics include
◦ Focuses on only the requirements of one department or business function.
◦ Do not normally contain detailed operational data unlike data warehouses.
◦ More easily understood and navigated.
Reasons for Creating Data Mart
◦ To give users access to the data they need to analyze most often.
◦ To improve end-user response time due to the reduction in the volume of data
to be accessed.
◦ Building a data mart is simpler compared with establishing a corporate data
warehouse.
◦ The cost of implementing data marts is normally less than that required to
establish a data warehouse.
Data Warehouse vs Data Mart
Data Warehouse Architectures
Generic Two-Level Architecture

L
One,
company-
wide
T warehous
e
E

Periodic extraction  data is not completely current in warehouse


12/2/20 rev. 15
Data Warehouse Architectures
Data marts:
Independent Data Mart Mini-warehouses, limited in scope

T
E

Separate ETL for each Data access complexity


independent data mart due to multiple data
12/2/20 rev.
marts 16
Data Warehouse Queries
The types of queries that a DW is expected to answer
ranges from the relatively simple to the highly complex
and is dependent on the type of end-user access tools
used.
Examples:
◦ What was the total revenue for Scotland in the third quarter of
2004?
◦ What was the total revenue for property sales for each type of
property in Indonesia in 2003?
◦ What are the three most popular areas in each city for the
renting of property in 2004 and how does this compare with the
figures for the previous two years?
◦ What is the relationship between the total annual revenue
generated by each branch office and the total number of sales
staff assigned to each branch office?
Modelling of Data Warehouses
Typically warehouse data is multidimensional, with very large fact tables
◦ Examples of dimensions: item-id, date/time of sale, store where sale was
made, customer identifier
◦ Examples of measures: number of items sold, price of items
Modeling data warehouses: dimensions & measures
◦ Star schema: A fact table in the middle connected to a set of dimension
tables
◦ Snowflake schema: A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension tables, forming a
shape similar to snowflake
◦ Fact constellations: Multiple fact tables share dimension tables, viewed as
a collection of stars, therefore called galaxy schema or fact constellation
Illustration of Star Schema
Dimension Table

Dimension Table Dimension Table

Fact
Table

Dimension Table Dimension Table

19
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
Illustration of Snowflakes Schema

Fact
Table

21
Example of Snowflake Schema

time
time_key item
day
Sales Fact Table
item_key supplier
day_of_the_week item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
Illustration of Fact Constellations

Fact
Table

Fact Table

Fact
Table

CONTENT DEVELOPMENT GDLN 23


BATCH 2
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
DATA WAREHOUSE
shipper_type
Star Schema Example

12/2/20 rev.
Business Intelligence Technologies

 Accompanying the growth in data warehousing is an ever-increasing


demand by users for more powerful access tools that provide
advanced analytical capabilities.
 There are two main types of access tools available to meet this
demand, namely Online Analytical Processing (OLAP) and data
mining.
 OLAP and Data Mining differ in what they offer the user and because
of this they are complementary technologies.
 An environment that includes a data warehouse (or more commonly
one or more data marts) together with tools such as OLAP and /or
data mining are collectively referred to as Business Intelligence (BI)
technologies.
Data Analysis and OLAP
Aggregate functions summarize large volumes of data
Online Analytical Processing (OLAP)
◦ Interactive analysis of data, allowing data to be summarized and viewed in
different ways in an online fashion (with negligible delay)
OLAP enables users to gain a deeper understanding and knowledge about
various aspects of their corporate data through fast, consistent, interactive
access to a wide variety of possible views of the data.
Types of analysis ranges from basic navigation and browsing (slicing and
dicing) to calculations, to more complex analyses such as time series and
complex modeling.

12/2/20 rev. 27
Representation of Multi-dimensional Data
Example of two-dimensional query.
◦ What is the total revenue generated by property sales in each city, in each
quarter of 2004?’
Choice of representation is based on types of queries end-user
may ask.
Compare representation - 3-field relational table vs 2-
dimensional matrix.

12/2/20 rev. 28
Representation of Multi-dimensional Data
Example of three-dimensional query.
◦ ‘What is the total revenue generated by property sales for each type of
property (Flat or House) in each city, in each quarter of 2004?’
Compare representation - four-field relational table
versus three-dimensional cube.

12/2/20 rev. 29
Cross Tabulation of sales by item-name and color

The table above is an example of a cross-tabulation


(cross-tab), also referred to as a pivot-table.
A cross-tab is a table where
◦ Values for one of the dimension attributes form the row
headers, values for another dimension attribute form the column
headers
◦ Other dimension attributes are listed on top
◦ Values in individual cells are (aggregates of) the values of the
dimension attributes that specify the cell.
12/2/20 rev. 30
Relational Representation of Crosstabs

 Crosstabs can be
represented as
relations
 The value all is used to
represent aggregates
 The SQL:1999 standard
actually uses null values in
place of all

12/2/20 rev. 31
Three-Dimensional Data Cube
 A data cube is a multidimensional generalization of a crosstab
 Cannot view a three-dimensional object in its entirety
 but crosstabs can be used as views on a data cube

12/2/20 rev. 32
Online Analytical Processing (OLAP)

The operation of changing the dimensions used in a cross-tab is


called pivoting
Suppose an analyst wishes to see a cross-tab on item-name and
color for a fixed value of size, for example, large, instead of the
sum across all sizes.
◦ Such an operation is referred to as slicing.
◦ The operation is sometimes called dicing, particularly when values for multiple
dimensions are fixed.
The operation of moving from finer-granularity data to a coarser
granularity is called a rollup.
The opposite operation - that of moving from coarser-
granularity data to finer-granularity data – is called a drill
down.
Online Analytical Processing (OLAP)
Slicing a Data Cube

12/2/20 rev. 34
Online Analytical Processing (OLAP)
Summary report

Example of Drill
Down

Starting with
summary data, users Drill-down with
can obtain details for color added
particular cells

12/2/20 rev. 35
Hierarchies on Dimensions
 Hierarchy on dimension attributes: lets dimensions to be viewed at different
levels of detail
 E.g. the dimension DateTime can be used to aggregate by hour of
day, date, day of week, month, quarter or year

12/2/20 rev. 36
Extended Aggregation
Advantages of SQL include that it is easy to learn, non-
procedural, free-format, DBMS-independent, and that it is
a recognized international standard.
However, major limitation of SQL (SQL-92) is the inability
to answer routinely asked business queries such as
computing the percentage change in values between this
month and a year ago or to compute moving averages,
cumulative sums, and other statistical functions.
SQL:1999 OLAP extensions provide a variety of
aggregation functions to address some limitations
◦ Supported by several databases, including Oracle and IBM DB2
Extended Aggregation in SQL:1999 –
CUBE
The cube operation computes union of group by’s on every subset of
the specified attributes
E.g. consider the query
select item-name, color, size, sum(number)
from sales
group by cube(item-name, color, size)
This computes the union of eight different groupings of the sales relation:
{ (item-name, color, size), (item-name, color),
(item-name, size), (color, size),
(item-name), (color),
(size), ()}
where ( ) denotes an empty group by list.
For each grouping, the result contains the null value for attributes not
present in the grouping.
Extended Aggregation – CUBE (Cont.)
Show all possible subtotals for sales of properties
by branches offices in Aberdeen, Edinburgh, and
Glasgow for the months of August and September
of 2004.
SELECT propertyType, yearMonth, city, SUM(saleAmount) AS sales
FROM Branch, PropertyFor Sale, PropertySale
WHERE Branch.branchNo = PropertySale.branchNo
AND PropertyForSale.propertyNo = PropertySale.propertyNo
AND PropertySale.yearMonth IN ('2004-08', '2004-09')
AND Branch.city IN (‘Aberdeen’, ‘Edinburgh’, ‘Glasgow’)
GROUP BY CUBE(propertyType, yearMonth, city);
Extended Aggregation – CUBE (Cont.)

12/2/20 rev. 40
Extended Aggregation – ROLLUP (Cont.)
The rollup construct generates union on every prefix of
specified list of attributes
E.g.
select item-name, color, size, sum(number)
from sales
group by rollup(item-name, color, size)
◦ Generates union of four groupings:
{ (item-name, color, size), (item-name, color), (item-
name), ( ) }
ROLLUP supports calculations using aggregations such as
SUM, COUNT, MAX, MIN, and AVG at increasing levels of
aggregation, from the most detailed up to a grand total
Extended Aggregation – ROLLUP
(Cont.)
Show the totals for sales of flats or houses by
branch offices located in Aberdeen, Edinburgh, or
Glasgow for the months of August and September
of 2004.
SELECT propertyType, yearMonth, city, SUM(saleAmount) AS sales
FROM Branch, PropertyFor Sale, PropertySale
WHERE Branch.branchNo = PropertySale.branchNo
AND PropertyForSale.propertyNo = PropertySale.propertyNo
AND PropertySale.yearMonth IN ('2004-08', '2004-09')
AND Branch.city IN (‘Aberdeen’, ‘Edinburgh’, ‘Glasgow’)
GROUP BY ROLLUP(propertyType, yearMonth, city);
Extended Aggregation – ROLLUP (Cont.)

12/2/20 rev. 43
Elementary OLAP Operators
Supports a variety of operations such as
rankings and window calculations.
Windowing allows the calculation of
cumulative and moving aggregations
using functions such as SUM, AVG, MIN,
and COUNT.
DMBOK
Goals
 The primary driver: to support operational
functions, compliance requirements, and
Business Intelligence (BI) activities
 Organizations implement data warehouses in
order to:
 Support Business Intelligence activity
 Enable effective business analysis and decision-
making
 Find ways to innovate based on insights from
their data
Principles
 Focus on business goals
 Start with the end in mind
 Think and design globally; act and build
locally
 Summarize and optimize last, not first
 Promote transparency and self-service
 Build Metadata with the warehouse
 Collaborate
 One size does not fit all
Activities
1. Understand requirements
2. Define and maintain the DW/BI architecture
① Define DW/BI technical architecture
② Define DW/BI management process
3. Develop the data warehouse and data marts
① Map sources to targets
② Remediate and transform data
4. Populate the data warehouse
5. Implement the business intelligence portfolio
① Group users according to needs
② Match tools to user requirements
6. Maintain data products
① Release management
② Manage data product development lifecycle
③ Monitor and tune load processes
④ Monitor and tune BI activity and performance
(1) Understand Requirements
 Business goals and strategy
 Identify and scope the business areas, then identify
and interview the appropriate business people
 Document how they distinguish between and
categorize important aspects of the information
 Define and capture key performance metrics and
calculations
 Catalog requirements and prioritize them into those
necessary for production go-live and adoption of the
warehouse
Understand Business Intelligence
Information Needs
 All projects start with requirements
 Gathering requirements for DW-BIM projects has both similarities to and
differences from gathering requirements for other projects
 For DW-BIM projects, it is important to understand the broader business context
of the business area targeted as reporting is generalised and exploratory
 Capturing the actual business vocabulary and terminology is a key to success
 Document the business context, then explore the details of the actual source data
 Typically, the ETL portion can consume 60%-70% of a DW-BIM project’s budget
and time
 The DW is often the first place where the pain of poor quality data in source
systems and / or data entry functions becomes apparent
 Creating an executive summary of the identified business intelligence needs is a
best practice
 When starting a DW-BIM program, a good way to decide where to start is using
a simple assessment of business impact and technical feasibility
 Technical feasibility will take into consideration things like complexity,
availability and state of the data, and the availability of subject matter
experts
 Projects that have high business impact and high technical feasibility are
good candidates for starting.
Data Warehousing and Business
Intelligence Management
Together Data Warehousing and Business
Intelligence Management is the collection, integration,
and presentation of data to knowledge workers for
the purpose of business analysis and decision-making
Composed of activities supporting all phases of the
decision support life cycle that provides context
◦ Moves and transforms data from sources to a common target
data store
◦ Provides knowledge workers various means of access,
manipulation
◦ Reporting of the integrated target data
(2) Define and Maintain The DW-BI
Architecture
Successful DW-BIM architecture requires the identification and bringing
together of a number of key roles
◦ Technical Architect - hardware, operating systems, databases and DW-BIM
architecture
◦ Data Architect - data analysis, systems of record, data modeling and data mapping
◦ ETL Architect / Design Lead - staging and transform, data marts, and schedules −
Metadata Specialist - metadata interfaces, metadata architecture and contents
◦ BI Application Architect / Design Lead - BI tool interfaces and report design,
metadata delivery, data and report navigation and delivery
Technical requirements including performance, availability, and timing
needs are key drivers in developing the DW-BIM architecture
The design decisions and principles for what data detail the DW contains
is a key design priority for DW-BIM architecture
Important that the DW-BIM architecture integrate with the overall
corporate reporting architecture
(4) Populate the Datawarehouse
 The key factors to consider when defining a
population approach are required latency,
availability of sources, batch windows or upload
intervals, target databases, dimensional aspects, and
timeframe consistency of the data warehouse and
data mart.
 Address data quality processing, time to perform
transformations, and late-arriving dimensions and
data rejects
 Around change data capture process – detecting
changes in the source system, integrating those
changes together, and aligning changes across time
Query and Reporting Tools Landcapes
Implementing Management
Dashboards and ScoreCards
 Dashboards and scorecards are both ways of
efficiently presenting performance information
 Dashboards are oriented more toward dynamic
presentation of operational information while
scorecards are more static representations of
longer-term organizational, tactical, or strategic
goals
 Typically, scorecards are divided into 4 quadrants
or views of the organization such as Finance,
Customer, Environment, and Employees, each
with a number of metrics
Monitor and Tune Data Warehousing
Processes
 Processing should be monitored across the system
for bottlenecks and dependencies among processes
 Database tuning techniques should be employed
where and when needed, including partitioning,
tuned backup and recovery strategies
 Archiving is a difficult subject in data warehousing
 Users often consider the data warehouse as an
active archive due to the long histories that are
built, and are unwilling, particularly if the OLAP
sources have dropped records, to see the data
warehouse engage in archiving
Monitor and Tune BI Activity and
Performance
 A best practice for BI monitoring and tuning is to define
and display a set of customer- facing satisfaction metrics
 Average query response time and the number of users
per day / week / month, are examples of useful metrics
to display
 Regular review of usage statistics and patterns is essential
 Reports providing frequency and resource usage of data,
queries, and reports allow prudent enhancement
 Tuning BI activity is analogous to the principle of
profiling applications in order to know where the
bottlenecks are and where to apply optimization efforts
References
Chapter 22, Database Systems Concepts, 4th ed.,
Silberschatz, Korth, Sudarshan, McGrawHill, 2002.
Chapter 11, Modern Database Management, 7th ed.,
Hoffer, Prescott, McFadden, 2005.
Chapter 31 and 33, Database Systems, 5th ed.,
Connolly & Begg, 2004.
Structured and Comprehensive Approach to Data
Management and the Data Management Book of
Knowledge (DMBOK), Alan McSweeney

You might also like