Download as pdf or txt
Download as pdf or txt
You are on page 1of 90

Agile Data Warehouse Design with Big Data

John DiPietro & Jim Stagnitto


!1
Agenda
• Introduction / a2c Overview
• Modeling for End Users
• Role of Dimensional Models in Big Data
• Example: eCommerce
• Structured Data: Sales
• Semi-structured Data: Clickstream

• Agile Dimensional Modeling Overview


• Case Study Review
• Q&A

!2
Introduction
• a2c
• Boutique EDM (Enterprise Data Management)
consultancy firm:
• Data Warehousing
• Master Data Management
• Closed Look Analytics and Visualization
• Data & Application Architecture
• John DiPietro
• Principal, Chief Technology Officer
• Jim Stagnitto
• Data Warehouse & MDM Architect

!3
a2c Corporate Overview
& Industry Experience

!4
Company Overview
• Technology Solution Consultancy headquartered in Philadelphia with
regional offices in New York and Boston
• Servicing Healthcare, Life Science, Tel-Com and Financial Services
industries with recent obtainment of our GSA schedule to pursue Federal
Government opportunities
• Consultant base of over 2500 proven IT professionals throughout the North
East Region with a recruiting network which provides national coverage
• Flexible approach to helping our clients with their initiatives
• Project-based Solutions
• Staff Augmentation
• Managed Service Offerings – “On-Shore QA , Development & Application Support”
• Executive & Professional Search

!5
Competitive Advantage
• Founders of a2c were part of the fastest growing privately held IT consulting and staff
augmentation firm in the US from 1994-2002. Our Executive Management Team has over a
100 years collective experience and been responsible for delivering over a half-billion dollars
of IT Consulting and staff augmentation revenue from 1994 through to the present day.

• a2c’s Recruiting Engine and Methodology is one of the best in the industry, capable of
producing quality results, on-demand for our clients

• Resource Managers continually “Silo” disciplines with available candidates whom have
proven their abilities with us over the last 10 years

• Our solutions organization is instrumentally involved during the screening and selection
process to ensure that candidates submitted to our clients are an ideal match

• a2c’s Culture provides an ability to attract and retain the best talent in the industry and fosters
creativity, integrity, growth and teamwork

• a2c provides our clients with an alternative solution to a “Big 4” consultancy at substantial
savings for projects that are between $500K and $5M due to our flexibility, agility and focus

!6
Representative Clients

03/19/12
!7
a2c Solution Engagement Structures

• Technology Strategy & Roadmap Formulation


• Needs & Readiness Assessment
• Package & Platform Selections
• Proof of Concept Implementation
• Requirements Discovery & Specifications
• Program/Project Management
• Full Life Cycle & Application Development
• Infrastructure & Facilities Initiatives
• Managed Services & Maintenance Support

!8
a2c Solutions Capabilities
• Enterprise Data Management Practice helps clients manage their complete Information
Lifecycle from their On-line Transactional systems to their Data Warehousing, Enterprise
Reporting, Data Migration, Back-Up and Recovery Strategies (See Slide 7)
• Business Architecture & Optimization Practice utilizes “Six Sigma Lean” methodologies to
analyze, re-engineer and automate our client’s business processes to leverage human
workflow and business rules engine technologies to create efficiencies and provide
business unit owners with the necessary metrics to continually improve performance
• Program Management Office oversees all aspects of solutions planning and delivery
across client engagement teams and provides the methodology and frameworks which
are based on PMI® industry standards
• Application Development & Managed Services Practice helps clients architect, implement
and deploy the latest Microsoft and Enterprise Java based applications which are built on
proven frameworks and architectures for the enterprise
• a2c's SDLC Delivery Model is comprised of over 20 years collective best practices and
industry proven methodologies that allow our delivery teams to rapidly design, develop
and implement solutions. Our SDLC model has been designed to complement our project
management methodology, utilizing iterative development cycles that enable project
teams to provide consistently high quality, on-time deliverables, regardless of technology
platform

!9
Agile DW Design
Overview

!10
Modeling for End Users
• How to Design to Answer
Business Questions?
• Think about how questions are articulated
• And how the answers should be
deliveredIdentify a common question
framework

• Design an architecture that


embraces and leverages this
common question framework
• Utilize the best designs and
technologies to:
• (a) derive the answers
• (b) present them in compelling ways that
lead to the next interesting question!

!11
How Do We Ask Questions?
When What Who

“How do this quarter’s sales by sales rep of


electronic products that we promoted to retail
customers in the east compare with last year’s?

What Who Where Why When

!12
How Do We Ask Questions?
• Events / Transactions
• e.g. Sale
• a immutable "fact" that occurs in a time and (typically a)
place

• Interrogatives:
• Who, What, When, Where, Why
• Descriptive context that fully describes the event
• a set of “dimensions" that describe events

!13
Dimensional Value Proposition
• It makes sense to present answers to people using the same
taxonomy of events and interrogatives (aka: facts and dimensions
- dimensional structure) that they use when forming questions
• Events are instances of processes :
• It’s best to present information to people who will ask the system
questions in dimensional form
• This is true regardless of the type of information being
interrogated, it’s source, or IT stuff (like database technologies
utilized)
• It’s best to model this presentation layer based on the events (aka:
business processes) that underlie the questions

!14
How
Wh o
en W h
How
Many
re Wh
h e at
W
Why

!15
Scenarios

• A brief discussion of how and where


dimensional modeling and/or
databases fit within common and
emerging “big data” data
warehousing architectures

!16
Kimball Dimensional DW
Dimensional BI Semantic Layer

Dimensional Data Warehouse

Data Movement / Integration

Source Data
(Structured)

!17
Kimball with Big Data
Dimensional BI Semantic Layer

Dimensional Data Warehouse

Big Data Big Data


Capture Discovery
(e.g. HDFS) (e.g. MR)

Data Movement / Integration Tier Data Movement / Integration Tier

Source Data Tier Source Data Tier


(Un/Semi-Structured) (Structured)

!18
Corporate Information Factory (CIF)

Dimensional BI Semantic Layer

Dimensional Tier
(Virtual or Physical)

Corporate Information Factory 3NF DW

Data Movement / Integration

Source Data
(Structured)

!19
CIF with Big Data
Dimensional BI Semantic Layer

Dimensional Tier
(Virtual or Physical)

Big Data Big Data Corporate Information


Capture Discovery
(e.g. HDFS) (e.g. MR) Factory 3NF DW

Data Movement / Integration Tier Data Movement / Integration Tier

Source Data Tier Source Data Tier


(Un/Semi-Structured) (Structured)

!20
Data Vault
Dimensional BI Semantic Layer

Dimensional Tier
(Virtual or Physical)

Data Vault

Data Movement / Integration

Source Data
(Structured)

!21
Data Vault with Big Data
Dimensional BI Semantic Layer

Dimensional Tier
(Virtual or Physical)

Big Data Big Data


Capture Discovery Data Vault
(e.g. HDFS) (e.g. MR)

Data Movement / Integration Tier Data Movement / Integration Tier

Source Data Tier Source Data Tier


(Un/Semi-Structured) (Structured)

!22
Etc.

!23
Common Framework
Dimensional BI Semantic Layer

Dimensional Tier
[Physical (Kimball) or Virtual (CIF or Data Vault)

Unstructured ->
Persistant Un/ Persistent Structured Data Insight
Structured
Semi-Structured Repository Generation /
Data Discovery
Staging Area (not needed for Kimball) Data Mining
Processing

Un/Semi-Structured Data
Structured Data Movement
Movement

Structured Source Data


Un/Semi-Structured Source Data
(Structured)
!24
Common Framework
Dining Room
Readily Accessible to End Users
(and BI Developers)
Safe, Hospital Environment
Dimensional BI Semantic Layer Data Assets “Ready for Primetime”
Dimensionally Structured
Dimensional Tier
[Physical (Kimball) or Virtual (CIF or Data Vault)

Persistant Un/ Unstructured -> Persistent Structured Data


Semi-Structured Structured Data
Discovery Repository
Staging Area
Processing (not needed for Kimball) Kitchen
Off Limits to End Users
Un/Semi-Structured Data Movement Structured Data Movement Data Professionals Only Please
Dangerous / Inhospitable Environment
Un/Semi-Structured Source Data
Structured Source Data Data Assets “Not Ready for Primetime”
(Structured)
Structured Variably For Data Processing

Clickstream Data eCommerce Sale eCommerce Example

!25
eCommerce Example: Clickstream
Raw Clickstream Data!
Semi-Structured 25 52 164 240 274 328 368 448 538 561 630 687 730 775 825
834
39 120 124 205 401 581 704 814 825 834
Recording of every page request 35 249 674 712 733 759 854 950
made by a user 39 422 449 704 825 857 895 937 954 964
15 229 262 283 294 352 381 708 738 766 853 883 966 978
26 104 143 320 569 620 798
Includes some structural elements – 7 185 214 350 529 658 682 782 809 849 883 947 970 979
such as when the request was 227 390
71 192 208 272 279 280 300 333 496 529 530 597 618 674 675
made and who the user is 720 855 914 932
183 193 217 256 276 277 374 474 483 496 512 529 626 653 706
878 939
Requires significant prep work in 161 175 177 424 490 571 597 623 766 795 853 910 960
order to fit into a traditional row- 125 130 327 698 699 839
392 461 569 801 862
based relational database 27 78 104 177 733 775 781 845 900 921 938
101 147 229 350 411 461 572 579 657 675 778 803 842 903
71 208 217 266 279 290 458 478 523 614 766 853 888 944 969
Apples and Oranges: Pre- 43 70 176 204 227 334 369 480 513 703 708 835 874 895
Sessionized Page Visits, Detailed 25 52 278 730
Product Views, Catalogue 151 432 504 830 890
71 73 118 274 310 327 388 419 449 469 484 706 722 795 810
Requests, Shopping Cart Adds / 844 846 918
Deletes / Abandons, etc. 130 274 432 528 967
188 307 326 381 403 523 526 722 774 788 789 834 950 975
89 116 198 201 333 395 653 720 846
Needs to be converted into 70 171 227 289 462 538 541 623 674 701 805 946 964
143 192 317 471 487 631 638 640 678 735 780 865 888 935
seperate-but-relatable dimensional 17 242 471 758 763 837 956
facts - with many shared 52 145 161 283 375 385 676 721 731 790 792 885
182 229 276 529
(conformed) dimensions 43 522 565 617 859
!26
Typical Clickstream “Page View” Dimensional
Model

When What

What

Why Who

!27
eCommerce Example: Web Sales

• Purchase and/or Shipment


• Fully Structured (Geo or URL) Locations

• The Sale Transaction • Promotion / Campaign


typically carries all • Etc.
fundamental dimensions:
• And “How Many”
• Time
Measures
• Customer
• Unit and Price Quantities /
• Referring URL / Search Amounts
Phrase
• Discount Amounts
• Product
• Etc

!28
eCommerce Dimensionality
Referring Promotion Activity
Facts (below) & Time! Customer! Web Page! Product!
URL! / Type
Dimensions (right) (When) (Who) (Where) (What) Campaign
(Where) (How)
(Why)
View Start Current

View End
Page Visit Session
Visitor Previous ✔
Next
Start
Session End
View Start
Detailed Product Current

View End Prospect Previous ✔ ✔
View Session Next
Start
Session End
Shopping Cart Activity Start
Prospect ✔ ✔ ✔ ✔
Activity Activity End

Sale Start
Sale (Checkout) Sale End
Customer ✔ ✔ ✔ ✔

Customer
Shipment
Shipment / Delivery Delivery
Delivery ✔
Recipient

!29
Agile DW Design
Overview

!30
The first dimensional modeler:

Rudyard
RalphR.K.
Kimball?
Kipling

!31
I keep six honest serving-men

(They taught me all I knew);

Their names are What and Why and When 

And How and Where and Who…

–Rudyard Kipling

!32

!32
Who
!33
What
!34
When
!35
Where
!36
Why
!37
How
!38
How Many
!39
The 7Ws
Framework
How
Wh o
en Wh
How

Many
re Wh
he at
W
Why
How did we get here?
DW Architectures: A Brief History

Corporate Information Undisciplined Dimensional


Dimensional Bus
Factory
! Architecture

! Report-Driven Analysis !
Data-Driven Analysis Process-Driven Analysis
7Ws Dimensional Model
When
Who

Time
Customer

Day
How – Facts:
Employee

Month
Much
Third Party

Fiscal Period Many
Organization
Often

£$€
Where
What

Location
Product

Geographic
?? Why
Service

Store
Causal
Transactions
Ship To
Promotion

Hospital Reason

Weather

Competition
BEAM

How
Wh o
en Wh
How

Many
e Wh
her at
W

Why
Business Event Analysis & Modeling
How
do you design a data warehouse?
Tech Design Artifacts?
CALENDAR PRODUCT
Date Key Product Key
Date Product Code
Day Product Description
Day in Week Product Type
Day in Month SALES FACT Brand
Day in Qtr Subcategory
Date Key
Day in Year Category
Product Key
Month
Store Key
Qtr
Promotion Key
Year
Weekday Flag
Quantity Sold
Holiday Flag
Revenue
Cost
Basket Count
STORE PROMOTION
Store Key Promotion Key

Store Code Promotion Code


Store Name Promotion Name
URL Promotion Type
Store Manager Discount Type
Region Ad Type
Country
OK, Now Validate with
Why
Agile Data Warehousing?
Waterfall BI/DW

Limited Stakeholder interaction

Analysis

Design

Development
This Year Next Year

BDUF Test

Release

Stakeholder
Requirements Data
ETL BI VALUE?
Input Model
DATA
Agile DW/BI Development

Stakeholder interaction

JEDUF
? ETL
BI

Prototyping
Review

Release

This Year Next Year

Iteration 1 Iteration 2 ADM


Iteration
ETL BI
3Rev Iteration … Iteration n

VALUE? VALUE VALUE! VALUE! VALUE!

DATA
State of The
DW Field
Solid:
Dimensional Data Warehouse Design is Mature
Proven Design Patterns Exist for Common
Requirements
Hit or Miss:
Collecting Unambiguous and Thorough
Requirements
Slotting Requirements into Proven Design
Patterns
End-User Ownership and Validation
Too Often: Snatching Defeat from the Jaws of
Victory

!52
Modelstorming
Quick Inclusive

Data

Modeler BI Stakeholders

Interactive Fun
BEAM✲ Methodology
Structured, non-technical, collaborative working
conversation directly with BI Users

BEAM✲
• Logical and Physical
(Kimball-esque)
• BI User’s Business
Dimensional Data Models

Process, Organizational,
• Example data

Hierarchical, and Data
Knowledge
• Detailed and Testable ETL
Specification

• Focused Data Profiling
• Instantiated DW
Data

BI Stakeholders Prototype
Modeler
Requirements =
Design

55
Collaboration at Every
Step
Agile Data Modeling Requirements

• Techniques for encouraging interaction


• Must use simple, inclusive notation and tools
• Must be quick: hours rather than days – modelstorming
• Balance ‘just in time’ (JIT) and ‘just enough design up
front’ (JEDUF) to reduce design rework
• DW designers must embrace data model change, allow models
to evolve, avoid generic data models; need design patterns they
can trust to represent tomorrow’s BI requirements tomorrow
• ETL and BI developers must embrace database change; need
tool support

!57
What
kind of model?
CALENDAR PRODUCT
Date Key Product Key
Date
Product Code

Day
Product Description

Day in Week
Product Type

Day in Month
SALES FACT Brand

Day in Qtr
Subcategory

Day in Year
Date Key
Category
Month
Product Key

Qtr
Store Key

Year
Promotion Key
Weekday Flag

Holiday Flag
Quantity Sold

Revenue

Cost

Basket Count
STORE PROMOTION

Store Key Promotion Key

Store Code
Promotion Code

Store Name
Promotion Name

URL
Promotion Type

Store Manager
Discount Type

Region
Ad Type
Country
Holiday Type Customer Type

Month Country

Calendar Customer

Sales Fact

Store Product

City Category

Store Type Product Type


Modeling by Abstraction
Modeling by Example
Agile DW Design
Process

64
Collaborative / Conversational Design

Who does what?

“Customers buy products”

BEAM✲
BI Users
Modeler Subjects Verb Objects
Design Using Natural Language

• Verbs – Events – Relationships – Fact Tables


• Nouns – Details – Entities – Dimensions
• Main Clause – Subject-Verb-Object
• Prepositions – connect additional details to the
main clause
• Interrogatives – The 7Ws – Dimension Types
• Business Vocabulary - no IT-Speak

!66
“Spreadsheet”-like Models
Event Table Name (filled in later)

Subject Column Name

Verb

Object Column Name

Interrogative

Details

Example Data (4-6


rows)
Straightforward Methodology 1
1
4
1
1
Who 1
1
1 1
1
1 1
1 5
1
1 1
What 1

Subject-Verb-Object 1
1
1
1 2
3 1
1
1
1
1
When 1

1
Declare Event Type 1
6
1 Initial Data Examples
Where 1
1
1
1
7
1
How 1
(many) 1

1
1 Quantities - Facts
8
1
1
Why 1

1
1
9
1
Sufficient Detail Fact 1
Granularity How 1
Capture Example Data
verb on/at/every
SUBJECT OBJECT EVENT 

DATE

[who] [what] [when] [where] [how many] [why] [how]

Typical Typical/Popular Typical Typical Typical/Average Typical/Normal Typical/Normal

Different Different Different Different Different Different Different


Repeat Repeat Repeat Repeat Repeat Repeat Repeat
Missing Missing Missing Missing Missing Missing Missing

Group Multiple/Bundle Multi-Level Multiple Values

Old, Low Old, Low Value Oldest needed Near Min, Negative, 0

New, High New, High Most Recent, Future Far Max, Precision Exceptional Exceptional

Engage business users

Clarify definitions / Conform Dimensions

Illustrate exceptions

Drive out uniqueness

“Show and tell”


Thoughtful Example Data

Detailed ETL
Specification
Identify Event Type Early
Adjust Conversation Based on Event Type

• Discrete Event -> Transaction


• Instantaneous/short duration, irregularly occurring events or
transactions

• Recurring Event -> Periodic Snapshot – measurement


• Regularly occurring events, ongoing processes, typically use to
measure cumulative of discrete events

• Evolving Event -> Accumulating Snapshot – timeline


• Non-instantaneous/longer duration, irregularly occurring events or
transactions
• Represents current status - reflects adjustments

!72
Capture When Details

When do Customers order Products?

“On the Order Date”


BEAM✲ BI Users
Modeler
Any other Whens?
Any other Whos?
And so on...
Model How Many Measures
• Additive – can be summed up over any combination
of dimensions. No special rules
• Non-additive – can not be summed over any
dimension e.g. unit price or temperature
• Must be aggregated in other ways e.g. average, min, max
• Degenerate Dimensions – transaction #, timestamps, flags

• Semi-additive – can not be summed across at least


one dimension e.g. balances can not be summed
over time

!77
Modeling Dimensions
Annotate w Targeted Data Profiling
Proceed Through the Business Process Value Chain
Collaborative Dimension Conformance

Sales

Campaigns

Plant Response Product Promotion Customer Shipper Time

Dimensions
Identify Hierarchy Types

Balanced Ragged Variable


Depth

Simple

Complex
Graphically Depict Hierarchies
Visualize The Hierarchies
Paint The Organization
Prototype! Not “Data Model Review”
Recap
• Collaborative and Agile
• Data Modeling
• Data Sourcing
• Data Conformance

• Requirements = Design
• Slots directly into proven and mature dimensional data warehousing
design patterns

• Validation through Prototyping


• Semi-automated build of dimensional data warehouse
• Perfect compliment to Agile BI Tools and Methods (e.g. Pentaho)

!87
If you have been affected by

any of the issues raised

in this presentation
!
Agile Data Warehouse Design

Lawrence Corr, Jim Stagnitto, Decision Press, November 2011

!


Questions / Comments

You might also like