Download as pdf or txt
Download as pdf or txt
You are on page 1of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.

Solution Path for Planning and Implementing


the Logical Data Warehouse
Published: 17 May 2017 ID: G00320563

Analyst(s): Henry Cook

Technical professionals implementing the logical data warehouse can


maximize their effectiveness and the business benefit delivered by explicitly
integrating three analytics development styles: the classic data warehouse,
agile with data virtualization, and the data lake.

Key Findings
■ Three broad development methodologies for analytics have evolved. They are the classic data
warehouse, agile development using marts and virtualization, and analytics for large volumes of
unstructured data held in data lakes.
■ These three styles of development are not alternatives, but simply facets of a wider analytical
architecture that are best expressed as the logical data warehouse (LDW).
■ By architecting the LDW correctly, the seemingly conflicting requirements of these styles can be
reconciled, and organizations can fully realize the benefits of each.

Recommendations
For technical professionals focused on modernizing their data and analytics infrastructure:

■ Plan and build your analytical systems as components of an LDW from the start. Even if the
architecture is not fully realized immediately, this will make expansion smoother and easier.
■ To maximize results, utilize continuous cycles of development. For each cycle, choose a
development style, or have two or three styles running in parallel.
■ Explicitly manage and consider each style and stream. Use deliverables' transition between
streams to predictably and methodically trigger the addition or removal of controls.

Table of Contents

Problem Statement................................................................................................................................ 4

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Solution Path Diagram............................................................................................................................5


What You Need to Know........................................................................................................................ 7
Solution Path........................................................................................................................................12
Step 1: Initiate Phase or Project......................................................................................................12
Planning................................................................................................................................... 15
Platform Sizing and Cost Estimation......................................................................................... 18
Establish Architectural Principles...............................................................................................21
Step 2/DW: Data Warehouse Stream..............................................................................................22
Design the Architecture and Then the Physical System............................................................. 23
Document "As Is" System.........................................................................................................24
Design To-Be Architecture........................................................................................................ 24
Data Modelling and Metadata................................................................................................... 25
Develop the Data Model........................................................................................................... 26
Assess Data Warehouse Automation........................................................................................ 31
Evaluate a Cloud-First Approach...............................................................................................37
Evaluate Open-Source Software............................................................................................... 37
Metadata: Business Glossary and Technical Metadata..............................................................38
Put Data Quality Measures in Place.......................................................................................... 38
Information Governance........................................................................................................... 38
Formalize MDM for Critical Business Data.................................................................................40
Automate DW Testing............................................................................................................... 40
Data Ingest Design................................................................................................................... 42
Migration vs. Greenfield............................................................................................................ 43
Report Suite Design..................................................................................................................43
Establish or Grow DW Platform.................................................................................................44
Develop and Deploy the DW.....................................................................................................44
The Initial Load Strategy........................................................................................................... 44
Step 2/DV: Data Virtualization, Agile Development and Self-Service Stream....................................46
Agile Data Marts and Sandboxes..............................................................................................46
Data Virtualization..................................................................................................................... 48
Forward Planning for Virtualization............................................................................................ 57
Establish or Grow Agile and Self-Service Platforms...................................................................59
Step 2/DL: Data Lake Stream.........................................................................................................60
Design the Data Lake Architecture............................................................................................61
Data Acquisition........................................................................................................................62
Discovery and Development of Insight...................................................................................... 62

Page 2 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Data Lake Optimization and Governance..................................................................................62


Data Lake Analytics Consumption............................................................................................ 63
Provision the Data Lake............................................................................................................ 64
Secure the Data Lake............................................................................................................... 64
Cloud Adoption........................................................................................................................ 65
Step 3: Wrap Up the Development Cycle........................................................................................65
Change Control........................................................................................................................ 66
Monitoring and Administration.................................................................................................. 68
Capacity Planning and Workload Management.........................................................................68
Evaluate New Technology Options for the Next Cycle............................................................... 69
Summary, Matching Requirements in Development Cycles.............................................................70
Detailed Descriptions of the LDW Components.............................................................................. 72
Details on Requirements-Gathering and Business Case................................................................. 74
Gather and Prioritize Requirements...........................................................................................74
Align Business Strategy............................................................................................................ 77
Form the IT and Business Team................................................................................................78
Develop or Firm Up the Business Case.....................................................................................84
Project or Phase Initiation and Kick-Off Workshop.................................................................... 86
Gartner Recommended Reading.......................................................................................................... 87

List of Tables

Table 1. Effort by Topic for the Three LDW Streams.............................................................................. 14


Table 2. Change Control Actions When Moving Between LDW Paths................................................... 67

List of Figures

Figure 1. The LDW Solution Path and Its Three Streams.........................................................................6


Figure 2. Three types of User Service Levels, Developments and Modelling............................................7
Figure 3. The Compromise, Contender and Candidate Data Models...................................................... 9
Figure 4. LDW Architecture...................................................................................................................11
Figure 5. An Example LDW Plan........................................................................................................... 16
Figure 6. Example of a Simple Sizing of a DW/DMSA Platform............................................................. 20
Figure 7. Example Data Model and Its Place in Development................................................................27
Figure 8. Phased Expansion of the Data Model.................................................................................... 28

Gartner, Inc. | G00320563 Page 3 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 9. Cross Referencing Requirements and Data Sources.............................................................. 29


Figure 10. Requirements Met by New Data and New Combinations of Data......................................... 30
Figure 11. Example: Data Warehouse Automation Approach................................................................ 32
Figure 12. Traditional Tooling vs. Data Warehouse Automation..............................................................34
Figure 13. Example of DBMS Target-Specific Data Warehouse Automation.......................................... 36
Figure 14. The Guidance Framework: Product Management Approach to Enterprise Information
Management (EIM)............................................................................................................................... 39
Figure 15. Four Points of Data Comparison Within the DW................................................................... 41
Figure 16. Example Functionality of an Automation Tool........................................................................42
Figure 17. Agile Development With Data Marts and in DW....................................................................47
Figure 18. Different Styles of Data Virtualization.................................................................................... 49
Figure 19. Example of Data Virtualization.............................................................................................. 52
Figure 20. Example Virtualization Reference Architecture...................................................................... 54
Figure 21. Functions on a Remote Platform.......................................................................................... 56
Figure 22. Virtualization History and Trends...........................................................................................58
Figure 23. Conceptual Architecture for the Data Lake........................................................................... 61
Figure 24. Summary of LDW Requirements Handling and Build Process.............................................. 71
Figure 25. Gathering and Prioritizing Requirements...............................................................................76
Figure 26. Aligning Business Strategy With Development..................................................................... 78
Figure 27. Example Team Structure...................................................................................................... 80
Figure 28. Relationship of LDW Personnel to Other Roles.....................................................................83
Figure 29. Example Business Case Structure....................................................................................... 85

Problem Statement
How do technical professionals modernize their information management foundation for data and
analytics by implementing an LDW?

Subsidiary questions answered by this document are:

■ What is the architecture of an LDW?


■ What does an LDW project plan look like?
■ How can we easily size the data stores in the LDW?
■ Who builds the LDW? What is the LDW's team structure?

Page 4 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Solution Path Diagram

Gartner, Inc. | G00320563 Page 5 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 1. The LDW Solution Path and Its Three Streams

Source: Gartner (May 2017)

Page 6 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

The Solution Path, shown in Figure 1, is not a single linear path, but three parallel streams:

■ Classic data warehouse stream: This stream builds the traditional data warehouse
component, which is typically a massively parallel relational database management system
(RDBMS).
■ Agile development stream: This stream uses data virtualization and physical and virtual data
marts to do new things with existing (and some new) data.
■ The data lake stream: This stream allows organizations to work with very large scale and/or
unstructured data.

The three streams reflect three different contemporary analysis styles.

Note that, as shown by the arrow at the base of Figure 1, the overall process is iterative. Each
iteration executes one or more of the three streams. Each stream expands a different part of the
architecture of the LDW.

What You Need to Know


Technical professionals implementing the LDW need to do this by using three simultaneous
development streams, development styles and analytics service levels. The three styles are not
alternatives, but are complementary. Planning for this from the start will save time, money and effort
as well as reduce risk. These are illustrated in Figure 2.

Figure 2. Three types of User Service Levels, Developments and Modelling

Source: Gartner (May 2017)

Gartner, Inc. | G00320563 Page 7 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Much effort has been wasted arguing about which was the "best" approach. The breakthrough
insight is to recognize that all three are equally valid; they are not alternatives, but simply different
facets of a greater whole. A modern analytical system needs to do all three.

Previous Gartner research identified these three broad styles of analytics development, as
summarized in Figure 3. Before discussing the development styles themselves, it is worth noting
that each uses its data model in different ways:

■ The compromise data model: Most of the time, information consumers need a data model that
everyone can agree on. That is, they need a model that represents a collective view of the main
data items within the organization, what is there, its format and what it can be used for. In
Gartner parlance, this is known as a compromise data model because stakeholders have
arrived at a consensus about the model, and this may have involved a compromise being
reached between those stakeholders.
■ The contender model: Sometimes people want to use existing data in new ways by combining
or augmenting the model to service new requirements. This is particularly true in agile
development, where data may be combined in new ways or augmented by new data sources.
These new ways of looking at the data can derive benefit and should not be impeded. Data
virtualization is a great way to enable this mixing and matching of data views. If these contender
models are developed and then proven, they are likely to be folded into the main compromise/
consensus view.
■ The candidate model: This is where new data is available and is useful to analyze, but it has not
been analyzed to work out a data model that describes it. In fact, it may be inherently
unstructured and, therefore, impossible to describe using a single definitive format. Readers will
recognize that this is the domain of unstructured data that is interpreted using "schema on
read."
■ Each development style also has its own unique service-level agreements(SLAs) for data
availability, quality, provisioning, performance and ease of development.
■ Each style is required at some time, so they are shown as "and/or" rather than mutually
exclusive alternatives.
■ Eighty percent of users require data to be well-understood and trusted; they need to use a
compromise model that everyone can agree on. The users have reached a compromise (or
come to a consensus) on how the data is to be represented.
■ Ten percent of users wish to use existing models but combine them in new ways. They generate
and use the contender data models.
■ Five percent of users have no model at all, but generate candidate models from new data that
they can both obtain and manipulate.

Page 8 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 3. The Compromise, Contender and Candidate Data Models

DW = data warehouse; EDW = enterprise data warehouse; ODS = operational data store; DQ = data quality; MDM = master data
management; Gov. = governance

Source: Gartner (May 2017)

Recall how the industry arrived at its current state. Originally, data warehouses were single,
centralized RDBMS servers, into which data was loaded in order to provide a single enterprise view
using the compromise model shown on the left in Figure 3. The data was structured, originated from
operational applications and multiple terabytes in size.

The original centralized DW solved two pressing problems:

1. How to ask questions spanning the enterprise, or which required a depth of history. These were
not a part of the design aims of the operational systems.
2. How to design the server to handle the very different analytical workload, which needed to run
complex queries on very large amounts of data.

This scheme worked very well from the early 1990s. However, three trends became clear.

Gartner, Inc. | G00320563 Page 9 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

1. No organizations ever achieved the goal of having a DW that satisfied all its needs — the "single
version of the truth." Multiple data warehouses, tens of data marts or query tool point solutions,
and thousands of spreadsheets were the norm. With hindsight, this was because a single server
could not meet the increasing, and increasingly varied, needs that were evolving. Nor could it
keep pace with the need for agile development.
2. The types of data expanded to include text, audio, video, social media data, Internet of Things
(IoT) machine sensor data, weblogs and so on. This data was difficult to process using the
structured data types of the RDBMSs used by classic DWs.
3. The different types of users, with different needs, emerged, as illustrated in Figure 3.

This document focusses on how to build the LDW, but at each stage of the discussion, it will be
useful to refer to what is being built.

Figure 4 illustrates what an LDW is, including both its components and their relationships. This is
not prescriptive; these systems will vary by organization. Figure 4 shows a superset of the
components that we would expect to find.

Page 10 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 4. LDW Architecture

Source: Gartner (May 2017)

Gartner, Inc. | G00320563 Page 11 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Readers unfamiliar with the components will find them defined in the Detailed Descriptions of LDW
Components section.

This document speaks about both the DW and the LDW. LDW refers to the whole system as shown
in Figure 4, whereas the DW is simply the classic data warehouse — the centralized RDBMS
systems that are one of the components of the overall LDW.

At other points, we will refer to Data Management Solutions for Analytics (DMSA). This is the term
coined by Gartner to describe analytical systems because they are now so varied. Data
warehouses, marts, Hadoop clusters and other nonrelational systems are all types of DMSA.

Solution Path
The LDW is never finished, just like the DW that preceded it. The Solution Path diagram (Figure 1)
for the LDW shows this continual enhancement by the looping arrow at its base, to illustrate how it
executes through multiple cycles in a continuous process. On each cycle, choose one or more of
the three streams. Each develops or expands part of the LDW architecture.

This document first runs through the first stream, the classic DW, from start to finish, and then
describes the second and third paths. Each cycle begins with an initiation phase and ends with a
phase that ensures the maintainability of the system and prepares for the next cycle.

This ordering is just for simplicity, although historically many organizations have begun with a
classic DW, then moved on to agile development, and then implemented a data lake. This ordering
is not prescriptive. Organizations can start with any of the three paths, and can choose subsequent
paths in any order, or run paths in parallel if sufficient staff is available.

Step 1: Initiate Phase or Project


This document does not duplicate standard information on project setup. It focuses on activities
that are particularly important, or different, for the LDW. Building an LDW differs in the high degree
of collaboration between business and IT, and in constant change in requirements.

Execute the following key activities to determine which of the three streams will apply to each
project phase, and which parts of the LDW will be placed or expanded:

1. Gather and prioritize requirements


2. Align with the business strategy
3. Form the IT and business team
4. Establish architectural principles
5. Define the plan
6. Perform initial sizing
7. Firm up the business case

Page 12 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

8. Project or phase initiation and kick-off meeting

It is essential that the LDW, like the central DW before it, be driven by business objectives. If this is
not the case, it is a major "red flag." The risk is that not doing this often leads to a project that is
over time and cost and does not deliver meaningful business benefits.

The early project activities related to requirement gathering and prioritization, alignment to business
strategy, team formation, and project kick-offs are described in Details on Requirements Gathering
and Business Case section of this document. This section continues with descriptions of planning,
sizing and choosing architectural principles to describe the overall build process.

The different streams exercise the standard disciplines of data management to different degrees.
Table 1 summarizes this, showing for each of the three streams the relative amount of effort needed
to implement each of a number of data topics. None of the three streams is optimal for
implementing all requirements.

Gartner, Inc. | G00320563 Page 13 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Table 1. Effort by Topic for the Three LDW Streams

Topic Data Warehouse Agile, Mart, DV Data Lake

Data Model Fully defined in advance Existing definitions Generally not a prerequisite
Add as necessary Some use cases may require

Business ■ Used to inform wide user ■ Leverage other ■ Can use existing, normally
Glossary (BG) base exploratory so not needed
■ Some development
■ Defined early on

Technical ■ Extensive ■ Use existing, some re- ■ Some needed to maintain


Metadata engineering platform and access
■ Required to ensure
performance

Data Quality ■ Defined ■ Use existing, reassess ■ Depends on use case


some sources
■ Required by bulk of DW ■ For some none, for others
users some need to be developed

Provide ■ Effort of structuring data ■ Sources usually ■ Not necessary, history or


Structured Data is essential prerequisite already structured archive may be loaded

■ Structure provided by schema


on read

MDM Data ■ Use MDM data, or define ■ Plug into existing ■ Used for some use cases
standardized reference MDM, or spark MDM
data as part of DW initiative ■ MDM data can be
downloaded if necessary

Information ■ Required for complete ■ Some governance ■ For some use cases
Governance end-to-end process added to sources
(IG) ■ Less for exploratory work

Automate ■ Best practice, tools ■ Testing already done ■ Can be useful for repeatable
Testing available on sources use cases, similar to DW

Data Ingest ■ Required in original ■ Minimal ■ Need high-volume ingest,


Design design, both for well-understood
performance and for ■ Sits on top of existing
data quality sources ■ Fewer concurrency
requirements

Migration Skills ■ Significant percentage of ■ Virtualizes rather than ■ Unlikely


Needed system may need migrates sources
migration techniques ■ Maybe archive or "cooler"
data if copy of existing

■ Mainly new data

Report Suite ■ Large formal report ■ Mainly ad hoc ■ Exploratory, special purpose
Design design effort or data science

Page 14 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Topic Data Warehouse Agile, Mart, DV Data Lake


■ Augmented by self- ■ Can also include sets ■ Less-used for general
service of standard reports reporting suites

Establish or ■ Detailed design needed ■ Expand DBMS ■ Install nonrelational platform


Grow Platform to support tens or capacity or install
hundreds of users separate server ■ Effort depends on use-case
SLAs

Initial Load ■ Backloading of history is ■ Sources are already ■ Backloading usually not
Strategy common loaded needed

■ Reliability and quality ■ Some limited loading ■ Near raw data


necessary for sandboxes
■ Some SLA performance,
reliability work needed

Key: Effort High Medium Low

Source: Gartner (May 2017)

The second stream requires noticeably less effort than the first (data warehouse stream). Similarly,
there are areas where the data lake stream requires less effort too. This enables the second and
third streams to implement certain requirements better and to deliver earlier results. The trick is to
balance the strengths and weaknesses of each to provide an overall optimal solution.

Planning
It is essential to have an overall strategy and roadmap for the LDW project. Track progress with this
living document that provides the context for the team and sets the expectations of other
stakeholders. This should show the short-, medium- and longer-term expectations of the project,
with each time frame being at an appropriate level of detail. Expect this plan to evolve over time.

Figure 5 shows an example of a plan to build an LDW. This is not prescriptive, but simply an
illustration of a plan's typical size and shape.

Gartner, Inc. | G00320563 Page 15 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 5. An Example LDW Plan

Source: Gartner (May 2017)

Plans will vary considerably between organizations. However, for readers who are unfamiliar with
typical LDW plans, it does show useful characteristics, such as:

■ The relative sizes or durations of tasks


■ Which tasks can typically be run in parallel
■ How the delivery subphases can be staggered to provide regular drops of benefit

Straightaway, it is noticeable that the DW stream is much more involved than either the agile or the
data lake streams. This is to be expected. Many more things need to be done upfront in the DW
stream. The other streams are simpler and deliver more quickly.

Page 16 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 5 shows all three streams in parallel for the purposes of illustration. The streams are color-
coded green for DW, amber for agile and blue for data lake. Deliveries of functions providing benefit
are shown in purple, common activities are shown in gray.

The DW stream may at first glance appear to be a waterfall. However:

1. The agile and data lake streams accompany the DW, in parallel. You can periodically fold their
newly developed functionality into the main project via change control.
2. The DW stream can be a series of smaller projects, such as four three-month projects rather
than a single year-long one. However, it is necessary to do the same activities in the same order
within these shortened DW project phases.
3. DW development can adopt agile sprint techniques. There are typically two types of sprints:
1. Code only: New function on defined data and of two to three weeks in duration
2. Data model extension: These may take longer (four to 10 weeks would be typical)

Some observations about the plan are worth noting:

■ Each cycle typically has an early and a later phase. The first part is concerned with preparation
and the later phase with exploiting that preparation.
■ As the centralized DW server is developed, work in other streams can piggyback on the
definitions it provides. This speeds their delivery too. For example, the virtualization stream gets
ready-made data to virtualize, and the data lake gets readily available models to use as
templates for schema on read.
■ Relative sizes of tasks are shown, along with what is done serially and what is done in parallel.
■ Figure 5 does not show the size of the benefit delivered. The centralized DW stream has
periodic "drops" of large numbers of queries and reports to many parts of the business, but
takes longer to deliver them.
■ Agile activities deliver a smaller, more focused scope, but deliver them earlier and more
frequently.
■ A guiding principle of building the DW is "aim for an overall strategy interspersed by frequent
tactical drops of value." Short-term value and long-term aims are compatible.
■ Aim for financial payback within a year. Judicious use of the agile and data lake streams can
help with this. Don't fall into the trap of thinking "we can't do anything until we've implemented
everything."
■ Assess benefits early and continuously. Use this assessment to update the business plan.
■ Change control occurs regularly and as soon as delivery starts. Look for opportunities to move
functionality between streams. Add or remove appropriate controls as the function moves
between streams. For example, if a function is copied from the DW to an agile environment to

Gartner, Inc. | G00320563 Page 17 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

modify it, relax the controls on development. When moving function from the data lake to the
centralized DW, add the necessary data quality and security, and so on.
■ Benefit in the centralized DW stream is delivered at the end of each of three subphases for the
"data out" team, which use the data populated by the data-in team
■ There is early delivery of benefit through self-service enablement, agile development and
sandboxes. As soon as the data-in team accomplishes loading of production data, these other
modes can leverage them and start providing benefit early and often.
■ The agile and lake streams do not have to wait for the centralized DW team. They can establish
their own infrastructure and load their own data very quickly, especially if they are using cloud
infrastructure.

Instead of planning time/resources for building reports, technical professionals should, where
possible, create self-service functionality so that users can build their own reports. The provision of
the agile environment needs to be planned, but the individual results of using it do not. Enabling
self-service means more requirements can be delivered sooner.

As an aside, the provision of agile and self-service capabilities also helps to avoid large amounts of
technical debt. Technical debt is the code and data that needs modification as circumstances
change. Providing general-purpose self-service data structures, accessed by intuitive business
intelligence (BI) query tools means that much less fixed-function reporting needs to be
implemented. As requirements change, users will in most cases make minor modifications to their
existing reports with much less overall time and effort being spent than having to put fixed reports
through change control.

Requirements, and thus benefits, can also be achieved earlier by using the data lake. It may be
possible to do immediate analysis by loading data into the lake early on. This analysis may be a
one-off project or a recurring series. As the data becomes better-understood, its structure is
formally defined and the analysis is firmed up for repeated executing, this data can be folded into
the main warehouse later if that makes sense.

A solid plan involves considering people, process and technology. Refer to "Embrace Sound Design
Principles to Architect a Successful Logical Data Warehouse" for further guidance.

Having provided context by looking at a typical plan, this section now reviews the other major
activities of Step 1.

Platform Sizing and Cost Estimation


It is almost never too early to begin platform sizing. Having identified the requirements, and by
implication the data necessary to meet those requirements, size estimation can begin. It is good
practice to undertake sizing early and often.

Quantifying the size of the various platforms provides clues as to the kind of technology that is
appropriate. The sooner the team reveals any scalability challenges, the better.

Page 18 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

A useful rule of thumb is that large DWs usually hold most of their data in their top 10 tables by
volume. Each of these large-volume tables usually relates to a particular business entity. Use
business metrics to make a first cut-sizing of the most populous entities.

For example, as shown in Figure 6, for a telecommunications company, the biggest datasets are
those dealing with calls and with network events. (Figure 6 is downloadable as a spreadsheet.)

From business metrics, for example, from the annual report, we find the number of customers.
Industry averages allow us to factor in the average numbers of calls per customer per day, and
likewise the number of call detail records per call as customers travel between a series of mobile
masts.

The other largest dataset will focus on network telemetry where each piece of equipment emits
telemetry every few seconds. Calculate from these top-line business metrics the average number of
new records created per day, and then multiply by the number of days' history.

Gartner, Inc. | G00320563 Page 19 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 6. Example of a Simple Sizing of a DW/DMSA Platform

Source: Gartner (May 2017)

Page 20 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

This provides a reasonably accurate estimate from very early on in the life of the project. It is also
very easy to update during the project. As assumptions change, or with the receipt of new
information or requirements, this can provide a valuable early warning of design challenges.

If these large sets of data account for, say, 95% by volume of the total data size, it is straightforward
to estimate what 100% of the data will be by calculating and adding the other 5%. The ratio of top
10 tables to total value may vary; it may be 80%, 85%, 90% or 99% depending upon the
complexity of the model and metrics for a particular industry.

The spreadsheet then applies sizing factors to estimate the extra data such as aggregates and
indexes, expansion using redundant copies to provide resilience, and shrinkage due to
compression. This calculation often indicates the split of data between different platforms, in
particular between a DW and a data lake. Make separate estimates for the DW and the lake. This
will allow experimentation around the split of data between them.

At the same time, document the working assumptions on the number of users, as well as the
number and mix of query types. These can set off alarm bells later in the design process, so
document them as early as possible.

In the spreadsheet, notes are denoted by "#n" where "n" denotes note number "n." A list of notes
appears at the foot of the spreadsheet.

There is a need to size and cost the platforms in order to calculate return on investment (ROI) and
set budgets. Make estimates in advance, even if the platform will be cloud-based and paid for by
subscription rather than capital expenditure. This allows formal release of funds for the project.

Report to the sponsors at this point. By now, the project will have a very good idea of requirements,
benefits, timescales and costs.

Establish Architectural Principles


Establish architectural principles before design starts. The key question here is "what is this LDW
going to consist of?" Enterprise architecture is a wide subject and well-documented, so that
documentation will not be duplicated here. However, here are some of the important actions and big
decisions that need to be made early on specific to an LDW project:

■ Implement the LDW architecture from the start, even if it is not fully populated with all the
components. This is in line with contemporary best practice, where analytics is provided by a
collaborating ecosystem of different analytical engines.
■ The LDW provides a good template for "separation of concerns" with different parts of the
architecture servicing different needs and providing different capabilities. Document the
rationale for the choice of each LDW component.
■ Go through the requirements, and map them to the type of data store and processing that is
best-suited to implement them:
1. DW: High numbers of queries with strict service levels on well-understood data

Gartner, Inc. | G00320563 Page 21 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

2. Operational data store (ODS), hybrid transactional/analytical processing (HTAP) and MDM:
Very high volumes of queries with very low latencies
3. Data lake: Experimental analysis on large volumes of unstructured data
4. Agile development or data virtualization: Agile recombination of existing data sources, with
possibly some augmentation of data
■ Investigate Lambda and Kappa architecture techniques. These were developed to provide real-
time reporting performance and recoverability, especially in the context of using open-source
software. See "Harness Streaming Data for Real-Time Analytics" for a fuller discussion on this.
Remember that, despite their open-source heritage, these architectures can also be built using
non-open-source software.
■ Decide on the method for modelling data. Third normal form, dimensional modelling or a
mixture of both are commonly used. Specific techniques such as data vault modelling may be
used. Use schema on read for the data lake. Use third-party models if useful. This is discussed
in the Develop the Data Model section.
■ Use in-memory computing to simplify the data structures used by reducing or eliminating
preaggregated data and indexes. Also, look for opportunities to enhance virtualization of data
by allowing fast processing of intermediate results by the use of massively parallel processing
and/or in-memory techniques.
■ Choose your style of data virtualization. See the Agile Data Marts and Sandboxes section for a
discussion on this.
■ Decide how to manage physical marts throughout their life cycles to avoid unnecessary
proliferation. Question: Can they be made virtual?
■ There are products that automate DW development. These may or may not be suitable for your
organization, but now is a good time to consider them, see the Assess Data Warehouse
Automation section. This will affect the overall architecture and its components.
■ Decide how to handle metadata, sharing and tooling. This remains one of the challenges for
LDW because easily shareable universal metadata remains a way off. We do not expect
perfection or seamless integration. Some manual integration will be needed.
■ Determine how to secure data and how appropriate permissions are granted.
■ Develop support for backup and recovery, disaster recovery, and restart.

These are decisions about the overall architectural approach. Decisions regarding particular
platforms, BI tools, extraction, transformation and loading (ETL) tools and so on can be deferred to
the "to be" design phase.

Step 2/DW: Data Warehouse Stream


Having made the initial preparations, the project is ready to proceed with development. For the
purposes of this document, the stream to build the classic centralized DW is described first, then

Page 22 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

the agile development and virtualization stream, then the data lake stream. In practice, choose any
one, or more, of the three paths for each iteration.

One of the key components of the LDW is the classic centralized DW. It acts as a hub for all of the
standard, predefined and quality-assured data to be used.

Key steps in establishing or expanding the DW component are:

■ Document the "as is" and "to be" architectures:


■ Choose DW components
■ Assess DW automation tools
■ Consider cloud-first and open-source options
■ Put data quality measures in place
■ Determine the data model and handling of metadata, data glossary and so forth
■ Automate testing
■ Establish or grow the DW platform
■ Determine an initial loading strategy

This section on building the classic, central DW component is much longer than the others. As we
can see from the plan, this is simply because there is much more work to be done upfront. By
definition, the DW serves up data that is prestructured and quality-assured and comes with a
platform that provides good service for hundreds or thousands of users. Therefore, all of this work
has to be done before the DW can enter service. These constraints are relaxed in the other two
streams. The agile stream uses virtualized data sources that have already been set up. With the data
lake, data can be loaded and analysis begun without having to first model and structure the data.

An advantage of thinking about the development as three streams is that it makes it easier to
explicitly think about the implications of crossing stream boundaries. Moving from the data lake
path to the DW path reminds the team to put in place the checks and controls needed by the DW
for that new part of the system. Likewise, moving in the other direction from the DW stream to the
data lake stream reminds the team to remove unnecessary controls. This occurs, for example, when
prototyping new analysis in one part of the system and then moving it into production in another.

Design the Architecture and Then the Physical System


This activity lays the foundation for the server that will form the central data warehouse platform.
Having determined what data it is to hold and having assessed the requirements, a target to-be
architecture can be designed, configured and sized.

Gartner, Inc. | G00320563 Page 23 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Document "As Is" System


Perform an architecture and document review, including the existing strategy and vision, if these are
available. Sadly, it is common for them not to be, in which case the existing systems themselves
become the specification.

Documentation should include:

■ Existing systems and their connections: Existing DWs, data marts, MDM systems, ETL tools
and BI tools, along with notes on the pros and cons of each as experienced by the organization
■ Current methods: For example, data modelling methods and tools, metadata, data quality
standards and governance
■ Existing constraints and prerequisites placed on the current system, local standards,
preferences, enterprise agreements, methodologies and so on
■ Potential risks, dependencies, impediments and constraints
■ Governance, organization, support structures, policies, and how all of these were created and
are maintained
■ An assessment of the suitability of current technologies; do not replace just for the sake of
replacing
■ An assessment of current projects already and any opportunities to make them more efficient
■ Current issues, limitations and gaps in capability and skills (both functional and nonfunctional)

As part of this, compile an inventory of data sources. List all potential data sources with an
assessment of their value and the difficulty (cost) of acquiring them.

Data quality should be assessed, ideally using automated tools. This will provide an indication of the
overall quality of the data; the amount of redundancy; and whether sources are incomplete,
incorrect or inconsistent. Because there is a large body of guidance available regarding data quality,
its importance is noted here but not expanded further in this document.

Design To-Be Architecture


Design the DW as an LDW because LDW is becoming synonymous with DW. Other key points:

■ As part of the preparation, take an inventory of the types of users. Is the work they present to
the system going to be of low, medium or high complexity? What kind of service levels will they
expect?
■ Which are the main components of the workload? Are they HTAP, ad hoc, analytical data
science/predictive, sandboxing or bimodal? What kinds of components are needed to run
them?

Page 24 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

■ Use Figure 4, the summary of the LDW architecture, as a model, and consider the various
components in turn: DW, data marts (physical and virtual), MDM, ETL, NoSQL, data
virtualization, replication, reporting, predictive (data science), BI reporting and so on.
■ From the requirements, initial sizing and example architecture, it should be possible to identify
the components of the to-be architecture and their likely relative sizes.
■ Choose specific technologies:
■ Relational DBMS for the DW and ODS
■ Hadoop Distributed File System (HDFS), Hadoop and Spark for the data lake are common
choices, but also consider other technologies such as Graph and Document Databases if
you have special requirements that need them
■ Once you have identified the necessary components, examine each of the interfaces between
them, the nature of the data flows and the APIs that would be used. This should include security
and whether processing can automatically be "pushed down" between them.
■ Recommend standard enterprise tools and preferences. You should also document constraints,
risk and mitigations at this time.
■ Document adherence to regulation, including, data protection, compliance issues, Freedom of
Information, industry regulatory compliance and so on. See "Use Gartner's Three Rings of
Information Governance to Prioritize and Classify Records."
■ It is also useful to conduct a project risk assessment once the to-be architecture has been
fleshed out. Keep these activities at a practical level, and guard against them becoming ends in
themselves.

For guidance on design principles, see "Embrace Sound Design Principles to Architect a Successful
Logical Data Warehouse."

When designed properly, data virtualization can speed up data integration, improve reuse and
mitigate data sprawl.

Data Modelling and Metadata


The LDW project is an opportunity to reinvigorate your data modelling. Data modelling continues to
be required. Parts of the LDW will be schema on read. However, even in those cases, it is necessary
to know what schema the data is being read into. This enables users to agree on the information
that has been obtained.

Agree on the data-modelling standard. Whether it is entity relationship, star schemas/dimensional,


data vault, some other standard, or a mix, is less important than that a standard is set. In practice, a
mixture of entity relationship and dimensional modelling is the norm.

Choose tooling for metadata, the BG and data dictionary. In addition, decide how to derive
metadata and then share it between components.

Gartner, Inc. | G00320563 Page 25 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Decide how to handle data lineage early on to save effort later. Metadata interchange between
incompatible tools later in the project can significantly slow things down. Do proof-of-concept trials
to check which combinations work well.

Develop the Data Model


The traditional centralized DW component of the LDW has a predefined data model. This is "query
on write" because the data model has to be agreed on before the data can be loaded into it. Gartner
refers to this kind of model as a compromise model, meaning that all of the participants in the
warehouse, including those loading data and those using it, have come to an agreement on how to
represent the data.

Data models can be purchased if an off-the-shelf solution is required; These can provide a
significant jump-start to the modelling effort, but they can also require significant customization
(10% to 30% is not unusual).

Different places in the system use different levels of the model. The conceptual model appears in
portals to inform users and developers, and for general reference. The logical model describes
overall data content in an abstract, technology-independent manner. The physical models define
actual physical data stores, together with their optimizations (e.g., indexes, data aggregates and
materialized joins).

It is worth noting that the gap between Layer 2 and Layer 3 is shrinking due to technology such as
massively parallel processing, large multicore clusters and in-memory computation. These
technologies lessen the need to represent data in a particular way. Check how much physical
performance work is needed before investing too much effort in the physical model.

Figure 7 shows an example from IBM's Financial Services Data Model. The left side of the figure
shows that the model is centered around nine high-level concepts; these are broken down into more
detailed data item descriptions.

The right side of Figure 7 shows the relationship of the model to the overall system. At the core are
three layers of modelling: The Business Terms, The Atomic Warehouse Model and the Dimensional
Warehouse Model. Figure 7 shows the data for analysis as being a separate physical layer, but in
practice, these may simply be views into the atomic layer, especially if an in-memory DBMS is being
used.

Page 26 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 7. Example Data Model and Its Place in Development

Source: IBM

Gartner, Inc. | G00320563 Page 27 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Another three-layer pattern can be found within the main data model:

1. A conceptual model consisting of high-level business descriptions described in a BG


2. A logical model that defines data but in a technology-neutral manner
3. The physical model that defines a specific implementation for particular systems, taking into
account the need to design data in particular ways to provide performance

A fully developed data model may have hundreds of entities and thousands of attributes. It is not
necessary to define the entire model before any analysis can begin. Many DW projects have fallen
into this trap of not beginning any other development until the entire model is built.

It is better to expand the data model in phases, with each phase adding data related to data subject
areas, as illustrated in Figure 8. The Planning section notes that the centralized DW component of
the LDW might be built in stages. Building out the DW subject area by subject is one very common,
and very practical, way to construct these phases.

Figure 8. Phased Expansion of the Data Model

Source: Gartner (May 2017)

Add the data in an order determined by those requirements. Think of this as adding data to
"release" the requirements that need it. High-value requirements are usually done first, so the data
they require will likewise be loaded early.

This supports the principle of having a long-term strategy, with a fully developed model, but
releasing frequent drops of value by meeting requirements quickly to provide benefit.

Page 28 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Arrange requirements in a matrix detailing their data sources, as in Figure 9. It is easy to prioritize
the introduction of data sources by looking at the effort and value of the requirements released by
each data source.

Another beneficial effect of the correct phasing in the DW is the increasing reuse of existing data.
Aim to meet requirements not just from the new data, but also from new combinations of the old
and new data. It is not necessary to reload the old data; it is already there ready for reuse.
Ultimately, when all the data is available, new requirements are simply new combinations of the data
that is already there. Development becomes more and more productive because, as time
progresses, much of the data needed for new requirements is already in place.

Figure 9. Cross Referencing Requirements and Data Sources

Source: Gartner (May 2017)

Figure 10 shows a simple example of a typical phasing and data reuse.

1. Sales data is loaded, and only sales requirements are met.


2. Stock data is added to meet stock analysis requirements, but requirements for sales forecasting
based on stock availability can now be met, plus more sales analysis can of course be done
because the sales data is still there.
3. When cost data is added, profitability analysis can be done because we have both sales and
costs, and, of course, more sales and stock requirements can be done too.

Gartner, Inc. | G00320563 Page 29 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 10. Requirements Met by New Data and New Combinations of Data

Source: Gartner (May 2017)

Page 30 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Assess Data Warehouse Automation


Tools are now available to generate the data structures and code for a data warehouse. As part of
project initiation, it may be worthwhile assessing whether they would suit the project.

Data warehouse automation (DWA) tools hold out the promise of major gains in productivity for a
project. Up to a 500% productivity boost is not an unusual vendor claim. Project implementers have
to satisfy themselves that the tool is flexible enough to meet most of the project's requirements, and
that they will get meaningful productivity boosts in their environment. With most tools of this type, a
large proportion of the code and data definitions can be automatically generated, but, where
necessary, customization can be used to ensure that all the needed functions can be implemented.
The key question here is how much customization is necessary.

In this section, practical examples of current DWA tools are illustrated to show the concepts
involved. This document does not seek to evaluate or recommend particular products, and these
are for illustrative purposes, to show concrete examples of the capabilities available in this area.

Figure 11 shows a typical end-to-end process controlled by metadata, and from which all the
necessary data structures and ETL code can be generated. If the data structures or load processes
need amendment, the software will automatically regenerate the necessary code to effect the
change, such as to unload, transform and reload data to add columns.

Gartner, Inc. | G00320563 Page 31 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 11. Example: Data Warehouse Automation Approach

Source: Attunity

Page 32 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

The numbers of the items quoted in Figure 11 give an idea of the scale at which the tools operate. In
Figure 11, an "instruction" may be an individual routine used to load code or a similar operation. The
tool can generate hundreds of entities and thousands of attributes.

The particular features available vary between products. Assess these as part of the project. Run
proofs of concepts or pilots to ensure that the tool is suitable.

In the event of replatforming components of the LDW, the DWA tool will automatically generate all
the code for the new platform. This can include the translation of stored procedure logic.

For example, the original platform might be Microsoft SQL Server, which uses T-SQL for stored
procedures, and the DW is being moving to Oracle, which uses Procedural Language/Structured
Query Language (PL/SQL) in its stored procedures. Or it may be in the other direction. Some of
these tools have the capability to regenerate the stored procedures accordingly. As mentioned
above, it is not the intention of this document to perform a product evaluation, but rather to illustrate
the kinds of automated support that is available.

The tools usually work by implementing a layer of metadata over and above that used by the DW
itself. The metadata describes the data sources and targets, the mapping between them, translation
rules, data quality checks and so on.

The right side of Figure 12 illustrates the uniform metadata approach of one of the DWA tools. The
left side illustrates the mix of tools, and therefore metadata, that might be required in a traditional
approach.

Gartner, Inc. | G00320563 Page 33 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 12. Traditional Tooling vs. Data Warehouse Automation

Source: WhereScape

Page 34 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

These tools typically have metadata that describe the data and the end-to-end process that sits at a
higher level than that for the individual databases, tools and ETL. The tools then use code
generators that will turn the non-platform-specific metadata into the data definition language (DDL)
and ETL code necessary to implement the system on a particular platform. As requirements change,
the DWA tool works out the code changes needed. It then generates and places the appropriate
code on the platform and then executes it. Where the changes require restructuring of existing data,
the tool generates the code to do that too. They do this by generating the jobs to unload,
restructure and reload the data as necessary.

A variety of tools and approaches are available for DWA. The previous examples show tools that sit
above multiple platforms, provide a uniform approach to the whole process and target a variety of
databases.

However, tools that focus on particular platforms also exist. With these, the aim is to leverage the
standard DBMS tooling that exists for the platform. Such tools provide additional capability that
provides the automation without replacing the existing tools. Figure 13 shows an example of this.
The DWA in this example is specific to using Microsoft SQL Server as the target platform.

Figure 13 illustrates that, while effectively replacing SQL Server Integration Services (SSIS), the
DWA tool makes use of the SQL Server Management Studio and provides an end-to-end capability
while leveraging much of the existing familiar Microsoft tooling.

Gartner, Inc. | G00320563 Page 35 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 13. Example of DBMS Target-Specific Data Warehouse Automation

Source: BI Builders

Page 36 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Evaluate a Cloud-First Approach


Once you have designed your LDW architecture, you can deploy components on-premises, or in the
cloud as:

■ Platform as a service (PaaS)


■ Infrastructure as a service (IaaS) IaaS
■ Database PaaS (dbPaaS)
■ Integration PaaS (iPaaS)
■ Business analytics PaaS (baPaaS)
■ Business analytics software as a service (baSaaS)

When used appropriately, cloud computing can help you accelerate delivery, reduce cost, improve
ease of use and improve productivity. Several key drivers are pushing cloud as the first choice of
deployment. Recent technology innovations make cloud so attractive that some organizations —
especially small and midsize businesses (SMBs) — have decided to take a cloud-first approach.
Besides the organizational size, data gravity is another key indicator of determining suitability of
cloud deployment. As more and more source applications move to the cloud, the "gravity" of the
data moves with them, making it more natural to have the DW in the cloud alongside them or
virtualizing to them. Organizations at the beginner level can take full advantage of cloud computing
because they have less on-premises analytical and information infrastructure. Finally, cloud is also
great for innovative projects, short-term sandboxes and marts, proofs of concepts, development
and testing.

To evaluate the cloud's implications to specific LDW components, refer to the following documents:

■ "Is the Cloud Right for Your Database?"


■ "In-Depth Assessment of Amazon Web Services"
■ "Evaluating Microsoft Azure's Cloud Database Services"
■ "In-Depth Assessment of Microsoft Azure IaaS"
■ "Migrating Enterprise Databases and Data to the Cloud"

Evaluate Open-Source Software


Today, good-quality open-source software is available for various LDW components such as
databases, data integration and business analytics. Examples are Apache Software Foundation
projects, Hortonworks, DataTorrent, Talend and Pentaho.

See "2017 Planning Guide for Data and Analytics" for further details.

When planning to use open-source software, ensure that you understand the provision and cost of
support.

Gartner, Inc. | G00320563 Page 37 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Metadata: Business Glossary and Technical Metadata


The BG is the semantic foundation for the LDW and business analytics. Without such shared
semantic foundation, the LDW produces suboptimal results and, even worse, causes confusion.
This is because having an agreed-on foundation avoids different people having to do the work to
produce their own definitions, and it reduces the risk that different definitions will conflict, which
could result in erroneous results and rework.

Although the role of IT is to facilitate and enable, the BG is used by the business. Therefore, build,
use and maintain the content of the BG in close collaboration with the business users.

See "EIM 1.0: Setting Up Enterprise Information Management and Governance" for more details.

Put Data Quality Measures in Place


The usefulness of an LDW largely depends on its data quality (aka information quality). Data quality
addresses data's fit for its intended purposes. Different use cases have different data quality
requirements. For example, financial reports for regulatory compliance have higher data quality
requirements than big data exploration and discovery. These include comments on multiple
dimensions of data quality such as accessibility, completeness and correctness.

Best practices include:

■ Directly tie data quality activities to key business objectives such as customer loyalty, risk
reduction and regulatory compliance. Identify high-value use cases, which are the essential
links between key business objectives and data quality metrics.
■ Assess the maturity of data quality processes in order to discover immediate opportunities for
improvement.
■ Analyze whether typical data quality issues fall into the business or the IT arena, and then align
roles, responsibilities and actions accordingly. Examples of IT-centric issues are lack of naming
standards and inappropriate database and application design. Examples of business-centric
issues are vague definitions on KPIs and immature business processes.
■ Make the invisible visible by exposing data quality's hidden costs and benefits.

Information Governance
To implement IG and to foster an analytical culture, both executive sponsorship (top-down) and
support from the trenches (bottom-up) are required.

Page 38 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 14. The Guidance Framework: Product Management Approach to Enterprise Information Management
(EIM)

Source: Gartner (May 2017)

For example, one large oil and gas organization used to make decisions based on gut feelings. Its
analytical team predicted the high failure rate for a piece of oil-drilling equipment. However, the field
team discarded the prediction and went ahead with the oil drilling. This ended with broken
equipment within a few hours and caused huge expense.

The analytical team, with the support of the CIO, used this incident as a pivotal point to educate
business people and foster a strong analytical culture with the company. Today, analytics are fully
supported by business executives and field workers throughout the organization.

An EIM program, including IG, is critical for the LDW. Lack of EIM can lead to serious negative
business consequences such as failing to meet regulatory compliance, exposing business to
unnecessary risks or reducing value returned to the business.

Gartner, Inc. | G00320563 Page 39 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

"EIM 1.0: Setting Up Enterprise Information Management and Governance" from which Figure 14 is
taken, is highly recommended and describes how to deliver an EIM program in the same manner as
a product. The EIM "product" adopts the usual product life cycle stages: envision, strategize and
plan, define features, execute, measure and improve, and market and sell. This method can deliver
an EIM program that is highly usable, easy to understand and able to expand without drastic
changes.

The key to this is:

■ Manage EIM like a product.


■ Set up key roles in the IG council.

If your EIM program is not in place yet, you can still set up informal roles until the formal funding and
organizational structure are in place. Another step toward an EIM program is to form a business-IT
joint steering committee that can function as the temporary governance board until the official
board is formalized.

Formalize MDM for Critical Business Data


Master data is a subset of data that is business mission-critical (if MDM were to be shown on the
LDW architecture diagram, it would be positioned similarly to the ODS). This is not the focus of this
document, so it will not be dealt with in depth here. However, the LDW is dependent on reliable
master data, common examples of which are customers, products, suppliers, partners and citizens.
Either the LDW itself develops the master (or reference) data itself, or it loads it from an MDM
system.

MDM architectures consist of four styles: consolidation, registry, centralized and coexistence. They
differ in data authorship, data latency, persistence, the primary consumer and search complexity.

For more details, refer to "A Comparison of Master Data Management Implementation Styles."

Among the four MDM styles, the consolidation and registry styles are especially relevant to the
LDW. The consolidation style is primarily associated with data warehouses. The registry style offers
a lightweight approach to MDM and is suited for integrating diverse data sources.

Automate DW Testing
Automate testing where possible. DWs are large, complex systems with many interactions. Design
in testing methods in advance so that test runs are easily repeatable.

If you do not do this, then development is likely to be seriously impeded during the test phase
because large amounts of manual intervention will be needed to set up, run and resolve issues that
prevent the tests from running to completion.

One aspect of this is designing all complex batch-style processing for automatic recovery and
restart. Implement checkpoint/restart logic, and provide outputs to keep a note on their progress. It
should be possible to resubmit a run and have it recognize whether it has run before. If it has, then

Page 40 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

the tool should know to what level of completion and restart at the appropriate point. Alternatively, it
may simply reset the database to its starting position before the test executes. Tools are available
that can assist with this.

Figure 15 shows an example and the four main points of comparison within a DW system.

Figure 15. Four Points of Data Comparison Within the DW

Source: QuerySurge

Figure 15 shows an automation tool interacting with different parts of the system by placing testing
"agents" at different parts of the architecture. The sequence of execution is shown in Figure 16.

Gartner, Inc. | G00320563 Page 41 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 16. Example Functionality of an Automation Tool

Source: QuerySurge

The sequence is:

1. Client sends request to server to execute test.


2. Server assigns an agent to retrieve results from the source system for the test.
3. Server assigns another agent to retrieve the corresponding results from the target system.
4. The initial agent notifies server that the source information is retrieved.
5. The other agent notifies the server that the destination result is retrieved (may complete before
Step 4).
6. Server assigns any available agent to compare the results and report any errors.
7. Agent returns to the server the outcome of the test case.
8. Server notifies all the clients of the results of the test.

Define tests, run them and capture comparison data from them. Then run them on a repeated basis
to test successive versions of the system.

Data Ingest Design


Establish mappings between the sources and the DW model. Choose tools for ETL and data quality.
If implementing an MDM system as part of the system, or if there is an interface to an existing MDM
system, design those interfaces now.

Design the major ETL processes, and incorporate the data quality work within them. Ensure that the
ETL runs are designed for restart and recovery and are rerunnable at will. Build in sufficient

Page 42 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

instrumentation so that you can track progress and diagnose problems. Design into these runs the
ability to take snapshots of the data used in automated testing.

Migration vs. Greenfield


Large elements of the DW may be migrated from existing systems rather than being new
developments.

Determine which parts of the system require a migration approach (data, ETL and reports) and
adopt the appropriate methodology. If a large proportion of the data feeds and reports are
migrations, then it will be worthwhile having a migration expert on the team.

Decide the methods and the priority with which the data and program objects are to be migrated. A
semiautomated method will work where scripts scan code and data and either automatically
convert or flag issues that need manual intervention. Develop estimation techniques to gauge the
level of effort. Typical metrics might be that 70% of the system converts with little effort, a further
25% requires a moderate amount of effort and the remaining 5% needs a complete rewrite. By
classifying each data and program "object" according to its difficulty and summing up the total
effort needed, a reasonable migration estimate can be arrived at. As the migration progresses, track
estimates versus actuals to get early warning of problems.

Also, consider hiring migration experts for particular combinations of source-to-target migrations.
With migration efforts, there is no substitute for experience.

See "Migrating Enterprise Databases and Data to the Cloud."

Report Suite Design


At this point in the DW development process, the business requirements, outputs and the target
data model are known, so report development can begin.

This should include large elements of self-service. That is, instead of implementing hundreds of
reports, meet some of the requirements by implementing general-purpose data structures, most
likely cubes, or star schemas that users can use to satisfy their own requests. This is a subtle but
important distinction. Self-service represents a higher-level requirement or "metarequirement."

Instead of meeting requirements, the design will instead provide the means for users to meet their
own requirements. Technologies such as in-memory computing that provide for greater simplicity in
data design because data no longer needs to be preaggregated and stored but can be rolled up on
demand.

If users require 10 different hierarchies for their rollups, each can be done dynamically with in-
memory computing. There is no overhead associated with the number of views needed because
each will cost the same as any other. None of these hierarchies requires special data structures to
be generated, nor ETL to feed them.

Gartner, Inc. | G00320563 Page 43 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Nowadays it is rare to develop DWs from scratch. There is normally an existing suite of legacy
reporting in existence. In the design phase, explicitly identify new reports, modified reports and
obsolete reports. Each class may well account for one third of the requirements. You can save
considerable effort by disposing of obsolete reports.

Establish or Grow DW Platform


This document is primarily concerned with technical, not commercial, considerations. However, it is
necessary to purchase or subscribe to physical platforms. It is therefore beholden upon the DW
design team to provide the necessary input to procurement. This is to ensure that the system
acquired is fit for purpose and represents good value for money. The system will either be new or an
expansion of existing platforms.

Refine the first-cut sizing produced earlier in the project to define the platform capacity needed, and
then develop this into a full-capacity plan.

It may be necessary to run pilots or proofs of concepts to validate vendor claims. These are easier
to run in the cloud because the physical platform provision is simpler. Also consider cloud-first as a
potential strategy. Once procured, the platform needs to be set up for the team to use.

Develop and Deploy the DW


Once the design phase is complete and the platform acquired, development can proceed, and it
should be a relatively mechanical process. Basic design of the data model, sources and reports will
be complete.

DW systems are all about the reuse of data, and modern techniques such as in-memory computing,
massively parallel processing, and schema on read lend themselves to being able to develop in an
agile and iterative manner.

The Initial Load Strategy


A particular concern is the initial load strategy. At the start of the life of the DW, there needs to be a
special exercise that has two parts.

1. Load the initial payload of data into the DW from an existing system or from an archive, or
simply start accruing data day by day.
2. The DW should then be able to add to the initial load in a way that is synchronized with the
source systems, with no gaps in data provision, thus ensuring continuity of the daily data.

Theoretically, only one run of the initial load is necessary. However, if only run once, it is difficult to
guarantee the reliability of this code. The project is often delayed while making repeated attempts to
get this one-off task to complete successfully.

Alternatively, make the daily DW load, do double duty. It may be a daily batch job, a series of
microbatches or a continuous feed. Design and use it to also do the initial load.

Page 44 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

■ When loading historical data into the system, you can use a variant of the daily load.
■ After multiple runs, it will be debugged and become reliable.
■ Run this multiple times per day to back-load several days data per day. It may even be
designed to be flexible, to accept a number of days' data at a time to be loaded. A single day's
data would be a special case. Catching up multiple days after an operations problem would be
another.
■ This has a number of advantages:
1. The project writes one set of code not two (an initial load and a daily load).
2. When the project comes to rely on the daily load code, it will have been exercised many
hundreds of times by back-loading several hundred days of data. By the time it enters
normal production use, it will be bulletproof.

Plan for continuous data ingestion, which is more flexible. Most modern technologies allow a
continuous trickle of data to feed into the data store. Use this to run a batch or microbatch of
updates. For some technologies, this can be a continuous process. This spreads the work of doing
loads more evenly across time, and can provide real-time or near-real-time updates. Microbatching
at 15-minute intervals is a common scheme.

If, for some reason, the DW gets out of synchronization with the sources or the DW becomes
corrupted by errors propagated from source systems, there will be a need to resynchronize. This is
usually handled by keeping multiple days' input in the DW staging area. Alternate designs such as
the Lambda architecture also cater to this by keeping data in its original, near-raw form so that it
can be reapplied to re-establish a known position.

Simultaneous with report development, provide self-service capability. It is desirable to have as


much of the workload as possible satisfied through self-service. This style of development is the
main subject of the second type of LDW development style and SLA.

In particular, look to technology such as in-memory computing for data structure simplification for
the implementation of virtual cubes and data marts.

Work with power users to design "cubes" that are a superset of the data required. Individual
"cubes" are actually logical views of subsets of this data. Instead of having to produce separate
physical cubes or data marts, which take time to generate and reorganize, it is possible to substitute
logical views. These are just metadata and can be changed much more easily.

Implement agile and sandboxing technology simultaneously with the regular reports; it enables the
early delivery of benefits even before the regular reporting suite has been developed. You may also
be able to do fast development of the reports this way, then move them across from the agile
stream using change control.

Simultaneous to agile working, develop the standard reporting suite. As data is loaded, it enables
the various sets of reports. Release the sets of reports in waves, as the underlying tables they need
become available, by being populated by the ETL team. Phasing the population of the tables

Gartner, Inc. | G00320563 Page 45 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

enables regular delivery of business benefit from the reporting functions. Benefit is realized when
analysis is delivered into user's hands, either by standard reports or via self-service.

Step 2/DV: Data Virtualization, Agile Development and Self-Service Stream


The second style of development and SLA for analytics is agile development using data
virtualization, marts and sandboxing. This style focuses on getting quick results from the existing
DW infrastructure, ideally without the need to move data or make copies. This style uses
"contender" data models and provides new uses for existing data.

The agile stream is about enabling two key things:

1. Making subsets of the data available quickly and easily. This is done using physical or virtual
data marts or through data virtualization.
2. Being able to quickly combine that data in new ways, again through the capabilities of the marts
or through data virtualization.

Unlike the DW, where everyone uses an agreed-on data model reached by compromise or
consensus, the agile mode analyst finds that the data may be there, but not in the form needed for
analysis. Use a contender model for agile, and fold it into the main model later if that makes sense.

It will be noticeable that this section, and the next on the data lake, are much shorter than that for
the classic DW. This is simply because a number of activities may not need to be done because
they have already been done by another stream, because they are done to a lesser degree or
because they can be deferred.

For example, data marts and virtualization usually make use of one or more data sources that have
already been defined. Some data re-engineering may be required, but typically not the amount of
effort that was needed to originally design them.

The agile and data lake streams can leverage work done previously. This would include data stores
that have been established, reference data and data definitions.

Agile Data Marts and Sandboxes


Agile development for analytics has been going on for at least two decades, originally with data
marts — systems that are like small DWs serving a particular function or department.

Page 46 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 17. Agile Development With Data Marts and in DW

Source: Gartner (May 2017)

Gartner, Inc. | G00320563 Page 47 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 17 shows this in simplified form. In practice, there would be many more marts and
spreadsheets. It would be common to find tens of marts and hundreds or thousands of
spreadsheets in a large organization.

Data can be collected and loaded into a mart, and reports can be produced from the mart. Marts
may persist their data, performing the same function for months or years. Alternatively, reuse them
or create them for a particular purpose and break them down when the project is complete to free
up resource for the next requirement.

Data from an existing DW can be copied using ETL tools. Data might be loaded to a mart directly
from source systems, thus bypassing the DW, but this is not recommended.

This used to be the standard method of working, and, to be clear, it was indeed agile, at least in the
short to medium term. There was nothing stopping the developers doing whatever they wanted for
each mart, whenever they wanted to do it, and they could deliver results quickly. However, there
were a number of problems:

■ Marts were point solutions. The models used were often incompatible with each other.
(Although much of the work of Kimball and others sought to combat this with the concept of
"conformed dimensions" between collections of marts.)
■ They created many redundant copies of the data. This resulted in the occupation of much more
disk space than necessary. This also created more potential for data breaches.
■ Each mart had a limited amount of power. In aggregate, all the marts would represent a great
deal of compute power. However, all this aggregate power could not be brought to bear on the
same problem, because the marts were hosted on separate physical machines.

A good alternative is to create sandboxes or marts within the DW platform. As shown on the right in
Figure 17, Sandboxes are ring-fenced within the platform, and users can load their own data into
these private spaces. Security privileges separate sandbox marts from the DW and from each other.
An important difference is that the sandboxes can join their data with that in the DW, so there is no
need to copy large amounts of DW data multiple times.

However, for this style to work, the underlying DW DBMS has to be good at running concurrent
workloads and have very good workload management so that the sandbox work cannot interfere
with the DW workload. If this is not true, then the sandbox work will negatively affect the DW SLAs.

IT provisions the sandboxes for repeated agile use. Data scientists would be a good example of the
kinds of users who would make use of these. Sandbox users have access to all the DW data, plus
any other data they need for their analysis, without having to copy and transform it. This principle
extends further when data virtualization is used.

Data Virtualization
The agile system components discussed thus far are physical: separate physical data marts or
marts/sandboxes sharing the same physical DW platform. However, an important enabler of agile
analytical development is data virtualization.

Page 48 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 18. Different Styles of Data Virtualization

Source: Gartner (May 2017)

Gartner, Inc. | G00320563 Page 49 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Data virtualization is an essential capability for the LDW. It is the glue that connects all the other
components. Seeing it simply as an aid to agile development, by making data sources easier to
obtain and analyze, is to underestimate it. Because data virtualization is so central to the success of
the LDW, a significant amount of space is dedicated to it in this section.

There are different styles of virtualization, as illustrated in Figure 18. On the left, there is a BI tool,
which can make calls to multiple back-end databases. The tools display data retrieved by those
calls on the user's screen. Combining the data is restricted to what the BI tools can do with its
programming and with the power available on its server. This is sometimes referred to as
"integration on the glass," since it is only on the screen that the data is combined.

The center diagram of Figure 18 shows data brought from multiple back-end systems consolidated
into physical data structures, usually some kind of special cube server. Once the data structure is
complete, the BI tool runs the queries against it.

The consolidation data structure may be limited in scalability. In addition, there may be restrictions
on update. Having to build a special data structure before any querying can occur makes it difficult
to implement real-time or near-real-time query requirements. Designers need to ensure their
requirements fit this set of capabilities if it is used.

The third option, shown on the right side of Figure 18, is virtualization in fully developed form, with a
virtualization engine accepting query requests and then dynamically decomposing them into
subqueries against multiple sources. The virtualization server fulfils all the functions needed to query
the multiple sources, as shown in the figure. Readers will recognize the capabilities shown as being
the same as those implemented by a DBMS. These include query planning, cost-based
optimization, parallel processing, caching, and the parallel transfer of data in and out of the server.

The virtualization server needs to be intelligent enough to delegate or "push down" processing to
the source systems where appropriate. It should transfer data in parallel between the source
systems and itself, using fast communication links and multiple sessions. Where it cannot push
down the processing, such as when joining 500 GB from one source with 500 GB from another, it
should be able to use multicore, multinode parallel processing of its own to do this. Sometimes, the
virtualization server is an independent processing layer and server. Alternatively, it may be an
extension of an underlying DBMS, so that one of the DBMSs in the LDW acts as a "hub." That hub
is aware of remote systems and the connections to them, and it takes the data virtualization
decisions. Microsoft's Polybase is a good example of this.

Thus, we see that data virtualization can mean different things to many different people. There are
many degrees to which these capabilities can be implemented. Other terms are used too. For
example, "data federation" may mean simply that the data is remote and can be accessed.
Federated access is allowed, but the more sophisticated features such as true decoupling of
consumers and sources, unified governance and security, and cost-based optimization are lacking.
Similarly a "semantic tier" may be introduced. This is where a layer of abstraction is introduced that
makes it look as if all the data sources are part of one system, with uniform access. But the more
sophisticated features just mentioned (which are needed to provide an acceptable user experience)
may be lacking.

Page 50 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 19 shows a practical example of a data virtualization capability. The top row on the left side
shows the multiple entry points into the virtualized/federated structure.

The second row shows the different types of systems reached as targets. The third row on the left
shows the variety of systems reached using open-source Presto as part of a data virtualization
scheme. The right side of the figure illustrates a typical system architecture. The multiple parallel
lines between the nodes in the network denote fast and efficient parallel transfer. Multiple elements
of the architecture run on parallel processing compute elements, and cost-based optimization
determines where and how queries and subqueries run. Write capability as well as read is possible,
and links are bidirectional.

Gartner, Inc. | G00320563 Page 51 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 19. Example of Data Virtualization

Source: Teradata

Page 52 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

If your organization has a database standard for the DW, such as Teradata, Oracle, IBM, SAP or
Microsoft, and if that database has virtualization capabilities, then it is natural to simply grow the
DW out into the LDW and keep a consistent interface. Adding another federated target system such
as Hadoop to your existing remote databases is straightforward.

Figure 20 shows an example reference architecture that shows virtualization in relation to other
components. In this case, it represents a product implementing a stand-alone virtualization engine
rather than one built into an existing database. Typically, these kinds of products can run on-
premises or in the cloud. The correspondence to Figure 4, the example LDW architecture, is clear.
However, it is useful to show another variation on the theme.

Gartner, Inc. | G00320563 Page 53 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 20. Example Virtualization Reference Architecture

Source: Denodo

Page 54 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Note within Figure 20 the addition of the "in-memory data fabric," a multicore parallel processing
engine that has sufficient power to join large amounts of data. If it is not possible to use the
underlying databases to perform a particularly large virtualization function, then, in this example, the
virtualization layer supplies its own. The demand for this feature, due to be released in this particular
product in July 2017, illustrates the demand to have a parallel processing capability either within the
virtualization layer itself or accessible from it. When implementing LDW functions, it is important to
check that this kind of processing engine is available. Parallel proessing should be available either in
the virtualized platforms themselves, the "hub" database or in a separate virtualized platform. Also,
verify that the virtualization software knows how to use the parallelism wherever it may reside.

Gartner, Inc. | G00320563 Page 55 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 21. Functions on a Remote Platform

Source: Teradata

Page 56 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

With some virtualization platforms, you can execute a function defined on one platform from
another. You can consider an LDW as a collection of data storage types and data operators. If you
can only request data using a single standard query language, typically standard SQL, then this can
deny access to functionality that is available on some of the federated platforms. If the purpose of
data virtualization is to provide a uniform interface to multiple data sources, then this is less
important. However, if users, particularly power users or data scientists, wish to have access to
special functionality in federated systems, this will be worth considering.

Figure 21 shows an example. Here you can define a custom Java function to shred a JavaScript
Object Notation (JSON) document in Hadoop. A user on a relational DW platform is invoking this
function to help return remote data from Hadoop.

Forward Planning for Virtualization


From an architectural and agile development point of view, it is useful to plan for increasing use of
virtualization. Virtualized and nonvirtualized access will likely become indistinguishable. Figure 22
shows this evolution. The green connections in the figure represent "data distance": the bandwidth
with which data can be transmitted between two points. Components that have low bandwidth
connections are "far apart," and components with high bandwidth are "close together," regardless
of their physical proximity.

Gartner, Inc. | G00320563 Page 57 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 22. Virtualization History and Trends

Source: Gartner (May 2017)

Page 58 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 22 illustrates that:

1. In the mainframe era, there was relatively slow data transfer. Applications that coexisted were
close together within the same mainframe, but data was extracted from a target system that
was "far away." Multiple mainframes could feed a classic DW, but each was relatively remote,
typically fed once a day with batch jobs.
2. With client/server, there was sufficient bandwidth between application and database servers to
scale by separating out application servers. ERP systems used this.
3. Modern cloud systems have very high bandwidth connections and can treat compute and
storage independently. Multiple Hadoop clusters can connect to the same storage.
4. A modern in-memory system can consist of multiple processing engines communicating with
each other over high-speed networks or via shared memory. There may be multiple engines,
RDBMSs, online analytical processing (OLAP), text, graphs and so on working in close
collaboration.
5. Therefore, the logical consequence of this is that multiple collaborating engines are effectively
becoming close together whether they are in-board or outboard. You can view them,
increasingly, as if they were within the same system. This is provided, that is, that the
components are sufficiently powerful, that they can transfer data quickly and that there is
intelligent overall query optimization. If all these components are in the cloud, the line between a
virtualized and nonvirtualized system blurs.

Interconnecting components with fast, parallel connections, which are continually becoming faster,
blur the distinction between virtualized and nonvirtualized systems.

The implication is that the virtualization component of the LDW can be chosen to enable agile
access now but contribute to establishing an overall "data fabric" as part of the long-term goal.

Establish or Grow Agile and Self-Service Platforms


The agile platforms include: data virtualization software, as discussed above, to obtain and combine
data much more easily, wherever it may reside, or physical and virtual marts. Physical on-premises
marts are becoming less popular given cloud-based alternatives. Virtual marts are a logical view of a
subset of DW data. Instead of giving 20 different user groups their own physical marts, and all the
ETL to connect them, give them 20 different views of the same underlying data. This will most likely
require in-memory database techniques to provide adequate performance.

An extension of this idea is to provide for self-service with general-purpose data structures that are
easy to navigate. These allow users to meet their own analytical needs, either wholly or in part, as
part of agile mart or sandbox development. This removes the need for users to make information
requests of a development group to get analytical answers.

Suitable front-end BI tools allow users to answer their questions in an intuitive manner. Metadata
tools in turn support BI tools. Users can find out what data is available, and what it can be used for,
by using metadata and business glossaries.

Gartner, Inc. | G00320563 Page 59 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Metadata support extends to data lineage. By right clicking on a field on the BI tool, the user may
trace back where the value they are looking at came from and how it was calculated. This removes
a whole raft of documentation that would be manually intensive to provide. To be able to do this,
care must be taken in the choice of the underlying data platform, ETL tool, metadata support and
front-end tool to enable the metadata to be connected from source to target.

Self-service data provisioning is now possible. With the increasing sophistication of ETL and
virtualization tools, it is now possible for analysts to source their own data. With this, analysts can
identify and request data from any number of sources, then transform and profile it for analysis. This
capability complements IT rather than replacing it. In fact, usually IT is essential in setting up the
metadata and tools needed. Users then request their data using this framework. This is particularly
relevant to working with sandboxes.

Use automated profiling of the data to understand its data format and content. The tools use
sampling and statistical profiling of the data to determine what the data is. They can go further,
suggest likely correspondences in data source files and target data models based on the
characteristics of the data, similarity in naming and so on. This can save very large amounts of time.

Step 2/DL: Data Lake Stream


The third style of analytical system development and type of SLA is the data lake. This SLA uses
what has been termed "candidate" data models because raw data is often being loaded, and a new
model is being derived from it.

Gartner defines the data lake as:

"A collection of storage instances of various data assets. These assets are stored in a near-exact, or
even exact, copy of the source format and are in addition to the originating data stores."

Incorporate newly derived data models in the DW later on, if that is appropriate. Equally, the existing
models used by the other streams can be used as templates to define the schema on read models
used for the data lake.

The data lake stream can be broken down into its own set of implementation steps, as illustrated by
Figure 23. These are:

■ Design: Design the data lake architecture and platform.


■ Data acquisition: Use data acquisition to populate the lake.
■ Insight development and discovery: Process the data to derive insights.
■ Optimization and governance: The platform requires administration, security, maintenance and
performance tuning.
■ Analytics consumption: This makes the insights available to users.

Page 60 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 23. Conceptual Architecture for the Data Lake

Source: Gartner (May 2017)

Design the Data Lake Architecture


Unlike the other two SLAs, in this case, there is not necessarily a pre-existing and predefined data
model for the data. Instead, the data structure can be interpreted as the data is read.

Using schema on read, users of the lake will infer the structure as they work with the data. This
avoids the time and effort needed to predefine an agreed-on data model and to check and to
structure the incoming data to ensure it fits in the model. Within the data lake stream, the data can
always be loaded, and the structure can be worked out later.

However, this is not to say that there is no need for a data model. The users and developers need to
know how they will represent extracted data and how it will fit with existing modelled data. Having
everyone make up his or her own definition of entities such as "Customer" makes no sense. Thus, it
is wrong to assume that data models have no relevance to the data lake. At some point, derived
information from the data lake connects to other enterprise data, and data consistency will be
necessary.

There is typically less work to do in advance than for the DW stream, as seen in Table 1, thus this
section is shorter than the section on the central data warehouse component. If this were not so,
then something would be wrong, because one of the objectives of the data lake is to be able to
store and analyze data without so much dependence on upfront work. Because of this, and for the
right use cases, it can deliver results sooner. The agile stream and data lake stream both have these
characteristics in common.

Gartner, Inc. | G00320563 Page 61 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Gartner's "Best Practices for Designing Your Data Lake" describes the need to separate out the
different disciplines, as illustrated in Figure 23. The organization is likely to have these in place
already for other platforms, such as the data warehouse and data marts. Expand these or mirror
them for the data lake.

As Figure 23 shows, the data lake divides into separate areas of interest. Each of these will have its
own subsidiary data lake SLAs to be met. In the rest of this section, the data lake SLAs being
referred to are these subsidiary SLAs needed specifically for the data lake. They are not the three
umbrella analytic development SLAs that charecterize each of the three main development streams.

Data Acquisition
Load the data in large volumes and in raw or near-raw form.

Following the setting up of the data lake, put in place a well-defined data flow including sourcing,
metadata extraction and management. Separate this logically or physically.

■ Use change data capture (CDC) to capture and move only that data which has changed in the
sources each day, or batch or microbatch loading.
■ Make use of the separation of compute and storage. Multiple processing clusters can share a
single copy of the data in storage. These can vary in size.
■ If moving to cloud, then you can use an object store such as Amazon S3 as the staging area or
even the data lake itself. This approach is beginning to be widely adopted.

Determine data acquisition by looking at popular business-centric use cases. These include 360-
degree customer view, operations, product marketing and management, compliance and risk
management, and support for new business models. See "Selecting Impactful Big Data Use
Cases."

Discovery and Development of Insight


Enable data interrogation, model discovery and the development of analytics.

■ Data profiling, including machine learning techniques, can be used to discover structure and
relationships with existing data.
■ Extract load transform (ELT) processing is increasingly moving to the data lake. Established
transformation tools, such as Informatica, can make use of the data lake platform as an engine
for transformation.

Data Lake Optimization and Governance


This is to ensure that the lake performs as required for its different users. Ensure that users
understand the content and origination of the data. Users must also understand working practices
and disciplines that they can use to ensure they get the most out of working with the data lake such
as by getting reasonable performance. This also includes refining and transforming data, semantic
consistency and data governance, and optimization of models.

Page 62 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Governance is not optional; there is likely to be an increase in regulations. For example, in Europe,
the General Data Protection Regulation (GDPR) will require organizations within the European Union
(including the U.K. before it leaves) to adhere to stricter rules. The GDPR regards Internet Protocol
(IP) addresses as personal data. Data lake users who collect data that may contain IP addresses
need to be aware of this.

The data lake is increasingly becoming the main landing zone for the other data sources, so data
governance becomes more important because the data lake is feeding the entire LDW.

Data Lake Analytics Consumption


This provides operationalizing insights obtained to make the insights and information available to a
wider audience in the most suitable and most efficient manner. Direct this according to who is
consuming the analytics and what kind of tools they will use:

■ Casual user or BI analyst using BI tools such as Tableau Software or Power BI or simply using
portals. Those doing SQL query against the lake itself will most likely be using tools such as
Apache Hive or Spark.
■ Power user, as above, plus possibly R, SAS and more sophisticated SQL usage.
■ Data scientists using languages such as R, Python, Java and JavaScript, along with their
specialist libraries. These users will typically also want to use their own "sandboxes" within the
data lake and will have the skills to make use of them.
■ Specific areas, such as IT operations, will want to use the lake for specific analysis, such as
monitoring servers, analyzing weblogs and investigating cyberattacks.
■ Marketing users will most likely want direct access to detailed information, in particular social
media data. The new role of chief marketing technologist (CMT), as part of the digital marketing
department, can be the bridge between IT and marketing.
■ The chief data officer (CDO) will have an interest as a strategist and for governance. The CDO
will want to ensure that other users have data that is reliable and well-understood.

In some cases, the analytics will be accessed directly, but in others, they will be passed to the DW
or to marts for serving to the users. That is, the data lake can act as a preprocessing layer to deliver
complete sets of analytics into another part of the LDW architecture for further distribution. This is
another example of where the different "engines" of the LDW are complementary. You can perform
analysis within the lake, but that may not be the best place to make the results available to
hundreds of users. Similarly, you can perform some very fast analysis using structured data and
optimization within the DW before performing further phases within the data lake.

Each of these data lake areas represents a separate discipline, and this work will not just happen by
itself. Plan for this as you would with other parts of the LDW architecture. These areas are not
necessarily centralized. They may be different from those covering other parts of the LDW
architecture, although there are obvious advantages by having the different administration groups
collaborating and maybe colocated.

Gartner, Inc. | G00320563 Page 63 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

For the data to be useful, it must go through some process of interpretation. Many users of the data
lake will not want to interpret their data every time they use it. Holding data in an agreed format thus
saves time and effort and avoids confusion. However, the format will be much looser than those
found elsewhere in the LDW, and it may be much closer to the raw data format. Hold the data in raw
format where this is not possible.

Do not ignore data design in all cases just because sometimes it is unnecessary. This is a common
mistake in building data lakes. See Gartner's "Five Reasons Why Big Data Needs Metadata
Management and How to Leverage It."

The separate areas above will be familiar; they deal with data coming in, data organization,
processing of data (platform and techniques) and making the information available. These are the
same broad areas seen for each of the LDW components, the DW, the ODS, marts and sandboxes,
and indeed the LDW itself. These same requirements exist for the data lake. They may be much less
onerous and more flexible here, but that does not mean they do not exist.

Provision the Data Lake


When establishing the data lake as part of your LDW, a suitable platform has to be procured and
implemented. When you are ready to select and implement big data technologies, the following
documents will be useful:

■ "Identifying and Selecting the Optimal Persistent Data Store for Big Data Initiatives"
■ "Framework for Assessing NoSQL Databases"
■ "Market Guide for Hadoop Distributions"
■ "Comparing Four Hadoop Integration Architectures"
■ "Toolkit: Answers to the FAQs on Hadoop Infrastructure"

Secure the Data Lake


Obtaining advanced analytics from big data using a data lake can be thought of as having three
main stages:

■ Upstream: Deals with the ingestion of data on demand, in batch and in real time, such as from
IoT sensors.
■ Midstream: Concerns itself with aggregation of data within the data lake platform using the
storage and processing elements there.
■ Downstream: Here, data scientists and other analysts discover insights, using predictive
analytics, machine learning, programming and query. Smaller result sets might be extracted into
analytics platforms, or those platforms may "push down" processing to delegate large-scale
processing back into the lake before the result sets are returned to the analytics tool.

When used as part of an LDW, this pipeline that forms part of the LDW has to be properly secured.
See "Securing the Big Data and Advanced Analytics Pipeline" for further details.

Page 64 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

The concepts of data-centric audit and protection (DCAP) are also useful here. In designing the
LDW, the architecture diagram can be used to systematically work out how each data store and
data flow is secured. This is a very wide subject, and a great deal information is available on it, so it
will not be duplicated here. However, it is essential to put in place end-to-end controls in a way that
secures the data but does not unduly constrain the agility of the agile development stream or work
on the data lake.

Some tools allow you to scramble your data through tokenization or encryption so that intruders
cannot read data obtained illegally. The software makes available interfaces that tokenize and
untokenize that data so that legitimate users and programs see the data in clear.

Cloud Adoption
To remain competitive in the cloud era, a well-planned cloud strategy for data management is
critical. Evaluate and consider expanding adoption in dbPaaS, iPaaS, baPaaS and baSaaS. These
options may be complementary with on-premises data warehouses. If your existing on-premises
data warehouses are providing high business value, consider leaving them alone. Instead, focus on
moving new and innovative use cases, such as sandboxes, to the cloud.

In the past few years, the database management system (DBMS) landscape has exploded with a
wide variety of options ranging from software to appliances, from cloud to on-premises and from
traditional relational to NoSQL databases. Decision criteria include:

■ Cloud or on-premises
■ Use of a database appliance
■ DBMS type, including Hadoop, key-value, graph, table-style, relational, columnar and
multimode

baSaaS is especially beneficial when data gravity is cloud-centric and your organization lacks skills
in certain advanced analytics areas. baSaaS naturally pulls in other types of cloud adoption such as
iPaaS and dbSaaS.

Also, the distinction between NoSQL storage and cloud object storage is blurring. A service such as
Amazon's Athena, which allows serverless SQL querying of S3 object data, is an example.

This ends the description of the third of the three solution paths. In the body of the project, choose
one or more of these paths. At the end of each build phase, there will be a common wrap-up, which
completes the current phase and prepares for the next iteration.

Step 3: Wrap Up the Development Cycle


At the end of the design and build phase, and no matter which combination of the three paths was
chosen, there will be a common wrap-up phase.

There are two main objectives for this wrap-up phase:

Gartner, Inc. | G00320563 Page 65 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

1. Ensure that what has been built is sustainable


2. Prepare the overall process for the next design-build cycle

This paper will not duplicate the wealth of advice available on setting up systems to ensure
continued reliable service. Rather, this section focuses on aspects that are particular to the LDW.

Change Control
A major benefit of the "three path" model is that moving data or analysis between the paths
prompts action to preserve the integrity of the overall system without destroying the SLA for each
path. Table 2 lists what needs to be done when moving data or units of analysis from one path to
another.

Applying the appropriate controls (or freeing them up) as you cross the boundaries preserves the
integrity of the SLA for each style. Treat the disciplines differently to avoid getting into a mess. It is
easy to see how this might come about:

■ Allowing production reporting to infiltrate the agile environment introduces the need for rigid
controls that destroy that environment.
■ Moving data from the agile or lake environments without adequate quality checks destroys the
credibility of the DW data.
■ Moving DW data to the lake where anyone can access it can open data privacy exposures.

However, by taking a structured approach and explicitly adding or removing controls as you cross
the boundaries between the streams makes this manageable. It also makes it easier to quantify the
effort in doing so.

Change control between paths is illustrated in Table 2. As seen in the example project plan (see
Figure 5), it actually occurs regularly in every stream and on a regular basis throughout the project.

Change control between the different styles of development is essential. There is little or no change
control needed within the agile development stream because the whole point is to enable fast
development.

Often, agile development will result in analysis that should become part of regular reporting and
analysis. Transfer it in a controlled fashion to the DW or the data lake. That is, as development
transfers from one development style to another, full change control kicks in. Assist this by having
the agile style use as much of the existing data models and tools as possible so that suggested
changes are adaptions to what is already there rather than completely new constructs. This avoids
"reinventing the wheel."

The DW team, in developing schemas for CUSTOMER, TRANSACTION and so on, has expended
effort doing this. It makes sense to reuse these definitions rather than invent new ones.

Submit requirements specifications and resulting development artifacts to a change control function
for the LDW. A specific task then merges them into their proper place in the architecture. Developers

Page 66 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

other than the original creators of the code most likely do the change control and transfer of
function. This ensures that the right skills and adequate resources are brought to bear on each task.

Table 2. Change Control Actions When Moving Between LDW Paths

From/T DW Agile Data Lake


o

Model Compromise (Consensus) Contender Candidate


Used

DW ■ Standard DW development ■ Data copying or sandboxing ■ Physical copying,


via views placement in sandbox

■ Share DW data model/ ■ Data model is template


metadata for schema on read

■ Keep record of copies and ■ Anonymize data


view made
■ Remove DW change
■ Optional: Anonymize data control

■ Remove DW production ■ Add security controls


change control elements

Agile ■ Apply security controls ■ Standard agile development ■ Apply access controls for
lake users
■ Align contender model or
adopt contender model in ■ Provide contender model
DW for schema on read

■ Validate data sources and ■ Physically copy data,


data quality and log which data has
been copied
■ Instigate backup/recovery for
data

Data ■ Apply DQ checks ■ Provide physical copy or ■ Standard data lake


Lake provide virtualized access development
■ Apply role-based security
■ Apply data quality and
■ Align candidate model with transformation to allow joining
DW model, and adopt of data
candidate elements
■ Develop schema on read in
■ If virtualizing, provide views line with agile data model
and apply role-based security and/or modify contender
at DW and lake level as model to merge in candidate
appropriate model elements
■ Instigate backup/recovery for
production

■ Redirect data feeds

Source: Gartner (May 2017)

Gartner, Inc. | G00320563 Page 67 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Once transferred, delete the completed requirements from the agile environment. Repurpose the
sandboxes for other projects. Using sandboxes for many months for the same purpose is a red flag.
It can indicate the slowing down of the agile development by regular reporting and the erosion of
agile resources. Free up the resources after a defined time to return them to their real purpose.

Monitoring and Administration


Set up reliable monitoring and administration. Fit the system with rich instrumentation to monitor
performance and capacity, security, audit, and other similar information. Most LDW components
now provide extensive features in this area.

This applies to each of the major subsystems of the LDW. Each subsystem (DW, agile, data lake)
must be considered from the point of view of security, maintenance, backup, recovery and disaster
recovery, and the LDW should be considered as a complete system. For example, if a company
makes all of its internal information available to everyone, this means that all employees are subject
to Securities and Exchange Commission (SEC) trading regulations because they all have access to
potential inside information. This has a knock-on effect to HR practices. This is not necessarily a
problem, it just has to be explicitly considered so the appropriate steps can be taken.

Capacity Planning and Workload Management


LDW expansion must keep pace with demand, and it must do this on an economically sustainable
basis. LDW capacity can grow very fast, and in unexpected directions, especially in the early years.
Refine and expand the initial sizing performed at the start of the project into an ongoing capacity
plan. This should include both physical storage and compute capacity projections.

You need a capacity plan, even if some or all of the LDW is in the cloud. Projections of spend for
budgetary approval are still necessary, and these must be based on forecasts of future capacity
requirements. In addition, by anticipating how capacity is expanding and along which dimensions, it
is easier to design the system to deliver that extra capacity in the most cost-effective manner. For
example, expansion of the main DW component of the LDW may warrant offloading of some of its
data into the data lake component rather than simply expanding the DW itself.

Make sure that the individual components are scalable and that, equally importantly and a point
often missed, the interfaces between them are too. Also, check the economic implications of
sudden expansion in each part of the system.

You may wish to pre-position a particular technology in one part of the architecture so that it is
available, proven and ready to offload large amounts of data from another part of the architecture. If
the data held by the DW suddenly expands, then you can offload the "cooler" (i.e., less used) data
to your data lake. However, if this is not already a proven part of the architecture, that option may
not be open to you, and it will be more expensive to expand your DW. Budget constraints may limit
the expansion, how much data you can add, and thus, potentially the benefit derived.

Clearly, it is not possible for these capacity projections to be exact. A fair amount of experience and
art goes into their construction. However, a reasonably estimated capacity plan is better than having
none. Create a capacity plan in line with the methods shown in Figure 6 by doing a simple sizing for

Page 68 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

each of the main data stores calculated from the business metrics. As new data sources emerge,
and as business expands, this should naturally feed through into the metrics for the sizings. Review
the capacity plan regularly, such as at one of the regular governance meetings. This includes
projections for users and compute capacity, not just storage space.

Technologies can allow for compute capacity and storage space to be varied independently for both
cloud and on-premises resources. In addition, some technology allows for expansion of clusters
with different-sized processing nodes, or it can simply swap out nodes in a cluster for those of
different capacity to vary the compute-to-storage ratio. The capacity plan and architecture work will
consider these options. A capacity plan of one to five years is reasonable, with the caveat that the
later years are likely to be subject to change; again, an approximate forecast is better than none.

Evaluate New Technology Options for the Next Cycle


At the end of each design-build cycle, it is good practice to pause and consider new technology
options. These are easier to incorporate at the beginning of the next cycle than halfway through an
existing cycle.

Technologies that have recently appeared are:

■ In-memory computing to simplify physical data models by removing the need for preaggregated
data and indexes from the physical data model
■ Graphics processing unit (GPU)-based DBMS and accelerators for high speed processing
■ HDFS- and MapReduce-based systems that can be used as a store and processing
environment for large volumes of unstructured data, for data science sandboxing, and for
offloading of the DW
■ Spark to do transformations in memory (new data lake initiatives should prioritize Spark over
previous MapReduce software)
■ Cloud object stores, such as Amazon S3, that can be used as the persistence layer and
processed using software such as Spark, Amazon Web Services (AWS) Athena and Presto.
■ Data virtualization, which makes remote sources available, avoids unnecessary physical
consolidation and provides an abstraction layer for access to all data.
■ Graph databases for network-oriented problems

These are just examples, but each provides a potentially useful addition to the next version of the
LDW. Each has the opportunity to reduce the effort required and increase productivity, and thus
ROI, of the overall system.

ROI should be the main measure used to determine modifications to the LDW. Add the technology if
it improves overall ROI in meeting requirements and gaining benefit. If it simply adds complexity and
cost, then do not add it. Without the ROI test, there is a danger that the LDW becomes just a
smorgasbord of interesting technologies.

Gartner, Inc. | G00320563 Page 69 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Summary, Matching Requirements in Development Cycles


The sections above have described the three service levels and development styles for the LDW.
These are complementary rather than being alternatives. Plan the modern data warehouse as an
LDW and provide all three styles.

In practice, any of the paths can be chosen in any order and any combination, provided we have a
clear idea of the goal.

Allocate requirements to repeating cycles of development as shown in Figure 24. Each of these
repeating cycles of development builds out another part of the LDW.

Page 70 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 24. Summary of LDW Requirements Handling and Build Process

Source: Gartner (May 2017)

Gartner, Inc. | G00320563 Page 71 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Drive the work by the use cases and business benefits. Map the use cases and requirements to the
service level that is most appropriate: data warehouse, agile platform or data lake. Then consider
the data, and then the platform components, to ensure maximum productivity. Combinations of
styles are sometimes best. Finally, evaluate the skills required for the use case, service level, data
and platform, and assign development (or self-service) to the correct team members.

Detailed Descriptions of the LDW Components


Figure 4 outlined the typical components in an LDW architecture. A brief description of each is
included here.

Use the operational data store (ODS) for operational reporting and for supporting short, high-
frequency queries with high availability. Historically, to provide the required performance, and to
isolate it from other operational concerns, it has been implemented as a separate server. The ODS
will have a strictly defined data model similar to that of the DW. It also can integrate data from
multiple systems, though typically not to the same extent as the DW. This processing can be done
by a DW if it has good workload management, or via HTAP.

Hybrid Transaction and Analytical Processing (HTAP) allows analytical queries to run within an
operational source system. These are usually higher-volume, high-frequency queries. Usually
enabled by in-memory computing, HTAP is likely to supplant the ODS over time because it does not
require a separate store to be implemented. The HTAP system will use a defined model, but the
analytical component may adapt the transactional model "on the fly" into a form suitable for
analytical queries.

The classic data warehouse (DW) is a central server for integrating a physical copy of data from
multiple sources. It provides a depth of history and supports a wide variety of standard and ad hoc
reporting. It can support tens, hundreds or even thousands of concurrent users using workload
management to provide service to each class of users. It is almost invariably an RDBMS. Some
companies call it an enterprise data warehouse (EDW).

Physical data marts are smaller analytical data stores that meet the needs of a particular
organization, department or business function. They may be permanent, or they may be set up at
short notice for agile development to meet an immediate need. They can be thought of as small-
scale DWs. Although useful for point solutions, they result in high complexity and costs and pose a
maintenance problem if allowed to proliferate in an undisciplined manner. If multiple marts are used,
then effort to ensure they use the same conformed data model is worthwhile.

Virtual data marts are logical views into DW or other data. That is, to the outside world, they look
like a data mart, but there is no separate physical copy of the data involved. Instead, the "mart" is
simply a dynamic view into more detailed data, usually within the DW. This has a number of
advantages. The virtual mart is just metadata; the mart is changed simply by changing its
description. No physical unloading or reloading of data is needed. In addition, the mart can easily
make copies of DW data available by including them in the logical view. No copying is required. To
provide sufficient performance for virtual marts, you usually need massively parallel processing
and/or in-memory computation together with very good workload management.

Page 72 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Sandboxes within the DW allow users with the right permissions to experiment with their own
data. They can be regarded as specialized marts. Users can load/transform their own data, or data
obtained externally. By implementing the sandbox within the DW, data can be joined to DW data
without having to make large-scale copies. This makes development much more agile. Some
degree of data obfuscation and masking may be required. Security measures need to be in place to
prevent unauthorized release of information. Likewise, the underlying DW platform needs very good
workload management to allow the sandboxes to coexist within the main DW.

The data lake (DL) holds large volumes of unstructured data, typically using Hadoop or Spark, but
possibly other nonrelational database alternatives. The data in the DL does not have to be
prestructured before loading. Data is held in raw or near-raw form. Data format can be inferred as it
is read using schema on read. It is possible to hold very large amounts of data economically.
Although a data model does not need to be prepared in advance, it is useful to use the models used
elsewhere in the LDW as models for schema on read. This makes it easier to join data between
different LDW subsystems (marts, DW and virtualization). Although the data lake is defined in terms
of storage, consideration must also be given to how compute is attached to allow processing of the
data.

Sandboxes in the data lake are used for experimentation by skilled analysts. Some DLs may
actually be simply large sandboxes for experimentation. Their use is similar to DW sandboxes in
that they allow data scientists or others to safely experiment on data. However, the types of data
and the techniques used to experiment on them are likely to be different than those used in the DW.

Extract transform and load (ETL/ingest) gets data into the system, into the staging area for the
DW, or into the data layer for the DL platform. Data may be loaded into the hot or cold areas of the
DW or the DL. Data is physically extracted from the source systems, moved and then transformed
before being loaded into the target system. Data quality checks are usually embedded within ETL.
ETL can also be used internally within the DW to physically move data, for instance from the DW to
physical marts, or to move data between the DW and the data lake. Data lakes are often used to do
ETL, or more precisely extraction, loading and transforming (ELT), on behalf of the LDW — loading
the data and transforming it, in parallel, using the processing power of the lake.

Business rules are pieces of logic that may be distributed within other components too, such as
ETL, views and metadata. They are used to enforce company policies, such as the formatting of
data or the relationships between data.

Platform infrastructure is the underlying storage or compute technology. This can be on-premises,
in the cloud or spread across both. This applies to the DW, the DL or both.

Data for DWs is usually arranged in tiers for hot and cold data (also known as multitemperature
data management), recognizing that access to data is not evenly distributed. This means that less-
frequently processed data can be kept on lower cost media. The same concept is now starting to
be applied to the DL. Cloud data is usually plain object storage and very inexpensive. Features are
now beginning to appear that allow querying directly against object storage.

The query and reporting layer consists of the following components:

Gartner, Inc. | G00320563 Page 73 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

■ Operational reporting for reporting in mainstream business processes: Usually simple reports,
although some may be embedded within a business process
■ Query and reporting for standard reports and ad hoc querying: Serves a wide range of users,
typically using BI query and reporting tools
■ Self-service reporting and data sourcing: Enable users to meet requirements directly without
IT development; IT provides the environments, but users generate the reports
■ Application programming interfaces (APIs): Interfaces that can be called from within
programs to provide analytical information from the LDW to incorporate into other applications;
often used to provide operational information and to trigger business actions
■ Data science and statistical analysis using specialist tools and skills

Datavirtualization allows common access to all data within the architecture:

■ This is the key enabler of the LDW by placing a query abstraction layer above each of the
components.
■ Combines data from multiple LDW subsystems and uses virtualization as a universal access
mechanism to make finding and obtaining data much simpler.

The data model defines the data held in the system. There are usually three layers of abstraction:

■ A data or BG used by business people


■ A logical data model that is a technology-independent representation of the data
■ A physical data model that defines the actual physical data stores

Metadata and governance tools are used to describe the data in the LDW and to ensure that its
ingestion, use and provision are correctly and transparently managed.

The company business strategy is documented and is cross-referenced to the business


requirements and the LDW components that will satisfy them.

Mainstream business processes are the ultimate consumers of information provided by the LDW
whether this is directly, using APIs or through human intermediaries. At each stage, it should be
clear how information uniquely provided by the LDW serves and improves mainstream business
processes.

Details on Requirements-Gathering and Business Case


A key difference for LDW projects is the constant gathering and reprioritization of requirements and
the close and constant collaboration between the business and IT.

Gather and Prioritize Requirements


During the initiation phase, requirements are gathered and prioritized as shown in Figure 25.

Page 74 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Each requirement has two estimates: the benefit delivered and the effort, in person-days, of meeting
it. If a particular requirement needs additional investment, such as system expansion costs, then
use monetary figures, including estimated staff costs per day, to combine and normalize costs.
Prioritize requirements by likely returns versus effort to implement. Easy and high-value
requirements will appear in the top right of the chart.

Document any constraints or regulations. They are just as much a part of the requirements as the
functional requirements themselves.

Follow the money. Look at the main business processes to see where the revenue and costs flow.
Ask "what would happen if this flow was increased or decreased? Could better information make
that happen?" Ask about the style of delivery: simple reporting, predictive or prescriptive. It is useful
to note this in the requirement-gathering phase.

By being fully transparent, the justification process should switch from "can the LDW be justified?"
to "can you afford not to have the LDW?" This engenders thinking in terms of ROI.

Gartner, Inc. | G00320563 Page 75 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 25. Gathering and Prioritizing Requirements

Source: Gartner (May 2017)

Page 76 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

At the same time, document the data access and security requirements. These will accompany the
reporting requirements through the process.

Taking a structured approach like this has several benefits:

■ Being clear about costs and benefits


■ Avoiding "scope creep," which is adding requirements as if they cost nothing
■ Making it easier to prioritize new, important and urgent requirements above existing ones
■ Helping you decide the most appropriate implementation method for each requirement
■ Making it easier to cost-justify the LDW project at each stage
■ Avoiding the biggest risk for an LDW: Long development with no return and declining support

See also "Information Requirements Gathering for Analytic Projects."

Align Business Strategy


One of the principle causes of failure for LDW projects is when they are based on the philosophy of
"build it and they will come." This can occasionally work, through sheer luck, but it usually results in
a system nobody subsequently uses.

If explained correctly, senior business people should be motivated to engage with the DW project.
Failing to obtain their active support is another potential red flag for the project.

At the start of the project, senior managers and board members should explain the vision and
strategy of the company. This is most likely through initial meetings, interviews and company
documentation such as the annual report. This ensures that the team understands how information
needs relate to high-priority items in company strategy. It also allows that prioritization to balance
the longer-term strategic goals that may have very high value, and be essential to company growth
and survival, alongside other immediate and urgent requirements.

Figure 26, taken from the highly recommended Gartner paper "Key Recommendations for
Implementing Enterprise Metadata Management Across the Organization," shows the relationship
between business and technical metrics.

Gartner, Inc. | G00320563 Page 77 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 26. Aligning Business Strategy With Development

Source: Gartner (May 2017)

Form the IT and Business Team


The team should be multidisciplinary, with contributors from IT and business, and can expand over
a phase or a project. Figure 27 shows an example team structure, but this is not prescriptive. The
structure will vary from organization to organization.

The structure of the team mirrors the architecture. The largest team is usually the one building the
classic DW component because it simply has more to do. A high degree of preparation,
predefinition and structured development is necessary. Within the centralized DW team, there are
subteams for getting the data in; designing, deploying and maintaining the central servers; and
getting information out. Usually people with different skills staff these; each role is a specialist
domain in its own right.

Choose partners to bring skills into the organization and for knowledge transfer. It is a bad idea to
ask a third party to build a DW and then disappear. Building a DW is a continuous process, and
continuity is essential.

Page 78 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

For smaller projects, there will be staff who can fulfill more than one role and superstars who can
turn their hand to anything, but for larger projects, the teams are usually specialized.

Gartner, Inc. | G00320563 Page 79 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 27. Example Team Structure

Source: Gartner (May 2017)

Page 80 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

The "data in" team members are predominantly ETL and data ingestion specialists familiar with ETL
tools and familiar with the typical problems of gathering and integrating data. The central server
team understands database sizing and configuration.

The DW data-out team consists of:

■ Specialists in query tools and reporting


■ Data analysts
■ Agile developers
■ Data scientists
■ Migration specialists, if some of the project involves migrating existing data and reports

The data lake team will typically have similar but different skills to the DW team, and it will focus on
nonrelational solutions.

The data lake team may be within the DW team and regarded as an extension of it — just as a data
lake platform can be regarded as an extension of the DW platform. Alternatively, it can be alongside
the DW team, but if so, it is essential that these teams are coordinated. If there is no DW, then the
data lake team stands alone. However, data lakes still need architecting, building, securing,
administering and monitoring, so these roles also appear in the DL subteam.

There tends not to be a separate subteam for agile development and virtualization. These roles tend
to be found within one of the other teams or under a business unit. This is because virtualization
cannot stand on its own. By definition, it must be virtualizing a number of other sources. However, it
is necessary to properly design for the impact of virtualization on those other sources, and the
interfacing to them. This is akin to database performance work. Therefore, this is more likely to be
subsumed into another team, usually the DW team, because the access is very "database like."

Another way to think about this is that agile developers and users make use of virtualization to do
their work. However, IT professionals need to put in place the agile environment, physical or virtual
marts, sandboxes, or data virtualization. They also need to maintain and performance tune the
virtualization software. In this case, they are not meeting business requirements themselves, but
putting in place data and computation structures that make it easy for others to meet requirements
with advantage to productivity.

The business units will have their own analytical staff: the traditional power users, the newer "citizen
analysts" and their own data scientists. Note that the data scientist may simply be an outgrowth of
an existing business function that makes heavy use of analytics, such as actuaries in insurance or
merchandising or pricing in retail. Other Gartner papers have identified a typical mix of user types
spread across different groups.

■ Casual users: 1000


■ Business analysts/power users: 90

Gartner, Inc. | G00320563 Page 81 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

■ Data engineers: 5
■ Data scientists: 1

Whether or not the various groups report directly to the CIO or CDO will vary too. There will be
governance functions too. An overall governance council and steering committee is normal,
together with governance representatives and data stewards within the various groups. Figure 28
provides another view of the relationship between the parties concerned with analytics.

The business intelligence competency center (BICC) is alive and well in many organizations. It
functions as the center of excellence for analytics, though now it may simply be called "the analytics
group" with the "head of analytics" directing it, or it may go under some other name. The BICC may
have evolved from the original development team for the DW or DL and then incorporated other
personnel from other parts of the business.

Page 82 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 28. Relationship of LDW Personnel to Other Roles

Source: Gartner (May 2017)

Gartner, Inc. | G00320563 Page 83 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

If self-service enablement is needed, then this is likely to be in the central server team. This is
because making available general-purpose data structures within the central server that enable self-
service is more of a database configuration issue. However, it is possible that this could also be
hosted in the data-out team.

The DW project is an opportunity to re-examine the skills required, and for individuals to skill up.
This is particularly true for the "soft skills" necessary to enable better collaboration between IT and
the business.

Develop or Firm Up the Business Case


Guide your LDW development with a firm business case. This is likely to be more complex and more
volatile than that of other types of system because the requirements constantly change.

Over time, development is reprioritized, and the business case must be revisited regularly. Add
benefits and costs as line items when releasing requirements for development. Keep track of the
returns obtained. This will help justify continued development.

Ideally, develop a multiyear benefit and cost case. This is usually estimated over three to five years
(five is better so as to capture amortization of equipment and replacement costs at the four- to five-
year mark for on-premises equipment or to allow for multiyear cloud subscription renewal).

For each year, there are a set of line items for costs and a set of line items for forecast benefits.
From this can be calculated the total expenditure and net benefit for each year and the cumulative
costs and benefits of the LDW.

Figure 29 is clearly a highly simplified business case with numbers representing costs and benefits
in millions of euros (€).The numbers are simply illustrative, so do not spend much time looking at the
actual values. What is important is the general shape of the business case. At a high level, it should
look something like this.

There should be line items that identify the major costs, and the project should aim at problems that
will solve major business problems. Gross estimates of benefits should also be possible. If the
project cannot produce a chart that looks something like this, then that is a red flag.

Page 84 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Figure 29. Example Business Case Structure

Source: Gartner (May 2017)

The calculations can be extended to include more sophisticated financial measure such as ROI,
internal rate of return (IRR) and net present value (NPV), as is often required by organizational
standards for presenting these types of cases. This allows the organization to prioritize the
development of the LDW against other potential investments such as new plants and machinery or
opening new branches. This kind of financial analysis is outside the scope of this document, but it is
useful to be aware of it. Both technical and architectural inputs are an essential part of this.

If the system cannot be justified, it does not matter how well it is architected. One of the
responsibilities of the development team is to ensure that the linkage between the benefits, costs
and architecture is not lost.

Gartner, Inc. | G00320563 Page 85 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Tie the business case to an important business problem. For example, one project provided
profitability information for an insurance company, which had grown by acquisition and been
aggressively pursuing market share. This resulted in the underwriting of a large amount of
unprofitable business. The first iteration of the DW simply brought together revenue and cost
information to provide sales personnel with immediate feedback on the profitability of each potential
sale. The initial justification did not have to be any more complex than that. Benefits were measured
in the tens of millions of dollars and were clear to all participants.

Here, the main justification was a strategic imperative, which was also easy to put into numbers.
The organization had a very urgent problem, and the DW would easily fix it. The benefits case was
obvious.

Beware the "apportionment trap." An LDW may be the first time that multiple departments have had
to collaborate in developing and using a single system. Working out the sharing of the costs will be
key. Otherwise, individual departments may not collaborate for fear that they will be stuck with the
whole bill. Solve this by having the LDW initially as a corporate asset, made available to all
departments, then after a year or two, begin chargeback based on a fair assessment of usage.

Project or Phase Initiation and Kick-Off Workshop


It is a good idea to have a formal project kick-off to align all the stakeholders.

The agenda includes:

■ Business goal summary, fit with company strategy: Ideally, senior executives deliver this.
■ Run through of the requirements process and list: Initial requirements are most likely already
known at project initiation. Review the prioritization process.
■ Introduce the team, expectations, time scale, goals, roles and responsibilities.
■ Outline project organization and management.
■ Identify stakeholders, influencers and decision makers.
■ Design and development methods: Waterfall, agile bimodal and hybrid.
■ Change management: Specify how to pass deliverables between the data warehouse, agile and
data lake streams without slowing development or damaging data integrity.
■ Steering committee: Identify who they are and discuss meeting schedules and input
mechanisms.
■ Risk assessment/register: Discuss what it is and identify who maintains it.
■ The communication plan: Discuss how the project will share aims, progress and successes
(newsletters, meetings, portals).

At the end of this workshop, the team members will know each other and their roles, goals and
strategy for the project.

Page 86 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

In this detailed section, the activities related to business requirements gathering and prioritization,
business alignment, and business case production and the kick-off activity have been outlined.

Gartner Recommended Reading


Some documents may not be available as part of your current Gartner subscription.

"EIM 1.0: Setting Up Enterprise Information Management and Governance"

"2017 Planning Guide for Data and Analytics"

"Migrating Enterprise Databases and Data to the Cloud"

"Avoid a Big Data Warehouse Mistake by Evolving to the Logical Data Warehouse Now"

"Magic Quadrant for Metadata Management Solutions"

"Magic Quadrant for Data Management Solutions for Analytics"

"A Comparison of Master Data Management Implementation Styles"

"Solution Path: Implementing Big Data for Analytics"

"Solution Path for Evolving Your Business Analytics Program"

"Eight Steps to Picking the Best Self-Service BI and Data Discovery Tool"

"Comparing Cloud Data Warehouses: Amazon Redshift and Microsoft Azure SQL Data Warehouse"

"Best Practices for Designing Your Data Lake"

"Rethink and Extend Data Security Policies to Include Hadoop"

"Three Architecture Styles for a Useful Data Lake"

Evidence
Gartner research including numerous client conversations. The body of Gartner publications on
LDW and ancillary topic areas.

Gartner, Inc. | G00320563 Page 87 of 88

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

GARTNER HEADQUARTERS

Corporate Headquarters
56 Top Gallant Road
Stamford, CT 06902-7700
USA
+1 203 964 0096

Regional Headquarters
AUSTRALIA
BRAZIL
JAPAN
UNITED KINGDOM

For a complete list of worldwide locations,


visit http://www.gartner.com/technology/about.jsp

© 2017 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner is a registered trademark of Gartner, Inc. or its affiliates. This
publication may not be reproduced or distributed in any form without Gartner’s prior written permission. If you are authorized to access
this publication, your use of it is subject to the Gartner Usage Policy posted on gartner.com. The information contained in this publication
has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of
such information and shall have no liability for errors, omissions or inadequacies in such information. This publication consists of the
opinions of Gartner’s research organization and should not be construed as statements of fact. The opinions expressed herein are subject
to change without notice. Although Gartner research may include a discussion of related legal issues, Gartner does not provide legal
advice or services and its research should not be construed or used as such. Gartner is a public company, and its shareholders may
include firms and funds that have financial interests in entities covered in Gartner research. Gartner’s Board of Directors may include
senior managers of these firms or funds. Gartner research is produced independently by its research organization without input or
influence from these firms, funds or their managers. For further information on the independence and integrity of Gartner research, see
“Guiding Principles on Independence and Objectivity.”

Page 88 of 88 Gartner, Inc. | G00320563

This research note is restricted to the personal use of rolando.quezada@itau.cl.

You might also like