Professional Documents
Culture Documents
Five Steps For Accelerating Data Readiness
Five Steps For Accelerating Data Readiness
Five Steps For Accelerating Data Readiness
By David Stodder
Sponsored by:
OCTOBER 2020
Data Readiness
2 FOREWORD
Developing a technology strategy to
drive better analytics and governance 4 NUMBER ONE
across the enterprise Advance data readiness by developing knowledge
graphs of data relationships
6 NUMBER TWO
Establish a data catalog to improve location, analysis,
and governance of data
By David Stodder
8 NUMBER THREE
Address weaknesses in governance that impact
readiness and reduce data trust
10 NUMBER FOUR
Improve readiness through efficient, flexible, and
integrated data preparation
11 NUMBER FIVE
Modernize pipelines for analytics and AI/ML with
catalogs and workflow automation
12 A FINAL WORD
T 425.277.9126 © 2020 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in
part are prohibited except by written permission. Email requests or feedback to info@tdwi.org.
F 425.687.2842
E info@tdwi.org Product and company names mentioned herein may be trademarks and/or registered trademarks
of their respective companies. Inclusion of a vendor, product, or service in TDWI research does
tdwi.org
not constitute an endorsement by TDWI or its management. Sponsorship of a publication should
not be construed as an endorsement of the sponsor organization or validation of its claims.
FOREWORD
Data is everywhere, generated by an array of struggle to gain complete and trusted views of it,
applications, digitally transformed business and organizations struggle to govern it.
processes, mobile devices, and Internet of Things
Growing semi- and unstructured data, however,
(IoT) sensors. Unfortunately, even as organizations
presents new challenges. Semistructured data
ingest this data into enormous data lakes and cloud
could be JavaScript Object Notation (JSON) data
storage systems, much of it remains unknown.
used in web applications and forms and require
Only a fraction of it moves through the long specialized access.
data quality and transformation processes that
Unstructured data has a format that is not known
structure it into “known” data for target systems
or a format the data management system is not
such as data warehouses, where users are able
set up to search, query, and access. Common
to access it for business intelligence (BI) reports,
examples include sensor data, video, image
dashboards, and managed analytics. Too much
files, social media posts, and log files. Many
remains untapped; many organizations are unable
organizations would like to use this data to
to prepare and process data in time to handle
provide fuller, more contextual understanding
analytics workloads for solving business questions,
of structured data such as that generated by
improving operational processes, and building
transactions.
more valuable customer relationships. Instead, it
sits in data “swamps” and becomes an expensive When semi- or unstructured data is collected from
management headache. non-transactional customer activity in different
channels, it often falls into the category of
This TDWI Checklist discusses how organizations
“behavioral” data. If its relationship to structured
can modernize technology strategies to create
transaction data could be mapped and visualized,
a faster path to readiness with all their data, not
it would be highly valuable in understanding
just the portion currently defined, structured, and
customer preferences, buying patterns, and how
formatted for BI reporting and analysis.
customers influence each other.
Known data is generated by traditional
Data relationships between collections of
business applications primarily for financial and
structured, semistructured, and unstructured data
performance management. Through experience,
could also be valuable for many other types of
organizations understand how to locate, access,
use cases including fraud and abuse detection,
and govern this data.
equipment maintenance, situation awareness,
To be sure, structured data can be unknown; it can manufacturing process control, and operations
come from diverse applications and platforms. It improvement.
is growing in volume and is often spread across
To be ready for analytics and actionable for
disparate data silos, making it complicated to
decisions, organizations need to make all their
access and prepare. Data quality and consistency
data easier to discover and prepare. In addition,
issues can render structured data almost as
to adhere to data privacy and other regulations,
“unknown” as semi- and unstructured data. Users
FOREWORD CONTINUED
Data readiness needs to be about more than complete. AI-based recommendations could be
giving users access to quantities of data. Decision more personalized and timely.
makers are often frustrated by standard “data
Most organizations rely on traditional relational
dump” reports that do little to increase their
data management to support BI and data
knowledge of customer behavior, potential
warehousing systems as well as applications such
fraud and governance risks, how to optimize
as customer relationship management (CRM).
manufacturing supply chains, or what factors
These systems excel at recording explicit data and
to consider in determining pricing. They need
are geared to managing data that conforms to
insights into relationships between different
known definitions, formats, and structures. Data
data entities and events in context. Too often,
relationships in these systems are simple and
decision makers are left depending (and waiting)
smaller in number.
on developers to write complex programs to
unearth data relationships. These programs tend To store and manage large amounts of semi- and
to be brittle and hard to change when users have unstructured data as well as structured data from
unanticipated questions. less familiar data sources and applications, many
organizations set up data lakes using NoSQL.
Also driving demand for faster insights into data
However, the petabytes of implicit and explicit
relationships is growth in artificial intelligence
data can quickly become vast and unknown.
(AI) for analytics and the embedding of AI-driven
Simply stored in the data lake, the data is far from
automation in systems such as chatbots for
ready for fast visualization or analysis of complex
customer engagement. AI-infused systems would
relationships.
benefit from data management that facilitates
easier mapping of relationships between “implicit” KNOWLEDGE GRAPHS AND GRAPH DATABASES
behavioral data such as that generated by
customer activity and “explicit” data that results To search for and analyze data relationships
from intentional actions such as a customer more effectively, organizations need models and
purchase or feedback using a rating system that systems that can store data relationships and
registers stars or thumbs up or down. make them discoverable. Knowledge graphs,
network-based representations, and graph
Because there may be no explicit action resulting databases specialize in these capabilities, enabling
from behavior such as looking at a selection of business users, analysts, and AI applications to
products, reading other customers’ comments, or navigate relationships found in implicit and explicit
even filling (but then abandoning) a shopping cart, data. They enable organizations to examine the
the person’s intentions are unclear. significance of relationships between multiple data
elements.
AI and analytics can learn partially about customer
preferences from purchases and rating systems; The use of knowledge graphs, graph models, and
however, if underlying data management can network-based representations is not new; Google
help unearth relationships between explicit data and other search engine providers use knowledge
and implicit behavioral data, the knowledge base graphs to increase the accuracy, speed, and
for AI and analytics to work with would be more completeness of search results.
Knowledge graphs can capture the semantic Adherence to standards allows users to run
richness of data relationships beyond what semantic queries against any graph database
traditional data catalogs provide with metadata. to either retrieve specific data relationship
The richness comes from building network-based information or seek answers to exploratory
representations that are not limited to structured questions.
data; the graphs can include both explicit data and
Organizations should evaluate graph databases,
implicit data that is semi- or unstructured.
knowledge graphs, and graph models for how they
The relationships are visibly represented in can enable faster discovery and analysis of data
graph models by “edges,” the lines or arcs that relationships. Organizations should also examine
connect any two “nodes” (also called “vertices”) how knowledge graphs and graph databases could
representing data objects. The edges establish expand the breadth and power of data catalogs by
semantically relevant connections—potentially making it easier to locate implicit as well as explicit
many of them—between the nodes. Graph data and bring knowledge of data relationships
databases can then manage and retrieve these into the catalog. Data catalogs are discussed in the
data relationships. next section.
Organizations struggle to achieve satisfactory data centralize metadata and other information about
readiness if users have difficulty locating relevant data sets specific to their needs. For example, a
data. Users need confidence in the consistency of data catalog could help insurance sales managers
how data is defined. Analysis is faster and more and representatives see what data elements are
complete if users can easily learn how different relevant to certain policies, making it easier to gain
data elements are related within the context of a a complete view of all the data and understand
topic such as financial performance or marketing how data elements are organized within topic
campaign effectiveness. hierarchies. Some data catalog systems can
centralize metadata information from semi- and
Administrators need a resource for locating and
unstructured data to make it easier to find and
inventorying data so they can improve data
relate documents using XML or JSON formats or
quality, track data lineage, and fix data errors and
log files in a data lake.
irregularities. Administrators also need the ability to
monitor all sensitive data use to reduce the risks of As a second option, some organizations choose
regulatory and security exposures. to establish a data catalog to support a specific
analytics project so team members can move faster
These requirements are driving interest in
to locate data in multiple systems and resolve
establishing and modernizing data catalogs. A
discrepancies in data definitions. Team members
data catalog provides a centralized and trusted
could use a feature in some modern data catalogs
collection of metadata (i.e., data about the data)
that enables users to add their “tribal wisdom”
gathered from different data sources to enable
about different data sets through comments and
users, developers, and automated applications to
annotations; this crowdsourced information can be
learn where data is located and how it is defined
valuable to project team members, alerting them to
and structured.
data quality or consistency issues. Teams can also
About half of organizations surveyed recently by use crowdsourcing functions to curate data for other
TDWI (51 percent) say that establishing a data stakeholders.
catalog is one of the most important steps they
A third option some organizations choose is to
could take to improve users’ success with BI and
set up a data catalog and apply crowdsourcing
analytics.1 Three-quarters of these organizations
functionality for all their data stored and managed
(75 percent) want a data catalog to make it easier
on a specific cloud platform such as Amazon AWS,
for users to search for and find data; over half want
Microsoft Azure, or Google Cloud Platform.
to use it to improve data governance, security, and
regulatory adherence (57 percent). Although department-, project-, platform-, and
data source-specific catalogs are valuable, an
Not all data catalogs are enterprise-level catalogs
enterprise data catalog that transcends disparate
run by IT. Individual departments such as marketing,
silos can help organizations address data definition
sales, or finance might establish a data catalog to
conflicts and quality problems that impact users
1
All research references in this report are from the 2020 TDWI Best Practices Report: Evolving from Traditional Business Intelligence to Modern Business Analytics,
online at tdwi.org/bpreports.
across the organization. As noted earlier, TDWI • IMPROVE EASE OF USE. To make it easier for
research finds that governance and regulatory users to search for and find data, organizations
adherence are key drivers in establishing a data should upgrade data catalogs with more
catalog. An enterprise-level data catalog can aid advanced search and natural language
governance by documenting and tracking data processing (NLP) capabilities. AI and NLP can
lineage information about the data’s origin, who was help users discover how data elements are
responsible for its sourcing, and how it has been related within their subject matter context.
collected, transformed, copied, replicated, and Search that uses NLP capabilities helps
shared. Governance audits and regulatory reporting users refine exploration so they are locating
typically need this information. only relevant data and are not swamped by
unneeded results. Ease of use, smarter search,
Only 23 percent of organizations surveyed by TDWI and crowdsourced annotation can give users a
are satisfied with how well users can employ a data faster path to relevant data.
catalog to find data, understand data relationships,
and tap it as a knowledge asset about data and its • ENABLE MORE RAPID DISCOVERY OF DATA
lineage. This indicates that whether organizations RELATIONSHIPS. Data catalogs that make use
have departmental, project, platform, or enterprise of knowledge graphs and graph databases can
data catalogs, there is room for improvement. Here reduce the complexity involved in discovering
are three recommendations: data relationships. The combination can
also speed inventorying larger volumes of
• REDUCE MANUAL WORK THROUGH
unfamiliar semi- and unstructured data. Along
AUTOMATION. Many organizations are held
with helping users analyze data relationships,
back in their development and maintenance of integrating graph functionality into data
data catalogs by the amount of manual work catalogs enables organizations to govern
traditionally involved. Administrators have data more effectively. Organizations can use
trouble keeping up with new and voluminous graph models to visualize whether analysis and
data sources as well as proliferation of data sharing of particular data relationships might
silos. However, AI techniques such as machine violate regulations regarding the privacy of
learning (ML) plus software automation personally identifiable information (PII).
can relieve organizations of manual work in
populating data catalogs, surfacing anomalies
and discrepancies, and keeping catalogs up to
date. AI can also drive user recommendations
for data sets based on relevance, trusted quality,
and department- or project-specific contexts,
enabling organizations to curate data rather
than just provide access.
Data governance is vital to data readiness. The and redundancy problems. Nearly two-thirds of
primary focus of data governance today is to define organizations surveyed (63 percent) say they are
rules and policies for protecting and securing monitoring data quality to improve trust in the data.
sensitive data such as PII as it is defined by common Organizations need to set up metrics to evaluate
data privacy regulations. Governance practices and quality over time and compare the quality of
technologies must monitor data use to make sure different data sources so administrators know where
rules are followed when data is collected, moved, to focus remediation.
copied, analyzed, and shared.
Expansion in self-service BI and analytics plus
TDWI research finds that many organizations lack growth and distribution of data put pressure
confidence in their ability to govern and secure all on organizations to do more than just acquire
their data, especially as they move data into the technology to improve data quality.
cloud. Nearly half of organizations surveyed by
Organizations should designate data stewards
TDWI (45 percent) cite protection of sensitive data
from both business (to contribute subject matter
as one of their biggest challenges when augmenting
expertise) and IT. Data stewards need to know
or replacing existing on-premises systems with
the organization’s governance rules and policies
cloud-based data platforms and services.
as well as the business contexts within which they
More than half of organizations surveyed (52 must be applied. Data stewards can be helpful in
percent) regard growth in self-service BI and guiding users of self-service technologies to apply
analytics as one of their biggest governance governance and data quality standards. Stewards
challenges. Organizations will often restrict data can mentor users in following governance policies
access for self-service BI and analytics if they are not as they work with, share, and use data in analytics
confident in how well they can govern data use and models and visualizations.
sharing, an act that can frustrate users.
ROLE OF DATA CATALOGS AND MDM
For users, as important as protecting sensitive data
Modern data catalogs, discussed in the previous
is the second mission of governance: to improve
section, can play a critical role in improving
trust in the data. Our research shows the importance
governance and trust. Catalogs can document data
of this objective: 43 percent of organizations
lineage so that organizations know the sources of
surveyed are using governance to increase users’
the data, what has happened to it during its life
confidence in the data and 45 percent are training
cycle within the organization, and who has been
users in governance responsibilities.
responsible at each step. An enterprise data catalog
Central to trust is data quality, which is why can help organizations assemble an accurate
governance and data quality should be closely inventory of data for governance reporting and
aligned. Data quality processes include profiling, auditing even if it is located physically on distributed
validation, cleansing, and addressing consistency on-premises and cloud-based platforms.
To accelerate data readiness for more types can tap metadata and discover data more easily
of users and use cases, organizations need to during data preparation processes.
address problems in data preparation. Data
With self-service BI and analytics continuing
preparation comprises the sequence of processes
to expand, organizations need to balance
that take data from its original source through
users’ pursuit of flexibility and self-reliance with
profiling, transformation, integration, cleansing,
enterprise needs for governance, efficiency, and
and enrichment until it is in a ready state.
ensuring that compute-intensive preparation
Although they are dependent on each other, data
processes such as data transformation have
preparation processes in practice are frequently
appropriate resources. Enterprise IT can use
disconnected, creating bottlenecks causing
data catalogs’ lineage information to improve
inefficiency and delay. To address problems,
governance monitoring of data preparation
organizations need technologies and practices
processes. Enterprise IT can also analyze which
that knit together the whole sequence while not
data sources and sets are used most often and
restricting user flexibility.
apply that information to resource allocation.
Many organizations want data catalogs to play
The combination of data catalogs plus MDM
an important role in improving cohesion between
information integration models and process
different stages of data preparation; 40 percent
planning can help organizations gain a broader,
of those surveyed by TDWI want to improve
end-to-end view of users’ data preparation
data preparation by setting up a centralized
processes and how they fit—or should fit—
data catalog. This resource can address users’
together. For example, organizations could
difficulties in locating all data appropriate for
integrate data preparation with MDM to enable
their projects and resolving inconsistencies. Data
faster access to golden record data drawn from
catalogs provide a shared resource that enables
on-premises and cloud-based systems.
organizations to overcome problems that arise
when users prepare data separately with personal Finally, organizations implementing DataOps
spreadsheets and desktop databases. These frameworks to improve stakeholder collaboration
can contain inconsistent data definitions and for large-scale data pipeline development should
produce data sets that include quality errors and make sure teams are making full use of data
redundancy. catalogs and MDM. Data pipelines often include
data preparation processes for profiling, blending,
Modern data catalogs can apply AI and
cleansing, transformation, and enrichment. Data
automation to find data faster, surface anomalies,
catalogs and MDM can improve visibility into
and resolve inconsistencies—especially as data
where users’ preparation processes are running
volumes become larger and include semi- and
into problems locating, accessing, and integrating
unstructured data. Organizations should evaluate
data. The visibility is also important to data
data catalog systems that provide user-friendly
governance monitoring in data pipelines.
search and NLP capabilities so nontechnical users
Business-critical analytics and AI/ML workloads some data catalog solutions to identify data quality
put stress on organizations to scale up and issues across multiple workloads so they can either
manage numerous data pipelines for provisioning be solved automatically or brought to pipeline
workloads with ready data. Data pipelines ingest developers’ attention sooner.
data, often through streaming, from sources
Workflow automation can help organizations
to target locations such as a data lake, data
orchestrate not only a higher number of
warehouse, or BI/analytics platform.
workloads but also complex and interdependent
Some data pipelines simply stream raw, processes for data cleansing, transformation,
unstructured data into a data lake; others involve and enrichment within data pipelines. Workflow
complex data preparation workflows that include automation technologies enable organizations to
data cleansing, transformation, and enrichment develop data pipelines for provisioning numerous
before data sets can be operationalized. analytics and AI/ML workloads in an organized
Organizations also need to apply governance rules and repeatable way.
and policies to data pipelines.
Administrators can use workflow automation
Thus, organizations that may have started out to find and fix errors more rapidly in pipeline
with a “Wild West” of unmanaged data pipelines preparation processes. Workflow automation will
developed as needed by data scientists and be increasingly critical as decision makers use
engineers for specific projects are finding that AI-driven recommendations in BI systems and
they need a systematic approach to support applications. Recommendation functions depend
workload growth and govern data properly. on quality data pipelines, raising the risk to
Among organizations surveyed by TDWI research, organizations if there are delays and errors.
24 percent say they need a major upgrade to their
If organizations are implementing MDM process
data pipeline development and operationaliza-
mapping and DataOps frameworks, they
tion and 46 percent say they need at least some
should integrate workflow automation with
improvement.
these initiatives. Workflow automation can help
Data catalogs can improve how rapidly and com- organizations implement DataOps more effectively
prehensively data pipelines locate data. Catalogs to integrate disparate pipeline development
can also guide pipeline developers to use trusted and reduce delays, increase reuse, and promote
and governed data where possible and to tap continuous improvement cycles. Organizations
crowdsourced knowledge about less-well-known can use the combination of MDM, DataOps, and
data sets. Organizations should evaluate how they workflow automation to improve stakeholder
can use modern data catalogs augmented with AI/ collaboration, which is key to directing the
ML and automation to improve the consistency orchestration of workflows to achieve priority
and reusability of data pipelines for larger volumes objectives.
and varieties of data. AI-driven automation enables
A FINAL WORD
For more information, visit http://www.boomi.com. You can reach him by email (dstodder@tdwi.org),
on Twitter (@dbstodder), and on LinkedIn
(linkedin.com/in/davidstodder)