Professional Documents
Culture Documents
Data Mesh A Business Oriented Framework For Quicker Insights HCL Whitepapers
Data Mesh A Business Oriented Framework For Quicker Insights HCL Whitepapers
com
WHITEPAPER
TABLE OF CONTENTS
About Centralization of Data Platforms and their evolution 3
Organizational Setup –
Major Organizations with Central Data Platform Setups 4
Conclusion 14
Author Info 15
There are subtle differences and many similarities in the way these platforms have evolved. ETL
(Extract Transformation and Load) came into being with the advent of data warehousing, while
ELT (Extract Load Transform) became popular with MPPs (Massive Parallel Processing) and
data lakes. Data warehouse, in conjunction with data marts, aims at supporting data analysts
for operational reporting and data mining needs. Whereas, data lakes are the platforms for data
scientists who aim to discover patterns in the data and put it to use. There are differences in
terms of accessing the data as well, like in data warehouse, a SQL (Structured Query Language)
interface is standard. Whereas in data lakes, raw files, and APIs (Application Programming
Interfaces) are the preferred choices.
Among these differences, there lies a similarity, that is, having it all built around a central data store,
with a common development paradigm – Data extraction from many sources; transformation
and loading to the central data store. The sequence of transformation and loading may differ
with data warehouse and data lake implementations.
Each of these groups under Data services focuses on their respective business area. They
further have a hierarchy under a business relationship manager and an engineering manager. A
Business Relationship Manager is a liaison between business users and the development team.
The Business Relationship Manager is supported by a team of business analysts who may be
tagged with one or many business users. In terms of the engineering team, the technical leads,
and the development team report into the engineering manager, which works in parallel with
the BRM (Business Relationship Manager).
During the implementation of centralized data platforms, the focus is majorly on data
management pipeline and type of storage. There is always a challenge of decision making on an
efficient data strategy i.e., something that minimizes data movement and copying.
With the centralized development and management of these platforms, there is a lack of focus
on domain driven design and product thinking. It is a challenge as the emphasis is more on
cloud adoption, data science, and data lakes, and other latest trends, which leads to a lack of
focus on creating self-sufficient product thinking among teams.
Given these challenges, there is also an opportunity to leap towards another evolution - Moving
away from a central data platform to an arrangement that allows for distributed ownership i.e.,
Data Mesh. In a Data Mesh setup, the business case decides the location of the data, rather
than technology options. Moreover, it enables stitching of the datasets together rather than
duplication. It brings focus on distributed domain-driven teams to focus on creating data sets
rich for consumption.
The data domain owners handle supplying their data as products. The creation of these data
products is supported by a self-serve data platform to abstract technical complexity in serving
data products. It requires the adoption of federated governance through automation to enable
the interoperability of domain-oriented data products.
• Domain data set as a product: Datasets are produced by these domains and hence are
termed as data products. Within a domain, various pipelines could be generating these
data sets. The pipelines can be polyglot, that is, they could be dealing with different
formats and modes like streams or batch. The products that these pipelines generate are
discoverable, i.e., there is metadata that can help consumers find the datasets which are
being created. Data products must be addressable, which means that they can be uniquely
identified within the enterprise. Data products are trustworthy, i.e., they are defined and
checked for certain Service Level Objectives (SLOs). Data products are self-describable,
which relates to information available (metadata), which the consumers can refer to before
making use of the data set. Data products are interoperable, i.e., they are governed by
global standards which make them easier for use with other domains in the enterprise.
Data products are secured i.e., they are governed by global access and not federated
access.
• Data Infrastructure as a platform: It supplies the tools, processing frameworks, and storage
solutions to the various domain owners, so that they can apply and create the various
data pipelines and data products using them. Data infrastructure as a platform is scalable,
with polyglot storage on demand. The platform enables encryption for data at rest and in
motion, as well as supply a unified data access control across all storage systems. It enables
the data product to be discoverable, along with metrics collection and sharing with various
stakeholders. The platform supplies self-service tools or templates and is domain agnostic.
As shown in figure 3, the organization under the CIO (Chief Information Officer) has shrunk. The
Business Relationship Managers (BRM) have been tasked to perform the role of product owners.
The business analysts, which were earlier part of the CIO organization, have been moved to
work along with the product owners within the domain. Also, the engineering manager’s role is
majorly diminished or merged with the product owner. Erstwhile technical team leads and the
development team under the engineering manager, are now made a part of the business group
itself.
It means that within a domain, product owners now have direct control over both the cross
functional as well as the specialized teams, thereby reducing the hierarchical complexity and
the latency which was present earlier. The organization under CIO is focused on building
the frameworks and the tools required by the domain teams, including self-service tools,
accelerators, templates, and solutions, thereby bringing all the domain teams to the same level.
This leads to standardization, automated policy enforcement, and providing interoperability and
discoverability support.
A domain constitutes of various data products such as data pipelines or data sets. Each
data pipeline, in turn, follows extraction, transformation, and loading flow. Instead of directly
integrating with the sources and targets, it makes use of standard ports. Each port is different
in types, such as files, streams, and events, and provides an abstract standardized way to
integrate. Besides these, there are two more ports. One of the ports is used to share metrics and
enabling auditing. The second port enables sharing the metadata. The metadata helps discover
the datasets.
To enable these data pipelines and data products, the domain team makes use of certain
developer tools that are exposed to them by the underlying platform. The Data Infrastructure
as a Platform provides common components of compute, database and storage, processing,
and management. For compute, we can look at both the dedicated, virtualized, or serverless
computing engines. Whereas on the database and storage, we can have block storage,
relational databases, and other specialized storage mediums. For processing, the engines can
be dedicated or even serverless to help process data at scale. For management, a set of tools
provide information about the platform components to the domain owners or the product
owners and their team to gain an understanding of their data products and the platform.
On top of this, there is a layer of Federated and global ecosystem governance which deals
with the access and policy management, interoperability, and discoverability for the whole
implementation, by enforcing controls over access and policy management. Authentication
services like IAM (Identity & Access Management) from AWS (AMAZON WEB SERVICES) act
as a security blanket around the services and tools. The architecture depicted in figure 4 makes
use of various AWS components and is for reference only. The actual implementation may have
components from other cloud providers, open source, or hybrid.
In case of implementation involving a central data store, it would have been an ideal case to
build a data warehouse. Wherein a relational data model can be built to connect the data across
various segments, or one could even use a more performant graph-based data store to create
the knowledgebase as well. Keeping the organizational structure in perspective, to build such a
system, the BRM (Business Relationship Manager) would have to liaison with business owners
of the various segments or with other BRMs supporting the required business segments. BRM
is usually a part of the IT organization which works in tandem with the business owners of the
segments, and tries to bridge the gap between the business unit and the development team.
It means that to acquire data for manufacturing operations, supply chain, and other groups,
there is a need to conduct a big exercise to know about the data which can be acquired, what
are the data formats, the schema associated with it, the frequency, and the volume of the data
as well. What makes it complex is that the attributes – format, schema, volume etc. keep on
evolving as the business requirements keep on changing. With a change in requirements and
data attributes, the pipelines, and processes around them need be changed as well, leading to
a vicious cycle of capturing the changes, analyzing the impact, planning incremental changes,
deployment & maintenance. It will be difficult to find the root cause of the problem as every
time it needs to be back tracked. This ongoing exercise is required for making sure that the
systems are correctly ingesting the data and analyzing & aggregating them in the central data
store.
Apart from this, ensuring data quality is another challenge. For e.g. In case of a manufacturing
enterprise like an automotive manufacturer, due to a certain data discrepancy, the engine number,
the chassis number, and sometimes the customer ID may not get reflected in the system. To
figure out where the issue was, it is required to back track and validate each of the pipelines.
In an alternate implementation involving data mesh, the ideal setup would have each of the
business segments be tagged to a domain. For e.g., Supply Chain, Manufacturing operations,
CRM, Recall Management will be individual domains. Each domain is responsible for the data
products (dataset) they provide. Each data product has a metadata associated with it. This
includes data product identifier (unique across the enterprise), schema, format, frequency,
volume information etc. The data products are created by a team that understands the functional
aspects of the segment, and work closely to design, build, and publish it. During this process,
they may be subscribing to datasets from other domains, enrich and transform it, validate it
for data quality before sharing it with consumers. Consumers are the actual users of the data
product. Consumers can be individuals or other domains in the enterprise. Domains are free to
Issues related to data quality like missing data, lack of precision etc., or a related to change in
the schema are usually not prevalent in the data mesh implementation. The consumer domain in
this example is of Recall Management, which can recreate the knowledgebase based on the raw
data in case of a major data discrepancy, if all it happens.
The way business and IT teams are structured today may change drastically. Other technical
challenges include unified data access across various data storage systems, linking various
domains and products for scale and data infrastructure as self-serve platform. These are the
top 3 technical challenges. There are certain edge cases which could be encountered in any
enterprise implementation of the data mesh. For e.g., what if a data domain has multiple
owners? Also, with implementation over cloud for the data mesh, there is an extremely high
probability that we would tend to settle with the PaaS (Platform as a Service) offering of the
cloud service provider. They do work well if they are the sole constituents of the infrastructure
platform. The things get complex as we introduce other storage and data processing systems
like Oracle, Teradata, Snowflake, and others, which have their own closed proprietary access
and management controls. Also, on the governance side, global interoperability, access control,
etc. is a challenge today, even on-prem and single cloud implementations. Given that the future
implementations will move towards hybrid and multi cloud setups, it is going to take a lot of
coordination between enterprise IT, cloud service providers, and incumbent vendors.
Data mesh certainly has a lot of advantages over a central data store implementation, and it can
continue to have so, if there is a right balance and acceptance of changes in the organizational
structure, processes, and technology.
www.hcltech.com As a leading global technology company, HCL takes pride in its diversity, social responsibility, sustainability,
and education initiatives. As of 12 months ending on March 31, 2020, HCL has a consolidated revenue of
US$ 10 billion and its 159,000 ideapreneurs operate out of 50 countries.
For more details contact: ers.info@hcl.com
Follow us on twitter: http://twitter.com/hclers and our blog http://ers.hclblogs.com/
Visit our website: http://www.hcltech.com/engineering-services/