Modern Data Architecture Concepts

2/29/24, 8:35 AM Modern Data Architecture Concepts
Articles People Learning Jobs
Modern Data Architecture Concepts

Irfan Azim Saherwardi + Follow
Chief Technologist (CTO) @DXC Technology | Managed IT Services |
Digital Transformation | Operating Model | Technology Innovations | IT
Roadmap
Published Jun 7, 2023
Modern data architecture refers to a contemporary approach to designing and

implementing data systems that can effectively handle the challenges and
opportunities presented by the ever-increasing volume, variety, and velocity of
data.
Key characteristics and components of modern data architecture include:
1. Data integration and ingestion: Modern data architecture incorporates robust data
integration capabilities to collect data from diverse sources, including structured
databases, unstructured data, streaming data, and external data feeds. It enables
efficient and real-time data ingestion, ensuring that data is captured and made available
for analysis in a timely manner.
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 1/21
2. Data storage and management: Modern data architecture leverages a combination of

data storage technologies, such as data lakes, data warehouses, and NoSQL databases,
to accommodate different data types and usage patterns. It allows for storing raw and
processed data in a cost-effective manner while enabling fast and efficient data access.
3. Data processing and analytics: Modern data architecture supports various data
processing and analytics techniques, such as batch processing, stream processing, and
real-time analytics. It incorporates technologies like Apache Spark, Apache Flink, and
cloud-based analytics services to handle large-scale data processing and enable
advanced analytics, including machine learning and artificial intelligence.
Like Comment Share 25 · 4 Comments
4. Cloud-native and hybrid environments: Modern data architecture embraces cloud
computing and takes advantage of the scalability, flexibility, and cost efficiency offered
by cloud platforms. It also accommodates hybrid environments, where data and
applications span across on-premises infrastructure and multiple cloud providers.
5. Data governance and security: Modern data architecture prioritizes data governance
and security practices to ensure data privacy, regulatory compliance, and protection
against unauthorized access or breaches. It includes mechanisms for data cataloging,
metadata management, data lineage tracking, and access control policies.
6. Self-service data access and analytics: Modern data architecture empowers business
users and data analysts with self-service capabilities to access, analyze, and visualize
data. It provides user-friendly tools, such as data discovery platforms, data preparation
tools, and interactive dashboards, to enable data exploration and insights generation.
7. Scalability and agility: Modern data architecture is designed to scale seamlessly as data
volumes and processing needs grow. It incorporates technologies and practices that
enable horizontal scalability and easy integration of new data sources and analytics
capabilities.
8. DataOps and DevOps principles: Modern data architecture embraces DataOps and
DevOps principles, fostering collaboration and agility in data-related projects. It involves
close collaboration between data engineers, data scientists, and operations teams to
ensure smooth data pipeline deployment, monitoring, and maintenance.
The benefits of modern data architecture include improved data accessibility, faster
time-to-insights, enhanced decision-making, and the ability to leverage advanced
analytics techniques. It enables organizations to leverage the full potential of their
data assets to drive innovation, gain a competitive edge, and respond rapidly to
changing business needs.
However, it's important to note that modern data architecture is not a one-size-fits-
all approach. Organizations should carefully assess their specific requirements,
data landscape, and business objectives to design and implement a tailored
modern data architecture that aligns with their needs and resources.
Topics Covered in this document:
DATA WAREHOUSE, DATA MART, DATA LAKE, DATA PIPELINE, DATA MESH,
DATA LAKE HOUSE, DATA SWAMP, DATA FABRIC
Details of each topic as follows:

DATA WAREHOUSE
A data warehouse is a centralized repository that stores large amounts of

structured and semi-structured data from various sources within an organization. It
is designed to support business intelligence (BI) activities such as data analysis,
reporting, and decision-making.
The main purpose of a data warehouse is to provide a unified view of an
organization's data, integrating data from multiple sources and transforming it into
a consistent and reliable format. This allows users to easily access and analyze the
data for gaining insights, making informed business decisions, and supporting
strategic planning.
Data warehouses typically employ a process known as Extract, Transform, Load
(ETL) to acquire data from different operational systems, cleanse and transform it
to meet specific requirements, and then load it into the data warehouse. ETL
processes often involve data integration, data quality checks, data transformation,
and data aggregation.
The structure of a data warehouse is optimized for analytical processing rather than
transactional processing. It typically follows a dimensional model, using a star or
snowflake schema, which organizes data into fact tables (containing numerical or
quantitative data) and dimension tables (containing descriptive attributes). This
schema design facilitates efficient querying and analysis of data.
Data warehouses also support various data storage and retrieval technologies, such
as online analytical processing (OLAP), which allows for multidimensional analysis
and reporting, and data mining techniques for discovering patterns and trends in
the data.
Benefits of using a data warehouse include:
1. Data integration: Consolidating data from disparate sources into a single, unified view.
2. Data quality: Improving the quality and consistency of data through cleansing and
transformation processes.
3. Decision-making: Enabling better and faster decision-making by providing timely and
accurate information.
4. Historical analysis: Storing historical data for trend analysis, forecasting, and identifying
patterns.
5. Performance optimization: Providing optimized data structures and query processing for
efficient data retrieval.
6. Scalability: Handling large volumes of data and supporting the growth of data over time.
7. Security and access control: Implementing appropriate security measures to protect
sensitive data and controlling user access to the data.
Overall, a data warehouse serves as a valuable tool for organizations to leverage

their data assets and gain actionable insights to drive business growth and improve
operational efficiency.
DATA MART
A data mart is a subset or a specialized version of a data warehouse. It is focused

on a specific business function or department within an organization, providing a
more tailored and targeted view of the data. While a data warehouse contains a
comprehensive collection of data from various sources across the organization, a
data mart is designed to meet the specific needs of a particular user group or
business unit.
Data marts are created by selecting and aggregating relevant data from the data
warehouse and transforming it to align with the requirements of the specific
business area. This process is known as data marting or data mart creation.
The main characteristics and benefits of data marts include:
1. Focus: Data marts are designed to address the specific analytical needs of a particular
business function or department, such as sales, marketing, finance, or human
resources. They provide a more granular and targeted view of the data, enabling users
to access and analyze the information most relevant to their needs.
2. Simplified access: Data marts offer a simplified and user-friendly interface for accessing
and querying data. They often employ pre-defined reports, dashboards, and data
models tailored to the requirements of the specific user group, making it easier for users
to retrieve the necessary information without complex queries or technical expertise.
3. Improved performance: By containing a subset of data, data marts can be optimized for
faster query performance. Aggregating and summarizing the data at the data mart level
reduces the complexity and volume of data, resulting in quicker response times for
analytical queries.
4. Enhanced data governance: Data marts can be managed and governed independently,
allowing business units to have more control over their data. This can facilitate data
ownership, accountability, and compliance with regulatory requirements specific to a
particular department.
5. Flexibility and agility: Data marts offer flexibility in terms of customization and
adaptability. They can be quickly modified or expanded to meet evolving business
needs, providing agility in responding to changing requirements without impacting the
overall data warehouse architecture.
6. Cost-effectiveness: Since data marts are focused on specific business areas, they
require fewer resources and infrastructure compared to a comprehensive data
warehouse. This can lead to cost savings in terms of storage, processing, and
maintenance.
It's worth noting that data marts can be built directly from operational systems
without using a data warehouse. In such cases, they are referred to as independent
data marts. However, integrating data marts with a data warehouse architecture
provides the advantage of a centralized and consistent data repository, enabling
data sharing and integration across different business areas.
Overall, data marts are valuable tools for delivering relevant and actionable
information to specific user groups, promoting self-service analytics, and
supporting decision-making at a departmental level.
DATA LAKE
A data lake is a centralized repository that stores large volumes of structured, semi-
structured, and unstructured data in its raw or native format. Unlike a data
warehouse or data mart, which follows a predefined schema and structure, a data
lake allows for the storage of data in its original form, without the need for upfront
transformation or normalization.
In a data lake, data is stored in a flat architecture, typically using a distributed file
system such as Apache Hadoop or cloud-based storage systems like Amazon S3 or
Azure Data Lake Storage. The data lake can accommodate various types of data,
including text files, documents, sensor data, log files, social media feeds, and more.
The key characteristics and features of a data lake include:
1. Storage of diverse data: A data lake can store structured, semi-structured, and
unstructured data from different sources, without enforcing a specific schema or data
model. This flexibility allows organizations to capture and store vast amounts of data in
its raw format, preserving the data's integrity and enabling future analysis.
2. Scalability: Data lakes are designed to handle massive volumes of data, making them
highly scalable. They can scale horizontally by adding more storage nodes or by
leveraging cloud-based storage resources, ensuring the data lake can accommodate
the growing needs of the organization.
3. Schema-on-read: In a data lake, the structure and schema of the data are applied when
the data is accessed or queried, rather than during the data ingestion process. This
"schema-on-read" approach provides greater flexibility for data exploration and
analysis, as users can interpret and transform the data based on their specific needs at
the time of analysis.
4. Data exploration and discovery: Data lakes encourage data exploration and discovery by
offering a wide range of data sources and formats. Analysts and data scientists can
explore and experiment with different datasets, combine and correlate data from various
sources, and uncover hidden insights or patterns that may not have been apparent
before.
5. Integration with analytical tools: Data lakes integrate with various analytical and
processing tools, such as Apache Spark, Apache Hive, and machine learning
frameworks, allowing users to perform complex analytics and derive meaningful insights
from the data stored in the lake.
6. Data governance and security: Data lakes require robust governance and security
measures to ensure data privacy, compliance, and access control. Data governance
practices, such as data cataloging, metadata management, and data lineage, help
maintain data quality, track data provenance, and establish data usage policies.
7. Data democratization: Data lakes promote data democratization by providing self-
service access to data for different user groups within the organization. Business users,
data analysts, and data scientists can access the data lake, discover relevant datasets,
and derive insights without heavy reliance on IT or data engineering teams.
It's important to note that while a data lake provides a flexible and scalable storage
solution, it requires proper data management and governance practices to ensure
data quality, data integration, and appropriate access controls. Without proper
management, a data lake can become a data swamp, where the data becomes
disorganized, difficult to find, and lacking in meaningful insights.
In summary, a data lake is a centralized repository that stores diverse and raw data,
enabling organizations to capture, store, and analyze vast amounts of information in
a flexible and scalable manner, promoting data exploration and discovery for
various analytics and decision-making purposes.
DATA PIPELINE
A data pipeline is a set of processes and technologies used to extract, transform,

and load (ETL) data from various sources, transform it into a desired format, and
deliver it to a target destination for further analysis or storage. It involves a series of
steps that move data through the pipeline, ensuring its quality, reliability, and
accessibility.
The main components of a data pipeline include:
1. Data ingestion: Data is collected from different sources such as databases, applications,
APIs, files, or streaming platforms. It can be real-time or batch data.
2. Data transformation: The ingested data is processed and transformed according to
specific business rules, cleaning, aggregating, filtering, and enriching the data as
needed. This step ensures data consistency, quality, and conformity to the desired
format.
3. Data storage: Transformed data is stored in a suitable data storage system or repository.
This can be a data warehouse, a data lake, a cloud-based storage service, or any other
storage solution that fits the requirements of the pipeline.
4. Data integration: In some cases, data from multiple sources needs to be integrated and
consolidated. This involves merging data from different systems or databases to create
a unified view of the data.
5. Data validation and quality control: Data pipeline processes often include data validation
checks to ensure data accuracy, completeness, and adherence to defined business
rules. Quality control mechanisms may involve data profiling, anomaly detection, and
data cleansing techniques.
6. Data movement and orchestration: Data movement involves transferring data from one
stage of the pipeline to another. Orchestration involves managing the flow of data,
scheduling tasks, and coordinating the execution of different pipeline components.
7. Data delivery and consumption: Once the data is transformed and stored, it is made
available for consumption by downstream applications, analytics tools, reporting
systems, or other data consumers. This can be through APIs, dashboards, data access
layers, or direct integration with target systems.
Data pipelines can be implemented using various technologies, depending on the
specific requirements and the complexity of the data processing tasks. Some
popular tools and frameworks for building data pipelines include Apache Kafka,
Apache Airflow, Apache Spark, Apache NiFi, and commercial solutions like
Informatica PowerCenter or Talend.
The benefits of using data pipelines include:
1. Data reliability and consistency: Data pipelines ensure that data is processed
consistently, adhering to defined standards and rules, resulting in reliable and
consistent data across the organization.
2. Data freshness: Real-time or near-real-time data pipelines enable organizations to have
access to the most up-to-date information, supporting timely decision-making and
insights.
3. Automation and efficiency: Data pipelines automate the extraction, transformation, and
loading processes, reducing manual effort and improving efficiency in data integration
and analysis.
4. Scalability: Data pipelines can handle large volumes of data and are designed to scale
horizontally to accommodate increasing data volumes and processing demands.
5. Data governance and compliance: Data pipelines facilitate the implementation of data
governance policies, data lineage tracking, and compliance with regulatory
requirements by ensuring data quality and traceability.
In summary, a data pipeline is a systematic and automated process for moving,
transforming, and delivering data from multiple sources to a target destination. It
plays a crucial role in data integration, data management, and ensuring reliable and
actionable data for analytics, reporting, and decision-making purposes.
DATA MESH
Data Mesh is a relatively new architectural paradigm and organizational approach

for managing data within an organization. It was introduced by Zhamak Dehghani, a
software architect at ThoughtWorks, to address the challenges of data integration
and scalability in large-scale data environments.
In a Data Mesh approach, the focus is shifted from a centralized data infrastructure
to a decentralized model, where data ownership and governance are distributed
across individual domains or teams within the organization. The core principles of
Data Mesh include:
1. Domain-oriented decentralized teams: Instead of having a centralized data team
responsible for all data-related tasks, the Data Mesh approach advocates for domain-
oriented teams with end-to-end ownership of specific business domains. These teams
are responsible for the data within their domain, including data collection, storage,
processing, and analytics.
2. Self-serve data infrastructure: Each domain team is empowered to build and manage its
own data infrastructure using modern technologies and tools. This self-serve
infrastructure provides flexibility and autonomy to domain teams to choose the most
suitable technologies and approaches for their specific needs.
3. Federated data governance: Data governance responsibilities are distributed across

domain teams. Each team is accountable for defining and implementing data quality
standards, data privacy policies, and regulatory compliance within their domain.
Collaboration and coordination between domain teams are facilitated through federated
governance practices.
4. Product thinking and data productization: Data is treated as a product, and domain
teams are encouraged to create data products that provide value to other teams or
stakeholders within the organization. Data products are well-defined, self-contained
units of data that are discoverable, reusable, and scalable.
5. Data as a first-class citizen: Data is considered a key asset of the organization and is
treated with the same importance as other software products. This means applying
engineering practices, such as version control, testing, monitoring, and documentation,
to data assets.
The Data Mesh approach aims to address some of the common challenges of
traditional centralized data architectures, such as bottlenecks, lack of agility, and
difficulties in scaling. By distributing data ownership and promoting autonomy and
accountability within domain teams, Data Mesh enables organizations to scale their
data capabilities and leverage the expertise of different teams for better data
management and utilization.
It's important to note that implementing a Data Mesh approach requires careful
consideration of organizational culture, collaboration practices, and technology
infrastructure. It may not be suitable for every organization or every data use case.
However, for large and complex data environments, Data Mesh provides a
framework for scaling data capabilities, improving data quality, and fostering a
culture of data ownership and collaboration.
DATA LAKE HOUSE
The term "Data Lakehouse" refers to a data architecture that combines the best
aspects of a data lake and a data warehouse. It aims to provide a unified and
flexible data platform that can handle both traditional structured data and newer
forms of semi-structured and unstructured data.
In a Data Lakehouse architecture, data is stored in a centralized repository like a
data lake, which allows for the storage of raw and diverse data in its native format.
This enables organizations to capture and store vast amounts of data without
upfront data transformation.
However, unlike a traditional data lake, a Data Lakehouse incorporates some key
elements of a data warehouse to improve data organization, processing, and query
performance. These elements include schema enforcement, indexing, and
optimized query engines.
Here are some key features and benefits of a Data Lakehouse:
1. Schema enforcement: A Data Lakehouse introduces schema enforcement, which means
applying a schema or structure to the data stored in the data lake. This helps ensure
data consistency, enforce data quality controls, and improve data governance.
2. Optimized query engines: Data Lakehouse architectures often employ modern query
engines, such as Apache Spark or Presto, which provide optimized performance for
querying and analyzing data stored in the data lake. These engines support both
structured and semi-structured data, enabling efficient data processing and exploration.
3. Indexing and metadata management: Data Lakehouse architectures incorporate

indexing mechanisms and metadata management to improve data discoverability and
query performance. Indexing allows for faster data retrieval, while metadata
management helps catalog and organize data assets for easier data exploration and
analysis.
4. Hybrid workload support: A Data Lakehouse accommodates both batch and real-time
processing workloads. It enables organizations to perform complex analytics on large
historical datasets while also supporting near-real-time data ingestion and processing
for time-sensitive use cases.
5. Data governance and security: Data Lakehouse architectures incorporate data
governance and security practices, including access controls, data lineage tracking, and
compliance mechanisms. This helps ensure data privacy, regulatory compliance, and
adherence to data governance policies.
By combining the flexibility and scalability of a data lake with the structured query
capabilities of a data warehouse, a Data Lakehouse aims to provide a unified and
powerful data platform that can meet the diverse needs of modern data-driven
organizations. It allows for efficient data exploration, analytics, and reporting,
enabling faster and more accurate decision-making based on a wide variety of data
sources.
DATA SWAMP
A data swamp is a term used to describe a situation where a data lake becomes
disorganized, unmanageable, and difficult to extract value from. Unlike a well-
organized and governed data lake, a data swamp lacks proper data governance,
structure, and quality control, resulting in a chaotic and messy data environment.
The characteristics of a data swamp include:
1. Lack of organization: Data in a data swamp is often unstructured, untagged, and poorly
categorized. There is no consistent naming convention, metadata, or data catalog,
making it challenging to find and understand the data.
2. Poor data quality: A data swamp typically contains low-quality and unreliable data. It
may include duplicated, incomplete, outdated, or incorrect information, leading to
erroneous insights and analysis.
3. Limited discoverability: Due to the absence of proper data indexing and metadata
management, finding relevant data within a data swamp becomes a time-consuming
and frustrating task. Users may struggle to identify the right data sources, understand
their contents, and determine their relevance to their analysis or business needs.
4. Inadequate data governance: Data governance practices, such as data lineage tracking,
data security controls, and compliance measures, are often absent or poorly
implemented in a data swamp. This poses risks to data privacy, regulatory compliance,
and overall data integrity.
5. Lack of collaboration: A data swamp hinders collaboration among users and teams.
Without clear data organization and governance, it becomes challenging for different
stakeholders to share and collaborate on data-related projects.
The presence of a data swamp can have several negative consequences for an
organization:
1. Reduced productivity: Users spend significant time searching for data and cleaning it up
before analysis, leading to decreased productivity and delayed insights.
2. Inaccurate decision-making: Data swamps can lead to unreliable and inconsistent data,
which can result in inaccurate decision-making and incorrect insights.
3. Increased costs: The lack of organization and quality control in a data swamp can result
in wasted resources, as data professionals spend excessive time on data cleanup and
rectification efforts.
4. Compliance and security risks: The absence of proper data governance practices in a
data swamp can expose sensitive data, violate regulatory requirements, and
compromise data security.
To avoid or address a data swamp, organizations should focus on implementing
proper data governance practices, including data quality controls, metadata
management, data cataloging, and data stewardship. Establishing clear data
ownership and responsibility, implementing data validation processes, and defining
data governance policies are crucial steps in preventing or mitigating the formation
of a data swamp.
Regular data maintenance and clean-up activities, along with the use of data
management tools and technologies, can help ensure that the data lake remains
organized, useful, and valuable to the organization.
DATA FABRIC
Data fabric refers to an architectural approach and set of technologies that enable
organizations to seamlessly integrate and manage data across distributed systems,
including on-premises and cloud environments. It provides a unified and consistent
view of data assets, regardless of their location or format, and facilitates data
access, integration, and analysis.
The key characteristics and components of a data fabric include:
1. Data integration and interoperability: A data fabric enables data integration by

connecting and harmonizing data from various sources, systems, and formats. It
provides interoperability between different data platforms, databases, applications, and
services, allowing data to flow seamlessly across the organization.
2. Distributed data management: A data fabric enables organizations to manage and
process data across distributed environments, such as on-premises data centers, public
clouds, and edge devices. It provides capabilities to access and work with data
regardless of its location, ensuring data availability and consistency.
3. Data abstraction and virtualization: Data fabric abstracts the underlying data
infrastructure and provides a virtualized layer that enables users and applications to
interact with data without needing to know the specific details of where and how the
data is stored. This abstraction layer simplifies data access and eliminates the need for
direct integration with underlying systems.
4. Data governance and security: A data fabric incorporates data governance and security
measures to ensure data privacy, compliance, and access control. It includes
mechanisms for data classification, encryption, authentication, authorization, and
auditing, providing a secure and governed data environment.
5. Metadata management and cataloging: Data fabric solutions typically include metadata
management capabilities, allowing organizations to capture and manage metadata
about their data assets. This includes information about data sources, data lineage, data
quality, and data semantics. Metadata cataloging enables users to discover and
understand available data assets, promoting data reuse and self-service data access.
6. Data orchestration and processing: Data fabric provides capabilities for data
orchestration, including data movement, transformation, and processing. It allows
organizations to define data pipelines, workflows, and transformations to extract,
transform, and load data across different systems and environments.
7. Scalability and elasticity: Data fabric is designed to scale horizontally and vertically to
accommodate growing data volumes and processing demands. It leverages cloud-
native technologies and elastic computing resources to ensure scalability and
performance.
By implementing a data fabric approach, organizations can achieve several
benefits, including:
1. Improved data agility and accessibility: Data fabric provides a unified and consistent
view of data assets, making it easier for users and applications to access and leverage
data for analysis, reporting, and decision-making.
2. Accelerated data integration: Data fabric simplifies and accelerates data integration by
providing a standardized approach to connect and harmonize data from diverse
sources.
3. Enhanced data governance and compliance: Data fabric incorporates data governance
practices and security controls, ensuring data privacy, regulatory compliance, and
proper access control.
4. Flexibility and scalability: Data fabric enables organizations to scale their data
infrastructure and capabilities by leveraging distributed and cloud-based technologies.
5. Reduced data silos and fragmentation: Data fabric helps break down data silos and
enables data sharing and collaboration across the organization, promoting a unified data
culture.
In summary, a data fabric is an architectural approach and technology framework
that enables seamless integration, management, and analysis of data across
distributed systems. It provides a unified and consistent view of data assets,
enhances data agility, and promotes data governance and accessibility.
Regards,
Irfan Saherwardi
Account Technical Leader@DXC
Michal Pilarski 2w
GIS, Python, ETL, FME, Spatial Data and Testing Specialist w Boeing Company
Very helpful article. Thanks. So my proposals of naming convention for poor quality of data
architecture:
Data Lakehouse --> Data Swamphouse
Data Warehouse --> Data Warhouse
Data Mesh --> Data Mess
Data Mart --> Data Mort
Data Pipeline --> Data PeeLine
Data Fabric --> Data Macabric
Make sense?
Like · Reply
Franck Christien 4mo

Modern Data Architecture Concepts

Uploaded by

Copyright:

Available Formats

You might also like

Modern Data Architecture Concepts

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Modern Data Architecture Concepts

Uploaded by

Copyright:

Available Formats

2/29/24, 8:35 AM Modern Data Architecture Concepts

Articles People Learning Jobs

Modern Data Architecture Concepts

Modern data architecture refers to a contemporary approach to designing and

2. Data storage and management: Modern data architecture leverages a combination of

Details of each topic as follows:

A data warehouse is a centralized repository that stores large amounts of

Overall, a data warehouse serves as a valuable tool for organizations to leverage

A data mart is a subset or a specialized version of a data warehouse. It is focused

A data pipeline is a set of processes and technologies used to extract, transform,

Data Mesh is a relatively new architectural paradigm and organizational approach

3. Federated data governance: Data governance responsibilities are distributed across

3. Indexing and metadata management: Data Lakehouse architectures incorporate

1. Data integration and interoperability: A data fabric enables data integration by

You might also like