Professional Documents
Culture Documents
Modern Data Architecture Concepts
Modern Data Architecture Concepts
Modern Data Architecture Concepts
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 1/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 2/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
However, it's important to note that modern data architecture is not a one-size-fits-
all approach. Organizations should carefully assess their specific requirements,
data landscape, and business objectives to design and implement a tailored
modern data architecture that aligns with their needs and resources.
Topics Covered in this document:
DATA WAREHOUSE, DATA MART, DATA LAKE, DATA PIPELINE, DATA MESH,
DATA LAKE HOUSE, DATA SWAMP, DATA FABRIC
data for gaining insights, making informed business decisions, and supporting
strategic planning.
Data warehouses typically employ a process known as Extract, Transform, Load
(ETL) to acquire data from different operational systems, cleanse and transform it
to meet specific requirements, and then load it into the data warehouse. ETL
processes often involve data integration, data quality checks, data transformation,
and data aggregation.
The structure of a data warehouse is optimized for analytical processing rather than
transactional processing. It typically follows a dimensional model, using a star or
snowflake schema, which organizes data into fact tables (containing numerical or
quantitative data) and dimension tables (containing descriptive attributes). This
schema design facilitates efficient querying and analysis of data.
Data warehouses also support various data storage and retrieval technologies, such
as online analytical processing (OLAP), which allows for multidimensional analysis
and reporting, and data mining techniques for discovering patterns and trends in
the data.
Benefits of using a data warehouse include:
1. Data integration: Consolidating data from disparate sources into a single, unified view.
2. Data quality: Improving the quality and consistency of data through cleansing and
transformation processes.
3. Decision-making: Enabling better and faster decision-making by providing timely and
accurate information.
4. Historical analysis: Storing historical data for trend analysis, forecasting, and identifying
patterns.
5. Performance optimization: Providing optimized data structures and query processing for
efficient data retrieval.
6. Scalability: Handling large volumes of data and supporting the growth of data over time.
7. Security and access control: Implementing appropriate security measures to protect
sensitive data and controlling user access to the data.
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 4/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
models tailored to the requirements of the specific user group, making it easier for users
to retrieve the necessary information without complex queries or technical expertise.
3. Improved performance: By containing a subset of data, data marts can be optimized for
faster query performance. Aggregating and summarizing the data at the data mart level
reduces the complexity and volume of data, resulting in quicker response times for
analytical queries.
4. Enhanced data governance: Data marts can be managed and governed independently,
allowing business units to have more control over their data. This can facilitate data
ownership, accountability, and compliance with regulatory requirements specific to a
particular department.
5. Flexibility and agility: Data marts offer flexibility in terms of customization and
adaptability. They can be quickly modified or expanded to meet evolving business
needs, providing agility in responding to changing requirements without impacting the
overall data warehouse architecture.
6. Cost-effectiveness: Since data marts are focused on specific business areas, they
require fewer resources and infrastructure compared to a comprehensive data
warehouse. This can lead to cost savings in terms of storage, processing, and
maintenance.
It's worth noting that data marts can be built directly from operational systems
without using a data warehouse. In such cases, they are referred to as independent
data marts. However, integrating data marts with a data warehouse architecture
provides the advantage of a centralized and consistent data repository, enabling
data sharing and integration across different business areas.
Overall, data marts are valuable tools for delivering relevant and actionable
information to specific user groups, promoting self-service analytics, and
supporting decision-making at a departmental level.
DATA LAKE
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 6/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
A data lake is a centralized repository that stores large volumes of structured, semi-
structured, and unstructured data in its raw or native format. Unlike a data
warehouse or data mart, which follows a predefined schema and structure, a data
lake allows for the storage of data in its original form, without the need for upfront
transformation or normalization.
In a data lake, data is stored in a flat architecture, typically using a distributed file
system such as Apache Hadoop or cloud-based storage systems like Amazon S3 or
Azure Data Lake Storage. The data lake can accommodate various types of data,
including text files, documents, sensor data, log files, social media feeds, and more.
The key characteristics and features of a data lake include:
1. Storage of diverse data: A data lake can store structured, semi-structured, and
unstructured data from different sources, without enforcing a specific schema or data
model. This flexibility allows organizations to capture and store vast amounts of data in
its raw format, preserving the data's integrity and enabling future analysis.
2. Scalability: Data lakes are designed to handle massive volumes of data, making them
highly scalable. They can scale horizontally by adding more storage nodes or by
leveraging cloud-based storage resources, ensuring the data lake can accommodate
the growing needs of the organization.
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 7/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
3. Schema-on-read: In a data lake, the structure and schema of the data are applied when
the data is accessed or queried, rather than during the data ingestion process. This
"schema-on-read" approach provides greater flexibility for data exploration and
analysis, as users can interpret and transform the data based on their specific needs at
the time of analysis.
4. Data exploration and discovery: Data lakes encourage data exploration and discovery by
offering a wide range of data sources and formats. Analysts and data scientists can
explore and experiment with different datasets, combine and correlate data from various
sources, and uncover hidden insights or patterns that may not have been apparent
before.
5. Integration with analytical tools: Data lakes integrate with various analytical and
processing tools, such as Apache Spark, Apache Hive, and machine learning
frameworks, allowing users to perform complex analytics and derive meaningful insights
from the data stored in the lake.
6. Data governance and security: Data lakes require robust governance and security
measures to ensure data privacy, compliance, and access control. Data governance
practices, such as data cataloging, metadata management, and data lineage, help
maintain data quality, track data provenance, and establish data usage policies.
7. Data democratization: Data lakes promote data democratization by providing self-
service access to data for different user groups within the organization. Business users,
data analysts, and data scientists can access the data lake, discover relevant datasets,
and derive insights without heavy reliance on IT or data engineering teams.
It's important to note that while a data lake provides a flexible and scalable storage
solution, it requires proper data management and governance practices to ensure
data quality, data integration, and appropriate access controls. Without proper
management, a data lake can become a data swamp, where the data becomes
disorganized, difficult to find, and lacking in meaningful insights.
In summary, a data lake is a centralized repository that stores diverse and raw data,
enabling organizations to capture, store, and analyze vast amounts of information in
a flexible and scalable manner, promoting data exploration and discovery for
various analytics and decision-making purposes.
DATA PIPELINE
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 8/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
rules. Quality control mechanisms may involve data profiling, anomaly detection, and
data cleansing techniques.
6. Data movement and orchestration: Data movement involves transferring data from one
stage of the pipeline to another. Orchestration involves managing the flow of data,
scheduling tasks, and coordinating the execution of different pipeline components.
7. Data delivery and consumption: Once the data is transformed and stored, it is made
available for consumption by downstream applications, analytics tools, reporting
systems, or other data consumers. This can be through APIs, dashboards, data access
layers, or direct integration with target systems.
Data pipelines can be implemented using various technologies, depending on the
specific requirements and the complexity of the data processing tasks. Some
popular tools and frameworks for building data pipelines include Apache Kafka,
Apache Airflow, Apache Spark, Apache NiFi, and commercial solutions like
Informatica PowerCenter or Talend.
The benefits of using data pipelines include:
1. Data reliability and consistency: Data pipelines ensure that data is processed
consistently, adhering to defined standards and rules, resulting in reliable and
consistent data across the organization.
2. Data freshness: Real-time or near-real-time data pipelines enable organizations to have
access to the most up-to-date information, supporting timely decision-making and
insights.
3. Automation and efficiency: Data pipelines automate the extraction, transformation, and
loading processes, reducing manual effort and improving efficiency in data integration
and analysis.
4. Scalability: Data pipelines can handle large volumes of data and are designed to scale
horizontally to accommodate increasing data volumes and processing demands.
5. Data governance and compliance: Data pipelines facilitate the implementation of data
governance policies, data lineage tracking, and compliance with regulatory
requirements by ensuring data quality and traceability.
In summary, a data pipeline is a systematic and automated process for moving,
transforming, and delivering data from multiple sources to a target destination. It
plays a crucial role in data integration, data management, and ensuring reliable and
actionable data for analytics, reporting, and decision-making purposes.
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 10/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
DATA MESH
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 11/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 12/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
The term "Data Lakehouse" refers to a data architecture that combines the best
aspects of a data lake and a data warehouse. It aims to provide a unified and
flexible data platform that can handle both traditional structured data and newer
forms of semi-structured and unstructured data.
In a Data Lakehouse architecture, data is stored in a centralized repository like a
data lake, which allows for the storage of raw and diverse data in its native format.
This enables organizations to capture and store vast amounts of data without
upfront data transformation.
However, unlike a traditional data lake, a Data Lakehouse incorporates some key
elements of a data warehouse to improve data organization, processing, and query
performance. These elements include schema enforcement, indexing, and
optimized query engines.
Here are some key features and benefits of a Data Lakehouse:
1. Schema enforcement: A Data Lakehouse introduces schema enforcement, which means
applying a schema or structure to the data stored in the data lake. This helps ensure
data consistency, enforce data quality controls, and improve data governance.
2. Optimized query engines: Data Lakehouse architectures often employ modern query
engines, such as Apache Spark or Presto, which provide optimized performance for
querying and analyzing data stored in the data lake. These engines support both
structured and semi-structured data, enabling efficient data processing and exploration.
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 13/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 14/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
A data swamp is a term used to describe a situation where a data lake becomes
disorganized, unmanageable, and difficult to extract value from. Unlike a well-
organized and governed data lake, a data swamp lacks proper data governance,
structure, and quality control, resulting in a chaotic and messy data environment.
The characteristics of a data swamp include:
1. Lack of organization: Data in a data swamp is often unstructured, untagged, and poorly
categorized. There is no consistent naming convention, metadata, or data catalog,
making it challenging to find and understand the data.
2. Poor data quality: A data swamp typically contains low-quality and unreliable data. It
may include duplicated, incomplete, outdated, or incorrect information, leading to
erroneous insights and analysis.
3. Limited discoverability: Due to the absence of proper data indexing and metadata
management, finding relevant data within a data swamp becomes a time-consuming
and frustrating task. Users may struggle to identify the right data sources, understand
their contents, and determine their relevance to their analysis or business needs.
4. Inadequate data governance: Data governance practices, such as data lineage tracking,
data security controls, and compliance measures, are often absent or poorly
implemented in a data swamp. This poses risks to data privacy, regulatory compliance,
and overall data integrity.
5. Lack of collaboration: A data swamp hinders collaboration among users and teams.
Without clear data organization and governance, it becomes challenging for different
stakeholders to share and collaborate on data-related projects.
The presence of a data swamp can have several negative consequences for an
organization:
1. Reduced productivity: Users spend significant time searching for data and cleaning it up
before analysis, leading to decreased productivity and delayed insights.
2. Inaccurate decision-making: Data swamps can lead to unreliable and inconsistent data,
which can result in inaccurate decision-making and incorrect insights.
3. Increased costs: The lack of organization and quality control in a data swamp can result
in wasted resources, as data professionals spend excessive time on data cleanup and
rectification efforts.
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 15/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
4. Compliance and security risks: The absence of proper data governance practices in a
data swamp can expose sensitive data, violate regulatory requirements, and
compromise data security.
To avoid or address a data swamp, organizations should focus on implementing
proper data governance practices, including data quality controls, metadata
management, data cataloging, and data stewardship. Establishing clear data
ownership and responsibility, implementing data validation processes, and defining
data governance policies are crucial steps in preventing or mitigating the formation
of a data swamp.
Regular data maintenance and clean-up activities, along with the use of data
management tools and technologies, can help ensure that the data lake remains
organized, useful, and valuable to the organization.
DATA FABRIC
Data fabric refers to an architectural approach and set of technologies that enable
organizations to seamlessly integrate and manage data across distributed systems,
including on-premises and cloud environments. It provides a unified and consistent
view of data assets, regardless of their location or format, and facilitates data
access, integration, and analysis.
The key characteristics and components of a data fabric include:
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 16/21
2/29/24, 8:35 AM Modern Data Architecture Concepts
2. Accelerated data integration: Data fabric simplifies and accelerates data integration by
providing a standardized approach to connect and harmonize data from diverse
sources.
3. Enhanced data governance and compliance: Data fabric incorporates data governance
practices and security controls, ensuring data privacy, regulatory compliance, and
proper access control.
4. Flexibility and scalability: Data fabric enables organizations to scale their data
infrastructure and capabilities by leveraging distributed and cloud-based technologies.
5. Reduced data silos and fragmentation: Data fabric helps break down data silos and
enables data sharing and collaboration across the organization, promoting a unified data
culture.
In summary, a data fabric is an architectural approach and technology framework
that enables seamless integration, management, and analysis of data across
distributed systems. It provides a unified and consistent view of data assets,
enhances data agility, and promotes data governance and accessibility.
Regards,
Irfan Saherwardi
Account Technical Leader@DXC
Michal Pilarski 2w
GIS, Python, ETL, FME, Spatial Data and Testing Specialist w Boeing Company
Very helpful article. Thanks. So my proposals of naming convention for poor quality of data
architecture:
Data Lakehouse --> Data Swamphouse
Data Warehouse --> Data Warhouse
Data Mesh --> Data Mess
Data Mart --> Data Mort
Data Pipeline --> Data PeeLine
Data Fabric --> Data Macabric
Make sense?
Like · Reply
Franck Christien 4mo
https://www.linkedin.com/pulse/modern-data-architecture-concepts-irfan-azim-saherwardi/ 18/21