CCD CH 3 & 4 Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

What is cloud storage?

Cloud storage is a cloud computing model that enables storing data and files on the internet
through a cloud computing provider that you access either through the public internet or a
dedicated private network connection. The provider securely stores, manages, and maintains the
storage servers, infrastructure, and network to ensure you have access to the data when you need
it at virtually unlimited scale, and with elastic capacity. Cloud storage removes the need to buy
and manage your own data storage infrastructure, giving you agility, scalability, and durability,
with any time, anywhere data access.
Why is cloud storage important?

• Cost effectiveness
• Increased agility
• Faster deployment
• Efficient data management
• Virtually unlimited scalability
• Business continuity

Types of Cloud Storage


Based on CSP

There are three main cloud storage types: object storage, file storage, and block storage. Each
offers its own advantages and has its own use cases.

Object storage

Organizations have to store a massive and growing amount of unstructured data, such as photos,
videos, machine learning (ML), sensor data, audio files, and other types of web content, and
finding scalable, efficient, and affordable ways to store them can be a challenge. Object storage
is a data storage architecture for large stores of unstructured data. Objects store data in the format
it arrives in and makes it possible to customize metadata in ways that make the data easier to
access and analyze. Instead of being organized in files or folder hierarchies, objects are kept in
secure buckets that deliver virtually unlimited scalability. It is also less costly to store large data
volumes.

Applications developed in the cloud often take advantage of the vast scalability and metadata
characteristics of object storage. Object storage solutions are ideal for building modern
applications from scratch that require scale and flexibility, and can also be used to import
existing data stores for analytics, backup, or archive.

File storage
File-based storage or file storage is widely used among applications and stores data in a
hierarchical folder and file format. This type of storage is often known as a network-attached
storage (NAS) server with common file level protocols of Server Message Block (SMB) used in
Windows instances and Network File System (NFS) found in Linux.

Block storage

Enterprise applications like databases or enterprise resource planning (ERP) systems often
require dedicated, low-latency storage for each host. This is analogous to direct-attached storage
(DAS) or a storage area network (SAN). In this case, you can use a cloud storage service that
stores data in the form of blocks. Each block has its own unique identifier for quick storage and
retrieval.

Based on User Perspective


1. Private cloud storage

Private cloud storage is also known as enterprise or internal cloud storage. Data is stored on the
company or organization’s intranet in this case. This data is protected by the company’s own
firewall. Private cloud storage is a great option for companies with expensive data centers and
can manage data privacy in-house. A major advantage of saving data on a private cloud is that it
offers complete control to the user. On the other hand, one of the major drawbacks of private
cloud storage is the cost and effort of maintenance and updates. The responsibility of managing
private cloud storage lies with the host company.

2. Public cloud storage

Public cloud storage requires few administrative controls and can be accessed online by the user
and anyone else who the user authorizes. With public cloud storage, the user/company doesn’t
need to maintain the system. Public cloud storage is hosted by different solution providers, so
there’s very little opportunity for customizing the security fields, as they are common for all
users. Amazon Web Services (AWS), IBM Cloud, Google Cloud, and Microsoft Azure are a few
popular public cloud storage solution providers. Public cloud storage is easily scalable,
affordable, reliable and offers seamless monitoring and zero maintenance.

3. Hybrid cloud storage

Hybrid cloud storage is a combination of private and public cloud storage. As the name suggests,
hybrid cloud storage offers the best of both worlds to the user – the security of a private cloud
and the personalization of a public cloud. In a hybrid cloud, data can be stored on the private
cloud, and information processing tasks can be assigned to the public cloud as well, with the help
of cloud computing services. Hybrid cloud storage is affordable and offers easy customization
and greater user control.

4. Community cloud storage

Community cloud storage is a variation of the private cloud storage model, which offers cloud
solutions for specific businesses or communities. In this model, cloud storage providers offer
their cloud architecture, software and other development tools to meet the community’s
requirements. Any data is stored on the community-owned private cloud storage to manage the
community’s security and compliance needs. Community cloud storage is a great option for
health, financial or legal companies with strict compliance policies.

What is Data Governance?


Data governance is a principled approach to managing data during its life cycle, from acquisition
to use to disposal.

Every organization needs data governance. As businesses throughout all industries proceed on
their digital-transformation journeys, data has quickly become the most valuable asset they
possess.

Senior managers need accurate and timely data to make strategic business decisions. Marketing
and sales professionals need trustworthy data to understand what customers want. Procurement
and supply-chain-management personnel need accurate data to keep inventories stocked and to
minimize manufacturing costs. Compliance officers need to prove that data is being handled
according to both internal and external mandates. And so on.

Data governance defined


Data governance is everything you do to ensure data is secure, private, accurate, available, and
usable. It includes the actions people must take, the processes they must follow, and the
technology that supports them throughout the data life cycle.

Benefits of data governance?


Make better, more timely decisions
Improve cost controls
Enhance regulatory compliance
Earn greater trust from customers and suppliers
Manage risk more easily
Allow more personnel access to more data

What is data governance used for?


Data stewardship
Data governance often means giving accountability and responsibility for both the data itself
and the processes that ensure its proper use to “data stewards.”
Data quality
Data governance is also used to ensure data quality, which refers to any activities or techniques
designed to make sure data is suitable to be used. Data quality is generally judged on six
dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness.
Data management
This is a broad concept encompassing all aspects of managing data as an enterprise asset, from
collection and storage to usage and oversight, making sure it’s being leveraged securely,
efficiently, and cost-effectively before it’s disposed of.

The key-value database defined

A key-value database is a type of nonrelational database that uses a simple key-value method to
store data. A key-value database stores data as a collection of key-value pairs in which a key
serves as a unique identifier. Both keys and values can be anything, ranging from simple objects
to complex compound objects. Key-value databases are highly partitionable and allow horizontal
scaling at scales that other types of databases cannot achieve. For example, Amazon
DynamoDB allocates additional partitions to a table if an existing partition fills to capacity and
more storage space is required.
The following diagram shows an example of data stored as key-value pairs in DynamoDB.
Use cases

Session store

A session-oriented application such as a web application starts a session when a user logs in and
is active until the user logs out or the session times out. During this period, the application stores
all session-related data either in the main memory or in a database. Session data may include
user profile information, messages, personalized data and themes, recommendations, targeted
promotions, and discounts. Each user session has a unique identifier. Session data is never
queried by anything other than a primary key, so a fast key-value store is a better fit for session
data. In general, key-value databases may provide smaller per-page overhead than relational
databases.

Shopping cart

During the holiday shopping season, an e-commerce website may receive billions of orders in
seconds. Key-value databases can handle the scaling of large amounts of data and extremely high
volumes of state changes while servicing millions of simultaneous users through distributed
processing and storage. Key-value databases also have built-in redundancy, which can handle the
loss of storage nodes.

Popular key-value databases

Amazon DynamoDB

Amazon DynamoDB is a nonrelational database that delivers reliable performance at any scale.
It's a fully managed, multi-region, multi-master database that provides consistent single-digit
millisecond latency, and offers built-in security, backup and restore, and in-memory caching. In
DynamoDB, an Item is composed of a primary or composite key and a flexible number of
attributes. There is no explicit limitation on the number of attributes associated with an
individual item, but the aggregate size of an item, including all the attribute names and attribute
values, cannot exceed 400 KB. A table is a collection of data items, just as a table in a relational
database is a collection of rows. Each table can have an infinite number of data items.

What is the difference between batch data and streaming data?

Batch processing is the method computers use to periodically complete high-volume, repetitive
data jobs. You can use it to compute arbitrary queries over different sets of data. It usually
derives computational results from all the data it encompasses and allows for deep analysis of
big data sets. MapReduce-based systems, like Amazon EMR, are examples of platforms that
support batch jobs.

In contrast, stream processing requires ingesting a data sequence and incrementally updating
metrics, reports, and summary statistics in response to each arriving data record. It is better
suited for real-time analytics and response functions.
Batch processing Stream processing
Queries or processing over all or Queries or processing over data within a rolling
Data scope most of the data in the dataset. time window, or on just the most recent data
record.
Individual records or micro batches consisting of
Data size Large batches of data. a few records.
Requires latency in the order of seconds or
Performance Latencies in minutes to hours.
milliseconds.
Simple response functions, aggregates, and
Analysis Complex analytics.
rolling metrics.

Many organizations are building a hybrid model by combining the two approaches to maintain a
real-time layer and a batch layer. For example, you can first process data in a streaming data
platform such as Amazon Kinesis to extract real-time insights. Then, you can persist it into a
store like Amazon Simple Storage Service (Amazon S3). There, it can be transformed and loaded
for various batch processing use cases.

Amazon Redshift Streaming Ingestion allows users to ingest data directly from Amazon Kinesis
Data Streams without having to stage it in Amazon S3. The service can also ingest data from
Amazon Managed Streaming for Apache Kafka (Amazon MSK) into Amazon Redshift.

What is Batch Processing?

Herman Hollerith, an American inventor who invented the first tabulating machine, used
the Batch Processing method for the first time in the 19th century. This device, which was
capable of counting and sorting data organized on punched cards, became the forerunner of the
modern computer. The cards, as well as the information on them, could then be collected and
processed in batches.

Large amounts of data could be processed more quickly and accurately with this innovation than
with manual entry methods. Batch Processing is a technique for consistently processing large
amounts of data. The batch method allows users to process data with little or no user interaction
when computing resources are available.
Users collect and store data for Batch Processing, which is then processed during a “batch
window.” Batch Processing boosts productivity by prioritizing processing and completing data
jobs when it’s most convenient.

Batch Processing has become popular due to its numerous benefits for enterprise data
management. It has several advantages for businesses:

• Efficiency: When computing or other resources are readily available, Batch Processing
allows a company to process jobs. Companies can schedule batch processes for jobs that
aren’t as urgent and prioritize time-sensitive jobs. Batch systems can also run in the
background to reduce processor stress.
• Simplicity: Batch Processing, in comparison to Stream Processing, is a less complex
system that does not require special hardware or system support. For data input, it
requires less maintenance.
• Improved Data Quality: Batch Processing reduces the chances of errors by automating
most or all components of a processing job and minimizing user interaction. To achieve a
higher level of data quality, precision, and accuracy are improved.
• Faster Business Intelligence: Batch Processing allows companies to process large
volumes of data quickly, resulting in faster Business Intelligence. Batch Processing
reduces processing time and ensures that data is delivered on time because many records
can be processed at once. And, because multiple jobs can be handled at the same time,
business intelligence is available faster than ever before.

What is Stream Processing?

Stream Processing is the act of taking action on a set of data as it is being generated.
Historically, data professionals used the term “real-time processing” to refer to data that was
processed as frequently as was required for a specific use case. However, with the introduction
and adoption of Stream Processing technologies and frameworks, as well as lower RAM prices,
“Stream Processing” has become a more specific term.
Stream Processing is concerned with real-time (or near-real-time) data streams that must be
processed with minimal latency to generate real-time (or near-real-time) reports or automated
responses. Sensor data, for example, could be used by a real-time traffic monitoring solution to
detect high traffic volumes. This information could be used to automatically initiate high-
occupancy lanes or other traffic management systems or to dynamically update a map to show
congestion.

Some benefits of Stream Processing are:

• The amount of time it takes to process data is minimal.


• The information is current and can be used right away.
• You would need fewer resources to sync systems with Stream Processing.
• Stream Processing also allows you to improve your uptime.
• It aids in the identification of problems so that immediate action can be taken.

What is a Cloud Data Warehouse?

A data warehouse is a repository of the current and historical information that has been collected.
The data warehouse is an information system that forms the core of an organization’s business
intelligence infrastructure. It is a Relational Database Management System (RDBMS) that allows
for SQL-like queries to be run on the information it contains.

Unlike a database, a data warehouse is optimized to run analytical queries on large data sets. A
database is more often used as a transaction processing system.

A Cloud Data Warehouse is a database that is delivered as a managed service in the public
cloud and is optimized for analytics, scale, and usability. Cloud-based data warehouses allow
businesses to focus on running their businesses rather than managing a server room, and they
enable business intelligence teams to deliver faster and better insights due to improved access,
scalability, and performance.

Key features of Cloud Data Warehouse

Some of the key features of a Data Warehouse in the Cloud are as follows:

• Massive Parallel Processing (MPP): MPP architectures are used in cloud-based data
warehouses that support big data projects to provide high-performance queries on large
data volumes. MPP architectures are made up of multiple servers that run in parallel to
distribute processing and input/output (I/O) loads.
• Columnar data stores: MPP data warehouses are typically columnar stores, which are
the most adaptable and cost-effective for analytics. Columnar databases store and process
data in columns rather than rows, allowing aggregate queries, which are commonly used
for reporting, to run much faster.

What are the capabilities of the Cloud Data Warehouse?

For all the Cloud based Data Warehouse services, the cloud vendor or data warehouse provider
provides the following “out-of-the-box” capabilities.

• Data storage and management: data is stored in a file system hosted in the cloud (i.e.
S3).
• Automatic Upgrades: There is no such thing as a “version” or a software upgrade.
• Capacity management: You can easily expand (or contract) your data footprint.

What are the Benefits of a Cloud Data Warehouse?

Amazon Redshift
Best price-performance for cloud data warehousing
How it works

Amazon Redshift uses SQL to analyze structured and semi-structured data across data
warehouses, operational databases, and data lakes, using AWS-designed hardware and machine
learning to deliver the best price performance at any scale.
What is BigQuery in Google Cloud Platform?

BigQuery is a petabyte-scale, low-cost, and serverless data warehouse in Google Cloud


Platform. It is a fully managed service, so we don’t have to worry about underlying computing,
network, or storage resources. Therefore, we can use SQL to answer our biggest questions
without any maintenance.

BigQuery has three primary built-in features:

1. Machine Learning using BigQuery ML


2. Business Intelligence driven by the BI Engine
3. Geospatial Analysis

Apart from this, BigQuery provides us with the flexibility of storage and computing. We can
analyze and store the data in BigQuery, or use it to analyze data from external sources, such as
Google Cloud Storage, Google Drive, etc.

BigQuery also provides automatic backups with a seven-day history of changes, and fine-grained
governance and security. It ensures data encryption at rest and in transit.

Data Ingestion

We can ingest data into BigQuery in two ways.

Real-time data

Real-time data is generated continuously, such as the data from sensors. It can be ingested in
BigQuery using PubSub. We can also use Dataflow to receive events from PubSub, perform
operations, and ingest them into the BigQuery table.
Batch data

Batch data is the data at rest, such as CSV files residing in Google Cloud storage. Loading such
data is also known as “bulk load,” as we load multiple files in one shot. The straightforward way
to bulk load data is to access files residing in cloud storage, process them with Dataflow, and
push it to BigQuery.

Using BigQuery data

The data from BigQuery can be used in multiple ways:

• We can create dashboards using external tools such as Tableau, or internal resources such
as DataStudio, etc.
• Other GCP, tools such as Dataflow, can use the data from BigQuery.
• We can share the data tables with the other teams or members for querying.
• We can export the data into a local disk, or Google Sheets, Cloud Storage, etc.
• We can use it in Data Lab, which has an interface similar to the Jupiter notebooks, where
we can run Python commands and analyze the data.
What is a Data Pipeline?

A data pipeline architecture refers to the design of tools and processes that help transport data
between locations for easy access and application to various use cases. These use cases may be
for business intelligence, machine learning purposes, or for producing application visualizations
and dashboards. Data pipeline architecture aims to make the data pipeline process seamless and
efficient by designing an architecture that helps improve the functionality and flow of data
between various sources and along the pipeline.

A data pipeline is a process that moves data from one system or format to another. The data
pipeline typically includes a series of steps. This is for extracting data from a source,
transforming and cleaning it, and loading it into a destination system, such as a database or a data
warehouse. Data pipelines can be used for a variety of purposes, including data integration, data
warehousing, automating data migration, and analytics.
Batch Architecture

A batch pipeline is designed typically for high volume data workloads where data is batched
together in a specific time frame. This time frame can be hourly, daily, or monthly depending on
the use case.

Streaming Architecture

A streaming pipeline is designed for data that gets generated in real time or near real time. This
data is crucial in making instantaneous decisions and can be used for different IoT devices, fraud
detection, and log analysis.

The Importance of Data Pipeline Architecture


• Saves time with reusability: A single pipeline architecture can be replicated and applied
for similar business data needs, leaving more time for engineers to focus on other tasks.
• Data consolidation: Data from various sources are combined and consolidated in a single
destination through pipeline architecture.
• Improved functionality: Data pipeline architecture helps establish a seamless workflow of
data, which prevents data silos and enables teams to have access to data they need,
which helps improve day-to-day business functions.
• Allows easy data sharing between groups: Most data processes undergo similar
transformation processes before usage. For example, data cleaning is an essential part
of transformation and must occur before use in most cases. Establishing a clear pipeline
workflow automates this process and enables easy data sharing between different
teams.
• Accelerated data lifecycle processes: Data pipeline architecture involves automation in an
organized manner with minimal human effort. Hence, data processes occur faster with a
reduced risk of human-prone errors.
• Standardization of workflows: Data architecture helps define each pipeline activity,
making monitoring and management more effortless. Also, because each step follows a
well-defined process, it helps monitor and identify substandard steps in the pipeline.

Challenges of Data Pipeline Design

Designing data pipelines can be challenging because data processes involve numerous stop
points. As data travels between locations and touchpoints, it opens the system to various
vulnerabilities and increases the risk of errors. Here are some challenges facing efficient pipeline
design:

1. Increasing and varied data sources: The primary aim of data pipelines is to collect data
from various data sources and make it available via a single access point. However, as
organizations grow, their data sources tend to increase, hence the need for a seamless
design to integrate new data sources while maintaining scalability and fast business
operations. Integration of new data may be complex due to the following reasons:

o The latest data source may differ from the existing data sources.
o Introducing a new data source may result in an unforeseen effect on the data
handling capacity of the existing pipeline.
2. Scalability: It’s a common challenge for pipeline nodes to break when data sources keep
increasing, and data volume increases, resulting in data loss. Data engineers usually find
it challenging to build a scalable architecture while keeping costs low.
3. The complexity resulting from many system components: Data pipelines consist of
processors and connectors, which help in the transport and easy accessibility of data
between locations. However, as the number of processors and connectors increases
from various data integrations, it introduces design complexity and reduces the ease of
implementation. This complexity, in turn, makes pipeline management difficult.
4. The choice between data robustness and pipeline complexity: A fully robust data
pipeline architecture integrates fault detection components at several critical points along
the pipeline and mitigation strategies to combat such faults. However, adding these two
to pipeline design adds complexity to the design, which results in complex management.
Pipeline engineers may become tempted to design against every likely vulnerability, but
this adds complexity quickly and doesn’t guarantee protection from all vulnerabilities.
5. Dependency on other vital factors: Data integration may depend on action from another
organization. For example, Company A cannot integrate data from Source B without
input from Company B. This dependency may cause deadlocks without proper
communication strategies.
6. Missing data: This problem significantly reduces data quality. Along the data pipeline,
files can get lost, which may cause a significant dent in data quality. Adding monitoring
to pipeline architecture to help detect potential risk points can help mitigate this risk.
CHARACTERISTICS OF A MODERN DATA PIPELINE
Only robust end-to-end data pipelines will properly equip organizations to source, collect,
manage, analyze, and effectively use crucial data to generate new market opportunities and
deliver cost-saving business processes. Traditional data pipelines are rigid, difficult to change,
and they do not support the constantly evolving data needs of today’s organizations. Modern data
pipelines, on the other hand, make it faster and easier to extract information from the data you
collect.

characteristics to look for when considering a data pipeline, including:

• Continuous and extensible data processing


• The elasticity and agility of the cloud
• Isolated and independent resources for data processing
• Democratized data access and self-service management
• High availability and disaster recovery
Data Collection vs Data Ingestion

Data Collection: Definition: Data collection is the process of gathering raw data from various
sources and compiling it into a central location for analysis. It is typically the first step in the data
analysis process.

Key Differences:

1. Data collection involves gathering raw data from various sources, while data ingestion

involves processing and preparing data for analysis.

2. Data collection is typically a one-time process, while data ingestion can be an ongoing

process.
3. Data collection can involve manual entry of data, while data ingestion is typically an

automated process.

4. Data collection can be a time-consuming and resource-intensive process, while data ingestion

can be faster and more efficient.

5. Data collection is often done in a decentralized manner, while data ingestion is typically

centralized.

Data Ingestion: Definition: Data ingestion is the process of taking data from various sources and
preparing it for analysis. This can involve transforming the data, cleaning it, and structuring it so
that it can be easily analyzed.

Key Differences:

1. Data ingestion involves processing and preparing data for analysis, while data collection

involves gathering raw data.

2. Data ingestion is typically an ongoing process, while data collection is typically a one-time

process.

3. Data ingestion is typically automated, while data collection can involve manual entry of data.

4. Data ingestion is typically faster and more efficient than data collection.

5. Data ingestion is typically centralized, while data collection can be done in a decentralized

manner.
Data Collection Tools:

1. Surveys: Surveys are a common tool for collecting data from individuals or groups of people.

Surveys can be conducted in person, over the phone, or online, and can be designed to collect

quantitative or qualitative data.

2. Interviews: Interviews are a more in-depth way of collecting data from individuals. They can

be conducted in person, over the phone, or online, and can be structured or unstructured.

3. Observations: Observations involve watching and recording the behavior of individuals or

groups in real time. Observations can be done in person or through video recordings.

4. Sensors: Sensors can be used to collect data automatically, such as temperature, humidity, or

motion data. Sensors can be installed in various locations and can collect data over long

periods of time.

5. Web scraping: Web scraping involves automatically extracting data from websites. It can be

used to collect data on prices, reviews, or other information that may be relevant to a

particular research question.

Data Ingestion Tools:

1. ETL Tools: ETL (Extract, Transform, Load) tools are used to extract data from various

sources, transform the data into a structured format, and load it into a data warehouse or other

data storage system. Examples of ETL tools include Talend, Informatica, and Apache NiFi.

2. API Integrations: API (Application Programming Interface) integrations allow data to be

collected from various sources automatically. APIs can be used to extract data from social

media platforms, marketing automation tools, or other third-party applications.


3. Log File Analysis Tools: Log file analysis tools can be used to ingest and analyze log files

from web servers, applications, or other systems. These tools can help identify errors or

performance issues and provide insights into user behavior.

4. Data Preparation Tools: Data preparation tools can be used to clean, transform, and prepare

data for analysis. These tools can be used to remove duplicates, fill missing values, or

convert data to a standardized format. Examples of data preparation tools include Trifacta,

OpenRefine, and Google Data Prep.

5. Data Integration Platforms: Data integration platforms allow data to be collected from

multiple sources and integrated into a single data store. These platforms can be used to create

a unified view of data from various sources, such as CRM systems, marketing automation

tools, or social media platforms. Examples of data integration platforms include MuleSoft,

Dell Boomi, and Informatica Cloud.

What is Data Transformation?

Data transformation is the process of converting, cleansing, and structuring data into a usable
format that can be analyzed to support decision making processes, and to propel the growth of an
organization.

Data transformation is used when data needs to be converted to match that of the destination
system. This can occur at two places of the data pipeline. First, organizations with on-site data
storage use an extract, transform, load, with the data transformation taking place during the
middle ‘transform’ step.

Organizations today mostly use cloud-based data warehouses because they can scale their
computing and storage resources in seconds. Cloud based organizations, with this huge
scalability available, can skip the ETL process. Instead, they use a transformation process that
converts the data as the raw data is uploaded, a process called extract, load, and transform. The
process of data transformation can be handled manually, automated or a combination of both.

Transformation is an essential step in many processes, such as data integration, migration,


warehousing and wrangling. The process of data transformation can be:

• Constructive, where data is added, copied or replicated


• Destructive, where records and fields are deleted
• Aesthetic, where certain values are standardized, or
• Structural, which includes columns being renamed, moved and combined

The data transformation process is carried out in five stages.

1. Discovery

The first step is to identify and understand data in its original source format with the help of data
profiling tools. Finding all the sources and data types that need to be transformed. This step helps
in understanding how the data needs to be transformed to fit into the desired format.

2. Mapping

The transformation is planned during the data mapping phase. This includes determining the
current structure, and the consequent transformation that is required, then mapping the data to
understand at a basic level, the way individual fields would be modified, joined or aggregated.

3. Code Generation

The code, which is required to run the transformation process, is created in this step using a data
transformation platform or tool.

4. Execution

The data is finally converted into the selected format with the help of the code. The data is
extracted from the source(s), which can vary from structured to streaming, telemetry to log files.
Next, transformations are carried out on data, such as aggregation, format conversion or
merging, as planned in the mapping stage. The transformed data is then sent to the destination
system which could be a dataset or a data warehouse.

Some of the transformation types, depending on the data involved, include:


• Filtering which helps in selecting certain columns that require transformation
• Enriching which fills out the basic gaps in the data set
• Splitting where a single column is split into multiple or vice versa
• Removal of duplicate data, and
• Joining data from different sources

5. Review

The transformed data is evaluated to ensure the conversion has had the desired results in terms of
the format of the data.

It must also be noted that not all data will need transformation, at times it can be used as is.

Data Transformation Techniques

There are several data transformation techniques that are used to clean data and structure it
before it is stored in a data warehouse or analyzed for business intelligence. Not all of these
techniques work with all types of data, and sometimes more than one technique may be applied.
Nine of the most common techniques are:

1. Revising

Revising ensures the data supports its intended use by organizing it in the required and correct
way. It does this in a range of ways.

• Dataset normalization revises data by eliminating redundancies in the data set. The data
model becomes more precise and legible while also occupying less space. This process,
however, does involve a lot of critical thinking, investigation and reverse engineering.
• Data cleansing ensures the formatting capability of data.
• Format conversion changes the data types to ensure compatibility.
• Key structuring converts values with built-in meanings to generic identifiers to be used as
unique keys.
• Deduplication identifies and removes duplicates.
• Data validation validates records and removes the ones that are incomplete.
• Repeated and unused columns can be removed to improve overall performance and
legibility of the data set.

2. Manipulation

This involves creation of new values from existing ones or changing current data through
computation. Manipulation is also used to convert unstructured data into structured data that can
be used by machine learning algorithms.

• Derivation, which is cross column calculations


• Summarization that aggregates values
• Pivoting which involves converting columns values into rows and vice versa
• Sorting, ordering and indexing of data to enhance search performance
• Scaling, normalization and standardization that helps in comparing dissimilar numbers by
putting them on a consistent scale
• Vectorization which helps convert non-numerical data into number arrays that are often
used for machine learning applications

3. Separating

This involves dividing up the data values into its parts for granular analysis. Splitting involves
dividing up a single column with several values into separate columns with each of those values.
This allows for filtering on the basis of certain values.

4. Combining/ Integrating

Records from across tables and sources are combined to acquire a more holistic view of activities
and functions of an organization. It couples data from multiple tables and datasets and combines
records from multiple tables.

5. Data Smoothing

This process removes meaningless, noisy, or distorted data from the data set. By removing
outliers, trends are most easily identified.

6. Data Aggregation

This technique gathers raw data from multiple sources and turns it into a summary form which
can be used for analysis. An example is the raw data providing statistics such as averages and
sums.

7. Discretization

With the help of this technique, interval labels are created in continuous data in an attempt to
enhance its efficiency and easier analysis. The decision tree algorithms are utilized by this
process to transform large datasets into categorical data.

8. Generalization

Low level data attributes are transformed into high level attributes by using the concept of
hierarchies and creating layers of successive summary data. This helps in creating clear data
snapshots.

9. Attribute Construction

In this technique, a new set of attributes is created from an existing set to facilitate the mining
process.
Benefits of Data Transformation
Data Utilization

If the data being collected isn’t in an appropriate format, it often ends up not being utilized at all.
With the help of data transformation tools, organizations can finally realize the true potential of
the data they have amassed since the transformation process standardizes the data and improves
its usability and accessibility.

Data Consistency

Data is continuously being collected from a range of sources which increases the inconsistencies
in metadata. This makes organization and understanding data a huge challenge. Data
transformation helps making it simpler to understand and organize data sets.

Better Quality Data

Transformation process also enhances the quality of data which can then be utilized to acquire
business intelligence.

Compatibility Across Platforms

Data transformation also supports compatibility between types of data, applications and systems.

Faster Data Access

It is quicker and easier to retrieve data that has been transformed into a standardized format.

Challenges of Data Transformation


High Cost of Implementation

Resource Intensive

Errors and Inconsistency


Data sources

The first component of the modern data pipeline is where the data originates. Any system that
generates data your business uses could be your data source, including:

• Analytics data (user behavior data)


• Transactional data (data from sales and product records)
• 3rd party data (data your company doesn’t collect directly but uses)

Data collection/ingestion

The next component in the data pipeline is the ingestion layer responsible for bringing data into
the data pipeline. This layer leverages data ingestion tools such as Striim to connect to various
data sources (internal and external) over a variety of protocols. For example, this layer can ingest
batch (data at rest) and streaming (data in motion) data and deliver it to big data storage targets.

Data processing

The processing layer is in charge of transforming data into a consumable state through data
validation, clean-up, normalization, transformation, and enrichment. Depending on the
company’s specific architecture, ETL (Extract Transform Load) vs. ELT (Extract Load
Transform), the data pipeline can do this processing component before or after data is stored in
the data store.
In an ETL-based processing architecture, the data is extracted, transformed, then loaded into the
data stores; this is mainly used when the data storage is a data warehouse. In ELT-based
architectures, data is first loaded into data lakes and then transformed to a consumable state for
various business use cases.

Data storage

This component is responsible for providing durable, scalable, and secure storage for the data
pipeline. It usually consists of large data stores like data warehouses (for structured data) and
data lakes ( for structured or semi-structured data ).

Data consumption

The consumption layer delivers and integrates scalable and performant tools for consuming from
the data stores. In addition, the data consumption layer provides analytics across the business for
all users through purpose-built analytics tools that support analysis methodologies such as SQL,
batch analytics, reporting dashboards, and machine learning.

Data governance

The security and governance layer safeguards the data in the storage layer and all other layers’
processing resources. This layer includes access control, encryption, network security, usage
monitoring, and auditing mechanisms. The security layer also keeps track of all other layers’
operations and creates a complete audit trail. In addition, the other data pipeline components are
natively integrated with the security and governance layer.

Step 1: Determine the goal

When designing a data pipeline, the priority is to identify the outcome or value the data pipeline
will bring to your company or product. At this stage, we ask relevant questions such as:

• What are our objectives for this data pipeline?


• How do we measure the success of the data pipeline?
• What use cases will the data pipeline serve (reporting, analytics, machine learning)?

Strimmer: For our Strimmer application, the data pipeline will provide data for the ML
recommendation engine, which will help Strimmer determine the best movies and series to
recommend to users.

Step 2: Choose the data sources

We then consider the possible data sources that’ll enter the data pipeline. At this stage, it’s
critical to ask questions such as:
• What are all the potential sources of data?
• In what format will the data come in (flat files, JSON, XML)?
• How will we connect to the data sources?

Strimmer: For our Strimmer data pipeline, our data sources would include:

• User historical data, such as previously watched movies and search behaviors stored in
operational databases like SQL, NoSQL
• User behavior data/analytics, such as when a user clicks a movie detail
• 3rd party data from social media applications and movie rating sites like IMDB

Step 3: Determine the data ingestion strategy

With the pipeline goal and data sources understood. We need to ask questions about how the
pipeline will collect the data. At this point, we ask questions such as :

• What communication layer will we be using to collect data ( HTTP, MQTT, gRPC)?
• Would we be utilizing third-party integration tools to ingest the data?
• Are we going to be using intermediate data stores to store data as it flows to the
destination?
• Are we collecting data from the origin in predefined batches or in real time?

Strimmer: For our Striimmer data pipeline, we’ll be using Striim, a unified real-time data
integration and streaming tool, to ingest both batch and real-time data from the various data
sources.

Step 4: Design the data processing plan

Once data has been ingested, it has to be processed and transformed for it to be valuable to
downstream systems. At this stage, it’s necessary to ask questions such as:

• What data processing strategies are we utilizing on the data (ETL, ELT, cleaning,
formatting)?
• Are we going to be enriching the data with specific attributes?
• Are we using all the data or just a subset?
• How do we remove redundant data?

Strimmer: To build the data pipeline for our Strimmer service, we’ll use Striim’s streaming
ETL data processing capabilities, allowing us to clean and format the data before it’s stored in
the data store. Striim provides an intuitive interface to write streaming SQL queries to correct
deficiencies in data quality, remove redundant data, and build a consistent data schema to enable
consumption by the analytics service.

Step 5: Set up storage for the output of the pipeline


Once the data has been processed, we must determine the final storage destination for our data to
serve various business use cases. At this step, we ask questions such as:

• Are we going to be using big data stores like data warehouses or data lakes?
• Would the data be stored on cloud or on-premises?’
• Which of the data stores will serve our top use cases?
• In what format will the final data be stored?

Strimmer: Because we’ll be handling structured data sources in our Strimmer data pipeline, we
could opt for a cloud-based data warehouse like Snowflake as our big data store.

Step 6: Plan the data workflow

We then need to design the sequencing of processes in the data pipeline. At this stage, we ask
questions such as:

• What downstream jobs are dependent on the completion of an upstream job?


• Are there jobs that can run in parallel?
• How do we handle failed jobs?

Strimmer: In our Strimmer pipeline, we can utilize a third-party workflow scheduler


like Apache Airflow to help schedule and simplify the complex workflows between the different
processes in our data pipeline via Striim’s REST API. For example, we can define a workflow
that independently reads data from our sources, joins the data using a specific key, and writes the
transformation output to our data warehouse.

Step 7: Implement a data monitoring and governance framework

In this step, we establish a data monitoring and governance framework, which helps us observe
the data pipeline to ensure a healthy and efficient channel that’s reliable, secure, and performs as
required. In this step, we determine:

• What needs to be monitored?


• How do we ensure data security?
• How do we mitigate data attacks?
• Is the data intake meeting the estimated thresholds?
• Who is in charge of data monitoring?

Strimmer: We need to ensure proper security and monitoring in our Strimmer data pipeline. We
can do this by utilizing fine-grained permission-based access control from the cloud providers
we use, encrypting data in the data warehouse using customer-managed encryption keys, storing
detailed logs, and monitoring metrics for thresholds using tools like Datadog.

Step 8: Plan the data consumption layer


This final step determines the various services that’ll consume the processed data from our data
pipeline. At the data consumption layer, we ask questions such as:

• What’s the best way to harness and utilize our data?


• Do we have all the data we need for our intended use case?
• How do our consumption tools connect to our data stores?

Strimmer: The consumption layer in our Strimmer data pipeline can consist of an analytics
service like Databricks that feeds from data in the warehouse to build, train, and deploy ML
models using TensorFlow. The algorithm from this service then powers the recommendation
engine to improve movie and series recommendations for all users.

Data has also evolved; we process not only structured data but also unstructured and semi-
structured data. With the introduction of cloud data lakes and warehouses (aka data
lakehouse) that can act as both data lake and data warehouse, the need for improvements in ETL
processing has significantly increased.

Evolution of ELT
In a typical ETL process:

• Data is pulled from the multiple data sources to staging and then into the warehouse.

• Transformations are completed before the data is loaded into the warehouse.

• Most of the time, the data extraction from source systems happens during an off-peak time of the

day and in batches.

• Only the relevant fields or tables that are needed for analytics are fetched to staging and

processed.

• Changes in the source structure mandate changes in the coding as ETL expects the columns and

data to be in the same sequence and order.

• Primarily, a single tool was used to perform “E,” “T,” and “L” operations in the ETL; it was like

one size fits all.


In a typical ELT process:

• Data is pushed from multiple data sources to the data lake using data integration tools

• Data is then transformed into a data warehouse/mart using data transformation tools.

• Data load doesn’t happen in batches but is pushed from the source systems, so real-time analytics

is possible.

• Transformation happens within the cloud lakehouse that separates storage and compute

resources, allowing processing of data exponentially faster than reading and loading the data

using an ETL process.

• Data lake (aka RAW DB) within the cloud lakehouse solutions act as the sources for the
warehouse. Almost all data from the source systems are pushed to the data lake.

• Any changes or addition to the fields in the warehouse is also easy as data exists in the data lake

• Many lakehouses support both semi-structured and unstructured data. So indeed, data lakehouse

is a single source of truth.


Factors that led to switching of ELT

• ELT is a compute-intensive process, but it happens in a highly robust, powerful, and scalable

cloud lakehouse.

• Most cloud lakehouses like Azure Synapse, Snowflake, and Google BigQuery are columnar

databases, so index and record search operations are much faster.

• Almost all the cloud lakehouses do massively parallel processing, so the queried

transformations are carried out in parallel and not successively, with multiple nodes running

multiple transformations simultaneously.

What is Data Sharing?

Data sharing is the process of making the same data resources available to multiple applications,
users, or organizations. It includes technologies, practices, legal frameworks, and cultural
elements that facilitate secure data access for multiple entities without compromising data
integrity. Data sharing improves efficiency within an organization and fosters collaboration with
vendors and partners. Awareness of the risks and opportunities of shared data is integral to the
process.

Data delivery

Data delivery is the process of transferring campaign data out of the Oracle Data Cloud platform
and into your cookie or profile store or to a partner. After your campaign data has been
delivered, you can target, model, and optimize your users on your site or on display, mobile,
social, search, and other media execution platforms.

You might also like