Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

UNIT-III:

PROCESSING BIG DATA: Integrating disparate data stores.


Data integration is a core component of the broader data management process, serving as the backbone
for almost all data-driven initiatives. It ensures businesses can harness the full potential of their data
assets effectively and efficiently. It empowers them to remain competitive and innovative in an
increasingly data-centric landscape by streamlining data analytics, business intelligence (BI), and,
eventually, decision-making.

ETL has long been the standard way of integrating data. This data integration strategy involves
extracting data from multiple sources, transforming the data sets into a consistent format, and loading
them into the target system.

These are the different ways of integrating data. Depending on your business requirements,
you may have to use a combination of two or more data integration techniques. These
include:

Extract, Transform, Load (ETL)

ETL has long been the standard way of integrating data.

This data integration strategy involves extracting data from multiple sources, transforming
the data sets into a consistent format, and loading them into the target system.

Consider using automated ETL tools to accelerate data integration and unlock faster time-to-
insight.

Extract, Load, Transform (ELT)

Similar to ETL, except for the sequence of the rest of the process, data extraction is the first
step in ELT, which is a fairly recent data integration technique.

Instead of transforming the data before loading it into, say, a data warehouse, the data is
directly loaded into the target system as soon as it’s extracted.

The transformation takes place inside the data warehouse, utilizing the processing power of
the storage system.

Change Data Capture (CDC)

Change data capture is a way to integrate data by identifying and capturing only the changes
made to a database.
It enables real-time or near-real-time updates to be efficiently and selectively replicated across
systems, ensuring that downstream applications stay synchronized with the latest changes in
the source data.

Enterprise Data Integration

When it comes to integrating data across an organization, it doesn’t get any broader than this.

Enterprise data integration is a holistic strategy that provides a unified view of data to improve
data-driven decision-making and enhance operational efficiency at the enterprise level.

It is typically supported by a range of technologies, such as ETL tools, APIs, etc. The choice
of technology depends on the enterprise’s specific data integration needs, existing IT
infrastructure, and business objectives.

Data Federation

Data federation, also known as federated data access or federated data integration, is an
approach that allows users and applications to access and query data from multiple disparate
sources as if they were a single, unified data source system.

It provides a way to integrate and access data from various systems without physically
centralizing or copying it into a single repository.

Instead, data remains in its original location, which users can access and query using a unified
interface.

However, data federation can introduce some performance challenges. For example, it often
relies on real-time data retrieval from multiple sources, which can impact query response
times.

Data Virtualization

Data virtualization allows organizations to access and manipulate data from disparate sources
without physically moving it.

It provides a unified and virtual view of data across databases, applications, and systems.

Think of it as a layer that abstracts these underlying data sources, enabling users to query and
analyze data in real-time.
Data virtualization is a valuable data integration technique for organizations seeking to
improve data agility without the complexities of traditional ETL processes.

Middleware Integration

In simple terms, middleware integration is a data integration strategy that focuses on enabling
communication and data transfer between systems, often involving data transformation,
mapping, and routing.

Think of it as a mediator that sits in the middle and connects different software applications,
allowing them to perform together as a cohesive unit.

For example, you can connect your old on-premises database with a modern cloud data
warehouse using middleware integration and securely move data to the cloud.

Data Propagation

Data propagation is when information or updates are distributed automatically from one source
to another, ensuring that all relevant parties have access to the most current data.

For example, let us say you have a database of product prices, and you make changes to these
prices in one central location. Now, suppose you want to automatically update these new prices
across all the places where this data is needed, such as your website, mobile app, and internal
sales tools. In this case, data propagation can be a viable solution.

Mapping data to the programming framework.

Data mapping is the process of connecting a data field from one source to a data field in another source.
This reduces the potential for errors, helps standardize your data and makes it easier to understand your
data.
Data mapping helps us visualize and connect data fields much like maps can help us visualize
the best way to get from point A to point B. And just like taking the wrong turn can mean
trouble when you are travelling, data mapping errors can negatively impact your mission-
critical data management initiatives.

Why Does Data Mapping Matter?

With a backdrop of exploding data volume and variety in the modern enterprise, it’s important
to decrease the potential for data errors while increasing the ability to deliver actionable data
insights. The data visualization process integrates multiple data sources into data models so
you can simplify and combine dispersed data sources.

How Does Data Mapping Work?


Data mapping begins with knowing exactly what data you have about any given subject. A data
mapping instruction set identifies data sources and targets and their relationships. In today’s
complex and growing data enterprises, it’s important for your data mapping capabilities to be
part of an intelligent data management platform. That way, you can easily integrate capabilities
including data mapping, data integration, data quality and data governance across all enterprise
workloads at scale.

Data Mapping Challenges

Here is a list of four primary challenges that companies may face in implementing data
mapping initiatives:

Complex, Manual Data Mapping Processes

In today’s complex world, companies struggle to keep up with the scale of their data
environments. It is critical for data stewards to define data maps that are both strategic and
systematic.
IT teams need careful planning, the right tools, and a clear data mapping roadmap.
Without these capabilities, the data mapping process can overwhelm your team. Automated
software solutions are part of the modern approach that is needed.

They help you break through the clutter and create data maps with agility and accuracy.

Data Diversity
It is not likely that you will be able to convert data into the actionable insights you need without
the ability to handle the four “Vs” of modern data:

• Volume – Amount of data being generated


• Velocity – Speed at which data is being generated
• Variety – Various types of data being generated, which can largely be grouped into
three categories: structured data, semi-structured data, and unstructured data
• Veracity – Trustworthiness of the data

Once you can map your data addressing the four “Vs,” you can unleash the fifth “V” — data
value. With data value, you can drive deep analysis, accurate reporting and confident decision-
making.

Poor Performance

If your data mapping is incorrect, your processing time can increase and so can your costs. You
can better understand a use case and business outcome with data integration. But you need an
intelligent recommendation engine to avoid unnecessary steps and tasks.

Your systems and staff may not be able to keep up with data mapping challenges and errors
can occur. This slows down processing time and causes issues during data transfer. It also
impacts data testing and implementation.

You can even lose data, costing more time and resources. Intelligent, automated data mapping
can help you get the results you want.
Lack of Trust

You need visibility into end-to-end data movements and changes. Without it, your team may
not be able to trust your data or build a data mapping plan. Applications generate their own
data formats and definitions may be different. That means you won't be able to trust your data
transformation.

Connecting and extracting data from storage.

The most efficient method for extracting data is a process called ETL. Short for “extract,
transform, load,” ETL tools pull data from the various platforms you use and prepare it for
analysis. The only alternative to ETL is manual data entry — which can take literal months,
even with an enterprise amount of manpower.

Big data is a big deal. Spotting trends in data enables business leaders and entrepreneurs to
make better decisions, improve team performance and increase revenue.

Sales, customer and operations data can make a night-and-day difference for your business.
But with the bevy of analytics tools available, the challenge isn’t analyzing but extracting
data.

The most efficient method for extracting data is a process called ETL. Short for “extract,
transform, load,” ETL tools pull data from the various platforms you use and prepare it for
analysis.

The only alternative to ETL is manual data entry — which can take literal months, even with
an enterprise amount of manpower. Save yourself the trouble by getting a grip on the ETL
process.

How ETL Tools Work

Knowing how an ETL tool works is part and parcel of understanding its value. Let’s take
each term of that acronym in turn:

Extract

The first step is getting data out of one program and into another. It’s the part of the process
most leaders already know they need, but it’s only useful Transform

The data extracted by your ETL tool is raw. Most programs do not “talk” to each other well,
so their data must be converted to a common language before you can work with it.

when paired with the other two.

Load

Finally, your processed data is loaded into your storage vessel of choice. Many people use a
data warehouse: a program built specifically for the storing and viewing of processed data.
Once there, it’s finally ready to be analyzed.
What You Need to Do

The ETL process seems simple enough, right? Now that you understand it, let’s get your data
pipeline flowing:

1. Set Up Triggers

To get your extraction started, you’ll need to set up an automation system. This involves
designating triggers for your ETL process.

One trigger might be a sale made on your company website. When a customer checks out, it
initiates an action: the purchase data is delivered to your ETL program. Information like the
amount spent and the time of purchase are then compiled, converted and piped into your data
storage system.

2. Choose a Storage Platform

As mentioned previously, extracted data needs to be stored somewhere. There are a few
options to choose from, each with their own strengths and weaknesses:

Data warehouses are used for storing data from one or multiple sources that has been
processed for a specific function.

Data lakes store raw data that has yet to be manipulated to fit an exact purpose.

Data marts are similar to warehouses but on a much smaller scale, typically devoted to a
single team or department.

Databases are used to store information from a single source, often to support a system that
uses a lot of data.

Data warehouses are the default choice for most big data initiatives, but the other three
options have their merits. When in doubt, ask your IT team to weigh in.

3. Audit Your Data

Don’t blindly trust every byte sitting in your data warehouse. Address change. People switch
companies. The point is, data gets outdated faster than you might think.

Regularly audit your data, ensuring everything is in place and in the correct context. Without
a data audit, you may be getting an incomplete picture. When you work with limited
information, you cannot make informed decisions.

Transforming data for processing.

Data transformation is the process of converting, cleansing, and structuring data into a usable
format that can be analyzed to support decision making processes, and to propel the growth of
an organization.
Data transformation is used when data needs to be converted to match that of the destination
system. This can occur at two places of the data pipeline. First, organizations with on-site data
storage use an extract, transform, load, with the data transformation taking place during the
middle ‘transform’ step.

Data transformation works on the simple objective of extracting data from a source, converting
it into a usable format and then delivering the converted data to the destination system.

The extraction phase involves data being pulled into a central repository from different sources
or locations, therefore it is usually in its raw original form which is not usable.

To ensure the usability of the extracted data it must be transformed into the desired format by
taking it through a number of steps. In certain cases, the data also needs to be cleaned before
the transformation takes place.

This step resolves the issues of missing values and inconsistencies that exist in the dataset. The
data transformation process is carried out in five stages.

1. Discovery
The first step is to identify and understand data in its original source format with the help
of data profiling tools. Finding all the sources and data types that need to be transformed. This
step helps in understanding how the data needs to be transformed to fit into the desired format.

2. Mapping
The transformation is planned during the data mapping phase. This includes determining the
current structure, and the consequent transformation that is required, then mapping the data to
understand at a basic level, the way individual fields would be modified, joined or aggregated.
3. Code generation
The code, which is required to run the transformation process, is created in this step using a
data transformation platform or tool.

4. Execution
The data is finally converted into the selected format with the help of the code. The data is
extracted from the source(s), which can vary from structured to streaming, telemetry to log
files.
Next, transformations are carried out on data, such as aggregation, format conversion or
merging, as planned in the mapping stage.

The transformed data is then sent to the destination system which could be a dataset or a data
warehouse.
Some of the transformation types, depending on the data involved, include:
• Filtering which helps in selecting certain columns that require transformation
• Enriching which fills out the basic gaps in the data set
• Splitting where a single column is split into multiple or vice versa
• Removal of duplicate data, and
• Joining data from different sources

5. Review
The transformed data is evaluated to ensure the conversion has had the desired results in terms
of the format of the data.
It must also be noted that not all data will need transformation, at times it can be used as is.

Subdividing data in preparation for Hadoop Map Reduce.

Hadoop MapReduce is a programming model and processing framework for processing large
amounts of data in parallel across a distributed cluster. To prepare data for Hadoop
MapReduce, you typically need to subdivide the data into smaller chunks or partitions that can
be processed in parallel across multiple nodes in the cluster. Here is how you can approach
this:

1. Data Splitting: Divide your input data into smaller chunks or blocks. Hadoop’s HDFS
(Hadoop Distributed File System) typically does this automatically when you upload
your data, splitting it into 128MB or 256MB blocks. This is important for distributing
the data efficiently across the cluster.

2. Mapper Tasks: In the context of MapReduce, each chunk of data is processed by a


separate mapper task. Design your mapper function to process a single piece or record
simultaneously. This function is applied to each data block independently and in
parallel.

3. Key-Value Pairing: MapReduce processes data as key-value pairs. Your mapper


function should read each record from the input data and emit intermediate key-value
pairs. These intermediate key-value pairs are then grouped by key and passed to the
reducer tasks.

4. Partitioning and Shuffling: Hadoop groups and shuffles the intermediate key-value
pairs based on their keys. The shuffling phase ensures that all values for a given key
are sent to the same reducer, allowing you to process the grouped data efficiently.

5. Reducer Tasks: Reducer tasks process the grouped and shuffled data. Your reducer
function should take a key and a list of values associated with that key, and then perform
the required aggregation or computation. Like mappers, reducers work in parallel.

6. Output: The output of reducer tasks is typically written to HDFS or another storage
system. Ensure that your output format is compatible with your data processing needs.

You might also like