Data Integration

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

Data Integration

Fundamental of Scin ence:


UNIT III
Dr. C. Sivaraj
What is Data Integration?

 Data integration refers to the process of bringing together data


from multiple sources across an organization to provide a
complete, accurate, and up-to-date dataset for BI, data analysis
and other applications and business processes.

 It includes data replication, ingestion and transformation to


combine different types of data into standardized formats to be
stored in a target repository such as a data warehouse, data lake
or data lakehouse.

2
Data Warehouse Data Lake Data Lakehouse
Storage Works well with structured Works well with semi- Can handle structured, semi-
Data Type data structured and unstructured structured, and unstructured
data data
Purpose Optimal for data analytics and Suitable for machine learning Suitable for both data analytics
business intelligence (BI) use- (ML) and artificial intelligence and machine learning
cases (AI) workloads workloads

Cost Storage is costly and time- Storage is cost-effective, fast, Storage is cost-effective, fast,
consuming and flexible and flexible

ACID Records data in an ACID- Non-ACID compliance: ACID-compliant to ensure


Compliance compliant manner to ensure the updates and deletes are consistency as multiple parties
highest levels of integrity complex operations concurrently read or write data

3
Five Approaches to Data Integration
 To implement these processes, data engineers, architects and
developers can either manually code an architecture using SQL
or, more often, they set up and manage a data integration tool,
which streamlines development and automates the system.

4
1. ETL (Extract, Transform, and Load )
 Extract: the process of pulling data from a source such as an SQL
or NoSQL database, an XML file or a cloud platform holding data
for systems such as marketing tools, CRM systems, or transactional
systems.
 Transform: the process of converting the format or structure of
the data set to match the target system.

 Load: the process of placing the data set into the target system
which can be a database, data warehouse, an application, such as
CRM platform or a cloud data warehouse, data lake or data
lakehouse from providers such as Snowflake, Amazon RedShift, and
Google BigQuery.

5
1. What is an ETL ?

6
2. ELT (“Extract, Load, and Transform” )
 The ELT process is broken out as follows:
 Extract. A data extraction tool pulls data from a source or
sources such as SQL or NoSQL databases, cloud platforms or XML
files. This extracted data is often stored temporarily in a staging area
in a database to confirm data integrity and to apply any necessary
business rules.
 Load. The second step involves placing the data into the target
system, typically a cloud data warehouse, where it is ready to be
analyzed by BI tools or data analytics tools.
 Transform. Data transformation refers to converting the
structure or format of a data set to match that of the target system.
Examples of transformations include data mapping, replacing codes
with values and applying concatenations or calculations.
7
2. ELT

8
2. ELT (“Extract, Load, and Transform” )

 ELT alternative to the traditional ETL pipeline.


 ELT and cloud-based repositories are more scalable,
more flexible, and allow you to move faster.
 The ELT process is most appropriate for larger,
nonrelational, and unstructured data sets and when
timeliness is important.

9
3. Data Streaming
 Instead of loading data into a new repository in batches,
streaming data integration moves data continuously in real-time
from source to target.
 Modern data integration (DI) platforms can deliver analytics-
ready data into streaming and cloud platforms, data warehouses,
and data lakes.

10
4. Application Integration
 Application integration (API) allows separate applications to
work together by moving and syncing data between them.
 The most typical use case is to support operational needs such
as ensuring that your HR system has the same data as your
finance system.
 Application integration must provide consistency between the
data sets.
 Also, these various applications usually have unique APIs for
giving and taking data so SaaS application automation tools
can help you create and maintain native API integrations
efficiently and at scale.

11
4. Application Integration

 Here is an example of a B2B marketing integration flow: 12


5. Data Virtualization
 Like streaming, data virtualization also delivers data in real
time, but only when it is requested by a user or application.
 Still, this can create a unified view of data and makes data
available on demand by virtually combining data from different
systems.
 Virtualization and streaming are well suited for transactional
systems built for high performance queries.

13
5. Data Virtualization

14
Data Integration Benefits

Three key benefits of data integration:


 Increased accuracy and trust:
You and other stakeholders can stop wondering which
KPI(key performance indicator.) from which tool is correct
or whether certain data has been included.
You can also have far fewer errors and rework. Data
integration provides a reliable, single source of accurate,
governed data you can trust: “one source of truth”.

15
Data Integration Benefits

Three key benefits of data integration:


 More data-driven & collaborative decision-making:
 Users from across your organization are far more likely to engage in
analysis once raw data and data silos have been transformed into
accessible, analytics-ready information.
 They’re also more likely to collaborate across departments since the
data from every part of the enterprise is combined and they can
clearly see how their activities impact each other.
 Increased efficiency:
 Analyst, development and IT teams can focus on more strategic
initiatives when their time isn’t taken up manually gathering and
preparing data or building one-off connections and custom reports.

16
Integrating Data using Python

Exercise :
 In this exercise, we'll merge the details of students from two
datasets, namely student.csv and marks.csv.
 The student dataset contains columns such as Age, Gender,
Grade, and Employed.

 The marks.csv dataset contains columns such as Mark and


City.

 The Student_id column is common between the two datasets. :

17
Integrating Data using Python
 load the student.csv and marks.csv datasets into the
stud_data and mark_data pandas DataFrames:

18
Integrating Data using Python

 To print the first five rows of DataFrames,

19
 Student_id is common to both datasets. Perform data
integration on both the DataFrames with respect to the
Student_id column using the pd.merge() function, and then
print the first 10 values of the new DataFrame:

20

You might also like