Professional Documents
Culture Documents
L4 Data Integration
L4 Data Integration
L4 Data Integration
2
Data Integration
Databases are great: they let us manage huge amounts of
data
Assuming you’ve put it all into your schema.
In reality, data sets are often created independently
Only to discover later that they need to combine their data!
At that point, they’re using different systems, different
schemata and have limited interfaces to their data.
The goal of data integration: tie together different sources,
controlled by many people, under a common schema.
https://www.youtube.com/watch?v=Fh17h6c3sNw
3
Introduction
Many databases and sources of data that need to be
integrated to work together
Almost all applications have many sources of data
Data Integration is the process of integrating data from
multiple sources and probably have a single view over all
these sources
And answering queries using the combined information
Integration can be physical or virtual
Physical: Coping the data to warehouse
Virtual: Keep the data only at the sources
4
Data Integration
Data integration is also valid within a single organization
Integrating data from different departments or sector
5
Goals of Data Integration
Provide
Uniform (same query interface to all sources)
Access to (queries; eventually updates too)
Multiple (we want many, but 2 is hard too)
Autonomous (DBA doesn’t report to you)
Heterogeneous (data models are different)
Distributed (over LAN, WAN, Internet)
Data Sources (not only databases).
6
Heterogeneity Problems
The main problem is the heterogeneity among the data
sources.
Source Type Heterogeneity : Systems storing the data can
be different
7
Heterogeneity Problems (cont.)
Communication Heterogeneity
Some systems have web interface others do not
Some systems allow direct query language others offer APIs
Schema Heterogeneity
the structure of the tables storing the data can be different (even
if storing the same data)
8
Heterogeneity Problems (cont.)
Data Type Heterogeneity
Storing the same data (and values) but with different data types
E.g., Storing the phone number as String or as Number
E.g., Storing the name as fixed length or variable length
Value Heterogeneity
Same logical values stored in different ways
E.g., ‘Prof’, ‘Prof.’, ‘Professor’
E.g., ‘Right’, ‘R’, ‘1’ ……… ‘Left’, ‘L’, ‘-1’
9
Heterogeneity Problems (cont.)
Semantic Heterogeneity
Same values in different sources can mean different things
E.g., Column ‘Title’ in one database means ‘Job Title’ while
in another database it means ‘Person Title’
10
Reasons for Heterogeneity
11
Top 10 Data Integration Issues
12 https://tdwi.org/articles/2006/05/09/data-integration-using-etl-eai-and-eii-tools-to-create-an-
integrated-enterprise-report-excerpt.aspx
Motivation
WWW
Website construction
Comparison shopping
Portals integrating data from multiple sources
B2B, electronic marketplaces
https://www.youtube.com/watch?v=MaNjsbdSDZ4
13
Data Integration:
A Higher-level Abstraction
15
Application Area 1: Business
Enterprise Databases
EII Apps:
CRM
ERP
Single Mediated View
Portals
…
Legacy Databases
Services and Applications
16
50% of all IT $$$ spent here!
Application Area 2: Science
Sequenceable Structured
Phenotype Gene Experiment
Entity Vocabulary
Nucleotide Microarray
Protein
Sequence Experiment
Swiss-
OMIM HUGO GO
Prot
Gene- Locus-
Entrez GEO
Clinics Link
18
The Deep Web
Millions of high quality HTML forms out there
Each form has its own special interface
Hard to explore data across sites.
Goal (for some domains):
A single interface into a multitude of deep-web sources
19
https://www.deepweb-sites.com/
http://idke.ruc.edu.cn/projects/
web.htm
20
Other Reasons to Integrate Data
Create a (useful) web site for tracking services
Collaborate with third parties
E.g., create branded services
Comply with government regulations
Find “risky” employees
Business intelligence
What’s really wrong with our products?
21
Goal of Data Integration
Uniform query access to a set of data sources
Handle:
Scale of sources: from tens to millions
Heterogeneity
Autonomy
Semi-structure
22
Why is it Hard?
Systems-level reasons:
Managing different platforms
SQL across multiple systems is not so simple
Distributed query processing
Logical reasons:
Schema (and data) heterogeneity
‘Social’ reasons:
Locating and capturing relevant data in the enterprise.
Convincing people to share (data fiefdoms)
Security, privacy and performance implications.
23
Setting Expectations
Data integration is AI-Complete.
Completely automated solutions unlikely.
Goal 1:
Reduce the effort needed to set up an integration application.
Goal 2:
Enable the system to perform gracefully with uncertainty (e.g.,
on the web)
24
Data Integration Smorgasbord
Something for everyone:
Theory of modeling data sources
Systems aspects of data integration
Architectural issues: e.g., P2P data sharing
AI @ work: automated schema matching
Web: latest on data integration & web
Commercial products: BEA, IBM
Semantic Web: what does it have to offer?
New trends in DBMS: uncertainty, dataspaces
25
Types of Data Integration
Data Consolidation
Data consolidation physically brings data together from several separate
systems, creating a version of the consolidated data in one data store.
Often the goal of data consolidation is to reduce the number of data
storage locations. Extract, transform, and load (ETL) technology
supports data consolidation.
Data Propagation
Data propagation is the use of applications to copy data from one
location to another. It is event-driven and can be done synchronously or
asynchronously. Most synchronous data propagation supports a two-way
data exchange between the source and the target. Enterprise application
integration (EAI) and enterprise data replication (EDR) technologies
support data propagation.
26
Types of Data Integration
Data Virtualization
Virtualization uses an interface to provide a near real-time, unified view of
data from disparate sources with different data models. Data can be viewed
in one location, but is not stored in that single location. Data virtualization
retrieves and interprets data, but does not require uniform formatting or a
single point of access.
https://www.youtube.com/watch?v=6Ws-3dOGasE
Data Federation
Federation is technically a form of data virtualization. It uses a virtual database and creates a
common data model for heterogeneous data from different systems. Data is brought together and
viewable from a single point of access. Enterprise information integration (EII) is a technology
that supports data federation. It uses data abstraction to provide a unified view of data from
different sources. That data can then be presented or analyzed in new ways through applications.
Virtualization and federation are good workarounds for situations where data consolidation is
cost prohibitive or would cause too many security and compliance issues.
https://www.coursera.org/learn/data-analytics-business/lecture/SzzGY/3-virtualization-federation-and-in-
memory-computing
27
Type of Data Integration
Data Warehousing
Warehousing is included in this list because it is a commonly used term.
However, its meaning is more generic than the other methods previously
mentioned. Data warehouses are storage repositories for data. However,
when the term “data warehousing,” is used, it implies the cleansing,
reformatting, and storage of data, which is basically data integration
Source: https://www.globalscape.com/blog/5-types-data-integration
28
Models of Data Integration
Federated Databases
Data Warehousing
Mediation
29
Federated Databases
Simplest architecture
Every pair of sources can build their own mapping and
transformation
Source X needs to communicate with source Y build a
mapping between X and Y
Does not have to be between all sources (on demand)
30
Data Warehousing
Very common approach
Data from multiple sources are copied and stored in a
warehouse
Data is materialized in the warehouse
Users can then query the warehouse database only
31
Data Warehousing: Synchronization
How to synchronize the data between the sources and the
warehouse? In both approaches the
warehouse is not up-to-date at all
Two approaches: times
Complete rebuild
Periodically re-build the warehouse from the sources
(e.g., every night or every week)
(+) The procedure is easy
(-) Expensive and time consuming
Incremental update
Periodically update the warehouse based on the changes in the sources
(+) Less expensive and efficient
(-) More complex to perform incremental update
(-) Requires sources to keep track of their updates
32
Data Warehousing
33
Traditional DW Architecture
34
Mediation
Mediator is a virtual view over
the data (it does not store any data)
Data is stored only at the sources
Mediator has a virtual schema that
combines all schemas from the sources
Usually wrappers are the
35
Mediation : Example
Mediator Schema
Source 1 Schema
Source 2 Schema
36
Mediation: Example
37
Mediation: Example
38
Virtual, Warehousing and in Between
Data warehousing: integrate by bringing the data into a
single physical warehouse
Virtual data integration: leave the data at the sources and
access it at query time.
39
Virtual Data Integration Architecture
Mediated Schema
or Warehouse Query reformulation/
Query over materialized data
Source
descriptions/
Transforms
RDBMS1 RDBMS2
HTML1 XML1
40
Entity Resolution
Data coming from different sources may be different even
if representing the same objects
Entity resolution is the process of:
Figuring out which records represent the same thing
Linking relevant records together
41
Merging Similar Records
How to merge similar records???
In some cases, e.g., misspelling synonyms , it is possible to
merge results
In other cases, e.g., conflicts, there is no easy way to find the
correct values
Report all the results we have
42
Automated Integration
Data integration requires a lot of manual effort
Data warehouse designing and implementing the ETL module
Mediators designing and implementing the wrappers
Federated database designing and implementing the mapping modules
(wrappers)
43
Recent Research
44
Summary
Data integration: abstract away the fact that data comes
from multiple sources in varying schemata.
Problem occurs everywhere: it’s key to business, science,
Web and government.
Goal: reduce the effort involved in integrating.
Regardless of the architecture, heterogeneity is a key
issue.
Architectures range from warehousing to virtual
integration.
45