Information Integration: Existing Methods and Solutions

W2
Information Integration: Existing methods and solutions
Integration Architecture
Schema Matching methods
Schema Mapping Methods
Acknowledgement: This lecture includes contents from open sources.

Steps involved in
Data Source Selection, Data Acquisition,
Understanding, Cleansing, Transforming
Data Source Selection
 Once you have the set of questions for querying
integrated data, the next step is to identify the data
sources
 Must evaluate the data sources to know which to select

– Is it complete? In terms of coverage? In terms of missing data?
– Is it correct?
– Is it clean?
– Is it structured? If not, can you extract structured data out of it easily
and accurately or expose text document as an attribute?
– Is it up to date? How quickly is it changing?
– Hard to answer some of these questions until you have acquired

some data from the sources
Data Acquisition
 Then need to acquire the data from the sources
– This is highly non-trivial
– Types of data sources:

 Structured data sources: relational databases, XML files, …
 Textual data: emails, memos, documents, etc.
 Web data: need to crawl, maybe can get a dump, API may exist
 Other types of data: excel, pdf, images, etc.
– Being able to extract data from such sources is non-trivial, time

consuming
– Build connectors, wrappers,…

Data Acquisition
 Then need to acquire the data from the sources
– Some of the data come from within the company
 Need to go talk with data owner, convincing him/her, get help
 Can take months – acquisition due to legal and compliance
reasons.
– Some of the data come from outside the company

 Public data, Open source data, Paid data
– Pros: clean, quick, Cons: trustworthy, noisy, expensive
Understanding, Cleaning, & Transformation
Do These for Each Source, then Integrate
 For data from each source
– Data problems
– missing values
– incorrect values, illegal values, outliers
– synonyms
– misspellings
– conflicting data (eg, age and birth year)
– wrong value formats
– variations of values
– duplicate tuples
– understand current vs ideal schema/data
– Attribute values profiling, relationship between attributes, integrity constraints…
– Tools exist for data profiling, relationship discovery,…
– compare the two and identify possible problems
– violations of constraints for the ideal schema/data
– clean and transform
– possibly enrich/enhance
 Integrate data from the multiple sources
– schema matching/merging, data matching/merging
– misspelt names
– violating constraints (key, uniqueness, foreign key, etc)
Information Integration Architecture
Mediated Schema or
Data Warehouse Query reformulation/
Query over materialized data
Source
descriptions/
Transforms
Wrapper / Wrapper / Wrapper / Wrapper /

Extractor Extractor Extractor Extractor
RDBMS 1 RDBMS 2
HTML1 XML1
II Architecture: A Data Warehousing Approach
II Architecture: Virtualization Layer Approach
The Schema Matching Problem Query
Global Schema
Given two input schemas in any data model and, optionally, auxiliary
information and an input-mapping, compute a mapping between
schema elements of the two input schemas that passes user validation.
in
Source-1 Source-2
BookInfo
Books ID char(15) Primary Key
AuthorID integer references AuthorInfo
ISBN char(15) key BookTitle varchar(150)
Title varchar(100) ListPrice float
DiscountPrice float
Author varchar(50)
MarkedPrice float
AuthorInfo
AuthorID integer key
LastName varchar(25)
FirstName varchar(25)
Inputs to Matching Technique
• Schema structure • Constraints: data type, keys, nullability
• Attribute names
• Synonyms • Acronyms
Code = Id = Num = No ◦ PO = Purchase Order
Zip=PIN = Postal [code] ◦ UOM = Unit of Measure
Node = Server ◦ SS# = Social Security Number
• Data instances
Attributes match if they have similar instances or value
distributions
Schema-based hybrid matching algorithm
Based on combining multiple approaches that use only schema (no instances)
Input: Two schema graphs

Output: Similarity matrix and candidate mapping
• Linguistic matching: compare attributes based on names, data types, etc

• Use a thesaurus to help match names by identifying short-forms (Qty for Quantity),
acronyms (UoM for UnitOfMeasure) and synonyms (Bill and Invoice). The result is a
linguistic similarity coefficient, Lsim, between each pair of elements.
• Structure matching: compare elements based on the similarity of their contexts or

vicinities. The result is a structural similarity coefficient, Ssim, for each pair of elements.
• Compute the Weighted similarity: Wsim = w * Lsim + (1 – w) * Ssim
• Mapping generation: a mapping is created by choosing pairs of schema

elements with maximal weighted similarity.
Example
PO PurchaseOrder
POLines Items
POShipTo DeliverTo
Item Item
Name Address
Name City
Street
Line ItemNumber City Street
UoM UnitOfMeasure
Qty Quantity
Linguistic Matching
• Tokenization of names
• PurchaseOrder  purchase + order
• Expansion of acronyms
• UOM  unit + of + measure
• Clustering based on keywords and data-types

• Street, City, POAddress  Address
• Linguistic similarity
• Pair-wise comparison of elements that belong to the same cluster
• Token similarity = f(string matching, synonyms score)
• Token set similarity = average (best matching token similarity)
• Thesaurus: acronymns, synonyms, stop words and categories

Structure Matching
PO PurchaseOrder
POLines Items
POShipTo DeliverTo
Item Item
Name Address
Name City
Street Line ItemNumber
City Street
UoM UnitOfMeasure
Qty Quantity
Tree Match Algorithm (Bottom-up Traversal)

• Atomic elements (leaves) are similar if
• Mutually dependent formulation
• Linguistically and data-type similar
• Leaves determine internal node similarity
• Their contexts, i.e., ancestors, are similar
• Similarity of internal nodes leads to increase in
• Compound elements (non-leaves) are similar if
leaf similarity
• Linguistically similar
• Elements in their context, i.e., subtrees rooted at the elements, are similar
Collective Schema Matching
allcars.com
craigslist auto
[He+, SIGMOD’03]: Build mediated schema for a domain by clustering elements in
multiple schemas
craigslist auto
Learn to map between new schemas based on other schemas

and mappings in the same domain
Clio: Schema Discovery and Mapping for Integration
Find it: Discovery

• Use ontologies and graph algorithms to find similar objects for mapping.
Connect it: Mapping algorithms

• Using mapping composition to handle schema evolution Inverse mapping
• Advanced features in mapping semantics
• Conditional mapping, “nested” mapping, ETL-like procedural constructs
• Round trip support between mappings and generated queries
• Mapping-based data lineage in the context of query execution
Generate it: Transformations

• XML transformation engine Schema integration
Paper reading/discussion
on Sept 2, 2021
Clio Grows Up: From Research Prototype to Industrial Tool

Schema Mapping
• Global schema defined in terms of sources
(global schema centric or Global-As-View Query
(GAV))
• Query reformulation easier
• Any change in sources, needs change in Global Schema
global schema
• Global relations cannot model any information
not present in at least one source.
• Sources defined in terms of global schema

(source-centric or Local-As-View (LAV))
Source-1 Source-2
• High modularity and extensibility (if the
global schema is well designed, when a
source changes, only its definition is
affected)
• Query reformulation complex
• It allows to add a source to the system
independently of other sources.
Example
Example taken from Dr. Subbarao Kambhampati’s lecture notes.

Global-as-View (GAV)
Local-as-View (LAV)
Reformulation in LAV: The issues
Mediated schema:
Query: Find all the years in which Movie(title, dir, year, genre),
Schedule(cinema, title, time).
SKapoor released movies. Create Source S1 AS select * from Movie
Create Source S3 AS
Select year from movie M select title, dir from Movie
Create Source S5 AS
where M.dir=“SKapoor”; select title, dir, year from Movie
where year > 1960 AND genre=“Comedy”
Sources are “materialized views” of

Virtual schema
Q(y) :- movie(T,D,Y,G), D=SKapoor
Which is the better plan?

Q(y) :- S1(T,D,Y,G), D=Skapoor (1) What are we looking for?
Q(y) :- S5(T,D,Y), D=SKapoor (2) --equivalence?
--containment?
--Smallest plan?
GAV vs. LAV
• Not modular • Modular--adding new sources is easy
– Addition of new sources
changes the mediated schema
• Very flexible--power of the entire
• Can be awkward to write mediated query language available to describe
schema without loss of information sources
• Query reformulation easy • Reformulation is hard

– reduces to view unfolding – Involves answering queries only
(polynomial)
using views
– Can build hierarchies of
mediated schemas
• Best when
• Best when
– Many, relatively unknown data
– Few, stable, data sources sources
– well-known to the mediator (e.g. – possibility of addition/deletion of
corporate integration) sources
• Garlic, TSIMMIS, • Information Manifold,
HERMES InfoMaster, Emerac, Havasu

Information Integration: Existing Methods and Solutions

Uploaded by

Copyright:

Available Formats

You might also like

Information Integration: Existing Methods and Solutions

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Integration: Existing Methods and Solutions

Uploaded by

Copyright:

Available Formats

W2

Information Integration: Existing methods and solutions

Acknowledgement: This lecture includes contents from open sources.

 Must evaluate the data sources to know which to select

– Hard to answer some of these questions until you have acquired

– Types of data sources:

– Being able to extract data from such sources is non-trivial, time

– Build connectors, wrappers,…

– Some of the data come from outside the company

Wrapper / Wrapper / Wrapper / Wrapper /

• Schema structure • Constraints: data type, keys, nullability

Zip=PIN = Postal [code] ◦ UOM = Unit of Measure

Node = Server ◦ SS# = Social Security Number

Input: Two schema graphs

• Linguistic matching: compare attributes based on names, data types, etc

• Structure matching: compare elements based on the similarity of their contexts or

• Compute the Weighted similarity: Wsim = w * Lsim + (1 – w) * Ssim

• Mapping generation: a mapping is created by choosing pairs of schema

• Clustering based on keywords and data-types

• Thesaurus: acronymns, synonyms, stop words and categories

Tree Match Algorithm (Bottom-up Traversal)

Learn to map between new schemas based on other schemas

Find it: Discovery

Connect it: Mapping algorithms

Generate it: Transformations

Clio Grows Up: From Research Prototype to Industrial Tool

• Sources defined in terms of global schema

Example taken from Dr. Subbarao Kambhampati’s lecture notes.

Select year from movie M select title, dir from Movie

Sources are “materialized views” of

Q(y) :- movie(T,D,Y,G), D=SKapoor

Which is the better plan?

• Query reformulation easy • Reformulation is hard

You might also like