Professional Documents
Culture Documents
Information Integration: Existing Methods and Solutions
Information Integration: Existing Methods and Solutions
Information Integration: Existing Methods and Solutions
Integration Architecture
Schema Matching methods
Schema Mapping Methods
Mediated Schema or
Data Warehouse Query reformulation/
Query over materialized data
Source
descriptions/
Transforms
RDBMS 1 RDBMS 2
HTML1 XML1
II Architecture: A Data Warehousing Approach
II Architecture: Virtualization Layer Approach
The Schema Matching Problem Query
Global Schema
Given two input schemas in any data model and, optionally, auxiliary
information and an input-mapping, compute a mapping between
schema elements of the two input schemas that passes user validation.
in
Source-1 Source-2
BookInfo
Books ID char(15) Primary Key
AuthorID integer references AuthorInfo
ISBN char(15) key BookTitle varchar(150)
Title varchar(100) ListPrice float
DiscountPrice float
Author varchar(50)
MarkedPrice float
AuthorInfo
AuthorID integer key
LastName varchar(25)
FirstName varchar(25)
Inputs to Matching Technique
• Attribute names
• Synonyms • Acronyms
Code = Id = Num = No ◦ PO = Purchase Order
• Data instances
Attributes match if they have similar instances or value
distributions
Schema-based hybrid matching algorithm
Based on combining multiple approaches that use only schema (no instances)
PO PurchaseOrder
POLines Items
POShipTo DeliverTo
Item Item
Name Address
Name City
Street
Line ItemNumber City Street
UoM UnitOfMeasure
Qty Quantity
Linguistic Matching
• Tokenization of names
• PurchaseOrder purchase + order
• Expansion of acronyms
• UOM unit + of + measure
• Linguistic similarity
• Pair-wise comparison of elements that belong to the same cluster
• Token similarity = f(string matching, synonyms score)
• Token set similarity = average (best matching token similarity)
POLines Items
POShipTo DeliverTo
Item Item
Name Address
Name City
Street Line ItemNumber
City Street
UoM UnitOfMeasure
Qty Quantity
allcars.com
craigslist auto
[He+, SIGMOD’03]: Build mediated schema for a domain by clustering elements in
multiple schemas
craigslist auto
Create Source S5 AS
where M.dir=“SKapoor”; select title, dir, year from Movie
where year > 1960 AND genre=“Comedy”