Semantic Mapping in Data Integration


Baladevi C

Dept of Computer Science &Engineering

Amrita School of Engineering
Amrita Vishwa Vidyapeetham

28 Feb. 2017

1 Introduction to Data Integration Systems

2 Background
3 Existing Data Integration Methods
4 Algorithms
5 Conclusion

Top IT Spending Priorities

Real World Applications
Business The Web


Web:Hundreds of millions of
high quality tables on the
Pretty much everywhere

Integration in data management: Evolution

Centralized system with three-tier architecture

”Implicit” integration: integration supported by the Data
Base Management System (DBMS), i.e., the data manager

Integration in data management: Evolution

Centralized system with three-tier architecture and multiple

Application-hidden integration:integration ”embedded” within

Problems in integrating DBs

Lot of different types of heterogeneity among several DBs to be

used together.
Different platforms:Technological heterogeneity
Different query languages:Language heterogeneity
Different data schema:Schema(semantic) heterogeneity
Errors in data, that result different values for the same info:
Instance heterogeneity

Integration in data management: Evolution

Centralized system with four-tier architecture and multiple,

distributed stores
(Centralized) data integration: the global schema is mapped
to the different data sources, which are heterogeneous,
distributed and autonomous
Data Integration[5]

What is Data Integration?

Data integration is the problem of
providing unified and transparent
view to a collection of data stored in
multiple, autonomous, and
heterogeneous data sources.
What is the importance of data
Data integration becomes increasingly
important in cases of
merging systems of two
consolidating applications within
one company to provide a unified
view of the company’s data

Data Integration System

Examples of Heterogeneity

Title Author Year

DBMS Alice ’70
Computer Networks Bob ’75
Source 1:Details of book published before 1980

Name BAuthor PYear

DBMS Alice 1970
Computer Networks Bob 1975
Source 2:Details of book published before 1980

Formal framework for data integration

A data integration system I is a triple G,S,M , where
G is the global(mediated) schema
S is the source schema
M is the mapping between S and G

Data integration Architecture[2]


Data integration Architecture

A simple Example

A simple Example

Mediated Schema
Movie: Title, director, year, genre
Actors: title, actor
Plays: movie, location, startTime
Reviews: title, rating, description

select title, startTime

from Movie, Plays
where AND
location=New York AND

Challenges in DIS

1 Design of mediated schema.

Data sources might have different schema, and might export
data in different formats.
2 Translation of queries over the mediated schema to queries
over the source schema
3 Query Optimization:
No/limited statistics about data sources
4 Incomplete data sources
Data at any source might be partial, overlap with others, or
even conflict
Do we query all the data sources? Or just a few? How many?
In what order?
5 ...

A Logical View definition provides mapping between the global

mediated schema and the local schema.
Two basic approaches
GAV (Global As View)
LAV (Local As View)
Mapping is understanding which real data (in the data sources)
correspond to those virtual data(in mediated schema)

Global As View

Mediated schema as a view over the local schema.

Source Schema
Mediated Schema
DB1(id, title, actor, year)
MovieActor(title, actor)
DB2(id, title, actor, year)
MovieReview(title, review)
DB3(id, review)
View that provides mapping between Global Schema and Source
create view MovieActor as
select title,actor from S1.DB1
select title,actor from S2.DB2;
Difficult to add new sources. All existing view definitions might be

Local as View

Local schema as a view over the mediated schema.

Source Schema
Mediated Schema
DB1(id, title, actor, year)
MovieActor(title, actor)
DB2(id, title, actor, year)
MovieReview(title, review)
DB3(id, review)
View that provides mapping between Global Schema and Source
create view S1.DB1 as
select * from MovieActor
create view S2.DB2 as
select * from MovieActor
create view S3.DB3 as
select * from MovieReview
Query reformulation is harder in LAV.

Query Answering/Rewriting in GAV

Find reviews for movies starring Bob
Query over Mediated Schema
q(title, review) : MovieActor(title, ’Bob’), MovieReview(title, re-
Reformulated Query
q(title, review) : DB1(id, title, ‘Bob’, year),DB3(id, review)
q(title, review) : DB1(id, title, ‘Bob’, year), DB2(id, ‘Bob’, year),
DB3(id, review)

Bucket Algorithm[2]

The goal of the bucket algorithm is to reformulate a user

query that is posed on a mediated(virtual) schema into a
query that refers directly to the available data sources.
The bucket algorithm returns the maximally contained
rewriting of the query using the views.
Bucket Algorithm
1 the algorithm creates a bucket for each sub goal in Q.(ie. the
bucket contains the views (data sources) that are relevant to
answering the particular sub goal.
2 In the second step, the algorithm produce a
maximally-contained rewriting of the query using the views,
and not an equivalent rewriting.

An Example for Bucket algorithm[2]

Mediated Schema:
Enrolled(student, dept) Registered(student, course, year)
Course(course, number)
View of Data Sources:
V1(student,number,year) :- Registered(student,course,year),
Course(course,number), number≥ 500, year ≥ 1992.
V2(student,dept,course) :- Registered(student,course,year),
V3(student,course) :- Registered(student,course,year), year
≤ 1990.
V4(student,course,number) :- Registered(student,course,year),
Enrolled(student,dept), number ≤ 100

An Example for Bucket algorithm[2]
S → student, D → dept, Y → year , C → course

Query is:
q(S,D) :- Enrolled(S,D), Registered(S,C,Y), Course(C,N), N
≥ 300, Y ≥ 1995.
Bucket Formed:

QR Decomposition[4]

An efficient method for finding the independent attributes

and that can represent the data in its proper substructure
form without losing its semantics.
This is an efficient method for decomposing a matrix A into a
product A = QR of an orthogonal matrix Q and an upper
triangular matrix R
It is essential to remove the redundant data and bring out
only the significant data to the forefront.

A Simple Example

The house location attribute may or may not determine living

This can be established by the fact that one vector(house location)
is orthogonal to other vector(living rooms).
QR decomposition is a technique to establish this fact.
QR decomposition with column pivoting to distinguish between the
independent and dependent attributes.
The next objective is to provide an integrated view of the
heterogeneous data sources with the help of a knowledge base.

Frequency based Coverage Statistics Mining[3]

StatMiner, a statistics mining module, estimate the coverage and

overlap statistics.
Ranking of Sources
Ranks all sources in descending order of P(S/Q).
P(S/Q) is the coverage of sources with respect to given query.
Queries are grouped into Query class using attribute and
corresponding value ofclassificatory attribute.
Query List keeps frequency of each class.

Frequency based Coverage Statistics Mining[3]

Probability of a query posed to mediator is:

P(Q)=FRQ /FR where,
FRQ isaccessfrequencyofaqueryQ,
FR is total frequency of all queries in Qlist
Probability that a random query posed to the mediator subsumed
by the class PC is
Pmap (C ) = QismappedtoC P(Q)
Probability that a random query belonging to the class C present
in a set of sources
P S0 is
0 P(S |Q)∗P(Q)
p(S |C ) = Q∈C P(C )
This is the overlap statistics w.r.t. a query class C.
If the query is overlapped in multiple source then class source
association rule, C→ S , givestherankofsources.

Problem Definition

Problem Definition
Our objective is to form an approximate view of entire data sources
at a global level in order to reduce the storage requirement at
global level and efficient retrieval of data.

A first Approximation

Same data model

Adoption of a global schema
The global schema will provide a Reconciled
view of the data sources

