You are on page 1of 10

A UML Profile for Modeling Schema Mappings

Stefan Kurz, Michael Guppenberger, and Burkhard Freitag

Institute for Information Systems and Software Technology (IFIS)


University of Passau, Germany
Stefan.Kurz@uni-passau.de, Michael.Guppenberger@uni-passau.de,
Burkhard.Freitag@uni-passau.de

Abstract. When trying to obtain semantical interoperability between


different information systems, the integration of heterogeneous informa-
tion sources is a fundamental task. An important step within this process
is the formulation of an integration mapping which specifies how to se-
lect, integrate and transform the data stored in the heterogeneous local
information sources into a global data store. This integration mapping
can then be used to perform the data integration itself.
In this paper, we present a UML-based approach to define integration
mappings. To this end, we introduce a UML profile which can be used to
map local information schemata onto one global schema thus eliminating
schema conflicts. We claim that this is the first time that the integration
mapping can be specified within the UML model of the application and
that this model can be used to generate a working implementation of the
schema mappings using MDA-transformations.

Key words: Data Integration, Schema Mapping, Model Driven Archi-


tecture (MDA), UML Profiles

1 Introduction
The integration of heterogeneous information sources is an important task to-
wards the achievement of semantical interoperability. To perform data integra-
tion, it has to be determined how to select, integrate and transform the infor-
mation stored in local data sources. During the formulation of the integration
mapping, possible integration conflicts have to be recognized and eliminated,
like data-level conflicts, e.g. inconsistent attribute ranges, or schema-level con-
flicts, e.g. different data models. Our approach addresses schema-level conflicts
concerning semantical and structural heterogeneity.
The integration of legacy information systems is usually done in four phases:

1. The local data sources to be integrated and especially their data schemata
are analysed in detail. The goal of this first step is to determine the semantics
of the data to be integrated as completely as possible.
2. The heterogeneous representations of local data (e.g. Entity-Relationship-
Models or XSchemata) are transformed into a common global data model
to overcome conflicts resulting from varying modeling concepts.
3. Further structural and semantical schema-level conflicts have to be uncov-
ered. Whilst structural conflicts can be detected directly by analyzing the
schemata, the detection of semantical conflicts is more complicated since the
discovery of the model’s semantics on the basis of a schema is only possible to
a limited extent. Basically, the various causes for schema diversity arise from
different perspectives, equivalences among constructs, and incompatible de-
sign specifications. To solve schema-level conflicts, a schema integration has
to be performed. In brief, schema integration is the activity of first find-
ing correspondences between the elements of the local schemata and next
integrating these source schemata into a global, unified schema [1].
4. Finally, the results of the third phase are used to consolidate the data stored
in the local sources in a way that the integrated data is accessible based on
the global data schema.

In this paper, we focus on the third phase. As we assume that the global
schema and the local schemata are given, we have to specify a schema mapping.
To avoid the problem of handling different modeling formalisms, we also assume
that both the global and the local schemata are specified as UML models [2].1
We present a newly developed UML profile providing a set of different con-
structs (as explained in section 4), which can be used to specify the integration
mapping between source and target schema. Our approach helps keeping the
model consistent and readable. Even more important, it also allows us to use new
MDA techniques to automatically generate a fully functional implementation of
the mapping, using only the UML model(s) and a set of generic transformations.
The remainder of the paper is organized as follows: section 2 gives an overview
of existing approaches to schema-level integration. Section 3 gives an outline to
the fundamental structure of a schema mapping. In section 4 we show how
these ideas have been transferred into the UML profile. To indicate the practical
applicability of the proposed profile, section 5 very briefly describes an example
we used to evaluate the profile. The paper ends with a conclusion in section 6.

2 Related Work
There exist various approaches to schema-level data integration. Most of them
use a mediator-based architecture to access and integrate data stored in het-
erogeneous information sources [4]. From the perspective of the applications, a
mediator can be seen as an overall database system which accepts queries, selects
necessary data from the different sources and, to process the queries, combines
and integrates the selected data. A mediator does not access the data sources
directly, but via a so-called wrapper [5].
An early representative of this architectural concept is the TSIMMIS project
[6]. Heterogeneous data from several sources is translated into a common object
model (Object Exchange Model, OEM) and combined to allow browsing the
1
This, in fact, is no real restriction, since tools like AndroMDA’s Schema2XMI [3] are
able to generate UML representations from e.g. relational data sources.
information stored in the different sources. Unlike our objective, TSIMMIS only
allows the execution of a predefined set of queries (so-called query-templates).
To overcome this obvious handicap, the Garlic project [7] introduced a com-
mon global schema and allows to process general queries according to this unified
schema. Both the local schemata as well as the global schema are represented in
a data definition language similar to ODL [8].
The MOMIS project [9] uses a schema matching approach to build a global
schema in a semi-automatic way. Again an object-oriented language derived from
ODL is used to describe heterogeneous data sources.
The Clio project [10] of IBM Research also tries to support data integration
by facilitating a semi-automatic schema mapping. However, Clio only allows the
integration of relational data sources and XML documents and therefore gener-
ates data transformation statements in SQL and XSLT/XQuery, respectively.
Apart from research projects, there are also many commercial software sys-
tems and tools available that support data integration. Most of them allow a
graphical, but proprietary definition of a mapping between data schemata. We
name Oracle’s Warehouse Builder [11] and Altova’s MapForce [12] as examples.
It is obvious that it would be a good idea to combine the commercial soft-
ware solutions mainly addressing the users’ needs and the research projects con-
sidering advanced technical developments. Our approach allows user-friendly
graphical modeling of the schema mapping using the de-facto standard UML
(in contrast to Garlic and MOMIS which use other object-oriented description
languages). Furthermore, our method can be integrated into a mediator-based ar-
chitecture serving as a platform for the model-driven implementation of a schema
mapping modeled according to our proposal. With our approach, various target
platforms can be supported. In contrast to approaches tied to a specific data
model, any format of data transformation statements like SQL or Java can be
generated by our method. Also, various kinds of data sources (e.g. relational
databases, semi-structured information sources or flat files) can be integrated.
Finally, our architecture offers interfaces to external schema matching tools thus
supporting a semi-automatic schema mapping.

3 Modeling Schema Mappings

An overview of our approach is shown in fig. 1. Assume that some local, possibly
heterogeneously represented data schemata (bottom of fig. 1) are to be integrated
into a common global schema (top of fig. 1).
Assume further that the global schema already exists and is represented in
UML. First, the local schemata are re-modeled using UML (middle of fig. 1).
Afterwards, for each UML representation of a local schema a mapping onto the
global schema is defined. Based on these mappings, the necessary data access and
data integration procedures can be generated using transformation techniques
from model-driven software technology.
In the following, the local schemata are denoted as source schemata and the
global schema as target schema. In general, the objective of schema mapping is
Fig. 1. Overview of our approach

Fig. 2. Sample structural schema-level conflicts

to find the correspondences between the target schema and the source schemata.
During this phase, conflicts similar to those shown in fig. 2 have to be resolved
[1]. The left part of fig. 2 illustrates a frequent structural conflict: a project as-
sociated with an employee is modeled as a designated class in one schema (1a)
and by an attribute in the other schema (1b). As another example, a structural
conflict arises because in one schema (2b) a generalization hierarchy is intro-
duced whereas the other schema (2a) simply uses different attribute values to
distinguish between different departments. The right part of fig. 2 finally shows
a structural conflict that is caused by different representations of an associa-
tion: in one schema two classes are associated directly (3a) whereas in the other
they are associated indirectly via another class (3b). Of course, also semantical
schema-level conflicts have to be followed and solved.
According to [13], we consider a mapping between two schemata as a set
of mapping elements. Each mapping element correlates specific artefacts of the
source schema to the corresponding artefacts of the target schema.
In general, a mapping element may refer to element-level schema components
like attributes or to structure-level artefacts like classes and associations. At
structure-level, we define which classes of the source schema and the target
schema correspond to each other. At element-level, we define how the structure-
level specifications work in detail, i.e., how target elements are computed from
their corresponding source elements.
Consider fig. 3 for an example: there are two schemata each basically model-
ing an employee associated with a project. We are interested in merging the two
source classes (left hand side) into the one target class (right hand side) to solve
the structural conflict as shown in fig. 2 (parts 1a and 1b). To achieve this we will
define a n : 1 structure-level mapping. At element-level source.Employee.name
maps onto target.Employee.lastName and source.Project.name onto target.Em-
ployee.project thus defining a n:m element-level mapping.
Of course, the semantics of a mapping must be defined in more detail. To this
end, a mapping element can be associated with a so-called mapping expression.
In our example above, the mapping expression could be defined by a SQL-query
like target.Employee.lastName, target.Employee.project = SELECT e.name, p.name
FROM source.Employee e, source.Project p WHERE e.project = p.id.

Fig. 3. Sample source and target schema

At a first glance it seems to be obvious that a mapping can have any cardinal-
ity. However, for simplification we allow only 1 : 1 and n : 1 mapping cardinalities
at structure-level. This restriction guarantees that each mapping element can be
related to exactly one target class which is important for the implemention of
the mapping.2 Furthermore, 1 : n and n : m structure-level mappings can be
replaced by appropriate 1 : 1 and n : 1 mappings if needed.
In the following, we assume that a given target class originates from one or
more source classes. Consequently, we call the target class a mapping element
refers to the ”originated” target class. We further assume that from the set of
source classes associated with a target class one is selected as the main ”orig-
inating” source class; the other source classes are seen as ”dependent” source
classes. In fig. 3, for example, the class target.Employee originates from the classes
source.Employee and source.Project whereas the latter can be seen as dependent.
In the following section we introduce a profile which extends the UML meta-
model and allows for the representation of schema mappings using UML model-
ing primitives. The core constructs of our extension are the MappingElement and
MappingOperator stereotypes used to define mapping elements and associated
mapping expressions. We will use these concepts in conjunction with UML de-
pendencies to graphically specify which source schema artefacts are mapped onto
which target schema artefacts and how this schema mapping is to be performed.

4 The Profile
An overview of the stereotypes introduced by the profile can be found in fig. 4.
To clarify the practical aspects of the proposed profile, we also give some simple
examples.3
2
The functional correlation of mapping elements with target classes, i.e., 1 : 1 or n : 1,
allows the implementation of each mapping element to be coded as an implementa-
tion of its associated target class.
3
These examples are only meant to illustrate the descriptions of the profile, not to
provide a detailed survey of how to model the elimination of arbitrary schema-level
conflicts. All of them are based on the two simple schemata introduced in fig. 3.
Fig. 4. Overview of stereotypes introduced by the profile

Fig. 5 shows the elimination of the semantical conflict between two elements
having different names but modeling the same concept (problem of synonyms),
here the attributes name and lastName. We will give a step-by-step illustration
of how this conflict can be solved using our UML profile.

Fig. 5. Sample 1:1 element-level mapping

First, we describe the stereotypes which have been introduced to tag the
source schema and the target schema.

MappingParticipant The (abstract) stereotype MappingParticipant (see fig.


4) is used to tag classes which participate in the mapping. As a mapping is
always defined between classes which are tagged with the stereotypes DataSource
or DataTarget, MappingParticipant is only used implicitly as a generalization of
the stereotypes DataSource and DataTarget.

DataSource and DataTarget The stereotypes DataSource and DataTarget


are used to tag the source and target classes participating in the mapping. In
our running example, we tag the source class source.Employee as DataSource and
the target class target.Employee as DataTarget (cf. fig. 5).

DataDefinition According to the principles of object-oriented software de-


sign, a class is commonly implemented against interfaces. Especially in case of a
DataTarget, we assume that among these interfaces one is available which speci-
fies the methods needed to access the DataTarget. The DataDefinition stereotype
is used to tag this particular interface.4
We will now explain how to specify a mapping between these two schemata.

MappingElement This stereotype is used to tag a class defining the asso-


ciation of originating DataSources with an originated DataTarget. In our run-
ning example, we introduce the MappingElement EmployeeMapping to relate the
DataSource source.Employee to the DataTarget target.Employee (cf. fig. 5).

Originate To specify the structure-level associations of a MappingElement, we


use dependencies tagged with the stereotype originate. The already mentioned
restriction, i.e., that at structure-level we allow only 1 : 1 and n : 1 mapping
cardinalities, is checked by appropriate OCL constraints.

Map The stereotype map allows to tag dependencies which specify the element-
level relationships of a MappingElement. In our example (cf. fig. 5), two linked5
map-dependencies define the attribute name of the DataSource source.Employee
to be mapped onto the attribute lastName of the DataTarget target.Employee.

MappingOperator The stereotype MappingOperator tags classes defining func-


tions that can be used to specify the mapping expression of a MappingElement.
This way it is possible to define more complex relationships between a DataTarget
and its corresponding DataSources. Note that a class tagged MappingOperator
merely defines a function. The mapping itself must be modeled using instances
of a MappingOperator class.

Fig. 6. Sample 1:n element-level mapping

Fig. 6 illustrates how even more complicated semantical conflicts can be


resolved. As an example, consider the conflict of relating the attribute name to
the attributes lastName and firstName. We use the instance splitName of the
MappingOperator StringSplitOperator that associates the attribute name of the
4
When implementing the mapping, this means that the tagged interface of a target
class remains unchanged, whereas the implementation of the target class can be
replaced according to the specified mapping definitions.
5
The stereotype link is used to tag the attributes of a MappingElement which act as
”connectors” between map-dependencies and additionally relate map-dependencies
to a MappingElement.
DataSource with the attributes lastName and firstName of the DataTarget using
a blank as separator. These input/output parameters of the MappingOperator
are defined by map-dependencies in conjunction with appropriate tagged values.
For example, the source data ”John Smith” could be transformed into the target
data ”John” and ”Smith”.
The implementation of the MappingOperator, here the class StringSplitOper-
ator, can be provided by the user. This offers a very flexible and simple method
to define complex mappings by introducing new mapping operators.

Respect The stereotype respect is mainly used to tag dependencies relating


dependent DataSources to their originating DataSource. Fig. 7 shows a re-
spect-dependency indicating how to navigate from the originating DataSource
source.Employee to the dependent DataSource source.Project (see also fig. 3).

Fig. 7. Sample n:m element-level mapping (n:1 structure-level mapping)

Fig. 8 illustrates how to resolve the structural conflict of fig. 2 (parts 2a,b).
A respect-dependency with an appropriate tagged value specifies that each time
the value of the source attribute description is ”development”, the DataTarget
DevelopmentDepartment is instantiated.

Fig. 8. Mapping concerning generalization hierarchy

The structural conflict shown in fig. 2 (parts 3a,b) can be resolved similarly.
5 The Profile in Practice
To evaluate the practical applicability of the proposed profile, we defined a map-
ping between two realistic heterogeneous schemata. The sample schemata cover
almost all of the structural and semantical schema-level conflicts listed in [1] and
[13], in particular the structural conflicts of fig. 2.
To define the mapping between the source schema (consisting of four classes)
and the target schema (containing four classes with a generalization hierarchy),
four MappingElements and one MappingOperator had to be used. Although
the profile proved to be suitable for real-life integration scenarios, it became
obvious that a complete integration mapping easily becomes complex (which is
also the reason why we do not show the complete mapping here). However, such
an integration mapping can be decomposed into several smaller parts, which
increases readability and understandability a lot.

6 Conclusion and Summary


The results of our work make it possible to specify mappings for the integration of
heterogeneous data sources directly within the UML model(s) of the application
in a user-friendly graphical and standardized way.
By using UML, we are able to apply the MDA approach [14] to generate code
from our models implementing the modeled schema mappings. So it is possible
to generate code which defines data transformation statements that allow us to
access integrated local data according to a global schema. Furthermore, as our
models are independent from any implementation details (which is one of the
core concepts of MDA), we are also able to generate code that satisfies the needs
of any target platform. To prove that claim, we also developed a mediator-based
architecture which can be seen as a framework to execute generated code from
UML models built according to our profile, thus allowing homogeneous access
to the integrated data sources [15]. The code generation itself is done by an
AndroMDA cartridge [16]. A lot of problems (e.g. with handling associations
and generalization hierarchies) - whose explanation is out of the focus of this
paper - are also solved by our framework, proving that the UML profile we
proposed in this paper is applicable.
However, there are still several open issues: Currently, we are working on
extending our framework by integrating a schema matching tool which proposes
initial mapping elements. This would help the user to understand the schemata
which have to be mapped and would support modeling (larger) mappings. Fur-
thermore, we intend to transform the mapping specification into the native query
languages of the integrated data sources (SQL, XQuery, ...) to gain efficiency.
Finally, one of the most important advantages of our approach is that it is
not limited to the integration of data sources, but can also be used to specify
operations on the data in a unified way. This can, for instance, be used to
specify semantical notification information as proposed in [17] already on the
high level of the integrated schema instead of having to use different rules for
every integrated source.
Altogether, the bottom line is that our approach provides a standardized and
adequate means to not only integrate data sources, but to specify integration
mappings that can be used for a variety of requirements whenever information
systems have to deal with several legacy data sources.

References
1. Batini, C., Lenzerini, M., Navathe, S.B.: A Comparative Analysis of Methodologies
for Database Schema Integration. ACM Computing Surveys 18(4) (1986) 323–364
2. The Object Management Group: UML 1.4.2 Specification. http://www.omg.org/
cgi-bin/doc?formal/04-07-02 (last access: 05/2006)
3. AndroMDA: Schema2XMI Generator. http://team.andromda.org/docs/
andromda-schema2xmi/ (last access: 05/2006)
4. Wiederhold, G.: Mediators in the Architecture of Future Information Systems.
Computer, IEEE Computer Society Press 25(3) (1992) 38–49
5. Roth, M.T., Schwarz, P.M.: Don’t Scrap It, Wrap It! A Wrapper Architecture for
Legacy Data Sources. Proceedings of the 23rd International Conference on Very
Large Data Bases (1997) 266–275
6. Chawathe, S., Hammer, J., Ireland, K., Papakonstantinou, Y., Ullman, J.D.,
Widom, J., Garca-Molina, H.: The TSIMMIS Project: Integration of Heteroge-
neous Information Sources. 16th Meeting of the Information Processing Society of
Japan (1994) 7–18
7. Haas, L.M., Miller, R.J., Niswonger, B., Roth, M.T., Schwarz, P.M., Wimmers,
E.L.: Transforming Heterogeneous Data with Database Middleware: Beyond Inte-
gration. IEEE Data Engineering Bulletin 22(1) (1999) 31–36
8. Berler, M., Eastman, J., Jordan, D., Russell, C., Schadow, O., Stanienda, T., Velez,
F.: The Object Data Standard: ODMG 3.0. Morgan Kaufmann (2000)
9. Bergamaschi, S., Castano, S., Vincini, M., Beneventano, D.: Semantic Integration
of Heterogeneous Information Sources. Data & Knowl. Eng. 36(3) (2001) 215–249
10. Miller, R.J., Hern?ndez, M.A., Haas, L.M., Yan, L., Ho, C.T.H., Fagin, R., Popa,
L.: The Clio Project: Managing Heterogeneity. SIGMOD Record (ACM Special
Interest Group on Management of Data) 30(1) (2001) 78–83
11. Oracle: Integrated ETL and Modeling. White Paper, http://www.oracle.com/
technology/products/warehouse/pdf/OWB_WhitePaper.pdf (2003)
12. Altova: Data Integration: Opportunities, challenges, and MapForce. White Paper,
http://www.altova.com/whitepapers/mapforce.pdf (last access: 05/2006)
13. Rahm, E., Bernstein, P.A.: A Survey of Approaches to Automatic Schema Match-
ing. VLDB Journal: Very Large Data Bases 10(4) (2001) 334–350
14. Kleppe, A., Warmer, J., Bast, W.: MDA Explained. The Model Driven Architec-
ture: Practice and Promise. Addison-Wesley Longman (2003)
15. Kurz, S.: Entwicklung einer Architektur zur Integration heterogener Datenbe-
stände. Diploma thesis, University of Passau; in German (2006)
16. AndroMDA: Model Driven Architecture Framework. http://www.andromda.org/
(last access: 05/2006)
17. Guppenberger, M., Freitag, B.: Intelligent Creation of Notification Events in In-
formation Systems - Concept, Implementation and Evaluation. In A. Chowdhury
et al., ed.: Proceedings of the 14th ACM International Conference on Information
and Knowledge Management (CIKM), ACM, ACM Press (2005) 52–59

You might also like