Inferring XML Schema Definitions From XML Data

Inferring XML Schema Definitions from XML Data
Geert Jan Bex

Frank Neven
Stijn Vansummeren
How can we automatically
Paper presentation generate schemas
from XML documents?
Overview Overview
●Problem
●Solution
●Related work
●Background
●Contributions
➔ iLocal algorithm
➔ Reduce algorithm
➔ iXSD
●Experimental evaluation
Problem
XML DTD
<library>
<borrowed>
<person>
<name/><tel/><email/> <!ELEMENT library (borrowed*,stock+)>
</person>
<!ELEMENT borrowed (person,book+)>
<book>
<id/> <author/> <time/> <!ELEMENT stock (book)+>
</book> <!ELEMENT person (name,tel+,email?)>
</borrowed> <!ELEMENT book (id,author,nbBooks?,
<stock> (bookshelf|time)?)>
<book>
<id/> <author/>
<nbBooks/> <bookshelf/>
</book>
</stock>
</library>
Solution
<library> XSD
<borrowed>
<person>
<name/><tel/><email/> root -> library[library]
</person> library -> borrowed[borrowed]*, stock[stock]
<book> borrowed -> person[person], book[book1]+
<id/> <author/> <time/> stock -> book[book2]+
</book> person -> name[emp], tel[emp]?, email[emp]+
</borrowed>
<stock> book1 -> id[emp], author[emp], time[emp]
<book> book2 -> id[emp], author[emp], nbBooks[emp], bookshelf[emp]
<id/> <author/> <nbBooks/> emp -> #PCDATA
<bookshelf/>
</book>
</stock>
</library>
Related Work
● Schema inference(SSD)
○ Restricting algorithms to trees
—> XSD schemas can’t
○ No order considered between the children of a node be derived
● DTD inference
● XSD inference
○ Trang The expressiveness of the generated schema
○ Xstruct —> does not go beyond that of a DTD.
● Learning of tree automata
○ Inferring queries, not XSD
Background
Considering an XML,
● an XML fragment is a sequence of elements <a1>f</a1> …<an>fn</an>, where a are

element names and f are XML fragments.
f = <library>
<borrowed>
<person><name/><tel/><email/> </person>
<book> <id/> <author/> <time/> </book>
</borrowed>
</library>
● Paths(f) is the set of all labeled paths starting at a root element in the XML fragment f.
Paths(f) = { λ, library, library borrowed, library borrowed book, library borrowed
person, library borrowed book id, etc. }
● Strings(f,p) is the set of all strings of element names occurring below an occurrence of
path p in fragment f.
Strings(f, library borrowed) = { person book }
Background
Definition 1:
An XSD is a triple D = (T, ρ, τ) , consisting of a finite set of types T, a mapping ρ from T to
regular expressions and τ that assigns a type to each pair (t,a) with the element name a
occurring in ρ(t).
T = { root, library, borrowed, stock, person,book1, book2, emp}
ρ(root) = library τ(root, library) = library
ρ(library) = borrowed*, stock τ(library, borrowed) = borrowed
ρ(person) = name, tel?, email+ τ(library, stock) = stock
● W3C specification requires regular expression to be deterministic
● an XSD is k-local if its content models depend only on labels up to the k-th ancestor.
Background
Definition 2 :
SORE: A regular expression r is single occurrence if every element name occurs at
most once in it. An XSD is single occurrence if it contains only SOREs.(SOXSD)
borrowed*, stock borrowed*, stock, stock
Definition 3 :
SOA is a graph A = (V,E) where all states in V-{in,out} are element names, and
E ⊆ (V-{in}) x (V-{out}) is the edge relation.
● L(A) is the set of all strings accepted by A.

Contribution
The goal is to infer a k-local single occurrence XSD (D’, t’) equivalent to a target k-local
SOXSD (D, t) given only a finite corpus of XML documents.
● Let C be a corpus consisting of 2 XML fragments which are valid wrt XSD presented
before and k a natural number
iLocal Algorithm:
➔ T = { set of types consist of all (p/k) / p ∈ paths(C) }
➔ ρ ← Ø ; τ ← Ø;
➔ construct the content model for these types:
◆ learn the SOA for the set k-strings(C, (p/k)) of all strings occurring in C below a
path q that is k-equivalent to the type pk
◆ transform this SOA into SORE
◆ add each transition from pk to sore to the ρ
➔ for each path pa in paths(C), add(p/k,a)->(pa)/k to τ
Let C be the corpora of these two XMLs <library>
<borrowed>
<library> <person>
<borrowed> <name/><email/>
<person> </person>
<name/><tel/><email/> <book>
</person> <id/> <author/> <time/>
<book> </book>
<id/> <author/> <time/> </borrowed>
</book> <borrowed> ,,, </borrowed>
</borrowed> <stock>
<stock> <book>
<book> <id/> <author/> <nbBooks/>
<id/> <author/> <nbBooks/> <bookshelf/>
<bookshelf/> </book>
</book> <book>
</stock> <id/> <author/> <nbBooks/>
</library> <bookshelf/>
<book/>
</book>
</stock>
</library>
Running the iLocal:
k=2
p/k = borrowed book
k-strings(C,p/k) = {id name nbBooks author bookshelf,
id name book book}
SOA exemple
SORE exemple
do not contain timeBorrowed element
we have inferred the content models for all types exemple
determine the type associated with the element names in these content models, for k = 2 exemple
final result exemple
problem next, minimisation // more types than necessary

Minimise:
probem so created they own algo reduce
Reduce of iLocal:
Experimental Evaluation
Personal Opinion
Conclusion
❖ Problem: Document Type Definition for inferring XML it is not enough, the content
model of an element can only depend on the element name and not on the context in
which is used.
❖ Solution: inferring XML Schema Definition. XSD allow the content model of an
element to depend on the context in which is used.
❖ Background:
➢ Definition of XSD
➢ SORE
➢ SOA
❖ Contribution
➢ iLocal
➢ Reduce
➢ iXSD = iLocal + Reduce
❖ Experimental Evaluation

Inferring XML Schema Definitions From XML Data

Uploaded by

Copyright:

Available Formats

You might also like

Inferring XML Schema Definitions From XML Data

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inferring XML Schema Definitions From XML Data

Uploaded by

Copyright:

Available Formats

Inferring XML Schema Definitions from XML Data

Geert Jan Bex

● an XML fragment is a sequence of elements <a1>f</a1> …<an>fn</an>, where a are

● W3C specification requires regular expression to be deterministic

borrowed, stock borrowed, stock, stock

● L(A) is the set of all strings accepted by A.

p/k = borrowed book

k-strings(C,p/k) = {id name nbBooks author bookshelf,

id name book book}

do not contain timeBorrowed element

we have inferred the content models for all types exemple

final result exemple

problem next, minimisation // more types than necessary

probem so created they own algo reduce

You might also like