Inferring XML Schema Definitions From XML Data

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

Inferring XML Schema Definitions from XML Data

Geert Jan Bex


Frank Neven
Stijn Vansummeren
How can we automatically
Paper presentation generate schemas
from XML documents?
Overview Overview

●Problem
●Solution
●Related work
●Background
●Contributions
➔ iLocal algorithm
➔ Reduce algorithm
➔ iXSD
●Experimental evaluation
Problem
XML DTD
<library>
<borrowed>
<person>
<name/><tel/><email/> <!ELEMENT library (borrowed*,stock+)>
</person>
<!ELEMENT borrowed (person,book+)>
<book>
<id/> <author/> <time/> <!ELEMENT stock (book)+>
</book> <!ELEMENT person (name,tel+,email?)>
</borrowed> <!ELEMENT book (id,author,nbBooks?,
<stock> (bookshelf|time)?)>
<book>
<id/> <author/>
<nbBooks/> <bookshelf/>
</book>
</stock>
</library>
Solution
<library> XSD
<borrowed>
<person>
<name/><tel/><email/> root -> library[library]
</person> library -> borrowed[borrowed]*, stock[stock]
<book> borrowed -> person[person], book[book1]+
<id/> <author/> <time/> stock -> book[book2]+
</book> person -> name[emp], tel[emp]?, email[emp]+
</borrowed>
<stock> book1 -> id[emp], author[emp], time[emp]
<book> book2 -> id[emp], author[emp], nbBooks[emp], bookshelf[emp]
<id/> <author/> <nbBooks/> emp -> #PCDATA
<bookshelf/>
</book>
</stock>
</library>
Related Work

● Schema inference(SSD)
○ Restricting algorithms to trees
—> XSD schemas can’t
○ No order considered between the children of a node be derived
● DTD inference
● XSD inference
○ Trang The expressiveness of the generated schema
○ Xstruct —> does not go beyond that of a DTD.
● Learning of tree automata
○ Inferring queries, not XSD
Background
Considering an XML,

● an XML fragment is a sequence of elements <a1>f</a1> …<an>fn</an>, where a are


element names and f are XML fragments.
f = <library>
<borrowed>
<person><name/><tel/><email/> </person>
<book> <id/> <author/> <time/> </book>
</borrowed>
</library>
● Paths(f) is the set of all labeled paths starting at a root element in the XML fragment f.
Paths(f) = { λ, library, library borrowed, library borrowed book, library borrowed
person, library borrowed book id, etc. }
● Strings(f,p) is the set of all strings of element names occurring below an occurrence of
path p in fragment f.
Strings(f, library borrowed) = { person book }
Background
Definition 1:
An XSD is a triple D = (T, ρ, τ) , consisting of a finite set of types T, a mapping ρ from T to
regular expressions and τ that assigns a type to each pair (t,a) with the element name a
occurring in ρ(t).
T = { root, library, borrowed, stock, person,book1, book2, emp}
ρ(root) = library τ(root, library) = library
ρ(library) = borrowed*, stock τ(library, borrowed) = borrowed
ρ(person) = name, tel?, email+ τ(library, stock) = stock

● W3C specification requires regular expression to be deterministic

● an XSD is k-local if its content models depend only on labels up to the k-th ancestor.
Background
Definition 2 :
SORE: A regular expression r is single occurrence if every element name occurs at
most once in it. An XSD is single occurrence if it contains only SOREs.(SOXSD)

borrowed*, stock borrowed*, stock, stock

Definition 3 :
SOA is a graph A = (V,E) where all states in V-{in,out} are element names, and
E ⊆ (V-{in}) x (V-{out}) is the edge relation.

● L(A) is the set of all strings accepted by A.


Contribution
The goal is to infer a k-local single occurrence XSD (D’, t’) equivalent to a target k-local
SOXSD (D, t) given only a finite corpus of XML documents.
● Let C be a corpus consisting of 2 XML fragments which are valid wrt XSD presented
before and k a natural number

iLocal Algorithm:
➔ T = { set of types consist of all (p/k) / p ∈ paths(C) }
➔ ρ ← Ø ; τ ← Ø;
➔ construct the content model for these types:
◆ learn the SOA for the set k-strings(C, (p/k)) of all strings occurring in C below a
path q that is k-equivalent to the type pk
◆ transform this SOA into SORE
◆ add each transition from pk to sore to the ρ
➔ for each path pa in paths(C), add(p/k,a)->(pa)/k to τ
Let C be the corpora of these two XMLs <library>
<borrowed>
<library> <person>
<borrowed> <name/><email/>
<person> </person>
<name/><tel/><email/> <book>
</person> <id/> <author/> <time/>
<book> </book>
<id/> <author/> <time/> </borrowed>
</book> <borrowed> ,,, </borrowed>
</borrowed> <stock>
<stock> <book>
<book> <id/> <author/> <nbBooks/>
<id/> <author/> <nbBooks/> <bookshelf/>
<bookshelf/> </book>
</book> <book>
</stock> <id/> <author/> <nbBooks/>
</library> <bookshelf/>
<book/>
</book>
</stock>
</library>
Running the iLocal:
k=2

p/k = borrowed book

k-strings(C,p/k) = {id name nbBooks author bookshelf,

id name book book}

SOA exemple

SORE exemple

do not contain timeBorrowed element

we have inferred the content models for all types exemple

determine the type associated with the element names in these content models, for k = 2 exemple

final result exemple

problem next, minimisation // more types than necessary


Minimise:

probem so created they own algo reduce

Reduce of iLocal:
Experimental Evaluation
Personal Opinion
Conclusion
❖ Problem: Document Type Definition for inferring XML it is not enough, the content
model of an element can only depend on the element name and not on the context in
which is used.
❖ Solution: inferring XML Schema Definition. XSD allow the content model of an
element to depend on the context in which is used.
❖ Background:
➢ Definition of XSD
➢ SORE
➢ SOA
❖ Contribution
➢ iLocal
➢ Reduce
➢ iXSD = iLocal + Reduce
❖ Experimental Evaluation

You might also like