Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

XML BASICS FOR DATA EXCHANGE

John Beresniewicz, Precise Software Solutions

WHAT IS XML?
I suppose the best place to start is the official home of XML on the web, the W3C (World Wide Web
Consortium) web site. There we can find the following definition (www.w3.org/XML/):

“XML is the universal format for structured documents and data on the Web.”

That short sentence has a lot of ambition in it. In particular, XML intends to be universal for the
Web. XML is also about documents and data, a distinction that points at the two main branches of
XML usage: for rendering (publishing) content on the Web, and for exchanging structured data on
the Web or elsewhere.
This paper is based on my experiences and learning gained from implementing a specific use of
XML. That use is the exchange of simple, well-defined structured data between networked software
components using a request-response protocol between client and supplier. My expertise in the
wider and more sophisticated usages of XML is yet to be developed.
XML has a wide variety of uses and probably abuses also. Recently, there has been much hype and
hoopla about XML and it can be difficult to sift out the tunes from the noise. Hopefully, this short
article will help clarify some fundamental aspects of XML for newcomers to this new and somewhat
strange corner of technology-world.

XML TECHNOLOGIES
There is a whole raft of technologies in XML-world that are actively being developed, specified and
standardized. Some of the more important technologies are:
• XML 1.0 and namespaces
• XML information set
• Document Object Model (DOM)
• Simple API for XML (SAX)
• XML Path Language (Xpath)
• XSL Transformations (XSLT)
• XML Schema
• Simple Object Access Protocol (SOAP)
The first of these, XML 1.0 and namespaces, refers to the specification for XML itself. The others
are important extensions or related technologies that are important in the use of XML generally and
for data exchange in particular. They will be mentioned and covered in various depths below.

THE XML INFOSET


Most people probably think of XML as text files filled with those funny angled bracket tags similar
but seemingly more complicated than HTML tags and files. In fact, these text files are merely
representations (called serializations) of specific occurrences of an abstract thing called the XML
infoset. The infoset of an XML document is really a kind of abstract model of data with the
following characteristics:
• Structured information items that can have named properties
• Hierarchically tree-oriented with a single root node
The text XML documents that we read are actually based on representational conventions for
making visible this abstract structured data object called the infoset. Every XML document
represents an instantiation of some infoset.

XML DOCUMENT STRUCTURE


NODES AND THE ROOT NODE
The hierarchical structure of an XML document can be thought of as consisting of a set of nodes
connected by parent-child relationships and with a single ultimate parent, which is called the root
node. The root node is an element information item, and all child elements are considered nodes.
Element nodes can have associated attributes and content, which themselves can be considered leaf
nodes, as they cannot have child nodes.
The node structure of XML is particularly important when traversing or selecting document nodes
using XML Path Language and also when processing XML documents with XSLT.
ELEMENTS AND CONTENT
Element information items are the primary building blocks of XML documents. Elements can
contain other elements as children, and so are structurally important. All documents have the root
element as their outermost container. Elements can also have named attribute values.
Between the start and end tags of an element is the element content. Element content can be almost
anything imaginable in some application or other, but for data exchange applications element content
(if we use it) will contain data values for its parent element. We avoid mixed content (explained
below) by allowing our document elements to have content only if they have no child elements. That
is, only leaf element nodes will carry data.
ATTRIBUTES
Attributes are named property values for elements. They are specified in the element’s start tag using
the syntax: name=”value”. The value must be enclosed in matching quotes, either single or
double.
Attributes are child nodes of their parent element node and cannot themselves have child
information items.
ENTITIES
Entities are like substitution variables in XML documents. They are references to content that is
somewhere else, which may mean elsewhere in the document itself or somewhere out in the Internet.

REPRESENTING STRUCTURED DATA


There is a lot of flexibility in how data and structure can be represented in XML documents.
Perhaps there is too much flexibility. Modeling data structures is not always obvious, and the rush to
implementation can be a liability. The following are a couple of guidelines that have emerged that
help frame the options.
MIXED CONTENT
Mixed content refers to elements that have elements as well as content as subnodes. For example, the
following shows a mixed content element:

<MixedUp>This content is <Child>Woops</Child>mixed up</MixedUp>

It is generally regarded as bad practice to use mixed content, especially for data exchange
applications. Perhaps there are valid reasons for using it in the document formatting area.
ELEMENT NORMAL FORM
Element-normal form refers to structuring documents in such a way that all information is carried in
element content rather than attribute values.
Consider the following simple element.

<doc>
<name>my.doc</name>
<size>25,565</size>
<type>MS Word</type>
</doc>

Each property of interest about the doc element is represented by a child element whose content is
the value of that property for the parent doc element. The document has a total of four elements,
three of which contain content.
Oracle’s canonical representation of SQL result sets uses an element normal form.
ATTRIBUTE NORMAL FORM
We can represent exactly the same information about the doc element using attribute values only, as
below:
<doc name=‘my.doc’
size=“25,565”
type=‘MS Word’/>

In this representation, there is only the single doc element and it is empty of content. Attribute-
normal form is a somewhat terser representation, as many element tags have been eliminated.
Some authors [BOX] seem to prefer element-normal form over attribute-normal form, as attributes
are less flexible than elements (terminus in parent-child chain). I have preferred attribute-normal
form to express a simple data exchange API convention. In the end there may be specific
applications where one form is preferred over the other but in many cases either will do the job fine.

DOCUMENT TYPE DEFINITION (DTD)


ELEMENT CONTENT MODELS
Rules about the node structure of documents can be specified using element declarations. These
rules are sometimes called the content model of the element. A simplified regular expression syntax
is used to specify element content models. Consider the following element declarations:

<!ELEMENT foo (bar,bar) >


<!ELEMENT emplist (emp)* >
<!ELEMENT choice ( A | B ) >
<!ELEMENT attributesonly EMPTY >

• The foo element is required to have a bar child element followed by another bar child
element.
• The emplist element contains zero or more emp elements.
• The choice element must contain either an A element or a B element.
• The attributesonly element cannot have child elements or content, but may have
attribute values
These examples give some idea of the kinds of structural rules that can be imposed on documents.
Bear in mind that child elements in the content model will have their own content models. The
structural complexity theoretically possible is essentially unlimited.
ATTRIBUTES
Attributes have names and hang onto elements. DTDs can require that attributes be specified for
each occurrence of the entity or they can be optional. They can specify a fixed or default value for
the attribute or an enumerated list of valid values. The following are some basic attribute
declarations for the element emp:

<!ELEMENT emp EMPTY >


<!ATTLIST emp empid ID #REQUIRED >
<!ATTLIST emp name CDATA #IMPLIED >
<!ATTLIST emp sex ( M | F ) #REQUIRED >
<!ATTLIST emp job ( DBA | Developer | Manager ) #REQUIRED >
<!ATTLIST emp mgrid IDREF #IMPLIED >

• The empid attribute is a required unique identifier for the element within the document.
• The name attribute is optional and composed of character data.
• The sex attribute is required and limited to the choices M and F.
• The mgrid attribute is optional but must match the value of an empid attribute in the
document if specified.
Attributes can be used to implement complex structural aspects through IDREF pointers, as well as
to carry specific data values similar to element content. The real disadvantage of DTDs is that the
data typing of attribute values as well as element content is much too weak. There is no built-in
mechanism to require numeric attribute values, for instance.
ENTITIES
Entities are the trickiest and most powerful aspect of DTDs to learn. In general they are pointers or
references to something that has been defined or exists elsewhere whose contents are substituted for
the entity reference. They have several types and uses, including:
• Parameter entities are defined in the DTD and replaced within the DTDs
• General entities are defined in the DTD and replaced within XML documents using the
DTD
• External entities are URI references to XML content that is merged into documents
• Unparsed entities are references to non-XML content, e.g graphics files
The following simple example shows a parameter entity used to standardize declarations of Boolean
attribute enumeration lists:

<!ENTITY % bool "( true | false | TRUE | FALSE )" >

<!ELEMENT emp EMPTY >


<!ATTLIST emp isMarried %bool; #REQUIRED >
<!ATTLIST emp isWealthy %bool; #REQUIRED >

The entity bool is expanded in the DTD wherever the %bool; reference is encountered. This allows
a degree of notational efficiency as well as standardization of attribute declaration.
DTD EXAMPLE: POINTS AND LINES
DTDs are reasonably good at restricting (typing) the structural organization of an XML document’s
nodes, however they are quite limited at imposing type constraints on actual data values.
Take for example the following DTD in which the line element is composed of two point elements,
each of which is composed of an x, y element pair:

<!ELEMENT line (point,point) >


<!ELEMENT point (x,y) >
<!ELEMENT x EMPTY >
<!ATTLIST x val CDATA #REQUIRED>
<!ELEMENT y EMPTY >
<!ATTLIST y val CDATA #REQUIRED>

Structurally, the model of a line consisting of two points specified by x and y coordinates works well.
The content model is in attribute-normal form. That is, the leaf elements are declared EMPTY and
data payload is carried in attribute values, specifically the val attribute of elements x and y.
Let’s look at a specific XML document conforming to the DTD above:

<?xml version="1.0" ?>


<!DOCTYPE line SYSTEM "foo4.dtd" >
<line>
<point>
<x val="12345"/>
<y val="45321"/>
</point>
<point>
<x val="abcdef"/>
<y val="ghijkl"/>
</point>
</line>

Everything seems OK until we get to the x and y values for the second point, which contains
alphabetic characters where we want numbers representing an integer value. Here is the big problem
with DTDs. It is not obvious and in fact quite difficult to enforce a rule as simple as requiring an
attribute’s value to be an integer. We will see further down why enforcing very strict rules on XML
type and structure can be highly advantageous in the development of trustworthy and stable
distributed systems.

XML SCHEMA
XML Schema is a relatively new standard designed to provide robust mechanisms for stronger typing
of XML document objects and their content. It has much better mechanisms than the DTD
language for specifying data and structural validation rules. In addition, XML schema is
implemented in XML itself, so abstract type metadata and actual document type occurrences are
expressed using the same semantics.
The newness and increased complexity of XML Schema may limit the number or quality of parsers
for some little while. However, it is still probably the best bet for new projects to define their data
structures using XML Schema as it seems clearly destined to succeed DTDs.

PROCESSING XML DOCUMENTS


There are a number of technologies that developers will need to use to programmatically process
XML documents. The following is a very brief guide to some of these technologies.
PARSING AND VALIDATION
The first requirement when processing XML is to make sure it is really XML and not just random
tags. There are two basic parsing levels for XML: well-formed and valid. XML is considered well-
formed if it conforms to basic syntax and structural requirements like matching of element begin and
end tags, conformance to rules about element and attribute names, and strict nesting of parent-child
element content. XML that is associated with a DTD is valid if it conforms to all the rules specified
by the DTD.
There are numerous freely available validating parsers available as source code or libraries for use by
developers, so there is generally no need to write a validating parser from scratch. Applications that
use XML without validating sacrifice a tremendous opportunity to control software errors, as will be
mentioned more below.
DOCUMENT OBJECT MODEL (DOM)
The Document Object Model is exactly as the name suggests: a set of object-oriented abstract
interface specifications for XML infosets. Implementations of DOM parsers in specific languages
like Java or C++ allow XML infosets to be parsed and loaded into an object structure that allows
infoset items to be extracted or manipulated programmatically in these languages.
SIMPLE API FOR XML (SAX)
SAX provides a stream-oriented event-based interface to XML document parsing. One of the
problems with DOM is that entire XML documents are parsed as a whole, and therefore entirely
contained in memory as a DOM object. Since it is possible for at least some documents to be quite
large or to contain many thousands of elements the overhead of DOM could be considerable in
these cases. In SAX the document is read from the beginning and parsing proceeds sequentially.
Specific events are raised when specific information set items are encountered and recognized by the
parser. Programs using SAX can sift through large XML documents and extract only those
information set items of interest to them, greatly reducing memory demands in some cases.
XML TRANSFORMATIONS (XSLT)
XSLT stands for the XML Stylesheet Language for Transformations. There are actually dual
purposes for the XSLT technology as follows:
• Transform XML input into an output format appropriate for a target device
• Transform XML documents into other XML documents
The first of these is less relevant to the use of XML as a data exchange medium. Rather, it is mostly
used when a single content source (document of some type) needs to be rendered in a number of
different formats, perhaps using different markup conventions. Rather than create multiple versions
of the content (one for each target format), XSLT allows us to create different transforms of the
content into each of the target rendering formats.
The second usage of XSLT is much more relevant to data exchange applications. We can use XSLT
to extract interesting subsets of XML documents and make them into XML documents. Or we can
convert a document conforming to certain encoding rules into one that is similar but conforms to
slightly different rules. In this case XSLT becomes a kind of adapter for converting between
representational models, perhaps useful for knitting systems together.
SOAP
The Simple Object Access Protocol, or SOAP, is a specification that includes a protocol and
representational rules for general XML messaging between abstract processing locations called SOAP
nodes. SOAP messages are composed of an outer element called the SOAP envelope that may
contain a SOAP header element and/or a SOAP body element. The contents of the SOAP header
and body elements are one or more SOAP blocks, which contain application-specific XML
datagrams. SOAP messages can be targeted at SOAP nodes, and SOAP nodes are responsible for
processing the SOAP messages targeted to them using the rules of SOAP. The SOAP specification
includes conventions for representing remote procedure calls and responses in SOAP messages.

SOFTWARE CONTRACTS AND XML


Strongly typed XML datagrams passed between distributed software components can be thought of
as complex parameters or even data objects. These data objects can be thought of as having to
conform to software contracts, in the sense of Design by Contract popularized by Bertrand Meyer
[MEYE]. With this understanding, the sender of the datagram must ensure that the contract
preconditions are met before sending. The receiver assumes the preconditions and guarantees the
postconditions will be met. In a client-supplier relationship (discussed below), this would generally
mean returning a response to the original sender expressing the postconditions.
The datagrams are XML and their content model can be controlled using DTD and/or XML
Schema specifications. This control over structure and content of XML datagrams can be used to
help maintain contract enforcement in distributed applications.
CLIENTS AND SUPPLIERS
A fundamental software abstraction is that of the client-supplier relationship. The abstraction applies
at many levels in computing:
• Database engines store and retrieve data to client users.
• Software modules call other modules, passing parameters and receiving data values in
return.
• Operating systems are clients of disk hardware systems, requesting I/O operations and
getting notified that they have completed.
Fundamental to the client-supplier relationship is the idea of a conversation between the two initiated
by the client as a request for some service by the supplier. The supplier responds with the requested
service or perhaps some error message. This conversational communication mode between client
and supplier is sometimes called a request-response protocol.

REQUEST-RESPONSE XML PROTOCOL


We can implement a simple request-response protocol for exchanging data between a supplier and its
clients. The following DTD is a basic skeleton of such a protocol:

<!ENTITY % version "1.0" >


<!ELEMENT API (req | resp ) >
<!ATTLIST API version (%version;) #REQUIRED >
<!ELEMENT req EMPTY >
<!ATTLIST req reqID ID #REQUIRED >

<!ELEMENT resp ( rowset | err ) >


<!ELEMENT rowset (row)* >
<!ELEMENT err EMPTY>
<!ELEMENT row EMPTY>

XML documents using the protocol will declare the API element as their root element. The DTD
specifies a fixed attribute for the API element called version. This mechanism allows clients and
suppliers to quickly find out at parse time whether they are using the same version of the DTD. The
API element specifies that a document conforming to the protocol is either a req (request) element
or an resp (response) element. Response elements contain either a rowset element or an err
element. A rowset element is composed of zero or more row elements.
Once the basic protocol structure has been established, the details of what constitutes valid request,
response, and error elements need to be worked out. This will constrain the specific set of
conversations that are allowed under the protocol.
The protocol helps makes explicit the contract obligations of clients and suppliers. Clients are
responsible for generating and sending valid request documents as well as receiving and processing
valid responses or errors. Suppliers are responsible for receiving and understanding valid request
documents and generating and returning valid response documents to clients.
AN XML SUCCESS STORY
Imagine the following software development scenario and requirements:
• Two products with very different architectures but similar problem domains.
• Two development teams separated by thousands of miles and time zones.
• Fundamental cultural and communications challenges between teams.
• Need to provide integrated product functionality.
Coordination of design efforts was hardly an option under the circumstances. However, the desired
integration was essentially a data level interface between the products in classic client-supplier roles.
The decision was made to use XML to capture the structure of the data involved in the client-
supplier conversation. Data returned by the supplier software was specified in a response element.
The various supplier options that determined the response were organized into a request element.
Requests and responses were bundled together into an API as illustrated above. A single DTD file
controlled all conversational documents through use of a validating parser.
Once the teams understood the DTD and their obligations under the protocol they could proceed
independently to develop their respective components. The DTD was quite rigid and explicit, so the
need for clarification and communication between teams was minimal. The DTD was also language
and culturally neutral, so it represented an unambiguous communication of functional requirements.
In the end, the client and supplier modules were connected through a TCP/IP port and understood
each other perfectly on the first test.

CONCLUSION
Hopefully this look at some of the basics of XML has helped clarify this technology somewhat.
There is a confusing sea of technologies that is difficult to wade into at first. For distributed
applications involving data exchange, the ability of XML to capture complex typed data in text
document format is invaluable. Further, the ability to use strongly typed XML datagrams to assist in
(design by) contract enforcement between distributed modules promotes safer interoperability.
Finally, the use of XML to specify interfaces can assist in distributed software development by
explicitly exposing data and conversational requirements in a culturally neutral fashion.

REFERENCES
[BOX] Box, Skonnard and Lam: Essential XML. Addison-Wesley (2000).
[ECKS] Eckstein: XML Pocket Reference. O’Reilly & Associates (1999).
[FUNG] Fung: XSLT Working with XML and HTML. Addison-Wesley (2001).
[HOLZ] Holzner: Inside XSLT. New Riders Publishing (2001).
[MEYE] Meyer: Object Oriented Software Construction, 2nd Ed.: Prentice-Hall PTR (1997).
[W3C] W3C: SOAP Version 1.2 Part 1: Messaging Framework. www.w3.org/TR/2001/WD-soap-
part1-20011002/

ABOUT THE AUTHOR


John Beresniewicz is a technology manager with Precise Software Solutions. He was a member of
the original design and development team for the Precise/Savant product and is currently working
on extending and integrating Precise’ performance management solutions. John has worked with
Oracle technologies for over 13 years. He is a frequent and popular speaker at Oracle user groups
and conferences on the topics of database performance and server-side PL/SQL development and a
member of the IOUG Masters Classes faculty.

You might also like