Professional Documents
Culture Documents
XML Basics
XML Basics
WHAT IS XML?
I suppose the best place to start is the official home of XML on the web, the W3C (World Wide Web
Consortium) web site. There we can find the following definition (www.w3.org/XML/):
“XML is the universal format for structured documents and data on the Web.”
That short sentence has a lot of ambition in it. In particular, XML intends to be universal for the
Web. XML is also about documents and data, a distinction that points at the two main branches of
XML usage: for rendering (publishing) content on the Web, and for exchanging structured data on
the Web or elsewhere.
This paper is based on my experiences and learning gained from implementing a specific use of
XML. That use is the exchange of simple, well-defined structured data between networked software
components using a request-response protocol between client and supplier. My expertise in the
wider and more sophisticated usages of XML is yet to be developed.
XML has a wide variety of uses and probably abuses also. Recently, there has been much hype and
hoopla about XML and it can be difficult to sift out the tunes from the noise. Hopefully, this short
article will help clarify some fundamental aspects of XML for newcomers to this new and somewhat
strange corner of technology-world.
XML TECHNOLOGIES
There is a whole raft of technologies in XML-world that are actively being developed, specified and
standardized. Some of the more important technologies are:
• XML 1.0 and namespaces
• XML information set
• Document Object Model (DOM)
• Simple API for XML (SAX)
• XML Path Language (Xpath)
• XSL Transformations (XSLT)
• XML Schema
• Simple Object Access Protocol (SOAP)
The first of these, XML 1.0 and namespaces, refers to the specification for XML itself. The others
are important extensions or related technologies that are important in the use of XML generally and
for data exchange in particular. They will be mentioned and covered in various depths below.
It is generally regarded as bad practice to use mixed content, especially for data exchange
applications. Perhaps there are valid reasons for using it in the document formatting area.
ELEMENT NORMAL FORM
Element-normal form refers to structuring documents in such a way that all information is carried in
element content rather than attribute values.
Consider the following simple element.
<doc>
<name>my.doc</name>
<size>25,565</size>
<type>MS Word</type>
</doc>
Each property of interest about the doc element is represented by a child element whose content is
the value of that property for the parent doc element. The document has a total of four elements,
three of which contain content.
Oracle’s canonical representation of SQL result sets uses an element normal form.
ATTRIBUTE NORMAL FORM
We can represent exactly the same information about the doc element using attribute values only, as
below:
<doc name=‘my.doc’
size=“25,565”
type=‘MS Word’/>
In this representation, there is only the single doc element and it is empty of content. Attribute-
normal form is a somewhat terser representation, as many element tags have been eliminated.
Some authors [BOX] seem to prefer element-normal form over attribute-normal form, as attributes
are less flexible than elements (terminus in parent-child chain). I have preferred attribute-normal
form to express a simple data exchange API convention. In the end there may be specific
applications where one form is preferred over the other but in many cases either will do the job fine.
• The foo element is required to have a bar child element followed by another bar child
element.
• The emplist element contains zero or more emp elements.
• The choice element must contain either an A element or a B element.
• The attributesonly element cannot have child elements or content, but may have
attribute values
These examples give some idea of the kinds of structural rules that can be imposed on documents.
Bear in mind that child elements in the content model will have their own content models. The
structural complexity theoretically possible is essentially unlimited.
ATTRIBUTES
Attributes have names and hang onto elements. DTDs can require that attributes be specified for
each occurrence of the entity or they can be optional. They can specify a fixed or default value for
the attribute or an enumerated list of valid values. The following are some basic attribute
declarations for the element emp:
• The empid attribute is a required unique identifier for the element within the document.
• The name attribute is optional and composed of character data.
• The sex attribute is required and limited to the choices M and F.
• The mgrid attribute is optional but must match the value of an empid attribute in the
document if specified.
Attributes can be used to implement complex structural aspects through IDREF pointers, as well as
to carry specific data values similar to element content. The real disadvantage of DTDs is that the
data typing of attribute values as well as element content is much too weak. There is no built-in
mechanism to require numeric attribute values, for instance.
ENTITIES
Entities are the trickiest and most powerful aspect of DTDs to learn. In general they are pointers or
references to something that has been defined or exists elsewhere whose contents are substituted for
the entity reference. They have several types and uses, including:
• Parameter entities are defined in the DTD and replaced within the DTDs
• General entities are defined in the DTD and replaced within XML documents using the
DTD
• External entities are URI references to XML content that is merged into documents
• Unparsed entities are references to non-XML content, e.g graphics files
The following simple example shows a parameter entity used to standardize declarations of Boolean
attribute enumeration lists:
The entity bool is expanded in the DTD wherever the %bool; reference is encountered. This allows
a degree of notational efficiency as well as standardization of attribute declaration.
DTD EXAMPLE: POINTS AND LINES
DTDs are reasonably good at restricting (typing) the structural organization of an XML document’s
nodes, however they are quite limited at imposing type constraints on actual data values.
Take for example the following DTD in which the line element is composed of two point elements,
each of which is composed of an x, y element pair:
Structurally, the model of a line consisting of two points specified by x and y coordinates works well.
The content model is in attribute-normal form. That is, the leaf elements are declared EMPTY and
data payload is carried in attribute values, specifically the val attribute of elements x and y.
Let’s look at a specific XML document conforming to the DTD above:
Everything seems OK until we get to the x and y values for the second point, which contains
alphabetic characters where we want numbers representing an integer value. Here is the big problem
with DTDs. It is not obvious and in fact quite difficult to enforce a rule as simple as requiring an
attribute’s value to be an integer. We will see further down why enforcing very strict rules on XML
type and structure can be highly advantageous in the development of trustworthy and stable
distributed systems.
XML SCHEMA
XML Schema is a relatively new standard designed to provide robust mechanisms for stronger typing
of XML document objects and their content. It has much better mechanisms than the DTD
language for specifying data and structural validation rules. In addition, XML schema is
implemented in XML itself, so abstract type metadata and actual document type occurrences are
expressed using the same semantics.
The newness and increased complexity of XML Schema may limit the number or quality of parsers
for some little while. However, it is still probably the best bet for new projects to define their data
structures using XML Schema as it seems clearly destined to succeed DTDs.
XML documents using the protocol will declare the API element as their root element. The DTD
specifies a fixed attribute for the API element called version. This mechanism allows clients and
suppliers to quickly find out at parse time whether they are using the same version of the DTD. The
API element specifies that a document conforming to the protocol is either a req (request) element
or an resp (response) element. Response elements contain either a rowset element or an err
element. A rowset element is composed of zero or more row elements.
Once the basic protocol structure has been established, the details of what constitutes valid request,
response, and error elements need to be worked out. This will constrain the specific set of
conversations that are allowed under the protocol.
The protocol helps makes explicit the contract obligations of clients and suppliers. Clients are
responsible for generating and sending valid request documents as well as receiving and processing
valid responses or errors. Suppliers are responsible for receiving and understanding valid request
documents and generating and returning valid response documents to clients.
AN XML SUCCESS STORY
Imagine the following software development scenario and requirements:
• Two products with very different architectures but similar problem domains.
• Two development teams separated by thousands of miles and time zones.
• Fundamental cultural and communications challenges between teams.
• Need to provide integrated product functionality.
Coordination of design efforts was hardly an option under the circumstances. However, the desired
integration was essentially a data level interface between the products in classic client-supplier roles.
The decision was made to use XML to capture the structure of the data involved in the client-
supplier conversation. Data returned by the supplier software was specified in a response element.
The various supplier options that determined the response were organized into a request element.
Requests and responses were bundled together into an API as illustrated above. A single DTD file
controlled all conversational documents through use of a validating parser.
Once the teams understood the DTD and their obligations under the protocol they could proceed
independently to develop their respective components. The DTD was quite rigid and explicit, so the
need for clarification and communication between teams was minimal. The DTD was also language
and culturally neutral, so it represented an unambiguous communication of functional requirements.
In the end, the client and supplier modules were connected through a TCP/IP port and understood
each other perfectly on the first test.
CONCLUSION
Hopefully this look at some of the basics of XML has helped clarify this technology somewhat.
There is a confusing sea of technologies that is difficult to wade into at first. For distributed
applications involving data exchange, the ability of XML to capture complex typed data in text
document format is invaluable. Further, the ability to use strongly typed XML datagrams to assist in
(design by) contract enforcement between distributed modules promotes safer interoperability.
Finally, the use of XML to specify interfaces can assist in distributed software development by
explicitly exposing data and conversational requirements in a culturally neutral fashion.
REFERENCES
[BOX] Box, Skonnard and Lam: Essential XML. Addison-Wesley (2000).
[ECKS] Eckstein: XML Pocket Reference. O’Reilly & Associates (1999).
[FUNG] Fung: XSLT Working with XML and HTML. Addison-Wesley (2001).
[HOLZ] Holzner: Inside XSLT. New Riders Publishing (2001).
[MEYE] Meyer: Object Oriented Software Construction, 2nd Ed.: Prentice-Hall PTR (1997).
[W3C] W3C: SOAP Version 1.2 Part 1: Messaging Framework. www.w3.org/TR/2001/WD-soap-
part1-20011002/