Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Advanced Java Programming

(J2EE LC)
XML Parsers - Day 3
Course Objectives
Overview of XML
XML Document Type Definitions (DTDs)
XML Schemas
To understand the need for parsing XML documents
To understand types of XML Parsers
– Validating vs. Non-Validating Parsers
To understand different XML Parser Interfaces
– Tree Based Interface Standard : DOM
– Event Based Interface Standard : SAX
Evaluating Parsers
– Which parser to use?

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 2
Technologies Ltd Version 1.00
Recap on XML
What is XML? tomcat-users.xml
– eXtensible Markup Language (XML)

Uses of XML
– XML Data Buffers : Used to store the data
– Config Files : Describes the configuration of the Servers
– Example : The user configuration file for Tomcat Web Server (tomcat-users.xml)

How these files are read?


– Parsers are used to read XML document programmatically
– Types of Parsers
• DOM: Reads entire XML data, converts into Memory objects and keeps data ready
• SAX: Incremental parser, parses chunk by chunk, used for huge XML data
• Validating and Non-Validating Parsers

XML structure and data validating technologies


– XML Document Type Definition (DTD)
– XML Schema

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 3
Technologies Ltd Version 1.00
Recap on XML
XML Document (address.xml)

<?xml
xml version=“
version=“1.0”
1.0”?> XML Declaration
<address> Root Element
address
<name>
<first>John</first> Nested Elements
first, middle, last
<middle>Fitzgerald Johansen</middle>
<last>Doe</last>
</name>
Attribute
<doornumber>2345</doornumber> type

<street>Kalidasa Road</street>
<city>Mysore</city>
<pin>570 002</pin>
<telephone type=“work”>91-821-
2404000</telephone>
<telephone type=“home”> 91-821-
ER/CORP/CRS/LA22/003
2404001</telephone>
Copyright © 2005, Infosys
Technologies Ltd
4
Version 1.00

<telephone type=“mobile”>91-93424-
Namespaces in XML
Namespaces helps to differentiate two objects (XML data) of the same name

In the example below, a ‘table’ can be a visual element or a piece of furniture


<web:table>
<table>
<web:tr>
<tr>
<web:td>Apples</web:td>
<td>Apples</td>
<web:td>Bananas</web:td>
<td>Bananas</td>
</web:tr>
</tr>
</web:table>
</table>

<wood:table>
<table>
<wood:name>Coffee Table</wood :name>
<name>Coffee Table</name>
<wood:width>80</wood :width>
<width>80</width>
<wood:length>120</wood :length>
<length>120</length>
</wood:table>
</table>

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 5
Technologies Ltd Version 1.00
Document Type Definitions (DTDs)
Describes syntax that explains
– which elements may appear in the XML document
– what are the element contents and attributes

Need for DTD


– Validating parser ( a program) can be used to check whether XML data adheres to
the rules in DTD
– The parser can do appropriate error handling if there are any violation
– Validity error is not necessary a fatal error, but some applications may treat it as
fatal error

Document Type Declarations


– A valid XML document must include the reference to DTD which validates it
– Types of DTD
• Internal DTD: DTD can be embedded into XML document
• External DTD: DTD can be in a separate file

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 6
Technologies Ltd Version 1.00
Internal DTD
DTD embedded in the XML document
– The declarations appear between [ and ]
– E.g. AddressBook.xml AddressBook.xml

<?xml version='1.0' encoding='utf-8'?> XML Declaration


<!-- DTD for a AddressBook.xml -->
<!DOCTYPE AddressBook [
<!ELEMENT AddressBook (Address+)>
<!ELEMENT Address (Name, Street, City)>
<!ELEMENT Name (#PCDATA)> Internal DTD
<!ATTLIST Name salutation CDATA #REQUIRED> Defining the
<!ELEMENT Street (#PCDATA)> Attribute(s)
<!ELEMENT City (#PCDATA)> salutation
]>
<AddressBook>
<Address> Document Name (Root
Element)
<Name salutation="Mr.">Ram</Name>
<Street>M G Road</Street>
<City>Bangalore</City> Defining the elements
</Address> AddressBook,
AddressBook, address,
</AddressBook> Name, Street, City.
City

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 7
Technologies Ltd Version 1.00
External DTD

DTD is present in separate file


AddressBook.xml AddressBook.dtd
Example
– The DTD for AddressBook.xml is contained in a file AddressBook.dtd
– AddressBook.xml contains only XML Data with a reference to the DTD file
AddressBook.xml

<?xml version="1.0" encoding="UTF-8"?>


<!DOCTYPE AddressBook SYSTEM "file:///c:/XML/AddressBook.dtd
"file:///c:/XML/AddressBook.dtd">
c:/XML/AddressBook.dtd">
<AddressBook>
<Address>
<Name salutation="Mr.">Ram</Name>
<Street>M G Road</Street>
<City>Bangalore</City>
</Address>
Reference to
</AddressBook> external DTD

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 8
Technologies Ltd Version 1.00
Anatomy of DTD – Defining new XML tags
(Elements)
<!ELEMENT element_name content_specification>
– element_name: Specifies name of the XML tag

– Content_specification: Specifies what are the contents of the element


• #PCDATA: Parsed character data (Extra white spaces are ignored)

• #CDATA: Character data (White spaces retained as is)

• Nested elements

Example:
– <!ELEMENT Street (#PCDATA)>
• element Street contains the parsed character Data

– <!ELEMENT Address (Name, Street, City)>


• element Address contains three nested tags Name, Street and City respectively

– <!ELEMENT AddressBook (Address+)>


• Element AddressBook contains one or more occurrences of element Address

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 9
Technologies Ltd Version 1.00
Anatomy of DTD – Attribute Declarations

Specifies allowable attributes of each element

<!ATTLIST Tag-name Attr-Name Attr-Type Restriction>

– Tag-name : Element name

– Attr-Name : Name of the attribute, the attribute is defined for element Tag-Name

– Restriction : Whether the attribute must be present or implied etc

Example
– <!ATTLIST Name salutation CDATA #REQUIRED>

– The element Name has attribute salutation which is of type CDATA

– The attribute salutation must be specified in the Name tag

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 10
Technologies Ltd Version 1.00
Anatomy of DTD – Entity Declarations (1 of 2)

Way to escape special characters


Some special characters such as <, >, & are not used as #PCDATA
This escaping of the characters is called as “Entity reference”
Following different entity references are used in the XML document
– Built-in Entities : &amp;, &lt;, &gt;, &apos;, &quot;
– Characters Entities : &#243; representing ó
– General Entities : &source-text;

Example
– <State>Jammu &amp; Kashmir</State>

AddressBook1.xml

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 11
Technologies Ltd Version 1.00
Anatomy of DTD – Entity Declarations(2 of 2)

Data that is frequently used can be declared as an General Entity


– <!ENTITY entity_name entity_contents>

• entity_name : Name of the new Entity

• entity_contents : Contents of the new entity

Example
– <!ENTITY MyCountry "India">

• Defines the entity called as MyCountry

• “India” is the contents of entity MyCountry

Usage in the XML Document


– <Country>&MyCountry;</Country>

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 12
Technologies Ltd Version 1.00
XML Schema
What is XML Schema?
– An XML vocabulary for expressing your data's structure and business rules

– Validating parsers can use Schema to check whether XML data adheres to rules in schema

– More robust and extensive than DTD, can do even data type validations

– E.g. : Consider following XML Document


<Result>
<EmpNo>45609</EmpNo>
<Name>Kiran</Name>
<Subject>
<Name>CHSSC</Name>
<Marks>80</Marks>
<Grade>A</Grade>
</Subject>
</Result>

Is this data valid?

To be valid, it must meet following business rules (constraints)


– The Result must be comprised of a Subject, Marks, Grade in the order shown

– The Subject must be any valid subject from the list (PF, CHSSC, RDBMS, IWT, AOA)

– The Marks must be between 0 to 100 only and Grade can be either A or B or C
ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 13
Technologies Ltd Version 1.00
XML Schema : Validating the XML Document
Validating your Data (XML Document)
<Result>
<Name>Kiran</Name>
<EmpNo>45609</EmpNo>
<Subject>
<Name>CHSSC</Name>
<Marks>80</Marks>
XML
<Grade>A</Grade>
</Subject> Schema Data is
</Result> Validating Ok!

XML Document ( Instance Document ) parser

Subject, Marks and Grade must appear in that order


The Subject must be one of the following
CHSSC, PF, RDBMS, AOA, IWT
The Marks must be between 0 to 100 only
The Grade can be either A, or B or C

Constraints on XML Document (Schema)

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 14
Technologies Ltd Version 1.00
How can XML schema help to accomplish this?
Answer
– It creates XML vocabulary : Defines following set of elements
• <Result>, <Subject>, <Marks>, <Grade>

– It specifies the contents of each element and restrictions on each element


• <Result> element must contain <Subject>, <Marks>, <Grade> in that order

• <Subject> must be one of the valid subjects (CHSSC, PF, RDBMS, AOA, IWT)

• The Marks must be between 0 to 100 only

• Grade can be either A or B or C

– XML Schema specifies in which namespace the created vocabulary must be in

– It is not an actual URL, but uses URL syntax and should be a unique string

– Example: http://www.Results.com Namespace defines the following vocabulary


<Result>
<Subject>
<Marks> <Grade>

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 15
Technologies Ltd Version 1.00
Example of referring to Schema Result.xml
<?xml version = "1.0" encoding = "UTF-
"UTF-8"?>
<res:Result xmlns:res="http://
xmlns:res="http://www.Results.com
="http://www.Results.com"
www.Results.com"
xmlns:xsi="http://www.w3.org/2001/XMLSchema
xmlns:xsi="http://www.w3.org/2001/XMLSchema-
="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.Results.com
xsi:schemaLocation="http://www.Results.com Result.xsd
Result.xsd"
">
<res:Name>Kiran</
res:Name>Kiran</res:Name
>Kiran</res:Name>
res:Name>
<res:EmpNo>45609</
res:EmpNo>45609</res:EmpNo
>45609</res:EmpNo>
res:EmpNo>
<res:Subject>
res:Subject>
<res:Name>CHSSC</
res:Name>CHSSC</res:Name
>CHSSC</res:Name>
res:Name>
<res:Marks>80.70</
res:Marks>80.70</res:Marks
>80.70</res:Marks>
res:Marks>
<res:Grade>A</
res:Grade>A</res:Grade
>A</res:Grade>
res:Grade>
</res:Subject
</res:Subject>
res:Subject>
<res:Subject>
res:Subject>
<res:Name>PF</
res:Name>PF</res:Name
>PF</res:Name>
res:Name>
<res:Marks>78.30</
res:Marks>78.30</res:Marks
>78.30</res:Marks>
res:Marks>
<res:Grade>B+</
res:Grade>B+</res:Grade
>B+</res:Grade>
res:Grade>
</res:Subject
</res:Subject>
res:Subject>
</res:Result
</res:Result>
res:Result>
ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 16
Technologies Ltd Version 1.00
Schema example : Result.xsd
Result.xsd

<?xml version="1.0" encoding="UTF-


encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://
targetNamespace="http://www.Results.com
="http://www.Results.com"
www.Results.com"
xmlns="http://
xmlns="http://www.Results.com
="http://www.Results.com"
www.Results.com" elementFormDefault="qualified">
elementFormDefault="qualified">
<!--
<!-- Root Element Declaration -->
-->
<xsd:element name="Result">
<xsd:complexType>
<xsd:sequence>
xsd:sequence>
<xsd:element name="Name" type="xsd:string"/>
<xsd:element name="EmpNo
name="EmpNo"
EmpNo" type="xsd:int"/>
<xsd:element name="Subject" type="SubjectType
type="SubjectType"
SubjectType" maxOccurs="5"/>
</xsd:sequence
</xsd:sequence>
xsd:sequence>
</xsd:complexType>
</xsd:element>

<xsd:simpleType name="NameType
name="NameType">
NameType">
<xsd:restriction base="xsd:string">
<xsd:pattern value="CHSSC|PF|RDBMS|IWT|AOA"/>
</xsd:restriction>
</xsd:simpleType>
[ Continued ……]
ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 17
Technologies Ltd Version 1.00
Schema example : Result.xsd (Continued ……)
<xsd:complexType name="SubjectType
name="SubjectType">
SubjectType">
<xsd:sequence>
xsd:sequence>
<xsd:element name="Name" type="NameType
type="NameType"/>
NameType"/>
<!--
<!-- Reference to the element Marks -->
-->
<xsd:element ref="Marks"/>
<xsd:element name="Grade">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:pattern value="A|B+|B|C|D"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
</xsd:sequence
</xsd:sequence>
xsd:sequence>
</xsd:complexType>
<xsd:element name="Marks">
<xsd:simpleType>
<xsd:restriction base="xsd:float
base="xsd:float">
xsd:float">
<xsd:minInclusive value="0.0"/>
<xsd:maxInclusive value="100.0"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
</xsd:schema
</xsd:schema>
xsd:schema>
ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 18
Technologies Ltd Version 1.00
Result.xml : Understanding XML Declaration

<?xml version="1.0" encoding="UTF-8"?> XML Declaration


<res:Result xmlns:res="http://www.Results.com"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.Results.com Result.xsd">
All elements prefixed
with res: are defined in
<res:Name>Kiran</res:Name> www.Resuts.com
<res:EmpNo>1000</res:EmpNo> namespace
<res:Subject>
...
... The namespace
</res:Subject> www.Resuts.com
is defined in
Result.xsd
</res:Result>

XML Data

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 19
Technologies Ltd Version 1.00
Result.xml : Understanding Structure of XML Data

<?xml version="1.0" encoding="UTF-8"?>


<res:Result xmlns:res="http://www.Results.com"
XML Declaration
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" and Reference to
xsi:schemaLocation="http://www.Results.com Result.xsd"> Schema
<res:Name>Kiran</res:Name>
Attributes prefixed
<res:EmpNo>1000</res:EmpNo> with xsi: are
<res:Subject> defined in
www.w3.org/.../XM
<res:Name>CHSSC</res:Name> LScheman-
<res:Marks>80.90</res:Marks> CHSSC Result instance
<res:Grade>A</res:Grade> namespace

</res:Subject>
All elements
prefixed with res:
<res:Subject> are defined in
www.Results.co
<res:Name>PF</res:Name> mnamespace
<res:Marks>45.30</res:Marks> PF Result
<res:Grade>D</res:Grade>
</res:Subject>
</res:Result>

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 20
Technologies Ltd Version 1.00
Understanding XML Schema
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd=“http://www.w3.org/2001/XMLSchema” All the elements
targetNamespace=“http://www.Results.com” prefixed with xsd are
xmlns="http://www.Results.com" elementFormDefault="qualified"> defined in
www.w3.org/../...
<xsd:element name="Result"> Name-space
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Name" type="xsd:string"/> Define
<xsd:element name="EmpNo" type="xsd:int"/> Element
<xsd:element name=“Subject" type=“SubjectType" maxOccurs="5"/> Result
</xsd:sequence>
</xsd:complexType>
</xsd:element> All the elements
defined here are part
<xsd:complexType name=“SubjectType">
of this
... “targetNamespace”
...
</xsd:complexType>
</xsd:schema>

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 21
Technologies Ltd Version 1.00
DTD vs Schema

XML document and DTD use different syntax : Inconsistency


– Schema uses XML syntax

Limited data type capability


– DTDs support a very limited capability for specifying data types.

– DTDs do not support field level validations and complex types


• E.g. : You can't, express "I want the <Marks> element to hold an integer with a range of 0
to 100“ in DTD
Schema describes a set of data types compatible with those found in
databases
– E.g.: Database supports integer, string, etc data types

– Schema supports integer, string etc while the DTD does not

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 22
Technologies Ltd Version 1.00
Element Declarations: Simple Element
Syntax :
<xsd:element name=“Element_name” type=“Element_type” Occurrence/>

Element_name : Any valid xml name


Element_type : Built in Simple type
Occurrence : Number of occurrences of that element, optional

Example :
– <xsd:element name="Name" type="xsd:string"/>
• Defines the element Name of type string
– <xsd:element name=“Marks" type=“xsd:float“ maxOccurs=“5”/>
• Defines the element Marks of simple type float
• Marks may appear for maximum 5 times
• And by default for minimum 1 time

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 23
Technologies Ltd Version 1.00
Element Declarations
Syntax :
<xsd:element name=“
name=“Element_name”
Element_name”>
<xsd:complexType>
<!-- Element Specification -->
<!-- -->
</xsd:complexType>
</xsd:element>
– Example
<xsd:element name=“
name=“Subject">
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“ name=“Name" type="xsd:string
type="xsd:string"/>
xsd:string"/>
<xsd:element name=“ name=“Marks" type="xsd:float
type="xsd:float"/>
xsd:float"/>
<xsd:element name=“ name=“Grade" type="xsd:string
type="xsd:string"/>
xsd:string"/>
</xsd:sequence>
</xsd:complexType>
<xsd:element>
• Defines non-reusable complex element called ‘Subject’
Copyright © 2005, Infosys 24
ER/CORP/CRS/LA22/003

• Each element appears in that sequence


Technologies Ltd because <xsd:sequence> tag is used
Version 1.00
Element Declarations: Reusable Simple Type
Syntax :
<xsd:simpleType name=“
name=“Element_type_name">
Element_type_name">
<xsd:restriction base="Base_Data_type
base="Base_Data_type">
Base_Data_type">
<!--
<!-- Restriction specification -->
-->
</xsd:restriction>
</xsd:simpleType>

Element_type_name : Name of the data type


Base_data_type : Any of the built in simple data type (integer, float etc)
Restriction_specification : Specifies restriction on the element if any
Example :
<xsd:simpleType name=“name=“MarksType">
MarksType">
<xsd:restriction base="xsd:float
base="xsd:float">
xsd:float">
<xsd:minInclusive value=“
value=“0.0"/>
<xsd:maxInclusive value=“
value=“100.0”
100.0”/>
</xsd:restriction>
</xsd:simpleType>
– Defines the reusable element type MarksType
– Element defined as MarksType may take minimum value of 0.0 and maximum value 100.0
– <xsd:element name=“Marks” type=“MarksType”>

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 25
Technologies Ltd Version 1.00
Element Declarations: Reusable Complex Type
Syntax
<xsd:complexType name=“
name=“Type_name”
Type_name”>
– Defines the reusable type Type_name
Example
<xsd:complexType name=“SubjectType“>
<xsd:sequence>
<xsd:element name=“Name" type=“xsd:string"/>
<xsd:element name=“Marks" type="xsd:int"/>
<xsd:element name=“Grade" type="xsd:string”/>
</xsd:sequence>
</xsd:complexType>
– Defines reusable complex element type SubjectType
– Comprises of following elements in the sequence specified (<xsd:sequence> tag)
• Name
• Marks
• Grade
This type can be used to define elements in your XML
<xsd:element name=“Subject” type=“SubjectType”>
ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 26
Technologies Ltd Version 1.00
Defining the Attributes
Syntax : <xsd:attribute name=“Attr_Name" type=“Attr_Type"/>
– Example
<xsd:attribute name=“Project" type=“xsd:string"/>

– All attributes are declared as simple types.


– Only complex elements can have attributes
– Example
<xsd:complexType name=“EmpNo">
<xsd:sequence>
<!-- Module elements -->
</xsd:sequence>
<xsd:attribute name=“
name=“Project" type=“
type=“xsd:string"/>
</xsd:complexType>
• Defines the attribute Type of string type

– Attribute Project being used


<res:Name>Kiran<res:Name>
<res:EmpNo Project=“
Project=“Training”
Training”>45609<res:EmpNo
>45609<res:EmpNo>
res:EmpNo>
<res:Subject>CHSSC<res:Subject>

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 27
Technologies Ltd Version 1.00
Anatomy of XML Schema : Constraints specification
Controls occurrence of individual element or group of elements
Types of constraints
• <choice> : allows only one element to appear
• <sequence> : elements must appear in the same order as they are declared
• <all> : elements can occur in any order and in any combination
<choice> constraint
– E.g.:
<xsd:choice>
<xsd:element name=“first”/>
<xsd:element name=“last”/>
</xsd:choice>
• Allows either first or last name to be used in the instance XML Document
<sequence> constraints
– E.g.:
<xsd:sequence>
<xsd:element name="Name" type="xsd:string"/>
<xsd:element name="EmpNo" type=“xsd:int"/>
<xsd:element name=“Subject" type="SubjectType" maxOccurs="5"/>
</xsd:sequence>
• All elements must appear in the defined order only
ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 28
Technologies Ltd Version 1.00
Anatomy of XML Schema : Constraints specification
<all> constraints
– E.g. :
<xsd:all>
<xsd:element name=“invoice”>
<xsd:element name=“purchaseOrder”>
<xsd:element name=“mailingLabel”>
</xsd:all>
• Any of the elements can either appear or not appear
• Elements may appear in any order

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 29
Technologies Ltd Version 1.00
XML Parsers
XML Parser : The Big Picture

Why to use Parser?


– Typically use a pre-built XML parser (e.g. JAXP, Apache Xerces etc)

– This enables you to build your application much more quickly

XML
DTD / Schema
API’s

XML XML Client

Document Application
Parser

Parsed Data
Fig. 1 : Usage of the XML Parser

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 31
Technologies Ltd Version 1.00
Need for Parser

Defining the Parser’s Responsibilities

– Ensure that the document adheres to specific standards


• Does the document match the DTD or Schema?

• Is the document well-formed?

– Make the document contents available to your application


• The parser will parse the XML document, and make this data available to your
application

• An application using parser can access data in XML by going through the hierarchy
or using tag names

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 32
Technologies Ltd Version 1.00
Types of XML Parsers

Validating Parser
– a parser that verifies that the XML document adheres to the DTD or Schema

Non-Validating Parser
– a parser that does not verify the XML document against the DTD or Schema

Most parsers provide an option to turn validation on or off

All parsers checks the well-formedness of XML document at all times

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 33
Technologies Ltd Version 1.00
XML Parser Interfaces

Two types of Interfaces provided by XML Parsers


– SAX An Event Based Interface

– DOM a Tree Based Interface

JAXP
– “Java API for XML Processing”

– JAXP is part of JDK

– Provides parsers which can be used in any Java application

It supports both
– Tree Based Parser : DOM

– Event Based Parser : SAX

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 34
Technologies Ltd Version 1.00
DOM Parser

Tree Based Parser


– Definition: Parser reads the XML document, and creates an in-memory “tree”
representation of XML Document

– For example: Given a sample XML document below

– What kind of tree would be produced?

<Result>
<Name>Kiran</Name>
<EmpNo>45609</EmpNo>
<Subject>
<Name>CHSSC</Name>
<Marks>80</Marks>
<Grade>A</Grade>
</Subject>
</Result>

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 35
Technologies Ltd Version 1.00
DOM Parser

In memory tree created by Tree Based Parser


– Tree represents the hierarchy of XML document

Element
Result Nodes

Name

Kiran

EmpNo
Text Nodes

45609

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 36
Technologies Ltd Version 1.00
DOM Parser

Tree based APIs presents a memory model of entire document to an


application once parsing has concluded

No need to use extra data-structures to maintain the information during parsing

An application can navigate through the tree to find the desired pieces of
document

Document Object Model (DOM) is the standard for Tree Based parsing of
XML document

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 37
Technologies Ltd Version 1.00
Document Object Model (DOM)

The Document Object Model (DOM) is a set of interfaces defined by the W3C
DOM Working Group

DOM is the tree based interface used by the programmers to manipulate the
XML document

DOM Parser can be Validating or Non Validating

DOM Parser represents the logical Model of the XML document in the memory

All the entity reference are expanded before the DOM tree was constructed

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 38
Technologies Ltd Version 1.00
DOM Structure representing XML

XML Document Document Structure


Structure Document representing Result.xml
Root

Document Result
Element
Node

Element Element Element Name EmpNo

Attribute Text Kiran 45609


Subject

Comment
Text Name Marks
Text Node

80.0 Grade
CHSSC
A

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 39
Technologies Ltd Version 1.00
Document Object Model (DOM) : Overview

The root of the DOM Hierarchy is called as a Document node


– Example : Result

The Child nodes of the Document node are : Element nodes, Comments
nodes etc
– Example : Name, Subject, EmpNo, etc are all Child Nodes

All the nodes in the XML Document are derived from interface :
org.w3c.dom.Node

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 40
Technologies Ltd Version 1.00
The Big picture : Parsing the XML Document
Document builder factory creates an instance of parser with required characteristics
– Whether the parser should be validating parser or not
– Whether namespace support required or not, Whether to ignore the white spaces between the elements or
not

Factory hides the implementation details of the parser and gives a standard DOM interface for
parsing XML
– (Analogous to JDBC driver)

Java Application using DOM Parser (JAXP)

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 41
Technologies Ltd Version 1.00
DomApp.java : Parsing XML Document using DOM Parser
public class DomApp {
public static void main(String argv[]) { DomApp.java
MyErrorHandler hErr;
Document hDocument;
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
try {
hErr = new MyErrorHandler();
DocumentBuilder hBuilder = factory.newDocumentBuilder();
// Set the error handler
hBuilder.setErrorHandler(hErr);
hDocument = hBuilder.parse( new File(“Result.xml”));
}
catch (Exception e){
// Handle exception if generated during parsing
}
}// End of Function main
}

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 42
Technologies Ltd Version 1.00
Parsing the XML Document using DOM Parser
Step 1: Get the instance of document-builder factory.
This will be used to produce the DOM-parser (called DocumentBuilder)
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
Step 2: Set the properties of the DOM parser to be produced
a. It should validate the XML Document against the Schema / DTD
b. It should be namespace aware
factory.setValidating(true);
factory.setNamespaceAware(true);
Step 3 : Obtain the instance of the MyErrorHandler class
This instance handles the error generated during parsing, in application specific way
hErr = new MyErrorHandler();
Step 4: Obtain the instance of DOM parser, and register the error handler
This will be used to parse the XML Document and creates the memory based tree
representation of the XML Document
DocumentBuilder hBuilder = factory.newDocumentBuilder();
hBuilder.setErrorHandler(hErr);
Step 5 : Parse the XML Document (Result.xml) using the parser created as above
hDocument = hBuilder.parse( new File(“Result.xml”));

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 43
Technologies Ltd Version 1.00
DOM : Exploring the org.w3c.dom.Node Interface
The Node interface is the root of DOM Core class hierarchy

This interface can be used to extract information from any DOM object without
knowing its actual type (e.g. Element node, Text node, Attr Node etc ) of
underlying node

i.e. It is possible to access a document's complete structure and content using


only the methods and properties exposed by the Node interface

The Class Hierarchy rooted at org.w3c.dom.Node

Node

Element Document Entity

Attr Text Comment

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 44
Technologies Ltd Version 1.00
DOM : Important Methods of Node interface

Methods to retrieve the various information from the XML DOM Tree
• Node getFirstChild() : Returns the first child of the current node

• Node getLastChild() : Returns the last child of the current node

• String getNodeName() : The name of this node

• String getNodeValue() : The value of this node, depending on its type

• short getNodeType() : A code representing the type of the underlying object

Methods to alter the elements of XML DOM Tree


• Node insetBefore( Node newChild, Node refChild)

• Node appendChild (Node newChild)

• Node removeChild (Node oldChild)

• Node replaceChild (Node newChild, Node oldChild )

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 45
Technologies Ltd Version 1.00
Using Node Interface

hNode hNode = hDocument.getDocumentElement()

Node hFirstChild= hNode.getFirstChild();


Result
hFirstChild String sName = hFirstChild.getNodeName()

sName=“Name” Node hLastChild = hNode.getLastChild();


Name
hFirstChild= hFirstChild.getFirstChild();
hFirstChild hLastChild
String sVal = hFirstChild.getNodeValue()

Kiran EmpNo Subject

sVal = “Kiran”

Name
45609

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 46
Technologies Ltd Version 1.00
XML Parser Interfaces : Event Based Interface

Event Based Interface


– Definition : Parser reads the XML document and generates events for each parsing
step

– Some common parsing events


• Element start-tag read

• Element content read

• Element end- tag read

– Example
<Result>
<Name>Kiran</Name>
<EmpNo>45609</EmpNo>
<Subject>
<Name>CHSSC</Name>
<Marks>80</Marks>
<Grade>A</Grade>
ER/CORP/CRS/LA22/003
</Subject> Copyright © 2005, Infosys
Technologies Ltd
47
Version 1.00
</Result>
XML Parser Interfaces : Event Generated
– startElement : Result

– startElement : Name

– contents : Kiran

– endElement : Name

– startElement : EmpNo

– contents : 45609

– endElement : EmpNo

– endElement : Result

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 48
Technologies Ltd Version 1.00
XML Parser Interfaces : Event Based Interface

For each of these events, your application implements “event handlers”

Each time an event occurs, a different event handler is called

Your application intercepts these events, and handles them in any way you
want

Application does not wait till the entire document gets parsed

Application has to maintain the information from XML document within local
data-structures till it is processed completely

Simple API for XML (SAX) is the standard for Event Based parsing of XML
document

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 49
Technologies Ltd Version 1.00
SAXApp.java : Parsing XML Document using SAX Parser
SAXApp.java
public class SAXApp {
public static void main(String argv[])
argv[]) {
//Get the instance of parser event handing class
DefaultHandler handler = new Handler();
//Get the instance of SAXParserFactory
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParserFactory.newInstance();
try {
// Set the properties of the parser to be obtained
factory.setValidating(true);
factory.setValidating(true);
factory.setNamespaceAware(true);
factory.setNamespaceAware(true);
// Get the new SAX Parser
SAXParser saxParser = factory.newSAXParser();
factory.newSAXParser();
// Parse the file
// handler : processes events generated during parsing
saxParser.parse(new File(“
File(“Result.xml”
Result.xml”), handler);
}
//Handle any exceptions if generated during parsing
catch (Throwable
(Throwable t) {
t.printStackTrace();
t.printStackTrace();
}
} // End of function main
} ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 50
Technologies Ltd Version 1.00
SAXApp.java : Parsing XML Document using SAX Parser
class Handler extends DefaultHandler{
DefaultHandler{

public void error(SAXParseException e) throws SAXException {


System.out.println("Error At Line:”
Line:”+e.getLineNumber());
+e.getLineNumber());
System.out.print(“
System.out.print(“Column:
Column: "+e.getColumnNumber
"+e.getColumnNumber());
e.getColumnNumber());
// Print the error message
System.out.print(e.getMessage());
System.out.print(e.getMessage());
}

// Process any fatal errors in the XML document


public void fatalError(SAXParseException e) throws SAXException {
System.out.println("Fatal Error At Line:”
Line:”+e.getLineNumber());
+e.getLineNumber());
System.out.print(“
System.out.print(“Column:
Column: "+e.getColumnNumber
"+e.getColumnNumber());
e.getColumnNumber());
// Print the error message
System.out.print(e.getMessage());
System.out.print(e.getMessage());
}
} //End Class DefaultHander

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 51
Technologies Ltd Version 1.00
Understanding The Simple API for XML (SAX)
Step 1: Get the instance of SAXParserFactory
This instance is used to obtain the SAX Parser
SAXParserFactory factory = SAXParserFactory.newInstance();
Step 2:Get the instance of the event handler class
This class handles all the events generated by parser
DefaultHandler handler = new Handler();
Step 3:Set the properties of the parser to be obtained
a. It should validate the XML Document against the Schema / DTD
b. It should be namespace aware
factory.setValidating(true);
factory.setNamespaceAware(true);
Step 4 : Obtain the instance of the SAX Parser using the factory just obtained
SAXParser saxParser = factory.newSAXParser();
Step 5: Parse the Result.xml file using the SAX Parser obtained as above
Events generated during parsing will be handled by object handler
saxParser.parse(new File(“Result.xml”), handler);

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 52
Technologies Ltd Version 1.00
The Big picture : Paring the XML Document using SAX

SAX Parser org.xml.sax class hierarchy


Factory

org.xml.sax org.xml.sax org.xml.sax


ContentHander ErrorHander EntityResolver

XML Parser implements


SAX Parser Events
Document
DefaultHandler/
MyHandler

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 53
Technologies Ltd Version 1.00
org.xml.sax Interfaces

org.xml.sax.DefaultHandler Class
– Provides the default implementation of all the events

– DefaultHandler implements the ContentHandler, ErrorHandler, DTDHandler, and


EntityResolver interfaces (with null methods).

– Only the methods which are required are overridden

org.xml.sax.ContentHandler Interface
– Receive notification of the logical content of a document

– Defines methods like startDocument(), endDocument(), startElement(), and


endElement()

– These are invoked when an XML tags arerecognized

– Also defines methods characters() which are invoked when the parser encounters
the text in an XML element
ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 54
Technologies Ltd Version 1.00
org.xml.sax Interfaces

org.xml.sax.ErrorHandler Interface
– Allows SAX application to do customized error handling

– The parser will then report all errors and warnings through this interface

– Important Methods
• void error() : receives the notification of recoverable error

• void fatalError() : receives the notification of non-recoverable error

• void warning() : receives the notification of a warning

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 55
Technologies Ltd Version 1.00
Evaluating Parsers : SAX vs. DOM
SAX
– Advantage
• It is good when serial processing of the document is required and document is very large

• i.e. when the size of the XML document is in terms of GBs.

– Disadvantage
• Requires internal data structure to maintain the parts of XML document till the complete processing is not
finished, therefore not suitable for parsing the small XML Documents.

DOM
– Advantage
• Supports DOM Tree Traversing methods

• Allows modification of XML Document

• Good when the random access of a document is required

– Disadvantage
• For large XML documents (size in GBs) requires more memory as compared to memory required to parse
XML document using SAX Parser.

ER/CORP/CRS/LA22/003
Copyright © 2005, Infosys 56
Technologies Ltd Version 1.00

You might also like