Professional Documents
Culture Documents
How To Parallelly Process Large XML Files in Java - by Anup Jawanjal - Clairvoyant Blog
How To Parallelly Process Large XML Files in Java - by Anup Jawanjal - Clairvoyant Blog
An XML File
Dealing with XML files is always challenging. Multiple formats to format files exist but
XML still leads the list. We at Clairvoyant understand that working with this in a time-
and memory- efficient fashion cannot be easy. This blog is our effort at documenting our
learning around this topic.
The Problem
https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 1/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog
We were dealing with hundreds of XML files. The zip contained multiple XML files and
each file contained thousands of records. After unzipping, the folder size was up to a GB.
The job can be broken down as below:
Read the zip file and unzip it into a folder for processing
Identify data, validate using external service, change if required, and write the result
file to a different folder
The XML schema was not too nested and contained data that looked something like this:
Also, we stored all the processing information in the database to be able to backtrack
later on. Though there were more data and processing involved, for ease of
https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 2/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog
understanding we can simplify the core problem as; for each <Request> if
<ID> validation fails then remove the <ACKNOWLEDGEMENT> element.
The solution
Tapping into Clairvoyant’s experience of dealing with similar challenges, we arrived at
the following solution:
It was preferable to map each <Request> element to the corresponding Java object
called Request . We mapped the zip filed to InputFile object and XML file to DataFile .
The code snippets shown in the following sections are simplified to show the core
problem as mentioned earlier, skipping a few details. This solution can be extended to
similar problems by grouping XMLEvents generated by the StAX parser in different ways.
To process an individual XML file, we have selected the below mentioned approach:
Phase-1: Parse XML file, populate Java objects, i.e. Request , InputFile , DataFile with
data
Phase-2: Validate each Request with the help of external REST API
Pashe-3: Parse XML for the 2nd time, remove data if required, and write to the new file
For parsing and changing XML data, we can use the DOM-based approach or the
Streaming-based approach. The DOM-based approach requires entire data to be in
memory, which immediately rules out this solution for our problem. In the streaming
approach, XML info sets are transmitted and parsed serially resulting in a lesser memory
footprint.
Stream parsing can be of two types- push or pull-based, and they are differentiated
based on XML parser sending events, or application code requesting events. With pull
parsing, the client controls the application thread and can call methods on the parser
when needed, which perfectly works for our case. The other advantages of the pull-
based approach and comparisons between the different available parsers can be found
in the below-mentioned document.
Why StAX?
https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 3/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog
The StAX project was spearheaded by BEA with support from Sun Microsystems, and the
JSR 173 specification passed the…
docs.oracle.com
FasterXML/woodstox
The gold standard Stax XML API implementation. Now at Github. The
most common way is to use Maven (or Ivy) to access it…
github.com
It's faster and easy to use and also provides a way to validate our XML document against
DTD. Here are some code snippets:
Note: The 2 suffixes used in code, i.e. XMLInputFactory2 refers to the 2nd version of
Stax API provided by Woodstox.
2. Parsing XML document and saving the results in the Request object.
https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 4/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog
try {
FileReader fileReader = new FileReader(xmlFile);
XMLInputFactory2 inputFactory = (XMLInputFactory2)
XMLInputFactory2.newInstance();
XMLEventReader eventReader =
inputFactory.createXMLEventReader(fileReader);
Request request = null;
while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
if (event.isStartElement()) {
StartElement startElement = event.asStartElement();
if
("REQUEST".equalsIgnoreCase(startElement.getName().getLocalPart())) {
request = new Request();
}
if
("ID".equalsIgnoreCase(startElement.getName().getLocalPart())) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isCharacters()) {
Characters dataEvent = (Characters) xmlEvent;
request.setID(dataEvent.getData());
}
}
if("ACKNOWLEDGEMENT".equalsIgnoreCase(startElement.getName()
.getLocalPart())) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isCharacters()) {
Characters dataEvent = (Characters) xmlEvent;
request.setAck(dataEvent.getData());
}
}
}
if (event.isEndElement()) {
EndElement endElement = event.asEndElement();
if
("REQUEST".equalsIgnoreCase(endElement.getName().getLocalPart())) {
requestList.add(request);
}
}
}
https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 6/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog
Here we have created a separate list ackElementEvents to store XML elements that we
conditionally want to add in the result XML file.
Here XMLTask is the main entity responsible for processing individual XML files. Once all
the processing is done in XMLTask, we call latch.countDown() . With latch.await() we
wait till all the files are processed and dataFilesFutures contain all the processing
information that can be saved to DB.
Summary
The key takeaway here is the utilization of the StAX parser for manipulating XML. Also,
the approach described in this document can be extended to many such similar
problems.
https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 7/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog
https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 8/8