Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog

How To Parallelly Process Large XML Files In


Java
An efficient approach to parse, change and write large XML files parallelly in Java
using StAX parser and concurrent programming

Anup Jawanjal Follow


Nov 30, 2020 · 5 min read

An XML File

Dealing with XML files is always challenging. Multiple formats to format files exist but
XML still leads the list. We at Clairvoyant understand that working with this in a time-
and memory- efficient fashion cannot be easy. This blog is our effort at documenting our
learning around this topic.

The Problem
https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 1/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog

We were dealing with hundreds of XML files. The zip contained multiple XML files and
each file contained thousands of records. After unzipping, the folder size was up to a GB.
The job can be broken down as below:

Read the zip file and unzip it into a folder for processing

Go over each file and parse it

Identify data, validate using external service, change if required, and write the result
file to a different folder

Zip the result folder and copy it to the destination location

The XML schema was not too nested and contained data that looked something like this:

Sample XML Input File

Also, we stored all the processing information in the database to be able to backtrack
later on. Though there were more data and processing involved, for ease of

https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 2/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog

understanding we can simplify the core problem as; for each <Request> if
<ID> validation fails then remove the <ACKNOWLEDGEMENT> element.

The solution
Tapping into Clairvoyant’s experience of dealing with similar challenges, we arrived at
the following solution:

It was preferable to map each <Request> element to the corresponding Java object
called Request . We mapped the zip filed to InputFile object and XML file to DataFile .
The code snippets shown in the following sections are simplified to show the core
problem as mentioned earlier, skipping a few details. This solution can be extended to
similar problems by grouping XMLEvents generated by the StAX parser in different ways.

Processing individual XML file

To process an individual XML file, we have selected the below mentioned approach:

Phase-1: Parse XML file, populate Java objects, i.e. Request , InputFile , DataFile with
data
Phase-2: Validate each Request with the help of external REST API
Pashe-3: Parse XML for the 2nd time, remove data if required, and write to the new file

For parsing and changing XML data, we can use the DOM-based approach or the
Streaming-based approach. The DOM-based approach requires entire data to be in
memory, which immediately rules out this solution for our problem. In the streaming
approach, XML info sets are transmitted and parsed serially resulting in a lesser memory
footprint.

Stream parsing can be of two types- push or pull-based, and they are differentiated
based on XML parser sending events, or application code requesting events. With pull
parsing, the client controls the application thread and can call methods on the parser
when needed, which perfectly works for our case. The other advantages of the pull-
based approach and comparisons between the different available parsers can be found
in the below-mentioned document.

Why StAX?
https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 3/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog

The StAX project was spearheaded by BEA with support from Sun Microsystems, and the
JSR 173 specification passed the…
docs.oracle.com

For our case, we have used Woodstox’s implementation of StAX.

FasterXML/woodstox
The gold standard Stax XML API implementation. Now at Github. The
most common way is to use Maven (or Ivy) to access it…
github.com

It's faster and easy to use and also provides a way to validate our XML document against
DTD. Here are some code snippets:

1. Setting DTD validator for XML file.

XMLInputFactory2 inputFactory = (XMLInputFactory2)


XMLInputFactory2.newInstance();
inputFactory.setXMLResolver(
new XMLResolver() {
public Object resolveEntity(String publicID, String systemID,
String baseURI, String namespace){
if (systemID.contains("test.dtd")) {
return getClass().getClassLoader()
.getResourceAsStream("schema/test.dtd");
}
else {
return null;
}
}
}
);

Note: The 2 suffixes used in code, i.e. XMLInputFactory2 refers to the 2nd version of
Stax API provided by Woodstox.

2. Parsing XML document and saving the results in the Request object.

https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 4/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog

try {
FileReader fileReader = new FileReader(xmlFile);
XMLInputFactory2 inputFactory = (XMLInputFactory2)
XMLInputFactory2.newInstance();
XMLEventReader eventReader =
inputFactory.createXMLEventReader(fileReader);
Request request = null;

while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
if (event.isStartElement()) {
StartElement startElement = event.asStartElement();
if
("REQUEST".equalsIgnoreCase(startElement.getName().getLocalPart())) {
request = new Request();
}
if
("ID".equalsIgnoreCase(startElement.getName().getLocalPart())) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isCharacters()) {
Characters dataEvent = (Characters) xmlEvent;
request.setID(dataEvent.getData());
}
}
if("ACKNOWLEDGEMENT".equalsIgnoreCase(startElement.getName()
.getLocalPart())) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isCharacters()) {
Characters dataEvent = (Characters) xmlEvent;
request.setAck(dataEvent.getData());
}
}
}
if (event.isEndElement()) {
EndElement endElement = event.asEndElement();
if
("REQUEST".equalsIgnoreCase(endElement.getName().getLocalPart())) {
requestList.add(request);
}
}
}

3. Changing data in XML file and writing to the new file.

try(FileWriter fileWriter = new FileWriter(newXmlFile.getName())){


XMLEventReader eventReader = inputFactory.createXMLEventReader(new
FileInputStream(xmlFile));
XMLEventWriter writer =
outputFactory.createXMLEventWriter(fileWriter);
https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 5/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog

boolean ackNeeded = false; String id = null;


List<XMLEvent> ackElementEvents = new ArrayList<>();
while (eventReader.hasNext()) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isStartElement()) {
StartElement startElement = xmlEvent.asStartElement();
if
("ID".equalsIgnoreCase(startElement.getName().getLocalPart())) {
writer.add(xmlEvent);
xmlEvent = eventReader.nextEvent();
if (xmlEvent.isCharacters()) {
Characters dataEvent = (Characters) xmlEvent;
id = dataEvent.getData();
}
}
if
("ACKNOWLEDGEMENT".equalsIgnoreCase(startElement.getName().getLocalPa
rt())) {
ackElementEvents.add(xmlEvent);
xmlEvent = eventReader.nextEvent();
String finalId = id;
Optional<Request> request = requestList.stream().
filter(r-
>r.getId().equalsIgnoreCase(finalId)).findFirst();
if(request.isPresent() && request.get().isValid())
ackNeeded = true;
}
}
if (xmlEvent.isEndElement()) {
EndElement endElement = xmlEvent.asEndElement();
if
("ACKNOWLEDGEMENT".equalsIgnoreCase(endElement.getName().getLocalPart
())){
ackElementEvents.add(xmlEvent);
if(ackNeeded)
for (XMLEvent event : ackElementEvents)
writer.add(event);
ackElementEvents.clear();
ackNeeded = false;
continue;
}
}
if(ackNeeded){
ackElementEvents.add(xmlEvent);
}else{
writer.add(xmlEvent);
}
}
}

https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 6/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog

Here we have created a separate list ackElementEvents to store XML elements that we
conditionally want to add in the result XML file.

Processing All Files parallelly

For the purpose of parallel execution, we have utilized ExecutorService provided by


Java with the number of initial threads configurable. We have iterated over all XML files
in a single zip file and created a callable thread for processing each XML file. For
synchronization, we have used CountDownLatch . The sample implementation for this is
provided in the below code snippet:

File[] allXmlFiles = dataFileFolder.listFiles();


CountDownLatch latch = new
CountDownLatch(requireNonNull(allXmlFiles).length);
for (File xmlFile : allXmlFiles) {
XMLTask parseXMLTask = new XMLTask(xmlFile,latch);
Future<DataFile> xmlFileProcessingDetailsFuture =
executorService.submit(parseXMLTask);
dataFileFutures.add(xmlFileProcessingDetailsFuture);
}
latch.await();
processDataFiles(dataFileFutures);

Here XMLTask is the main entity responsible for processing individual XML files. Once all
the processing is done in XMLTask, we call latch.countDown() . With latch.await() we
wait till all the files are processed and dataFilesFutures contain all the processing
information that can be saved to DB.

Summary
The key takeaway here is the utilization of the StAX parser for manipulating XML. Also,
the approach described in this document can be extended to many such similar
problems.

Java Concurrent Programming Xml Stax Data Processing

https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 7/8
6/9/2021 How To Parallelly Process Large XML Files In Java | by Anup Jawanjal | Clairvoyant Blog

About Write Help Legal

Get the Medium app

https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90 8/8

You might also like