Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Converting STM Content to XML

Introduction
Content is being read on mobile devices, ebook readers, PDAs, notepads, etc. The final electronic
format of the content can vary depending on the device that is being used to read it. To accommodate
that device, flexibility is one of the main reasons why many content owners elect to convert their
content first to XML (EXtensible Markup Language). XML is a set of rules for encoding content
electronically that is device, application, and product neutral. XML’s design goals emphasize application
independence, simplicity, generality, and usability. The XML specifications are managed by the World
Wide Web Consortium (W3C) and are widely adopted and recognized throughout most of the
information industry. There are hundreds of tools and languages that have been developed to support
XML. XML is a format even supported by Microsoft® and Apple®, industry leaders in office productivity
tools. Besides interoperability, XML produces componentized content for reuse and repurposing. Once
the content is in XML, specific tools (also with their own open standards) can be leveraged to produce
the hot information products of today (epub, mobipocket, XHTML, HTML5) as well as the hot products
of tomorrow.

This whitepaper is based on experience with a large conversion initiative for an STM publisher (a
common way to refer to a publisher of Scientific, Technical or Medical content). The project goal was to
convert 70,000+ technical papers to XML. This paper therefore focuses on the issues and challenges
when working with STM content. The paper focuses on the following points:

 Selecting and managing a conversion vendor


 STM content characteristics to carefully consider
 Costs and resources associated with a conversion

Where to begin?
Before we jump in, let’s review all of the essential project management items necessary to successfully
plan, execute and successfully complete such a project. Any project, whether a content conversion or a
kitchen remodeling, should follow a structured methodology that addresses:

 Identifying the purpose, goals, objectives, sponsor and stakeholders 1


 Creating and maintaining the projects activities/plan throughout its life cycle
 Identifying and managing the risks, resources, assumptions, constraints, requirements and
deliverables throughout the life cycle
 Communicating frequently and reporting as needed
 Ensuring completion and closure

1
The sponsor is the project champion and is typically responsible for the financial decisions while the stakeholders
can include a variety of roles including senior management, the customer, functional managers, etc. Stakeholders
can hold the keys to unlock the available resources within the company who can work on the project.

© 2010 Nancy A Clarke Page 1 of 10


Converting STM Content to XML

Specifically when designing a project to address electronic publishing take into consideration the
following:

 What are the actual products to result from this initiative?


 What are the major business goals and objectives of the initiative?
 At a high level what are the requirements of the initiative?
 What are the major subprojects that will address the IT infrastructure, the business procedures
and policies, and the digital conversion and long term new content strategy.
 Are there additional subprojects needed to address particular business needs. For example,
improving your company’s taxonomy and integrating it with the creation of the digital content.

A conversion of content from one format to another is generally not the main long term business goal, it
is merely one step or activity within an overall initiative. So in order to begin a conversion, stepping
back to understand the goals, objectives and requirements should help ensure success.

Converting to XML
The following is a list of items to take into consideration when doing the planning for converting content
to XML:

• Who will do the conversion? STM publishers typically have A LOT of content to convert. How
much content you decide to convert will depend on what you intend to do with the content
once it’s converted. There are many vendors available to perform content conversions. This
paper makes no recommendations except to follow a standard request for proposal (RFP)
process. Each company has their own unique business requirements for how they wish to deal
with vendors. By asking questions, gathering information and talking to others who have
completed a conversion, the best decisions can be made. The following list contains information
you’ll need to supply your vendor(s) in order to obtain an adequate estimate:

o What is the target level of accuracy for text, tags, tables, graphics, etc? 98%, 99.95%,
99.995%, 99.9995%
o What is the timeframe/turnaround time expected to meet your organizations business
goals?
o How will complex content such as math equations and tables be captured?
o What is estimated # of pages? What is the estimated number of pages for each source
format?
o How will the content be sent to and from the vendor? If hardcopy, is it expected to be
sent back when finished? If yes, is it expected that it will be sent back exactly as
received (e.g. bound or paper clipped or stapled?)
o What is the source material? PDF Normal, PDF Image, Hardcopy, proprietary content
processing tool, some other source…2
2
Text for the XML file can also be extracted from proprietary formats like, Xyvision, FrameMaker, etc. For image
files or hardcopy files the text usually will be processed using OCR technologies and is subject to further manual
review thereby increasing the cost. In rare cases of particularly challenging source material actual manual
keyboarding may be needed, also a potential cost facto. Also, if hardcopy, an additional step is needed to scan the
content although those added costs are typically minor. You'll want to break down number of pages by type of

Page 2 of 10
Converting STM Content to XML

o Are the conversion specifications complete or is this a service you need from the
vendor? The project can be more accurately quoted when the upfront analysis and
conversion specifications are complete.
o What result products are expected? XML, PDF, epub, etc.
o IF PDF Image is an output, will it include clean or dirty hidden text? In otherwords, will
the OCR text be reviewed and cleaned by the vendor to correct errors? 3
o Who pays shipping and travel costs (if any)?
o How is the product to be delivered? FTP? DVD?
o What language is the content? All English?

In addition to the items above, a sample set should be included with the request. This set
should be a good sampling of the content. If the content is in many formats, include all formats
in the sample. If content is old, include enough samples from each period of time, for example
each decade4. If some documents are more complex than others, provide an estimate for the
varying degrees of complexity. Before making a final selection, ask the vendor to convert your
sample set to XML and verify the results.

• Managing your Conversion Vendor.


o Encourage queries from your vendor not only regarding the comprehension of your
design specifications but also during the conversion process. If a given anomaly is found
during the processing, encourage them to validate their assumption on how to handle
the item. It’s better to address questions/issues real time, than to try to go back later to
resolve as it may now exist in many documents or may be too inconsistent to capture
and correct easily. The best approach is to share an “issue log” with your vendor. The
log should track all issues/questions from the vendor as well as all issues/questions
you’ve identified as a result of your QA process.
o Schedule routine status meetings with the vendor to address issues, questions, and
review status and progress. Be wary of slips in deliverables, they can quickly compound
especially if your vendor is ramping up to include more volume over shorter periods of
time as the vendor becomes proficient with your needs.
o Track costs and request specific details from the vendor so that your calculation of costs
does not become over burdensome. Some vendors will quote the project calculating
pennies or dimes for a type of graphic and another cost for whether they are capturing
it in color/grayscale or black and white. Validating accuracy of invoices can be rather
time consuming when dealing with costs such as these.

o Trust but Verify. A conversion project typically includes a contract contingency that the
output produced by the vendor is 99.95%, 99.995% or 99.9995% accurate, however that
is no guarantee. While you are paying for the vendor to perform that level of quality
assurance, you will need to decide how much quality assurance your company will
perform internally to verify the deliverables are meeting the requirements. The QA
document format if more than 1 format.
3
Ideally the conversion specifications should outline the details of your output/deliverable.
4
Older documents that were typeset can be more difficult for an OCR engine to successfully process.

Page 3 of 10
Converting STM Content to XML

requirements may vary perhaps depending on the use of the document (a technical
paper versus, perhaps, the more rigorous quality goals of a technical standard). A few
other things related to quality to address:

 It is important to understand how the vendor measures the accuracy. For


example, 99.995% equates to 1 character error for every 20,000 characters.
What if your document does not have 20,000 characters? Typically the way this
is addressed is by performing QA on a batch of documents. How the vendor
performs this is important to understand.

 Take the time to understand the vendor’s quality assurance technique and
process. This is important when selecting a vendor.

 The QA process should be verifying that the XML specifications were followed,
and that only an acceptable level of character errors were introduced. If the
vendor is responsible for only submitting the resulting XML and associated files
(graphics) then consider how you will be reviewing the content for accuracy.
You will need to have some tool that enables a QA tester to review the text of
the document without needing to understand XML. Whatever tool you select,
share it with the vendor. That will help ensure that they are seeing the same
results as you.

 Begin your QA process (with all necessary tools) as soon as you begin the XML
conversion. Capturing issues real time will be less painful than finding a
problem after several files have been converted to XML.

• Normalizing content across an organization. Once the content is electronic, how does it
compare to the content stored in the organization’s business databases? It can be rather
interesting to find how many times and ways a title, or author name is stored. Also to find out
how that information may relate to or be referenced by other corporate systems. It is possible
that the metadata on the content and the metadata captured by the business systems are
different. This is an opportunity to capture the data at the source but that may or may not end
up being the best solution overall.

o Customizing the common (master) DTD/XSD 5. The NLM DTD has been adapted by
several STM implementations and has evolved through 3 major releases. The many
revisions and extensions to the core NLM DTD provides a basis for virtually all content
elements – and then some. It would be a challenge to make the case to not deploy the
NLM DTD framework for typical STM publisher purposes. It’s possible to modify the
DTD/XSD to meet specific business requirements, however, by keeping things simple
5
A document type definition (DTD) or XML Schema Definition (XSD) is a set of declarations that define the
structure and requirements for a given document type. Either format (DTD or XSD) can be used. Its primary
purpose, during a conversion, is to define the document structure and provide a mechanism to validate the XML.

Page 4 of 10
Converting STM Content to XML

and keeping with the standard structure, further customizations, and issues with
upgrades can be avoided. The NLM DTD, comes with thorough online documentation
including tagging preferences - http://dtd.nlm.nih.gov/publishing/tag-library/n-
qk32.html. Also follow the preferences and industry standards recommended by the
conversion vendor.

• Creating the Conversion Specifications. Although the NLM tagging preferences and the
conversion vendor are useful resources when defining the conversion specifications, there are
many conversion decisions to be made. With every XML element, try to include varying
examples of content which will help define “how to” properly capture the content for that
element. For example, consider the title of a Technical Paper. How consistent has your
organization been in defining the title format and presentation in the past, in your print/PDF
versions? Do the authors break the title down into subtitles? Are there characters used within
the title that would suggest a break between title and subtitle? Be sure your design
specification takes this into consideration and defines a consistent method for capturing this
information. If this level of detail is not covered in your design specification, the conversion
vendor may or may not consistently capture the title and subtitle. The more details you can
provide on how you want your content converted, the more consistent the results will be, and
the easier it will be to convert from XML to other formats for electronic publishing.

• Conversion specification decisions. It may not be financially beneficial to tag legacy content to
the same degree as current content. It depends on how the content will be used. Tagging
appropriately leaves opportunity to go back and further enhance legacy content when/if
necessary. What is the right balance? What is important to take into consideration when
creating conversion specifications? Consider these suggestions:

o For legacy content, consider capturing tables and equations as graphics. Capture the
content-type as “table” or “equation” for future conversion possibilities. Costs will rise
significantly when converting tables and equations to XML especially with large complex
tables. The amount of time dedicated by the vendor and the organization’s QA process
must increase. Keep in mind, however, the vendor will charge for the graphics. Some
charge a fixed price for each graphic.
o Capture reference citations as mixed-citation to preserve the variety of ways the
citations were written while tagging the individual items of the citation for future
reformatting of the content. By capturing the reference “as is” (mixed-citation) the
application code to deliver your final output for your customer may be simplified. Also
capture the individual parts of the citation so that it can be loaded into tools such as
CrossRef – Cited-by Linking®, a reference search database, etc. Also consider capturing
the type of publication as an attribute. That may add value later when needing to
display the references in a particular way depending on the type of publication. It is
challenging to determine publication types for some references when there is not

Page 5 of 10
Converting STM Content to XML

enough information or the reference does not follow a common format. Well defined
specifications will be important as well as including a default type.
o For legacy content, capture table and figure titles, figure captions, and legends as part of
the graphic, not as text. This ensures that key data is not missed and is kept with the
applicable content. This is also beneficial when dealing with a large volume of content
where your QA resources are limited. Lastly, figures created with symbols or varying
hyphens will be captured accurately. If the intent is to provide an archive of all graphics
and to reference them by figure caption or title then this approach is not feasible.
Remember though that if you ever wanted to go back and make that feature a
possibility you can, since all of the information is captured within the graphic and XML
attributes.
o Be specific on the graphics that are needed. This topic can be a whitepaper in itself.
There are many types of graphics that are valuable in different ways. Some standard
formats to consider are TIFF, JPG, GIF and SVG. Review the pros and cons of each of
these. If you have multiple needs, for example if you need 1) JPG for normal
black/white, grayscale and color figures; 2) GIF for thumbnails; 3) SVG for engineering
drawlings, vector images and 4) TIFF to preserve the graphic in its original form; then ask
for all four! The effort to convert from TIFF to JPG and GIF can be automated and
therefore should be no additional cost from the conversion vendor. SVG however may
be a different story depending on your vendor’s capabilities. Also consider the size of
the graphics. Keep the TIFF the original size of the image, the GIF is down-sampled to
100x100 for a thumbnail, and resize the JPG, if it is larger than most handheld reader
devices (600x800 dpi).
o Consider requesting an exact replica of the hardcopy in PDF and it have delivered
immediately in order to get immediate returns. If the archive is in hardcopy form, the
first thing the vendor must do is scan it in to an electronic form that can be converted to
characters (using optical character recognition software). Request the scanned images
back as a PDF Image with dirty hidden text (OCR text prior to cleanup). You can put this
online immediately for access by your customers!
o Capture section, list and other labels as labels. Especially with legacy content, there is
probably no consistent definition for how to label the content. One paper may have
numbered sections while another may include letters. By placing the label within its
own element this allows for flexibility in how the content is published going forward.
o STM papers are heavily peppered with a wide range of character codes/symbols. There
are several things to consider when capturing these items into your XML.
 Standardize on Unicode for capturing your characters even when entity codes or
other options are available. Unicode is platform and language independent. It
is widely and strongly adopted. For more information, see Unicode.org.

Page 6 of 10
Converting STM Content to XML

 While there is a Unicode for every character, the fonts available to display the
Unicode in your reader or print format may be limited. While web browsers and
any print/publishing system where you control the fonts, can produce great
results, some of the readers (Adobe Digital Editions, Kindle 2) are very limited in
the fonts they support and more complex characters will appear as question
marks, blank boxes or another symbol. (Similar to what you’d see in Word if
you’re opening up a document and you do not have the appropriate font on
your desktop.) Some suggestions to address this situation include providing a
format of the document that captures all symbols accurately and/or capture any
symbols/characters that are not within your standard fonts as a graphic instead
of the Unicode value.
 The hyphen, a pretty common character and can be represented in many forms
EMDash, ENDash, etc. Capturing them all as a hyphen - ensures a very
common character is viewable in the limited readers mentioned earlier. Take
care however when the Dash has meaning for example, when used in a figure.
This may not be an issue, however, if the Caption/Legends are captured as part
of the graphic.
 Be aware also that DTD’s include predefined character sets. Some have found
the NLM DTD to have outdated codes for the & and < characters. Sticking with
Unicode will resolve this issue.
o Analyze a large sampling of documents including documents from every decade and
specify how the conversion vendor should deal with them. Look for anomalies and
document how they should be processed. Some anomalies which can be found include:
 Multipart figures – This may just be a common occurrence for your documents.
A multipart figure is multiple figures that are defined as parts of a more
common figure caption.
 Documents with responses or discussions – Older technical papers commonly
have responses or discussions appended to the end of the paper. Luckily, the
NLM accounts for this with the element <response>!
 Documents with advertisements included – Yep, advertisements….You probably
will chose not to convert this content.
 Documents with abstracts only
 Journals with pages that unfold into large-size documents or diagrams
 Sections of a paper with a title and no content
 Nested reference lists – This occurs not just in older content, be clear in the
specifications how to address this.
 Logos on the cover page
 An article within a journal being continued on another page of the journal
(continued on page 493…)
 Articles in a journal that start on the same page as where another article ends

Page 7 of 10
Converting STM Content to XML

o When dealing with older content it is possible for parts of text to be missing or rubbed
off from the page. Decide how to handle this. Is the paper still worth converting? If so,
indicate that the content is [ILLEGIBLE].

o Carefully study the authors and affiliations on papers. Multiple authors can be grouped
with multiple affiliations. Authors can represent this in a variety of ways including
placing numbers or letters next to the authors who are affiliated with the respective
organizations. This can get quite complex. The NLM DTD provides a great deal of
flexibility in how the Authors and Affiliations can be tagged within a <contrib> or
<contrib group>. It’s very important to provide detailed instructions on how variations
should be tagged so that the content is tagged consistently for later reproduction.

o Be clear on what should and should not be a definition list. The basic rendering of a
definition list is as follows:
Wikipedia
Insert a definition here…..

If, however, your conversion vendor captures the following as a definition list, you may
not like how it’s rendered by default:
Y = the length in centimeters of a piece of gum.
Y
The length in centimeters of a piece of gum

There is an attribute for definition lists that help you capture the different types of lists
thereby permitting the rendering of the lists unique to their type.

o Include examples in your design specification/conversion specifications.

o Be aware of directional text in figure captions. I’ve seen this more in books than papers.
A caption may refer to more than one figure and distinguish them by referencing as “the
figure on the right”, or a clock directional. Also, especially in photo figures, references
are made to the contents of the picture such as mentioning people’s names in a photo
from left to right.

o If an “id” attribute exists for a given element, its best to use it. It may be helpful when
rendering the XML to a new format. If an “id” attribute exists, it’s there for a reason.
Someone found a need for it. So go ahead and include it. For example, sections can be
numbered in a variety of ways but having a standard way to capture sections and
subsections will be helpful when rendering the content. For example, use “s_n.n.n”
where each “n” referred to a level deeper within the section. This will be valuable for
defining table of contents, and different fonts and font sizes when rendering section
titles.

Page 8 of 10
Converting STM Content to XML

o Remember to “adjust as you go”. You certainly don’t want to make a decision to
capture the XML one way and then change midstream, however you may encounter an
anomaly that was missed during the analysis. The vendor and organization should have
a change control process that can be followed to perform analysis on the anomaly and
adjust necessary documentation and processes to address it.

• Paying the bills. An obvious cost to performing the conversion is the cost for the contracted
services (e.g. conversion vendor, expert consultants, QA analysts, etc.) But also plan for the
following which may have a tangible or intangible cost to the organization:

o Account for disk space and method for transferring large amounts of data
o Plan for unexpected costs/buffers - overruns, rework, scope changes
o Include costs for shipping, copying, software licenses, readers
o Do you need to access an offsite facility/mine for your documents? Are there costs
associated with that?

Converting From XML


There are many formats to convert to once content is in XML. There are also many new products that
can be produced from combining or parsing out components from the XML. The options and ideas are
endless. It is increasingly commonplace to convert XML to epub and PDF. Both transformations are
possible with a variety of tools. I believe there is real value-add to an organization that has its own
information technology department to develop the content-to-product transformations internally. It’s
not rocket science and it then creates a repeatable process to be used over and over again on the
content. A per page charge from conversion vendors to perform this process seems unjustifiable when
this is a task that can be fully automated internally with most organizations. Keep in mind, that in order
to fully automate the XML to epub or PDF (or some other form), decisions will need to be made on how
to handle the XML and those plans must be well-developed and proven in live production processes.
The question you can ask yourself is whether the resources are necessary to put custom finishing
touches on content or not.

Summary
Time to Jump in! If you are not convinced that XML is the way to go, compare it to the costs and
benefits of converting to other formats. Also, learn about XML and why it is such an important
standard.6 Lastly, do not skip the upfront analysis and exercises (See “Where to begin?”) as these are
essential to successfully completing the project!

Acknowledgements
Sperling Martin, Information Industry Consultant

6
W3C Extensible Markup Language (XML) http://www.w3.org/XML/

Page 9 of 10
Converting STM Content to XML

About the Author


Nancy Clarke, PMP, Business and Technology Consultant
Nancy Clarke is an accomplished business and technical consultant with over 15 years of
experience helping businesses improve their efficiency and effectiveness through the
implementation of content management applications and practices. Her most recent
assignment is assisting a large STM publisher in transforming their processes and
technologies to face the demands of today’s electronic publishing. Nancy has also held leadership
positions at large companies where she has deployed global web content management and Internet
publishing solutions. Her involvement begins with concept and follows through to implementation
providing project management, technology architecture, and hands-on support wherever the need.
412-508-8299, nancy.clarke@gmail.com
LinkedIn Public Profile: http://www.linkedin.com/in/nancyaclarke

Page 10 of 10

You might also like