Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 81

Open XML Deep Dive

Doug Mahugh
Technical Evangelist, Microsoft
http://blogs.msdn.com/dmahugh

Satisfy Your Technical Curiosity


Application type: Document Assembly
Server environment: Linux, Java, Apache, MySql
Desktop environment: Office 2007

Satisfy Your Technical Curiosity


Session Objectives
Satisfy your curiosity about Open XML:

Architecture
The three main Open XML schemas
Development options
Custom XML support
Development scenarios

Satisfy Your Technical Curiosity


Today is the tip of the iceberg
Comprehensive 2-day Open XML Developer
workshop scheduled for Belgium on May 21
Contact Imma Verheyen, Partner Development
Manager: immav@microsoft.com

Satisfy Your Technical Curiosity


Diverse Environments
All you need is ZIP and XML support

Linux Java Microsoft COM

.NET Framework 3.0


Minizip
J2SE System.IO.Packaging *
ZIP Library java.util.zip
Xceed ActiveX controls
zLib
Xceed .NET controls

.NET Framework 3.0


XML Library Apache Xerces JAXP
System.Xml
MSXML

* Also includes abstractions for OPC concepts


(Open Packaging Convention)

Satisfy Your Technical Curiosity


Development Scenarios
Scenario Example
Document Assembly Create sales reports from financial and forecast data stored
Server-based or user-assisted construction of documents from in a CRM system.
archived content or database content.
Integration & Content Reuse Quickly and efficiently apply content stored in Word
Much easier to move content between documents, including documents to Web pages.
different document types.

Document Sanitization
Remove unwanted content like comments, embedded code or Remove all tracked changes and comments from a Word
potentially sensitive items from your document when document before it is published.
appropriate.
Document Interrogation
Query document repositories based on custom data, content Search for all documents containing a specific company
types or document metadata. name or sales contact.

Content Tagging
Adding a tagging schema to content can dramatically improve Organizations can create their own smart tags then use
content searches and the value of the data stored in documents. them as the basis for searches.

Document Archival
Ensuring document formats can be consumed long into the XML-based document archives include the data and
future without vendor-specific clients or applications. presentation information.
Satisfy Your Technical Curiosity
XML in Office: the last 10 years
Office 2003
Breakthrough XML Support
WordProcessingML,
SpreadsheetML
Custom-defined schema

2007 Office system


New XML-based Formats
XML File format Default
Office 2000
XML PowerPoint Format
Early Innovation
XML Document Properties

Office XP
First XML Formats
Spreadsheet XML

Office 97
Existing binary file formats designed in
1994, launched in Office 97
Satisfy Your Technical Curiosity
Open XML Architecture
Markup Languages

WordprocessingML SpreadsheetML PresentationML

Shared Vocabularies

DrawingML Custom XML Bibliography

VML (legacy) Metadata Equations

Open Packaging Convention


Digital
Relationships Content Types
Signatures

Core Technologies

ZIP XML + Unicode

Satisfy Your Technical Curiosity


Open Packaging Convention
Low-level conventions that define the structure of
an Office Open XML document

Also used by XPS, and some third-party


implementations are under development

Key concepts: package, parts, relationships, and


content types

Satisfy Your Technical Curiosity


Parts
Stored inside the package in a specific location
Reachable via a URI
Associated with a specific content type

Often XML, but can be of any defined content type (including custom types)

Satisfy Your Technical Curiosity


Content Types

Every part must have a content type


Most OXML parts are content type XML
Consumers support a specific set of content
types

You can define custom content types, and


consumers will preserve them – this is a key
area of opportunity for developer innovation
Satisfy Your Technical Curiosity
Relationships

Tie elements inside the package to each other

Allow you to step through the document without


parsing parts

Are required: a part without a relationship is not


part of the package, and may be discarded

Satisfy Your Technical Curiosity


OPC is a Logical Structure
Files and folders – NO! Parts should be referenced by
These details may vary. their relationship type.

Satisfy Your Technical Curiosity


Types of Interoperability
Reference Schemas Custom-defined Schemas
Display-oriented Data-oriented
Enables technical interoperability Enables semantic interoperability

Satisfy Your Technical Curiosity


Brian Jones, ODC2006
WordprocessingML
Document architecture

Document

properties body

comments images

footnotes/endnotes numberingDefinitions

headers/footers styles

fontTable customXML

Satisfy Your Technical Curiosity


Paragraphs, Runs and Text
How text is stored in wordprocessingML
The document element
• Contains a body element
• Contains paragraphs
• Contains runs
• Contains text elements
<document>
<body>
<p>
<r>
<t>HELLO!</t>
</r>
</p>
</body>
</document>
Satisfy Your Technical Curiosity
Satisfy Your Technical Curiosity
Direct Formatting Example
Simple formatting at paragraph/run levels:

<w:p> Paragraph properties specify bold (default


<w:pPr>
<w:b/> for the entire paragraph)
</w:pPr>
<w:r>
<w:t>The quick</w:t>
</w:r> Run properties specify italics
<w:r> (override for this run)
<w:rPr>
<w:i/>
</w:rPr>
<w:t>brown</w:t>
</w:r>
<w:r>
<w:t>fox.</w:t>
</w:r>
</w:p>

Satisfy Your Technical Curiosity


Paragraph Properties
Can be set directly or in a paragraph style
24 total property settings

<w:p>
<w:pPr>
<w:widowControl w:val=“on” />
<w:keepNext/>
<w:keepLines/>
<w:pageBreakBefore/>
<w:suppressLineNumbers />
<w:suppressAutoHyphens />
<w:textBoxTightWrap />
</w:pPr>
… runs, paragraph content …
</w:p>

Satisfy Your Technical Curiosity


Run Properties
Define formatting for
individual characters
Font attributes, size/position,
other settings
24 total properties

<w:r>
<w:rPr>
<w:rFonts w:ascii=“Arial” w:hAnsi=“Arial” w:cs=“Arial” />
<w:b/>
<w:i/>
<w:sz w:val=“11” />
<w:dstrike w:val=“true” />

Satisfy Your Technical Curiosity


Text <w:t>
The only element in the main story that can
contain text – all other text is in attributes
Three other types of text are allowed in runs:
Deleted text <w:delText>
Field code <w:instrText>
Deleted field codes <w:delInstrText>
By looking to <w:t> nodes, you can be sure
you’re seeing only displayed text

Satisfy Your Technical Curiosity


Revision IDs (RSIDs)
RSID values are used to identify a set of
changes that were made during the same
editing session
Found in many elements:
Paragraphs, runs, sections, styles
Table rows, table properties, charts, diagrams
Allows for merging revisions, without the
privacy and security issues involved in tracking
who changed what
Optional, but recommended for applications
that modify existing documents Satisfy Your Technical Curiosity
Images
An image is a w:pict element inside a run <w:r>
The v:imagedata element is defined in VML:
xmlns:v="urn:schemas-microsoft-com:vml"

The actual image is referenced via a relationship:


<w:pict>
<v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:250; height:200">
<v:imagedata r:id="rId4"/>
</v:shape>
</w:pict>

The relationship points to an image part in the package:


<Relationship Id="rId4”

Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image”
Target="image1.jpg"/>

Satisfy Your Technical Curiosity


Satisfy Your Technical Curiosity
Tables
Tables are a set of paragraphs which are
arranged into rows and columns

In WordprocessingML, tables are block level


content, and are specified using the table
element
Analogous to the HTML <table> element

Satisfy Your Technical Curiosity


What’s in a table?
<w:tbl>

<w:tblPr>
<w:tblStyle w:val=“TableGrid”/>
<w:tblW w:w=“0” w:type=“auto”/>
<w:tblLook w:val=“01E0”/>
</w:tblPr>

<w:tblGrid>
Properties
<w:gridCol w:w=“2952”/>
<w:gridCol w:w=“2952”/>
<w:gridCol w:w=“2952”/>
</w:tblGrid>
Grid
<w:tr>

<w:tc>
Rows
<w:tcPr>
<w:tcW w:w=“2952” w:type=“dxa”/>
</w:tcPr>
<w:p>
Cells
<w:r>
<w:t>1,1</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w=“2952” w:type=“dxa”/>
</w:tcPr>
<w:p>
<w:r>
<w:t>1,2</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
</w:tbl>

Satisfy Your Technical Curiosity


Styles
A style defines a specific set of values for formatting properties that may be applied as a single logical unit

For example, the Normal style in Word 2007 defines these formatting properties:
Font = Calibri (body)
Font Size = 11 point
Font Language = Word default (as configured by user)
Justification = Left
Line Spacing = Single
Widow/Orphan control

Satisfy Your Technical Curiosity


Style Types
WordprocessingML supports six style types:

Paragraph styles
Character styles
Linked styles
Table styles
List styles
Default style (linked type, but applies when no style
specified)
Satisfy Your Technical Curiosity
Paragraph Styles Example
Step 1: define a paragraph style
Styles are defined in the style part:

<w:style w:type=“paragraph” w:styleid=“TestParagraphStyle”>

<w:name w:val=“Test Paragraph Style”/> Common


<w:qformat/>
<w:rsid w:val=“009E253E”/> Properties
<w:pPr>
<w:pStyle w:val=“TestParagraphStyle”/>
<w:spacing w:line=“480” w:lineRule=“auto”/> Paragraph
<w:ind w:firstLine=“1440”/> Properties
</w:pPr>

<w:rPr>
<w:rFonts w:ascii=“Algerian” w:hAnsi=“Algerian”/>
<w:b/> Character (Run)
<w:color w:val=“ED1C24”>
<w:sz w:val=“40”/> Properties
</w:rPr>

</w:style>

Satisfy Your Technical Curiosity


Paragraph Styles Example
Step 2: apply the style to a paragraph

The pStyle element associates a style with a


paragraph:
<w:p>
<w:pPr>
<w:pStyle w:val=“TestParagraphStyle”/>
</w:pPr>
<w:r>
<w:t>Text</w:t>
</w:r>
</w:p>

The paragraph is displayed with the style applied:

Satisfy Your Technical Curiosity


Numbering Styles
Flexible hierarchical definition

Numbering styles are styles which define the


structure of a multi-level numbering format
Numbering definition instances are based on an
abstract numbering definition
Abstract numbering definitions define paragraph
properties for up to 9 hierarchical levels
NOTE: items in a list are simply paragraphs. There
is no list “container” as in HTML.

Satisfy Your Technical Curiosity


Table Styles
A table style is associated with a table via the tblStyle
element in the table properties:

<w:tbl>
<w:tblPr>
<w:tblStyle w:val=“Style20”/> Table style Style20 is applied to
<w:tblW w:w=“5000” w:type=“pct”/> the table
<w:tblLook w:val=“0220”/>
</w:tblPr>
… tblGrid, table rows and cells …
</w:tbl>

Satisfy Your Technical Curiosity


Style Application Hierarchy
Direct formatting overrides style settings

Document Defaults

Table

Numbering

Paragraph

Character

Direct Formatting

Satisfy Your Technical Curiosity


Satisfy Your Technical Curiosity
Subdocuments
Mechanism for “rolling up” documents
Subdocuments are well-formed Open XML
documents and can be edited independently
Subdocuments don’t know they’re part of
something bigger – they’re just stand-alone
documents

Satisfy Your Technical Curiosity


Subdocuments
Implementation details
Main document part contains subDoc elements that indicate where to
insert subdocuments
The subdocument’s location is stored in a relationship

Main document part:


<w:body>
<w:subDoc r:id=“rId1”/>
<w:subDoc r:id=“rId2”/>
<w:subDoc r:id=“rId3”/>

Relationships:
<Relationship Id=“rId1” Type=“…/subDocument” Target=“Part1.docx” TargetMode=“external”/>
<Relationship Id=“rId2” Type=“…/subDocument” Target=“Part2.docx” TargetMode=“external”/>
<Relationship Id=“rId3” Type=“…/subDocument” Target=“Part3.docx” TargetMode=“external”/>

Satisfy Your Technical Curiosity


Document Sections
A document may be divided into sections
Allows formatting at a higher level than
paragraphs:
Landscape/portrait orientation
Page margins, etc.
Section properties are defined in sectPr:
<w:sectPr>
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="1440" w:right="1800" w:bottom="1440“ w:left="1800“
w:header="720" w:footer="720" w:gutter="0"/>
<w:cols w:space="720"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>

Satisfy Your Technical Curiosity


Section Properties
Example
In Word, section properties are
specified in the Page Setup dialog

<w:sectPr>
  <w:pgSz w:w="12240" w:h="15840" />
  <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440"
w:header="720" w:footer="720" w:gutter="0" />
  <w:cols w:space="720" />
  <w:docGrid w:linePitch="360" />
  </w:sectPr>

Satisfy Your Technical Curiosity


Custom XML Support

Merging the worlds of documents and data

Satisfy Your Technical Curiosity


Why Custom XML?
Enables semantic interoperability
Documents can provide a rich view of back-end data
Documents can update back-end data sources

Exposes business data within documents to


heterogenous systems
Business-specific semantics can be applied to
document data
Separates presentation and data
Custom XML schema support was a key design
objective for Open XML: any schema can be used
in Open XML documents.
Satisfy Your Technical Curiosity
Custom XML
Developer options for custom XML support

Satisfy Your Technical Curiosity


Custom-defined XML is Document Template
stored in its own discrete part Visual
document XML
parts data
Any XML can be stored, with
or without a schema
External System

Only one requirement:


must be well-formed XML

External applications (client/server) can process


the store or populate the store

Microsoft Confidential
Custom XML Properties
Information about a custom XML part is stored
in a custom XML properties part
Stored via an implicit customXmlProps
relationship from the custom XML part
Contains two types of information:
Part ID
Uniquely identifies a part within a document
Maintained through editing sessions
XML Schema references

Satisfy Your Technical Curiosity


Structured Document Tags
Known as "content controls" in MS-Office

Smart tags and custom XML markup add semantics,


but do not have any effect on presentation
Sometimes you want to affect presentation
Data-entry restrictions, multi-select, etc.
Solution: the structured document tag <sdt>

Satisfy Your Technical Curiosity


Types of Content Controls
Plain text
Combobox
Dropdown list
Document building block
Date picker
Rich text
Picture

Satisfy Your Technical Curiosity


Data Binding
2-way synchronization between:
Content controls (structured document tags)
Custom XML nodes (data in your schema)

Satisfy Your Technical Curiosity


Satisfy Your Technical Curiosity
Data Binding Basics
How to bind xml nodes to structured document tags

Add a <dataBinding> element to the structured


document tag properties <sdtPr>
<dataBinding> specifices a custom Xml part (by Custom
XML Data Identifier) and an Xpath to a specific node
within that part

Custom XML Data Identifier? What’s that?


The custom XML part has a properties part
Implicit relationship in customXmlPart.xml.rels
The properties part specifies a Custom XML Data Identifier
Satisfy Your Technical Curiosity
Content Control Toolkit
Open-source developer tool
http://www.codeplex.com/W
iki/View.aspx?ProjectName=d
be
Automatically generates
parts, relationships, and
markup to bind custom XML
parts to content controls

Satisfy Your Technical Curiosity


Custom XML Markup
Tagging document content with custom semantics

Allows embedding the structure from any XML schema into a WordprocessingML
document

Schema not required


XML doesn’t have to validate against your schema
Custom XML elements may have custom attributes
Consumers/producers preserve your attributes

Satisfy Your Technical Curiosity


Custom XML Markup
Example

Satisfy Your Technical Curiosity


XML Mapping in SpreadsheetML

XML elements and attributes may be mapped


to cells and tables

Store a copy of the schema in the workbook

Data is in an external XML file

Satisfy Your Technical Curiosity


Satisfy Your Technical Curiosity
SpreadsheetML
Document architecture
Workbook properties

styles

sharedStrings

calcChain

sheet1..N
sheet1..N
sheet1..N
sheet1..N

table

chart

sheet1..N
sheet1..N
sheet1..N
drawing
Satisfy Your Technical Curiosity
SpreadsheetML
Performance optimizations

SpreadsheetML has been optimized based on


analysis of typical spreadsheet usage patterns:

Small tag size (often a single character)


Shared strings
Shared formulas
Sparse table markup allowed
Optional r=“A1” attribute for faster loading

Satisfy Your Technical Curiosity


SpreadsheetML Strings
Two alternatives for storing text strings

1. Inline strings
• Provided for ease of translation/conversion
• Useful in XSLT scenarios
• Excel and other consumers may convert to shared
strings on document save
2. An entry in the shared-strings table
• May be either a simple string or formatted text

These approaches may be mixed/combined


Satisfy Your Technical Curiosity
Shared Strings
Repetitive strings are common in typical spreadsheets

Strings are stored in a shared-strings part:


Each unique string is stored once
Cells store the index (0-based) of the string

Benefits:
Users: reduced file size, improved performance
Developers: all strings are in one part, simplifying
search, localization, and other common string-handling
tasks

Satisfy Your Technical Curiosity


Shared Strings
Sampled shared-strings table
6 string references, 4
unique strings
<sst xmlns="..." count="6" uniqueCount="4">
<si>
<t>Paris</t>
</si> Paris = string 0
<si>
<t>Seattle</t>
</si> <row r="1" spans="1:1">
<si> <c r="A1" t="s">
<t>London</t> <v>0</v>
</si> </c>
<si> </row>
<t>Copenhagen</t>
</si>
</sst>

Satisfy Your Technical Curiosity


Inline Strings
No shared-strings part required
Especially useful in XSLT scenarios
If you’re consuming Open XML documents, you must
handle both cases: inline strings and/or shared strings
Excel 2007 converts to shared strings on save

<sheetData>
<row><c t="inlineStr"><is><t>Paris</t></is></c></row>
<row><c t="inlineStr"><is><t>Seattle</t></is></c></row>
<row><c t="inlineStr"><is><t>London</t></is></c></row>
<row><c t="inlineStr"><is><t>Copenhagen</t></is></c></row>
<row><c t="inlineStr"><is><t>Paris</t></is></c></row>
<row><c t="inlineStr"><is><t>London</t></is></c></row>
</sheetData>

Satisfy Your Technical Curiosity


Satisfy Your Technical Curiosity
SpreadsheetML Tables
Design goals for SpreadsheetML tables:
1. Separate presentation and data
Data stays in the worksheet
Table definition is in a separate part (referenced via a relationship)
2. Cell definition lightweight but extensible
Complex type with future storage capabilities
Named ranges written in their own collection instead of on each cell

Open XML has different types of tables for each


document type, optimized for different scenarios:
WordprocessingML has its tbl element
SpreadsheetML has its table element
PresentationML uses DrawingML tables (tbl inside graphicData)

Satisfy Your Technical Curiosity


SpreadsheetML Table Example
Worksheet part:
<sheetData>
<row r="1" spans="1:2"> Headings = shared strings
<c r="A1" t="s"><v>0</v></c>
<c r="B1" t="s"><v>1</v></c>
</row>
<row r="2" spans="1:2">
<c r="A2"><v>1</v></c>
<c r="B2"><v>4</v></c>
</row>
<row r="3" spans="1:2">
<c r="A3"><v>2</v></c>
<c r="B3"><v>5</v></c>
</row> Table-definition part:
<row r="4" spans="1:2">
<c r="A4"><v>3</v></c> <table … ref="A1:B4” …>
<c r="B4"><v>6</v></c> <autoFilter ref="A1:B4”/>
</row> <tableColumns count="2">
</sheetData> <tableColumn id="1" name="Column1" />
... <tableColumn id="2" name="Column2" />
<tableParts count="1"> </tableColumns>
<tablePart r:id="rId2"/> <tableStyleInfo …/>
</tableParts> </table>

Satisfy Your Technical Curiosity


AutoFilter Example

Satisfy Your Technical Curiosity


Formulas
<row>
<c>
Stored as plain text <v>1</v>
</c>
</row>
<row>
Documented in the spec <c>
<v>2</v>
to provide for predictable </c>
</row>
interoperability <row>
<c>
<v>3</v>
</c>
</row>
<row>
<c>
<f>SUM(A1:A3)</f>
</c>
</row>
Satisfy Your Technical Curiosity
DrawingML

Satisfy Your Technical Curiosity


DrawingML vs. VML
Per the Ecma spec: “VML should be considered
a deprecated format included in Office Open
XML for legacy reasons only.”
VML was not entirely replaced by DrawingML
before submission to Ecma

Main remaining uses of VML:


WordprocessingML: OfficeArt shapes, textboxes
SpreadsheetML/PresentationML: comments,
embedded OLE objects Satisfy Your Technical Curiosity
3-D Effects
Apply 3-D Adjust
Bevels Material types

3-D Scene Definition

Before Apply 3-D Scene

Satisfy Your Technical Curiosity


DrawingML
Implementation varies for each document type

Location varies (main body, drawing part, slide)


Packaging (“shim”) varies

WordprocessingML SpreadsheetML PresentationML


(in Word): (in Excel): (in PowerPoint):

Satisfy Your Technical Curiosity


WordprocessingML
DrawingML is stored in the document body

Shim defines graphic frame


and locked canvas

Shape definition is DrawingML

Satisfy Your Technical Curiosity


SpreadsheetML
Drawing is in a separate drawing part

Shim defines anchor


position and type

Shape definition uses


spreadsheetDrawing namespace
for non-visual properties

Satisfy Your Technical Curiosity


PresentationML
DrawingML is stored in the slide part

No shim – the shape is in


the shape tree

Shape definition is DrawingML

Satisfy Your Technical Curiosity


PresentationML
Document architecture
 
Themes    
Slide Masters 
Slide Layouts 
   
 

Fonts 
 
Slides 
 

Presentation
View Properties 
  
 
Notes Slides 
Notes Masters   
  

Presentation
Properties

 
Handout

Masters
 

Code

Satisfy Your Technical Curiosity


Sample Slide
Typical presentationML content
Shape Textbox Chart

Satisfy Your Technical Curiosity


Slide Part
Shape tree contains slide content definitions
<p:sld xmlns:p=“…/presentationml/2006/main”
xmlns:a=“…/drawingml/2006/main” …>
<p:cSld>
<p:spTree>
<p:sp>
Shape
<p:nvSpPr>
  <p:cNvPr id="2" name="7-Point Star 1” />

<p:sp>
<p:nvSpPr> Textbox
  <p:cNvPr id="3" name="TextBox 2” />

<p:graphicFrame>
<p:nvGraphicFramePr> Chart
<p:cNvPr id="4" name="Chart 3” />

</p:spTree>
</p:cSld>
<p:clrMapOvr>
<a:masterClrMapping />
</p:clrMapOvr>
</p:sld>
Satisfy Your Technical Curiosity
Chart Part (chart1.xml)

Data source

Shape Textbox Chart

Satisfy Your Technical Curiosity


PresentationML Tables
Slide part contains table definition
In a graphicFrame element
All DrawingML is in the slide – no separate “table part”

Header-row formatting

Banded-row formatting
Table position
TableStyleID = GUID

Table definition

Satisfy Your Technical Curiosity


Satisfy Your Technical Curiosity
OpenXmlDeveloper.org
Formed by 40 companies to share developer information about
the Office Open XML file formats
Articles with source code for C#, VB, Java, PHP, XSLT
Forums for posting technical questions

Satisfy Your Technical Curiosity


The Ecma Spec
1. Fundamentals
2. Open Packaging Convention
3. Primer (start here)
4. Markup Language Reference (huge!)
5. Markup Compatibility and Extensibility
Reference Schemas (XSD, RelaxNG)

Tips:
• Start with part 3, Primer
• Use the PDF version of part 4 to look up elements/attributes
Satisfy Your Technical Curiosity
Open XML Blogs

Brian Jones: http://blogs.msdn.com/brian_jones


Doug Mahugh: http://blogs.msdn.com/dmahugh
Kevin Boske: http://blogs.msdn.com/kboske
Wouter Van Vugt:
http://blogs.infosupport.com/wouterv
Erika Ehrli: http://blogs.msdn.com/erikaehrli

See complete list on www.OpenXmlDeveloper.org

Satisfy Your Technical Curiosity


Satisfy Your Technical Curiosity

You might also like