Professional Documents
Culture Documents
Open XML Deep Dive
Open XML Deep Dive
Doug Mahugh
Technical Evangelist, Microsoft
http://blogs.msdn.com/dmahugh
Architecture
The three main Open XML schemas
Development options
Custom XML support
Development scenarios
Document Sanitization
Remove unwanted content like comments, embedded code or Remove all tracked changes and comments from a Word
potentially sensitive items from your document when document before it is published.
appropriate.
Document Interrogation
Query document repositories based on custom data, content Search for all documents containing a specific company
types or document metadata. name or sales contact.
Content Tagging
Adding a tagging schema to content can dramatically improve Organizations can create their own smart tags then use
content searches and the value of the data stored in documents. them as the basis for searches.
Document Archival
Ensuring document formats can be consumed long into the XML-based document archives include the data and
future without vendor-specific clients or applications. presentation information.
Satisfy Your Technical Curiosity
XML in Office: the last 10 years
Office 2003
Breakthrough XML Support
WordProcessingML,
SpreadsheetML
Custom-defined schema
Office XP
First XML Formats
Spreadsheet XML
Office 97
Existing binary file formats designed in
1994, launched in Office 97
Satisfy Your Technical Curiosity
Open XML Architecture
Markup Languages
Shared Vocabularies
Core Technologies
Often XML, but can be of any defined content type (including custom types)
Document
properties body
comments images
footnotes/endnotes numberingDefinitions
headers/footers styles
fontTable customXML
<w:p>
<w:pPr>
<w:widowControl w:val=“on” />
<w:keepNext/>
<w:keepLines/>
<w:pageBreakBefore/>
<w:suppressLineNumbers />
<w:suppressAutoHyphens />
<w:textBoxTightWrap />
</w:pPr>
… runs, paragraph content …
</w:p>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“Arial” w:hAnsi=“Arial” w:cs=“Arial” />
<w:b/>
<w:i/>
<w:sz w:val=“11” />
<w:dstrike w:val=“true” />
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image”
Target="image1.jpg"/>
<w:tblPr>
<w:tblStyle w:val=“TableGrid”/>
<w:tblW w:w=“0” w:type=“auto”/>
<w:tblLook w:val=“01E0”/>
</w:tblPr>
<w:tblGrid>
Properties
<w:gridCol w:w=“2952”/>
<w:gridCol w:w=“2952”/>
<w:gridCol w:w=“2952”/>
</w:tblGrid>
Grid
<w:tr>
<w:tc>
Rows
<w:tcPr>
<w:tcW w:w=“2952” w:type=“dxa”/>
</w:tcPr>
<w:p>
Cells
<w:r>
<w:t>1,1</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w=“2952” w:type=“dxa”/>
</w:tcPr>
<w:p>
<w:r>
<w:t>1,2</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
</w:tbl>
For example, the Normal style in Word 2007 defines these formatting properties:
Font = Calibri (body)
Font Size = 11 point
Font Language = Word default (as configured by user)
Justification = Left
Line Spacing = Single
Widow/Orphan control
Paragraph styles
Character styles
Linked styles
Table styles
List styles
Default style (linked type, but applies when no style
specified)
Satisfy Your Technical Curiosity
Paragraph Styles Example
Step 1: define a paragraph style
Styles are defined in the style part:
<w:rPr>
<w:rFonts w:ascii=“Algerian” w:hAnsi=“Algerian”/>
<w:b/> Character (Run)
<w:color w:val=“ED1C24”>
<w:sz w:val=“40”/> Properties
</w:rPr>
</w:style>
<w:tbl>
<w:tblPr>
<w:tblStyle w:val=“Style20”/> Table style Style20 is applied to
<w:tblW w:w=“5000” w:type=“pct”/> the table
<w:tblLook w:val=“0220”/>
</w:tblPr>
… tblGrid, table rows and cells …
</w:tbl>
Document Defaults
Table
Numbering
Paragraph
Character
Direct Formatting
Relationships:
<Relationship Id=“rId1” Type=“…/subDocument” Target=“Part1.docx” TargetMode=“external”/>
<Relationship Id=“rId2” Type=“…/subDocument” Target=“Part2.docx” TargetMode=“external”/>
<Relationship Id=“rId3” Type=“…/subDocument” Target=“Part3.docx” TargetMode=“external”/>
<w:sectPr>
<w:pgSz w:w="12240" w:h="15840" />
<w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440"
w:header="720" w:footer="720" w:gutter="0" />
<w:cols w:space="720" />
<w:docGrid w:linePitch="360" />
</w:sectPr>
Microsoft Confidential
Custom XML Properties
Information about a custom XML part is stored
in a custom XML properties part
Stored via an implicit customXmlProps
relationship from the custom XML part
Contains two types of information:
Part ID
Uniquely identifies a part within a document
Maintained through editing sessions
XML Schema references
Allows embedding the structure from any XML schema into a WordprocessingML
document
styles
sharedStrings
calcChain
sheet1..N
sheet1..N
sheet1..N
sheet1..N
table
chart
sheet1..N
sheet1..N
sheet1..N
drawing
Satisfy Your Technical Curiosity
SpreadsheetML
Performance optimizations
1. Inline strings
• Provided for ease of translation/conversion
• Useful in XSLT scenarios
• Excel and other consumers may convert to shared
strings on document save
2. An entry in the shared-strings table
• May be either a simple string or formatted text
Benefits:
Users: reduced file size, improved performance
Developers: all strings are in one part, simplifying
search, localization, and other common string-handling
tasks
<sheetData>
<row><c t="inlineStr"><is><t>Paris</t></is></c></row>
<row><c t="inlineStr"><is><t>Seattle</t></is></c></row>
<row><c t="inlineStr"><is><t>London</t></is></c></row>
<row><c t="inlineStr"><is><t>Copenhagen</t></is></c></row>
<row><c t="inlineStr"><is><t>Paris</t></is></c></row>
<row><c t="inlineStr"><is><t>London</t></is></c></row>
</sheetData>
Fonts
Slides
Presentation
View Properties
Notes Slides
Notes Masters
Presentation
Properties
Handout
Masters
Code
Data source
Header-row formatting
Banded-row formatting
Table position
TableStyleID = GUID
Table definition
Tips:
• Start with part 3, Primer
• Use the PDF version of part 4 to look up elements/attributes
Satisfy Your Technical Curiosity
Open XML Blogs