Professional Documents
Culture Documents
Parsing A PDF File With Powercenter: 2010 Informatica
Parsing A PDF File With Powercenter: 2010 Informatica
2010 Informatica
Abstract
You can parse data from a PDF file with a PowerCenter mapping. Define the PDF file as a Data Transformation source. This
article describes how to configure the Data Transformation source to interface with a Data Transformation service.
Supported Versions
PowerCenter 9.0.1
Table of Contents
Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Mapping Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
PDF File Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Create the Data Transformation Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Export the XML File Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Create the Target Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Data Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Create the Data Transformation Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Deploy the Data Transformation Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Define the Service Name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Configure the Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Run the Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Overview
A PDF is a common file format that stores invoices and account statements. You can configure a PowerCenter mapping to
extract the data from the PDF when the page layout is the same for each invoice. Configure a Data Transformation source in
the PowerCenter Designer.
This article explains how to configure a Data Transformation source that represents a multiple page PDF file. The article
shows how to configure the PowerCenter source with a Data Transformation service to extract the data from the PDF file.
The target is a set of relational tables.
To parse the data from a PDF file, complete the following tasks:
Create the Data Transformation source in the Designer.
Export the structure as an XML schema from the Designer.
Create a Data Transformation Parser project in the Data Transformation Developer Studio. Import the XML schema
to the PowerCenter Client. Deploy another copy of the service to the Data Transformation repository local to the
PowerCenter Integration Service.
Define the Data Transformation service name in the Data Transformation source.
Create and run the PowerCenter workflow.
Mapping Overview
Create a PowerCenter mapping to parse the data from the PDF file and pass the data to relational targets.
The following figure shows the mapping in the Designer:
To pass row data to the relational tables, configure output ports on the Output Hierarchy tab. Create a hierarchy of groups in
the left pane of the Output Hierarchy tab. All groups are under the root group. Each group can contain ports and other
groups. The group structure represents the relationship between target tables. When you define a group within a group, you
define a parent-child relationship between the groups. The Designer defines a primary key-foreign key relationship between
the groups with a generated key.
The following figure shows the Output Hierarchy tab:
Define the following groups of ports to represent the invoice database tables:
Group1 Invoice Header
Account. Customer account number.
Period Ending. Date of current charges.
Current Total. Total amount of purchases for the period.
Balance Due. Total amount due including past due charges.
Data Transformation
Data Transformation is the application that transforms file formats such as Excel spreadsheets or PDF documents.
Create Data Transformation projects in the Data Transformation Studio. Deploy the projects from the Data Transformation
Studio to the Data Transformation repository. The Designer accesses the services in the Data Transformation repository
when you create a Data Transformation source. The PowerCenter Integration Service accesses a Data Transformation
service when it runs a workflow that has a Data Transformation source, target, or Unstructured Data transformation.
You can import the parser to the Data Transformation Studio from the Results directory. The parser project is
PDFInvoiceParser.cmw.
The tutorial describes how to create the parser. To interface the project with PowerCenter, use the .xsd file that you
exported from the Designer instead of the OrshavaInvoice.xsd file.
The parser runs a document processor to convert the data from a binary PDF format to text. The parser project uses
positional formatting to determine the location of the data in the PDF. You configure the anchors that define the text location
and the content. Define a repeating group for the buyer and a nested repeating group for each buyer transaction. Define a
CalculateValue action to add product prices for each buyer and a total for the invoice.
You can run the project in the Data Transformation Studio. View results from the sample data. When you call a Data
Transformation service from PowerCenter, the Data Transformation Engine passes the XML back to the PowerCenter
Integration Service.
When you run the project, Data Transformation returns the following XML:
<?xml version="1.0" encoding="windows-1252" ?>
- <Invoice account="12345">
<Period_Ending>April 30, 2003</Period_Ending>
<Current_Total>351.04</Current_Total>
<Balance_Due>475.07</Balance_Due>
- <Buyer name="Molly" total="217.65">
- <Transaction date="Apr 02" ref="22498">
<Product>large eggs</Product>
<Total>29.07</Total>
</Transaction>
- <Transaction date="Apr 08" ref="22536">
<Product>large eggs</Product>
<Total>58.14</Total>
</Transaction>
- <Transaction date="Apr 08" ref="22536">
<Product>cheddar cheese</Product>
<Total>43.61</Total>
</Transaction>
- <Transaction date="Apr 21" ref="22798">
<Product>cream cheese</Product>
<Total>26.98</Total>
</Transaction>
- <Transaction date="Apr 29" ref="22903">
<Product>large eggs</Product>
<Total>59.85</Total>
</Transaction>
</Buyer>
- <Buyer name="Jack" total="133.39">
- <Transaction date="Apr 12" ref="22570">
<Product>large eggs</Product>
<Total>29.93</Total>
</Transaction>
- <Transaction date="Apr 18" ref="22734">
<Product>large eggs</Product>
<Total>59.85</Total>
</Transaction>
- <Transaction date="Apr 25" ref="22841">
<Product>cheddar cheese</Product>
<Total>43.61</Total>
</Transaction>
</Buyer>
</Invoice>
10
The source file name is *Invoice*.pdf. The session is configured to use wildcards.
Account
Period_Ending
Current_Total
Balance_Due
12345
351.04
475.07
The Integration Service writes the following row to the Buyer table:
XPK_Buyer
FK_Invoice
Buyer_Name
Total
Molly
217.65
Jack
133.39
11
The Integration Service writes the following row to the Transaction_Detail table:
XPK_Transaction
FK_Buyer
Date
Ref
Product
Total
Apr 02
22498
large eggs
29.07
Apr 08
22536
large eggs
58.14
Apr 08
22536
cheddar cheese
43.61
Apr 21
22798
cream cheese
26.98
Apr 29
22903
large eggs
59.85
Apr 12
22570
large eggs
29.93
Apr 18
22734
large eggs
59.85
Apr 25
22841
cheddar cheese
43.61
Author
Ellen Chandler
Principal Technical Writer
12