SRI-Utility-PDF v1.0 Technical Documentation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

SRI – Utility - PDF

BLUE PRISM VBO TECHNICAL DOCUMENTATION


Version: 1.0

automation@sriinfotech.com | PL: +48 58 742 53 56 | US: (888) 513-0114


www.sriinfotech.com
Contents
1. Introduction ..........................................................................................................................................................3
1.1. Installation ..................................................................................................................................................3
2. Version History ......................................................................................................................................................4
2.1. Version 1.0 ..................................................................................................................................................4
3. Functional Overview of Current Version ...............................................................................................................5
3.1. Check if Page Exists .....................................................................................................................................5
3.2. Decrypt .......................................................................................................................................................6
3.3. Encrypt ........................................................................................................................................................7
3.4. Extract Page Range .....................................................................................................................................9
3.5. Extract Single Page ....................................................................................................................................10
3.6. Get Images from Page Range ....................................................................................................................11
3.7. Get Images from Single Page ....................................................................................................................12
3.8. Get Images from Whole PDF .....................................................................................................................13
3.9. Get Number of Pages ................................................................................................................................14
3.10. Get Page Coordinates ...............................................................................................................................15
3.11. Get Text from Page Range ........................................................................................................................16
3.12. Get Text from Page Range (Area) ..............................................................................................................17
3.13. Get Text from Page Range to Collection ...................................................................................................18
3.14. Get Text from Page Range to Collection (Area) ........................................................................................19
3.15. Get Text from Single Page .........................................................................................................................20
3.16. Get Text from Single Page (Area) ..............................................................................................................21
3.17. Get Text from Whole PDF .........................................................................................................................22
3.18. Get Text from Whole PDF (Area)...............................................................................................................23
3.19. Get Text from Whole PDF to Collection ....................................................................................................24
3.20. Get Text from Whole PDF to Collection (Area) .........................................................................................25
3.21. Merge All PDFs in a Directory....................................................................................................................26
3.22. Merge Selected PDFs ................................................................................................................................27
3.23. Split Page Range to Single Pages ...............................................................................................................28
3.24. Split Whole PDF to Single Pages ................................................................................................................29

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 2 of 29
1. Introduction

SRI – Utility – PDF is an easy to use Blue Prism VBO that allows the user to interact with PDF documents without the
need to open them. It brings the possibility to: extract text, images and pages from PDF documents, split, merge,
encrypt and decrypt them.

1.1. Installation

This VBO requires itextsharp.dll to be stored stored in Blue Prism Automate directory (default location: C:\Program
Files\Blue Prism Limited\Blue Prism Automate). The dll file is provisioned inside asset package.

Once dll is in place, you can import the object and use it without need for any modification.

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 3 of 29
2. Version History

2.1. Version 1.0


This is the initial version of this VBO. The document will be amended with a change history in future.

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 4 of 29
3. Functional Overview of Current Version

This VBO utilizes open source library iTextSharp 5.5.13.1.

The runmode of this business object is "background".

3.1. Check if Page Exists

Determines if page with given number exists within supplied PDF document.

Returns true if exists, false if it doesn't.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Page
In Number Index of PDF document page
Number

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

Page
Out Flag True if page exists
Exists

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 5 of 29
3.2. Decrypt

Removes password protection from supplied PDF document, does not alter security properties.

If Output File Path will be left blank or will contain path to parent document, then new document won't be created
and action will be performed on parent document.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Output File OPTIONAL: Full path to the new document destination (If left blank, then
In Text
Path parent document will get decrypted)

Owner
In Password Password to manage properties of the document; UTF-8 encoding
Password

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 6 of 29
3.3. Encrypt

Sets security properties of PDF document and encrypts it with given passwords.

Encryption Standard: AES 256.

If Output File Path will be left blank or will contain path to parent document, then new document won't be created
and action will be performed on parent document.

User Password is used for opening the document. If this parameter will be left blank, then PDF software won’t prompt
for password when opening.

Owner Password is used for managing the document (changing properties, etc.). If this parameter will be left blank,
then random string will be generated and used to encrypt the file.

If Encrypt is set to true, then at least one of new passwords must be provided.

IMPORTANT NOTE: Permissions given in context with encryption never cause the user to be able to do more than he
could do with the unencrypted document. These permissions only decide how much less a regular document user
(i.e. a person opening the PDF with the user password) is allowed to do compared with the document owner (i.e. a
person opening the PDF with the owner password). When opening an unencrypted document, always the full owner
permissions are assumed.

As follows, in a PDF viewer that does not allow certain operations even to a document owner, setting the matching
Allow* flags during encryption will not make the PDF viewer allow those operations to some user. The document
restrictions summary in security section in PDF reading software don't merely reflect the state of the permissions of
the document set during encryption. Instead they are indeed a summary based on numerous inputs not all of which
depend on the document itself:

1. The operations the program variant allows by default.

2. Additional operations allowed via a usage rights signature in the document.

3. Restrictions introduced by permissions not given during encryption.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

OPTIONAL: Full path to the new document destination (If left blank, then
Output File Path In Text
parent document will get encrypted)

Password In Password Password to manage properties of the document; UTF-8 encoding

New User OPTIONAL: Password used to open the document; at least one of the
In Password
Password passwords (user, owner) must be provided;

New Owner OPTIONAL: Password to manage properties of the document; at least one
In Password
Password of the passwords (user, owner) must be provided

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 7 of 29
OPTIONAL: (True is default value); The user is permitted to print the
Allow Printing In Flag
document

OPTIONAL: (True is default value); The user is permitted to modify the


Allow Modify
In Flag contents—for example, to change the content of a page, or insert or
Contents
remove a page

OPTIONAL: (True is default value); The user is permitted to copy or


Allow Copy In Flag otherwise extract text and graphics from the document, including using
assistive technologies such as screen readers or other accessibility devices

Allow Modify OPTIONAL: (True is default value); The user is permitted to add or modify
In Flag
Annotations text annotations and interactive form fields

Allow Fill In In Flag OPTIONAL: (True is default value); The user is permitted to fill form fields

Allow OPTIONAL: (True is default value); The user is permitted to extract text
In Flag
Screenreaders and graphics for use by accessibility devices

OPTIONAL: (True is default value); The user is permitted to insert, remove,


Allow Assembly In Flag and rotate pages and add bookmarks. The content of a page can’t be
changed unless the permission allow modifying contents is granted too

Allow Degraded OPTIONAL: (True is default value); The user is permitted to print the
In Flag
Printing document, but not with the quality offered by allow printing

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 8 of 29
3.4. Extract Page Range

Extracts given range of pages from supplied PDF document and saves it as separate PDF document.

If encrypt will be set to true, this action will copy security properties of parent document to extracted one (at least
one new password must be provided in this case).

User Password is used for opening the document. If this parameter will be left blank, then PDF software won’t prompt
for password when opening.

Owner Password is used for managing the document (changing properties, etc.). If this parameter will be left blank,
then random string will be generated and used to encrypt the file.

Encryption Standard: AES 256.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Start Page In Number Index of PDF document's page from which the extraction will begin

End Page In Number Index of PDF document's page at which the extraction will end

Output File
In Text Full path to the extraction destination
Path

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

OPTIONAL: (False is default value); True for copying security properties of


Encrypt In Flag
parent document

New User OPTIONAL: Password used to open the document; if encrypt is set to true,
In Password
Password then at least one password (user, owner) must be provided

New Owner OPTIONAL: Password to manage properties of the document; if encrypt is set
In Password
Password to true, then at least one password (user, owner) must be provided

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 9 of 29
3.5. Extract Single Page

Extracts given page from supplied PDF document and saves it as separate PDF document.

If encrypt will be set to true, this action will copy security properties of parent document to extracted one (at least
one new password must be provided in this case).

User Password is used for opening the document. If this parameter will be left blank, then PDF software won’t prompt
for password when opening.

Owner Password is used for managing the document (changing properties, etc.). If this parameter will be left blank,
then random string will be generated and used to encrypt the file.

Encryption Standard: AES 256.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Page Number In Number Index of PDF document page

Output File
In Text Full path to the extraction destination
Path

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

OPTIONAL: (False is default value); True for copying security properties of


Encrypt In Flag
parent document

New User OPTIONAL: Password used to open the document; if encrypt is set to true,
In Password
Password then at least one password (user, owner) must be provided

New Owner OPTIONAL: Password to manage properties of the document; if encrypt is set
In Password
Password to true, then at least one password (user, owner) must be provided

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 10 of 29
3.6. Get Images from Page Range

Gets images from each page from given range of pages of supplied PDF document.

Returns 5-column collection:

Page Number <Text>: Index of page from which image was extracted.
Image <Image>: Actual image.
Extension <Text>: Image file format.
Status <Text>: "Success" if image was processed successfully, "Error" if not processed successfully.
Error Message <Text>: Error description if error occurred.

If Ignore Errors is set to true, then in case of an error, it will continue processing next PDF pages, otherwise it will
terminate.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Start Page In Number Index of PDF document's page from which the images extraction will begin

End Page In Number Index of PDF document's page at which the images extraction will end

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

Ignore OPTIONAL: (False is default value); True to continue processing next PDF pages
In Flag
Errors in case of an error

PDF Pdf Images extracted from the document (Page Number; Image; Extension,
Out Collection
Images Status, Error Message)

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 11 of 29
3.7. Get Images from Single Page

Gets images from given page of supplied PDF document.

Returns 5-column collection:

Page Number <Text>: Index of page from which image was extracted.
Image <Image>: Actual image.
Extension <Text>: Image file format.
Status <Text>: "Success" if image was processed successfully, "Error" if not processed successfully.
Error Message <Text>: Error description if error occurred.

If Ignore Errors is set to true, then in case of an error, it will continue processing next PDF pages, otherwise it will
terminate.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Page
In Number Index of PDF document page
Number

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

Ignore OPTIONAL: (False is default value); True to continue processing next PDF pages
In Flag
Errors in case of an error

PDF Pdf Images extracted from the document (Page Number; Image; Extension,
Out Collection
Images Status, Error Message)

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 12 of 29
3.8. Get Images from Whole PDF

Gets images from each page of supplied PDF document.

Returns 5-column collection:

Page Number <Text>: Index of page from which image was extracted.
Image <Image>: Actual image.
Extension <Text>: Image file format.
Status <Text>: "Success" if image was processed successfully, "Error" if not processed successfully.
Error Message <Text>: Error description if error occurred.

If Ignore Errors is set to true, then in case of an error, it will continue processing next PDF pages, otherwise it will
terminate.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

Ignore OPTIONAL: (False is default value); True to continue processing next PDF pages
In Flag
Errors in case of an error

PDF Pdf Images extracted from the document (Page Number; Image; Extension,
Out Collection
Images Status, Error Message)

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 13 of 29
3.9. Get Number of Pages

Gets number of pages in supplied PDF document.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

Number of
Out Number Count of all pages within given document
Pages

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 14 of 29
3.10. Get Page Coordinates

Gets coordinates of given page from supplied PDF document.

IMPORTANT NOTE: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on y
axis.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Page
In Number Index of PDF document page
Number

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

Left Out Number Start x value of PDF page

Bottom Out Number Start y value of PDF page

Width Out Number Width of PDF page (left to right)

Height Out Number Height of PDF page (bottom to top)

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 15 of 29
3.11. Get Text from Page Range

Gets text from given range of pages of supplied PDF document.

EXTRACTION STRATEGIES:

0: no strategy at all, characters are being read from left to right, top to bottom (default option).

1: simple extraction strategy - A simple text extraction renderer. This renderer keeps track of the current Y position
of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders
text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in
the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be
inserted into the output.

2: location extraction strategy - A text extraction renderer that keeps track of relative position of text on page the
resultant text will be relatively consistent with the physical layout that most PDF files have on screen. This renderer
keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation.
Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance,
but different parallel distance is treated as being on the same line.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Start Page In Number Index of PDF document's page from which the text extraction will begin

End Page In Number Index of PDF document's page at which the text extraction will end

Extraction OPTIONAL: (0 is default value); 0 - no extraction strategy; 1 - simple strategy; 2


In Number
Strategy - location strategy

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

PDF Text Out Text Text read from PDF

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 16 of 29
3.12. Get Text from Page Range (Area)

Gets text from specified area from each page of given range of pages of supplied PDF document.

This action uses Location Extraction Strategy.

IMPORTANT NOTE 1: This action will read block of text that is within specified area, however it will always read the
whole block, even if it ends outside of specified area.

IMPORTANT NOTE 2: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on
y axis.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Start
In Number Index of PDF document's page from which the text extraction will begin
Page

End Page In Number Index of PDF document's page at which the text extraction will end

Start x In Number Leftmost position of the area

Start y In Number Bottom position of the area

End x In Number Rightmost position of the area

End y In Number Top position of the area

OPTIONAL: User or Owner password (necessary if the document is protected; user


Password In Password password will not work if owner of the document restricted this kind of activity in
security properties); UTF-8 encoding

PDF Text Out Text Text read from PDF

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 17 of 29
3.13. Get Text from Page Range to Collection

Gets text from given range of pages of supplied PDF document and returns it as a 2-column collection (Page
Number<Number>, Text<Text>), where each row contains text from appropriate PDF page.

EXTRACTION STRATEGIES:

0: no strategy at all, characters are being read from left to right, top to bottom (default option).

1: simple extraction strategy - A simple text extraction renderer. This renderer keeps track of the current Y position
of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders
text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in
the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be
inserted into the output.

2: location extraction strategy - A text extraction renderer that keeps track of relative position of text on page the
resultant text will be relatively consistent with the physical layout that most PDF files have on screen. This renderer
keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation.
Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance,
but different parallel distance is treated as being on the same line.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Start Page In Number Index of PDF document's page from which the text extraction will begin

End Page In Number Index of PDF document's page at which the text extraction will end

Extraction OPTIONAL: (0 is default value); 0 - no extraction strategy; 1 - simple strategy;


In Number
Strategy 2 - location strategy

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

PDF Text Out Collection Collection of texts from PDF pages

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 18 of 29
3.14. Get Text from Page Range to Collection (Area)

Gets text from specified area from each page of given range of pages of supplied PDF document and returns it as a
2-column collection (Page Number<Number>, Text<Text>), where each row contains text from appropriate PDF page.

This action uses Location Extraction Strategy.

IMPORTANT NOTE 1: This action will read block of text that is within specified area, however it will always read the
whole block, even if it ends outside of specified area.

IMPORTANT NOTE 2: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on
y axis.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Start
In Number Index of PDF document's page from which the text extraction will begin
Page

End Page In Number Index of PDF document's page at which the text extraction will end

Start x In Number Leftmost position of the area

Start y In Number Bottom position of the area

End x In Number Rightmost position of the area

End y In Number Top position of the area

OPTIONAL: User or Owner password (necessary if the document is protected; user


Password In Password password will not work if owner of the document restricted this kind of activity in
security properties); UTF-8 encoding

PDF Text Out Collection Collection of texts from PDF pages

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 19 of 29
3.15. Get Text from Single Page

Gets text from given page of supplied PDF document.

EXTRACTION STRATEGIES:

0: no strategy at all, characters are being read from left to right, top to bottom (default option).

1: simple extraction strategy - A simple text extraction renderer. This renderer keeps track of the current Y position
of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders
text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in
the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be
inserted into the output.

2: location extraction strategy - A text extraction renderer that keeps track of relative position of text on page the
resultant text will be relatively consistent with the physical layout that most PDF files have on screen. This renderer
keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation.
Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance,
but different parallel distance is treated as being on the same line.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Page
In Number Index of PDF document's page
Number

Extraction OPTIONAL: (0 is default value); 0 - no extraction strategy; 1 - simple strategy; 2


In Number
Strategy - location strategy

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

PDF Text Out Text Text read from PDF

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 20 of 29
3.16. Get Text from Single Page (Area)

Gets Text from specified area of given page of supplied PDF document.

This action uses Location Extraction Strategy.

IMPORTANT NOTE 1: This action will read block of text that is within specified area, however it will always read the
whole block, even if it ends outside of specified area.

IMPORTANT NOTE 2: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on
y axis.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Page
In Number Index of PDF document page
Number

Start x In Number Leftmost position of the area

Start y In Number Bottom position of the area

End x In Number Rightmost position of the area

End y In Number Top position of the area

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

PDF Text Out Text Text read from PDF

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 21 of 29
3.17. Get Text from Whole PDF

Gets text from whole supplied PDF document.

EXTRACTION STRATEGIES:

0: no strategy at all, characters are being read from left to right, top to bottom (default option).

1: simple extraction strategy - A simple text extraction renderer. This renderer keeps track of the current Y position
of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders
text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in
the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be
inserted into the output.

2: location extraction strategy - A text extraction renderer that keeps track of relative position of text on page the
resultant text will be relatively consistent with the physical layout that most PDF files have on screen. This renderer
keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation.
Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance,
but different parallel distance is treated as being on the same line.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Extraction OPTIONAL: (0 is default value); 0 - no extraction strategy; 1 - simple strategy; 2


In Number
Strategy - location strategy

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

PDF Text Out Text Text read from PDF

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 22 of 29
3.18. Get Text from Whole PDF (Area)

Gets text from specified area from each page of supplied PDF document.

This action uses Location Extraction Strategy.

IMPORTANT NOTE 1: This action will read block of text that is within specified area, however it will always read the
whole block, even if it ends outside of specified area.

IMPORTANT NOTE 2: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on
y axis.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Start x In Number Leftmost position of the area

Start y In Number Bottom position of the area

End x In Number Rightmost position of the area

End y In Number Top position of the area

OPTIONAL: User or Owner password (necessary if the document is protected; user


Password In Password password will not work if owner of the document restricted this kind of activity in
security properties); UTF-8 encoding

PDF Text Out Text Text read from PDF

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 23 of 29
3.19. Get Text from Whole PDF to Collection

Gets text from whole supplied PDF document and returns it as a 2-column collection (Page Number<Number>,
Text<Text>), where each row contains text from appropriate PDF page.

EXTRACTION STRATEGIES:

0: no strategy at all, characters are being read from left to right, top to bottom (default option).

1: simple extraction strategy - A simple text extraction renderer. This renderer keeps track of the current Y position
of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders
text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in
the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be
inserted into the output.

2: location extraction strategy - A text extraction renderer that keeps track of relative position of text on page the
resultant text will be relatively consistent with the physical layout that most PDF files have on screen. This renderer
keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation.
Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance,
but different parallel distance is treated as being on the same line.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Extraction OPTIONAL: (0 is default value); 0 - no extraction strategy; 1 - simple strategy;


In Number
Strategy 2 - location strategy

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

PDF Text Out Collection Collection of texts from PDF pages

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 24 of 29
3.20. Get Text from Whole PDF to Collection (Area)

Gets text from specified area from each page of supplied PDF document and returns it as a 2-column collection (Page
Number<Number>, Text<Text>), where each row contains text from appropriate PDF page.

This action uses Location Extraction Strategy.

IMPORTANT NOTE 1: This action will read block of text that is within specified area, however it will always read the
whole block, even if it ends outside of specified area.

IMPORTANT NOTE 2: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on
y axis.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Start x In Number Leftmost position of the area

Start y In Number Bottom position of the area

End x In Number Rightmost position of the area

End y In Number Top position of the area

OPTIONAL: User or Owner password (necessary if the document is protected; user


Password In Password password will not work if owner of the document restricted this kind of activity in
security properties); UTF-8 encoding

PDF Text Out Collection Collection of texts from PDF pages

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 25 of 29
3.21. Merge All PDFs in a Directory

Merges all PDF documents found in given directory into one document.

Provisioned password will be used to open each file that is password protected.

If Ignore Merge Errors is set to true, then in case of an error, it will continue merging next documents, otherwise it
will terminate.

This action returns collection with merge status (File Path<Text>; Status<Text>, Error Message<Text>).

1st column - File Path: contains full paths to the PDF files that were merged.

2nd column - Status: returns “Success” or “Error” for each document.

3rd column - Error Message: will contain error description for all documents with “Error” status.

Output of this action is useful only when Ignore Merge Errors is set to true.

Parameter Direction Data Type Description

Directory
In Text Path to directory with PDF files to be merged
Path

Output File
In Text Full path to the merge destination
Path

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

Ignore OPTIONAL: (False is deafult value); True, to continue merging next documents
In Flag
Merge Errors in case of an error

Merge
Out Collection Status of merge operation (File Path; Status; Error Message)
Status

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 26 of 29
3.22. Merge Selected PDFs

Merges supplied PDF documents into one document and saves it at given path.

1st column from File Paths collection will be treated as file paths to PDF documents to be merged. Needs to be of
type <Text>.

2nd column (if exists) will be treated as corresponding passwords to the files. Needs to be of type <Password> or
<Text>.

Any other column will be disregarded.

If Ignore Merge Errors is set to true, then in case of an error, it will continue merging next documents, otherwise it
will terminate.

This action returns collection with merge status (File Path<Text>; Status<Text>, Error Message<Text>).

1st column - File Path: contains full paths to the PDF files that were merged.

2nd column - Status: returns “Success” or “Error” for each document.

3rd column - Error Message: will contain error description for all documents with “Error” status.

Output of this action is useful only when Ignore Merge Errors is set to true.

Parameter Direction Data Type Description

(1st column for File Paths; 2nd column for passwords - UTF-8 encoding; any
File Paths In Collection
other columns will be disregarded)

Output File
In Text Full path to the merge destination
Path

Ignore Merge OPTIONAL: (False is default value); True, to continue merging next
In Flag
Errors documents in case of an error

Merge Status Out Collection Status of merge operation (File Path; Status; Error Message)

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 27 of 29
3.23. Split Page Range to Single Pages

Extracts each page from given range of pages of supplied PDF document and saves it as separate PDF document.

Naming convention of extracted documents: 'Parent Document File Name' & '_Page Number' & '.pdf'.

If encrypt will be set to true, this action will copy security properties of parent document to extracted one (at least
one new password must be provided in this case).

User Password is used for opening the document. If this parameter will be left blank, then PDF software won’t prompt
for password when opening.

Owner Password is used for managing the document (changing properties, etc.). If this parameter will be left blank,
then random string will be generated and used to encrypt the file.

Encryption Standard: AES 256.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Start Page In Number Index of PDF document's page from which the extraction will begin

End Page In Number Index of PDF document's page at which the extraction will end

Output
In Text Path to directory where extracted documents will be saved
Directory

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

OPTIONAL: (False is default value); True for copying security properties of


Encrypt In Flag
parent document

New User OPTIONAL: Password used to open the document; if encrypt is set to true,
In Password
Password then at least one password (user, owner) must be provided

New Owner OPTIONAL: Password to manage properties of the document; if encrypt is set
In Password
Password to true, then at least one password (user, owner) must be provided

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 28 of 29
3.24. Split Whole PDF to Single Pages

Extracts each page from supplied PDF document and saves it as separate PDF document. Naming convention of
extracted documents: 'Parent Document File Name' & '_Page Number' & '.pdf'.

If encrypt will be set to true, this action will copy security properties of parent document to extracted one (at least
one new password must be provided in this case).

User Password is used for opening the document. If this parameter will be left blank, then PDF software won’t prompt
for password when opening.

Owner Password is used for managing the document (changing properties, etc.). If this parameter will be left blank,
then random string will be generated and used to encrypt the file.

Encryption Standard: AES 256.

Parameter Direction Data Type Description

File Path In Text Full path to PDF document

Output
In Text Path to directory where extracted documents will be saved
Directory

OPTIONAL: User or Owner password (necessary if the document is protected;


Password In Password user password will not work if owner of the document restricted this kind of
activity in security properties); UTF-8 encoding

OPTIONAL: (False is default value); True for copying security properties of


Encrypt In Flag
parent document

New User OPTIONAL: Password used to open the document; if encrypt is set to true,
In Password
Password then at least one password (user, owner) must be provided

New Owner OPTIONAL: Password to manage properties of the document; if encrypt is set
In Password
Password to true, then at least one password (user, owner) must be provided

© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 29 of 29

You might also like