Professional Documents
Culture Documents
SRI-Utility-PDF v1.0 Technical Documentation
SRI-Utility-PDF v1.0 Technical Documentation
SRI-Utility-PDF v1.0 Technical Documentation
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 2 of 29
1. Introduction
SRI – Utility – PDF is an easy to use Blue Prism VBO that allows the user to interact with PDF documents without the
need to open them. It brings the possibility to: extract text, images and pages from PDF documents, split, merge,
encrypt and decrypt them.
1.1. Installation
This VBO requires itextsharp.dll to be stored stored in Blue Prism Automate directory (default location: C:\Program
Files\Blue Prism Limited\Blue Prism Automate). The dll file is provisioned inside asset package.
Once dll is in place, you can import the object and use it without need for any modification.
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 3 of 29
2. Version History
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 4 of 29
3. Functional Overview of Current Version
Determines if page with given number exists within supplied PDF document.
Page
In Number Index of PDF document page
Number
Page
Out Flag True if page exists
Exists
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 5 of 29
3.2. Decrypt
Removes password protection from supplied PDF document, does not alter security properties.
If Output File Path will be left blank or will contain path to parent document, then new document won't be created
and action will be performed on parent document.
Output File OPTIONAL: Full path to the new document destination (If left blank, then
In Text
Path parent document will get decrypted)
Owner
In Password Password to manage properties of the document; UTF-8 encoding
Password
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 6 of 29
3.3. Encrypt
Sets security properties of PDF document and encrypts it with given passwords.
If Output File Path will be left blank or will contain path to parent document, then new document won't be created
and action will be performed on parent document.
User Password is used for opening the document. If this parameter will be left blank, then PDF software won’t prompt
for password when opening.
Owner Password is used for managing the document (changing properties, etc.). If this parameter will be left blank,
then random string will be generated and used to encrypt the file.
If Encrypt is set to true, then at least one of new passwords must be provided.
IMPORTANT NOTE: Permissions given in context with encryption never cause the user to be able to do more than he
could do with the unencrypted document. These permissions only decide how much less a regular document user
(i.e. a person opening the PDF with the user password) is allowed to do compared with the document owner (i.e. a
person opening the PDF with the owner password). When opening an unencrypted document, always the full owner
permissions are assumed.
As follows, in a PDF viewer that does not allow certain operations even to a document owner, setting the matching
Allow* flags during encryption will not make the PDF viewer allow those operations to some user. The document
restrictions summary in security section in PDF reading software don't merely reflect the state of the permissions of
the document set during encryption. Instead they are indeed a summary based on numerous inputs not all of which
depend on the document itself:
OPTIONAL: Full path to the new document destination (If left blank, then
Output File Path In Text
parent document will get encrypted)
New User OPTIONAL: Password used to open the document; at least one of the
In Password
Password passwords (user, owner) must be provided;
New Owner OPTIONAL: Password to manage properties of the document; at least one
In Password
Password of the passwords (user, owner) must be provided
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 7 of 29
OPTIONAL: (True is default value); The user is permitted to print the
Allow Printing In Flag
document
Allow Modify OPTIONAL: (True is default value); The user is permitted to add or modify
In Flag
Annotations text annotations and interactive form fields
Allow Fill In In Flag OPTIONAL: (True is default value); The user is permitted to fill form fields
Allow OPTIONAL: (True is default value); The user is permitted to extract text
In Flag
Screenreaders and graphics for use by accessibility devices
Allow Degraded OPTIONAL: (True is default value); The user is permitted to print the
In Flag
Printing document, but not with the quality offered by allow printing
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 8 of 29
3.4. Extract Page Range
Extracts given range of pages from supplied PDF document and saves it as separate PDF document.
If encrypt will be set to true, this action will copy security properties of parent document to extracted one (at least
one new password must be provided in this case).
User Password is used for opening the document. If this parameter will be left blank, then PDF software won’t prompt
for password when opening.
Owner Password is used for managing the document (changing properties, etc.). If this parameter will be left blank,
then random string will be generated and used to encrypt the file.
Start Page In Number Index of PDF document's page from which the extraction will begin
End Page In Number Index of PDF document's page at which the extraction will end
Output File
In Text Full path to the extraction destination
Path
New User OPTIONAL: Password used to open the document; if encrypt is set to true,
In Password
Password then at least one password (user, owner) must be provided
New Owner OPTIONAL: Password to manage properties of the document; if encrypt is set
In Password
Password to true, then at least one password (user, owner) must be provided
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 9 of 29
3.5. Extract Single Page
Extracts given page from supplied PDF document and saves it as separate PDF document.
If encrypt will be set to true, this action will copy security properties of parent document to extracted one (at least
one new password must be provided in this case).
User Password is used for opening the document. If this parameter will be left blank, then PDF software won’t prompt
for password when opening.
Owner Password is used for managing the document (changing properties, etc.). If this parameter will be left blank,
then random string will be generated and used to encrypt the file.
Output File
In Text Full path to the extraction destination
Path
New User OPTIONAL: Password used to open the document; if encrypt is set to true,
In Password
Password then at least one password (user, owner) must be provided
New Owner OPTIONAL: Password to manage properties of the document; if encrypt is set
In Password
Password to true, then at least one password (user, owner) must be provided
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 10 of 29
3.6. Get Images from Page Range
Gets images from each page from given range of pages of supplied PDF document.
Page Number <Text>: Index of page from which image was extracted.
Image <Image>: Actual image.
Extension <Text>: Image file format.
Status <Text>: "Success" if image was processed successfully, "Error" if not processed successfully.
Error Message <Text>: Error description if error occurred.
If Ignore Errors is set to true, then in case of an error, it will continue processing next PDF pages, otherwise it will
terminate.
Start Page In Number Index of PDF document's page from which the images extraction will begin
End Page In Number Index of PDF document's page at which the images extraction will end
Ignore OPTIONAL: (False is default value); True to continue processing next PDF pages
In Flag
Errors in case of an error
PDF Pdf Images extracted from the document (Page Number; Image; Extension,
Out Collection
Images Status, Error Message)
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 11 of 29
3.7. Get Images from Single Page
Page Number <Text>: Index of page from which image was extracted.
Image <Image>: Actual image.
Extension <Text>: Image file format.
Status <Text>: "Success" if image was processed successfully, "Error" if not processed successfully.
Error Message <Text>: Error description if error occurred.
If Ignore Errors is set to true, then in case of an error, it will continue processing next PDF pages, otherwise it will
terminate.
Page
In Number Index of PDF document page
Number
Ignore OPTIONAL: (False is default value); True to continue processing next PDF pages
In Flag
Errors in case of an error
PDF Pdf Images extracted from the document (Page Number; Image; Extension,
Out Collection
Images Status, Error Message)
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 12 of 29
3.8. Get Images from Whole PDF
Page Number <Text>: Index of page from which image was extracted.
Image <Image>: Actual image.
Extension <Text>: Image file format.
Status <Text>: "Success" if image was processed successfully, "Error" if not processed successfully.
Error Message <Text>: Error description if error occurred.
If Ignore Errors is set to true, then in case of an error, it will continue processing next PDF pages, otherwise it will
terminate.
Ignore OPTIONAL: (False is default value); True to continue processing next PDF pages
In Flag
Errors in case of an error
PDF Pdf Images extracted from the document (Page Number; Image; Extension,
Out Collection
Images Status, Error Message)
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 13 of 29
3.9. Get Number of Pages
Number of
Out Number Count of all pages within given document
Pages
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 14 of 29
3.10. Get Page Coordinates
IMPORTANT NOTE: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on y
axis.
Page
In Number Index of PDF document page
Number
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 15 of 29
3.11. Get Text from Page Range
EXTRACTION STRATEGIES:
0: no strategy at all, characters are being read from left to right, top to bottom (default option).
1: simple extraction strategy - A simple text extraction renderer. This renderer keeps track of the current Y position
of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders
text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in
the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be
inserted into the output.
2: location extraction strategy - A text extraction renderer that keeps track of relative position of text on page the
resultant text will be relatively consistent with the physical layout that most PDF files have on screen. This renderer
keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation.
Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance,
but different parallel distance is treated as being on the same line.
Start Page In Number Index of PDF document's page from which the text extraction will begin
End Page In Number Index of PDF document's page at which the text extraction will end
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 16 of 29
3.12. Get Text from Page Range (Area)
Gets text from specified area from each page of given range of pages of supplied PDF document.
IMPORTANT NOTE 1: This action will read block of text that is within specified area, however it will always read the
whole block, even if it ends outside of specified area.
IMPORTANT NOTE 2: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on
y axis.
Start
In Number Index of PDF document's page from which the text extraction will begin
Page
End Page In Number Index of PDF document's page at which the text extraction will end
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 17 of 29
3.13. Get Text from Page Range to Collection
Gets text from given range of pages of supplied PDF document and returns it as a 2-column collection (Page
Number<Number>, Text<Text>), where each row contains text from appropriate PDF page.
EXTRACTION STRATEGIES:
0: no strategy at all, characters are being read from left to right, top to bottom (default option).
1: simple extraction strategy - A simple text extraction renderer. This renderer keeps track of the current Y position
of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders
text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in
the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be
inserted into the output.
2: location extraction strategy - A text extraction renderer that keeps track of relative position of text on page the
resultant text will be relatively consistent with the physical layout that most PDF files have on screen. This renderer
keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation.
Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance,
but different parallel distance is treated as being on the same line.
Start Page In Number Index of PDF document's page from which the text extraction will begin
End Page In Number Index of PDF document's page at which the text extraction will end
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 18 of 29
3.14. Get Text from Page Range to Collection (Area)
Gets text from specified area from each page of given range of pages of supplied PDF document and returns it as a
2-column collection (Page Number<Number>, Text<Text>), where each row contains text from appropriate PDF page.
IMPORTANT NOTE 1: This action will read block of text that is within specified area, however it will always read the
whole block, even if it ends outside of specified area.
IMPORTANT NOTE 2: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on
y axis.
Start
In Number Index of PDF document's page from which the text extraction will begin
Page
End Page In Number Index of PDF document's page at which the text extraction will end
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 19 of 29
3.15. Get Text from Single Page
EXTRACTION STRATEGIES:
0: no strategy at all, characters are being read from left to right, top to bottom (default option).
1: simple extraction strategy - A simple text extraction renderer. This renderer keeps track of the current Y position
of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders
text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in
the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be
inserted into the output.
2: location extraction strategy - A text extraction renderer that keeps track of relative position of text on page the
resultant text will be relatively consistent with the physical layout that most PDF files have on screen. This renderer
keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation.
Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance,
but different parallel distance is treated as being on the same line.
Page
In Number Index of PDF document's page
Number
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 20 of 29
3.16. Get Text from Single Page (Area)
Gets Text from specified area of given page of supplied PDF document.
IMPORTANT NOTE 1: This action will read block of text that is within specified area, however it will always read the
whole block, even if it ends outside of specified area.
IMPORTANT NOTE 2: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on
y axis.
Page
In Number Index of PDF document page
Number
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 21 of 29
3.17. Get Text from Whole PDF
EXTRACTION STRATEGIES:
0: no strategy at all, characters are being read from left to right, top to bottom (default option).
1: simple extraction strategy - A simple text extraction renderer. This renderer keeps track of the current Y position
of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders
text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in
the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be
inserted into the output.
2: location extraction strategy - A text extraction renderer that keeps track of relative position of text on page the
resultant text will be relatively consistent with the physical layout that most PDF files have on screen. This renderer
keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation.
Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance,
but different parallel distance is treated as being on the same line.
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 22 of 29
3.18. Get Text from Whole PDF (Area)
Gets text from specified area from each page of supplied PDF document.
IMPORTANT NOTE 1: This action will read block of text that is within specified area, however it will always read the
whole block, even if it ends outside of specified area.
IMPORTANT NOTE 2: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on
y axis.
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 23 of 29
3.19. Get Text from Whole PDF to Collection
Gets text from whole supplied PDF document and returns it as a 2-column collection (Page Number<Number>,
Text<Text>), where each row contains text from appropriate PDF page.
EXTRACTION STRATEGIES:
0: no strategy at all, characters are being read from left to right, top to bottom (default option).
1: simple extraction strategy - A simple text extraction renderer. This renderer keeps track of the current Y position
of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders
text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in
the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be
inserted into the output.
2: location extraction strategy - A text extraction renderer that keeps track of relative position of text on page the
resultant text will be relatively consistent with the physical layout that most PDF files have on screen. This renderer
keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation.
Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance,
but different parallel distance is treated as being on the same line.
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 24 of 29
3.20. Get Text from Whole PDF to Collection (Area)
Gets text from specified area from each page of supplied PDF document and returns it as a 2-column collection (Page
Number<Number>, Text<Text>), where each row contains text from appropriate PDF page.
IMPORTANT NOTE 1: This action will read block of text that is within specified area, however it will always read the
whole block, even if it ends outside of specified area.
IMPORTANT NOTE 2: Pdf document coordinates start from bottom-left corner (0,0) and go right on x axis and up on
y axis.
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 25 of 29
3.21. Merge All PDFs in a Directory
Merges all PDF documents found in given directory into one document.
Provisioned password will be used to open each file that is password protected.
If Ignore Merge Errors is set to true, then in case of an error, it will continue merging next documents, otherwise it
will terminate.
This action returns collection with merge status (File Path<Text>; Status<Text>, Error Message<Text>).
1st column - File Path: contains full paths to the PDF files that were merged.
3rd column - Error Message: will contain error description for all documents with “Error” status.
Output of this action is useful only when Ignore Merge Errors is set to true.
Directory
In Text Path to directory with PDF files to be merged
Path
Output File
In Text Full path to the merge destination
Path
Ignore OPTIONAL: (False is deafult value); True, to continue merging next documents
In Flag
Merge Errors in case of an error
Merge
Out Collection Status of merge operation (File Path; Status; Error Message)
Status
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 26 of 29
3.22. Merge Selected PDFs
Merges supplied PDF documents into one document and saves it at given path.
1st column from File Paths collection will be treated as file paths to PDF documents to be merged. Needs to be of
type <Text>.
2nd column (if exists) will be treated as corresponding passwords to the files. Needs to be of type <Password> or
<Text>.
If Ignore Merge Errors is set to true, then in case of an error, it will continue merging next documents, otherwise it
will terminate.
This action returns collection with merge status (File Path<Text>; Status<Text>, Error Message<Text>).
1st column - File Path: contains full paths to the PDF files that were merged.
3rd column - Error Message: will contain error description for all documents with “Error” status.
Output of this action is useful only when Ignore Merge Errors is set to true.
(1st column for File Paths; 2nd column for passwords - UTF-8 encoding; any
File Paths In Collection
other columns will be disregarded)
Output File
In Text Full path to the merge destination
Path
Ignore Merge OPTIONAL: (False is default value); True, to continue merging next
In Flag
Errors documents in case of an error
Merge Status Out Collection Status of merge operation (File Path; Status; Error Message)
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 27 of 29
3.23. Split Page Range to Single Pages
Extracts each page from given range of pages of supplied PDF document and saves it as separate PDF document.
Naming convention of extracted documents: 'Parent Document File Name' & '_Page Number' & '.pdf'.
If encrypt will be set to true, this action will copy security properties of parent document to extracted one (at least
one new password must be provided in this case).
User Password is used for opening the document. If this parameter will be left blank, then PDF software won’t prompt
for password when opening.
Owner Password is used for managing the document (changing properties, etc.). If this parameter will be left blank,
then random string will be generated and used to encrypt the file.
Start Page In Number Index of PDF document's page from which the extraction will begin
End Page In Number Index of PDF document's page at which the extraction will end
Output
In Text Path to directory where extracted documents will be saved
Directory
New User OPTIONAL: Password used to open the document; if encrypt is set to true,
In Password
Password then at least one password (user, owner) must be provided
New Owner OPTIONAL: Password to manage properties of the document; if encrypt is set
In Password
Password to true, then at least one password (user, owner) must be provided
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 28 of 29
3.24. Split Whole PDF to Single Pages
Extracts each page from supplied PDF document and saves it as separate PDF document. Naming convention of
extracted documents: 'Parent Document File Name' & '_Page Number' & '.pdf'.
If encrypt will be set to true, this action will copy security properties of parent document to extracted one (at least
one new password must be provided in this case).
User Password is used for opening the document. If this parameter will be left blank, then PDF software won’t prompt
for password when opening.
Owner Password is used for managing the document (changing properties, etc.). If this parameter will be left blank,
then random string will be generated and used to encrypt the file.
Output
In Text Path to directory where extracted documents will be saved
Directory
New User OPTIONAL: Password used to open the document; if encrypt is set to true,
In Password
Password then at least one password (user, owner) must be provided
New Owner OPTIONAL: Password to manage properties of the document; if encrypt is set
In Password
Password to true, then at least one password (user, owner) must be provided
© SRI Infotech Inc., 1500 Providence Highway Unit 32, Norwood MA 02062 Page 29 of 29