Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

An Empirical Complexity Estimation for RIP Prediction

Fabio Giannetti HP Laboratories HPL-2012-116 Keyword(s):


PDF; Document Complexity; Raster Image Processing; Printing; Document Analysis; Document Characteristics; Document Processing Time Estimation; Cloud Computing.

Abstract:
Documents are increasingly generated, manipulated and sent for print production in the cloud. Cloud based solutions bring a new set of capabilities that simplify storage, availability and ultimately collaborations/reviews and it was expected that the printing industry would leverage that platform too.In contrast, documents have become smarter and easier to produce, while printing workflows are still bound to old paradigms. Printers, known as Print Service Providers (PSPs), uses substantial information technology infrastructure to perform Raster Image Processing (RIP) and send ready to print data to Digital Presses. In this paper, we discuss how to compute a PDF file complexity based on its intrinsic characteristics and hence predict, a-priori, the time it will take to RIP. To achieve this, we have identified PDF characteristics and their correlation with the RIP performance, developed a testing platform that uses the RIP times as input feedback to dynamically adjust the prediction function.Results have shown that we can successfully classify PDF files (jobs) and predict RIP times within 20-25% accuracy. This is a very good result considering that the RIP Infrastructure may have variations of an average of 6% processing times and peaks up to 12%. This research opens to the possibility for moving the majority of the RIP infrastructure into the cloud, since it is now possible to assert which files would require more resources and time. This can help strike the balance between RIP locally and RIP in a cloud farm, and stream the raster content to the PSP.

Internal Posting Date: May 21, 2012 [Fulltext] - HP Restricted

Copyright 2012 Hewlett-Packard Development Company, L.P.

HP Restricted

An Empirical Complexity Estimation for RIP Prediction


Fabio Giannetti
Hewlett-Packard Laboratories 1501 Page Mill Road M/S 1161 Palo Alto, CA 94304 +1 650 857 5085

fabio.giannetti@hp.com ABSTRACT
Documents are increasingly generated, manipulated and sent for print production in the cloud. Cloud based solutions bring a new set of capabilities that simplify storage, availability and ultimately collaborations/reviews and it was expected that the printing industry would leverage that platform too. In contrast, documents have become smarter and easier to produce, while printing workflows are still bound to old paradigms. Printers, known as Print Service Providers (PSPs), uses substantial information technology infrastructure to perform Raster Image Processing (RIP) and send ready to print data to Digital Presses. In this paper, we discuss how to compute a PDF file complexity based on its intrinsic characteristics and hence predict, a-priori, the time it will take to RIP. To achieve this, we have identified PDF characteristics and their correlation with the RIP performance, developed a testing platform that uses the RIP times as input feedback to dynamically adjust the prediction function. Results have shown that we can successfully classify PDF files (jobs) and predict RIP times within 20-25% accuracy. This is a very good result considering that the RIP Infrastructure may have variations of an average of 6% processing times and peaks up to 12%. This research opens to the possibility for moving the majority of the RIP infrastructure into the cloud, since it is now possible to assert which files would require more resources and time. This can help strike the balance between RIP locally and RIP in a cloud farm, and stream the raster content to the PSP. Document Analysis, Document Characteristics, Processing Time Estimation, Cloud Computing. Document

1. MOTIVATION
The Digital Presses era is now well established and adoption of such presses vs. Conventional Presses is raising by double digits every year. The flexibility introduced by the Digital Press has a domino effect on the pre-press (artwork preparation) and postpress (finishing equipment). The first is now becoming more automated thanks to automated job submission and preparation tools as well as web to print systems. The latter creates tools that are more flexible and not inline with the press but rather nearline so then can be re-used for a variety of job types instead of being specialized for a single production line. In this paper we discuss how the most complex of the pre-press steps the Raster Image Processing (RIP) [1] are re-directed to the cloud, and hence became web based. Having the RIP process running in the cloud allows dynamic processing power allocation on a job base. This reduces the need of IT infrastructure to be present at the Print Service Provider (PSP) as well as to respond to spikes in production or better handle seasonality. In order to perform dynamic RIP allocation on a job base it is necessary to be able to estimate the complexity of a job computing and evaluating the relevant aspects and how these effects the RIP time. Extensive work has already been done in identifying useful job parameters from a PDF file and even a specialized PDF Profiler [2] has been produced. To give an easy to grasp comparison it is like to have an application which, based on few PDF characteristics, is capable of predicting how long Adobe Acrobat will take to present the PDF document to the user. Once this is achieved, it is possible to implement a system that computes the document complexity, estimates the RIP time and subsequently adjusts its estimation based on measured RIP time from the cloud based RIP farm. In the following sessions, we give an introduction of the various pre-press steps with a more in-depth view of the RIP task. We will illustrate the various aspects of a document and how these influence RIP performances. This leads to our estimation function, the supporting data and experiments, and ultimately, a description of the planned system with the feedback loop.

Categories and Subject Descriptors


I.7 DOCUMENT AND TEXT PROCESSING - I.7.2 Document Preparation: Format and notation and markup. I.7.4 Electronic Publishing: Print publishing for variable data driven templates.

General Terms
Documentation, Design, Standardization, Languages, Theory.

Keywords
PDF, Document Complexity, Raster Image Processing, Printing,
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DocEng12, September 47, 2012, Paris, France. Copyright 2012 ACM 978-1-4503-0863-2/11/09$10.00.

2. BACKGROUND: DIGITAL PRINTING PRE-PRESS STEPS


In a non-cloud based printing workflow, the artwork is received by the pre-press department. In this department, there are typically few steps that are done to prepare the document for printing. These are the most common steps:

HP Restricted

1. 2. 3.

Pre-flight/Validation/Normalization Imposition Raster Image Processing (RIP)

5.

The first step is to evaluate the quality and readiness of the received document to be printed. The pre-flight usually ensures that all the necessary resources are available in the Page Description Language. Typically, these are the validation that the fonts are embedded, and that images have enough dpi to be printed with the required quality. Depending on the type of document, some color management can be performed at this stage too (e.g., photo enhancing). The second step ensures that the finishing requirements (e.g., folded booklet, stapled flyer, etc.) are met as well as the printing area of the press is fully utilized. This aspect reduces both printing time and paper waste. The last step transforms (RIP) the PDL into a set of raster images which represent, for each sheet, the various color separations. A color separation dictates the location and amount of the primary colors, for instance Cyan, Magenta, Yellow and blacK. This process is computational expensive and it is heavily dependent on the intrinsic characteristics of the job, in our case a PDF file. There are obvious elements like: number of pages, page size and file size that influence the RIP time, but also less obvious elements like transparency and re-use (also called optimization in PDF jargon).

AR - Overall Area with Reuse (millipoint): represents the amount of objects in all pages that once rastered can be re-used throughout the document. Once an object is re-used it can be cached and simply placed in the new page without any other computation significantly speeding the RIPping operation. TP - Transparency Pervasiveness (percentage): represents the percentage of pages with transparency. These pages typically reduce RIP performance not allowing the RIP to raster their content in parallel. This is because there are interactions between the objects and their transparency, and that can fundamentally change the appearance of underneath objects.

6.

Using these parameters, it is possible to characterize the PDF documents into separate classes. These classes represent the fundamental characteristic of the PDF and should allow tailoring the correlation with RIP performances in a more accurate way: 1. 2. Plain Documents Class: these are PDF files that do not present any transparency or reuse. Transparent Documents Class: these are PDF files where portions of the content in some pages have transparent settings. Reuse Documents Class: these are PDF files where a portion of the content is effectively reused in some pages. The reusable content is stored in the PDF as XObject.

3.

3. TEST OBJECTIVE
We tested our approach using real customer examples that have been produced in commercial settings and rendered with a commercial RIP based on production ready hardware and software. We had in excess of 100 PDF files with different characteristics and complexity. The primary objective is to create test cases of documents that are similar in characteristics so it is possible to understand the accuracy of the proposed approach. The various PDF files (jobs) have been analyzed using a PDF Profiler to extract the various characteristics; these are: 1. FS - File Size (MB): represents the size of the file in the file system. The file size reflects the amount of images and complex data representing the PDF and its re-use. Larger PDF are usually more complex to RIP and smaller PDF are simpler or leverage re-use. PS - Page Size (millipoint): represents the overall area of the page (or sheet) to be produced. This affects RIPping in two ways: first is related to the amount of data to process that increases with the page size and second the amount of data produced as output. NP - Number of Pages (integer): represents how many pages (or sheet) the PDF is composed of. There is always a correlation between the number of pages and its RIP time, sometimes this is directly proportional, sometimes it is in correlation with other aspects. OA - Overall Area without Reuse or Transparency (millipoint): represents the amount of objects in all pages that have to be rastered by the RIP. These objects once rastered are placed in the output and disposed.

In order to get enough data for the correlation process, each PDFs were RIPped at least ten times. All RIP times have been recorded and averaged out. The executions were singled out and every PDF had at its disposal the entire set of 12 RIPs for apple to apple comparison. During this process we identified that the RIP has, on average, a difference in execution times between 6% and 8% with spikes up to 12%. This is normal due to the various components; inter process/machine communication and resource allocation performed by the OS. All the PDFs have been RIPped using the HP SmartStream Production Pro version 4.5 [3]. This product is based on optimized software running on a HP Proliant Generation 5 (G5) rack. The hardware rack contains the following servers: four HP ProLiant DL360 [4] (two dual core Intel Xeon X5460 at 3.16Ghz with 4 GB of RAM); three HP ProLiant DL380 [4] (two dual core Intel Xeon X5460 at 3.16Ghz with 8 GB of RAM).

2.

3.

All the servers are running Windows Server 2008 R2 Standard 64 bit. The RIPs are running in the HP Proliant DL360 and there are three instances for each server for a total of 12 RIPs. The HP Proliant DL380s are running the job ingestion and management software. In the next section, we describe the results obtained using an empirical formula which uses the previously identified characterization aspects.

4.

4. EMPIRICAL RESULTS
Once the PDF files have been classified two sets have been generated: training and testing set.

HP Restricted

The training set has been RIPped several times and the PDF characteristics, highlighted in the previous section, have been used to model a formula. The model has been built finding the highest correlation values between the PDF characteristics and the training set RIP execution times. The training set must be as representative as possible for each class; hence max, min and average values of individual characteristic are identified and the corresponding PDF documents selected. This strategy allowed creating an empirical function to compute a PDF complexity as illustrated by Equation 1. = 0 + 1 + 2 + 3 + 4 5

Figure 2 illustrates the complete results. The represents the documents in the training set and represents documents in the test set. The samples can be represented using a linear function.

Equation 1: Empirical PDF Complexity Polynomial Function Among all the PDF characteristics only the Area of Reuse reduces complexity, this is because re-use accelerates the raster process avoiding duplication of work. Obviously, the different characteristics influence the raster process differently and hence these have different weights. The conducted experiments highlight that the weights, in order to maintain a high correlation value, differ for the different PDF classes previously identified. A further capability that has been implemented is the ability, for the testing engine, to use the raster time results as feedback and iteratively adjust the formula. A two iterations result is illustrated in Figure 1 demonstrating that the adjustments are correctly reacting to the measurements. Figure 2: Plain Documents Estimated vs. Measured Results The proposed complexity formula manages to approximate RIP times with an average error of 23% (see Table 1) and an adjusted 15-18% considering the RIP time fluctuations. Table 1: Plain Test Document Set () Prediction and Audit
PDF Document/Job Plain_TestFile_1 Plain_TestFile_2 Plain_TestFile_3 Plain_TestFile_4 Plain_TestFile_5 Plain_TestFile_6 Plain_TestFile_7 Plain_TestFile_8 Average Complexity 0.70251989 0.63454632 0.30050372 0.33276647 0.38493941 0.27695733 0.32610461 0.28125221 Estimation (s) 48.79 39.22 16.61 16.95 17.49 16.37 16.88 16.41 Audit (s) 50 32 14 12 28 14 14 12 Deviation (%) 2% 18% 16% 29% 60% 14% 17% 27% 23%

4.2 Transparent Documents Class


Figure 1: Subsequent Iterations using Adaptive Strategy In the following sections the various classes are analyzed and the weights defined to maximize the correlation values. The Transparent Documents Class identifies all the PDF that have some transparent content. In this case the TP is relevant but the AR is not necessary since not re-use is applicable. Using this weights combination we could obtain a correlation of 92.5%. Figure 3 illustrates the complete results with both the training set and the test set. The results have an exponential like behavior. This is in line with the expectations, since, the more transparency there is in the document the longest the RIP takes to produce the raster image. Usually the time is not linear because it depends on size (in square points) as well as the number of overlapping objects and layers.
0 = 0.125, 1 = 0.44, 2 = 0.125, 3 = 0.01, 4 = 0.3, 5 = 0

4.1 Plain Documents Class


The Plain Document Class identifies the PDF documents that have no transparency nor re-use. In this case the AR and TP characteristics are empty and the corresponding weights can be zeroed. Using the following weights it is possible to obtain a correlation of 95.8% between the PDF complexity and RIP time. The tests are conducted with a balanced set of training and test documents. The same training set is executed at least three times allowing the adaptive function is able to predict the results within five percent error. Once the system is trained, test documents are submitted one by one. The prediction is performed and the real RIP timing is monitored and used to further adjust the prediction of subsequent documents.
0 = 0.25, 1 = 0.5, 2 = 0.25, 3 = 0, 4 = 0, 5 = 0

HP Restricted

Average

38%

Figure 3: Transparent Documents Estimated vs. Measured Results Table 2: Transparent Test Document Set () Prediction and Audit
PDF Document/Job Transp_TestFile_1 Transp_TestFile_2 Transp_TestFile_3 Transp_TestFile_4 Transp_TestFile_5 Transp_TestFile_6 Transp_TestFile_7 Transp_TestFile_8 Average Complexity 0.37906632 0.65742034 0.18439701 0.36458004 0.38454160 0.38073963 0.17354802 0.69975488 Estimation (s) 101.54 314.80 17.85 96.36 100.88 101.34 16.80 485.35 Audit (s) 66 270 25 90 78 84 28 381 Deviation (%) 35% 14% 40% 7% 23% 17% 67% 21% 28%

Figure 4: Reuse Documents Estimated vs. Measured Results Among the document classes, the re-use has been the hardest to predict, especially with low complexity cases. The samples can be approximated by a logarithmic-like function as illustrated in Figure 4. This behavior can be explained by the fact that when the re-use increases it mitigates RIP work.

5. NEXT STEPS
The work here presented is at an embryonic phase and it requires further examination and work. It is encouraging, though, that the six vectors of complexity are directly responsible for the raster estimation. Changing the weights greatly influences both the correlation values as well as the prediction results. For this reason it is our belief that good predictions can only be achieved studying in advance the usual set of PDF documents treated by the PSP. Indeed, every PSP always work with a limited set of products and hence it is possible to classify the PSPs offering and produce adhoc weights. The next step is to create a tool that automatically analyses a representative set of PDFs, classifies them into Plain, Transparent and Reuse creating a customized set of weights for the three classes. These are then loaded in the production workflow tool which uses the production data as feedback channel to dynamically adjust the prediction. Further along we are planning to use this data in conjunction with download estimates to create a complex evaluator that can move the RIP stage for PDF documents in the cloud or in the PSP depending on the PDF complexity and available resources.

4.3 Reuse Documents Class


The Reuse Document Class identifies the PDF documents that have been optimized. The optimization is based on the fact that some content can be stored once and used several time throughout the document. This brings two different effects at the PDF level, the first, is that the file size is smaller due to the single instance storage, and the second, that the RIP raster content once and simply place it when a re-use instance occurs. Using the above weights we can obtain a correlation of 96% and as illustrated in Figure 3 it is possible to predict the raster process with an average of 38% accuracy. Table 3: Reuse Test Document Set () Prediction and Audit
PDF Document/Job Reuse_TestFile_1 Reuse_TestFile_2 Reuse_TestFile_3 Reuse_TestFile_4 Reuse_TestFile_5 Reuse_TestFile_6 Reuse_TestFile_7 Complexity 0.971037 0.639895 0.531234 0.318074 0.242352 0.192302 0.189387 Estimation (s) 3927.92 3430.03 3266.59 1398.90 618.27 505.80 493.35 Audit (s) 2851 4388 3155 1169 210 718 97 Deviation (%) 27% 28% 3% 16% 66% 42% 80% 0 = 0.01, 1 = 0.2, 2 = 0.5, 3 = 0, 4 = 0.15, 5 = 0.25

6. REFERENCES
[1] PrintWiki, http://printwiki.org/Raster_Image_Processor [2] Thiago Nunes, Fabio Giannetti, Mariana Luderitz Kolberg, Rafael Nemetz, Alexis Cabeda, Luiz Gustavo Fernandes, Job profiling in high performance printing. ACM Symposium on Document Engineering 2009: 109-118 [3] HP SmartStream Production Pro 4.x Graphic Arts Solution, http://h10088.www1.hp.com/gap/download/4AA17733ENUS_SmartStreamProductionPro_Hi%20Res_Mar200 8-New.pdf [4] HP Proliant G5 Servers Overview, http://h10010.www1.hp.com/wwpc/ca/en/sm/WF05a/1535115351-3328412-241475-241475-1121516.html?dnr=1

You might also like