Computation of Storage Requirements For Multi-Dimensional Signal Processing Applications

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO.

4, APRIL 2007

447

Computation of Storage Requirements for Multi-Dimensional Signal Processing Applications


Florin Balasa, Member, IEEE, Hongwei Zhu, and Ilie I. Luican, Student Member, IEEE
AbstractMany integrated circuit systems, particularly in the multimedia and telecom domains, are inherently data dominant. For this class of systems, a large part of the power consumption is due to the data storage and data transfer. Moreover, a signicant part of the chip area is occupied by memory. The computation of the memory size is an important step in the system-level exploration, in the early stage of designing an optimized (for area and/or power) memory architecture for this class of systems. This paper presents a novel nonscalar approach for computing exactly the minimum size of the data memory for high-level procedural specications of multidimensional signal processing applications. In contrast with all the previous works which are estimation methods, this approach can perform exact memory computations even for applications with numerous and complex array references, and also with large numbers of scalars. Index TermsLattice, memory allocation, memory size computation, multidimensional signal processing, polytope.

I. INTRODUCTION

N TELECOM and real-time multimedia applicationsincluding video and image processing, medical imaging, articial vision, real-time 3-D rendering, advanced audio and speech codinga very large part of the power consumption is due to the data storage and data transfer. A typical system architecture includes custom hardware (application-specic accelerator datapaths and logic), programmable hardware (DSP core and controller), and a distributed memory organization which is usually expensive in terms of power and area cost. For instance, data transfer and memory access operations typically consume more power than a datapath operation [1]. In deriving an optimized memory architecture, memory size estimation/computation is an important step in the early phase of the design, the system-level exploration. This problem has been tackled in the past both for register-transfer level (RTL) programs at scalar level and for behavioral specications at nonscalar level. A. Scalar Estimation Techniques The problem of optimizing the register allocation/assignment in programs at register-transfer level was initially formulated in the eld of software compilers [2], aiming at a high-quality code generation phase. The problem of deciding which values in a program should reside in registers (allocation) and in which register each value should reside (assignment) has been solved by a

graph coloring approach. As in the code generation, the register structure of the targeted computer is known, the -coloring of a register-interference graph (where is the number of computer registers) can be systematically employed for assigning values to registers and managing register spills.1 Although the problem of determining whether a graph is -colorable is NP-complete, effective heuristic techniques were proposed. In the eld of synthesis of digital systems, starting from a behavioral specication, the register allocation was rst modeled as a clique partitioning problem [3]. The register allocation/assignment problem has been optimally solved for nonrepetitive schedules, when the life-time of all scalars is fully determined [4]. The similarity with the problem of routing channels without vertical constraints [6] has been exploited in order to determine the minimum register requirements (similar to the number of tracks in a channel), and to optimally assign the scalars to registers (similar to the assignment of one-segment wires to tracks) in polynomial time by using the left-edge algorithm [5]. A nonoptimal extension for repetitive and conditional schedules has been proposed in [7]. A lower bound on the register cost can be found at any stage of the scheduling process using force-directed scheduling [8]. Integer linear programming techniques are used in [9] to nd the optimal number of memory locations during a simultaneous scheduling and allocation of functional units, registers, and busses. Employing circular graphs, [10] and [11] proposed optimal register allocation/assignment solutions for repetitive schedules. A lower bound for the register count is found in [12] without xing the schedule, through the use of ASAP and ALAP constraints on the operations. Good overviews of these techniques can be found in [1] and [13]. Common to all scalar-based storage estimation techniques is that the number of scalars (also called signals or variables) is drastically limited. When multidimensional arrays are present in the algorithmic specication of the targeted applications, the computation time of these techniques increases dramatically if the arrays are attened and each array element is treated like a separate scalar. B. Nonscalar Estimation Techniques To overcome the shortcomings of the scalar estimation techniques for high-level algorithmic specications where the code has an organization based on loop nests and multidimensional arrays are present, several research teams proposed different techniques exploiting the fact that, due to the loop structure of the code, large parts of an array can be produced or consumed
1When a register is needed for a computation, but all available registers are in use, the content of one of the used registers must be stored (spilled) into a memory location, in order to free a register.

Manuscript received May 20, 2006; revised October 25, 2006. This work was supported by the National Science Foundation under Grant 0133318. The authors are with the Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607 USA (e-mail: fbalasa@cs.uic.edu). Digital Object Identier 10.1109/TVLSI.2007.895246

1063-8210/$25.00 2007 IEEE

448

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

within a single array reference. These estimation approaches can be basically split into two categories: those requiring a fully-xed execution ordering and those assuming nonprocedural specication where the execution ordering is still not (completely) xed. The techniques falling in the rst category will be addressed rst. Verbauwhede et al. consider a production axis for each array to model the relative production and consumption time (or date) of the individual array accesses [14]. The difference between these two dates equals the number of array elements produced between them, while the maximum difference gives the storage requirement for the considered array. The time differences are computed based on an integer linear programming model using an ILP tool. Since each array is treated separately, only the in-place mapping internally to an array is considered, the possibility of mapping arrays in-place of each other being not considered. Grun et al. compute the storage requirements between the different loop nests in the code [15]. Estimations of the occupied memory within the loop nests is based on the computation of upper and lower bounds. An upper bound is computed by producing as many array elements as possible before any consumption occurs, and the lower bound is found with the opposite reasoning. From these bounds, a memory trace of bounding rectangles as a function of time is built. Zhao and Malik developed a technique based on live variable analysis and integer point counting for intersection/union of mappings of parametrized polytopes [16]. They prove that, in order to get the minimum memory size estimate, it is sufcient to nd the number of live variables for one statement in each innermost loop nest. Ramanujam et al. use for each array a reference window containing at any moment during execution the array elements alive (that have already been referenced and will also be referenced in the future) [17]. The maximal window size gives the storage requirement for the corresponding array. Treating the arrays separately, the technique does not consider the possibility of inter-array in-place mapping [1]. In contrast to the nonscalar methods described so far, the memory estimation technique presented in [18] does not take execution ordering into account, allowing any ordering not prohibited by data dependencies. The method performs a data dependence analysis which decomposes the array references, building a polyhedral dependence graph whose nodes are the array partitions. The storage requirement is estimated through a greedy traversal of this graph. The estimation technique described in [19] assumes a partially xed execution ordering. The authors employ a data dependence analysis similar to [18], their major improvement being to add the capability of taking into account available execution ordering information (based mainly on loop interchanges), thus avoiding the possible overestimates due to the total ordering freedom (less the data dependence constraints). Good overviews of some of these techniques can be found in [1], [20], and [21]. This paper presents a nonscalar method for computing exactly the minimum size of the data memory (or the storage requirements) in multidimensional signal processing algorithms

Fig. 1. Illustrative example of afne specication with two delayed array references.

where the code is procedural, that is, where the loop structure and sequence of instructions induce the (xed) execution ordering. Even when the specications are nonprocedural, the amount of storage can be estimated, computing accurate upper-bounds. This approach uses both algebraic techniques specic to the data-ow analysis used in modern compilers [22], and also elements of the theory of -dimensional polyhedra. Since the mathematical model is very general, the proposed method is able to handle the entire class of afne specications (see Section II), therefore, practically the entire class of real-time multidimensional signal processing applications. In addition, this approach can process signicantly complex specications in terms of number of scalars (orders of magnitude larger than all the previous works), array references, and lines of code. The rest of this paper is organized as follows. Section II explains the memory size computation problem in the context of our research. Section III describes the global ow of our memory computation algorithm, while Section IV details some more signicant algorithms used by this framework. Section V addresses complexity topics. Section VI presents basic implementation aspects and discusses the experimental results. Section VII summarizes the main conclusions of this work. II. MEMORY SIZE COMPUTATION PROBLEM The algorithms for telecom and (real-time) multimedia applications are typically specied in a high-level programming language, where the code is organized in sequences of loop nests having as boundaries linear functions of the outer loop iterators, conditional instructions where the conditions may be both data-dependent or data-independent (relational and/or logical operators of linear functions of loop iterators), and multidimensional signals whose array references have (possibly complex) linear indices. This class of specications is often referred to as afne [1]. These algorithms describe the processing of streams of data samples, so their source codes can be imagined as surrounded by an implicit loop having the time as iterator. This is why the code

BALASA et al.: COMPUTATION OF STORAGE REQUIREMENTS FOR MULTI-DIMENSIONAL SIGNAL PROCESSING APPLICATIONS

449

Fig. 2. The effect of loop interchange on the storage requirement: assuming that all the size needed is: (a) 3104 locations and (b) 64 locations.

M -elements are created and consumed in the nested loops, the data memory

often contains delayed signals, i.e., signals produced in previous data processings, which are consumed during the current execution. An illustrative example is shown in Fig. 1. The delayed signals are the ones followed by the delay operator @, the next argument signifying the number of time iterations in the past when those signals were produced. The delayed signals must be kept alive during several time iterations, i.e., they must be stored in the background memory during one or several data-sample processings. The problem is to determine the minimum amount of memory locations necessary to store the signals of a given multimedia algorithm during its execution, or, equivalently, the maximum storage occupancy assuming any scalar signal need to be stored only during its lifetime. The total number of scalars in the algorithm in Fig. 1 is 463 314 (including the delayed scalars of signal A). But due to the fact that scalars having disjoint lifetimes can share the same memory location, the amount of storage can be much smaller than the total number of scalar signals. Actually, only 166 108 memory locations are necessary for this example (as explained in Section III-C and conrmed by our tool presented in Section VI). Most of the previous nonscalar techniques on the evaluation of storage requirements (see Section I-B) considered a xed execution ordering prior to the memory estimation [14][17]. The exceptions are [18], which allows any ordering not prohibited by data dependencies, and [19], where the designer can specify partial ordering constraints. In this paper, the algorithmic specications are considered to be procedural, therefore, the execution ordering is induced by the loop structure and it is thus xed, like in most of the previous works. This assumption is based on the fact that in present industrial design, the design entry usually includes a full xation of the execution ordering. Even if this is not the case, the designer can still explore different algorithmic specications which are functionally equivalent.2 Fundamentally different from these previous works [14][17] doing only a memory size estimation, the algorithm presented in this paper is able to perform an exact computation of the minimum data storage even for very complex specications. One of the possible applications of the exact memory size computation is the evaluation of the impact of different code (and, in particular, loop) transformations on the data storage, during the early design phase of system-level exploration. For instance, the minimum memory size needed by the array in the code from Fig. 2(a) is 3104 locations. Interchanging the loopsa possible transformation in this case shown in Fig. 2(b)drastically decreases the storage requirements with 98%to only 64 locations. It should be noticed that function2The memory size computation tool has also an estimation operating mode for nonprocedural specications, when the execution ordering is only partially xed. This will be explained in Section III-B.

ally equivalent codes, even apparently similar, may require very different amounts of data storage. Another possible application is the evaluation of signal-tomemory mapping techniques3 which are used to compute the physical addresses in the data memory for the array elements in the application code. These mapping models actually tradeoff an excess of storage for a less complex address generation hardware. None of these past research works dealing with the mapping techniques was able to provide consistent information on how good their models are, that is, how large is the oversize of their resulting storage amount after mapping in comparison to the minimum data storage. The effectiveness of their mapping model is assessed only relatively, that is, they compare the storage resulted after applying their mapping model either to the total number of array elements in the code, or to the storage results when applying a different mapping model. The exact computation of the memory size and, also, the minimum storage windows for each array allow to better evaluate the performance of any signal-to-memory mapping model [24]. In order to solve exactly the memory size computation problem, the simulated execution of the code may seem, at a rst glance, an appealing strategy. In principle, it could indeed solve the problem by brute-force computation. However, from our past experience with such an approach, the simulated execution exhibits a poor scalability: while being fast for small and even medium size applications, the computation times increase steeply, becoming ineffectual when processing examples with millions of scalars (array elements) and deep loop nests with iterators having large ranges, like many image and video processing applications. Enumerative techniques or RTL scalar approaches (see Section I-A) based, e.g., on the left edge algorithm [5] are too computationally expensive in such cases, often prohibitive to use. The algorithm presented in this paper is a nonscalar technique for computing the maximum number of array elements simultaneously alive. This number represents the minimum data storage necessary for the code execution because the simultaneously alive scalar signals must be stored in distinct memory locations. The basic reasons of performance of our approach are: 1) an efcient decomposition of the array references of the multidimensional signals into disjoint linearly bounded lattices [25] and 2) an efcient mechanism of pruning the code of the algorithmic specication, concentrating the analysis in the parts of the code where the larger storage requirements are likely to happen. It must be noted that a lot of research involving parametric polytopes has been done in the compiler community like, for instance, afne loop nest analysis and transformation [26], [27],
3An overview of signal-to-memory mapping models is given, for instance, in [23].

450

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

the improvement of data locality of nested loops [28], counting lattice points, their images, their projections [29], computing the number of distinct memory locations accessed by a loop nest [30], parametrized integer programming [31]. Many specications of multimedia algorithms are often parametrized for reason of generality. These specications may contain more than one parameter, and the array indexesalthough linear functions of the loop iteratorsmay be nonlinear functions of the parameters, like in one of our benchmarks, the motion detection (see Section VI). Since the values of the parameters are known anyway before the implementation of the VLSI system, we adopted a more pragmatic (rather than theoretical) view when addressing the memory size computation problem, considering the values of the parameters to be known.4 A limitation of the current implementation is that the input programs are in the single-assignment form, that is, each array element is written at most once (but it can be read an arbitrary number of times). At the end of Section IV-B, it will be explained how to process specications which are not in the single-assignment form.

Fig. 3. Mapping of the iterator space of the array reference A into its index space.

[2i +3j ][5i + j ]

Example: for ( ; ; ) ; ; ) for ( if The iterator space (see Fig. 3) of the array reference is the -polytope

III. EXACT MEMORY SIZE COMPUTATION BASED ON DATA-DEPENDENCE ANALYSIS

A. Denitions and Concepts Denitions: A polyhedron is a set of points satisfying a nite set of linear inequalities: , where and . If is a bounded set, then is called a polytope. If , then is called a -polyheis called dron/polytope. The set the lattice generated by the columns of matrix . Each array reference of an -dimensional signal , in the scope of a nest of loops , is characterized by an iterator having the iterators space and an index (or array) space. The iterator space signies in the scope the set of all iterator vectors of the array reference. The index space is the set of all index of the array reference. When vectors the indices of an array reference are linear mappings with integer coefcients of the loop iterators, the index space consists of one or several linearly bounded lattices (LBL) [25]the image of an afne vector function over the iterator polytope (1) is the index vector of the -dimensional signal where is an -dimensional iterator vector. In our context, and the elements of the matrices , and of the vectors , are considered integers.
4Like other past works [14][19], as well. In spite of considering parametrized polytopes, Zhao and Malik still consider the parameter values to be known [16], since their estimation results are numeric, not parametric. On the other hand, researchers from the compiler community did compute the memory size for parametrized examples, but these were restricted in size and complexity (for instance, only oneusually, perfectnest of loops, with one single parameter), which is not sufcient for typical multimedia applications.

or, in nonmatrix format: . (Note that the inequality is redundant and can be eliminated.) The index space of the array reference can be expressed in this case as an LBL

where and are the indices of the array reference. The points of the index space lie inside the -polytope , the image of the boundary of the iterator space. In this example, each point in the iterator space is mapped to a distinct point of the index space, which is not always the case. B. Flow of the Algorithm The main steps of the memory size computation algorithm will be discussed below, using a more intuitive (rather than formal) presentation. Subsequently, Section IV will formally address the essential techniques used in the algorithm. Step 1: Extract the array references from the given algorithmic specication and decompose the array references of every indexed signal into disjoint linearly bounded lattices. Fig. 4(a) shows the result of this decomposition for the 2-D in the illustrative example from Fig. 1. The graph signal displays the inclusion relations (arcs) between the LBLs of (nodes). The four bold nodes are the four array references of signal in the code. The nodes are also labeled with the size of the corresponding LBLthat is, the number of lattice points

BALASA et al.: COMPUTATION OF STORAGE REQUIREMENTS FOR MULTI-DIMENSIONAL SIGNAL PROCESSING APPLICATIONS

451

Fig. 4. (a) Decomposition of the index space of signal A from the code example in Fig. 1 into disjoint LBLs; the arcs in the inclusion graph show the inclusion relations between lattices. (b) The partitioning of As index space according to the decomposition (a).

(i.e., points having integer coordinates) in those sets. The inclusion graph is gradually constructed by partitioning analytically the initial (four) array references using LBL intersections and differences. While the intersection of two nondisjoint LBLs is an LBL as well [25], the difference is not necessarily an LBLand this latter operation makes the decomposition difand , cult. In this example, are also LBLs [denoted , in Fig. 4(a)]. However, the is not an LBL due to the nonconvexity of difference this set. At the end of the decomposition, the nodes without any incident arc represent nonoverlapping LBLs [they are displayed in Fig. 4(b)]. Every array reference in the code is now either a and ), or a union of disjoint disjoint LBL itself (like ). LBLs (e.g., Section IV-A will present in more detail the LBL decomposition algorithm. Step 2: Determine the memory size at the boundaries between the blocks of code. The algorithmic specication is, basically, a sequence of nested loops. (Single instructions outside nested loops are actually nests of depth zero.) We refer to these loop nests as blocks of code. After the decomposition of the array references in the specication code, it is determined in which block each of the disjoint LBLs is created (i.e., produced), and in which block it is used as an operand for the last time (i.e., consumed). Based on this information, the memory size between the blocks can be determined exactly, by adding the sizes of every live lattice [32]. Steps 2 and 3 will be illustrated using the example in Fig. 5, identical with the code in Fig. 1, except the delay operators addressed in Section III-C. The inclusion graph and the deare the same as composition of the array space of signal in Fig. 4(a) and (b). The memory sizes at the beginning/end of the specication code are the storage requirements of the and, respectively, inputs/outputs, therefore, for this example. Block 1 produces the LBL (just one scalar), hence, the memory size after Block 1 is: . Block 2 consumes only the LBL from signal ; it both produces and consumes the LBLs , , and of signal ; it also produces the LBLs and

Fig. 5. Illustrative example similar to the code in Fig. 1, but without delayed array references. The total number of scalars is 302 498; the storage requirement is 5292.

consuming and (see Fig. 6). The memory size after Block . (The LBLs both 2 is: produced and consumed cancel each other.) Similarly, Block 3 consumes all the signal still alive, both produces and conof signal . Since only is sumes the LBLs , , and . still alive, the memory size after Block 3 is: Step 3: Compute the storage requirement inside each block. Step 3.1: Determine the characteristic memory variation for each assignment instruction in the current block. Take, for instance, the rst loop nest from the illustrative example in Fig. 5. The assignment (1) produces at each iteration a of the array S. We say that the charnew element acteristic memory variation of this assignment is 1 since each time the instruction is executed the memory size will increase by one location. Similarly, the assignment (3) has a character) since at each istic memory variation of 1 (i.e., is produced and two other iteration one scalar signal and are consumed (used scalars for the last time). In general, the characteristic memory variation of an assignment is easy to compute: a produced array reference having a bijective vector function of its LBL(s) has a contribution of 1; an entirely consumed array reference having a one-to-one mapping has a contribution of 1; an array reference having no component LBL consumed in the block has a zero contribution, independent of its mapping. The rest of the array references in the

452

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

Fig. 6. Inclusion graphs of the signals S and T (Figs. 1 and 5).

assignment are ignored for the time being, being dealt with at Step 3.3. For instance, at assignment (2), the two array references bring a contribution of 1 and 1, respectively; the array referbrings a zero contribution since it contains only ence [see Fig. 4(a)] which is not consumed in this the LBL is also covered by the array reference from block ( the second loop nest, so it will be consumed in Block 3). Only is consumed in this block, part of the array reference in Fig. 4(a). Therefore, the characteristic that is, the LBL memory variation of assignment (2) is zero, meaning that the typical variation is of zero locations; however, for some iterations the memory variation may be 1 due to the consumption covered by the array reference of the elements in the LBL . Note that testing whether the vector function of an array reference is a one-to-one mapping or not is very simple at this phase. It is sufcient to compare the size of its LBL(s)computed at Step 1with the size of its iterator space: if they are equal then the vector function is bijective. Step 3.2: Check whether the maximum storage requirement could occur in the current block or not. The maximum possible memory increase in the block is the number of executions of the assignment instructions with positive characteristic memory variations. If this maximum possible increase added to the amount of memory at the beginning of the block is not larger than the maximum storage at the block boundaries (known from Step 2), then the maximum storage requirement cannot occur in this block, which hence can be safely skipped from further analysis. In particular, the blocks in which no signals are consumed (used for the last time) can be skipped since the memory will only increase to the amount at the end of the block (known from Step 2). The memory size at the beginning of the rst loop nest is 5291 (and this is the largest value among the memory sizes at the block boundaries). The maximum possible memory increase is due only to assignment (1), having a characteristic memory variation of 1, executed 497 times. So, theoretically, the memory 497 (alsize could reach (but not exceed) the value 5291 though it will not). The maximum storage requirement could occur in this block, so its analysis should continue. On the other hand, the memory size at the beginning of the second loop nest is 4762. The memory size within that block

which is already could reach the value smaller than 5291. It follows that the maximum storage requirement cannot occur in the second loop nest, so this block can be skipped from further analysis (unless the tool is in the trace mode, as shown in Section VI). In such a situation, Step 3 is resumed from the beginning for the next block of code. This code pruning enhances the running times, concentrating the analysis on those portions of code where the memory increase is likely to happen. Remark: The memory computation tool can be also used to estimate the storage requirements for nonprocedural specications by nding an upper-bound of the memory size large enough for any possible execution ordering inside the blocks of code (the block organization being considered xed though). Still assuming that any scalar must be stored only during its lifetime, it was previously shown that the execution of Block 2 (and, actually, of the whole code) cannot necessitate more than locations even if the computation order is changed. This upper-bound is only 9.4% larger than the exact minimum storage (5292) for the procedural code and it is a reasonably good estimation. When the tool runs in the estimation mode, processing nonprocedural specications, it resumes Step 3 from the beginning for the next block of code. Step 3.3: Determine and sort the iterator vectors of the consumed elements covered by array references which are: 1) only partly consumed or 2) their mappings (between iterators and indexes) are not one-to-one. Case 1: The mapping of the partly consumed array reference is bijective. Each consumed LBL is intersected with the LBL of the array reference (in which it is included). By doing this, the general expression of the iterator vectors addressing only the elements of the consumed LBL is obtained. At each such iteration vector, one array element will be consumed (since the mapping of the array reference is bijective). Case 2: The mapping of the array reference is not bijective. This case is more difcult since distinct iterator vectors can access a single array element, whereas we are interested in only that unique iterator vector accessing the array element for the last time. This is what we call the maximum (lexicographic) iterator vector. and be Denition: Let two iterator vectors in the scope of nested loops, which may be

BALASA et al.: COMPUTATION OF STORAGE REQUIREMENTS FOR MULTI-DIMENSIONAL SIGNAL PROCESSING APPLICATIONS

453

assumed normalized (i.e., all the iterators are increasing with the step size of 1). The iterator vector is larger lexicographi) if , or and cally than (written , or ( , and ). The iterator vector from a set of such vectors is the smallest/largest vector in the set relative to the lexicographic order. Example: for ( ; ; ) ; ; ) for ( for The iterator vector such that the element, say, is accessed is , while the iterator . vector is Section IV-B will present the algorithm computing the maximum iterator vector. is the only LBL consumed in the rst loop nest (see Fig. 5) satisfying the conditions of Step 3.3. It is part of the array ref, the elements of having the indexes in the erence . The maximum iterator vectors set of these elements indicate the iterations when they are conis consumed in the itsumed. For instance, the element ; is consumed in , eration is consumed in the iteration . In while general, the maximum iterator vectors of s elements are: when , and when . Remark: If the consumed LBL is included in several array references within the block, the maximum iterator vectors are taken over all these array references. For instance, if in the loop

memory variation) and before the assignments decreasing the memory in order to identify the local maxima of the memory variation. The maximum storage requirement (5292) is reached after assignment (1). It is the maxin the rst iteration imum value for the entire code. Therefore, the minimum data memory necessary to ensure the code execution is 5292 locations. The main ideas of the algorithm presented and illustrated in this section are summarized in the following. As already mentioned, the algorithm is actually computing the maximum number of scalars (array elements) simultaneously alive since this number corresponds to the minimum data storage necessary for the code execution. First, (at Steps 1 and 2), the algorithm computes the number of simultaneously alive scalars at the borderline between the blocks of code (nests of loops). The lattices of array elements created in blocks before a certain borderline and, also, included in operands from blocks after the borderline contain the live array elements. (The rest of the array elements are either already dead, or still unborn.) The total size of these lattices is the exact storage requirement at the respective borderline. Afterwards (at Step 3), the computation focuses on each block of code, one by one. Some assignment instructions inside the block have a constant memory variation for each execution, since each time they create and consume the same amount of scalars. For the other assignments, the time of death (in terms of executed datapath instructions) of the array elements covered by the dying lattices is precisely determined, using maximum (lexicographic) iterator vectors. Based on these data, all the local maxima of the memory variation are exactly determined. In particular, the global maximum at the level of the entire code is accurately computed, as well. C. Handling Array References With Delays Our current implementation supports constant-value delays, the typical situation in most of the multimedia applications. The presence of the delay operators does not affect the partitioning of the signals index spaces (Step 1 of the algorithm). The LBLs will get attached an additional piece of informationtheir maximum delays in the specication code. These data can be easily determined with a top-down traversal of the inclusion graphs, a lower-level node getting the maximum delay from the nodes of the sets containing it. For instance, in the inclusion graph in Fig. 4(a) the maximum delay is 16 for the nodes and , and it is 32 for all the other nodes. The memory size at the beginning of the code in Fig. 1 is the total storage requirement of the current input and of the previous 32 inputs; but the older 16 inputs (delays 1732) without their since its maximum delay is 16 LBLs

all the elements are consumed, the maximum iterator vector is relative to the array reference , of whereas it is relative to the array reference . We take, obviously, since is accessed the is accessed twice when last time in that iteration. Also, ; we take relative the array reference , since is accessed for the last time in the second assignment of the loop. Step 3.4: Compute the storage requirement in the current block. In this moment, there is enough information to compute the memory size after any assignment. E.g., the memory size after is: the assignment (1) for the iterator

. Here, the initial memory at the beginning of the block is 5291; the characteristic memory variations of the three assignments are 1, 0, and 1; the other consumed elements are given by maximum iterator vectors (determined in Step 3.3) lexico, and they are: graphically smaller than . Their number is the last term in the previous sum. The algorithm computes the memory size only after the assignments increasing the memory (with positive characteristic

Here, we denoted spectively, the lattice

and from

the input signal and, resample processings in the past.

454

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

(Obviously, their sizes are the same as in the current code execution.) The memory size at the end of the code is 160 817 since and (that is, 5290 scalars in total) were consumed during the execution. The maximum storage requireof the ment is 166 108; it is reached in the iteration rst loop nest, after the rst assignment. In order to handle delays, Steps 2 and 3 of the algorithm are only slightly modied: they only have to take into account that an LBL is consumed when it belongs to an array reference whose delay is equal to the LBLs maximum delay. IV. ALGEBRAIC TECHNIQUES USED IN THE MEMORY SIZE COMPUTATION ALGORITHM A. Full Decomposition of Array References Into Disjoint LBLs It must be noted that the idea of decomposing the array references in the code into disjoint LBLs (Step 1 of the algorithm, as explained in Section III) was rst proposed in [18]. However, their model was only a partial decomposition since it relied only on the LBL intersection operation (shown as follows). Their basic setsterm used to refer to the disjoint components of the decompositionwere not all LBLs: some of them were differences of LBLs (which are not necessarily LBLs, as already mentioned in Section III, at Step 1). We have reasons to believe that in [19] Kjeldsberg et al. used a partial decomposition as well since they employed a similar data dependence analysis as in [18]. The fact that [18] and [19] carried out only a partial LBL decomposition should not be construed as a limitation of those models: in the context of their research, aiming to achieve a memory size estimation, a partial LBL decomposition was sufcient for their goal. However, in our context where the goal is to perform an exact memory size computation, a full LBL decompositionwhere each of the partitions is an LBLis significantly more convenient, especially at Step 3.3 (see Section III). The basic ideas of the LBL decomposition are presented as follows, followed by an illustrative example. The analytical partitioning of the array references of every signal into disjoint LBLs can be performed by a recursive intersection, starting from the array references in the code. Let be a multidimensional signal in the algorithmic specication. A high-level pseudo-code of the LBL decomposition is as follows: for all the array references of signal select an array reference and let be its representation; for all the current disjoint LBLs of the signal select an LBL, let it be called ; compute ; if the intersection is not empty and ; then compute update the LBL collection of and their inclusion graph; repeat the operations in the loop till no new LBL is created; end for; end for. Two operations are relevant in our context: the intersection and the difference of two LBLs. While the intersection of two LBLs was addressed also by other works (in different contexts,

though) as, for instance, [25], the difference operation is far more difcult. 1) Intersection of Two LBLs: Let , be two LBLs derived from the same indexed signal, where and have obviously the same number of rowsthe signal dimension. Intersecting the two linearly bounded lattices means, rst of all, solving a linear Diophantine system5 having the elements of and as unknowns. If the system has no solution, the intersection is empty. Otherwise, let (2) be the solution of the Diophantine system. If the set of coalesced constraints of the two LBLs

(3) has integer solutions, then the intersection is a new LBL s.t. (4)

2) Difference of Two LBLs: The difference (and, similarly, ) can be decomposed by generating a set of LBLs, not necessarily disjoint, covering the difference. These LBLs will be further intersected recursively as well, till all the components are disjoint. The LBLs covering the difference are generated as follows. Suppose the minuend and subtrahend LBLs have the general form (1), and the iterator vectors of the subtrahend have the as derived from the solution of general form a Diophantine linear system (2), as previously explained. Here, is a matrix with integer elements having rows ( is the columns, , dimension of the iterator space) and , the and is a vector with integer elements. (When Diophantine system has the unique solution .) , and matrix Case 1: Assume for the time being that is transformed into a Hermite reduced form6 [34]. Then it is easy to detect those rows of that are linear combinations of the rows above them. For instance, let row be such a row. Then we have

One can nd of the rows

integers not all zero such that (the th row of is a linear combination ). Since an LBL in the difference cover

5Finding the integer solutions of the system. Solving a linear Diophantine system was proven to be of polynomial complexity, all the known methods being based on bringing the system matrix to the Hermite Normal Form (HNF) [33]. 6A Hermite reduced form of is a matrix having all the non-zero elements on or below the main diagonal. It can be obtained by post-multiplying with a sequence of unimodular matrices and premultiplying it with a row permutation matrix.

BALASA et al.: COMPUTATION OF STORAGE REQUIREMENTS FOR MULTI-DIMENSIONAL SIGNAL PROCESSING APPLICATIONS

455

Fig. 7. (a) Example: Lbl , Lbl , and their intersection (containing holes). (b) The LBL L is equal to the difference Lbl L , the latter covering the holes of the intersection, make together the difference Lbl (Lbl Lbl ).

0 (Lbl \ Lbl ); the LBLs L

and

must be disjoint from the subtrahend LBL where the equality above is satised, the LBLs

and

are included in the minuend LBL, but disjoint from the subtrahend LBL. Hence, they are included in the difference. Note that this case covers also the situations when (i.e., for all ), or when matrix has null rows (i.e., for some ): the added inequalities are and . Case 2: Other LBLs covering the difference are derived negating one by one the constraints (3) that dene the iterator polytope of the intersection. For instance, for the difference the constraints derived from are negated, and subsequently translated . In this way, the into inequalities between the iterators in with an additional conresulting LBL (which is, basically, straint between the iterators) will be included in , but will . This operation is be disjoint from the intersection described as follows. be the inequality considered for Let negation, where are integer coefcients. Since matrix is now a reduced Hermite form, we have

this intersection is not dense (that is, it does not contain all the lattice points inside or on its boundary). be the vector function Case 3: Let has linearly of the intersection (4). Assume the matrix independent rows (otherwise, the linear dependent ones will be eliminated), and bring this matrix to the HNF [33] . Assuming is equal to its number of columns, if all the diagonal that , then the intersection LBL is dense [35], coefcients , to cover the and we are done with this case. But if holes inside the boundary of the intersection, we build LBLs having the same mapping as the intersection except a shift of of the th index. ; ; ) Example: for ( for ( ; ; ) if for ( ; ; ) . for The lattices and of the two array references are

where the iterator spaces were written in nonmatrix format for economy of space. The elements of the array covered by the two LBLs are shown with black dots in Fig. 7(a). Finding and have common elements starts whether , with solving the Diophantine linear system: , obtained by equalizing the indices of the array references. The general solution of this system is (5)

where are of the iterators (after a possible reis a linear combination of , ordering). Then, that is, . The inequality expressed . in terms of the iterators becomes The LBL considered for the difference cover will contain this inequality negated, that is: . Cases 1 and 2 generate LBLs which are outside the intersection boundary, but still inside the minuend LBL. The next case will build LBLs inside the boundary of the intersection when

Substituting in the iterator spaces of straints (3) are obtained

and

, the con-

Since the constraints have integer solutions, the two array references do intersect [see Fig. 7(a)], and their intersection is the LBL

456

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

Let us focus on the decomposition of the difference . Since from the solution of the Diophantine system

the rows of the matrix

are linearly independent, there

are no LBLs in the difference generated according to Case 1. and Now, according to Case 2, we take the inequalities we select one inequality at a time; these inequalities are exis pressed in the iterators and , and the inequality from negated. If the resulting -polytope is not empty, a new LBL included in the difference is created. For instance, the inequality from translates into , which negation . Then yields:

shown in Fig. 7(b), belongs to . Two other LBLs with the same mapping and the iterator spaces and are found in a similar way. But since they are both included in , they are disregarded, in order to keep the number of LBLs in the difference as small as possible. Since the index space of the intersection is not dense (since the mappings matrix , already in the HNF, has the rst diagonal element larger than 1 [35]), whereas the index space of is dense [see Fig. 7(a)], we proceed to Case 3. First, the boundary of the intersection is computed: . Afterwards, an LBL having the mapping shifted one , ), where unit at the rst index (that is, the parameters , must satisfy the constraints such that ( , ) lies inside the boundary. Hence

accessing the element; if it has a , then there is a unique unique solution iterator vector accessing the element, provided that . If the solution of the system is not ; unique, it has the general form is assume that the -polytope not empty (otherwise, there is no solution either). Step 2) Bring matrix to a reduced Hermite form [34] by post-multiplying it with a unimodular matrix (less , a possible row permutation): is lower-triangular, with positive where matrix diagonal elements. on Step 3) Project the new -polytope the rst coordinate axis and nd the maximum coorin the exact shadow [36]. Replace dinate its value in the -polytope and repeat this operation, . Then, , where nding is the vector of the values. ; ; ) Example: for ( for ( ; ; ) . for iterator vector Assume we need to compute the such that the element is accessed. The solution of the Diophantine equation has the general form [33]

Post-multiplying , we obtain

with the unimodular matrix

shown in Fig. 7(b), belongs to Therefore, With a similar analysis

. .

where

(resulted from Case 2) is shown in Fig. 7(b).

B. Computation of the Maximum Iterator Vector of an Array Element array reference in the scope of the iterator polytope derived from loop boundaries and conditional instructions, nd the maximum iterator vector (the loops being normalized) accessing a given element . Step 1) Solve the Diophantine system [33] of equations , . If the system has no solution, there is no iterator vector Given an

where the matrix is a reduced Hermite form [34]. It is quite obvious that choosing as large as possible, the iterator will be maximized; afterwards, choosing as large as possible, the iterator will be maximized as well. Replacing the iterators from the previous matrix equation in the iterator polytope , a -polyderived from the loop boundaries: tope in and is obtained: . Projecting this -polytope on the rst axis, the maximum lattice point of the projection (exact shadow . For this value, the maximum is 1. Replacing [36]) is in the previous matrix equation, the maximum iterator vector is obtained

Remark: Taking the minimum values of imum iterator vector is obtained

and

, the min-

BALASA et al.: COMPUTATION OF STORAGE REQUIREMENTS FOR MULTI-DIMENSIONAL SIGNAL PROCESSING APPLICATIONS

457

As already mentioned, the current implementation works only for single-assignment specications.7 However, it would be possible to process also programs which are not in the single-assignment form: using minimum iterator vectors in Step 3.3, one can nd out the iterations when the array elements are written for the rst time, exactly the opposite of using the maximum iterator vectors to nd out when the array elements are read for the last time. Here, we assumed that an array element is alive from the rst time it is written to the last time it is read, even if it is dead for some parts of this period and is written again. C. Computation of the Size of the Lattices When the mapping between the iterator space and the index space is bijective, the problem reduces to the computation of the number of lattice points (i.e., having integer coordinates) in a polytope; otherwise, it reduces to counting the points in a projection of a polytope [29], [36]. Counting the lattice points in a polytope can be done in several ways: there are methods based on Ehrhart polynomials [37] like, for instance, [26] and [38], or even much simpleradapting the FourierMotzkin technique [39], [40]. For reason of scalability, we use a computation technique based on the decomposition of a simplicial cone into unimodular cones [41], using an implementation proposed by [42]. In the rather numerous cases of 2-D integral polytopes (i.e., with vertices having integer coordinates), Picks is an theorem (1899) [43] is used instead: if is integral polytope, the number of lattice points in , where is the number of lattice points on the boundary of . If has vertices with integer coordinates labeled counterclockwise, then

HNF have polynomial complexity,8 being equipped with different mechanisms of preventing the signicant increase of the numbers involved in the intermediate computations. If the Diophantine system is compatible, the operation continues with the intersection of their iterator polytopes, which is simply given by the concatenation of their constraints. It follows that the LBL intersection has polynomial complexity. The difference of two LBLs (see Section IV-A) is polynomial, as well. If the rst two cases are quite straightforward, the number of LBLs generated by Case 3 depends on the diagonal values of an HNF. But these values are polynomial in the input size [46]. The complexity of counting the number of lattice points in a -polytope is dominated by the computation of its generating , if is xed, this function [41]. Given operation takes [47], where is the input size of (the number of bits needed to write down the polytope inequalities in the binary system). It must be noted that, in the worst case, the number of LBLs ) of an indexed signal can grow exponentially (that is, with the number of its array references . However, this theoretical upper-bound is overly pessimistic since in typical programs there are array references having the same LBL and, also, many array references are disjoint from each other. Typically, the total number of LBLs grows linearly with the number of array references, in all our experiments the factor being less than four. In the illustrative example from Fig. 1, signal has four array references and the total number of LBLs is 11, out of which eight are disjoint [see Fig. 4(a)]. VI. EXPERIMENTAL RESULTS A memory size computation framework has been implemented in C++, incorporating the ideas and algorithms described in this paper. For the syntax of the algorithmic specications, we adopted a subset of the C language, enriched with the delay operator @ (see, e.g., the illustrative example in Fig. 1). This is not a restrictive feature of the theoretical model since any modication in the specication language would affect only the front-end of the framework. In addition to the computation of the minimum data storage, the variation of the memory occupancy during the execution of the algorithmic specication can be displayed, as well. Such memory traces are shown in Figs. 8 and 9. Table I summarizes the results of our experiments [48] carried out on a PC with a 1.85-GHz Athlon XP processor and 512-MB memory. The benchmarks used are: 1) a motion detection algorithm used in the transmission of real-time video signals on data networks [1]; 2) a real-time regularity detection algorithm used in robot vision; 3) Durbins algorithm for solving Toeplitz systems with unknowns; 4) the kernel of a motion estimation algorithm for moving objects (MPEG-4); 5) a singular value decomposition (SVD) updating algorithm [49] used in spatial division multiplex access (SDMA) modulation in mobile commu8For instance, the StorjohannLabahn algorithm [45] requires O (n mB (n log A)) bit operations, where B (t) is the number of bit operations to multiply two dte-bit integers, and  is the exponent for matrix multiplications over rings ( = 3 for standard multiplication, but  = 2:38 for the best known multiplication algorithm). Here, O is the soft-Oh notation: f = O (g ) if and only if f = O (g 1 log g ) for some constant c > 0.

where

where is the greatest common divisor of and . The polytope vertices are determined with the reverse search algorithm [44]. V. COMPLEXITY ASPECTS The complexity of the memory size computation algorithm depends on the complexity of the signicant operations with -polytopes and lattices. The array references of each indexed signal are recursively intersected. Each (LBL) intersection implies solving rst a Diophantine linear system (see Section IV-A), the complexity of this operation being dominated by the computation of the HNF of the system matrix. The more recent algorithms computing
7Checking whether the code is in the single-assignment form is quite easy in our framework: each LBL must be written at most once and each produced array reference must have a bijective mapping.

458

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

Fig. 8. Memory trace for the SVD updating (N = 25) algorithm [49]. The abscissae are the numbers of datapath instructions in the code, the ordinates are memory locations. (a) The entire memory trace. The global maximum is at the point (x = 48848; y = 2175). (b) Detailed trace in the interval [2200, 9100], which corresponds to the end of the second loop nest (the QR update) and the start of the third one (the SVD diagonalization). (c) Detailed trace in the interval [5595, 7215] covering the end of the second and start of the third iterations in the SVD diagonalization block.

Fig. 9. Memory trace for a 2-D Gaussian blur lter (N = 100, M = 50) algorithm from a medical image processing application which extracts contours from tomograph images in order to detect brain tumors. (a) The entire trace in the interval [0, 48025]. The global maximum is at the point (x = 5; y = 5005). (b) Detailed trace in the interval [23800, 24900], which corresponds to the end of the horizontal Gaussian blur and the start of the vertical Gaussian blur. (c) Detailed trace in the interval [25100, 25550] covering the fourth inner-loop iteration in the part of the code performing the vertical Gaussian blur. TABLE I EXPERIMENTAL RESULTS: COLUMNS MEMORY SIZE AND CPU REFER TO THE EXACT MEMORY SIZE COMPUTATIONS FOR PROCEDURAL SPECIFICATIONS; COLUMNS RELATIVE MEM. UPPER-BOUNDS AND CPU REFER TO THE ESTIMATION MODE FOR NONPROCEDURAL SPECIFICATIONS

nication receivers, in beamforming, and Kalman ltering; and 6) the kernel of a voice coding applicationcomponent of a mobile radio terminal. Columns # Array Refs. and # Scalars display the numbers of array references and, respectively, scalar signals (array elements) in the specications. Column Memory Size displays the exact storage requirements (numbers of memory loca displays the corresponding running tions) and column times. Column Relative Mem. Upper-Bounds shows how much larger are the storage upper-bounds9 when the tool operates in the estimation mode for nonprocedural specications;
9This data should not be mistaken for estimation errors. Since in nonprocedural specications, the computation order of instructions is not (fully) xed, each possible order yields a different amount of storage. These data are upperbounds over all possible orderings (while keeping the same loop structure). In general, the estimation errors are due to different approximations during the computation ow; but this approach does not do any approximation, it is exact.

these relative upper-bounds are computed with the formula , where is given in column Memory Size (as explained and illustrated at Step 3.2 in Section III-B). The corresponding running times for this second set of experiments . are given in column This tool can process large specications in terms of number of loop nests, lines of code, number of array references. For instance, the voice coding application contains 232 array references organized in 40 loop nests. In one of our experiments, the tool processed a difcult example [48] of about 900 lines of code, with 113 loop nests three-level deep, and a total of 906 array references (many having complex indices), yielding a total of 3159 LBLs, in less than 6 min. The illustrative example in Fig. 1, which is quite complex though not very large, was processed in less than 2 s.

BALASA et al.: COMPUTATION OF STORAGE REQUIREMENTS FOR MULTI-DIMENSIONAL SIGNAL PROCESSING APPLICATIONS

459

Doing a comparative overlook with previous nonscalar works, the following few facts are worth being emphasized. 1) Part of the previous works impose important constraints on the properties of the algorithmic specications they can process. For instance, Ramanujam et al. address only specications with perfectly nested loops (i.e., in which every statement is inside the innermost loop) [17]. Verbauwhede et al. consider loops with nonconstant boundaries, but in such cases the varying boundaries are internally replaced with upper- and lower-bounds in order to t their computational model [14]. In addition, the presence of delays adds an extra dimension in their model to the signals (which, in general, does affect the computation time). In comparison, our model handles the entire class of afne specicationsas described in Section II. Moreover, our approach processes the delayed array references without adding any extra dimension to the respective signals. 2) All the previous works except [19] use benchmarks with a relatively small number (up to tens of thousands) of scalars. These works do not offer any information on the scalability of their techniques. From our past experience, even if a memory size estimation technique behaves reasonably well when dealing with examples containing thousands of scalars, the computation time can sharply increase till becoming ineffectual for examples where the number of scalars is larger by 12 orders of magnitude. In comparison, our approach can handle applications with millions of scalars in acceptable times. The running times of our approach are also increasing signicantly with the size of the problem, but our tool is still effective when processing examples with at least one (but often 23) order(s) of magnitude more scalars than in previous works. 3) Part of the previous works do not report running times [17], [19]. Although [19] is the only previous workto the best of our knowledgereporting results on a complex application (the MPEG-4 motion estimation kernel) with a signicant amount of scalars (262 400), the running time information is missing. 4) The previous memory estimation techniques yield storage results that may be sometimes very inaccurate. Verbauwhede et al. state that their determinations are exact when the loop boundaries are constant, but overestimates occur when they are not [14]. However, they do not report any concrete results on the amount of the overestimates they experienced. Ramanujam et al. report exact determinations for all their benchmarks, except one exhibiting an estimation error of 13% [17]. Zhao and Malik obtained an estimation of 1372 memory locations for the motion detection kernel, when the set of parameters , [16]. This is a rather poor estiis mation since the correct result for the same set of parameters is 2740 storage locations (see Table I). Conversely, our tool performs exact determinations for procedural codes. Even in the estimation mode when the specica-

tions are nonprocedural, the relative memory upper-bounds are usually under 10%. One could argue that it may be better to obtain fast estimations, even not very accurate, rather than exact determinations with a signicantly higher computation effort. This argument is fair enough, but it does not apply to the present situation. The aforementioned estimation result was obtained in 21 s on a Sun Enterprise 4000 machine with four (336-MHz UltraSparc) processors and 4-GB memory [16], whereas our computation time was of only 2 s on the Athlon XP PC and 7 s on a Sun Ultra 20. Therefore, not only our approach found the exact result, but it did it faster than the estimation technique from [16]. VII. CONCLUSION This paper has presented a nonscalar approach for computing the memory size for data dominant telecom and multimedia applications, where the storage of large multidimensional signals causes a signicant cost in terms of both area and power consumption. This method uses modern elements in the theory of polyhedra and algebraic techniques specic to the data-ow analysis used nowadays in compilers. Different from past works which perform memory size estimations, this approach does exact evaluations for high-level procedural specications, but also accurate estimations for nonprocedural ones. REFERENCES
[1] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle, Custom Memory Management Methodology: Exploration of Memory Organization for Embedded Multimedia System Design. Boston, MA: Kluwer, 1998. [2] A. V. Aho, R. Sethi, and J. D. Ullman, CompilersPrinciples, Techniques, and Tools. Reading, MA: Addison-Wesley, 1986. [3] C. J. Tseng and D. Siewiorek, Automated synthesis of data paths in digital systems, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. CAD-5, no. 3, pp. 379395, Jul. 1986. [4] G. Goossens, J. Rabaey, J. Vandewalle, and H. De Man, An efcient microcode compiler for custom DSP processors, in Proc. IEEE Int. Conf. Comput.-Aided Design, 1987, pp. 2427. [5] F. J. Kurdahi and A. C. Parker, REAL: A program for register allocation, in Proc. 24th ACM/IEEE Design Autom. Conf., 1987, pp. 210215. [6] A. Hashimoto and J. Stevens, Wire routing by optimizing channel assignment within large apertures, in Proc. 8th Design Autom. Workshop, 1971, pp. 155169. [7] G. Goossens, Optimization techniques for automated synthesis of application-specic signal-processing architectures, Ph.D. dissertation, Dept. ESAT, K.U. Leuven, Leuven, Belgium, 1989. [8] P. G. Paulin and J. P. Knight, Force-directed scheduling for the behavioral synthesis of ASICs, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 8, no. 6, pp. 661679, Jun. 1989. [9] C. H. Gebotys and M. I. Elmasry, Optimal VLSI Architectural Synthesis. Boston, MA: Kluwer, 1992. [10] L. Stok and J. Jess, Foreground memory management in data path synthesis, Int. J. Circuit Theory Appl., vol. 20, pp. 235255, 1992. [11] K. K. Parhi, Calculation of minimum number of registers in arbitrary life time chart, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 41, no. 6, pp. 434436, Jun. 1994. [12] S. Y. Ohm, F. J. Kurdahi, and N. Dutt, Comprehensive lower bound estimation from behavioral descriptions, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design, 1994, pp. 182187. [13] D. Gajski, F. Vahid, S. Narayan, and J. Gong, Specication and Design of Embedded Systems. Englewood Cliffs, NJ: Prentice-Hall, 1994. [14] I. Verbauwhede, C. Scheers, and J. M. Rabaey, Memory estimation for high level synthesis, in Proc. 31st ACM/IEEE Design Autom. Conf., 1994, pp. 143148. [15] P. Grun, F. Balasa, and N. Dutt, Memory size estimation for multimedia applications, in Proc. 6th Int. Workshop Hardw./Softw. Co-Design, 1998, pp. 145149.

460

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

[16] Y. Zhao and S. Malik, Exact memory size estimation for array computations, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 8, no. 5, pp. 517521, Oct. 2000. [17] J. Ramanujam, J. Hong, M. Kandemir, and A. Narayan, Reducing memory requirements of nested loops for embedded systems, in Proc. 38th ACM/IEEE Design Autom. Conf., 2001, pp. 359364. [18] F. Balasa, F. Catthoor, and H. De Man, Background memory area estimation for multi-dimensional signal processing systems, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 3, no. 2, pp. 157172, Jun. 1995. [19] P. G. Kjeldsberg, F. Catthoor, and E. J. Aas, Data dependency size estimation for use in memory optimization, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 22, no. 7, pp. 908921, Jul. 2003. [20] F. Catthoor, E. Brockmeyer, K. Danckaert, C. Kulkarni, L. Nachtergaele, and A. Vandecappelle, Custom memory organization and data transfer: Architecture issues and exploration methods, in VLSI Section of Electrical and Electronics Engineering Handbook. New York: Academic, 2000. [21] P. R. Panda, F. Catthoor, N. Dutt, K. Dankaert, E. Brockmeyer, C. Kulkarni, and P. G. Kjeldsberg, Data and memory optimization techniques for embedded systems, ACM Trans. Design Autom. Electron. Syst., vol. 6, no. 2, pp. 149206, Apr. 2001. [22] S. S. Muchnick, Advanced Compiler Design and Implementation. San Mateo, CA: Morgan Kaufmann, 1997. [23] A. Darte, R. Schreiber, and G. Villard, Lattice-based memory allocation, IEEE Trans. Comput., vol. 54, no. 11, pp. 12421257, Oct. 2005. [24] I. I. Luican, H. Zhu, and F. Balasa, Signal-to-memory mapping analysis for multimedia signal processing, in Proc. Asia South-Pacic Design Autom. Conf., 2007, pp. 486491. [25] L. Thiele, Compiler techniques for massive parallel architectures, in State-of-the-Art in Computer Science. Norwell, MA: Kluwer, 1992. [26] P. Clauss and V. Loechner, Parametric analysis of polyhedral iteration spaces, J. VLSI Signal Process., vol. 19, no. 2, pp. 179194, 1998. [27] P. DAlberto, A. Veidembaum, A. Nicolau, and R. Gupta, Static analysis of parametrized loop nests for energy efcient use of data caches, presented at the Workshop Compilers Operat. Syst. Low Power, Barcelona, Spain, 2001. [28] V. Loechner, B. Meister, and P. Clauss, Precise data locality optimization of nested loops, J. Supercomput., vol. 21, no. 1, pp. 3776, 2002. [29] S. Verdoolaege, K. Beyls, M. Bruynooghe, and F. Catthoor, Experiences with enumeration of integer projections of parametric polytopes, in Proc. Compiler Construction: 14th Int. Conf., 2005, pp. 91105. [30] W. Pugh, Counting solutions to Presburger formulas: How and why, in Proc. SIGPLAN Conf. Programming Language Design Implementation, 1994, pp. 121134. [31] P. Feautrier, Parametric integer programming, Operat. Res., vol. 22, no. 3, pp. 243268, 1988. [32] H. Zhu, I. I. Luican, and F. Balasa, Memory size computation for multimedia processing applications, in Proc. Asia South-Pacic Design Autom. Conf., 2006, pp. 802807. [33] A. Schrijver, Theory of Linear and Integer Programming. New York: Wiley, 1986. [34] M. Minoux, Mathematical ProgrammingTheory and Algorithms. New York: Wiley, 1986. [35] W. Li and K. Pingali, A singular loop transformation framework based on non-singular matrices, in Proc. 5th Annu. Workshop Lang. Compilers Parallelism, 1992, pp. 249260. [36] W. Pugh and D. Wonnacott, Experiences with constraint-based array dependence analysis, Principles Practice Constraint Program., vol. 874, pp. 312325, 1994. [37] E. Ehrhart, Polynmes arithmtiques et mthode des Polydres en combinatoire, in International Series of Numerical Mathematics. Stuttgart, Germany: Birkhuser-Verlag, 1977, p. 35. [38] S. Verdoolaege, R. Seghir, K. Beyls, V. Loechner, and M. Bruynooghe, Analytical computation of Ehrhart polynomials: Enabling more compiler analyses and optimizations, in Proc. Int. Conf. Compilers Arch., Synthesis Embed. Syst., 2004, pp. 248258. [39] G. B. Dantzig and B. C. Eaves, FourierMotzkin elimination and its dual, J. Combinatorial Theory (A), vol. 14, pp. 288297, 1973. [40] W. Pugh, A practical algorithm for exact array dependence analysis, Commun. ACM, vol. 35, no. 8, pp. 102114, Aug. 1992. [41] A. I. Barvinok, A polynomial time algorithm for counting integral points in polyhedra when the dimension is xed, Math. Operat. Res., vol. 19, no. 4, pp. 769779, Nov. 1994.

[42] J. A. De Loera, R. Hemmecke, J. Tauzer, and R. Yoshida, Effective lattice point counting in rational convex polytopes, J. Symbolic Comput., vol. 38, no. 4, pp. 12731302, 2004. [43] J. C. Lagarias, Point lattices, in Handbook of Combinatorics. Amsterdam, The Netherlands: Elsevier, 1995. [44] D. Avis, LRS: A revised implementation of the reverse search vertex enumeration algorithm, in PolytopesCombinatorics and Computation. Basel/Stuttgart, Germany: Birkhuser-Verlag, 2000, pp. 177198. [45] A. Storjohann and G. Labahn, Asymptotically fast computation of the Hermite Normal Form of an integer matrix, in Proc. Int. Symp. Symbolic Algebraic Comput., 1996, pp. 259266. [46] D. Micciancio and B. Warinschi, A linear space algorithm for computing the Hermite Normal Form, in Proc. Int. Symp. Symbolic Algebraic Comput., 2001, pp. 231236. [47] A. I. Barvinok and J. Pommersheim, An algorithmic theory of lattice points in polyhedra, in New Perspectives in Algebraic Combinatorics. Cambridge, U.K.: Cambridge Univ. Press, 1999. [48] , Benchmarks, Univ. Illinois, Chicago, 2007 [Online]. Available: http://www.cs.uic.edu/~iluican/benchmarks.html [49] M. Moonen, P. Van Dooren, and J. Vandewalle, SVD updating for tracking slowly time-varying systems, in Proc. SPIEAdv. Algorithms Arch. Signal Process., 1989, vol. 1152. Florin Balasa (M95) received the M.S. and Ph.D. degrees in computer science from the Polytechnical University of Bucharest, Bucharest, Romania, in 1981 and 1994, respectively, the M.S. degree in mathematics from the University of Bucharest, Bucharest, Romania, in 1990, and the Ph.D. degree in electrical engineering from the Katholieke Universiteit Leuven, Leuven, Belgium, in 1995. He is currently an Assistant Professor of Computer Science at the University of Illinois at Chicago (UIC), Chicago. From 1990 to 1995, he was with the Interuniversity Microelectronics Center (IMEC), Leuven, Belgium. From 1995 to 1999, he was a Senior Design Automation Engineer at the Advanced Technology Division of Conexant Systems (formerly Rockwell Semiconductor Systems), Newport Beach, CA. He holds two U.S. patents and he coauthored the book Custom Memory Management Methodology: Exploration of Memory Organization for Embedded Multimedia System Design (Kluwer, 1998). His research interests are mainly focused on high-level synthesis and physical design automation. Dr. Balasa was a recipient of the National Science Foundation CAREER Award. Hongwei Zhu received the B.S. degree in electrical engineering from Xian Jiaotong University, Xian, P.R. China, in 1996, and the M.S. degree in electrical and electronic engineering from Nanyang Technological University, Singapore, in 2001. He is currently pursuing the Ph.D. degree in the Department of Computer Science, University of Illinois at Chicago (UIC), Chicago. He is currently a Research Assistant at UIC. Before joining UIC, he has worked as a Senior R&D Engineer at JVC Asia Pte. Ltd., Singapore. His research interests are mainly focused on memory management for real-time multidimensional signal processing and combinatorial optimization in CAD VLSI. Ilie I. Luican (S06) received the B.S. and M.S. degrees in computer science from the Polytechnical University of Bucharest (PUB), Bucharest, Romania, in 2002 and 2003, respectively. He is currently pursuing the Ph.D. degree in the Department of Computer Science, University of Illinois at Chicago (UIC), Chicago. He is currently a Research Assistant at UIC. From 2003 to 2004, he was with the Automatic Control Department, PUB. His research interests are mainly focused on high-level synthesis and memory management for real-time multidimensional signal processing.

You might also like