Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

LOOP UNROLLING

INTRODUCTION: Loop unrolling is the process of reusing the loop code to include more than one iteration of the old code, in a single pass with the new one. Loop unrolling works by replicating the body of a loop some (machine and code dependent) number of times and scheduling the resulting code as a single basic block. Replicating the loop body has a couple of performance advantages: Producing a larger loop body provides a larger block of instructions for the scheduler to work with, which gives the schedule more options when positioning operations. Combining multiple iterations allows induction variable computations to be combined. These performance improvements are traded against the potential penalty caused by increased I-cache misses on the larger loop body. Loop unrolling is used to minimize stalls that may be encountered inside loops, and also to get rid of the overhead of running unnecessary branch conditionals. To keep pipeline full, parallelism among instruction must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline. To avoid stalls, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. The basic MIPS integer pipeline is assumed with branches having a delay of one clock cycle. But for the floating-point operation the following latencies are assumed: Instruction producing result FP ALU op FP ALU op Load Double Load Double Instruction using results Another FP ALU op Store Double FP ALU op Store Double Latency in clock cycles 3 2 1 0

Table 1 Before performing Loop unrolling, following assumptions are made in the pipelining: Functional units are assumed to be fully pipelined or replicated (as many times as the pipeline depth) An operation of any type can be issued on every clock cycle and there is not structural hazard.

Procedure of loop unrolling:


Lets consider following C++ loop code that adds a scalar value to an array in memory: For (i=1; i<=1000; i++) x[i] = x[i] + s; STEP (1): Convert the above segment of code to DLX assembly language: Loop: OP Codes LD ADD.D S.D SUBI BNEZ Operands F0, 0(R1) ; F4, F0, F2 ; 0(R1), F4 ; R1, R1, #8 ; R1, Loop ; Operations F0- array element Add scalar in F2 Store result on R1 Decrement pointer(8 bytes) Branch R1!= zero

STEP (2): Arrange the code generated from Step (1) by considering stalls given in Table 1 for the floating-point operations: Loop: OP Codes LD stall ADD.D stall stall S.D SUBI BNEZ stall Operands F0, 0(R1) ; -------------------F4, F0, F2 ; ---------------------------------------------0(R1), F4 ; R1, R1, #8 ; R1, Loop ; -----------------------Issued Cycle 1 2 3 4 5 6 7 8 9

Clock cycles/element 9

STEP (3): Schedule the code generated by Step (2) to minimize the stalls or idle clock cycles generated by Floating-point operation and delayed branches: Loop: OP Codes LD stall ADD.D SUBI BNEZ S.D 6 Operands F0,0(R1) ; -----------------------F4, F0, F2 ; R1, R1, #8 ; R1, Loop ; 8(R1), F4 ; Issued Cycle 1 2 3 4 5 (delayed branch) 6 (interX to SUBI)

Clock cycles/ element STEP (4):

Now unroll the loop as many times as we want (here its unrolled 4 times) to avoid stalls for both scheduled and unscheduled code. After unrolling there is 4 copies of the loop body by assuming R1 is initially a multiple of 32, which means that the number of loop iterations is a multiple of 4. By renaming the registers eliminate the name dependencies and redundant computations present in the unrolled code. (Both scheduled and unscheduled code) Unscheduled Code: Loop: OP Codes L.D stall ADD.D stall stall S.D L.D stall ADD.D stall stall S.D L.D stall ADD.D stall stall S.D L.D Operands F0, 0(R1) ; -------------------F4, F0, F2 ; ---------------------------------------------0(R1), F4 ; F6, -8(R1) ; -----------------------F8, F6, F2 ; ---------------------------------------------8(R1), F8 ; F10, -16(R1) ; -----------------------F12, F10, F2 ; -----------------------------------------------16(R1), F12 ; F14, -24(R1) ; Issued Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

stall ADD.D stall stall S.D SUBI BNEZ stall Clock cycles/iteration Clock cycles/element

----------------------F16, F14, F2 ; -----------------------------------------------24(R1), F16 R1, R1, #32 ; R1, Loop ; ------------------------

20 21 22 23 24 25 26 27

27 27/4= 6.8

Scheduled Code: OP Codes LD LD LD LD ADD.D ADD.D ADD.D ADD.D S.D S.D S.D SUBI BNEZ S.D 14 14/4 = 3.5 Operands F0,0(R1) ; F6, -8(R1) ; F10, -16(R1) ; F14, -24(R1) ; F8, F0, F2 ; F8, F6, F2 ; F16, F14, F2 ; F4, F0, F2 ; 0(R1),F4 ; -8(R1),F8 ; -16(R1),F12 ; R1, R1, #32 ; R1, Loop ; 8(R1), F16 ; Issued Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (8-32=-24)

Loop:

Clock cycles/iteration Clock cycles/ element

Precaution to be taken while loop unrolling:


Determine that it is legal to move the SD instruction after the SUBI and BNEZ, and find amount to adjust the SD offset. Observe that unrolling the loop would be useful by finding that loop iterations were independent, except for loop maintenance code. Utilize different registers to avoid any name and data dependencies. Avoid the extra tests and branches and adjust loop maintenance code. Analyze the memory addresses so that there is no any memory conflict.

Schedule the code, preserving any dependencies needed to yield the same result as the original code.

Limitations of Loop Unrolling:


Decreasing benefit A decrease in the amount of memory overhead with each unrolls. Code size limitations Memory is premium thing. Larger size causes cache hit rate changes. Shortfall in registers (Register pressure)- Increasing ILP leads to increase in number of live values: May not be possible to allocate all the live values to registers. * Compiler limitations: Significant increase in complexity.

You might also like