Professional Documents
Culture Documents
MN Loop Unrolling
MN Loop Unrolling
INTRODUCTION: Loop unrolling is the process of reusing the loop code to include more than one iteration of the old code, in a single pass with the new one. Loop unrolling works by replicating the body of a loop some (machine and code dependent) number of times and scheduling the resulting code as a single basic block. Replicating the loop body has a couple of performance advantages: Producing a larger loop body provides a larger block of instructions for the scheduler to work with, which gives the schedule more options when positioning operations. Combining multiple iterations allows induction variable computations to be combined. These performance improvements are traded against the potential penalty caused by increased I-cache misses on the larger loop body. Loop unrolling is used to minimize stalls that may be encountered inside loops, and also to get rid of the overhead of running unnecessary branch conditionals. To keep pipeline full, parallelism among instruction must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline. To avoid stalls, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. The basic MIPS integer pipeline is assumed with branches having a delay of one clock cycle. But for the floating-point operation the following latencies are assumed: Instruction producing result FP ALU op FP ALU op Load Double Load Double Instruction using results Another FP ALU op Store Double FP ALU op Store Double Latency in clock cycles 3 2 1 0
Table 1 Before performing Loop unrolling, following assumptions are made in the pipelining: Functional units are assumed to be fully pipelined or replicated (as many times as the pipeline depth) An operation of any type can be issued on every clock cycle and there is not structural hazard.
STEP (2): Arrange the code generated from Step (1) by considering stalls given in Table 1 for the floating-point operations: Loop: OP Codes LD stall ADD.D stall stall S.D SUBI BNEZ stall Operands F0, 0(R1) ; -------------------F4, F0, F2 ; ---------------------------------------------0(R1), F4 ; R1, R1, #8 ; R1, Loop ; -----------------------Issued Cycle 1 2 3 4 5 6 7 8 9
Clock cycles/element 9
STEP (3): Schedule the code generated by Step (2) to minimize the stalls or idle clock cycles generated by Floating-point operation and delayed branches: Loop: OP Codes LD stall ADD.D SUBI BNEZ S.D 6 Operands F0,0(R1) ; -----------------------F4, F0, F2 ; R1, R1, #8 ; R1, Loop ; 8(R1), F4 ; Issued Cycle 1 2 3 4 5 (delayed branch) 6 (interX to SUBI)
Now unroll the loop as many times as we want (here its unrolled 4 times) to avoid stalls for both scheduled and unscheduled code. After unrolling there is 4 copies of the loop body by assuming R1 is initially a multiple of 32, which means that the number of loop iterations is a multiple of 4. By renaming the registers eliminate the name dependencies and redundant computations present in the unrolled code. (Both scheduled and unscheduled code) Unscheduled Code: Loop: OP Codes L.D stall ADD.D stall stall S.D L.D stall ADD.D stall stall S.D L.D stall ADD.D stall stall S.D L.D Operands F0, 0(R1) ; -------------------F4, F0, F2 ; ---------------------------------------------0(R1), F4 ; F6, -8(R1) ; -----------------------F8, F6, F2 ; ---------------------------------------------8(R1), F8 ; F10, -16(R1) ; -----------------------F12, F10, F2 ; -----------------------------------------------16(R1), F12 ; F14, -24(R1) ; Issued Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
stall ADD.D stall stall S.D SUBI BNEZ stall Clock cycles/iteration Clock cycles/element
----------------------F16, F14, F2 ; -----------------------------------------------24(R1), F16 R1, R1, #32 ; R1, Loop ; ------------------------
20 21 22 23 24 25 26 27
27 27/4= 6.8
Scheduled Code: OP Codes LD LD LD LD ADD.D ADD.D ADD.D ADD.D S.D S.D S.D SUBI BNEZ S.D 14 14/4 = 3.5 Operands F0,0(R1) ; F6, -8(R1) ; F10, -16(R1) ; F14, -24(R1) ; F8, F0, F2 ; F8, F6, F2 ; F16, F14, F2 ; F4, F0, F2 ; 0(R1),F4 ; -8(R1),F8 ; -16(R1),F12 ; R1, R1, #32 ; R1, Loop ; 8(R1), F16 ; Issued Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (8-32=-24)
Loop:
Schedule the code, preserving any dependencies needed to yield the same result as the original code.