Download as pdf
Download as pdf
You are on page 1of 13
Chapter 4 Advanced Pipelining and Instruction-Level Paral the structural hazards and waiting for the absence of a data hazard, We can still check for structural hazards when we issue the instruction; thus, we still use order instruction issue. However, we want the instructions to begin execution as soon as their data operands are available. Thus, the pipeline will do out-of-order execution, which implies out-of-order completion ‘Out-of-order completion creates major complications in handling exceptions. In the dynamically scheduled processors addressed in this section, exceptions are imprecise, since instructions may complete before an instruction issued earlier raises an exception, Thus, itis difficult to restart after an interrupt. Rather than address these problems in this section, we will discuss a solution for precise ex ceptions in the context of a processor with speculation in section 4.6. The ap- proach discussed in section 4.6 can be used to solve the simpler problem that arises in these dynamically scheduled processors. For floating-point exceptions other solutions may be possible, as discussed in Appendix A. In introducing out-of-order execution, we have essentially split the ID pipe stage into two stages: 1. Issue—Decode instructions, check for structural hazards. 2. Read operands—Wait until no data hazards, then read operands. ‘An instruction fetch stage precedes the issue stage and may fetch either into a sin- ale-entry latch or into a queue; instructions are then issued from the latch or ‘queue. The EX stage follows the read operands stage, just as in the DLX pipeline As in the DLX floating-point pipeline, execution may take multiple cycles, de- pending on the operation. Thus, we may need to distinguish when an instruction begins execution and when it completes execution; between the two times, the in- struction is in execution. This allows multiple instructions to be in execution at the same time. In addition to these changes to the pipeline structure, we will also change the functional unit design by varying the number of units, the latency of ‘operations, and the functional unit pipelining, so as to better explore these more advanced pipelining techniques. Dynamic Scheduling with a Scoreboard In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue); however, they can be stalled or bypass each other in the second stage (read operands) and thus enter execution out of order. Scoreboariling isa technique for allowing instructions to execute out of order when there are suf ficient resources and no data dependences; itis named after the CDC 6600 score board, which developed this capability Before we see how scoreboarding could be used in the DLX pipeline, itis im portant to observe that WAR hazards, which did not exist in the DLX floating: 4.2 Overcoming Data Hazards with Dynamic Scheduling 243 point or integer pipelines, may arise when instructions execute out of order. Sup- pose inthe earlier example, the SUBD destination is F8, so that the code sequence is DIvD £0,F2,F4 ABD F10,F0,F8 UBD F8,F8,F14 Now there is an antidependence between the ADDD and the SUBD: Ifthe pipeline executes the SUBD before the ADDD, it will violate the antidependence, yielding incorrect execution. Likewise, to avoid violating output dependences, WAW haz- ards (¢.g., as would occur if the destination of the SUED were F10) must also be detected. As we will see, both these hazards are avoided in a scoreboard by stall- ing the later instruction involved in the antidependence. “The goal of a scoreboard is to maintain an execution rate of one instruction per clock eycle (when there are no structural hazards) by executing an instruction as carly as possible. Thus, when the next instruction to execute is stalled, other in- structions can be issued and executed if they do not depend on any active or stalled instraction. The scoreboard takes full responsibility for instruction issue ‘and execution, including all hazard detection. Taking advantage of out-of-order execution requires multiple instructions to be in their EX stage simultaneously ‘This can be achieved with multiple functional units, with pipelined functional ‘units, or with both, Since these two capabilities—pipelined functional units and ‘multiple functional units—are essentially equivalent for the purposes of pipeline control, we will assume the processor has multiple functional units. ‘The CDC 6600 had 16 separate functional units, including 4 floating-point units, 5 units for memory references, and 7 units for integer operations. On DLX, scoreboards make sense primarily on the floating-point unit since the latency of the other functional units is very small, Let’s assume that there are two multi- pliers, one adder, one divide unit, and a single integer unit for all memory ref- cerences, branches, and integer operations. Although this example is simpler than the CDC 6600, it is sufficiently powerful to demonstrate the principles without having a mass of detail or needing very long examples. Because both DLX and the CDC 6600 are load-store architectures, the techniques are nearly identical for the two processors, Figure 4.3 shows what the processor looks like. Every instruction goes through the scoreboard, where a record of the data dependences is constructed; this step corresponds to instruction issue and replaces part of the ID step in the DLX pipeline. The scoreboard then determines when the instruction can read its operands and begin execution. If the scoreboard decide: the instruction cannot execute immediately, it monitors every change in the hard- ware and decides when the instruction can execute. The scoreboard also controls when an instruction can write its result into the destination register. Thus, all haz ard detection and resolution is centralized in the scoreboard. We will see a picture of the scoreboard later (Figure 4.4 on page 247), but first we need to understand the steps in the issue and execution segment of the pipeline. 2aa Chapter 4 Advanced Pipelining and instruction-Lavel Parallelism egies Ost bs FIGURE 4.3. The basic structure of a DLX processor with a scoreboard, The score: board's function is to control instruction execution (vertical contra ines), All data flows be- tween the register fle and the functional units over the buses (the horizontal lines, called trunks in the CDC 6600). There are two FP multipliers, an FP divider, an FP adder, and an integer unt. One set of buses (wo inputs and one output) serves a group of funtion unit, “The detals ofthe scoreboard aro shown in Figures 4.44.7 Each instruction undergoes four steps in executing. (Since we are concen: trating on the FP operations, we will not consider a step for memory access.) Let's first examine the steps informally and then look in detail at how the score~ board keeps the necessary information that determines when to progeess from ‘one step to the next, The four steps, which replace the ID, EX, and WB steps in the standard DLX pipeline, are as follows: 1. ssue—If a functional unit for the instruction is free and no other active in- struction has the same destination register, the scoreboard issues the instruc: tion to the functional unit and updates its internal data structure. This step replaces a portion of the ID step in the DLX pipeline, By ensuring that no other active functional unit wants to write its result into the destination register, we ‘guarantee that WAW hazards cannot be present, Ifa structural or WAW haz. ard exists, then the instruction issue sts, and no further instructions will overcoming Data Hazards wth Dynamic Schecuiog 245 sesue until these hazards are cleared, When the jssve St8e stalls, it eauses the ister between instruction fetch and issue to fill i the buffer is a single entry, pataction fetch stalls immediately. 1f the buffer is ae ‘with multiple in- insifone:it stalls when the queve fills; later we will se° how a queue is used Sn the PowerPC 620 ro connect fetch and issue ead operands —The scoreboard monitors the availabilty of the source oper- rere eR source operand is availabe if no earlier issued Ae instruction is going 1 write itor ifthe register containing We ‘operand is being written by & aemrently aetive Functional nit. When the source Ope ands are available, the cared tells the Functional nit to proceed fo read He ‘operands from the sepjaers and begin execution. The scoreboard resolves RAW hazards dynam reamy in this step, and instructions may Be sent ‘execution out of order ‘This step, together with issue, completes the function of the ID step in the simple DLX pipeline. xecurion —The functional unit begins exeeution upon feoe Tt operan ene result is ready, itntifies the seoreboard that Mas ‘completed exec Werth step replaces the EX step in the DLX pipeline ‘and takes multiple cycles in the DLX FP pipeline. Write result—Once the scoreboard is aware that the functional unit has com pleted execution, the scoreboard checks for WAR fhazards and stalls the com pleting instruction, if necessary. in AR azar exists if there i «code Sequence ke our ear example with annb and suBD that both use FB, In that example we had the code eprvp FO,F2,F4 ‘app FLO, FO,F8 supp 8,8, pop has a source operand F which is the same register as the destination of at anno actually depends on an earlier instruction, he scoreboard Som el stall he SUB in its write result stage unl ADBD reads ‘operands, In general, then, 2 completing instruction cannot Pe allowed to write its re sults when «there isan instruction that has not read its operands that precedes (i... in ‘rder of issue) the completing instruction, and sone of the operand isthe same register a the result oF ine completing instruction, Ii this WAR hazard does not exist, or when it clears, the scoreboard tells the fonetional unit to store its result vo the destination register This step replaces the WB step in the simple DLX pipeline. chapter 4 Advanced Pipatining and Intruction-Level Perales fs glance it might appear that the scoreboard wil have dfn Pak ing Ra Sad WAR hazard Exercise 46 il help you understand Pos ing RA a nuises these to cases and thus knows when to prevent 2 WAR Teorey sting a instruction thats ready t write ts results ar iy for an instreton ae ead only when both operas 9 rats tae le this seoreboard doesnot ake advatage of FON ar ners se ly rad hen they ae both eval, THs 1 in a gan you might inital thnk. Unlike our simple Piping © ge wil wt tee esl ao the rege le a8 008 06 Chae lon asuing no WAR haar, rer han wa fr Se ee tha maybe several cyses away. The elect rluees Fe assign ed ones of Forwarding. There is sfill one ational = Ln ey ae ice the waite esl and read operand sages cannot O87 tate) eed atonal buffering to eliminate this overhead roa a gata sreture, the soreboad contol he instruction at one stp tote ext by communieating wie fnew Feo oro complication, however, There ae only a Timited umber of There ie a pnne and result buses to the register Ae, which epee Sear ra The scoreboard must guarantee thatthe number of HHL ON srt par oceed ito seps 2 and 4 0 not exceed the numberof ee units allowed fat go int fre detail on hier than mention at aya eis probler by grouping the 16 fncional units fousiht COC 0 a upg ae of uses cle dra rans for ach Bowe OY four ons a ap eoul ead is operands oF wrt its est ring | oes in he dened data srctre mained by a DLX senor, (ee eee ee al unit, Figure 4 show's what the scoreboards nfornsoee with ve Fenty though the execution ofthis simple sequene of suction np #6,34(R2) up F2,451R3) MOUTD FO,F2,F4 SUBD F8,86,F2 pvp #10,F0,F6 Ropp FS, FB,F2 “There are three parts to the scoreboard as ” > 1. Instruction starus—Indicates which ofthe four steps the instruction is Functional unit status-—lndicates te state ofthe functional unit (FU). There are nine fields for each functional unit: Busy Indicates whether the unit is busy of not Op—Operation to perform in the unit -B. 844 o subtract) Fi Destination register Fj, Fk Source resister numbers 6, QkFunctional units progucins source registers F}. Fk: ij, Ri-_Flags indicating when Fh FR are ady, Set vo No after operands are read. a, Register result status—Indicae igh fanetonal unit wil wie each FT festive inseuction has the TEE A “ts destination. THiS field fs set f° Hac whenever there are no pending INS tions that will write that rexist Tastruction status | Teens operands Execution complele ‘Write result Funetional unit status Qi Ri rk er result status \ : F6 FB rio FIZ FM” Divide quae 44 components ofthe seorebosh ac tron that has issued oF PENS has an enty i the component gone air the Mo ms each tuntiona uit, Once an etre abl. Toran kapt ithe wnt un us able ve regi en abe wcaes ch issue rece each pacing suite Ero equal the numbe fogs instruction status registe" wnt) testo has comleled oles result, and (2) the second LO aS nS ppleted execution out as Nt sags ha osu The MOLTO, SUBD and oT re joa but are stalled, wale foe ‘operands. Te funcional sere oye tat he Hs MUAY ANE al ger un, ro ad unk wang integer unit, and te un i wating ore et mUUPIY ms staid because of UCB) hazard: itil clear wher crt yab completes an ent HU Tose scoreboard tables Not BON ot bark For example the Rk eld eSBD cargos a the M2 aed have no mearing. ASO, Once St ‘operand nas Bee bs nts and Ak tikes ave sot 0 No. Fa rus, ese How vy i st sep UCL instruction sta issues, the reco overcoming Data Hazards wit Oy does not extend beyond 2 branch, so the window (and the scoreboard) always contains sraght-line code from a single Base block. Section 4.6 shows how shewindow can be extended beyond a branch 3. rhe mumiber and types of functional wnits—This deter the importance of acral hazards, which can increase when dynam scheduling is used. 4. The presence of antidependences and out WAR and WAW stalls. dependences—These lead (0 this entire chapter focuses on techniques that aac the problem of exposing and bauer atzing available ILP. The second and third factors can be attacked by and pene the sizeof the scoreboard and the numbes ‘of functional units; howev er, thes «ot implications and may also affect cycle Vine WAW and WAR hazards become more important in dynamically scheduled processors, tae ce the pipeline exposes more name depenicrees AWAW hazards also be aerators important if we use dynamic scheduling ‘with a branch prediction yng allows multiple iterations of a loop t© overlap The next subsection Tooks at a technique called rege renaming that dynam ically eliminates name dependences 50 2810 ‘avoid WAR and WAW hazards. Res: rier renaming does this by replacing the register NTN uch as those kept in the iste var) withthe names ofa larger setof virtual risers ‘The register renam. ing sclieme also is the basis for implementing forwarding. changes hi another Dynamic Scheduling Approach— ‘The Tomasulo Approach Another approach to allows execution to proceed in the Pret of hazards was ney the BM 360/91 floating-point wnt, This scheme Oe invented by Robert “Teac and is named after him. Tomasulo's scheme ‘combines key elements of st th the intoducton of register renaming, THE¥e 8° ney variations on tis scheme, though the Key cone of renaming registers to man WAR and WAW bazar isthe tmost common chase istic, ia NTBM 360/91 was completed about three years ser the CDC 6600, just before caches appenred in commercial processors IBM's goal was to achieve Floating-point performance from an instruction $ ‘and from compilers de bigh or the entire 360 computer fay aer than 01% specialized compilers viene pighvend processors. The 360 architecture had ony four double-precision TTeating-point registers, which Timits the effectvet™e ‘of compiler scheduling: this fact was another motivation for the Tomasulo approach. Jn addition, the IBM AND had Tong memory accesses and Tong floating-Pome ‘delays, which Tomasuto’s 30th was designed to overcome. Af the end of the So we will see that aero's algorithm can. also suppor the overlapped execi00 cof multiple itera~ tions-of a loop. the seoreboarding sct ces on the floating Pore Sux The primary siferen™ petween memory insertions. the latter eee 2 ood funtion rant, no Sig Prt changes ave need Feamemory adress vrodes. ThE pENATY suction is another OOS “The IBM 360/91 also, ity pipelined Fonetion#) ui rather than TTBIE veetional units: Te Oty ference beewees MESES that a pipelined wnit ean Sr st pnost one opera¥ion Pet Here eyete- Since tte Maly no Fundamentsl apferences, we aeserive gorithm as if ere multi je fuettonal units, TNE TBM 360/91 could he sb odate three OPES for the floating-point adder = forthe Floste PON, tipi. An adiions °F ir x floating-point Toads. no sory references and WB ree floating Pow wares could be oustanding Tad data uers and SOT ana bugfers are we FOF Se function. Although We ah not disouss the Toad na store iss We dO ‘need 19 jnclude the burFers FOF operands. TTomasolo's scheme Shan aan ideas with the scoreboard scien $0 We sume that you wrest Mee scoreboard thoroughly nthe ast sects WE ST ow a compiler could one registers 1 vod AW and WAR bazar In Momasule’s scheme tS manionality 8 provided UY Me resereatin stations: Winch bof te operands tetris Wain 1 and by the issue TOBE “The basic idea is that ang cervaion statin Fees 1 puters an operand 8 Se Mei available, eliminate the need to get tHe operand from 2 SEES in adsl tion, pendi gate the reservation station that il PrOVISE their put. Finally, whens seem rites to 0 Este APPS only the ist one 8 86 venily used to update Me sepjeter. AS instTUCtIONS peed, te register species Tor pending operands ate iam to tHe names Of SA qoservaion station 19 8 process called regi ar gaming. THis COMPINA LT FFigoue Tope and ese Pranions provides renaming nin timinates WAW an SIAR hazards This a0 rte major conceptual vifference betwee” eoreboaraing and ‘uit, in the 0% seers, te tect nigue AN ‘timinate hazards that eeyg nor be etirinated OY com: pile, As we explore I ctimponents of FomAsKlG SS swe will ero 10 the Rape of register censmnine mi see exactly HOw HE naming occurs and BON it Mr addition to tbe Use OF register renaming, tere wo other significant dF ferences in the organiza of Tomasul’s ere aig scoreboard PIS ret detection and execute) ego! are aisuibated Te eservation stations 3 teach functional unit eal when af instTUCHOE, Mery begin execution a unit ‘This function 1S centralized in the a oad. Secon: Fels passed ect) Te ynctonsl its from ON *ervation stations, NRE “ey are woftereds Fane Frees al its waiing FF 77 sperand 10 be Toaded jemuttancously con He 360/91 Wor called the commen date Mee oF CDB). COMPAL the scorebonnd WHHES tts nto registers; WHEE pe fantional wits MEY ave ro contend FOF ming Data Hazards with Dyn neduling them, The number of result buses in either the scoreboard or Tomasulo’s scheme ther aried in the actual implementations, the CD {6600 had multiple comple

You might also like