Intel P6

You might also like

Download as pdf
Download as pdf
You are on page 1of 7
oH my rom Intel’s P6 Uses Decoupled Superscalar Design Next Generation of x86 Integrates L2 Cache in Package with CPU by Linley Gwennap Intel's forthcoming PS processor (668 eave story) is siesigned to ousperform all other x96 CPUs by a signili- teint margin. AHhough itshares some design techniques ‘vith compotitors such as AMD's 5, NexGen's Nab, ‘and Cyrive M1, thenew Intal chip haa several porta andvantages over those competitors ‘The Pes deep pipeline eliminates the cache-accsss bottlenocks that restrict ls eompettor to clock apeeds of about 100 MUix.'The new CPU is designed ta run at 469 Mic ints inal -mieron BICMOS implementa- ion; a 0.38-mieram version, due next year, could push ‘the speed ax high as 200 MHz. th addition, the Intel design uses a closely coupled secondary cathe to speed memory accesees, a eritical issue fr high-froquency CPUs. Intel will eobin the PS (CPU and a 2568 cathe chip inc a single PGA. package, reducing the time needed far datn to 106 fronh ¢he cache othe proves. Like some af its competitors, the PS translates x86 instruction into simple, fixed-length instructions that Intl calla miero-operations of Uops (pronouneed "you cps"), These unps are then executed in a decoupled si- peracalar core capable of eogister renaming and oat Onder execution, intel haw given the name “iynamic {xccution to this particular combination of features ‘whith is nether new nor unique, but highly efecivein Ineretsing x86 perormance ‘The P6 also implements 4 new system bua with in- roused bandwith eompared to the Pontium bus. The ‘new bus is cupaile of supporting up to four PS processors, ‘mith no glue loge, redueing the cost uf developing and bhilding mulliproceseo eyrteme: Tha fete et maken the new proceasor particularly attractive for aervers vl also be used in high-end desktop PCs and, eventu- ally, in mainstream PC produc Not Your Grandfather's Pentium ‘While Pentium’s microarchitecture carries a dis- tinet leacy from tho 486, itis hard to find a traee of Pen- ium in the PB, The P8 team threw out most of the design techniques used by the 486 and Pentium and started from a blank pieoe of paper to build a high-performance :=86-compatible provessor. ‘The reault is a microarchitecture that is quite radi- ‘el compared with Intel's previous 486 designs, but one ‘that draws from the same bag of tricks as competitors” 1x86 chips. To this mix, the P6 adda high-performance ‘enche and bus denigns that allow even large programs to ‘make guod use of the superscalar CPU core, ‘As Figure 1 {eee page 10) shows, the P6 can be di- ‘vided into two portions: the inorder and out-of-order sec- tions. Instructions start in order but can be exceuted out of order, Results flow to the reordor buffer (ROB), which puts them back into the correct order. Like AMD's K5 (see MPR 10/24/24, p. 1), the PS uses the ROB to held re- sults that are generated by speculative and out-of-order instructions fit turns out that these instructions should not have been executed, their results can be flushed fram, ‘the ROB before they arv-committod, ‘The performance increase over Pentium comes largely from the out-of-order execution engine, In Pen tium, ifan instruction tales several eyeles to execute, ue to a cache miss or other long-latency operation, the entire processor stalls until that instruction ean proceed. In the same situation, the P6 will continue te execute subsequent inatructiona, coming back to the atalled in- struction ouce it is ready to execute, Intel estimates that the P6, by avoiding stalls, delivers 1.5 SPECint92 per Miz, about 40% better than Pentium, x86 Instructions Translate to Miero-ops ‘The PS CPU includes un 8K instruetion eache that fs similar in structure to Pentium’, On each eyele, itean deliver 16 aligned bytes into the instruction Byte queue. Unlike Pentium, the Pé cache cannot fetch an unaligned cache line, throttling the decode process when poorly aligned branch targets are encountered. Any hiccups in the feteh stream, however, are generally hidden by the vep queues in the exceution engine, ‘The inatruetion hytea are fed into three instruction decoders, The first decoder, a-the front of the queue, can handle any x86 instruetion; the others are restrieted to only simple (og. reyister-to-rogister) instructions. In- struetiona are always decoded in program order, so if an instruction cannot be handled by a restricted decader, neither that instmiction nor any subsoquent ones ean be decoded on that yee; the complex instruction will even- tually reach the front ofthe queue and be decoded by the general decoder. ‘Assuming that instruction bytos are available, at least one x88 instruction will be decoded per eye, but more than one will be decoded only if the second (and third) instructions fall into the “restrieved” category Intel refused to list these instruetions, but they donot clude any that operate on memory. Thus, the P@'s ability ‘tn execute more than one x86 instruction per eyele relies FEBRUARY 16, 1995 o fk Inebuctan Geshe Figure 1, The P6 combines an inorder front end with a decoupled Superscaler execution engine that can proces FISG-Ike mero ops speculatively and out of crder. ‘on avoiding long sequences of complex instructions ar in structions that operate on memory. The decnders translate x80 instructions into wops. P6 uops have a fixed length of 118 bits, using a regular structure to encode an aperation, two sources, and a des- ination. The source and destination fields are each wide enough to contain a 32-bit operand. Like RISC instrue- Hone, uops use a load/store model; x86 instructions that ‘operate on memory must be braken into a luad usp, an ALU uop, and possibly a store uop. The restricted decoders can produce only one uap per eyele (and thus accept only instructions that brane late into a single uop). The generalized decader is capable of generating up to four wops per eyele, Instructions that require more than faur wops are handled by a uop ‘quencer that generates the requisite series of uaps over ‘hv0 or more cycles. Bacause ef x86 constructs such as the string instructions, a single instruction can produce a very long sequence of uops, Many x86 instructions, un the other hand, translate into a single uop, and the sverage is 1.52.0 uope per instruction, aecording to Intel. ‘The wops then paas through the reorder buffer. The ROB must log each wop 0 it cun later be retired in pro- ram order, Bach of the 40 ROB entries also has room to 10 store the result of a load or calculation wop along with the condition codes that evuld be changed by that wop, AB ‘uops execute, they write their results to the ROB. Closely associated with the ROB is the register alias table (RAT), As uops are logged in the ROB, the RAT determines if their source operanda should be taken from the real register file (RRP) or from the ROB. ‘The latter case occurs ifthe destination register ofa pre- ‘vious instruction in the ROB matehes the source rogis- ter; if so, that source register number is replaced by a pointer ta the appropriate ROB antry. The RAT is also ‘updated with the destination register of each wap. In this way, the P6 implements register renaming. When uops are executed, they read their data from ei- ther the rogister file or the ROB, as needed. Renaming, a technique diacusced in dotail when Cyrix introduced its MI design (sco MPR 10/25/99, p. 1), helps broak one of the major bottlenecks of the x86 instruction set: the small number of general-purpose registers. The ROB provides the PS with 40 reristers that can hold (ae con- tents of any integer or BP register, reducing the number of stalls due to register conflicts, Out-of- Order Engine Drives Performance Up ta three uops ean be renamed and logged in the ROB on each eyele; these three uopa then flaw into the reservation station (RS). This section of tao chip holds up 0 20 vops ima single structure. (The RS can besmaller than the ROB hecause the ROE must track uape that, have executed but are not retired.) The wops wait in dae reservation station until their source operands are all available, Due to the register renaming, only a single ROB entry (por source) must be cocked to dotormine if the nooded values available. Any nopa with all operands available are marked as ready. Euch eyele, up to five uops can be dispatched: twa esleulations, a load, a store address, and a stare data, A. store requires a second wap to carry the store data, whieh, in the ease of floating-point data, must be con vorted before writing #t to memory. There are some restrictions to pairing calculation uops due to the urrangemont of the read parts af the reservation station. As Figure | shows, only one uop per eyele can be dispatched to the main urithmetic unit, ‘which consists of the floating-point unit and a complete integer unit, which has an ALU, shifter, multiplier, and divider. The second caleulation uop goea tothe secondary ALU and must be either an intoyer-ALU (no shifts, mul- ‘iplies, ar divides) or a branch-target-addreas ealeala- Gon. Simplifying the second calculation unit reduoes the die aren with little impaet on performance, In many situations, more waps will be ready lo exe- uta than there are fanction units and read ports. When this happens, the dispatch logic prioritizes the available uuops secording to a complex set of rules. Intel declines to a discuss these rules but nates that older uopa are given priority over newer ones, speeding the resolution of chains of dependent aperations, ‘The integer units handle most calculations ina sin- ile eyele, but integer multiply and divide take longer. ‘The P6 FPU exeevtes adds in throe cycles und requires five cycles for multiply operations. FP add and multiply aro pipelined and ean be exeeuled in parallel with lang- latency operations. Table 1 showa the cycle times for long-latency integer and floating-point operationn, ‘The FxcH (floating-point exchange) instruction ia not handled by the FPU. This instruction sveapa the tap of the FP register stack with another register in the stack and is frequently used in x86 floating-point appli- cations, sinee many x86 FP instructions aecess only the top ofthe stack. The PS handles FXxCH entirely in the re. order buffer by treating itas a renaming of twa registers, Thus, FX6# uops enter the ROB bat not the reservation station and never enter the function units, ‘The dual address-generation units each contain a {our-input adder that combines all possible 186 address ‘components (segment, base, index, and immediate} in a single cycle. Asin Pentium, each unit also contains a soc- ‘ond four-input adder to perform a segment-limit check in parallel, The resulting addresa (along. with, for a store, souree data) is then placed in the memory rearder buffer (MOB, in P&-oso} to await availability of the data cache ‘The reorder buffer can accept three results per cycle: ane from the main arithmetic unit, one from the secondary ALU, and one from a load uop. Beeause of long-lateney operations, two or more function units within the main anthmetie unit ean generate results in single eyele. In this case, the function units must arbi Late ta write to the ROB. After a wop writes its result ta the ROB (er, in the cave of a store, 1 the MOB), it is eligible eo he retired ‘The ROB will retire up to three uops per eyele, always in program order, by writing their resulta to the register file. Uops are never retired until all provious ope have completed successfully, [fany exception or errar occur, uncommitted results in the ROB ean be flushed, resel- ‘ting the CPU to its proper state, as if the instructions hha been executed in order up tothe point ofthe urrur, Nonblocking Caches Reduce Stalls A key to the PS's performance is ils eache subsys- tom. The primary data cache, at 8K, is relatively small for a processor af this generation, but the fast level-twa cache helps alleviate the lower hit rate of the primary cache. Ifan access misses the data cache, the eache can, continue servicing requests while waiting for the misa data tobe returned, This technique, called a nonblocking: eache or hit-under-miss, has been used for years by PA-RISC processors to avoid stalls, ‘The PS takes advantage of the nonblocking cache = = i a cll se FEBRUARY 16, 1995 Table #.Pé eating-polit akescies aro similar to Peri but i lngor arithmetic mueh faster. Lteneles ae th sarvo for singe, double, and extended precision except where rnges are shown, vith its memory reorder bullor.Ithe access atthe front ofthe MOB misses, subsequent accesses will continu to exeente, Since these accesees can execute out of ender, the PS must-take care to avoid incorrect program execu: tion. For example, loads can arbitrarily pass loads, but stores must always be exeeuied in order. Loads ean pass stores unly ifthe to addresses are verified tobe differ. ent, Intel would not reveal the sizeof the MOB or other details ofits Function, The P6 data eich cam process one load and one store por cycle as long as they access different banks “The cache is divided into four interleaved banks, half as many as Pentium’s data eache. AMD's KS also imple- ‘monis a four-hank data eacho, and that sompany says that there is little benefit to an eight-bank design. The P6 cache cannat handle (vo louds at onee, in stark contrast to processors such as the KS, M11, and even Pentium, Typieal x86 code generates large num ber of memory references due tothe liniled register set and more of these referonees will be loada than stores Tate points out thatthe entire processor in designed for ‘ne load per «ylo—avan the decoders cannot produce ‘more than one Inad uop por eyelo—and that its simula tions show this capability is adequate to attain the de- sirod performance level “The data cache fans u latency of three cycles Gn- cluding address. generation) but is filly pipelined, pro- ducing one result por ejele, The unified level-two (C2) cache has a latency of three eyeles and, like the data cache, is pipelined and nonBocking. The fast L2 latency ‘is achieved by implementing the L2 cache as a single chip and combining it with the CPU ina single package, as Figure 2 (see page 12} shows ‘Address translation oocur in parallel withthe data cache aecess. Ifthe access misses the data cache, the translated (real) address is sent to the L2 cache, The la- tency ofa Inad that misses the data cache but hits in the 12 is six cycles, assuming that the L2 cache is not busy returning data from an instruecion cache miss arn ear lier data cache miss The PG CPU contains a comploto L2 exche eon- troller and is Intels frst 26 processor with a dedicated L2eachebus. Both NexG 3 586 and the R000 use this design siyle to inereare cache-bus bandwidth while t Figuie 2. The PS CPU and L2 cache ate combinedin a single 307: IN PGA package inet measures 7-75 x 28 cm (286° 2.46). allawing the separate system bus to operate al.a lower, ‘more manageable speed. ‘The L2 cache uses the same 32-byte line size as the on-chip caches. It returns 64 bits of data at a time, tak- ing four eyeles to refill a cache line, The cache always rv turns the requested word within the first transfer, get- ting the critical data back to the processor as quickly as possible, Table 2 showrs other cache parameters. This combination of nonblocking, caches with a fast L2 cache provides the PS with better perfurmance on memory accessea than its x86 competitors. The proces- sor stalls Jess often and has relatively quick aecess to 256K of memory. The K5, by contrast, has 24IC of on-chip cache but will take longer to access its secnndary cache, Deep Pipeline Speeds Clock Rate Another advantage that the P6 has over ils x86 competitorsisia higher clock rate. Intel achieves this feat by deeply pipelining the chip. Figure 3 shows the P& Pipeline. which ennaists of I2 stages. This pipeline repre eT 2k Trerosien | Daa Tanase ca Nos i ebipes | zt “rrougnpit/catoney | 1 3yehes tee ‘Ber Atos Wah ‘26 bas RumbsrefFors one Tedd | a pyle phys] physiol TTLB Enos BBeniies | atentses | — TB assosatuny fu fue = istered Rome Foes = ‘Table 2, The fast nonolsking L2 canes fly pipelined and helps ‘rake up fr tho highor miss eof -asmall ptmary cactes. sents the case of an instruction that flows through the CPU as quickly as possible. [tis more likely that the in- struction wil he stalled in the reservation station for some number of cycles, a delay represented by the thick black band. The second black band reprosents anther potenti! delay: completed instructions ean spend ser: eral cycles waiting in the ROB before retirement. Jn the first stage, the next fetch address is calev- lated by accessing the branch target Buffer (BTB). If there is shit in the BT, the feich stream is reirectad to the indicated location. Otherwise, the processor con- tlnnes to che next aequential addres “The instruction eacho acceas ia spread aeross two and one-half eyeles. The KS, in contrast, must ealeulate tho next address and read from the insteuetion each in a single eyele. By allowing multiple pipeline stages for the cache aooss, Into remaves this tak fram the erica ‘ming path and allow the P6 clock to run faster. Instructions are then fed to the deenders. The ent- plex problem of decoding variable-length x86 instruc tions i allocated tsen and onv-half eyeles as well. Part of the problem in a superscalar x86 processor i Wentifying the starting point ofthe second and subsequent instrue- tions ina group. The KS includes predecnl information inits instruction cache to hasten this proces, bat the PS does not, to avoid both inatruction-cache bloat and the bottleneck of pradending instructions as they are read from the L2-ache The decoding ina is earier for PS because the svc ond and third decoders handle aly simple instruetions The restricted domders can wat forthe general decnder ta identify the length of the first instruction and still have time to handle the remaining instructions, aatum- ing that they are simple ones. If they are not simple, Uhe restricted decers must pass them on to the general de- coder the next cyele Atthe end of stage 6, he xainstructions havebeen {ally docoded and transiated into nops. These uops have their registers renamed in stage 7. In stage 8, tho re- named wops urw wrilten ta the reservation statin, Ifthe nperanda for a particular uop are available, und if no other waps have priority for the needed function unit, that uop is dispatched, Otherwise, the wup will wait for its operands to howe wvailable. Tt takes one cycle (stage 9) forthe reservation stae tion to decide whic wops ean be dispatched. The PS im- plements operand bypassing, 90 a result ean be use on the immediatoly fullowing cyele. The reservation station ‘sill atamptta have the corresponding uop arrive at the fimetion unit atthe saume time as the necossary data. Simple integer wops can execute in a single cle (stage 10), Sore integer uops nnd ll FP uops take sev- eral eyeles at this point. Lad and store uops fenerate thele adress in one eye and are waillen to Uae MOB: if the MOD ia empty, a load will go directly to the data 12 FEBRUARY 16, 1995 wy Fear Vly Rah eet G2 atts eexanye— ra — 7 we 40 ff Poe deny rembe dty C0 | RR acute tet a Load Pipeline: | Sipe | Gf kaka be Pisa) ens on 100 screen peti darn Figure 8. n tha best case, instructions can flow Through the Pin 12 cycles. bat the averageis 18 oyies dus to delays nthe reservation sl ton (RS) or he oordor buffer (ROB), Load instructions taks longer and-can also be sesajod inthe memory render oxllor MOB). cache, bur it wall take an additional throe eyeles (sssum- ing a cache hit) before the loaded data is available for use. The address-generation unit is probably a critical timing path in the F6, limiting the procesaor to 133 ME RISC processors can typically cyele their simpler ALUs tab 200 Mz or better in 0.5-micran technology. ‘Once the uop executes, it writes ite result to the ROB. Ifall previous uops have been retired, ic takes one cycle to retire a uap, If previous instructions are still pending, however, it may take several evelea before the Lop is retired. This delay does not impact performance. Branch Prediction Accuracy Is Critical The deep pipeline creates extraordinary branch penalties. The outcome of a conditional branch is nat known until stage 10, The minicum penalty for a mis- predicted branch is 11 eyelea. Inéel eatimates that, on a erage, uop spends four cycles in the reservation station, 50 a mispredieted branch will typically cause # 15-eyele penalty. Ifthe braneh spends an unusually long time in the reservation station, the penalty could be even worst, ‘Thus, the P6 designers spent alot af wort. to reduce the number of misprodieted branches. Like Pentium, the PG uses a branch target hutfer that retains both branch- history information and the predicted targot of the branch. This table is accessed by the current prosgeam counter (PC). Ifthe PC hits in the BT, a branch is pre dicted to the target address indicated by the BTB entry; there is no delay for enrrectly predicted taken branches, ‘The BTB has 612 entries organized in a four-way set-associative cache. This size ig bwice that of Pentium, improving the hie rate. The PS rejeeta the commonly used Smith algorithm, which maintains four states using twa its, in favor of the more recent Yeh method! 1] ‘This adaptive algorithm uses four bits of branch history and ean recognize and predict repeatable sequences of branches, for example, taken—tsken-not: taken We esti= rate that the PG BT will deliver clase to 90% accuracy ‘on programs such aa the SPECint92 suite, ‘Asecond type of misprediction occurs if a branch ‘isos in the BTB. This situation is not detected until ‘the instruction is fully decoded in stage 6.'The branch ia predicted to be taken ifthe offset is negative (indicating FEBRUARY 18, 1998, a liely lop), and the target address, if available, is ased to redirect the fetch stream. At that point, however, seven cycles have been wasted fatching and deading in- suruetions that are unlikely to be needed. In this ease, the long fetzh-and.decode pipeline saps performance. Forward branches that miss the BTB are predicted to be not taken, s0 the sequential path continues to he fetched with no delay. Branches that miss the BTB are 'mispredtcted more often than thase that hit-and are eu Jct the same mispredieted branch penalties. Conditional Move Added The PG inatruetion set is nearly identical to Pon- ‘tium's and, in fact, to that ofthe 386. The most significant addition is a canditional move (cito¥) inatruction, This instruction, which has beon: added te'neveral RISC in. struction sota recently, helps avoid eostly mispredictions by eliminating branches. The instruction copies the con= onts of une register into anther only if « particular eon- dition flag is set, replacing a test-and-braneh sequence. Lake all current Intel processors, the PS implement ssystem-management mode, The mew CPU alse supports, all the features in Pentium's secret Appendix H. Intel says tat these fentures—including CPU ID, lange page sizes, and virtual-086 extonsions—will be fully docu- mented whon the PS is released. The PS also implements performance counters similar to those described in Appendix H, but it uses a new instruction that allows them to be accessed more easily. It has been widely speculated that the PS would in- clude ness instrnetione to speed NSP (native signal pro- cessing) applications. While the P8's improved intoger ‘multiplier will aasist NSP, multimedia extensions such as those in Sun's UltraSpare (see MPR 12/5/94, p. 18) ‘would have an aven bigger effect. Intel, however, denies. that the PS has any such extensions. This oversight could yivo RISC processors like UitraSpare a significant perfarmance advantage when exeeuting increasingly ‘popular multimedia software, P6 Bus Allows Glucless MP ‘The P6 system bus is completely redesigned from tHe Pentium bua. Both use 64 bits of date and operate st 13 Figure 4. The P8 CPU measures 17.6> 17.8 mm whon fabiestadin 23 05-mieronfour-layer matal BICMOS process, 1 maximum of 66 MHz, but the P6 can sustain much higher handwidth beeause it uses a splittransaction pro- tocol, When Pentium reads fram its bus, it sends the ad tiress, waits for the revult, and reads the returned data, ‘This aequenes ties up the bus far easentially the entire transaction, The PS bus, in contrast, continues to eanduct transactions while waiting for results. This overlapping of transactions greatly improves overall bus uilization. ‘All arbitration ia conducsed on separate signals an Ute PS bus in parallel with data transmission. Acldresses are gent on their ow 36-bit bus, sa the bus is capable of ‘sustaining Ue full 528. Mbyte/s bandwidth for an inde nite period. Since the PS has a private cache bus, the en tire ayatem:bus bandwidth ean he devoted to memor and UO accesses, (Intel will disclose thy details of the PS system hns ata later date.) Asplittransuction bus is ideal for 0 multiprocessor .MP} system. Intel has designed the P6 bas to support up ‘to four processors without any glue logic: hat i, the pro: cessor pine can be wired directly to each other with anly ‘asingle chip sot to support several CPUs, Like Pentinm, the PB includes Intel's advanced priority incerrupe con- troller (APICI, simplifying multiprocessor designs. The company believes that the system bus has enough band- width to support four P6 processors with little perfor: mance degradation. Larger MP aystems ean be built from clusters of four processors each, ‘Tooperate at 66 MElz with up to eight devices (four processors along with twa memory controllers and tro WO bridges), the P6 bus operates at modified GTL (Gun- ning transosiver logie) signal levels. This lower signal voltage (1.5 Vi reduces settling time in a complex eleetri- cal environment. Intel will roll out the fret PS chip sets along vith the processor, but other venders anv expected lo produce compatible chip sets a8 wel its integrated cache and APIC, the P6 module affera an eney way to upgrade an MP system by plugging ina new processor. This feature will be most useful in servers, which oflen sell in MP configurations today. Ul- imately, the PO will be used in multiproeessar PCs, a market being seeded laday by the APIC-enabled P54C, Another Big, Power-Hungry CPU ‘As alway, there is no free lunch: high performance comes al a pris. The P6 CPU requires 55 million tran sisiars, of whieh about 4.5 million are for logic and 1.0 nillion aro in the 16K cf eacho, Even in a 0 S-mieron four-layer-metal BICMOS process, chis circuitry re- ‘quires 806 mum? of silieoa, as Figure 4 shows, To keep this in perspeetive, however, the die size is only 4° big- gor than that of the original Pentium. As with Pentium, 4 shrink to the next procesa will make the PO much smaller and easier to manufacture ‘The cache chip comaumes 202 mm! and is built in the same process as the CPU, as it must aperate at the same clock rate. It contains tags and data storage for YBGK of cache and recires 15 milion transistors. ‘Although the PS uses the sume nominal IC process as tho P54C Pentium, it operates from a core voltage of 2.9 rather than Pentiuin’s 8.2, This change evuld in- dicate some minor process tweaks to reduce transistor size or gate-oxide thickness; aueh tweaks ean improve performance but reduce the allowable supply voltage. The lower volinge has the side ffect of reducing the power consumption: even so, che PG CPU bas a prelimie nary power rating of 15 W maximum and 12 W typical With the 256K 12 cache chip, the otal power ia expected to be 20 W maximum and 15 W typical, This power rating is slightly greater than that of the original PS Pontiam and ia quite reasonable compared sith next generation RISC processors, which start at 00 W. The srealer surface aren of the PO package allamn it to uae shorter heat ninks than the notorivusly hot P5. Intel investigated many types af mrultichip modules (OMCAa) before setting on & simple two-avity PGA, chewing more complicated (ane custly) dip-chip option ‘The package is simiar toa standard PGA with extra a= ramie layers to route the 64-bit cache bus between the CBU und cache chipe. The die are attached using stan- dard wire bonding; no unusual substrutes are required. The pin arrangement, showm previnusly-in Fiue 2, is amymmetrc: interstitial pins surround the CPU, which Aves all the signals that leave che module, but not-the cache ebip. We estimate that this 987-pin duabeavity package costs Intel about $50 in volume. The MPR. Cort Model computes the overall manufacturing costof the PS 4 FEBRUARY 16, 1995 VY w module to be rouphly $250. ‘This cost ia Ten] Ges AES) Ge ST Cee greater than that of competitive chips, Be} itooao. Ke. a NaSe8_|_Penlirn but the entire L2 cache is inchided. tna [Od Epoal | TSIM | zon Mex | toputie | 100 MRE | BRE" 100 Mee 0.35-micron proceas, the cost of the PG Pica Bo = at te, eer a pace oath ee ‘tier [Sater | ear i cual coe SSDI 1. Euneton Units sunts | “Finis | ums | gi] sunt Improving System Performance Paes Bs TEs | oe, ee anh stop size i i Intel designed the P6toachieve hign | BRNEREat See Gees | Bates performance af the system level, not just’ [pena Rags airops [NEMEC tore [oeeee within the CPU core. Thus, the team [iz ceche Gus? yes s 7 placed significant emphasis on the cache | cx MP? yes ie ro 53 subsystem and the system bus ax well as [7 Taya | Os eM | OGSa I | O.6S AN ‘the CPU. The nonhlocking caches, closely sad cmos | ‘owas! cvos | cios.| coupled 12 cache, and split-transaction [Tage Tarn Eamon j Zt wer Z1 mon | Sra tous exemplify this emphasis, These fese [0a Tanstos 5.3 ration | Sma 3.0 miton 3.8 ply this emp mn q tures put the PS a step beyond competi [pyctage tyro S27-pin |i 2a6-3iN=] 296-pin tive x86 processors. cpGA | -ePGk" | cPGA Tmemraat AMD KS.appearstobe [De Sax BBR ee? |B smn a P6-class CPU sore trapped in a Pen- | EAMaGost Se20- ee — tium pinout. The husiness decision to care [ows m= oy eee on got the Pentium intefaes loaves tho 5 [2H

You might also like