oH
my
rom
Intel’s P6 Uses Decoupled Superscalar Design
Next Generation of x86 Integrates L2 Cache in Package with CPU
by Linley Gwennap
Intel's forthcoming PS processor (668 eave story) is
siesigned to ousperform all other x96 CPUs by a signili-
teint margin. AHhough itshares some design techniques
‘vith compotitors such as AMD's 5, NexGen's Nab,
‘and Cyrive M1, thenew Intal chip haa several porta
andvantages over those competitors
‘The Pes deep pipeline eliminates the cache-accsss
bottlenocks that restrict ls eompettor to clock apeeds
of about 100 MUix.'The new CPU is designed ta run at
469 Mic ints inal -mieron BICMOS implementa-
ion; a 0.38-mieram version, due next year, could push
‘the speed ax high as 200 MHz.
th addition, the Intel design uses a closely coupled
secondary cathe to speed memory accesees, a eritical
issue fr high-froquency CPUs. Intel will eobin the PS
(CPU and a 2568 cathe chip inc a single PGA. package,
reducing the time needed far datn to 106 fronh ¢he
cache othe proves.
Like some af its competitors, the PS translates x86
instruction into simple, fixed-length instructions that
Intl calla miero-operations of Uops (pronouneed "you
cps"), These unps are then executed in a decoupled si-
peracalar core capable of eogister renaming and oat
Onder execution, intel haw given the name “iynamic
{xccution to this particular combination of features
‘whith is nether new nor unique, but highly efecivein
Ineretsing x86 perormance
‘The P6 also implements 4 new system bua with in-
roused bandwith eompared to the Pontium bus. The
‘new bus is cupaile of supporting up to four PS processors,
‘mith no glue loge, redueing the cost uf developing and
bhilding mulliproceseo eyrteme: Tha fete et maken
the new proceasor particularly attractive for aervers
vl also be used in high-end desktop PCs and, eventu-
ally, in mainstream PC produc
Not Your Grandfather's Pentium
‘While Pentium’s microarchitecture carries a dis-
tinet leacy from tho 486, itis hard to find a traee of Pen-
ium in the PB, The P8 team threw out most of the design
techniques used by the 486 and Pentium and started
from a blank pieoe of paper to build a high-performance
:=86-compatible provessor.
‘The reault is a microarchitecture that is quite radi-
‘el compared with Intel's previous 486 designs, but one
‘that draws from the same bag of tricks as competitors”
1x86 chips. To this mix, the P6 adda high-performance
‘enche and bus denigns that allow even large programs to
‘make guod use of the superscalar CPU core,
‘As Figure 1 {eee page 10) shows, the P6 can be di-
‘vided into two portions: the inorder and out-of-order sec-
tions. Instructions start in order but can be exceuted out
of order, Results flow to the reordor buffer (ROB), which
puts them back into the correct order. Like AMD's K5
(see MPR 10/24/24, p. 1), the PS uses the ROB to held re-
sults that are generated by speculative and out-of-order
instructions fit turns out that these instructions should
not have been executed, their results can be flushed fram,
‘the ROB before they arv-committod,
‘The performance increase over Pentium comes
largely from the out-of-order execution engine, In Pen
tium, ifan instruction tales several eyeles to execute,
ue to a cache miss or other long-latency operation, the
entire processor stalls until that instruction ean proceed.
In the same situation, the P6 will continue te execute
subsequent inatructiona, coming back to the atalled in-
struction ouce it is ready to execute, Intel estimates that
the P6, by avoiding stalls, delivers 1.5 SPECint92 per
Miz, about 40% better than Pentium,
x86 Instructions Translate to Miero-ops
‘The PS CPU includes un 8K instruetion eache that
fs similar in structure to Pentium’, On each eyele, itean
deliver 16 aligned bytes into the instruction Byte queue.
Unlike Pentium, the Pé cache cannot fetch an unaligned
cache line, throttling the decode process when poorly
aligned branch targets are encountered. Any hiccups in
the feteh stream, however, are generally hidden by the
vep queues in the exceution engine,
‘The inatruetion hytea are fed into three instruction
decoders, The first decoder, a-the front of the queue, can
handle any x86 instruetion; the others are restrieted to
only simple (og. reyister-to-rogister) instructions. In-
struetiona are always decoded in program order, so if an
instruction cannot be handled by a restricted decader,
neither that instmiction nor any subsoquent ones ean be
decoded on that yee; the complex instruction will even-
tually reach the front ofthe queue and be decoded by the
general decoder.
‘Assuming that instruction bytos are available, at
least one x88 instruction will be decoded per eye, but
more than one will be decoded only if the second (and
third) instructions fall into the “restrieved” category
Intel refused to list these instruetions, but they donot
clude any that operate on memory. Thus, the P@'s ability
‘tn execute more than one x86 instruction per eyele relies
FEBRUARY 16, 1995
ofk Inebuctan Geshe
Figure 1, The P6 combines an inorder front end with a decoupled
Superscaler execution engine that can proces FISG-Ike mero
ops speculatively and out of crder.
‘on avoiding long sequences of complex instructions ar in
structions that operate on memory.
The decnders translate x80 instructions into wops.
P6 uops have a fixed length of 118 bits, using a regular
structure to encode an aperation, two sources, and a des-
ination. The source and destination fields are each wide
enough to contain a 32-bit operand. Like RISC instrue-
Hone, uops use a load/store model; x86 instructions that
‘operate on memory must be braken into a luad usp, an
ALU uop, and possibly a store uop.
The restricted decoders can produce only one uap
per eyele (and thus accept only instructions that brane
late into a single uop). The generalized decader is capable
of generating up to four wops per eyele, Instructions that
require more than faur wops are handled by a uop
‘quencer that generates the requisite series of uaps over
‘hv0 or more cycles. Bacause ef x86 constructs such as the
string instructions, a single instruction can produce a
very long sequence of uops, Many x86 instructions, un the
other hand, translate into a single uop, and the sverage
is 1.52.0 uope per instruction, aecording to Intel.
‘The wops then paas through the reorder buffer. The
ROB must log each wop 0 it cun later be retired in pro-
ram order, Bach of the 40 ROB entries also has room to
10
store the result of a load or calculation wop along with
the condition codes that evuld be changed by that wop, AB
‘uops execute, they write their results to the ROB.
Closely associated with the ROB is the register
alias table (RAT), As uops are logged in the ROB, the
RAT determines if their source operanda should be
taken from the real register file (RRP) or from the ROB.
‘The latter case occurs ifthe destination register ofa pre-
‘vious instruction in the ROB matehes the source rogis-
ter; if so, that source register number is replaced by a
pointer ta the appropriate ROB antry. The RAT is also
‘updated with the destination register of each wap.
In this way, the P6 implements register renaming.
When uops are executed, they read their data from ei-
ther the rogister file or the ROB, as needed. Renaming, a
technique diacusced in dotail when Cyrix introduced its
MI design (sco MPR 10/25/99, p. 1), helps broak one of
the major bottlenecks of the x86 instruction set: the
small number of general-purpose registers. The ROB
provides the PS with 40 reristers that can hold (ae con-
tents of any integer or BP register, reducing the number
of stalls due to register conflicts,
Out-of- Order Engine Drives Performance
Up ta three uops ean be renamed and logged in the
ROB on each eyele; these three uopa then flaw into the
reservation station (RS). This section of tao chip holds up
0 20 vops ima single structure. (The RS can besmaller
than the ROB hecause the ROE must track uape that,
have executed but are not retired.) The wops wait in dae
reservation station until their source operands are all
available, Due to the register renaming, only a single
ROB entry (por source) must be cocked to dotormine if
the nooded values available. Any nopa with all operands
available are marked as ready.
Euch eyele, up to five uops can be dispatched: twa
esleulations, a load, a store address, and a stare data, A.
store requires a second wap to carry the store data,
whieh, in the ease of floating-point data, must be con
vorted before writing #t to memory.
There are some restrictions to pairing calculation
uops due to the urrangemont of the read parts af the
reservation station. As Figure | shows, only one uop per
eyele can be dispatched to the main urithmetic unit,
‘which consists of the floating-point unit and a complete
integer unit, which has an ALU, shifter, multiplier, and
divider. The second caleulation uop goea tothe secondary
ALU and must be either an intoyer-ALU (no shifts, mul-
‘iplies, ar divides) or a branch-target-addreas ealeala-
Gon. Simplifying the second calculation unit reduoes the
die aren with little impaet on performance,
In many situations, more waps will be ready lo exe-
uta than there are fanction units and read ports. When
this happens, the dispatch logic prioritizes the available
uuops secording to a complex set of rules. Intel declines to
adiscuss these rules but nates that older uopa are given
priority over newer ones, speeding the resolution of
chains of dependent aperations,
‘The integer units handle most calculations ina sin-
ile eyele, but integer multiply and divide take longer.
‘The P6 FPU exeevtes adds in throe cycles und requires
five cycles for multiply operations. FP add and multiply
aro pipelined and ean be exeeuled in parallel with lang-
latency operations. Table 1 showa the cycle times for
long-latency integer and floating-point operationn,
‘The FxcH (floating-point exchange) instruction ia
not handled by the FPU. This instruction sveapa the tap
of the FP register stack with another register in the
stack and is frequently used in x86 floating-point appli-
cations, sinee many x86 FP instructions aecess only the
top ofthe stack. The PS handles FXxCH entirely in the re.
order buffer by treating itas a renaming of twa registers,
Thus, FX6# uops enter the ROB bat not the reservation
station and never enter the function units,
‘The dual address-generation units each contain a
{our-input adder that combines all possible 186 address
‘components (segment, base, index, and immediate} in a
single cycle. Asin Pentium, each unit also contains a soc-
‘ond four-input adder to perform a segment-limit check in
parallel, The resulting addresa (along. with, for a store,
souree data) is then placed in the memory rearder buffer
(MOB, in P&-oso} to await availability of the data cache
‘The reorder buffer can accept three results per
cycle: ane from the main arithmetic unit, one from the
secondary ALU, and one from a load uop. Beeause of
long-lateney operations, two or more function units
within the main anthmetie unit ean generate results in
single eyele. In this case, the function units must arbi
Late ta write to the ROB.
After a wop writes its result ta the ROB (er, in the
cave of a store, 1 the MOB), it is eligible eo he retired
‘The ROB will retire up to three uops per eyele, always in
program order, by writing their resulta to the register
file. Uops are never retired until all provious ope have
completed successfully, [fany exception or errar occur,
uncommitted results in the ROB ean be flushed, resel-
‘ting the CPU to its proper state, as if the instructions
hha been executed in order up tothe point ofthe urrur,
Nonblocking Caches Reduce Stalls
A key to the PS's performance is ils eache subsys-
tom. The primary data cache, at 8K, is relatively small
for a processor af this generation, but the fast level-twa
cache helps alleviate the lower hit rate of the primary
cache. Ifan access misses the data cache, the eache can,
continue servicing requests while waiting for the misa
data tobe returned, This technique, called a nonblocking:
eache or hit-under-miss, has been used for years by
PA-RISC processors to avoid stalls,
‘The PS takes advantage of the nonblocking cache
= = i a cll se
FEBRUARY 16, 1995
Table #.Pé eating-polit akescies aro similar to Peri but i
lngor arithmetic mueh faster. Lteneles ae th sarvo for singe,
double, and extended precision except where rnges are shown,
vith its memory reorder bullor.Ithe access atthe front
ofthe MOB misses, subsequent accesses will continu to
exeente, Since these accesees can execute out of ender,
the PS must-take care to avoid incorrect program execu:
tion. For example, loads can arbitrarily pass loads, but
stores must always be exeeuied in order. Loads ean pass
stores unly ifthe to addresses are verified tobe differ.
ent, Intel would not reveal the sizeof the MOB or other
details ofits Function,
The P6 data eich cam process one load and one
store por cycle as long as they access different banks
“The cache is divided into four interleaved banks, half as
many as Pentium’s data eache. AMD's KS also imple-
‘monis a four-hank data eacho, and that sompany says
that there is little benefit to an eight-bank design.
The P6 cache cannat handle (vo louds at onee, in
stark contrast to processors such as the KS, M11, and
even Pentium, Typieal x86 code generates large num
ber of memory references due tothe liniled register set
and more of these referonees will be loada than stores
Tate points out thatthe entire processor in designed for
‘ne load per «ylo—avan the decoders cannot produce
‘more than one Inad uop por eyelo—and that its simula
tions show this capability is adequate to attain the de-
sirod performance level
“The data cache fans u latency of three cycles Gn-
cluding address. generation) but is filly pipelined, pro-
ducing one result por ejele, The unified level-two (C2)
cache has a latency of three eyeles and, like the data
cache, is pipelined and nonBocking. The fast L2 latency
‘is achieved by implementing the L2 cache as a single
chip and combining it with the CPU ina single package,
as Figure 2 (see page 12} shows
‘Address translation oocur in parallel withthe data
cache aecess. Ifthe access misses the data cache, the
translated (real) address is sent to the L2 cache, The la-
tency ofa Inad that misses the data cache but hits in the
12 is six cycles, assuming that the L2 cache is not busy
returning data from an instruecion cache miss arn ear
lier data cache miss
The PG CPU contains a comploto L2 exche eon-
troller and is Intels frst 26 processor with a dedicated
L2eachebus. Both NexG 3 586 and the R000 use this
design siyle to inereare cache-bus bandwidth while
tFiguie 2. The PS CPU and L2 cache ate combinedin a single 307:
IN PGA package inet measures 7-75 x 28 cm (286° 2.46).
allawing the separate system bus to operate al.a lower,
‘more manageable speed.
‘The L2 cache uses the same 32-byte line size as the
on-chip caches. It returns 64 bits of data at a time, tak-
ing four eyeles to refill a cache line, The cache always rv
turns the requested word within the first transfer, get-
ting the critical data back to the processor as quickly as
possible, Table 2 showrs other cache parameters.
This combination of nonblocking, caches with a fast
L2 cache provides the PS with better perfurmance on
memory accessea than its x86 competitors. The proces-
sor stalls Jess often and has relatively quick aecess to
256K of memory. The K5, by contrast, has 24IC of on-chip
cache but will take longer to access its secnndary cache,
Deep Pipeline Speeds Clock Rate
Another advantage that the P6 has over ils x86
competitorsisia higher clock rate. Intel achieves this feat
by deeply pipelining the chip. Figure 3 shows the P&
Pipeline. which ennaists of I2 stages. This pipeline repre
eT
2k
Trerosien | Daa
Tanase ca
Nos i ebipes | zt
“rrougnpit/catoney | 1 3yehes
tee
‘Ber
Atos Wah ‘26 bas
RumbsrefFors one
Tedd | a
pyle phys] physiol
TTLB Enos BBeniies | atentses | —
TB assosatuny fu fue =
istered Rome Foes =
‘Table 2, The fast nonolsking L2 canes fly pipelined and helps
‘rake up fr tho highor miss eof -asmall ptmary cactes.
sents the case of an instruction that flows through the
CPU as quickly as possible. [tis more likely that the in-
struction wil he stalled in the reservation station for
some number of cycles, a delay represented by the thick
black band. The second black band reprosents anther
potenti! delay: completed instructions ean spend ser:
eral cycles waiting in the ROB before retirement.
Jn the first stage, the next fetch address is calev-
lated by accessing the branch target Buffer (BTB). If
there is shit in the BT, the feich stream is reirectad
to the indicated location. Otherwise, the processor con-
tlnnes to che next aequential addres
“The instruction eacho acceas ia spread aeross two
and one-half eyeles. The KS, in contrast, must ealeulate
tho next address and read from the insteuetion each in
a single eyele. By allowing multiple pipeline stages for
the cache aooss, Into remaves this tak fram the erica
‘ming path and allow the P6 clock to run faster.
Instructions are then fed to the deenders. The ent-
plex problem of decoding variable-length x86 instruc
tions i allocated tsen and onv-half eyeles as well. Part of
the problem in a superscalar x86 processor i Wentifying
the starting point ofthe second and subsequent instrue-
tions ina group. The KS includes predecnl information
inits instruction cache to hasten this proces, bat the PS
does not, to avoid both inatruction-cache bloat and the
bottleneck of pradending instructions as they are read
from the L2-ache
The decoding ina is earier for PS because the svc
ond and third decoders handle aly simple instruetions
The restricted domders can wat forthe general decnder
ta identify the length of the first instruction and still
have time to handle the remaining instructions, aatum-
ing that they are simple ones. If they are not simple, Uhe
restricted decers must pass them on to the general de-
coder the next cyele
Atthe end of stage 6, he xainstructions havebeen
{ally docoded and transiated into nops. These uops have
their registers renamed in stage 7. In stage 8, tho re-
named wops urw wrilten ta the reservation statin, Ifthe
nperanda for a particular uop are available, und if no
other waps have priority for the needed function unit,
that uop is dispatched, Otherwise, the wup will wait for
its operands to howe wvailable.
Tt takes one cycle (stage 9) forthe reservation stae
tion to decide whic wops ean be dispatched. The PS im-
plements operand bypassing, 90 a result ean be use on
the immediatoly fullowing cyele. The reservation station
‘sill atamptta have the corresponding uop arrive at the
fimetion unit atthe saume time as the necossary data.
Simple integer wops can execute in a single cle
(stage 10), Sore integer uops nnd ll FP uops take sev-
eral eyeles at this point. Lad and store uops fenerate
thele adress in one eye and are waillen to Uae MOB: if
the MOD ia empty, a load will go directly to the data
12
FEBRUARY 16, 1995
wyFear Vly Rah
eet G2 atts eexanye— ra —
7 we 40 ff
Poe deny rembe dty C0
| RR acute tet a
Load Pipeline: | Sipe | Gf kaka be Pisa)
ens on 100 screen peti darn
Figure 8. n tha best case, instructions can flow Through the Pin 12 cycles. bat the averageis 18 oyies dus to delays nthe reservation sl
ton (RS) or he oordor buffer (ROB), Load instructions taks longer and-can also be sesajod inthe memory render oxllor MOB).
cache, bur it wall take an additional throe eyeles (sssum-
ing a cache hit) before the loaded data is available for
use. The address-generation unit is probably a critical
timing path in the F6, limiting the procesaor to 133 ME
RISC processors can typically cyele their simpler ALUs
tab 200 Mz or better in 0.5-micran technology.
‘Once the uop executes, it writes ite result to the
ROB. Ifall previous uops have been retired, ic takes one
cycle to retire a uap, If previous instructions are still
pending, however, it may take several evelea before the
Lop is retired. This delay does not impact performance.
Branch Prediction Accuracy Is Critical
The deep pipeline creates extraordinary branch
penalties. The outcome of a conditional branch is nat
known until stage 10, The minicum penalty for a mis-
predicted branch is 11 eyelea. Inéel eatimates that, on a
erage, uop spends four cycles in the reservation station,
50 a mispredieted branch will typically cause # 15-eyele
penalty. Ifthe braneh spends an unusually long time in
the reservation station, the penalty could be even worst,
‘Thus, the P6 designers spent alot af wort. to reduce
the number of misprodieted branches. Like Pentium, the
PG uses a branch target hutfer that retains both branch-
history information and the predicted targot of the
branch. This table is accessed by the current prosgeam
counter (PC). Ifthe PC hits in the BT, a branch is pre
dicted to the target address indicated by the BTB entry;
there is no delay for enrrectly predicted taken branches,
‘The BTB has 612 entries organized in a four-way
set-associative cache. This size ig bwice that of Pentium,
improving the hie rate. The PS rejeeta the commonly
used Smith algorithm, which maintains four states
using twa its, in favor of the more recent Yeh method! 1]
‘This adaptive algorithm uses four bits of branch history
and ean recognize and predict repeatable sequences of
branches, for example, taken—tsken-not: taken We esti=
rate that the PG BT will deliver clase to 90% accuracy
‘on programs such aa the SPECint92 suite,
‘Asecond type of misprediction occurs if a branch
‘isos in the BTB. This situation is not detected until
‘the instruction is fully decoded in stage 6.'The branch ia
predicted to be taken ifthe offset is negative (indicating
FEBRUARY 18,
1998,
a liely lop), and the target address, if available, is ased
to redirect the fetch stream. At that point, however,
seven cycles have been wasted fatching and deading in-
suruetions that are unlikely to be needed. In this ease,
the long fetzh-and.decode pipeline saps performance.
Forward branches that miss the BTB are predicted
to be not taken, s0 the sequential path continues to he
fetched with no delay. Branches that miss the BTB are
'mispredtcted more often than thase that hit-and are eu
Jct the same mispredieted branch penalties.
Conditional Move Added
The PG inatruetion set is nearly identical to Pon-
‘tium's and, in fact, to that ofthe 386. The most significant
addition is a canditional move (cito¥) inatruction, This
instruction, which has beon: added te'neveral RISC in.
struction sota recently, helps avoid eostly mispredictions
by eliminating branches. The instruction copies the con=
onts of une register into anther only if « particular eon-
dition flag is set, replacing a test-and-braneh sequence.
Lake all current Intel processors, the PS implement
ssystem-management mode, The mew CPU alse supports,
all the features in Pentium's secret Appendix H. Intel
says tat these fentures—including CPU ID, lange page
sizes, and virtual-086 extonsions—will be fully docu-
mented whon the PS is released. The PS also implements
performance counters similar to those described in
Appendix H, but it uses a new instruction that allows
them to be accessed more easily.
It has been widely speculated that the PS would in-
clude ness instrnetione to speed NSP (native signal pro-
cessing) applications. While the P8's improved intoger
‘multiplier will aasist NSP, multimedia extensions such
as those in Sun's UltraSpare (see MPR 12/5/94, p. 18)
‘would have an aven bigger effect. Intel, however, denies.
that the PS has any such extensions. This oversight
could yivo RISC processors like UitraSpare a significant
perfarmance advantage when exeeuting increasingly
‘popular multimedia software,
P6 Bus Allows Glucless MP
‘The P6 system bus is completely redesigned from
tHe Pentium bua. Both use 64 bits of date and operate st
13Figure 4. The P8 CPU measures 17.6> 17.8 mm whon fabiestadin
23 05-mieronfour-layer matal BICMOS process,
1 maximum of 66 MHz, but the P6 can sustain much
higher handwidth beeause it uses a splittransaction pro-
tocol, When Pentium reads fram its bus, it sends the ad
tiress, waits for the revult, and reads the returned data,
‘This aequenes ties up the bus far easentially the entire
transaction, The PS bus, in contrast, continues to eanduct
transactions while waiting for results. This overlapping
of transactions greatly improves overall bus uilization.
‘All arbitration ia conducsed on separate signals an
Ute PS bus in parallel with data transmission. Acldresses
are gent on their ow 36-bit bus, sa the bus is capable of
‘sustaining Ue full 528. Mbyte/s bandwidth for an inde
nite period. Since the PS has a private cache bus, the en
tire ayatem:bus bandwidth ean he devoted to memor
and UO accesses, (Intel will disclose thy details of the PS
system hns ata later date.)
Asplittransuction bus is ideal for 0 multiprocessor
.MP} system. Intel has designed the P6 bas to support up
‘to four processors without any glue logic: hat i, the pro:
cessor pine can be wired directly to each other with anly
‘asingle chip sot to support several CPUs, Like Pentinm,
the PB includes Intel's advanced priority incerrupe con-
troller (APICI, simplifying multiprocessor designs. The
company believes that the system bus has enough band-
width to support four P6 processors with little perfor:
mance degradation. Larger MP aystems ean be built
from clusters of four processors each,
‘Tooperate at 66 MElz with up to eight devices (four
processors along with twa memory controllers and tro
WO bridges), the P6 bus operates at modified GTL (Gun-
ning transosiver logie) signal levels. This lower signal
voltage (1.5 Vi reduces settling time in a complex eleetri-
cal environment. Intel will roll out the fret PS chip sets
along vith the processor, but other venders anv expected
lo produce compatible chip sets a8 wel
its integrated cache and APIC, the P6 module
affera an eney way to upgrade an MP system by plugging
ina new processor. This feature will be most useful in
servers, which oflen sell in MP configurations today. Ul-
imately, the PO will be used in multiproeessar PCs, a
market being seeded laday by the APIC-enabled P54C,
Another Big, Power-Hungry CPU
‘As alway, there is no free lunch: high performance
comes al a pris. The P6 CPU requires 55 million tran
sisiars, of whieh about 4.5 million are for logic and 1.0
nillion aro in the 16K cf eacho, Even in a 0 S-mieron
four-layer-metal BICMOS process, chis circuitry re-
‘quires 806 mum? of silieoa, as Figure 4 shows, To keep
this in perspeetive, however, the die size is only 4° big-
gor than that of the original Pentium. As with Pentium,
4 shrink to the next procesa will make the PO much
smaller and easier to manufacture
‘The cache chip comaumes 202 mm! and is built in
the same process as the CPU, as it must aperate at the
same clock rate. It contains tags and data storage for
YBGK of cache and recires 15 milion transistors.
‘Although the PS uses the sume nominal IC process
as tho P54C Pentium, it operates from a core voltage of
2.9 rather than Pentiuin’s 8.2, This change evuld in-
dicate some minor process tweaks to reduce transistor
size or gate-oxide thickness; aueh tweaks ean improve
performance but reduce the allowable supply voltage.
The lower volinge has the side ffect of reducing the
power consumption: even so, che PG CPU bas a prelimie
nary power rating of 15 W maximum and 12 W typical
With the 256K 12 cache chip, the otal power ia expected
to be 20 W maximum and 15 W typical, This power
rating is slightly greater than that of the original PS
Pontiam and ia quite reasonable compared sith next
generation RISC processors, which start at 00 W. The
srealer surface aren of the PO package allamn it to uae
shorter heat ninks than the notorivusly hot P5.
Intel investigated many types af mrultichip modules
(OMCAa) before setting on & simple two-avity PGA,
chewing more complicated (ane custly) dip-chip option
‘The package is simiar toa standard PGA with extra a=
ramie layers to route the 64-bit cache bus between the
CBU und cache chipe. The die are attached using stan-
dard wire bonding; no unusual substrutes are required.
The pin arrangement, showm previnusly-in Fiue 2,
is amymmetrc: interstitial pins surround the CPU, which
Aves all the signals that leave che module, but not-the
cache ebip. We estimate that this 987-pin duabeavity
package costs Intel about $50 in volume. The MPR. Cort
Model computes the overall manufacturing costof the PS
4
FEBRUARY 16, 1995
VYw
module to be rouphly $250. ‘This cost ia Ten] Ges AES) Ge ST Cee
greater than that of competitive chips, Be} itooao. Ke. a NaSe8_|_Penlirn
but the entire L2 cache is inchided. tna [Od Epoal | TSIM | zon Mex | toputie | 100 MRE | BRE" 100 Mee
0.35-micron proceas, the cost of the PG Pica Bo = at te, eer a
pace oath ee ‘tier [Sater | ear i
cual coe SSDI 1. Euneton Units sunts | “Finis | ums | gi] sunt
Improving System Performance Paes Bs TEs | oe, ee
anh stop size i i
Intel designed the P6toachieve hign | BRNEREat See Gees | Bates
performance af the system level, not just’ [pena Rags airops [NEMEC tore [oeeee
within the CPU core. Thus, the team [iz ceche Gus? yes s 7
placed significant emphasis on the cache | cx MP? yes ie ro 53
subsystem and the system bus ax well as [7 Taya | Os eM | OGSa I | O.6S AN
‘the CPU. The nonhlocking caches, closely sad cmos | ‘owas! cvos | cios.|
coupled 12 cache, and split-transaction [Tage Tarn Eamon j Zt wer Z1 mon | Sra
tous exemplify this emphasis, These fese [0a Tanstos 5.3 ration | Sma 3.0 miton 3.8
ply this emp mn q
tures put the PS a step beyond competi [pyctage tyro S27-pin |i 2a6-3iN=] 296-pin
tive x86 processors. cpGA | -ePGk" | cPGA
Tmemraat AMD KS.appearstobe [De Sax BBR ee? |B smn
a P6-class CPU sore trapped in a Pen- | EAMaGost Se20- ee —
tium pinout. The husiness decision to care [ows m= oy eee on
got the Pentium intefaes loaves tho 5 [2H