"Bobca T ": AMD's New Low Pow Er X 86 Cor e Ar Chit Ect Ur e

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Bobca t

AMDs New Low Power x86 Core Archit ect ure


Brad Burgess, AMD Fellow
Chief Archit ect / Bobcat Core
August 24, 2010

1 | Bobcat | Hot Chips 2010

Tw o x 8 6 Cor e s Tune d for Ta r ge t M a r k e t s


Bulldoze r
Perform ance &
Scalabilit y

Bobca t
Flexible, Low
Power & Sm all

2 | Bobcat | Hot Chips 2010

M a inst r e a m Clie nt a nd Se r ve r M a r k e t s

Low Pow e r
Markets

Sm a ll
D ie Ar e a

Cloud Clie nt s
Opt im iz e d

Bobca t D e sign Goa ls


A sm all, efficient , low power
x86 core
Excellent perform ance
Synt hesizable wit h sm all
num ber of cust om arrays
Easily Port able across process
t echnologies

3 | Bobcat | Hot Chips 2010

Fe a t ur e Se t
64- bit AMD64 x86 I SA
SI MD ext ensions: SSE1, SSE2,
SSE3, SSSE3, SSE4A
Virt ualizat ion
Support for m isaligned 128- bit
dat a t ypes
I nst ruct ion Based Sam pling
( for dynam ic opt im izat ion)
C6 ( wit h int egrat ed power gat ing)

4 | Bobcat | Hot Chips 2010

M icr o- a r chit e ct ur e Ove r vie w


Dual x86 inst ruct ion decode
Out- of- Order inst ruct ion execut ion
Dual COP ret irem ent
Com plex m icroOPs
St at e of t he art branch predict ion
Aggressive OOO load/ st ore engine w/ hazard
predict ion
Advanced Virt ualizat ion w/ nest ed page t ables,
ASI Ds and world swit ch accelerat ion
Low power C6 st at e w/ core level power gat ing and
st at e save accelerat ion
5 | Bobcat | Hot Chips 2010

Bobca t

Branch Predictor

32KB
ICACHE

ITLB

Micro-Archit ect ure

Branch Locator

ConditionPredict
or
Dynamic Target

Return Stack

Fetch Queue
Dual x86 Decoder

uCode

Instr Queue

FP Decode

Int Rename

FP Rename

ROB

Scheduler

FP Sched

Scheduler

FP PRF

Int PRF
ALU
Table Walker
DTLB

ALU

LAGU

SAGU

Mul
32KB
DCACHE

LdSt
Unit

Prefetch
512KB
L2CACHE

6 | Bobcat | Hot Chips 2010

BU

MMX Alu

MMX Alu

IntMul

St Conv

FP Logical

FP Logical

FPAdd

FPMul

To/from Northbridge

Bobca t

Branch Predictor

32KB
ICACHE

ITLB

Micro-Archit ect ure

Branch Locator

ConditionPredict
or
Dynamic Target

Return Stack

Fetch Queue

I ca che
32Kbyt e

Dual x86 Decoder

uCode

2- way set associat ive


64- byt e line

Instr Queue

FP Decode

Int Rename

FP Rename

ROB

Parit y Prot ect ed


512/ 8 ent ry I TLB
( 4k/ 2m )

Scheduler

FP PRF

Int PRF

Fet ch up t o
32- byt es/ cycle

ALU
Table Walker
DTLB

ALU

LAGU

SAGU

Mul
32KB
DCACHE

LdSt
Unit

Prefetch
512KB
L2CACHE

7 | Bobcat | Hot Chips 2010

FP Sched

Scheduler

BU

MMX Alu

MMX Alu

IntMul

St Conv

FP Logical

FP Logical

FPAdd

FPMul

To/from Northbridge

Bobca t

Branch Predictor

32KB
ICACHE

ITLB

Micro-Archit ect ure

Branch Locator

ConditionPredict
or
Dynamic Target

Return Stack

Fetch Queue

Br a nch Pr e dict or :
Predict s up t o t wo
branches per cycle
Rem em bers branch
inst ruct ion locat ions

FP Decode

Int Rename

FP Rename

Scheduler

I ndirect Dynam ic
Address Predict or

Only necessary
st ruct ures are clocked

Instr Queue
ROB

Ret urn St ack Address


Predict or

St at e of t he Art
condit ion Predict or

Dual x86 Decoder

uCode

FP PRF

Int PRF
ALU
Table Walker
DTLB

ALU

LAGU

SAGU

Mul
32KB
DCACHE

LdSt
Unit

Prefetch
512KB
L2CACHE

8 | Bobcat | Hot Chips 2010

FP Sched

Scheduler

BU

MMX Alu

MMX Alu

IntMul

St Conv

FP Logical

FP Logical

FPAdd

FPMul

To/from Northbridge

Bobca t

Branch Predictor

32KB
ICACHE

ITLB

Micro-Archit ect ure

Branch Locator

ConditionPredict
or
Dynamic Target

Return Stack

Fetch Queue

D ua l x 8 6 D e code r :
Scans up t o 22 byt es
Decodes up t o t wo x86
inst ruct ions per cycle
The decoder can direct ly
m ap 89% of x86
inst ruct ions t o a single
m icroOp, an addit ional
10% t o a pair of
m icroOps, and m ore
com plicat ed x86
inst ruct ions ( < 1% ) are
m icrocoded. ( Dynam ic
I nst ruct ion Count s)

Dual x86 Decoder

uCode

Instr Queue

FP Decode

Int Rename

FP Rename

ROB

Scheduler

FP PRF

Int PRF
ALU
Table Walker
DTLB

ALU

LAGU

SAGU

Mul
32KB
DCACHE

LdSt
Unit

Prefetch
512KB
L2CACHE

9 | Bobcat | Hot Chips 2010

FP Sched

Scheduler

BU

MMX Alu

MMX Alu

IntMul

St Conv

FP Logical

FP Logical

FPAdd

FPMul

To/from Northbridge

Bobca t

Branch Predictor

32KB
ICACHE

ITLB

Micro-Archit ect ure

Branch Locator

ConditionPredict
or
Dynamic Target

Return Stack

Fetch Queue

I nt e ge r Ex e cut ion:
A dual port int eger
scheduler feeds t wo ALUs

uCode

A dual port address


scheduler feeds a load
address unit , and a st ore
address unit .

ROB

Physical Regist er File uses


m aps and point ers t o
reduce power by
m inim izing dat a
copying/ m ovem ent .

Dual x86 Decoder


Instr Queue

FP Decode

Int Rename

FP Rename

Scheduler

FP PRF

Int PRF
ALU
Table Walker
DTLB

ALU

LAGU

SAGU

Mul
32KB
DCACHE

LdSt
Unit

Prefetch
512KB
L2CACHE

1 0 | Bobcat | Hot Chips 2010

FP Sched

Scheduler

BU

MMX Alu

MMX Alu

IntMul

St Conv

FP Logical

FP Logical

FPAdd

FPMul

To/from Northbridge

Bobca t

Branch Predictor

32KB
ICACHE

ITLB

Micro-Archit ect ure

Branch Locator

ConditionPredict
or
Dynamic Target

Return Stack

Fetch Queue

Floa t ing Point Unit :


A cent ralized FP scheduler
feeds t wo 64- bit FP
execut ion st acks

Dual x86 Decoder

uCode

MMX and Logical unit s are


replicat ed in bot h st acks

A physical regist er file is


used t o reduce power

Int Rename

FP Rename
FP Sched

Scheduler

FP PRF

Int PRF
ALU
Table Walker
DTLB

ALU

LAGU

SAGU

Mul
32KB
DCACHE

LdSt
Unit

Prefetch
512KB
L2CACHE

1 1 | Bobcat | Hot Chips 2010

FP Decode

Scheduler

The FP Mul Unit can


perform t wo SP m ult iplies
per cycle
The FP Add Unit can
perform t wo SP addit ions
per cycle

Instr Queue
ROB

BU

MMX Alu

MMX Alu

IntMul

St Conv

FP Logical

FP Logical

FPAdd

FPMul

To/from Northbridge

Bobca t

Branch Predictor

32KB
ICACHE

ITLB

Micro-Archit ect ure

Branch Locator

ConditionPredict
or
Dynamic Target

Return Stack

Fetch Queue

D a t a Ca che :
32- Kbyt e

Dual x86 Decoder

uCode

8- way set associat ive


64- byt e line
Parit y Prot ect ed
Copyback

Advanced 8- st ream
prefet cher

FP Decode

Int Rename

FP Rename

Scheduler

40/ 8 ent ry L1DTLB


( 4k/ 2m )
512/ 64 ent ry L2DTLB
( 4k/ 2m )

Instr Queue
ROB

FP PRF

Int PRF
ALU
Table Walker
DTLB

ALU

LAGU

SAGU

Mul
32KB
DCACHE

LdSt
Unit

Prefetch
512KB
L2CACHE

1 2 | Bobcat | Hot Chips 2010

FP Sched

Scheduler

BU

MMX Alu

MMX Alu

IntMul

St Conv

FP Logical

FP Logical

FPAdd

FPMul

To/from Northbridge

Bobca t

Branch Predictor

32KB
ICACHE

ITLB

Micro-Archit ect ure

Branch Locator
Return Stack

Fetch Queue

Out - of- Or de r Loa d


St or e Unit :

Dual x86 Decoder

uCode

Loads bypassing loads


Loads bypassing st ores

Instr Queue

FP Decode

Int Rename

FP Rename

ROB

St ores bypassing loads


Bypass t racking and
dependency correct ion

Scheduler

ALU

Fast st ore forwarding

FP PRF

Table Walker
DTLB

ALU

LAGU

SAGU

Mul
32KB
DCACHE

LdSt
Unit

Prefetch
512KB
L2CACHE

1 3 | Bobcat | Hot Chips 2010

FP Sched

Scheduler
Int PRF

Hazard predict or
Fast crit ical word fill
forwarding

ConditionPredict
or
Dynamic Target

BU

MMX Alu

MMX Alu

IntMul

St Conv

FP Logical

FP Logical

FPAdd

FPMul

To/from Northbridge

Bobca t

Branch Predictor

32KB
ICACHE

ITLB

Micro-Archit ect ure

Branch Locator

ConditionPredict
or
Dynamic Target

Return Stack

Fetch Queue

L2 Ca che :
512Kbyt e

Dual x86 Decoder

uCode

16- way set associat ive


64 byt e lines

Instr Queue

FP Decode

Int Rename

FP Rename

ROB

ECC Prot ect ed


Half speed clocking for
power reduct ion

Scheduler

FP PRF

Int PRF
ALU
Table Walker
DTLB

ALU

LAGU

SAGU

Mul
32KB
DCACHE

LdSt
Unit

Prefetch
512KB
L2CACHE

1 4 | Bobcat | Hot Chips 2010

FP Sched

Scheduler

BU

MMX Alu

MMX Alu

IntMul

St Conv

FP Logical

FP Logical

FPAdd

FPMul

To/from Northbridge

Bobca t

Branch Predictor

32KB
ICACHE

ITLB

Micro-Archit ect ure

Branch Locator

ConditionPredict
or
Dynamic Target

Return Stack

Fetch Queue

Bus Unit :
8- out st anding dat a
accesses

uCode

2- out st anding fet ch


accesses

ROB

Dual x86 Decoder


Instr Queue

FP Decode

Int Rename

FP Rename

Evict ion Buffers


Scheduler

Fill Buffers

FP PRF

Int PRF

Writ e com bining buffers


ALU

Coherency m anagem ent


Table Walker
DTLB

ALU

LAGU

SAGU

Mul
32KB
DCACHE

LdSt
Unit

Prefetch
512KB
L2CACHE

1 5 | Bobcat | Hot Chips 2010

FP Sched

Scheduler

BU

MMX Alu

MMX Alu

IntMul

St Conv

FP Logical

FP Logical

FPAdd

FPMul

To/from Northbridge

Bobca t Pipe line


0

Fetch0

Fetch1

Fetch2

Fetch3

Fetch4

Fetch5

Dec0

Dec1

Dec2

Schedule

RegRead

Transit

FpDec

RegRen

Pack

EXE
Writeback
EXE
EXE

uCode
ROM

MDec

FDec

Dispatch

AGU

L2Tag

Schedule RegRead

DC1

L2Data

Loa d Use La t e n cy
L2 hit : 17- cycles

1 6 | Bobcat | Hot Chips 2010

11

12

Br a n ch M ispr e dict La t e n cy
13- cycles

Loa d Use La t e n cy
L1 hit : 3- cycles
Transit

10

DC2

ALU

Writeback

Cor e Floor Pla n


Floating Point Unit
Test/Debug

Data L2 TLB

X86 Decode

Bus Unit

Instruction
Cache

L2 Sub Array

Inst
TLB/Tag

L2 TAG

Branch
Predict
Ucode
ROM

ROB

Data Cache

Integer Unit

Data Tag/TLB

Load Store Unit

1 7 | Bobcat | Hot Chips 2010

Pow e r Re duct ion


Use of physical Regist er files
Ext ensive use of non- shift ing queues wit h
point ers
Fine grain clock gat ing
I nt egrat ed Core Power Gat ing
Only needed arrays are clocked
i.e. Dt ag hit before Dcache read
Predict ing t he t ype of branch t hen clocking t he
appropriat e predict or( s)

Elim inat ion of inst ruct ion m arker bit s in t he


I cache
Finding t he knee of t he curve ( scrut inize
perform ance gains against power cost s)
Polishing speed pat hs t o raise t he Vt m ix
and reduce leakage
1 8 | Bobcat | Hot Chips 2010

Bobca t Cor e Ove r vie w


Adva nce d M icr o- a r chit e ct ur e

Dual x86 Decode


Advanced Branch Predict or
Full OOO inst ruct ion execut ion
Full OOO load/ st ore engine
High Perform ance Float ing Point
AMD64 64- bit I SA
SSE1,2,3, SSSE3 I SA
Secure Virt ualizat ion
32kb L1s, 512kb L2

Low Pow e r D e sign


Power Opt im ized Execut ion
Micro- archit ect ure t hat m inim izes dat a m ovem ent
and unnecessary reads
Clock gat ing, Power gat ing
Syst em Low Power St at es

Sm a ll Cor e
Area efficient balance of high perform ance and low
power

1 9 | Bobcat | Hot Chips 2010

ICACHE

Bobca t
Low
Pow e r
Cor e

Integer
Scheduler

I
Pipe

I
Pipe

L2
Fetch

Decode

BU

Address
Scheduler

FP
Scheduler

Load
Pipe

Store
Pipe

DCACHE

A
Pipe

M
Pipe

Sum m a r y
Est im at ed 90% of t he perform ance of t odays
m ainst ream not ebook CPU in half t he area*
Sub- one wat t capable
Highly port able across designs and
m anufact uring t echnologies

2 0 | Bobcat | Hot Chips 2010

*Based on internal AMD modeling using benchmark simulations

You might also like