Professional Documents
Culture Documents
Bus - 7 - 25 - Morning - P
Bus - 7 - 25 - Morning - P
Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem
Write data
WREADY
Read Address/Control
ARREADY
Read data
RREADY
Split Transaction
• Address, data, and response are handled
separately.
Write Address/Control
AWREADY
Write data
WREADY
Read Address/Control
ARREADY
Read data
RREADY
Split Transaction: Write (1/3)
Master issues
address
Write Address/Control
AWREADY
Write data
WREADY
Read Address/Control
ARREADY
Read data
RREADY
Split Transaction: Write (2/3)
Master gives
Write Address/Control
data
AWREADY
Write data
WREADY
Read Address/Control
ARREADY
Read data
RREADY
Split Transaction: Write (3/3)
Write Address/Control
AWREADY
Slave
Write data acknowledges
WREADY
Read Address/Control
ARREADY
Read data
RREADY
Split Transaction: Read (1/2)
Write Address/Control
AWREADY
Write data
WREADY
Read Address/Control
ARREADY
Read data
RREADY
Split Transaction: Read (2/2)
Write Address/Control
AWREADY
Write data
WREADY
Slave returns
Read Address/Control
data
ARREADY
Read data
RREADY
Wire Counts
• Address 32b, data 32b bus case: 184~204
– AW: 52~56, W: 39~43, B: 4~8, AR: 52~56, R: 37~41
• AW signals ~ AR signals
• R signals ~ W signals
– RLAST is controlled by slave
One Address for Burst
ADDRESS A11 A21 A31
• Narrow transfer
*D21,D22,D23 의 delay 감소
Master #6 10
0
1 2 4 6 # masters
• Analysis
• A single master w/o outstanding requests can achieve only about 30%
utilization.
• The effects of bank interleaving by different masters are significant, up to 74%.
• 6 master case may suffer from bank conflicts.
Effect of Multiple Outstanding
Requests
• Setup
• Multiple outstanding request by each master
Memory
utilization
RS
Master #1 80 Read case
32b SDR 70 (RS 0/2)
Master #2 RS
Bank #1
60
Master #3 Bank #2
50
PL300 PL
40
Master #4 Bank #3
340 30
Master #5 Bank #4 20
10
Master #6 0
1 2 4 6 # masters
• Analysis
• A single master w/ multiple outstanding requests can achieve >50% utilization.
• The effects of bank interleaving by different masters are significant, up to 74%.
• Register slice does not degrade the overall performance, i.e. utilization since
multiple outstanding hides its latency.
Read Burst Operation
Read request
is accepted Note: data transfer only when valid = ready = 1
Overlapping Read Bursts
Read request A
is accepted Read request B
is accepted via AR channel
while data A(0) is
transferred via R channel
Write Burst Operation
AXI3
AXI4
Ordering #2: Multiple read
requests in slave and interconnect
Slave
IP 1
Master Slave Master Inter-
IP IP IP connect
Slave
Read data IP 2
reordering depth
Ordering #3: Write
• Write data with different AWIDs follow their
address order
• Interleaving rule
– Data with different ID can be interleaved.
– The order within a single burst is maintained
– The order of first data needs to be the same with
that of request
– WriteInterleaveCability The maximum number of
transactions that master can interleave
[AXI4]
Crossbar
ID: 4 + ceil(log27)=7 bits
Memory
Controller
Cache Support (AXI3)
ARCACHE[3:0] / AWCACHE[3:0]
• Bufferable bit (B): AWCACHE[0]
– Write delay can be an arbitrary one
• Cacheable bit (C): AR(W)CACHE[1]
– Read: prefetch or read cache is possible
– Write: write merging is possible
• Read Allocate bit (RA): ARCACHE[2]
– If read miss, fetch the data to cache
– If C=low, RA=low
• Write Allocate bit (WA): AWCACHE[3]
– If write miss, fetch the data to cache, and then write
to the cache (and through the memory)
– If C=low, WA=low
[AXI4]
L1
Instruction Memory
CPU Cache
Unified L2 Memory
Cache Memory
L1 Data
RF Memory
Cache
Write Response from
Intermediate Point
• Basically, the memory can give a write response
• Intermediate points, e.g., cache can give a write
response. They need to be responsible for the
data delivery to memory
L1
Instruction Memory
CPU Cache
Unified L2 Memory
Cache Memory
L1 Data
RF Memory
Cache
Write Though and Write Back
• Write through
– If L1 is updated, all the corresponding data in L2 and
memory are updated
• Write back
– Data update is delayed until the data is evicted
L1
Instruction Memory
CPU Cache
Unified L2 Memory
Cache Memory
L1 Data
RF Memory
Cache
[Source: K. Asanovic, 2008]
Write Buffer
Write miss or
write back Write merging!
c abcd
• If buffer contains modified blocks, the addresses can be
checked to see if address of new data matches the address of
a valid write buffer entry
• If so, new data are combined with that entry
• The Sun T1 (Niagara) processor, among many others, uses
write merging
[AXI4]
Memory Type
[AXI4]
Normal Non-cacheable
Non-bufferable / Bufferable
• Normal Non-cacheable Non-bufferable:
0010
– Write merging is possible (modifiable)
• Normal Non-cacheable Bufferable: 0011
– Write response from an intermediate point is
possible (bufferable or allocate)
– Write merging (modifiable)
– Read data from the destination or from a
write transaction that is going to the
destination (modifiable & bufferable in read)
[AXI4]
Write Through 1
• Write Through No Allocate: 1010/0110
– Why write through no bufferable
– Why no allocate only other allocate
– Write response from an intermediate point is possible
(bufferable or (other) allocate)
– Write merging is possible (modifiable)
– Read data from an intermediate cached copy
(bufferable or (other) allocate)
– Read pre-fetch is possible (modifiable)
– Allocations of either read or write is not
recommended for performance reasons, but not
prohibited
[AXI4]
Write Through 2
• Write Through Read Allocate: 1110(0110)/ 0110
– Basically, Write Through No Allocate except
– Allocation of reads is recommended for performance reasons
– Allocation of writes is not recommended for performance
reasons
• Write Through Write Allocate: 1010/1110 (1010)
– Reverse of the above
– Allocation of writes is recommended for performance reasons
while that of reads is not recommended
• Write Through Read and Write Allocate: 1110/1110
– Basically, Write Through No Allocate except
– Both allocations of reads and writes are recommended
[AXI4]
Write Back 1
• Write Back No Allocate: 1011/0111
– Why write back? bufferable
– Writes are not required to reach the destination (bufferable)
– Write response from an intermediate point is possible (bufferable
or (other) allocate)
– Write merging is possible (modifiable)
– Read data from an intermediate cached copy (bufferable and
(other) allocate)
– Read pre-fetch is possible (modifiable)
– Allocations of either read or write is not recommended for
performance reasons, but not prohibited
[AXI4]
Write Back 2
• Write Back Read Allocate: 1111(0111)/ 0111
– Basically, Write Back No Allocate except
– Allocation of reads is recommended for performance reasons
– Allocation of writes is not recommended for performance
reasons
• Write Back Write Allocate: 1011/1111 (1011)
– Reverse of the above
– Allocation of writes is recommended for performance reasons
while that of reads is not recommended
• Write Back Read and Write Allocate: 1111/1111
– Basically, Write Back No Allocate except
– Both allocations of reads and writes are recommended
[AXI4]
Transaction Buffering
• When the write reaches the destination?
• The write reaches the destination in a
timely manner
– Bufferable (Device and Normal Non-
Cacheable) and Write Through cases
• Not required to reach the destination (but,
the data should not be lost in any case)
– All Write Back cases
[AXI4]
Device memory
Atomic Access
• Normal access, AR(W)LOCK[1:0]=b00
• Exclusive access, b01
– Exclusive read (load-linked)… Exclusive write (store-
conditional)
– If no intervening write to the address region, EXOKAY
response. If not, OKAY response.
– Usually used for read-modify-write
Master 1 Master 2 Master 1 Slave 1
E.RD 0x100 WR 0x100 E.WR 0x100 OKAY
time
• Locked access, b10
– Start with b10, and end with b00
– During the period, only the lock initiating master can
access the address region (not the slave!!!)
[AXI4]
Structurally
Configured
PL301 Features
• Configurable number of SIs and MIs
• Sparse connection options to reduce gate count and improve
security
• Configurable AXI address/data widths
• Decoded address register that you can configure for each SI
• Flexible register stages to aid timing closure
• An arbitration mechanism that you can configure for each MI,
implementing:
– a fixed Round-Robin (RR) scheme
– a programmable RR scheme
– a programmable scheme that provides prioritized groups of Least
Recently Granted (LRG) arbitration
• A programmable Quality of Service (QoS) scheme
• Support for multiple clock domains: synchronous and asynchronous.
• Configurable cyclic dependency schemes to enable a master to have
outstanding transactions to more than one slave
Arbitration Scheme
Fixed Priority
Highest priority
Priority: 0 M0 SlaveInterface0
SlaveInterface1
Control Registers
Priority: 1 M1
PL301
Priority: 2 M2 SlaveInterface2
Priority: 3 M3 SlaveInterface3
Lowest priority
[Source: PL301 TS]
Arbitration Scheme
Round Robin
Arbitration Scheme
Hybrid
• Combination of round robin and fixed
priority
Highest priority
Priority: 0 M0 SlaveInterface0
Fixed mode
Control Registers
Priority: 1 M1 SlaveInterface1
PL301
Round-robin mode
Priority: 2 M2 SlaveInterface2
Next Next
Priority: 2 M3 SlaveInterface3
Lowest priority
[Source: PL301 TS]
Arbitration Scheme
• LRG (least recently granted) scheme
Arbitration Scheme
Fixed Round Robin
• A weighted round robin in a fixed order
m0 m1 m2 m3 m4 m5
m1 m2 m3 m4
m0 m2
m2 m5
s0
m0 m0
m5 m1
S0
m2 m4 m3 m2
An Example of Crossbar Bus Design
40% of total BW
Crossbar
Programmable QoS
Maximum # of requests
allowed for best effort traffic
Assume Tidemark = 4,
and ID match = M0.
If there is 4 outstanding
requests for M1, then
only requests from M0 are
accepted by S0 until one of
M1’s requests is served
Bus Arbitration: A Generic Arbiter
• Assumptions
– Each request has its own performance requirement (e.g., bandwidth budget
and/or latency)
• E.g., low latency access from CPU
• E.g., bandwidth guarantee for LCD / Camera controllers
– In order to avoid starvation, a global time out is applied
• Priority order: time-out > bandwidth > best effort
– Time-out (TO) request: TO = 0, and BW budget > 0
– Bandwidth (BW) request: TO > 0, and BW budget > 0
– Best effort (BE) request: BW budget = 0
• Priority promotion/demotion
– Demotion: if BW budget is exhausted, demotion to BE
– Promotion: if BW budget becomes positive, promotion to BW or TO request
• Time-out counters
– One type of counter for QoS access w/ time out
– The other for old request
• When a normal request (w/ unspecified TO) arrives at the bus, a timer is
assigned and starts to be decremented each cycle.
Bus Arbitration: A Generic Arbiter
• If there is any TO request, it is served
– QoS request w/ time out
– Old request
Higher • If there is any BW request
Priority – Give the bus to the BW requests based on BW budget
• Based on fixed time slot allocation or statistical slot allocation
• To the other BE requests, apply the same priority order that
BW requests
C
Memory Memory
Master Controller
2 2
2
B
Memory 1 A D
Memory 2 C B
Master 1 C C D D
Master 2 A A B B
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Single Slave Scheme
• Allow multiple outstanding transactions
only to the same slave
Unique ID Scheme
• Accept only out-of-order requests, i.e.,
requests with different transaction ID’s
Single Slave per ID
• Combination of both single slave and
unique ID schemes
• Allow multiple outstanding requests to a
single slave per transaction ID
[Source: J. Yoo, 2008]
5ns 4ns
14x4 9x4 7x2
5ns 4ns
3ns 3ns
S2 S2
Case A: Single big crossbar Case B: Cascaded crossbar
Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem
Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem
Write data
WREADY
Read Address/Control
ARREADY
Read data
RREADY
Register Slice for Timing Isolation
• Register slice incurs one cycle latency per
insertion
Write Address/Control
AWREADY
Write data
WREADY
Read Address/Control
ARREADY
Read data
RREADY
[Source: PL301 TS]
Bus 1 Bus 2
Generated
Bus
RTL code
Mem Mem
An Academic Approach
[Source: J. Yoo, 2008]
M2 M13 M2 M13
M3 M12 M3 M12
M4 14x4 M11 M4 9x4 7x2 M11
M5 M10 M5 M10
M6 M9 M6 M9
M7 S1 S2 M8 M7 S1 S2 M8
Topology/Floorplan/Pipeline
Co-design of Cascaded XB Bus
codec
conn
(50, 1000)
input
display
No DSP22 DSP23
DSP30 DSP31
End? DSP32
DSP21
DSP33
Yes DSP20
Topology/Floorplan/Pipeline
Co-design of Cascaded XB Bus
Design Input: Technology Input: A
Communication Graph Pre-characterized Library
B
& Floorplan Info of Bus Components D
X1 E
C
Initial Population F
(B C A XB1 D E F, C B F XB1 E D A)
B
D
Pipeline Stage Insertion X2
X1 E
with Timing Analysis C
F
Cost Evaluation &
(B C A XB1 XB2 D E F, C B F XB1 XB2 E D A)
Tournament Selection
No A
End?
B
D
X2
X1 E
Yes
C F
Return Results
(B C A XB1 XB2 D E F, C B XB1 F XB2 E D A)
[Source: J. Yoo, 2008]
Topology/Floorplan/Pipeline
Co-design of Cascaded XB Bus
Design Input: Technology Input:
Communication Graph Pre-characterized Library 1.5 Tree #2
& Floorplan Info of Bus Components (a) M0 1.5 2.5
1.0 1.5 S0
2.5
M1 X0
1.0 1.0 1.5
X1
1.0
Initial Population 3.0 4.0 X2 3.0
M2 3.0
1.0 S1
M3 2.5
Tree #1
Topology & Floorplan 1.0
Generation
(b) 2.5
M1 1.0 1.0
Pipeline Stage Insertion X1
with Timing Analysis 1.0
3.0 4.0 X2 3.0
M2 3.0
1.0 S1
Cost Evaluation & M3 2.5
Tournament Selection 1.0
2.5
(c) M1 1.0 1.0 0.5
No X1
End? 0.5
3.0 2.0 3.0
X2 3.0
M2
2.0 1.0 S1
Yes
M3 2.5
Return Results 1.0
[Source: J. Yoo, 2008]
Experiments
power (mW)
area (square mm)
codec 1.2 160
graphics (115, 40) Bridge
(600, 40)
(115, 40)(115, 40)
(115, 40) (100, 25) ddr1
Wire 140
1 Crossbar
proc (50, 25) nand0
(50, 25)
Pipeline 120
(100, 25)
peri
lcd (50, 25)
(50, 25) 0.8
(50, 1000) (100, 25) (50, 25)
(50, 1000)
100
security (50, 1000) nand1
conn 0.6 80
(50, 1000)
0.2
20
case 1
case 2
case 1
case 2
case 1
case 2
case 1
case 2
case 1
case 2
case 1
case 2
5 ddr1_s ddr1
graphics_d0
graphics XBAR
5
graphics_d1 5
bridge 5
5
proc_d
peri_s peri aspdac07 aspdac07ga proposed aspdac07 aspdac07ga proposed
proc proc_i
5 nand1_s nand1
storage storage_m
dma dma_m
5
display display_m
security security_m
XBAR
Topology/floorplan/pipeline co-design gives
lcd lcd_m
lower area cost in cascaded crossbar bus design
Summary
• AMBA3 (AXI3 and AXI4) protocol
– Specification
– Focus on ordering
• PL301: a crossbar bus
– Arbitration, QoS, cyclic dependency schemes
– Crossbar-based bus design flow