Professional Documents
Culture Documents
Cpe626 ARMorganization
Cpe626 ARMorganization
Cpe626 ARMorganization
Ø ARM Architecture
Ø ARM Organization and Implementation
Ø ARM Instruction Set
Ø Architectural Support for High-level Languages
ARM Ø Thumb Instruction Set
Organization and Implementation Ø Architectural Support for System Development
Ø ARM Processor Cores
Ø Memory Hierarchy
Ø Architectural Support for Operating Systems
Aleksandar Milenkovic
E-mail: milenka@ece.uah.edu Ø ARM CPU Cores
Web: http://www.ece.uah.edu/~milenka Ø Embedded ARM Applications
one operand for any number of decode § the instruction is decoded and the datapath control signals
A
prepared for the next cycle; in this stage the instruction
multiply &
bits L
U
register
control
A B
owns the decode logic but not the datapath
Ø ALU – performs the arithmetic
b
u b b
Ø Execute
s u u
and logic functions required s barrel
shifter
s
Ø Memory address register + § the instruction owns the datapath; the register bank is read,
incrementer ALU an operand shifted, the ALU register generated and written
Ø Memory data registers back into a destination register
Ø Instruction decoder and
associated control logic
data out register data in register
D[31:0]
3 4
ARM single-cycle instruction pipeline ARM single-cycle instruction pipeline
5 6
7 8
Control stalls: due to branches ARM pipelined branch
Ø Branches often introduce stalls (branch penalty)
§ Stall time may depend on whether branch is taken Decision not made until the third clock cycle
Ø May have to squash instructions bne foo fetch decode ex bne ex bne ex bne
that already started executing
Ø Don’t know what to fetch until condition is evaluated
sub Two cycles of work thrown
fetch decode
r2,r3,r6 away if bne takes place
foo add fetch decode ex add
r0,r1,r2
time
9 10
ARM9TDMI
Pipeline: how it works 5-stage pipeline next
pc +4
I-cache fetch
Ø All instructions occupy the datapath Ø Fetch pc + 4
Ø For each cycle that an instruction occupies the datapath, Ø instruction is decoded r15 instruction
decode
it occupies the decode logic in Ø register operands read register read
(3 read ports)
immediate
Ø Execute mul
Ø Buffer/data
MOV pc
SUBS pc
byte repl.
Ø data memory is accessed
(load, store) load/store
address
D-cache buffer/
data
Ø Write-back rot/sgn ex
Ø 5-stage pipeline
fields fields
B, BL
mux paths
stage and 5-stage B, BL
mux paths
unacceptable
SUBS pc SUBS pc
byte repl. byte repl.
LD r3, [r2] r3 := mem[r2] Ø to avoid this 5-stage pipeline
ADD r1, r2, r3 r1 := r2 + r3 load/store
D-cache buffer/
data ARMs emulate the behavior of load/store
D-cache buffer/
data
address address
rot/sgn ex
the older 3-stage designs rot/sgn ex
LDR pc LDR pc
AR = AR + 4 as ins. as ins.
ØIf autoindexing lsl #0 shifter
=>
Rn = Rn +/- 4
as instruction as instruction = A / A+B/ A -B = A +B/ A -B
[7:0] [11:0]
data out data in i. pipe data out data in i. pipe data out data in i. pipe byte? data in i. pipe
(a) register – register operations (b) register – immediate operations (a) 1st cycle – compute address (b) 2nd cycle – store data & auto-index
15 16
The first two (of three) cycles of a
branch instruction ARM Implementation
ØCompute target address register
Ø Datapath
address
address register
Ø Control unit (FSM)
ØAR = PC + Disp,lsl #2 increment increment
Ør14 = PC mult
mult
ØAR = AR + 4 shifter
lsl #2
(a) 1st cycle – compute branch target (b) 2nd cycle – save return address
17 18
19 20
ARM datapath timing (cont’d) The original ARM1 ripple-carry adder
ALU operands Ø Carry logic: use CMOS AOI (And-Or-Invert) gate
latched
phase 1
Ø Even bits use circuit show below
phase 2 Ø Odd bits use the dual circuit with inverted inputs and
register
read outputs and AND and OR gates swapped around
time read bus valid
precharge Ø Worst case path: Cout
32 gates long
invalidates register
shift time buses write time
shift out valid
ALU time A
B
ALU out
Minimum Datapath Delay = sum
Register read time +
Shifter Delay + ALU Delay +
Cin
Register write set-up time + Phase 2 to phase 1 non-overlap time
21 22
ARM2 4-bit carry look-ahead scheme The ARM2 ALU logic for one result bit
Ø Carry Generate (G) Ø ALU functions
Carry Propagate (P) § data operations (add, sub, ...)
Ø Cout[3] =Cin[0].P + G § address computations for memory accesses
Ø Use AOI and § branch target computations
alternate AND/OR gates Cout[3] § bit-wise logical fs: 5 01 23
carry
4
A[3:0] G ALU
4-bit bus
sum[3:0] P
P adder
logic NA
B[3:0] bus
Cin[0]
23 24
ARM2 ALU function codes The ARM6 carry-select adder scheme
Ø Compute sums of
fs 5 fs 4 fs 3 fs 2 fs 1 fs 0 ALU o ut put various fields of
a,b[3:0] a,b[31:28]
the word
0 0 0 1 0 0 A and B
0 0 1 0 0 0 A and not B
0 0 1 0 0 1 A xor B for carry-in of + +, +1 +, +1
0 1 1 0 0 1 A plus not B plus carry
0 1 0 1 1 0 A plus B plus carry zero and carry-in c s s+1
1 1 0 1 1 0 not A plus B plus carry of one mux
0 0 0 0 0 0 A
0 0 0 0 0 1 A or B Ø Final result is
0
0
0
0
0
1
1
0
0
1
1
0
B
not B
selected by using mux
a multiplexor
sum[3:0] sum[7:4] sum[15:8] sum[31:16]
25 26
logic/arithmetic
result mux
N
zero detect Z
result
27 28
The cross-bar switch barrel shifter
The cross-bar switch barrel shifter (cont’d)
Ø Shifter delay is critical since it contributes directly to the Ø Precharged logic is used =>
datapath cycle time each switch is a single NMOS transistor
Ø Cross-bar switch matrix (32 x 32) Ø Precharging sets all outputs to logic 0, so those which are
Ø Principle for 4x4 matrix not connected to any input during switching remain at 0
right 3 right 2 right 1 no shift
giving the zero filling required by the shift semantics
in[3]
Ø For rotate right, the right shift diagonal is enabled +
left 1
complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)
in[2]
left 2 Ø Arithmetic shift right:
in[1]
left 3
use sign-extension => separate logic is used to decode
the shift amount and discharge those outputs
in[0]
appropriately
29 30
31 32
ARM high-speed multiplier organization ARM2 register cell circuit
registers
initiali zation for MLA
read read
write A B
Rs >> 8 bits/cycle
ALU bus
Rm A bus
B bus
rotate sum and carry-save adders
carry 8 bits/cycle
partial sum
partial carry
33 34
INC data in
B bus instruction
bus
instruction pipe
data out
Din
35 36
ARM control logic structure
instruction
coprocessor
multiply
control
decode cycle
PLA count
load/store
multiple
37