Cpe626 ARMorganization

Outline
Ø ARM Architecture
Ø ARM Organization and Implementation
Ø ARM Instruction Set
Ø Architectural Support for High-level Languages
ARM Ø Thumb Instruction Set
Organization and Implementation Ø Architectural Support for System Development
Ø ARM Processor Cores
Ø Memory Hierarchy
Ø Architectural Support for Operating Systems
Aleksandar Milenkovic
E-mail: milenka@ece.uah.edu Ø ARM CPU Cores
Web: http://www.ece.uah.edu/~milenka Ø Embedded ARM Applications
ARM organization A[31:0] control Three-stage pipeline

address register
Ø Register file – Ø Fetch
P
§ 2 read ports, 1 write port + C incrementer
§ the instruction is fetched from memory and placed in the
1 read, 1 write port reserved for PC instruction pipeline
r15 (pc) register
bank
Ø Decode
Ø Barrel shifter – shift or rotate instruction
one operand for any number of decode § the instruction is decoded and the datapath control signals
A
prepared for the next cycle; in this stage the instruction
multiply &
bits L
U
register
control
A B
owns the decode logic but not the datapath
Ø ALU – performs the arithmetic
b
u b b
Ø Execute
s u u
and logic functions required s barrel
shifter
s
Ø Memory address register + § the instruction owns the datapath; the register bank is read,
incrementer ALU an operand shifted, the ALU register generated and written
Ø Memory data registers back into a destination register
Ø Instruction decoder and
associated control logic
data out register data in register
D[31:0]
3 4
ARM single-cycle instruction pipeline ARM single-cycle instruction pipeline
1 fetch decode execute

fetch decode execute add add r0,r1,#5
2 fetch decode execute sub r2,r3,r6 fetch decode execute sub
3 fetch decode execute cmp r2,#3 fetch decode execute cmp

instruction
time
1 2 3 time
5 6
ARM multi-cycle LDMIA (load multiple)

ARM multi-cycle instruction pipeline instruction
Decode logic is always generating Decode stage occupied

the control signals for the datapath
to use in the next cycle
since ldmia must continue to
1 fetch ADD decode execute
remember decoded instruction
ldmia fetch decodeex ld r2ex ld r3
r0,{r2,r3}
2 fetch STR decode calc. addr. data xfer
sub r2,r3,r6 fetch decode ex sub

4 fetch ADD decode execute cmp r2,#3 fetch decodeex cmp

instruction time
time
Instruction delayed sub fetched at normal time but
not decoded until LDMIA is finishing
7 8
Control stalls: due to branches ARM pipelined branch
Ø Branches often introduce stalls (branch penalty)
§ Stall time may depend on whether branch is taken Decision not made until the third clock cycle
Ø May have to squash instructions bne foo fetch decode ex bne ex bne ex bne
that already started executing
Ø Don’t know what to fetch until condition is evaluated
sub Two cycles of work thrown
fetch decode
r2,r3,r6 away if bne takes place
foo add fetch decode ex add
r0,r1,r2
time
9 10
ARM9TDMI
Pipeline: how it works 5-stage pipeline next
pc +4
I-cache fetch
Ø All instructions occupy the datapath Ø Fetch pc + 4
for one or more adjacent cycles Ø Decode pc+8 I decode
Ø For each cycle that an instruction occupies the datapath, Ø instruction is decoded r15 instruction
decode
it occupies the decode logic in Ø register operands read register read
(3 read ports)
immediate
the immediately preceding cycle fields
Ø Execute mul
Ø During the fist datapath cycle each instruction issues

LDM/
STM post-
Ø an operand is shifted and +4 index shift reg
a fetch for the next instruction but one the ALU result generated, or
pre-index
shift
execute
ALU forwarding
Ø Branch instruction flush and refill the instruction pipeline Ø address is computed B, BL
mux paths
Ø Buffer/data
MOV pc
SUBS pc
byte repl.
Ø data memory is accessed
(load, store) load/store
address
D-cache buffer/
data
Ø Write-back rot/sgn ex
Ø write to register file LDR pc
register write write-back

11 12
ARM9TDMI ARM9TDMI
Data Forwarding next
pc
PC generation next
pc
+4 +4
I-cache fetch I-cache fetch
Data Forwarding pc + 4 Ø 3-stage pipeline pc + 4
ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1 Ø PC behavior:

pc+8 I decode
operands are read in execution pc+8 I decode
ADD r5, r5, r3, LSL r2 r5 := r5 + 2r2 x r3 instruction instruction
stage
r15 r15
decode decode
register read
r15 = PC + 8 register read
ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1

immediate immediate
Ø 5-stage pipeline
fields fields
ADD r8, r9, r10 r8 := r9 + r10 mul mul

LDM/
STM post- Ø operands are read in decode LDM/
STM post-
ADD r5, r5, r3, LSL r2 r5 := r5 + 2r2 x r3 +4 index shift reg
shift stage and r15 = PC + 4? +4 index shift reg
shift
pre-index execute pre-index execute
ALU forwarding
Ø incompatibilities between 3- ALU forwarding
B, BL
mux paths
stage and 5-stage B, BL
mux paths
Stall? MOV pc implementations => MOV pc
unacceptable
SUBS pc SUBS pc
byte repl. byte repl.
LD r3, [r2] r3 := mem[r2] Ø to avoid this 5-stage pipeline
ADD r1, r2, r3 r1 := r2 + r3 load/store
D-cache buffer/
data ARMs emulate the behavior of load/store
D-cache buffer/
data
address address
rot/sgn ex
the older 3-stage designs rot/sgn ex
LDR pc LDR pc
register write write-back register write write-back

13 14
Data processing instruction

datapath activity STR (store register) datapath activity
ØReg-Reg ØCompute address
address register address register
ØRd = Rn op Rm ØAR = Rn op Disp address register address register
Ør15 = AR + 4 increment increment Ør15 = AR + 4 increment increment

AR = AR + 4 Rd PC Rd PC ØStore data PC Rn PC
ØReg-Imm Rn
registers
Rm Rn
registers
ØAR = PC Rn
registers registers
Rd
ØRd = Rn op Imm Ømem[AR] =
mult mult
Ør15 = AR + 4 Rd<x:y>
mult mult
AR = AR + 4 as ins. as ins.
ØIf autoindexing lsl #0 shifter
=>
Rn = Rn +/- 4
as instruction as instruction = A / A+B/ A -B = A +B/ A -B
[7:0] [11:0]
data out data in i. pipe data out data in i. pipe data out data in i. pipe byte? data in i. pipe
(a) register – register operations (b) register – immediate operations (a) 1st cycle – compute address (b) 2nd cycle – store data & auto-index
15 16
The first two (of three) cycles of a
branch instruction ARM Implementation
ØCompute target address register
Ø Datapath
address
address register
Ø Control unit (FSM)
ØAR = PC + Disp,lsl #2 increment increment
ØSave return address R14

registers
(if required) PC
registers
PC
Ør14 = PC mult
mult
ØAR = AR + 4 shifter
lsl #2
Third cycle: do a small

= A+B =A
correction to the value
stored in the link register in [23:0]
order that it points to
directly at the instruction data out data in i. pipe data out data in i. pipe
which follows the branch?
(a) 1st cycle – compute branch target (b) 2nd cycle – save return address
17 18
2-phase non-overlapping clock scheme ARM datapath timing

Ø Most ARMs do not operate on edge-sensitive registers Ø Register read
Ø Instead the design is based around § Register read buses – dynamic, precharged during phase 2
§ During phase 1 selected registers discharge the read buses
2-phase non-overlapping clocks which are generated which become valid early in phase 1
internally from a single clock signal Ø Shift operation
Ø Data movement is controlled by passing the data § second operand passes through barrel shifter
alternatively through latches which are open during phase Ø ALU operation
1 or latches during phase 2 § ALU has input latches which are open in phase 1,
allowing the operands to begin combining in ALU
as soon as they are valid, but they close at the end of phase 1
phase 1 so that the phase 2 precharge does not get through to the ALU
§ ALU processes the operands during the phase 2, producing the
phase 2 valid output towards the end of the phase
§ the result is latched in the destination register
1 clock cycle at the end of phase 2
19 20
ARM datapath timing (cont’d) The original ARM1 ripple-carry adder
ALU operands Ø Carry logic: use CMOS AOI (And-Or-Invert) gate
latched
phase 1
Ø Even bits use circuit show below
phase 2 Ø Odd bits use the dual circuit with inverted inputs and
register
read outputs and AND and OR gates swapped around
time read bus valid
precharge Ø Worst case path: Cout
32 gates long
invalidates register
shift time buses write time
shift out valid
ALU time A
B
ALU out
Minimum Datapath Delay = sum
Register read time +
Shifter Delay + ALU Delay +
Cin
Register write set-up time + Phase 2 to phase 1 non-overlap time
21 22
ARM2 4-bit carry look-ahead scheme The ARM2 ALU logic for one result bit
Ø Carry Generate (G) Ø ALU functions
Carry Propagate (P) § data operations (add, sub, ...)
Ø Cout[3] =Cin[0].P + G § address computations for memory accesses
Ø Use AOI and § branch target computations
alternate AND/OR gates Cout[3] § bit-wise logical fs: 5 01 23
carry
4
Ø Worst case: operations NB

logic
bus
8 gates long § ... G
A[3:0] G ALU
4-bit bus
sum[3:0] P
P adder
logic NA
B[3:0] bus
Cin[0]
23 24
ARM2 ALU function codes The ARM6 carry-select adder scheme
Ø Compute sums of
fs 5 fs 4 fs 3 fs 2 fs 1 fs 0 ALU o ut put various fields of
a,b[3:0] a,b[31:28]
the word
0 0 0 1 0 0 A and B
0 0 1 0 0 0 A and not B
0 0 1 0 0 1 A xor B for carry-in of + +, +1 +, +1
0 1 1 0 0 1 A plus not B plus carry
0 1 0 1 1 0 A plus B plus carry zero and carry-in c s s+1
1 1 0 1 1 0 not A plus B plus carry of one mux
0 0 0 0 0 0 A
0 0 0 0 0 1 A or B Ø Final result is
0
0
0
0
0
1
1
0
0
1
1
0
B
not B
selected by using mux
0 0 1 1 0 0 zero the correct carry-

in value to control mux
a multiplexor
sum[3:0] sum[7:4] sum[15:8] sum[31:16]
Worst case: Note: Be careful! Fan-out on some of these gates is

high so direct comparison with previous schemes is
O(log2[word width]) gates long not applicable.
25 26
The ARM6 ALU organization ARM9 carry arbitration encoding

Ø Not easy to merge the arithmetic and logic functions => Ø Carry arbitration adder
a separate logic unit runs in parallel with the adder,
and multiplexor selects the output A B C u v
0 0 0 0 0
A operand latch B operand latch
invert A
XOR gates XOR gates
invert B 0 1 unknown 1 0
1 0 unknown 1 0
C in
function 1 1 1 1 1
logic functions adder C
V
logic/arithmetic
result mux
N
zero detect Z
result
27 28
The cross-bar switch barrel shifter
The cross-bar switch barrel shifter (cont’d)
Ø Shifter delay is critical since it contributes directly to the Ø Precharged logic is used =>
datapath cycle time each switch is a single NMOS transistor
Ø Cross-bar switch matrix (32 x 32) Ø Precharging sets all outputs to logic 0, so those which are
Ø Principle for 4x4 matrix not connected to any input during switching remain at 0
right 3 right 2 right 1 no shift
giving the zero filling required by the shift semantics
in[3]
Ø For rotate right, the right shift diagonal is enabled +
left 1
complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)
in[2]
left 2 Ø Arithmetic shift right:
in[1]
left 3
use sign-extension => separate logic is used to decode
the shift amount and discharge those outputs
in[0]
appropriately
out[0] out[1] out[2] out[3]
29 30
The 2-bit multiplication algorithm, Nth

cycle Carry-propagate (a) and carry-save (b)
adder structures
Carry - i n Mul t i p l i e r Shift ALU Carry -o ut

0 x0 LSL #2N A+0 0 A B Cin A B Cin A B Cin A B Cin
(a) + + + +
x1 LSL #2N A+B 0 Cout S Cout S Cout S Cout S
x2 LSL #(2N + 1) A– B 1
x3 LSL #2N A– B 1
1 x0 LSL #2N A+B 0
x1 LSL #(2N + 1) A+B 0
x2 LSL #2N A– B 1
x3 LSL #2N A+0 1 A B Cin A B Cin A B Cin A B Cin
(b) + + + +
Cout S Cout S Cout S Cout S
31 32
ARM high-speed multiplier organization ARM2 register cell circuit
registers
initiali zation for MLA
read read
write A B
Rs >> 8 bits/cycle
ALU bus
Rm A bus
B bus
rotate sum and carry-save adders
carry 8 bits/cycle
partial sum
partial carry
ALU (add partials)
33 34
ARM register bank floorplan ARM core datapath buses
A bus read decoders

address register
B bus read decoders
incrementer
Vdd write decoders Ad A B
PC inc register bank
Vss
ALU ALU multiplier

bus bus
ALU
PC PC shift out W
bus register cells
A bus shifter
INC data in
B bus instruction
bus
instruction pipe
data out
Din
35 36
ARM control logic structure
instruction
coprocessor
multiply
control
decode cycle
PLA count
load/store
multiple
address register ALU shifter

control control control control
37

Cpe626 ARMorganization

Uploaded by

Copyright:

Available Formats

You might also like

Cpe626 ARMorganization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cpe626 ARMorganization

Uploaded by

Copyright:

Available Formats

Outline

ARM organization A[31:0] control Three-stage pipeline

1 fetch decode execute

2 fetch decode execute sub r2,r3,r6 fetch decode execute sub

3 fetch decode execute cmp r2,#3 fetch decode execute cmp

ARM multi-cycle LDMIA (load multiple)

Decode logic is always generating Decode stage occupied

sub r2,r3,r6 fetch decode ex sub

4 fetch ADD decode execute cmp r2,#3 fetch decodeex cmp

for one or more adjacent cycles Ø Decode pc+8 I decode

the immediately preceding cycle fields

Ø During the fist datapath cycle each instruction issues

Ø write to register file LDR pc

register write write-back

ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1 Ø PC behavior:

ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1

ADD r8, r9, r10 r8 := r9 + r10 mul mul

Stall? MOV pc implementations => MOV pc

register write write-back register write write-back

Data processing instruction

Ør15 = AR + 4 increment increment Ør15 = AR + 4 increment increment

ØSave return address R14

Third cycle: do a small

which follows the branch?

2-phase non-overlapping clock scheme ARM datapath timing

Ø Worst case: operations NB

0 0 1 1 0 0 zero the correct carry-

Worst case: Note: Be careful! Fan-out on some of these gates is

The ARM6 ALU organization ARM9 carry arbitration encoding

out[0] out[1] out[2] out[3]

The 2-bit multiplication algorithm, Nth

Carry - i n Mul t i p l i e r Shift ALU Carry -o ut

ALU (add partials)

ARM register bank floorplan ARM core datapath buses

A bus read decoders

ALU ALU multiplier

address register ALU shifter

You might also like