Cpe626 ARMorganization

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Outline

Ø ARM Architecture
Ø ARM Organization and Implementation
Ø ARM Instruction Set
Ø Architectural Support for High-level Languages
ARM Ø Thumb Instruction Set
Organization and Implementation Ø Architectural Support for System Development
Ø ARM Processor Cores
Ø Memory Hierarchy
Ø Architectural Support for Operating Systems
Aleksandar Milenkovic
E-mail: milenka@ece.uah.edu Ø ARM CPU Cores
Web: http://www.ece.uah.edu/~milenka Ø Embedded ARM Applications

ARM organization A[31:0] control Three-stage pipeline


address register
Ø Register file – Ø Fetch
P
§ 2 read ports, 1 write port + C incrementer
§ the instruction is fetched from memory and placed in the
1 read, 1 write port reserved for PC instruction pipeline
r15 (pc) register
bank
Ø Decode
Ø Barrel shifter – shift or rotate instruction

one operand for any number of decode § the instruction is decoded and the datapath control signals
A
prepared for the next cycle; in this stage the instruction
multiply &
bits L
U
register
control
A B
owns the decode logic but not the datapath
Ø ALU – performs the arithmetic
b
u b b
Ø Execute
s u u
and logic functions required s barrel
shifter
s

Ø Memory address register + § the instruction owns the datapath; the register bank is read,
incrementer ALU an operand shifted, the ALU register generated and written
Ø Memory data registers back into a destination register
Ø Instruction decoder and
associated control logic
data out register data in register

D[31:0]

3 4
ARM single-cycle instruction pipeline ARM single-cycle instruction pipeline

1 fetch decode execute


fetch decode execute add add r0,r1,#5

2 fetch decode execute sub r2,r3,r6 fetch decode execute sub

3 fetch decode execute cmp r2,#3 fetch decode execute cmp


instruction
time
1 2 3 time

5 6

ARM multi-cycle LDMIA (load multiple)


ARM multi-cycle instruction pipeline instruction

Decode logic is always generating Decode stage occupied


the control signals for the datapath
to use in the next cycle
since ldmia must continue to
1 fetch ADD decode execute
remember decoded instruction
ldmia fetch decodeex ld r2ex ld r3
r0,{r2,r3}
2 fetch STR decode calc. addr. data xfer

sub r2,r3,r6 fetch decode ex sub


3 fetch ADD decode execute

4 fetch ADD decode execute cmp r2,#3 fetch decodeex cmp


5 fetch ADD decode execute
instruction time
time
Instruction delayed sub fetched at normal time but
not decoded until LDMIA is finishing

7 8
Control stalls: due to branches ARM pipelined branch
Ø Branches often introduce stalls (branch penalty)
§ Stall time may depend on whether branch is taken Decision not made until the third clock cycle
Ø May have to squash instructions bne foo fetch decode ex bne ex bne ex bne
that already started executing
Ø Don’t know what to fetch until condition is evaluated
sub Two cycles of work thrown
fetch decode
r2,r3,r6 away if bne takes place
foo add fetch decode ex add
r0,r1,r2

time

9 10

ARM9TDMI
Pipeline: how it works 5-stage pipeline next
pc +4
I-cache fetch
Ø All instructions occupy the datapath Ø Fetch pc + 4

for one or more adjacent cycles Ø Decode pc+8 I decode

Ø For each cycle that an instruction occupies the datapath, Ø instruction is decoded r15 instruction
decode
it occupies the decode logic in Ø register operands read register read

(3 read ports)
immediate

the immediately preceding cycle fields

Ø Execute mul

Ø During the fist datapath cycle each instruction issues


LDM/
STM post-
Ø an operand is shifted and +4 index shift reg
a fetch for the next instruction but one the ALU result generated, or
pre-index
shift
execute
ALU forwarding
Ø Branch instruction flush and refill the instruction pipeline Ø address is computed B, BL
mux paths

Ø Buffer/data
MOV pc
SUBS pc
byte repl.
Ø data memory is accessed
(load, store) load/store
address
D-cache buffer/
data
Ø Write-back rot/sgn ex

Ø write to register file LDR pc

register write write-back


11 12
ARM9TDMI ARM9TDMI
Data Forwarding next
pc
PC generation next
pc
+4 +4
I-cache fetch I-cache fetch
Data Forwarding pc + 4 Ø 3-stage pipeline pc + 4

ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1 Ø PC behavior:


pc+8 I decode
operands are read in execution pc+8 I decode
ADD r5, r5, r3, LSL r2 r5 := r5 + 2r2 x r3 instruction instruction
stage
r15 r15
decode decode
register read
r15 = PC + 8 register read

ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1


immediate immediate

Ø 5-stage pipeline
fields fields

ADD r8, r9, r10 r8 := r9 + r10 mul mul


LDM/
STM post- Ø operands are read in decode LDM/
STM post-
ADD r5, r5, r3, LSL r2 r5 := r5 + 2r2 x r3 +4 index shift reg
shift stage and r15 = PC + 4? +4 index shift reg
shift
pre-index execute pre-index execute
ALU forwarding
Ø incompatibilities between 3- ALU forwarding

B, BL
mux paths
stage and 5-stage B, BL
mux paths

Stall? MOV pc implementations => MOV pc

unacceptable
SUBS pc SUBS pc
byte repl. byte repl.
LD r3, [r2] r3 := mem[r2] Ø to avoid this 5-stage pipeline
ADD r1, r2, r3 r1 := r2 + r3 load/store
D-cache buffer/
data ARMs emulate the behavior of load/store
D-cache buffer/
data
address address

rot/sgn ex
the older 3-stage designs rot/sgn ex
LDR pc LDR pc

register write write-back register write write-back


13 14

Data processing instruction


datapath activity STR (store register) datapath activity
ØReg-Reg ØCompute address
address register address register
ØRd = Rn op Rm ØAR = Rn op Disp address register address register

Ør15 = AR + 4 increment increment Ør15 = AR + 4 increment increment


AR = AR + 4 Rd PC Rd PC ØStore data PC Rn PC
ØReg-Imm Rn
registers
Rm Rn
registers
ØAR = PC Rn
registers registers
Rd
ØRd = Rn op Imm Ømem[AR] =
mult mult
Ør15 = AR + 4 Rd<x:y>
mult mult

AR = AR + 4 as ins. as ins.
ØIf autoindexing lsl #0 shifter

=>
Rn = Rn +/- 4
as instruction as instruction = A / A+B/ A -B = A +B/ A -B
[7:0] [11:0]

data out data in i. pipe data out data in i. pipe data out data in i. pipe byte? data in i. pipe

(a) register – register operations (b) register – immediate operations (a) 1st cycle – compute address (b) 2nd cycle – store data & auto-index

15 16
The first two (of three) cycles of a
branch instruction ARM Implementation
ØCompute target address register
Ø Datapath
address
address register
Ø Control unit (FSM)
ØAR = PC + Disp,lsl #2 increment increment

ØSave return address R14


registers
(if required) PC
registers
PC

Ør14 = PC mult
mult
ØAR = AR + 4 shifter
lsl #2

Third cycle: do a small


= A+B =A
correction to the value
stored in the link register in [23:0]
order that it points to
directly at the instruction data out data in i. pipe data out data in i. pipe

which follows the branch?

(a) 1st cycle – compute branch target (b) 2nd cycle – save return address

17 18

2-phase non-overlapping clock scheme ARM datapath timing


Ø Most ARMs do not operate on edge-sensitive registers Ø Register read
Ø Instead the design is based around § Register read buses – dynamic, precharged during phase 2
§ During phase 1 selected registers discharge the read buses
2-phase non-overlapping clocks which are generated which become valid early in phase 1
internally from a single clock signal Ø Shift operation
Ø Data movement is controlled by passing the data § second operand passes through barrel shifter
alternatively through latches which are open during phase Ø ALU operation
1 or latches during phase 2 § ALU has input latches which are open in phase 1,
allowing the operands to begin combining in ALU
as soon as they are valid, but they close at the end of phase 1
phase 1 so that the phase 2 precharge does not get through to the ALU
§ ALU processes the operands during the phase 2, producing the
phase 2 valid output towards the end of the phase
§ the result is latched in the destination register
1 clock cycle at the end of phase 2

19 20
ARM datapath timing (cont’d) The original ARM1 ripple-carry adder
ALU operands Ø Carry logic: use CMOS AOI (And-Or-Invert) gate
latched
phase 1
Ø Even bits use circuit show below
phase 2 Ø Odd bits use the dual circuit with inverted inputs and
register
read outputs and AND and OR gates swapped around
time read bus valid
precharge Ø Worst case path: Cout
32 gates long
invalidates register
shift time buses write time
shift out valid

ALU time A
B

ALU out
Minimum Datapath Delay = sum
Register read time +
Shifter Delay + ALU Delay +
Cin
Register write set-up time + Phase 2 to phase 1 non-overlap time
21 22

ARM2 4-bit carry look-ahead scheme The ARM2 ALU logic for one result bit
Ø Carry Generate (G) Ø ALU functions
Carry Propagate (P) § data operations (add, sub, ...)
Ø Cout[3] =Cin[0].P + G § address computations for memory accesses
Ø Use AOI and § branch target computations
alternate AND/OR gates Cout[3] § bit-wise logical fs: 5 01 23
carry
4

Ø Worst case: operations NB


logic
bus
8 gates long § ... G

A[3:0] G ALU
4-bit bus
sum[3:0] P
P adder
logic NA
B[3:0] bus

Cin[0]

23 24
ARM2 ALU function codes The ARM6 carry-select adder scheme
Ø Compute sums of
fs 5 fs 4 fs 3 fs 2 fs 1 fs 0 ALU o ut put various fields of
a,b[3:0] a,b[31:28]
the word
0 0 0 1 0 0 A and B
0 0 1 0 0 0 A and not B
0 0 1 0 0 1 A xor B for carry-in of + +, +1 +, +1
0 1 1 0 0 1 A plus not B plus carry
0 1 0 1 1 0 A plus B plus carry zero and carry-in c s s+1
1 1 0 1 1 0 not A plus B plus carry of one mux
0 0 0 0 0 0 A
0 0 0 0 0 1 A or B Ø Final result is
0
0
0
0
0
1
1
0
0
1
1
0
B
not B
selected by using mux

0 0 1 1 0 0 zero the correct carry-


in value to control mux

a multiplexor
sum[3:0] sum[7:4] sum[15:8] sum[31:16]

Worst case: Note: Be careful! Fan-out on some of these gates is


high so direct comparison with previous schemes is
O(log2[word width]) gates long not applicable.

25 26

The ARM6 ALU organization ARM9 carry arbitration encoding


Ø Not easy to merge the arithmetic and logic functions => Ø Carry arbitration adder
a separate logic unit runs in parallel with the adder,
and multiplexor selects the output A B C u v
0 0 0 0 0
A operand latch B operand latch
invert A
XOR gates XOR gates
invert B 0 1 unknown 1 0
1 0 unknown 1 0
C in
function 1 1 1 1 1
logic functions adder C
V

logic/arithmetic
result mux
N

zero detect Z

result
27 28
The cross-bar switch barrel shifter
The cross-bar switch barrel shifter (cont’d)
Ø Shifter delay is critical since it contributes directly to the Ø Precharged logic is used =>
datapath cycle time each switch is a single NMOS transistor
Ø Cross-bar switch matrix (32 x 32) Ø Precharging sets all outputs to logic 0, so those which are
Ø Principle for 4x4 matrix not connected to any input during switching remain at 0
right 3 right 2 right 1 no shift
giving the zero filling required by the shift semantics
in[3]
Ø For rotate right, the right shift diagonal is enabled +
left 1
complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)
in[2]
left 2 Ø Arithmetic shift right:
in[1]
left 3
use sign-extension => separate logic is used to decode
the shift amount and discharge those outputs
in[0]
appropriately

out[0] out[1] out[2] out[3]

29 30

The 2-bit multiplication algorithm, Nth


cycle Carry-propagate (a) and carry-save (b)
adder structures

Carry - i n Mul t i p l i e r Shift ALU Carry -o ut


0 x0 LSL #2N A+0 0 A B Cin A B Cin A B Cin A B Cin
(a) + + + +
x1 LSL #2N A+B 0 Cout S Cout S Cout S Cout S
x2 LSL #(2N + 1) A– B 1
x3 LSL #2N A– B 1
1 x0 LSL #2N A+B 0
x1 LSL #(2N + 1) A+B 0
x2 LSL #2N A– B 1
x3 LSL #2N A+0 1 A B Cin A B Cin A B Cin A B Cin
(b) + + + +
Cout S Cout S Cout S Cout S

31 32
ARM high-speed multiplier organization ARM2 register cell circuit

registers
initiali zation for MLA
read read
write A B
Rs >> 8 bits/cycle
ALU bus
Rm A bus
B bus
rotate sum and carry-save adders
carry 8 bits/cycle

partial sum

partial carry

ALU (add partials)

33 34

ARM register bank floorplan ARM core datapath buses

A bus read decoders


address register
B bus read decoders
incrementer
Vdd write decoders Ad A B
PC inc register bank
Vss

ALU ALU multiplier


bus bus
ALU
PC PC shift out W
bus register cells
A bus shifter

INC data in
B bus instruction
bus
instruction pipe
data out
Din

35 36
ARM control logic structure

instruction

coprocessor

multiply
control
decode cycle
PLA count
load/store
multiple

address register ALU shifter


control control control control

37

You might also like