Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 96

ARM- PROCESSOR FUNDAMENTALS

MODULE 3
STEVE FURBER, ARM SYSTEM-ON-CHIP ARCHITECTURE (2ND
EDITION)
Contents
2

◻ ARM Processor Architecture (Chapter 2) (Slides 3-20)


🞑 Acorn RISC Machine
🞑 Architectural Inheritance
🞑 ARM Programmer’s Model
🞑 ARM Development Tools
◻ ARM Assembly Language Programming (Chapter 3) (Slides 21-41)
🞑 Data Processing instructions
🞑 Data transfer instructions
🞑 Control flow instructions
🞑 Assembly Language Programs*(Module V)
◻ ARM Organization and Implementation (Chapter 4) (Slides 41-86)
🞑 3-Stage Pipeline & 5-Stage ARM Organization
🞑 ARM Instruction Execution & Implementation
Why ARM?
◻ Over 90% of the embedded market is based on the ARM
architecture
◻ Industry’s leading provider of 16/32-bit embedded RISC
microprocessor solutions
◻ One of the most licensed and thus widespread processor cores
in the world
◻ Used especially in portable devices due to its low power
consumption and reasonable performance.
◻ www.arm.com
4

◻ The ARM core uses a RISC architecture.


◻ RISC is a design philosophy aimed at delivering simple but
powerful instructions that execute within a single cycle at a
high clock speed.
◻ Being a RISC machine the design is simple which means it
can be built by less number of transistors.
◻ It’s easy to program high density code and low power
consumption makes it suitable for embedded applications.
5

◻ RISC vs CISC
◻ The RISC philosophy concentrates on reducing the complexity of
instructions performed by the hardware because it is easier to
provide greater flexibility and intelligence in software rather than
hardware. As a result, a RISC design places greater demands on
the compiler.
◻ The traditional complex instruction set computer (CISC) relies
more on the hardware for instruction functionality, and
consequently the CISC instructions are more complicated.
ARM – Advanced RISC Machines
◻ First commercial use RISC microprocessor(Acorn RISC Machine)
◻ Developed by Acorn Computers Limited of Cambridge, England
between 1983 & 1985 . In 1990, Advanced RISC Machines
Limited(ARM Ltd.) was formed
◻ Used in PDA, cell phones, multimedia players, handheld game
console, digital TV and cameras
◻ ARM7: GBA, iPod
◻ ARM9: NDS, PSP, Sony Ericsson, BenQ
◻ ARM11: Apple iPhone, Nokia N93, N800
◻ 75% of 32-bit embedded processors
Architectural inheritance
ARM: From Berkeley RISC

◻ Features used:
🞑 Load – store architecture : where instructions that process data operate only on
registers and are separate from instructions that access memory
🞑 Fixed-length 32-bit instructions
🞑 3-address instruction formats
◻ Features rejected:
🞑 Register windows - ARM reduced to 16 from 32 in RISC
🞑 Delayed Branches - Prediction of branches and delays
🞑 Single-cycle execution of all instructions
ARM Architecture
◻ Based on Berkeley RISC design.
◻ Large uniform register file.
◻ Load/store architecture.
◻ Only operations on memory are copy memory values into
registers (load) or copy register values into memory (store).
◻ ARM does not support memory to memory operations
◻ Simple addressing modes
◻ Uniform and fixed-length instruction fields(32 bit)
◻ 3-address instruction formats.
8
9

◻ Load Store architecture: This means the access of memory is done via these two
instructions. The load instruction copies data from memory to the register file,
whereas the store instruction writes the data from the registers to the memory. All
the arithmetic and logical instructions access only the register file, thus keeping the
operand access less time consuming and simple.
◻ Data Size and Instruction Set :The data processing capability of the instruction
basically depends on the register width.The group of registers called the register
file typically holds a signed or unsigned 32/64 bit data depending on the ARM core
family. The ARM sign extend hardware converts the Byte (8 bits) and Halfword (16
bits or two bytes) intoWord (32 bits or four bytes) to store in the register file.
10

◻ ARM implements three categories of instruction sets:


🞑 32-bit ARM Instruction Set, 16-bit Thumb Instruction Set,
🞑 8-bit Jazelle instruction set – Jazelle cores can also execute Java
bytecode.
◻ The “current program status register” decides which
instruction set will be executed.
◻ The instruction set typically uses a three address format with
one destination operand Rd and two source operands Rn and
Rm.
ARM Programmer’s Model

11
ARM : Visible Registers
User-level programs: 15, 32-bit GPRs(r0 – r14)
🞑 Program counter, PC(r15)
🞑 Current Program Status Register(CPSR)
Remaining registers
🞑 System-level programming
🞑 Handling Exceptions
ARM11 Registers
13
Current Program Status Register (CPSR)
Undefined
U n d e f i n e d

• Interrupt Disable bits.


Condition code flags
– I = 1: Disables the IRQ.
◦ N = Negative result from ALU
– F = 1: Disables the FIQ.
◦ Z = Zero result from ALU
• T Bit
◦ C = ALU operation Carried out
– T = 0: Processor in ARM state
◦ V = ALU operation oVerflowed
– T = 1: Processor in Thumb state
• Mode bits
– Specify the processor mode 14
Processor Modes

15

◻ The ARM has seven basic operating modes: six privileged mode and one non privileged mode (user). The
privileged modes are used to service interrupt , exception and access protected resources.
◻ User Mode : is an unprivileged mode under which most tasks run. This mode is used for executing application
programs.
◻ System mode : is a privileged mode to run user and system programs. Under this mode the user has all the access
permissions. This mode uses the same set of registers as the non privileged user mode.
◻ Supervisor Mode: This is the mode where in the OS kernel operates. This mode is entered on reset and when a
Software Interrupt instruction is executed.
◻ Two of the privilege modes are allocated for interrupt handling. In general, ARM has two levels of interrupts.
◻ Fast Interrupt Request (FIQ) : FIQ mode is entered when a high priority (fast) interrupt is raised. FIQ supports
channel communication for data transfer.
◻ Interrupt Request (IRQ) :IRQ is entered when a low priority (normal) interrupt is raised. This is a privileged
mode for general purpose interrupt handling.
◻ Abort : used to handle memory access violations. The abort mode handles data abort and pre-fetch abort.
◻ Undefined : used to handle undefined instructions that are not supported by the implementation.
The Memory System -Byte Ordering

◻ Refers to the order multi-byte values are stored by the


hardware
◻ Based on whether least significant byte is stored at higher
or lower address than the next most significant byte
◻ Two types - Big-endian & Little-endian

Represent the data F5AB13CD H in two byte ordering


schemes with a starting address of 1004H.

16
Little vs Big Endian

• Little-Endian 🡪 LSB first kept. Ends kept at little address


• Big-Endian 🡪MSB first kept. Ends kept at bigger address
• If data has N bytes, it is stored at address A to A+N-1 and we say that the data is
at address A 17
Word Alignment
◻ Linear array of bytes numbered from 0 to 232-1
◻ Data: byte(8-bit), half-word(16-bit), word(32-bit)
◻ For reasons of making hardware simpler (and sometimes
ISA defines that way), words are stored at word aligned
addresses
🞑 Address is stored at address divisible by 4
🞑 Last two bits will be “00”

18
Quantity Address divisible by Binary address ends in

Byte 1 Anything
Half-word (16 bits) 2 0

Word (32 bits) 4 00

Double-word (64 bits) 8 000


Load-Store Architecture
◻ Instruction set will process values only in registers and
will always place the results into a register.
◻ The only operations which apply to memory are copy
instructions.
◻ CISC allows memory-memory operations.
◻ Instructions three categories
🞑 Data Processing Instructions
🞑 Data Transfer Instructions
🞑 Control Flow Instructions
20
Supervisor Mode
21

◻ User code cannot gain supervisor Privileges (Protection


Mechanism).
◻ System-level functions(access to hardware peripheral
registers, character input and output) can only be accessed
through specified supervisor calls.
◻ User-level - Operate on the data ‘owned’ by their
programs and rely on OS to handle all transactions with the
world outside their programs.
The ARM Instruction Set
22

◻ All ARM instructions are 32 bits wide and are aligned on 4-byte boundaries in
memory.
◻ The load-store architecture.
◻ 3-address data processing instructions (that is, the two source operand registers and
the result register are all independently specified).
◻ Conditional execution of every instruction.
◻ Load and store multiple register instructions.
◻ The ability to perform a general shift operation and a general ALU operation in a
single instruction that executes in a single clock cycle.
◻ Open instruction set extension through the coprocessor instruction set, including
adding new registers and data types to the programmer's model.
◻ 16-bit compressed representation of the instruction set in the Thumb architecture.
The I/O System
23

◻ Memory-mapped devices with interrupt (disk controllers, network


interfaces and so on).
◻ Internal registers appear as addressable locations within the ARM memory
map.
◻ Peripherals may attract the processor's attention by making an interrupt
request using either the normal interrupt (IRQ) or the fast interrupt (FIQ)
input.
🞑 Both interrupt inputs are level-sensitive and maskable.
🞑 Normally most interrupt sources share the IRQ input, with just one or two time-
critical sources connected to the higher-priority FIQ input.
◻ Systems may include DMA hardware external to processor
ARM Exceptions
24

◻ Interrupts are a form of exception.


◻ The ARM architecture supports a range of interrupts, traps and
supervisor calls, all grouped under the general heading of exceptions.
◻ Step 1:- The current state is saved by copying the PC into rl4_exc and
the CPSR into SPSR_exc (where exc stands for the exception type).
◻ Step 2:- The processor operating mode is changed to the appropriate
exception mode.
◻ Step 3:- The PC is forced to a value between 0016 and 1C16, the
particular value depending on the type of exception.
ARM Development Tools – Self Study
25
ARM Assembly Language Programming
26

◻ Data Processing instructions


◻ Data transfer instructions
◻ Control flow instructions
◻ Assembly Language Programs
Data Processing Instructions
27

◻ ARM data processing instructions enable the programmer to


perform arithmetic and logical operations on data values in
registers.
◻ Data processing instructions are the only instructions which modify
data values.
🞑 Require two operands and produce a single result.
🞑 All operands are 32 bits wide and come from registers or are specified as
literals in the instruction itself.
🞑 The result, if there is one, is 32 bits wide and is placed in a register.
🞑 Each of the operand registers and the result register are independently
specified in the instruction.
Data Processing Instructions - Arithmetic
28

◻ This instructions perform binary arithmetic on two 32-bit operands.


◻ The operands may be unsigned or 2's-complement signed integers.
29

'ADD' is simple addition, 'ADC' is add with carry, 'SUB' is subtract, 'SBC' is
subtract
with carry, 'RSB' is reverse subtraction and 'RSC' reverse subtract with carry.
Data Processing Instructions – Bit-wise
30
Logical & Register Movement
Data Processing Instructions – Comparison
31

◻ These instructions do not produce a result (which is therefore omitted from the
assembly language format) but just set the condition code bits (N, Z, C and V) in the
CPSR according to the selected operation.

◻ CMP r1, r2 ; set cc on r1 - r2


◻ CMN r1, r2 ; set cc on r1 + r2
◻ TST r1, r2 ; set cc on r1 and r2
◻ TEQ r1, r2 ; set cc on r1 xor r2
◻ The mnemonics stand for 'compare' (CMP), 'compare negated' (CMN), '(bit) test‘
(TST) and 'test equal' (TEQ).
Data Processing Instructions –
Immediate operands & Shifted register operands
32

❑ ADD r3, r3, #1 ; r3 := r3+ 1


◻ AND r8, r7, #&ff ; r8 := r7 [7:0]
the immediate value may be specified in hexadecimal (base 16) notation by putting '&' after the '#'.
◻ ADD r3, r2, r1, LSL #3 ; r3 := r2 + 8 x r1
◻ MUL r4, r3, r2 ; r4 := (r3 x r2) [ 3 1 : 0 ]
Multiplying two 32-bit integers gives a 64-bit result, the least significant 32 bits of which are placed in the
result register and the rest are ignored
◻ MLA r4, r3, r2, r1 ; r4 := (r3 x r2 + r1) [31:0]
multiply-accumulate instruction: adds the product to a running total.
33

◻ Note: The available shift operations are:


◻ • LSL: logical shift left by 0 to 31 places; fill the vacated bits at the least significant end of the word
with zeros.
◻ • LSR: logical shift right by 0 to 32 places; fill the vacated bits at the most significant end of the word
with zeros.
◻ • ASL: arithmetic shift left; this is a synonym for LSL.
◻ • ASR: arithmetic shift right by 0 to 32 places; fill the vacated bits at the most significant end of the
word with zeros if the source operand was positive, or with ones if the source operand was negative.
◻ • ROR: rotate right by 0 to 32 places; the bits which fall off the least significant end of the word are
used, in order, to fill the vacated bits at the most significant end of the word.
◻ • RRX: rotate right extended by 1 place; the vacated bit (bit 31) is filled with the old value of the C
flag and the operand is shifted one place to the right. With appropriate use of the condition codes (see
below) a 33-bit rotate of the operand and the C flag is performed.
34

◻ Any data processing instruction can set the condition codes (N, Z, C and V) if the
programmer wishes it to.
◻ The comparison operations only set the condition codes, so there is no option with them,
but for all other data processing instructions a specific request must be made. At the
assembly language level this request is indicated by adding an 's' to the opcode, standing
for 'Set condition codes'.

◻ The code performs a 64-bit addition of two numbers held in r0-r1 and r2-r3, using the C
condition code flag to store the intermediate carry:
Data Transfer Instructions
35

◻ Data transfer instructions move data between ARM registers and memory. There
are three basic forms of data transfer instruction in the ARM instruction set:
◻ Single register load and store instructions.
🞑 These instructions provide the most flexible way to transfer single data items between an
ARM register and memory. The data item may be a byte, a 32-bit word, or a 16-bit half-
word.
◻ Multiple register load and store instructions.
🞑 Large quantities of data to be transferred more efficiently. They are used for procedure entry
and exit, to save and restore workspace registers, and to copy blocks of data around
memory.
◻ Single register swap instructions.
🞑 A value in a reg. to be exchanged with a value in memory
36

◻ Register-indirect addressing
🞑 The ARM data transfer instructions are all based around register-indirect
addressing, with modes that include base-plus-offset and base-plus-index
addressing.
🞑 Register-indirect addressing uses a value in one register (the base register) as a
memory address and either loads the value from that address into another
register or stores the value from another register into that memory address.
🞑 LDR r0, [r1] ; r0 := mem32[r1]
🞑 STR r0, [r1] ; mem32[r1] := r0

🞑 Base-plus-offset and base-plus-index


Data Transfer Instructions – Address
37
Pointer Initialization
◻ To load or store from or to a particular memory location, an ARM register must be
initialized to contain the address of that location
◻ In-built pseudo instruction (does not correspond directly to a particular ARM
instruction) ADR.
🞑 consider a program which must copy data from TABLE1 to TABLE2, both of which are near to the code:
Data Transfer Instructions – Single Register load and store
38
Data Transfer Instructions – Base Plus
39
offset addressing
◻ Pre-indexed LDR r0, [r1,#4] ; r0 := mem32[r1+ 4]
◻ Auto-indexing LDR r0,[r1,#4]! ; r0 := mem32[r1+ 4]
; r1 := r1+ 4
◻ Post–indexed LDR r0, [r1], #4 ; r0 := mem32 [r1]
; r1 := r1 + 4
Data Transfer Instructions – Multiple
Register
40
Data Transfer Instructions – Stack
Addressing
41

◻ A stack is a form of last-in-first-out store which supports simple dynamic


memory allocation, that is, memory allocation where the address to be
used to store a data value is not known at the time the program is
compiled or assembled.
◻ A Linear structure which grows up (Ascending Stack) or down
(Descending Stack)
◻ A Stack pointer holds the address of current top of stack, either by
pointing the last valid data item pushed onto the stack (a full stack), or by
pointing to the vacant slot where the next data item will be placed (an
empty stack).
◻ Full ascending, Empty ascending, Full descending, Empty descending
Data Transfer Instructions – Block
42
Copy Addressing

STMFD r13!, {r2-r9} ; save regs onto stack


LDMIA r0!, {r2-r9}
STMIA r1, {r2-r9}
LDMFD r13!, {r2-r9} ; restore from stack
Control Flow Instructions – Branch
43
instructions, Conditional Branches

LAB
EL
MOV r0, #0 ; initialize counter
LOOP …….
ADD r0, r0, #1 ; increment loop counter
CMP r0, #10 ; compare with limit
BNE LOOP ; repeat if not equal
.… ; else fall through
Control Flow Instructions – Branch
44
Conditions
Control Flow Instructions – Conditional
45
Execution
An unusual feature of the ARM instruction set is that conditional execution
applies
not only to branches but to all ARM instructions. A branch which is used to skip
a
small number of following instructions may be omitted altogether by giving
those
instructions the opposite condition.

This may be replaced by:


Control Flow Instructions – Branch &
46
Link Instructions

⮚ Nested subroutine

BL SUB1

SUB1 BL SUB2

….
SUB2 …….
Control Flow Instructions – Subroutine
47
return instructions
◻ ⮚ Nested
subroutine
BL SUB1
……..

SUB1 STMFD r13!, {r0-r2,r14} ;save work & link


regs
BL SUB2
….
…..
LDMFD r13!, {r0-r2,pc} ; restore work regs & return

SUB2 …….
MOV pc, r14 ; copy r14 into r15 to return
Control Flow Instructions – Supervisor
Calls
48

◻ The supervisor call is a program which operates at a


privileged level, which means that it can do things that a
user-level program cannot do directly.
◻ The instruction set includes a special instruction, SWI, to
call these functions, (SWI stands for 'Software Interrupt',
but is usually pronounced 'Supervisor Call'.)
◻ SWI SWI_WriteC ; output r0[7:0]
; sends the character in the bottom byte of r0 to
the user display device
Assembly Language Programs – Hello
49
World
Assembly Language Programs – Block
Copy
50
3- Stage pipeline ARM Organization

◻ The register bank


◻ The barrel shifter
◻ The ALU
◻ The address register and incrementer
◻ The data registers
◻ The instruction decoder and associated control logic

51
52
3- Stage pipeline ARM Organization-
Components
◻ The register bank
🞑 Stores the processor state.
🞑 It has two read ports and one write port which can each be used to
access any register, plus an additional read port and an additional
write port that give special access to r15, the program counter.
◻ The barrel shifter
🞑 can shift or rotate one operand by any number of bits.
◻ The ALU
🞑 which performs the arithmetic and logic functions required by
theinstruction set. 53
3- Stage pipeline ARM Organization-
Components
◻ The address register and incrementer
🞑 which select and hold all memory addresses and generate sequential

addresses when required.


◻ The data registers
🞑 which hold data passing to and from memory.
◻ In a single-cycle data processing instruction, two register operands are
accessed, the value on the B bus is shifted and combined with the value on
the A bus in the ALU, then the result is written back into the register bank.
◻ The program counter value is in the address register, from where it is fed into
the incrementer, then the incremented value is copied back into rl5 in the
register bank and also into the address register to be used as the address for
54
Stages of Pipeline
◻ Fetch
🞑 Instruction is fetched from memory and placed in instruction pipeline
◻ Decode
🞑 Instruction is decoded and datapath control signals prepared for next cycle the
instruction 'owns' the decode logic but not the datapath
◻ Execute
🞑 The instruction 'owns' the datapath
🞑 Register bank is read, an operand is shifted, result generated is written back to
destination register
◻ At any one time, three different instructions may occupy each of these stages,
so the hardware in each stage has to be capable of independent operation.
55
Pipelining

◻ Pipelining is the technique of starting the next instruction


before the current one has finished.
◻ Simple data processing instructions -- the pipeline enables
one instruction to be completed every clock cycle.
◻ An individual instruction takes three clock cycles to complete,
so it has a three-cycle latency, but the throughput is one
instruction per cycle.

56
Single Cycle Instruction

57
Multi-Cycle Instruction

58
Breaks in Pipeline

◻ All instructions occupy the datapath for one or more adjacent


cycles.
◻ For each cycle that an instruction occupies the datapath, it
occupies the decode logic in the immediately preceding cycle.
◻ During the first datapath cycle each instruction issues a fetch for
the next instruction.
◻ Branch instructions clear and refill the instruction pipeline.

59
PC Generation
◻ Pipeline reads instruction operands one stage earlier in the
pipeline.
◻ Incremented PC value is fed directly into decode stage,
bypassing pipeline register between 2 stages.

60
ARM Instruction Execution – Data
61
processing Instructions
◻ Instruction taken to the Instruction decoder unit.
◻ Instruction decoded and based on the opcode/operand
🞑 Generates control signal that transfer info as to which register should be opened
up.
◻ Data is read from the memory and stored in the register
◻ Source operands (Rm, Rn )
🞑 read from the register file using Bus A,B respectively and result Rd is written back
◻ ALU (Arithmetic and Logic Unit) takes register values Rm,Rn from buses
A,B and computes a result
◻ Data processing instruction write the result in Rd to the register file
62
ARM Instruction Execution – Data
63
Transfer Instructions
◻ Load-Store instruction use ALU to generate an address to be held in the address
register and broadcast on the address bus
◻ Barrel Shifter
🞑 Register Rm alternatively can be preprocessed in the barrel shifter before it enters ALU
🞑 Generating wide range of expressions and addresses in the same cycle
◻ PC value in the address register fed into incrementer and the incremented value
written back to R15
◻ Eg: PC accessed instruction from address 1000
🞑 Instruction from location 1000 read
🞑 Incremented to 1004 and fed back to PC.
◻ Incremented address also written into the address register
🞑 To be used as address for next fetch
64
ARM Instruction Execution – Branching
65
Instructions
◻ Branch instructions compute the target address in the first cycle.
◻ A 24-bit immediate field is extracted from the instruction and then shifted
left two bit positions to give a word-aligned offset which is added to the
PC.
◻ The result is issued as an instruction fetch address, and while the
instruction pipeline refills the return address is copied into the link
register (r14) if this is required (that is, if the instruction is a 'branch with
link').
◻ The third cycle, which is required to complete the pipeline refilling, is also
used to make a small correction to the value stored in the link register in
order that it points directly at the instruction which follows the branch.
66
The ARM Coprocessor Interface
67

◻ The ARM supports a general-purpose extension of its


instruction set through the addition of hardware
coprocessors
◻ It also supports the software emulation of these
coprocessors through the undefined instruction trap.
Coprocessor Architecture
68

◻ Support for up to 16 logical coprocessors.


◻ Each coprocessor can have up to 16 private registers of
any reasonable size; they are not limited to 32 bits.
◻ Coprocessors use a load-store architecture, with
instructions to perform internal operations on registers,
instructions to load and save registers from and to
memory, and instructions to move data to or from an ARM
register.
ARM7TDMI coprocessor interface
69

◻ coprocessor interface is based on 'bus watching‘.


◻ ARM instruction stream flows into the ARM, and the
coprocessor copies the instructions into an internal
pipeline that mimics the behaviour of the ARM
instruction pipeline.
◻ As each coprocessor instruction begins execution there is
a 'hand-shake' between the ARM and the coprocessor to
confirm that they are both ready to execute it.
ARM7TDMI coprocessor handshake signals
70

◻ cpi (from ARM to all coprocessors).


🞑'Coprocessor Instruction', indicates that the ARM has identified a
coprocessor instruction and wishes to execute it.
◻ cpa (from the coprocessors to ARM).
🞑'Coprocessor Absent' tells the ARM that there is no coprocessor
present that is able to execute the current instruction.
◻ cpb (from the coprocessors to ARM).
🞑'CoProcessor Busy' tells the ARM that the coprocessor cannot
begin executing the instruction yet.
ARM7TDMI coprocessor handshake outcomes
71

◻ The ARM may decide not to execute it, either because it falls in a branch
shadow or because it fails its condition code test. ARM will not assert cpi,
and the instruction will be discarded by all parties.
◻ The ARM may decide to execute it but no present coprocessor can take it so
cpa stays active. ARM will take the undefined instruction trap and use
software to recover, possibly by emulating the trapped instruction.
◻ ARM decides to execute the instruction and a coprocessor accepts it, but
cannot execute it yet.
◻ ARM decides to execute the instruction and a coprocessor accepts it for
immediate execution, cpi, cpa and cpb are all taken low and both sides
commit to com plete the instruction.
ARM Coprocessor Data Transfers & Pre-
emptive execution.
72

◻ If the instruction is a coprocessor data transfer instruction the ARM is


responsible for generating an initial memory address.
◻ The coprocessor determines the length of the transfer.
◻ ARM will continue incrementing the address until the coprocessor signals
completion.
◻ Coprocessors should limit the maximum transfer length to 16 words so as not to
compromise the ARM's interrupt response.
◻ A coprocessor may begin executing an instruction as soon as it enters its pipeline
so long as it can recover its state if the handshake does not ultimately complete.
◻ All activity must be idempotent (repeatable with identical results) up to the point
of commitment.
73

Assignment Topics
5-Stage Pipeline ARM Organization- Assignment Topics

◻ Tprog = Ninst x CPI/ fclk


◻ Tprog = Time required to execute a program
◻ CPI is the average number of clock cycles per instruction
◻ fclk is the processor's clock frequency
◻ Ninst is no. of ARM instructions executed in the course of the
program and is constant for a given program

74
Performance Improvement- Assignment Topics

❑ Increase the clock rate, fclk


🞑Logic in each pipeline stage to be simplified
🞑No. of pipeline stages has to be increased
🞑Can’t be increased much – since it depends on hardware
◻ Reduce no. of average clock cycles per instruction, CPI
🞑Instructions which occupy more than 1 pipeline slot in 3-stage
pipeline has to be re-implemented to occupy fewer slots
🞑Pipeline stalls caused by dependencies between instructions are
reduced
🞑Combination of both 75
Memory Bottleneck- Assignment Topics

◻ To get a better CPI


🞑Memory system must deliver more than one value in each clock
cycle
■ By delivering more than 32 bits per cycle from a single memory
■ Separate memories for instruction and data access

76
How to handle the issues? - Assignment Topics

◻ 5-stage pipeline
🞑Breaking instruction into 5 reduces the maximum work in each
clock cycle
🞑Reduces the maximum work to be completed in a clock cycle
◻ Separate instruction and data memories
🞑Can be separate caches connected to a unified instruction and data
main memory
🞑Significant reduction in core’s CPI

77
Stages of 5-Stage Pipeline- Assignment Topics
78

◻ Fetch
🞑The instruction is fetched from memory and placed in the
instruction pipeline.
◻ Decode
🞑The instruction is decoded and register operands read from the
register file.
🞑There are three operand read ports in the register file, so most
ARM instructions can source all their operands in one cycle.
Stages of 5-Stage Pipeline- Assignment Topics
79

◻ Execute
🞑 An operand is shifted and the ALU result generated.
🞑 If the instruction is a load or store the memory address is computed in the
ALU.
◻ Buffer/data
🞑 Data memory is accessed if required. Otherwise the ALU result is simply
buffered for one clock cycle to give the same pipeline flow for all instructions.
◻ Write-back
🞑 The results generated by the instruction are written back to the register
file,including any data loaded from memory.
5-Satge Pipeline – Data Forwarding-
80
Assignment Topics
◻ Instruction execution is spread across three pipeline stages, the only
way to resolve data dependencies without stalling the pipeline is to
introduce forwarding paths.
◻ Data dependencies arise when an instruction needs to use the result of
one of its predecessors before that result has returned to the register file.
◻ Forwarding paths allow results to be passed between stages as soon as
they are available.
◻ Forwarding paths allow results to be passed between stages as soon as
they are available, and the 5-stage ARM pipeline requires each of the
three source operands to be forwarded from any of three
intermediate result registers.
81
ARM Implementation – Clocking Scheme-
82
Assignment Topics
ARM Implementation – Datapath
Timing - Assignment Topics
83
ARM1 ripple carry adder- Assignment Topics
84
ARM2 4- bit carry look-ahead adder-
85
Assignment Topics
ARM2 ALU Logic- Assignment Topics
86
ARM6 Carry-Select Adder- Assignment Topics
87
ARM6 ALU Organization- Assignment Topics
ARM9 Carry arbitration encoding-
89
Assignment Topics

(u,v) • (w',v') = (v + u • u',v + u •


v')
The Barrel Shifter- Assignment Topics
90
Multiplier Design- Assignment Topics
91
High-Speed multipier Organization- Assignment Topics
92
ARM6 Register Cell Circuit- Assignment Topics
93
ARM Register Bank Floorplan- Assignment Topics
94
Datapath Layout- Assignment Topics
95
Control Structures- Assignment Topics
96

You might also like