Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Computer Organization & ARM Microcontrolers-21EC52

Module - 5
Introduction to the THUMB instruction set
Topics:

1. Introduction
2. THUMB register usage
3. ARM – THUMB interworking
4. Other branch instructions
5. Data processing instructions
6. Single-Register Load-Store Instructions
7. Multiple-Register Load-Store Instructions
8. Stack instructions
9. Software interrupt instructions.

Efficient C Programming
Topics:

1. Overview of C Compilers and optimization


2. Basic C Data types
- Local Variable Types
- Function Argument Types
- Signed versus Unsigned Types
3. C looping structures

- Loops with a Fixed Number of Iterations

- Loops Using a Variable Number of Iterations

- Loop Unrolling

Dept. of ECE Page 40


Computer Organization & ARM Microcontrolers-21EC52

1. Introduction

Thumb encodes a subset of the 32-bit ARM instructions into a 16-bit instruction set space. Since
Thumb has higher performance than ARM on a processor with a 16-bit data bus, but lower
performance than ARM on a 32-bit data bus, use Thumb for memory-constrained systems.

Thumb has higher code density, the space taken up in memory by an executable program than
ARM. For memory-constrained embedded systems, for example, mobile phones and PDAs, code
density is very important. Cost pressures also limit memory size, width, and speed.

On average, a Thumb implementation of the same code takes up around 30% less memory than
the equivalent ARM implementation. As an example, Figure 1 shows the same divide code
routine implemented in ARM and Thumb assembly code.

Fig 1: Code density

Even though the Thumb implementation uses more instructions, the overall memory footprint is
reduced. Code density was the main driving force for the Thumb instruction set. Because it was
also designed as a compiler target, rather than for hand-written assembly code, we recommend
that you write Thumb-targeted code in a high-level language like C or C++.

Each Thumb instruction is related to a 32-bit ARM instruction. Figure 2 shows a simple Thumb
ADD instruction being decoded into an equivalent ARM ADD instruction.

Dept. of ECE Page 41


Computer Organization & ARM Microcontrolers-21EC52

Fig 2: Thumb instruction decoding

Table 1 provides a complete list of Thumb instructions available in the THUMBv2 architecture
used in the ARMv5TE architecture. Only the branch relative instruction can be conditionally
executed. The limited space available in 16 bits causes the barrel shift operations ASR, LSL,
LSR, and ROR to be separate instructions in the Thumb ISA.

Dept. of ECE Page 42


Computer Organization & ARM Microcontrolers-21EC52

Table 1: Thumb instruction set

2. Thumb Register Usage

In Thumb state, you do not have direct access to all registers. Only the low registers r0 to r7 are
fully accessible, as shown in Table 2. The higher registers r8 to r12 are only accessible with

Dept. of ECE Page 43


Computer Organization & ARM Microcontrolers-21EC52

MOV, ADD, or CMP instructions. CMP and all the data processing instructions that operate on
low registers update the condition flags in the cpsr.

Table 2: Summary of Thumb register usage.

Note that from the Thumb register usage table that there is no direct access to the cpsr or spsr. In
other words, there are no MSR – and MRS -equivalent Thumb instructions. To alter the cpsr or
spsr, you must switch into ARM state to use MSR and MRS. Similarly; there are no coprocessor
instructions in Thumb state. You need to be in ARM state to access the coprocessor for
configuring cache and memory management.

3. ARM-Thumb Interworking

ARM-Thumb interworking is the name given to the method of linking ARM and Thumb code
together for both assembly and C/C++. It handles the transition between the two states. Extra
code, called a veneer, is sometimes needed to carry out the transition.

ATPCS defines the ARM and Thumb procedure call standards. To call a Thumb routine from an
ARM routine, the core has to change state. This state change is shown in the T bit of the cpsr.
The BX and BLX branch instructions cause a switch between ARM and Thumb state while
branching to a routine. The BX lr instruction returns from a routine, also with a state switch if
necessary.

The BLX instruction was introduced in ARMv5T. On ARMv4T cores the linker uses a veneer to
switch state on a subroutine call. Instead of calling the routine directly, the linker calls the
veneer, which switches to Thumb state using the BX instruction.

Dept. of ECE Page 44


Computer Organization & ARM Microcontrolers-21EC52

There are two versions of the BX or BLX instructions: an ARM instruction and a Thumb
equivalent. The ARM BX instruction enters Thumb state only if bit 0 of the address in Rn is set
to binary 1; otherwise it enters ARM state. The Thumb BX instruction does the same.

Syntax: BX Rm

BLX Rm | label

Unlike the ARM version, the Thumb BX instruction cannot be conditionally executed.

Example 1: This example shows a small code fragment that uses both the ARM and Thumb
versions of the BX instruction. This sets the T bit in the cpsr to Thumb state. The return address
is not automatically preserved by the BX instruction. Rather the code sets the return address
explicitly using a MOV instruction prior to the branch:

; ARM code
CODE32 ; word aligned
LDR r0, =thumbCode+1 ; +1 to enter Thumb state
MOV lr, pc ; set the return address
BX r0 ; branch to Thumb code & mode
; continue here
; Thumb code
CODE16 ; halfword aligned
thumbCode
ADD r1, #1
BX lr ; return to ARM code & state

A branch exchange instruction can also be used as an absolute branch providing bit 0 isn’t used
to force a state change:
; address(thumbCode) = 0x00010000
; cpsr = nzcvqIFt_SVC
; r0 = 0x00000000
0x00009000 LDR r0, =thumbCode+1

Dept. of ECE Page 45


Computer Organization & ARM Microcontrolers-21EC52

; cpsr = nzcvqIFt_SVC
; r0 = 0x00010001
0x00009008 BX r0
; cpsr = nzcvqIFT_SVC
; r0 = 0x00010001
; pc = 0x00010000

You can see that the least significant bit of register r0 is used to set the T bit of the cpsr. The cpsr
changes from IFt, prior to the execution of the BX, to IFT, after execution. The pc is then set to
point to the start address of the Thumb routine.

Example 2: Replacing the BX instruction with BLX simplifies the calling of a Thumb routine
since it sets the return address in the link register lr:

;ARM Code

CODE32
LDR r0, =thumbRoutine+1 ; enter Thumb state
BLX r0 ; jump to Thumb code
; continue here
CODE16
thumbRoutine
ADD r1, #1
BX r14 ; return to ARM code and state

4. Other Branch Instructions

There are two variations of the standard branch instruction, or B. The first is similar to the ARM
version and is conditionally executed; the branch range is limited to a signed 8-bit immediate, or
-256 to +254 bytes. The second version removes the conditional part of the instruction and
expands the effective branch range to a signed 11-bit immediate, or – 2048 to +2046 bytes.

The conditional branch instruction is the only conditionally executed instruction in Thumb state.

Syntax: B<cond> label

B label

BL label

Dept. of ECE Page 46


Computer Organization & ARM Microcontrolers-21EC52

The BL instruction is not conditionally executed and has an approximate range of +/− 4 MB.

This range is possible because BL (and BLX) instructions are translated into a pair of 16-bit
Thumb instructions. The first instruction in the pair holds the high part of the branch offset, and
the second the low part. These instructions must be used as a pair.

The code here shows the various instructions used to return from a BL subroutine call:

To return, we set the pc to the value in lr.

5. Data Processing Instructions

The data processing instructions manipulate data within registers. They include move
instructions, arithmetic instructions, shifts, logical instructions, comparison instructions, and
multiply instructions. The Thumb data processing instructions are a subset of the ARM data
processing instructions.

Syntax:

<ADC|ADD|AND|BIC|EOR|MOV|MUL|MVN|NEG|ORR|SBC|SUB> Rd,
Rm
<ADD|ASR|LSL|LSR|ROR|SUB> Rd, Rn #immediate
<ADD|MOV|SUB> Rd,#immediate
<ADD|SUB> Rd,Rn,Rm
ADD Rd,pc,#immediate

Dept. of ECE Page 47


Computer Organization & ARM Microcontrolers-21EC52

ADD Rd,sp,#immediate
<ADD|SUB> sp, #immediate
<ASR|LSL|LSR|ROR> Rd,Rs
<CMN|CMP|TST> Rn,Rm
CMP Rn,#immediate
MOV Rd,Rn

Dept. of ECE Page 48


Computer Organization & ARM Microcontrolers-21EC52

These instructions follow the same style as the equivalent ARM instructions. Most Thumb data
processing instructions operate on low registers and update the cpsr. The exceptions are

MOV Rd,Rn
ADD Rd,Rm
CMP Rn,Rm
ADD sp, #immediate
SUB sp, #immediate
ADD Rd,sp,#immediate
ADD Rd,pc,#immediate

which can operate on the higher registers r8–r14 and the pc. These instructions, except for CMP,
do not update the condition flags in the cpsr when using the higher registers. The CMP
instruction, however, always updates the cpsr.
Example 1: his example shows a simple Thumb ADD instruction. It takes two low registers r1
and r2 and adds them together. The result is then placed into register r0, overwriting the original
contents. The cpsr is also updated.

ADD r0, r1, r2

Dept. of ECE Page 49


Computer Organization & ARM Microcontrolers-21EC52

PRE: cpsr = nzcvIFT_SVC, r1 = 0x80000000, r2 = 0x10000000

POST: r0 = 0x90000000, cpsr = NzcvIFT_SVC

Example 2: Thumb deviates from the ARM style in that the barrel shift operations (ASR, LSL,
LSR, and ROR) are separate instructions. This example shows the logical left shift (LSL)
instruction to multiply register r2 by 2.

LSL r2, r4

PRE: r2 = 0x00000002, r4 = 0x00000001

POST: r2 = 0x00000004, r4 = 0x00000001

6. Single-Register Load-Store Instructions

The Thumb instruction set supports load and storing registers, or LDR and STR. These
instructions use two pre indexed addressing modes: offset by register and offset by immediate.

Syntax: <LDR|STR>{<B|H>} Rd, [Rn,#immediate]

LDR{<H|SB|SH>} Rd,[Rn,Rm]

STR{<B|H>} Rd,[Rn,Rm]

LDR Rd,[pc,#immediate]

<LDR|STR> Rd,[sp,#immediate]

Dept. of ECE Page 50


Computer Organization & ARM Microcontrolers-21EC52

The different addressing modes are shown in Table 3. The offset by register uses a base register
Rn plus the register offset Rm. The second uses the same base register Rn plus a 5-bit immediate
or a value dependent on the data size. The 5-bit offset encoded in the instruction is multiplied by
one for byte accesses, two for 16-bit accesses, and four for 32-bit accesses.

Table 3: Addressing modes

Example: This example shows two Thumb instructions that use a preindex addressing mode.
Both use the same pre-condition.

LDR r0, [r1, r4] ; register

PRE: mem32[0x90000] = 0x00000001, mem32[0x90004] = 0x00000002, mem32[0x90008] =


0x00000003, r0 = 0x00000000, r1 = 0x00090000, r4 = 0x00000004

POST: r0 = 0x00000002, r1 = 0x00090000, r4 = 0x00000004

LDR r0, [r1, #0x4] ; immediate

POST: r0 = 0x00000002

Dept. of ECE Page 51


Computer Organization & ARM Microcontrolers-21EC52

Both instructions carry out the same operation. The only difference is the second LDR uses a
fixed offset, whereas the first one depends on the value in register r4.

7. Multiple-Register Load-Store Instructions

The Thumb versions of the load-store multiple instructions are reduced forms of the ARM load-
store multiple instructions. They only support the increment after (IA) addressing mode.

Syntax : <LDM|STM>IA Rn!, {low Register list}

Here N is the number of registers in the list of registers. You can see that these instructions
always update the base register Rn after execution. The base register and list of registers are
limited to the low registers r0 to r7.

Example: This example saves registers r1 to r3 to memory addresses 0x9000 to 0x900c. It also
updates base register r4. Note that the update character! is not an option, unlike with the ARM
instruction set.

STMIA r4!,{r1,r2,r3}

PRE: r1 = 0x00000001, r2 = 0x00000002, r3 = 0x00000003, r4 = 0x9000

POST: mem32[0x9000] = 0x00000001, mem32[0x9004] = 0x00000002, mem32[0x9008] =


0x00000003, r4 = 0x900c

8. Stack Instructions

The Thumb stack operations are different from the equivalent ARM instructions because they
use the more traditional POP and PUSH concept.

Syntax: POP {low_register_list{, pc}}

PUSH {low_register_list{, lr}}

Dept. of ECE Page 52


Computer Organization & ARM Microcontrolers-21EC52

The interesting point to note is that there is no stack pointer in the instruction. This is because the
stack pointer is fixed as register r13 in Thumb operations and sp is automatically updated. The
list of registers is limited to the low registers r0 to r7.

The PUSH register list also can include the link register lr ; similarly the POP register list can
include the pc. This provides support for subroutine entry and exit. The stack instructions only
support full descending stack operations.

Example: In this example we use the POP and PUSH instructions. The subroutine
ThumbRoutine is called using a branch with link (BL) instruction.

The link register lr is pushed onto the stack with register r1. Upon return, register r1 is popped
off the stack, as well as the return address being loaded into the pc. This returns from the
subroutine.

9. Software Interrupt Instruction

Similar to the ARM equivalent, the Thumb software interrupt (SWI) instruction causes a
software interrupt exception. If any interrupt or exception flag is raised in Thumb state, the
processor automatically reverts back to ARM state to handle the exception.

Syntax: SWI immediate

Dept. of ECE Page 53


Computer Organization & ARM Microcontrolers-21EC52

Example: This example shows the execution of a Thumb SWI instruction. Note that the
processor goes from Thumb state to ARM state after execution.

0x00008000 SWI 0x45

PRE: cpsr = nzcVqifT_USER, pc = 0x00008000, lr = 0x003fffff ; lr = r14, r0 = 0x12,

POST: cpsr = nzcVqIft_SVC, spsr = nzcVqifT_USER, pc = 0x00000008, lr = 0x00008002, r0 =


0x12

Dept. of ECE Page 54


Computer Organization & ARM Microcontrolers-21EC52

Efficient C Programming

1. Overview of C Compilers and Optimization

Optimizing code takes time and reduces source code readability. Usually, it’s only worth
optimizing functions that are frequently executed and important for performance. To find the
frequently executed functions a performance profiling tool, found in most ARM simulators is
used.

C compilers have to translate C function literally into assembler so that it works for all possible
inputs. In practice, many of the input combinations are not possible or won’t occur. Let’s start by
looking at an example of the problems the compiler faces. The memclr function clears N bytes of
memory at address data.

No matter how advanced the compiler, it does not know whether N can be 0 on input or not.
Therefore the compiler needs to test for this case explicitly before the first iteration of the loop.

The compiler doesn’t know whether the data array pointer is four-byte aligned or not. If it is
four-byte aligned, then the compiler can clear four bytes at a time using an int store rather than a

Char store. Nor does it know whether N is a multiple of four or not. If N is a multiple of four,
then the compiler can repeat the loop body four times or store four bytes at a time using an int
store.

The compiler must be conservative and assume all possible values for N and all possible
alignments for data.

Dept. of ECE Page 55


Computer Organization & ARM Microcontrolers-21EC52

2. Basic C Data Types

Compilers armcc and gcc use the datatype mappings in Table 4 for an ARM target. The
exceptional case for type char is worth noting as it can cause problems when you are porting
code from another processor architecture. A common example is using a char type variable i as a
loop counter, with loop continuation condition i ≥ 0. As i is unsigned for the ARM compilers, the
loop will never terminate. Fortunately armcc produces a warning in this situation: unsigned
comparison with 0. Compilers also provide an override switch to make char signed. For example,
the command line option -fsigned-char will make char signed on gcc. The command line option -
zc will have the same effect with armcc.

Table 4: C compiler datatype mappings

2.1 Local Variable Types

ARMv4-based processors can efficiently load and store 8-, 16-, and 32-bit data. However, most
ARM data processing operations are 32-bit only.

For this reason, you should use a 32-bit datatype, int or long , for local variables wherever
possible.

Avoid using char and short as local variable types, even if you are manipulating an 8- or 16-bit
value. The one exception is when you want wrap-around to occur. If you require modulo
arithmetic of the form 255 + 1 = 0, then use the char type.

The following code checksums a data packet containing 64 words. It shows why you should
avoid using char for local variables.

Dept. of ECE Page 56


Computer Organization & ARM Microcontrolers-21EC52

Consider the compiler output for this function by making i as char.

At first sight it looks as though declaring i as a char is efficient. All ARM registers are 32-bit and
all stack entries are at least 32-bit. Furthermore, to implement the i++ exactly, the compiler must
account for the case when i = 255. Any attempt to increment 255 should produce the answer 0.

Now compare this to the compiler output where instead we declare i as an unsigned int.

In the first case, the compiler inserts an extra AND instruction to reduce i to the range 0 to 255
before the comparison with 64. This instruction disappears in the second case.

With armcc this code will produce a warning if you enable implicit narrowing cast warnings
using the compiler switch -W+n. The expression sum + data[i] is an integer and so can only be

Dept. of ECE Page 57


Computer Organization & ARM Microcontrolers-21EC52

assigned to a short using an (implicit or explicit) narrowing cast. The following assembly output,
the compiler must insert extra instructions to implement the narrowing cast:

The loop is now three instructions longer than the loop for example checksum_v2 earlier! There
are two reasons for the extra instructions:

 The LDRH instruction does not allow for a shifted address offset as the LDR
instruction did in checksum_v2. Therefore the first ADD in the loop calculates the
address of item i in the array. The LDRH loads from an address with no offset.
 The cast reducing total + array[i] to a short requires two MOV instructions. The
compiler shifts left by 16 and then right by 16 to implement a 16-bit sign extend. The
shift right is a sign-extending shift so it replicates the sign bit to fill the upper 16 bits.

Second problem can be avoided by using an int type variable to hold the partial sum. This
reduces the sum to a short type at the function exit.

However, the first problem is a new issue. We can solve it by accessing the array by
incrementing the pointer data rather than using an index as in data[i]. This is efficient regardless
of array type size or element size. All ARM load and store instructions have a postincrement
addressing mode.

Dept. of ECE Page 58


Computer Organization & ARM Microcontrolers-21EC52

Example: The checksum_v4 code fixes all the problems. It uses int type local variables to avoid
unnecessary casts. It increments the pointer data instead of using an index offset data[i].

The *(data++) operation translates to a single ARM instruction that loads the data and
increments the data pointer. Of course you could write sum += *data; data++ ; or even *data++
instead if you prefer. The compiler produces the following output. Three instructions have been
removed from the inside loop, saving three cycles per loop compared to checksum_v3.

2.2 Function Argument Types

Converting local variables from types char or short to type int increases performance and reduces
code size. The same holds for function arguments. Consider the following simple function,
which adds two 16-bit values, halving the second, and returns a 16-bit sum:

This function is a little artificial, but it is a useful test case to illustrate the problems faced by the
compiler. The input values a, b, and the return value will be passed in 32-bit ARM registers.
Should the compiler assume that these 32-bit values are in the range of a short type, that is,
−32,768 to +32,767? Or should the compiler force values to be in this range by sign-extending

Dept. of ECE Page 59


Computer Organization & ARM Microcontrolers-21EC52

the lowest 16 bits to fill the 32-bit register? The compiler must make compatible decisions for
the function caller and callee. Either the caller or callee must perform the cast to a short type.

We say that function arguments are passed wide if they are not reduced to the range of the type
and narrow if they are. You can tell which decision the compiler has made by looking at the
assembly output for add_v1. If the compiler passes arguments wide, then the callee must reduce
function arguments to the correct range. If the compiler passes arguments narrow, then the caller
must reduce the range. If the compiler returns values wide, then the caller must reduce the return
value to the correct range. If the compiler returns values narrow, then the callee must reduce the
range before returning the value.

For armcc n ADS, function arguments are passed narrow and values returned narrow. In other
words, the caller casts argument values and the callee casts return values. The compiler uses the
ANSI prototype of the function to determine the datatypes of the function arguments.

The armcc output for add_v1 shows that the compiler casts the return value to a short type, but
does not cast the input values. It assumes that the caller has already ensured that the 32-bit
valuesr0 and r1 are in the range of the short type. This shows narrow passing of arguments and
return value.

The gcc compiler we used is more cautious and makes no assumptions about the range of
argument value. This version of the compiler reduces the input arguments to the range of a short
in both the caller and the callee. It also casts the return value to a short type.

Here is the compiled code for add_v1:

Dept. of ECE Page 60


Computer Organization & ARM Microcontrolers-21EC52

Whatever the merits of different narrow and wide calling protocols, you can see that char or
short type function arguments and return values introduce extra casts. These increase code size
and decrease performance. It is more efficient to use the int type for function arguments and
return values, even if you are only passing an 8-bit value.

2.3 Signed versus Unsigned Types

This section compares the efficiencies of signed int and unsigned int.

If your code uses addition, subtraction, and multiplication, then there is no performance
difference between signed and unsigned operations. However, there is a difference when it
comes to division. Consider the following short example that averages two integers:

Notice that the compiler adds one to the sum before shifting by right if the sum is negative. In
other words it replaces x/2 by the statement:

Dept. of ECE Page 61


Computer Organization & ARM Microcontrolers-21EC52

It must do this because x is signed. In C on an ARM target, a divide by two is not a right shift if
x is negative. For example, −3»1 = -2 but − 3/2 = −1. Division rounds towards zero, but
arithmetic right shift rounds towards −∞.

It is more efficient to use unsigned types for divisions. The compiler converts unsigned power of
two divisions directly to right shifts. For general divisions, the divide routine in the C library is
faster for unsigned types.

The Efficient Use of C Types

 For local variables held in registers, don’t use a char or short type unless 8-bit or 16-bit
modular arithmetic is necessary. Use the signed or unsigned int types instead. Unsigned
types are faster when you use divisions.
 For array entries and global variables held in main memory, use the type with the
smallest size possible to hold the required data. This saves memory footprint. The
ARMv4 architecture is efficient at loading and storing all data widths provided you
traverse arrays by incrementing the array pointer. Avoid using offsets from the base of
the array with short type arrays, as LDRH does not support this.
 Use explicit casts when reading array entries or global variables into local variables, or
writing local variables out to array entries. The casts make it clear that for fast operation
you are taking a narrow width type stored in memory and expanding it to a wider type in
the registers. Switch on implicit narrowing cast warnings in the compiler to detect
implicit casts.
 Avoid implicit or explicit narrowing casts in expressions because they usually cost extra
cycles. Casts on loads or stores are usually free because the load or store instruction
performs the cast for you.
 Avoid char and short types for function arguments or return values. Instead use the int
type even if the range of the parameter is smaller. This prevents the compiler performing
unnecessary casts.

3. C Looping Structures

 Loops with a Fixed Number of Iterations

Dept. of ECE Page 62


Computer Organization & ARM Microcontrolers-21EC52

 Loops Using a Variable Number of Iterations


 Loop unrolling

3.1 Loops with a Fixed Number of Iterations

Consider the checksum example: This shows how the compiler treats a loop with incrementing
count i++.

It takes three instructions to implement the for loop structure:

 An ADD to increment i
 A compare to check if i is less than 64
 A conditional branch to continue the loop if i <64

This is not efficient. On the ARM, a loop should only use two instructions:

 A subtract to decrement the loop counter, which also sets the condition code flags on the
result
 A conditional branch instruction

The key point is that the loop counter should count down to zero rather than counting up to some
arbitrary limit. Then the comparison with zero is free since the result is stored in the condition
flags. Since we are no longer using i as an array index, there is no problem in counting down
rather than up.

Dept. of ECE Page 63


Computer Organization & ARM Microcontrolers-21EC52

Example: This example shows the improvement if we switch to a decrementing loop rather than
an incrementing loop.

The SUBS and BNE instructions implement the loop. Our checksum example now has the
minimum number of four instructions per loop. This is much better than six for checksum_v1
and eight for checksum_v3.

For an unsigned loop counter i we can use either of the loop continuation conditions i!=0 or i>0.
As i can’t be negative, they are the same condition. For a signed loop counter, it is tempting to
use the condition i>0 to continue the loop. You might expect the compiler to generate the
following two instructions to implement the loop:

The compiler is not being inefficient. It must be careful about the case when i = -0x80000000
because the two sections of code generate different answers in this case.

For the first piece of code the SUBS instruction compares i with 1 and then decrements i.

Dept. of ECE Page 64


Computer Organization & ARM Microcontrolers-21EC52

Since -0x80000000 < 1, the loop terminates. For the second piece of code, we decrement i and
then compare with 0. Modulo arithmetic means that i now has the value +0x7fffffff, which is
greater than zero. Thus the loop continues for many iterations.

Of course, in practice, i rarely takes the value -0x80000000. The compiler can’t usually
determine this, especially if the loop starts with a variable number of iterations. Therefore you
should use the termination condition i!=0 for signed or unsigned loop counters. It saves one
instruction over the condition i>0 for signed i.

3.2 Loops Using a Variable Number of Iterations

The checksum_v7 example shows how the compiler handles a for loop with a variable number of
iterations N.

Dept. of ECE Page 65


Computer Organization & ARM Microcontrolers-21EC52

Notice that the compiler checks that N is nonzero on entry to the function. Often this check is
unnecessary since you know that the array won’t be empty. In this case a do-while loop gives
better performance and code density than a for loop.

Example: This example shows how to use a do-while loop to remove the test for N being zero
that occurs in a for loop.

3.3 Loop Unrolling

In loops using fixed number of iterations each loop iteration costs two instructions in addition to
the body of the loop: a subtract to decrement the loop count and a conditional branch.

These instructions are called as the loop overhead. On ARM7 or ARM9 processors the subtract
takes one cycle and the branch three cycles, giving an overhead of four cycles per loop.

Some of these cycles by unrolling a loop—repeating the loop body several times, and reducing
the number of loop iterations by the same proportion.

Dept. of ECE Page 66


Computer Organization & ARM Microcontrolers-21EC52

Example 1: The following code unrolls our packet checksum loop by four times. We assume
that the number of words in the packet N is a multiple of four.

Here the loop overhead is reduced from 4N cycles to (4N)/4=N cycles. On the ARM7TDMI, this
accelerates the loop from 8 cycles per accumulate to 20/4 = 5 cycles per accumulate, nearly
doubling the speed! For the ARM9TDMI, which has a faster load instruction, the benefit is even
higher.

Dept. of ECE Page 67


Computer Organization & ARM Microcontrolers-21EC52

Example 2: This example handles the checksum of any size of data packet using a loop that has
been unrolled four times.

The second for loop handles the remaining cases when N is not a multiple of four. Note that both
N/4 and N&3 can be zero, so we can’t use do-while loops.

Writing Loops Efficiently

Use loops that count down to zero. Then the compiler does not need to allocate a register to hold
the termination value, and the comparison with zero is free.

Use unsigned loop counters by default and the continuation condition i!=0 rather than i>0. This
will ensure that the loop overhead is only two instructions.

Use do-while loops rather than for loops when you know the loop will iterate at least once. This
saves the compiler checking to see if the loop count is zero.

Unroll important loops to reduce the loop overhead. Do not over unroll. If the loop overhead is
small as a proportion of the total, then unrolling will increase code size and hurt the performance
of the cache.

Dept. of ECE Page 68


Computer Organization & ARM Microcontrolers-21EC52

Try to arrange that the number of elements in arrays are multiples of four or eight. You can then
unroll loops easily by two, four, or eight times without worrying about the leftover array
elements.

Question bank for Module 5


1. Demonstrate how thumb instruction is decoded?
2. Illustrate the working of BL instruction.
3. What are data processing instructions? Give an example for each.
4. Illustrate the following instructions.
i) ASR ii) LSL iii) MVN iv) TST
5. Write a program to add two 32 bit data present in register R1 and R2 and place the result
in R0.
6. Illustrate the single register load store instruction with an example.
7. Explain the operation of following instructions.
i) LDR ii) STRH iii) LDRSB iv) STR
8. What is pre index addressing mode? Give an example.
9. Illustrate the multiple register load store instructions with an example.
10. Illustrate PUSH and POP instructions.
11. What are software interrupt instructions?
12. Illustrate hoe thw processor goes from thumb state to ARM state.
13. What are basic C data types.
14. Illustrate the various looping structures in C.
15. Explain the working of loops with a fixed number of iterations.
16. Discuss the operation of loops using a variable number of iterations.
17. What is loop unrolling? Give an example.

Dept. of ECE Page 69

You might also like