Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Branch Prediction — Everything you need to

know.
Harshal Parekh · Follow
Published in The Startup
9 min read · Jun 23, 2019

In computer science, predication is an architectural feature that provides an


alternative to conditional transfer of control, implemented by machine
instructions such as conditional branch, conditional call, conditional return, and
branch tables. Predication works by executing instructions from both paths of the
branch and only permitting those instructions from the taken path to modify
architectural state. [1]

To better understand the concept, let’s take an example:


“Processing a sorted array is faster than processing an unsorted array”
1 import java.util.Arrays;
2 import java.util.Random;
3
4 public class Main {
5 public static void main(String[] args) {
6 // Generate data
7 int arraySize = 32768;
8 int data[] = new int[arraySize];
9
10 Random rnd = new Random(0);
11 for (int c = 0; c < arraySize; ++c)
12 data[c] = rnd.nextInt() % 256;
13
14 // With this, the next loop runs faster
15 Arrays.sort(data);
16
17 // Test
18 long start = System.nanoTime();
19 long sum = 0;
20
21 for (int i = 0; i < 100000; ++i) {
22 // Primary loop
23 for (int c = 0; c < arraySize; ++c) {
24 if (data[c] >= 128)
25 sum += data[c];
26 }
27 }
28
29 System.out.println((System.nanoTime() - start) / 1000000000.0);
30 System.out.println("sum: " + sum);
31 }
32 }

The code above takes ~12 seconds to run. But on commenting line 15, not touching
the rest, the same code takes ~33 seconds to run.
(running time may wary on different machines, but the proportion will stay the
same).

This is very peculiar yet intriguing, why would processing a unsorted array take almost
3 times as the time taken to process a sorted array?
Let’s travel back to 1700s to consider a real-life scenario, and you are the operator
of this junction and you hear a train coming. You don’t know where the train is
headed, so you stop the train, ask the driver and switch tracks.
But stopping and restarting a train takes a lot of time.

What if you were able to predict where the train wants to go and switch
beforehand? If you’re right, train continues; if not, driver backs up and you
switch.

If you guessed right all the time, the train would not have to stop.

This is what happens at the processor level when it encounters an “if” statement
— Modern processors are complicated and have long pipelines. So they take forever to
“warm up” and “slow down”.

In essence, this is branch prediction. If the processor guessed it right, it


continues executing or else backs up and restarts. Now looking back at the
program above, the culprit is the “if” statement:
1 if (data[c] >= 128) {
2 sum += data[c];
3 }

branch-prediction-if.java hosted with ❤ by GitHub view raw

A quick visualization for the sorted array:

Is “(a*b != 0) is faster than (a != 0 && b != 0) in Java”?

1 long time;
2 final int len = 50000000;
3 int arbitrary = 0;
4 int[][] nums = new int[2][len];
5
6 for (double fraction = 0 ; fraction <= 0.9 ; fraction += 1) {
This is easy for the processor to predict, with just one switch
7 for (int i = 0 ; i < 2 ; i++) {
8 for (int j = 0 ; j < len ; j++) {
And
9
now, a visualization for the unsorted array:
double random = Math.random();
10
11 if(random < fraction) {
12 nums[i][j] = 0;
13 } else {
14 nums[i][j] = (int) (random * 15 + 1);
15 }
16 }
17 }
18
This is difficult for the processor to predict, resulting in 3x more time to run
19 time = System.currentTimeMillis();
20
Now
21 that for
you(int
probably
i = 0 ;understood the{problem, how would you strategically guess
i < len ; i++)

to22minimize the
if number
(nums[0][i] * nums[1][i]
of times that the!= 0) {
processor must back up and go down the other
23 arbitrary++;
path? You try to identify a pattern and follow it. This is more or less how branch
24 }
predictors
25 work.
/* But when faced with unpredictable branches with no
recognizable
26 patterns,
if branch
(nums[0][i] != 0 predictors are!=virtually
&& nums[1][i] 0) { useless. [2]
27 arbitrary++;
28 }
29 */
30 }
31 System.out.println(System.currentTimeMillis() - time);
32 }
The above code is another example of branch prediction. Running the
commented code takes longer than the current code. This is for a simple reason
that the commented code will compile to 2 memory loads and two conditional
branches as compared to a multiply and one conditional branch.

The multiply is likely to be faster than the second conditional branch if the
hardware-level branch prediction is ineffective. This raises the question if this is
at compiler level or is it at the hardware level? The latter, I believe.

Does ordering of if…else if… else affect probability?

1 if (highly_likely)
2 //do something
3 else if (somewhat_likely)
4 //do something
5 else if (unlikely)
6 //do something

if-ordering.java hosted with ❤ by GitHub view raw

Shall conditions be ordered this way? Or the other way round?

It seems obvious that the sorted version would be faster, however for readability
or the existence of side-effects, we might want to order them non-optimally. It’s
also hard to tell how well the CPU will do with branch prediction until you
actually run the code. So, how to order the if… else if statements?

As a general rule, most if not all Intel CPUs assume forward branches are not
taken the first time they see them. See Godbolt’s work.

After that, the branch goes into a branch prediction cache, and past behavior is
used to inform future branch prediction.

So in a tight loop, the effect of misordering is going to be relatively small. The


branch predictor is going to learn which set of branches is most likely, and if you
have non-trivial amount of work in the loop the small differences won’t add up
much.

In general code, most compilers by default (lacking another reason) will order
the produced machine code roughly the way you ordered it in your code. Thus if
statements are forward branches when they fail.

So you should order your branches in the order of decreasing likelihood to get the
best branch prediction from a “first encounter”.

This raises a question, is “if” expensive?


At the hardware (lowest) level, yes, “if” statements are expensive. In order to
understand why, you have to understand how pipelines work.

For most RISC architectures, instructions are all a constant length, so the
Program Counter (PC) can be incremented by a constant amount. For most
instructions, the PC of the next instruction is just the current PC plus the length
of the current instruction.

For branch instructions, however, the next instruction to be executed is not the
next location after the current instruction. Branches are gotos — they tell the
processor where the next instruction is. Branches can either be conditional or
unconditional, and the target location can be either fixed or computed.

Conditional vs. unconditional is easy to understand — a conditional branch is


only taken if a certain condition holds (such as whether one number equals
another); if the branch is not taken, control proceeds to the next instruction after
the branch like normal.

The branch target is another important issue. Most branches have a fixed branch
target — they go to a specific location in code that is fixed at compile time. This
includes if statements, loops of all sorts, regular function calls, and many more.
Computed branches compute the target of the branch at runtime. This includes
switch statements (sometimes), returning from a function, virtual function calls,
and function pointer calls.

When the processor sees a branch instruction appear in its pipeline, it needs to
figure out how to continue to fill up its pipeline. In order to figure out what
instructions come after the branch in the program stream.
Thus, the reason why if statements are expensive is due to branch
mispredictions.

Is & faster than &&?


To put this clearly, are bitwise operators faster than logical operators?
&& lazily evaluates the right-hand-side expression (i.e. only if the LHS is true),
which implies a conditional, whereas in Java & in this context guarantees strict
evaluation of both (boolean) sub-expressions. The value result is the same either
way.

1 public boolean withAndAnd(int x, int value, int y) {


2 return value >= x && value <= y;
3 }
4
5 public boolean withAnd(int x, int value, int y) {
6 return value >= x & value <= y;
7 }

methods-for-logical-bitwise.java hosted with ❤ by GitHub view raw

when value >= x and value <= y are as likely true as false with no particular
pattern, would using the & operator be faster than using && ?

Let’s look at the bytecode:

public withANDAND(III)Z
L0
LINENUMBER 8 L0
ILOAD 2
ILOAD 1
IF_ICMPLT L1
ILOAD 2
ILOAD 3
IF_ICMPGT L1
L2
LINENUMBER 9 L2
ICONST_1
IRETURN
L1
LINENUMBER 11 L1
FRAME SAME
ICONST_0
IRETURN
L3
LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L3 0
LOCALVARIABLE x I L0 L3 1
LOCALVARIABLE value I L0 L3 2
LOCALVARIABLE y I L0 L3 3
MAXSTACK = 2
MAXLOCALS = 4

// access flags 0x1


public withAND(III)Z
L0
LINENUMBER 15 L0
ILOAD 2
ILOAD 1
IF_ICMPLT L1
ICONST_1
GOTO L2
L1
FRAME SAME
ICONST_0
L2
FRAME SAME1 I
ILOAD 2
ILOAD 3
IF_ICMPGT L3
ICONST_1
GOTO L4
L3
FRAME SAME1 I
ICONST_0
L4
FRAME FULL [test/lsoto/AndTest I I I] [I I]
IAND
IFEQ L5
L6
LINENUMBER 16 L6
ICONST_1
IRETURN
L5
LINENUMBER 18 L5
FRAME SAME
ICONST_0
IRETURN
L7
LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L7 0
LOCALVARIABLE x I L0 L7 1
LOCALVARIABLE value I L0 L7 2
LOCALVARIABLE y I L0 L7 3
MAXSTACK = 3
MAXLOCALS = 4
As seen, withANDAND() generates two conditional jumps however, withAND()
generates three conditional jumps.

Though I’m not that very much experienced with Java bytecode and I may have
overlooked something, it seems to me that & will actually perform worse than &&

in every case: it generates more instructions to execute, including more


conditional jumps to predict and possibly fail at. [3]

What’s sets me thinking is the Google Guava’s Power of Two method:

1 // original method
2 public static boolean isPowerOfTwoAND(long x) {
3 return x > 0 & (x & (x - 1)) == 0;
4 }
5
6 // modified method to compare performance
7 public boolean isPowerOfTwoANDAND(long x) {
8 return x > 0 && (x & (x - 1)) == 0;
9 }

guava-power-of-two.java hosted with ❤ by GitHub view raw

Is this use of & (where && would be more normal) a real optimization?

Intel ASM code for the original method:

# {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in


'AndTest'
# this: rdx:rdx = 'AndTest'
# parm0: r8:r8 = long
...
0x0000000003103bbe: movabs rax,0x0
0x0000000003103bc8: cmp rax,r8
0x0000000003103bcb: movabs rax,0x175811f0 ;
{metadata(method data for {method} {0x0000000017580af0}
'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103bd5: movabs rsi,0x108
0x0000000003103bdf: jge 0x0000000003103bef
0x0000000003103be5: movabs rsi,0x118
0x0000000003103bef: mov rdi,QWORD PTR [rax+rsi*1]
0x0000000003103bf3: lea rdi,[rdi+0x1]
0x0000000003103bf7: mov QWORD PTR [rax+rsi*1],rdi
0x0000000003103bfb: jge 0x0000000003103c1b ;*lcmp
0x0000000003103c01: movabs rax,0x175811f0 ;
{metadata(method data for {method} {0x0000000017580af0}
'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c0b: inc DWORD PTR [rax+0x128]
0x0000000003103c11: mov eax,0x1
0x0000000003103c16: jmp 0x0000000003103c20 ;*goto
0x0000000003103c1b: mov eax,0x0 ;*lload_1
0x0000000003103c20: mov rsi,r8
0x0000000003103c23: movabs r10,0x1
0x0000000003103c2d: sub rsi,r10
0x0000000003103c30: and rsi,r8
0x0000000003103c33: movabs rdi,0x0
0x0000000003103c3d: cmp rsi,rdi
0x0000000003103c40: movabs rsi,0x175811f0 ;
{metadata(method data for {method} {0x0000000017580af0}
'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c4a: movabs rdi,0x140
0x0000000003103c54: jne 0x0000000003103c64
0x0000000003103c5a: movabs rdi,0x150
0x0000000003103c64: mov rbx,QWORD PTR [rsi+rdi*1]
0x0000000003103c68: lea rbx,[rbx+0x1]
0x0000000003103c6c: mov QWORD PTR [rsi+rdi*1],rbx
0x0000000003103c70: jne 0x0000000003103c90 ;*lcmp
0x0000000003103c76: movabs rsi,0x175811f0 ;
{metadata(method data for {method} {0x0000000017580af0}
'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c80: inc DWORD PTR [rsi+0x160]
0x0000000003103c86: mov esi,0x1
0x0000000003103c8b: jmp 0x0000000003103c95 ;*goto
0x0000000003103c90: mov esi,0x0 ;*iand
0x0000000003103c95: and rsi,rax
0x0000000003103c98: and esi,0x1
0x0000000003103c9b: mov rax,rsi
0x0000000003103c9e: add rsp,0x50
0x0000000003103ca2: pop rbp
0x0000000003103ca3: test DWORD PTR
[rip+0xfffffffffe44c457],eax # 0x0000000001550100
0x0000000003103ca9: ret

Intel ASM code for the modified method:

# {method} {0x0000000017580bd0} 'isPowerOfTwoANDAND' '(J)Z' in


'AndTest'
# this: rdx:rdx = 'AndTest'
# parm0: r8:r8 = long
...
0x0000000003103438: movabs rax,0x0
0x0000000003103442: cmp rax,r8
0x0000000003103445: jge 0x0000000003103471 ;*lcmp
0x000000000310344b: mov rax,r8
0x000000000310344e: movabs r10,0x1
0x0000000003103458: sub rax,r10
0x000000000310345b: and rax,r8
0x000000000310345e: movabs rsi,0x0
0x0000000003103468: cmp rax,rsi
0x000000000310346b: je 0x000000000310347b ;*lcmp
0x0000000003103471: mov eax,0x0
0x0000000003103476: jmp 0x0000000003103480 ;*ireturn
0x000000000310347b: mov eax,0x1 ;*goto
0x0000000003103480: and eax,0x1
0x0000000003103483: add rsp,0x40
0x0000000003103487: pop rbp
0x0000000003103488: test DWORD PTR
[rip+0xfffffffffe44cc72],eax # 0x0000000001550100
0x000000000310348e: ret

The JIT compiler generates far less assembly code for the && version than for
Guava's & version. So everything points to Guava’s & method being less efficient
than the more “natural” && version.

Or is it?

On compiling the same methods with Java 7, the assembly code generated for the
& method by the JIT compiler, has only one conditional jump now, and is way
shorter!

So, is this use of & (where && would be more normal) a real optimization?

The answer is the same, even for this (very!) specific scenario: it depends on
your JVM implementation, your compiler, your CPU and your input data.

Now, let’s go back to the first piece of code where this all started and look at the
solution on how to fix it i.e. make it “branchless”.

The “if” statement can be replace with some bitwise operators.

1 int t = (data[c] - 128) >> 31;


2 sum += ~t & data[c];

branch-prediction-solution.java hosted with ❤ by GitHub view raw

Comparing the results (AWS t2.micro, Java 8):


// Branch - Random
seconds = 10.93293813

// Branch - Sorted
seconds = 5.643797077

// Branchless - Random
seconds = 3.113581453

// Branchless - Sorted
seconds = 3.186068823

If you give the Intel compiler the branchless code, it just out-right vectorizes and
is just as fast as with the branch (with the loop interchange).

To conclude, what is branch aware programming?

Is it possible to avoid branch mispredictions using some high level programming


technique (i.e. no assembly)?

Highly unlikely, at least not for a single branch. What you can do is minimize the
depth of your dependency chains so that branch mis-prediction would not have
any effect.

What should I keep in mind to produce branch-friendly code in a high level


programming language?

Primarily, you should keep in mind that branch mis-prediction is only going to
affect you in the most performance critical part of your program and not to worry
about it until you’ve measured and found a problem.

I’m not a micro-optimization wizard. I don’t know exactly how the hardware
branch predictor works.

1. Profiling:
That’s the very first thing you should be looking into if you don’t know how to
do this (profiling). Most of these micro-level hotspots are best discovered in
hindsight with a profiler in hand.

2. Branch Elimination:
There is an easier, high-level way to mitigate branch mis-prediction, and
that’s to avoid branching completely. Read more.

3. Zero Cost Exception Handling:


In the case that an exception is thrown, they’re now slower than ever before.
Yet in the case where an exception isn’t thrown, they’re now faster than ever
before and often faster in successful scenarios than code like:

1 if (!try_something())
2 return error;
3 if (!try_something_else())
4 return error;

zero-cost-exception-handling.java hosted with ❤ by GitHub view raw

There are many more ways to avoid failed branch predictions, that’s for you to
explore!

I hope you enjoyed the article! Find more in my profile. Visit my portfolio here:
https://harshal.one.

You might also like