Professional Documents
Culture Documents
Branch Prediction - Everything You Need To Know - The Startup Medium
Branch Prediction - Everything You Need To Know - The Startup Medium
know.
Harshal Parekh · Follow
Published in The Startup
9 min read · Jun 23, 2019
The code above takes ~12 seconds to run. But on commenting line 15, not touching
the rest, the same code takes ~33 seconds to run.
(running time may wary on different machines, but the proportion will stay the
same).
This is very peculiar yet intriguing, why would processing a unsorted array take almost
3 times as the time taken to process a sorted array?
Let’s travel back to 1700s to consider a real-life scenario, and you are the operator
of this junction and you hear a train coming. You don’t know where the train is
headed, so you stop the train, ask the driver and switch tracks.
But stopping and restarting a train takes a lot of time.
What if you were able to predict where the train wants to go and switch
beforehand? If you’re right, train continues; if not, driver backs up and you
switch.
If you guessed right all the time, the train would not have to stop.
This is what happens at the processor level when it encounters an “if” statement
— Modern processors are complicated and have long pipelines. So they take forever to
“warm up” and “slow down”.
1 long time;
2 final int len = 50000000;
3 int arbitrary = 0;
4 int[][] nums = new int[2][len];
5
6 for (double fraction = 0 ; fraction <= 0.9 ; fraction += 1) {
This is easy for the processor to predict, with just one switch
7 for (int i = 0 ; i < 2 ; i++) {
8 for (int j = 0 ; j < len ; j++) {
And
9
now, a visualization for the unsorted array:
double random = Math.random();
10
11 if(random < fraction) {
12 nums[i][j] = 0;
13 } else {
14 nums[i][j] = (int) (random * 15 + 1);
15 }
16 }
17 }
18
This is difficult for the processor to predict, resulting in 3x more time to run
19 time = System.currentTimeMillis();
20
Now
21 that for
you(int
probably
i = 0 ;understood the{problem, how would you strategically guess
i < len ; i++)
to22minimize the
if number
(nums[0][i] * nums[1][i]
of times that the!= 0) {
processor must back up and go down the other
23 arbitrary++;
path? You try to identify a pattern and follow it. This is more or less how branch
24 }
predictors
25 work.
/* But when faced with unpredictable branches with no
recognizable
26 patterns,
if branch
(nums[0][i] != 0 predictors are!=virtually
&& nums[1][i] 0) { useless. [2]
27 arbitrary++;
28 }
29 */
30 }
31 System.out.println(System.currentTimeMillis() - time);
32 }
The above code is another example of branch prediction. Running the
commented code takes longer than the current code. This is for a simple reason
that the commented code will compile to 2 memory loads and two conditional
branches as compared to a multiply and one conditional branch.
The multiply is likely to be faster than the second conditional branch if the
hardware-level branch prediction is ineffective. This raises the question if this is
at compiler level or is it at the hardware level? The latter, I believe.
1 if (highly_likely)
2 //do something
3 else if (somewhat_likely)
4 //do something
5 else if (unlikely)
6 //do something
It seems obvious that the sorted version would be faster, however for readability
or the existence of side-effects, we might want to order them non-optimally. It’s
also hard to tell how well the CPU will do with branch prediction until you
actually run the code. So, how to order the if… else if statements?
As a general rule, most if not all Intel CPUs assume forward branches are not
taken the first time they see them. See Godbolt’s work.
After that, the branch goes into a branch prediction cache, and past behavior is
used to inform future branch prediction.
In general code, most compilers by default (lacking another reason) will order
the produced machine code roughly the way you ordered it in your code. Thus if
statements are forward branches when they fail.
So you should order your branches in the order of decreasing likelihood to get the
best branch prediction from a “first encounter”.
For most RISC architectures, instructions are all a constant length, so the
Program Counter (PC) can be incremented by a constant amount. For most
instructions, the PC of the next instruction is just the current PC plus the length
of the current instruction.
For branch instructions, however, the next instruction to be executed is not the
next location after the current instruction. Branches are gotos — they tell the
processor where the next instruction is. Branches can either be conditional or
unconditional, and the target location can be either fixed or computed.
The branch target is another important issue. Most branches have a fixed branch
target — they go to a specific location in code that is fixed at compile time. This
includes if statements, loops of all sorts, regular function calls, and many more.
Computed branches compute the target of the branch at runtime. This includes
switch statements (sometimes), returning from a function, virtual function calls,
and function pointer calls.
When the processor sees a branch instruction appear in its pipeline, it needs to
figure out how to continue to fill up its pipeline. In order to figure out what
instructions come after the branch in the program stream.
Thus, the reason why if statements are expensive is due to branch
mispredictions.
when value >= x and value <= y are as likely true as false with no particular
pattern, would using the & operator be faster than using && ?
public withANDAND(III)Z
L0
LINENUMBER 8 L0
ILOAD 2
ILOAD 1
IF_ICMPLT L1
ILOAD 2
ILOAD 3
IF_ICMPGT L1
L2
LINENUMBER 9 L2
ICONST_1
IRETURN
L1
LINENUMBER 11 L1
FRAME SAME
ICONST_0
IRETURN
L3
LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L3 0
LOCALVARIABLE x I L0 L3 1
LOCALVARIABLE value I L0 L3 2
LOCALVARIABLE y I L0 L3 3
MAXSTACK = 2
MAXLOCALS = 4
Though I’m not that very much experienced with Java bytecode and I may have
overlooked something, it seems to me that & will actually perform worse than &&
1 // original method
2 public static boolean isPowerOfTwoAND(long x) {
3 return x > 0 & (x & (x - 1)) == 0;
4 }
5
6 // modified method to compare performance
7 public boolean isPowerOfTwoANDAND(long x) {
8 return x > 0 && (x & (x - 1)) == 0;
9 }
Is this use of & (where && would be more normal) a real optimization?
The JIT compiler generates far less assembly code for the && version than for
Guava's & version. So everything points to Guava’s & method being less efficient
than the more “natural” && version.
Or is it?
On compiling the same methods with Java 7, the assembly code generated for the
& method by the JIT compiler, has only one conditional jump now, and is way
shorter!
So, is this use of & (where && would be more normal) a real optimization?
The answer is the same, even for this (very!) specific scenario: it depends on
your JVM implementation, your compiler, your CPU and your input data.
Now, let’s go back to the first piece of code where this all started and look at the
solution on how to fix it i.e. make it “branchless”.
// Branch - Sorted
seconds = 5.643797077
// Branchless - Random
seconds = 3.113581453
// Branchless - Sorted
seconds = 3.186068823
If you give the Intel compiler the branchless code, it just out-right vectorizes and
is just as fast as with the branch (with the loop interchange).
Highly unlikely, at least not for a single branch. What you can do is minimize the
depth of your dependency chains so that branch mis-prediction would not have
any effect.
Primarily, you should keep in mind that branch mis-prediction is only going to
affect you in the most performance critical part of your program and not to worry
about it until you’ve measured and found a problem.
I’m not a micro-optimization wizard. I don’t know exactly how the hardware
branch predictor works.
1. Profiling:
That’s the very first thing you should be looking into if you don’t know how to
do this (profiling). Most of these micro-level hotspots are best discovered in
hindsight with a profiler in hand.
2. Branch Elimination:
There is an easier, high-level way to mitigate branch mis-prediction, and
that’s to avoid branching completely. Read more.
1 if (!try_something())
2 return error;
3 if (!try_something_else())
4 return error;
There are many more ways to avoid failed branch predictions, that’s for you to
explore!
I hope you enjoyed the article! Find more in my profile. Visit my portfolio here:
https://harshal.one.