Your explanation is absurd, claiming that the 1% of the time it isn't AL will slow it down tremendously when the overhead for that time would be less than 2x. What the actual values are in ARM code is irrelevant, because YOUR example only said for 1% of the time (saying less than 30% AL is insane though, it's probably closer to 80% or higher AL). Either way, the indirect branch in the switch is pretty much guaranteed slower (is more of a pipeline hazard) than the conditional branch in a preemptive test, if it's worthy.
Anyway, you're assuming that the compiler will make optimizations it won't. Look over the following clearly, for a value with a clear & 0x0F in the switch itself and with all 16 switch cases accounted for GCC with -O3 will still emit the following:
Code:
andl $15, %edx
cmpl $15, %edx
ja L1
jmp *L8(,%edx,4)
That's the AND, a test to see if it's in range, a jump if it isn't, an indirect jump, and a memory access.
Compare it to what it does for the following escape clause:
Code:
if((x & 0x0F) == 15)
{
return 10;
}
Code:
andl $15, %edx
cmpl $15, %edx
je L1
That's less an indirect jump and a memory dereference. Per your example, let's say that the second one occurs 99% of the time and the other 1% of the time both must occur. The time for the second we'll call S, the first F, so we have overall time of:
(0.01 * (S + F)) + (0.99 * S) = S + 0.01F
What you're saying is that the first only, or F, is smaller than (S + 0.01F). In other words, you think that S is less than 1% faster than F, when in reality it's probably a solid 2x as fast.
So now we are on about pipelines. Thats nice. how does it work again: Lets take an average 1.0-2.0ghz Intel processor. Its clock speed is irrelevent but running that fast necessarily implies that this processor has a 10-20 stage pipeline. I.e. you might already know what I am trying to do. Lets just say it has 10. Lets say it has the following stages:
[read instruction identifier][select appropriate path][operand load][operand decode][precalculation logic][memory fetch][calculation logic][state update][store result][memory write]
Of course, made up. But makes sense =/ earlier stages are about getting the operation, middle about executing it and towards the end you store results if needed. For the sake of simplicity, lets just say there wont be any cache miss. Say the time it takes to move data from cache to appropriate internal register is less than 2 clock cycles.
Lets just compare the time it takes from the point when the first instruction is identified to time when the apprppriate instruction is read into the pipeline.
1. ShizZy's jump statement that I said his code translates to which it does with the compiler I compiled with:
- jmp [switch_condition___0_f+reg*4]
The branch is "unpredictable" and hence we will assume that it will not do any branch prediction for this. I.e. The entire pipeline will be discarded on the "[state update]" when it modifies PC and on the next cycle it will read the next instruction identifier as pointed by PC. The time here is 8 clock cycles + the time it took for the jmp instruction to get to the state update stage which is 7 cycles + 1 extra one for memory read = 16 cycles.
2. What you said his jmp statement was:
- cmp reg, 15
- ja NOTINRANGE
- jmp [switch_condition___0_f+reg*4]
As far as intel's 2 level shift register branch predictor is concerned, with 99 to 1 ratio, the possibility of it predicting the branch not to be taken is not exactly 99/100 but a bit less. But it wont be taken anyway so we dont have to worry about it. So the time is just what it took for previous one + 2 cycles + no extra memory reads.
3. Now its your conditional AL aided code:
the setup will be something like:
Code:
...
cmp reg, 15
jne NOT_AL
;code for al
...
NOT_AL:
cmp reg, 14
ja NOTINRANGE
jmp [switch_condition___0_f+reg*4]
;;codes for different non-al switches
...
NOTINRANGE:
...
* If its ARM_COND_AL:
- cmp reg, 15
- jne NOT_AL
Assuming the branch is not taken for a long time which is highly unlikely for random code: is just one exra cycle. Most probably 2 or 3.
* If its not ARM_COND_AL:
- cmp reg, 15
- jne NOT_AL
NOT_AL:
- cmp reg, 14
- jx NOTINRANGE
- jmp [switch_condition___0_f+reg*4]
For this, its safe to assume that the first bit is branching prediction failure because most instructions are AL. For obvious reasons, the second one would also fail. and the 3rd one is just that plain jump statement whose timing is 16 cycles. Each branch failure costs 8 cycles + 2 extra cmps gives a total of 16 + 16 + 2 = 34 clock cycles. Of course, this is assuming everything is ideal and chances of things going wrong in pipelining with bigger code is more than smaller code.
Well, lets find a descent ratio:
16 vs (ratio * 3) + ((1-ratio) * 34;
simply solve it and will get ratio of AL > 0.5806 which is 58% for that to be effective.
Wait...lemme double check.
Anyways, it seems that I am always right. Moving on from here, 20 * 36 + 80 * 2 = 880 is not less than or equal to 1/2 * (16*100). So even with 80/20 ratio, its transport time not more than or equal to 2 times faster as without conditional and I love exaggerating. There are other things to consider as well and some things are slower than I assumed. Your AL does almost nothing compared to the memory accesses and extra comparisons on other conditionals. Even then I would not be surprised if the ratio of AL is more like 60% or less than something like 80%. With ARM, most people would use standard c/c++ compilers anyway. Most of them wont care about optimisation that much with their clock running at 300 mhz and a lot of memory. Especially those game developers. They like wrapping everything with conditionals and break things up into small logical pieces for tiny things, and compilers would most likely produce those single instructions with those conditionals for things like global values and parameters loaded into registers because it makes little to no difference running it on the hardware. Thats one of the other minor reasons why people like ARM.
And the code it produces for the latest version of free visual c++ for that function is exactly as I put up there. It seems like GCC does not have that optimisation yet. Well, even thou its widely used for most non-commercial purposes, there are still areas where it can improve.