working for fast code - c++/asm optimizations

cufunha · Jan 7, 2005

I'm here for a very old issue: optimization. The idea is to put here every trick you have for optimizing for fast code (c level or assembler level; processor blended code or not). I'm sure it'll be a long discussion... my best c level tricks (i think everyone knows them) are:

1) totally unroll little loops. this way you avoid jump instructions.
2) never use the bt (bit test) opcode. Its slow under some processors. Use the and opcode instead.
3) partially unroll large/indefinite size loops. For example:
//bad code
unsigned int acc = 0;
for(unsigned int i=0;i<998;i++) {
acc += arrayofnumbers;
}
//faster code
for(unsigned int i=0;i<998;i+=3) {
acc += arrayofnumbers;
acc+=arrayofnumbers[i+1];
acc+=arrayofnumbers[i+2];
}

this trick reduces the number of jumps.
4) never use signed integers, unless you really need. they are slower because the compiled code for them is bigger. Take a look a this:

int i;
i=i/4;

is compiled as:

MOV EAX, i
CDQ
AND EDX, 3
ADD EAX, EDX
SAR EAX, 2
MOV i, EAX

now look at this:

unsigned int i;
i=i/4;

is compiled as:

shr i, 2

5) Use yor processor cache at maximum by prefetching memory blocks on load/store routines. This is more specific for each processor, it's better get the processor documentation for it. For example, for athlon xp you can find it on: http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_3748,00.html

see code optimization guide.

that's it, write down here your optimizations too!

cufunha · Jan 7, 2005

I forgot one thing: use signed integers for int to float conversion. Look at this:

(avoid)
double x;
unsigned int i;
x=i;

compiled:

MOV [temp+4], 0
MOV EAX, i
MOV [temp], EAX
FILD QWORD PTR [temp]
FSTP QWORD PTR [x]

(preferred)

double x;
int i;
x=i;

compiled:

FILD DWORD PTR
FSTP QWORD PTR [x]

BGNG · Jan 7, 2005

In my experience, the number one overlooked and generally effective optimization is expression reduction; the stuff you do on college assignments:

X = 4 * Y * 3 + 8 * 7

is the same as

X = 12 * Y + 56... Obvious when you look at it, but be careful how you use stuff like that.
__________

Another thing is to simplify loops whenever possible... Not just unwind them a bit as you said, but see if you can rephrase them. For examle:

X = 0
For I = 1 To Y
X = X + I
Next

is the same as

X = (1 + Y) * (Y / 2)
__________

One other thing is to use logical operations in the place of condition statements.

For example:

If X > 127 Then X = X - 128

is the same as

X = X - (X And 128)

aprentice · Jan 7, 2005

Pointers can be very powerful tools when it comes to emulating, and if you know what your doing, you can do some amazing stuff with them.

BGNG · Jan 7, 2005

Example?

aprentice · Jan 8, 2005

BGNG said:
Example?

before:

PHP:

unsigned char read_mem(unsigned short addr)
{
  if(addr<0x1FFF)
  {
     return rom[addr&0x1FFF];
  } 
  else if(addr<0x2FFF)
  {
    return ram[addr&0x0FFF];
  }
  else if(addr<0x3FFF)
  {
     return video[addr&0x0FFF];
  }
}

after:

PHP:

unsigned char *map[5];

void initialize_map()
{
  map[0]=&rom[0];
  map[1]=&rom[0]-0x1FFF;
  map[2]=&ram[0]-0x2FFF;
  map[3]=&video[0]-0x3FFF;
}

unsigned char read_mem(unsigned short addr)
{
  return map[addr>>12][addr];
}

Theres a ton of things you can do with pointers, im just scratching the surface.

cufunha · Jan 8, 2005

BGNG: these are good, but obvious optimizations. I think every experienced program should do this naturally.
Reading some articles about optimization techniques, mmx, 3dnow! and sse technologies, I got amazed how little stupid things can be very optimized without any changing on the algoritm - just changing the way you implement it. I'll give some examples in the next post.

cufunha · Jan 8, 2005

I'm proposing here a simple code (4x4 matrix multiplication) present on everything related with 3d. The matrixes are 2-dimensional arrays of floats.
take a look:
-----------
float m1[4][4] = {1.0f, 2.26f, 4.01f, -2.1f, -1.0f, 8.2f, 55.478f, 10.0f, 1.2f, 5.0f, 9.6f, 5.3f, -99.31f, -8.2f, -1.7f, 0.0f};

float m2[4][4] = {1.7f, 27.26f, -14.8f, -2.1f, 1.0f, 8.2f, 5.47f, 1.0f, 1.2f, 15.0f, 9.6f, 5.3f, -9.31f, 8.2f, -1.7f, 10.0f};

void mmult(float *m1, float *m2, float *dest) { //dest = m1*m2

for(unsigned int i=0;i<3;i++) {
for(unsigned int j=0;j<3;j++) {
dest[j] = m1[0]*m2[0][j] + m1[1]*m2[1][j] + m1[2]*m2[2][j] + m1[3]*m2[3][j];
}
}

}

----

it's the normal code. Now give your optimizations to it, without changing the algoritm, just the way you implement it. You can use inline assembler if you want.

HyperHacker · Jan 8, 2005

I do a lot of ASM programming on the old Game Boy, where it's necessary to not only optimize for speed but for size as well. A lot of this is kinda GB-specific, but can be applied to other systems (especially when you have such limited space) as well.

-"ld a,0" takes two bytes, 2 cycles. The same can be accomplished in one byte and one cycle by doing "xor a". (When the other register isn't specified, A is assumed, so it's really "xor a,a".) A xor A = 0 no matter what A is.

-Instead of multiple shifts/rotates, use the Swap instruction to swap the low and high half of a register.

-Bit-testing instructions are slow. Instead, use a logical AND. Ex: "bit 7,a" can be replaced with "and $80".

-When doing very small delays, you may not need a loop at all. For example, you can delay 4 cycles just by calling a "ret" instruction. Even longer loops can often be made smaller just by using slower code, such as "cp 0; jr z, whatever" instead of "jr nz, whatever".

-If you don't have the room or skills to write a value-to-decimal-string function, consider storing copies of the variables in decimal instead. For example, say someone starts with 99 health - instead of writing a function to convert 0x63 to "99", you can just store 0x99 somewhere and write a function to convert that (which is much easier), then simply modify both variables simultaneously.

-Always keep in mind that SRAM need not be used for save files exclusively. If you need more RAM, you can store things here as well. If you're willing to risk breaking compatibility with future systems (obviously no longer an issue with old GB games), you can use unused/undocumented I/O regs to hold small values as well.

-Relocateable interrupts are very useful in some situations, especially if implemented well. A good method is to simply have the interrupt jump to a jump instruction in RAM, which can be modified as needed.

-This:
call somecode
ret
can be replaced with this:
jp somecode
This way, somecode will return to wherever the previous subroutine was called from, instead of the one that actually called it.

-Use the high-RAM area ($FF80-$FFFE), and the optimized ldh instructions that go with it! This can greatly speed up access and save a few bytes. Since I/O is at $FF00-$FF7F, you should never need to use a standard ld for it.

-If you REALLY need optimization, you can abuse placement of functions in ROM banks. Suppose function A is near the start of bank 1, and function B is near the start of bank 2, and you need function A to jump to function B. Problem being, only one bank is accessible at a time! If absolutely necessary, you can pad bank 2, so that function B starts at the address function A ends at. This way, rather than require some function in the fixed-bank area (which, being only 16K, you don't want to waste), function A can simply terminate by switching to bank 2, putting execution right at the start of function B.
Kinda complicated, so here's a pseudo-code example:
;Function A starts here, bank 1, address $4000
dostuff ;16 bytes
changerombank 2 ;2 bytes
;Execution is now at address $4013, bank 2

;Function B starts here, bank 2, address $4000
nop ;Repeat 19 times (19 bytes)
domorestuff ;Now located at $4013

-Quick way to swap 16-bit registers (example, HL and DE):
push hl
push de
pop hl
pop de
You can also push af, which gives you free access to F. Since the low 4 bits of F are unused, you could store your own data in there, hidden away from any RAM-searching programs (simple anti-hack method); you might even be able to use this for debugging (especially if you could somehow display the contents of F externally), since I've never seen this method used in ANY app.
Similarly, you can copy HL to DE:
push hl
pop de
Although in this particular case, just ld d,h and ld e,l might be faster, but AFAIK both of those are 2 bytes, when push/pop are only one.

-Emulator bugs can be abused to see if your program is running on an emulator or not. For example, most emulators initialize memory to all zeroes, while the real system would have mostly random contents. Some quick memory tests to determine which is the case can often identify if the system is real or not. Actual emulation bugs, of course, can be used as well.

-Jump tables are your best friend!

-Optimize your graphics as much as possible. For example, I recently designed an app whose GUI consisted of several 16x16 buttons, with various numbers and letters on them. Normally, these would never fit into the space I had left, but simple optimization fixed that very well. For example, most of the '6' button can be made from parts of the '5' button, since the numbers are so similar. I've even been planning to do a simple 'blending' via xor (simply draw the numbers and one blank button, and merge them together), which could cut several more tiles out.

-Remove unneeded array elements. If you have a game where each level can have an time limit that's a multiple of 100, you needn't store the full time, only the 100s digit. So instead of 600, which is 2 bytes, you need only store 6, which is one.

Also, here's a less ASM-specific one, in VB-style for easy readability:

Code:

i = 1
BitCount = 0
Do
Bit(BitCount) = Number And i
BitCount = BitCount + 1
i = i * 2
Loop Until i = 256

A faster and more safe version:

Code:

for i = 0 to 7
Bit(i) = Number And (2 ^ i) '2 to the power of i
next i

As you can see, a LOT of optimization can be made with simple math tricks.

Have fun with that.

Hacktarux · Jan 8, 2005

cufunha said:
BGNG: these are good, but obvious optimizations. I think every experienced program should do this naturally.
Reading some articles about optimization techniques, mmx, 3dnow! and sse technologies, I got amazed how little stupid things can be very optimized without any changing on the algoritm - just changing the way you implement it. I'll give some examples in the next post.

While it's true that you can optimize an algorithm by various processor optimization tricks, it's much more important to decrease as much as possible the algorithm's complexity. Nobody should ever start to think asm or C language optimizations before being absolutely sure that the complexity is as low as possible. BCNG's example is maybe obvious but there are many cases where similar stuffs can't be that easily found. Sometimes it needs hours of analysis to discover that something can be changed in an algorithm but when you're able to rewrite a O(n^10) algorithm into a O(n^3) algorithm after hours of analysis on a paper, your code will be infinitely more optimized than with any processor optimization.

A few months ago i heard about a project that was trying to compute something, the first algorithm was taking 5 days to give a result. After one year of analysis, they have rewritten the program and it was giving the result in 1 hour (on the same machine). They worked another year on the algorithm and the result was computed in 3 seconds.

Well... i only wanted to say that on real programs (not just simple matrix multiplication), algorithm optimizations should have higher priority than processor optimizations

cufunha · Jan 8, 2005

sorry, wrong code!

void mmult(float m1[][4], float m2[][4], float dest[][4]) { //dest = m1*m2

for(unsigned int i=0;i<3;i++) {
for(unsigned int j=0;j<3;j++) {
dest[j] = m1[0]*m2[0][j] + m1[1]*m2[1][j] +
m1[2]*m2[2][j] + m1[3]*m2[3][j];
}
}

}

cufunha · Jan 8, 2005

of course, but when you have the simplest algoritm possible you must think about processor optimizations. And when we talk bout emulation, we do have the best algoritms. Now, we have to look to processor optimizations.

Hacktarux · Jan 8, 2005

Well... considering that every emulators that would need high optimization are using a dynarec, of course there are processor optimizations as the code produced by the compiler has to be as efficient as possible. And no we don't have the best algorithm... you have no idea how a dynarec can be optimized... it seems like no dynarec will ever be fully optimized, there's always a possibility to improve them

WildNinji · Jan 8, 2005

BGNG said:
In my experience, the number one overlooked and generally effective optimization is expression reduction; the stuff you do on college assignments:

X = 4 * Y * 3 + 8 * 7

is the same as

X = 12 * Y + 56... Obvious when you look at it, but be careful how you use stuff like that.

I strongly suggest you: DON'T DO IT. It makes code unreadable and every compiler out of there does this optimization for you.

My suggestions are:

know your language(s). You must know when to use asm, when C and when C++. Knowledge of how they compile to machine code will guide you through the selection. I suggest you to code most of your code in C++, some core in C (function pointers in C are four time faster than C++ on x86 cpus and twice faster than C++ in ppc cpus), and direct hw manipulation (dynrec nowadays) in asm.
avoid if, expecially in loops. Branch prediction is the doomed beast of many architectures. Rewriting if/else blocks in straight code, using function pointers and some more math.
avoid global variables and pointers if you can. Local variables can be optimized by current compilers, can reside in hw registers for more time and are simplier to deal with. Global variables in C-like languages are subject to reloads under many circumstances = more memory reads = slower routine.
use C99 compilers and the __restricted keyword. The compiler will have more occasion to optimize
pass paramters via registers, not via stack. passing params via stack induces a write mem/read mem operation for each parameter. __fastcall, or regparams(3), is faster
use a profiler like valgrind and discover the most used funtions. Then write an optimized copy of that version.
lower the complexity of you algorithms. Hacktarux is correct, nothing can do more than a O(x^2) algo transformed in a O(3n) one

ector · Jan 8, 2005

WildNinji said:
function pointers in C are four time faster than C++ on x86 cpus and twice faster than C++ in ppc cpus

Huh? No way. A virtual function call in C++ is exactly the same cost as using a function pointer in a table in C. And if you only need a plain C-style function pointer, just use that in C++, it's there. There's not really any reason to choose C anymore for applications, when you can simply choose C++ and use any mix of C and C++ constructs you want or need. Besides, if your performance is really limited by function call overhead, you're often better off restructuring your program a little.

WildNinji said:
[*]avoid if, expecially in loops. Branch prediction is the doomed beast of many architectures. Rewriting if/else blocks in straight code, using function pointers and some more math.

This one is somewhat true, avoid conditionals such as if as much as possible, if the distribution of yes/no is around 50/50. If the distribution is more like 90/10 the branch prediction unit of the CPU will be happy and your conditional isn't so costly after all. Also, switches are much faster than some people think since they are in most cases transformed into jump tables by the compiler.

WildNinji said:
[*]use C99 compilers and the __restricted keyword. The compiler will have more occasion to optimize

Or simply use the "Assume no aliasing" option of your compiler (MSVC supports this). Same thing.

WildNinji said:
[*]pass paramters via registers, not via stack. passing params via stack induces a write mem/read mem operation for each parameter.

Passing parameters on the stack isn't that costly really since modern x86 cpus contains one hell of a lot of special logic to make it fast. That said, passing as much as possible in registers doesn't hurt.

The rest of WildNinji's advice is on the spot.

BGNG · Jan 8, 2005

My purpose for the expression simplification, WildNinji, is that it can reduce the required number of calculations. Take this random, made-up problem I just invented:

(Sqrt(-63) / Sqrt(-2)) * (Pi / 4)

...could be a simple expression in a trig application. It makes use of two square roots and two imaginary numbers which are eliminated after simplification...

3 * Pi * Sqrt(14) / 8

... is the simplified equivalent of the first expression, and uses only one square root and no imaginary numbers.

I am not aware of any compilers which can do that kind of optimization for you.

WildNinji · Jan 8, 2005

ector said:
Huh? No way. A virtual function call in C++ is exactly the same cost as using a function pointer in a table in C.

This is true. But I was talking about f-pointers in C and f-pointers in C++, not vfuncs.

Code:

/* C f-pointer */
table[0] = one_func;
table[1] = one_func;
table[2] = another_func;
(*table[x])();

/* C++ f-pointer */
table[0] = &CPU::one_func;
table[1] = &CPU::one_func;
table[2] = &CPU::another_func;
(*table[x])();

The C++ (*table[x])(); will be at least 2x slower because of how &CPU:

ne_func is encoded (as a offset to a calling table, one load, one addition and one comparison is needed). The net should have some info about this. You can also try to look around for the book Inside the C++ object model.

ector said:
There's not really any reason to choose C anymore for applications, when you can simply choose C++ and use any mix of C and C++ constructs you want or need. Besides, if your performance is really limited by function call overhead, you're often better off restructuring your program a little.

You're absolutely correct. But there are still some little things (like functions that deal with a lot of pointers) that need C because the C++ compilers can't do many assumpion about memory usage and will produce slow(er) code.

Slougi · Jan 10, 2005

While I've not emulated anything more complex than chip8, I have worked on a few projects were performance was important. What I've noticed is that very often just a few functions take up most of the CPU; of course optimizing those will give the highest speed boost. So use a nice profiler.
Kinda obvious

but I know too many people who never use a profiler and instead guess at the functions and algorithms that suck up most of the CPU time.

And just as hacktarux said, usually algorithm optimisation yields a much higher speed boost than something like unrolling loops.

A few generic things (not related to algorithms as such):

- You can optimize linked lists in many ways. Google is your friend.
- Tail recursion is bad. Use iteration instead.
- You can speed up for loops with a known number of iterations by counting down instead of up
- A switch is faster than a big if/else if block
- Often you can trade memory for processing speed by caching stuff. For example if you need 10000 random numbers, calculate them at program startup.
- On *nix systems (dunno about windows etc.) alloca() is very fast
- Order of variables in structs can make a difference to the struct size. It's better to write

Code:

struct foo {
        char one;
        char two;
        int foo;
};

than

Code:

struct foo {
        char one;
        int foo;
        char two;
};

BTW, tabs are 8 chars; NOT 4, 3, or 2!

BGNG · Jan 10, 2005

I do not doubt what you have said, Slougi, but I am curious as to the reasoning behind that loop counting down thing. Why is counting down faster than up for loops? Is it because knowing the maximum value eliminates the need to check for data size, overflow, etc?

Also... Your mentioning of link lists reminded me of a very important thing:

To optimize for small code is not necessarily to optimize for fast code. In fact, the fastest code tends to be larger and uses more memory than the smallest code.

In the event of a linked list, it is ideally best to allocate all the memory in the same block and reference it with an indexed pointer in similar fashion to an array than it is to allocate the next availiable memory block and point to it at the end of the previous one. The only thing that gets in the way for this is that you may not know how many elements there may be in the list; dynamicity is the purpose of the linked list concept, after all.

zenogais · Jan 10, 2005

WildNinji: There are other ways to encode the instructions in a c++ program. I don't know if this is necessarily faster, in fact my guess is it's encoded the same, but hey its a different way of doing the same thing.

Code:

class Cpu {
  ...
  typedef void (Cpu::*functionPtr)();
  static const functionPtr cpuInstructions[...];
};

Also if you know your c++ compiler well enough you'll be able to produce c++ programs that execute at equal or greater speed compared to their c counterparts.

working for fast code - c++/asm optimizations

New member

New member

New member

Moderator

New member

Moderator

New member

New member

Raving Lunatic

Emulator Developer

New member

New member

Emulator Developer

New member

Emulator Developer

New member

New member

New member

New member

New member