or9 said:
I think the advantage of GCEmu is that its using a lot of SSE optimizations in paired single operations, which beat usual FPU implementation.
Actually the funny thing is, that the SSE2 optimizations didnt make any noticable difference because of the need to load/store SSE2 registers around every basic block. It could make more of a difference if the GPU plugin would be on its own thread so it wouldnt be stepping over the SSE2 registers at random. By default (even if the CPU support SSE2!) SSE2 is DISABLED.
As your statistics already show, a 3000+ CPU is more than fast enough to do those matrix calculations but it still hits only 20% of speed in-game.
GCEmu gets its speed from not even trying to work on 'low end' systems meaning it requires a DX9 graphics card and a lot of optimizations in the recompiler to remove branch prediction problems with memory load/store access and jumping from one basic block to another.
Oh, and of course the idle loop detection helps as well. I have no idea how many instructions per frame the CPU should really do 'in the real world' assuming cache. So it is doing 15000 * 50 * 256 = 192MIPS right now and assumes the idle detection will remove the wasted time.
I had plans to do an AMD64 version where you can statically assign registers and make total use of SSE2 (assuming GPU on a thread) but I simply dont like emulating things other than the CPU and this is what is lacking in GCEmu. If the memory card and DSP would be improved a lot more games would boot and probably even work in-game. All the green screens are right now probably waiting for either DSP or memory card (they are related anyway

)
There are no 'HLE' hacks in GCEmu except for the idle loop detection. If compatibility improves speed should not be affected too much. It might even improve because when all the emulation falls in place its much clearer on how to do overall optimizations.
The source is released, so feel free to take a look yourself. If someone is interested in improving compatibility then there is still work that can be done on the CPU.