November 7th, 2014, 13:30
sebi707 nice work. I'm not sure if you already did these things to optimize your display but I hope I can help a bit: I would consider removing any sprite sorting algorithm (if used). Since you use a pixel-per-pixel approach (yeah dat Prehistorik Man...) you should check the possibility getting SCX and SCY values from MMIO fires and also calculate the Y addresses in vram after LY increases. Also, you could render the whole line in one take, and don't give a **** about Prehistorik Man. After all it's just the intro. We tried something with a friend of mine to speed up the core. We are porting my gb emulator to ARM android (since java runs as sluggish as 20fps max), and I had that idea of creating fixed address ARM instructions for each gameboy opcode. So we've coded blocks of ARM instructions like a jump table but without the table:
0x00 : NOP, not much to say here
0x01 : LD BC,## Shifting 0x01 << 8 left becomes 0x100. So in our prefixed memory 0x100 contains instructions for LD BC,## , and our code reaches 0x1A0 where it jumps at the end. We fill ARM NOPs until 0x1FF, and next comes 0x02 gb instruction LD (BC),A corresponding to 0x200 of our memory.
I cannot give you fps results yet since we have a lot of things to do in the ASM core, but I think it should give a considerable amount of speed.
Also it's not cheating to frameskip a bit. If you find your core runs at a decent speed, and emulation is stalled due to display, you can skip a frame or two. The LCD "afterimage" fade, will mostly smoothen out 1 frameskip I assume.
edit: forgot to mention. You could precalculate a table of 256 (0-255) mods with the number 8. That is to avoid storming the cpu with mods, as you need 1 for BG and 1 for Window to get the correct vram bit.
Last edited by venge; November 10th, 2014 at 14:46.
November 11th, 2014, 02:10
I'm using gcc with O2 or O3, which is a tad faster. I also tried lto, but it causes the emulator to crash somewhere inside the initialization routine from the 3rd party LCD library. To make things worse, gcc does not emit debug information when using lto...
Originally Posted by sebi707
I managed to compile the library without lto and everything else with, but there does not seem to be a measurable performance difference. At least the binary is about 30% smaller.
After some small optimizations, I'm at about 12fps, so no breakthrough there either .