I do a lot of ASM programming on the old Game Boy, where it's necessary to not only optimize for speed but for size as well. A lot of this is kinda GB-specific, but can be applied to other systems (especially when you have such limited space) as well.
-"ld a,0" takes two bytes, 2 cycles. The same can be accomplished in one byte and one cycle by doing "xor a". (When the other register isn't specified, A is assumed, so it's really "xor a,a".) A xor A = 0 no matter what A is.
-Instead of multiple shifts/rotates, use the Swap instruction to swap the low and high half of a register.
-Bit-testing instructions are slow. Instead, use a logical AND. Ex: "bit 7,a" can be replaced with "and $80".
-When doing very small delays, you may not need a loop at all. For example, you can delay 4 cycles just by calling a "ret" instruction. Even longer loops can often be made smaller just by using slower code, such as "cp 0; jr z, whatever" instead of "jr nz, whatever".
-If you don't have the room or skills to write a value-to-decimal-string function, consider storing copies of the variables in decimal instead. For example, say someone starts with 99 health - instead of writing a function to convert 0x63 to "99", you can just store 0x99 somewhere and write a function to convert that (which is much easier), then simply modify both variables simultaneously.
-Always keep in mind that SRAM need not be used for save files exclusively. If you need more RAM, you can store things here as well. If you're willing to risk breaking compatibility with future systems (obviously no longer an issue with old GB games), you can use unused/undocumented I/O regs to hold small values as well.
-Relocateable interrupts are very useful in some situations, especially if implemented well. A good method is to simply have the interrupt jump to a jump instruction in RAM, which can be modified as needed.
-This:
call somecode
ret
can be replaced with this:
jp somecode
This way, somecode will return to wherever the previous subroutine was called from, instead of the one that actually called it.
-Use the high-RAM area ($FF80-$FFFE), and the optimized ldh instructions that go with it! This can greatly speed up access and save a few bytes. Since I/O is at $FF00-$FF7F, you should never need to use a standard ld for it.
-If you REALLY need optimization, you can abuse placement of functions in ROM banks. Suppose function A is near the start of bank 1, and function B is near the start of bank 2, and you need function A to jump to function B. Problem being, only one bank is accessible at a time! If absolutely necessary, you can pad bank 2, so that function B starts at the address function A ends at. This way, rather than require some function in the fixed-bank area (which, being only 16K, you don't want to waste), function A can simply terminate by switching to bank 2, putting execution right at the start of function B.
Kinda complicated, so here's a pseudo-code example:
;Function A starts here, bank 1, address $4000
dostuff ;16 bytes
changerombank 2 ;2 bytes
;Execution is now at address $4013, bank 2
;Function B starts here, bank 2, address $4000
nop ;Repeat 19 times (19 bytes)
domorestuff ;Now located at $4013
-Quick way to swap 16-bit registers (example, HL and DE):
push hl
push de
pop hl
pop de
You can also push af, which gives you free access to F. Since the low 4 bits of F are unused, you could store your own data in there, hidden away from any RAM-searching programs (simple anti-hack method); you might even be able to use this for debugging (especially if you could somehow display the contents of F externally), since I've never seen this method used in ANY app.
Similarly, you can copy HL to DE:
push hl
pop de
Although in this particular case, just ld d,h and ld e,l might be faster, but AFAIK both of those are 2 bytes, when push/pop are only one.
-Emulator bugs can be abused to see if your program is running on an emulator or not. For example, most emulators initialize memory to all zeroes, while the real system would have mostly random contents. Some quick memory tests to determine which is the case can often identify if the system is real or not. Actual emulation bugs, of course, can be used as well.
-Jump tables are your best friend!
-Optimize your graphics as much as possible. For example, I recently designed an app whose GUI consisted of several 16x16 buttons, with various numbers and letters on them. Normally, these would never fit into the space I had left, but simple optimization fixed that very well. For example, most of the '6' button can be made from parts of the '5' button, since the numbers are so similar. I've even been planning to do a simple 'blending' via xor (simply draw the numbers and one blank button, and merge them together), which could cut several more tiles out.
-Remove unneeded array elements. If you have a game where each level can have an time limit that's a multiple of 100, you needn't store the full time, only the 100s digit. So instead of 600, which is 2 bytes, you need only store 6, which is one.
Also, here's a less ASM-specific one, in VB-style for easy readability:
Code:
i = 1
BitCount = 0
Do
Bit(BitCount) = Number And i
BitCount = BitCount + 1
i = i * 2
Loop Until i = 256
A faster and more safe version:
Code:
for i = 0 to 7
Bit(i) = Number And (2 ^ i) '2 to the power of i
next i
As you can see, a LOT of optimization can be made with simple math tricks.
Have fun with that.
