What's new

Space Invaders Recompiler

MelvoPR

New member
Hey all, I've made a x86 based Dynamic Recompiler for Space Invaders (8080 CPU one).

This was made with the purpose of giving a basic example about how dynarec's work for those who want to take the next step in emu programming. (Yes, it's overkill a dynarec for a simple 8080 CPU.)

Anyway, for the experts in the field, I would appreciate your input about what you think of this small project (generated asm code, optimizations, etc.).

I would've made a GB emulator, but given the fact that the GB has some code apart from the CPU, I kept it simple, so you can focus on the dynarec, which is the purpose of this, given that Space Invaders is pretty simple).
 

Exophase

Emulator Developer
Hey MelvoPR, it's always nice to see another recompiler.

I had a quick glance over the code, and this is what strikes me as the two areas where you could most easily apply performance improvements:

1) Flags
You update flags in 8080 format every time they're executed. This is much more expensive than the ALU operations themselves. To mitigate this you can employ three techniques (I'm using all of them in an x86-targeting ARM recompiler I have):

a) Dead flag elimination: This is a subset of a more general "dead assignment elimination" that falls out from "liveness analysis." This means that you don't update flags if a latter instruction in the block is going to overwrite the flags anyway. It does require a scan forward and then backwards before processing instructions to see what the liveness status is of flags before you go.

b) Host flags caching: This is where you use the x86 flags to directly hold the state of the 8080 flags for as long as possible. When the x86 flags have to be used for other things you write back the ones that are necessary.

c) Changing representation: The actual flags state doesn't have to look like the 8080 state until the 8080 code loads the flags into a register or memory (which is relatively uncommon). Until then you can store flags in a format that's quicker to write to.

2) Branches
a) You end branches by exiting a block and indirectly loading the address of the next block. If a block is known compiled you can branch directly to it. Branch targets already going to compiled blocks can be linked directly. For targets going to not yet compiled blocks, you can either compile the block right then then link, or you can link to a stub that will compile the block and patch the link to go to the block. The latter has the advantage of not compiling blocks that don't yet have valid data. This does mean you have to put in explicit cycle tests in your code, but you might find it's worthwhile to position them so that they only occur around certain types of branches, like backwards one. I personally put cycle tests at the beginning of blocks instead of the end of them, and put cycle decrements within blocks. You'd probably be better off having this counter operated on inline instead of storing a cycle total per-block in an array.

b) You always exit blocks when a branch is hit. A conditional branch has the option of conditionally exiting instead. This does have the downside of creating more recompiled code, but it can be mitigated by allowing blocks to branch within themselves.

Beyond this I'm sure there are some complex optimizations you can make with enough analysis, like propagation to fold operations into the more advanced 386+ addressing or higher bit-width operations. But this would start getting pretty difficult. You have a really huge advantage in being able to address memory as a simple array like this.
 
OP
M

MelvoPR

New member
Hey MelvoPR, it's always nice to see another recompiler.

I had a quick glance over the code, and this is what strikes me as the two areas where you could most easily apply performance improvements:

1) Flags
You update flags in 8080 format every time they're executed. This is much more expensive than the ALU operations themselves. To mitigate this you can employ three techniques (I'm using all of them in an x86-targeting ARM recompiler I have):

a) Dead flag elimination: This is a subset of a more general "dead assignment elimination" that falls out from "liveness analysis." This means that you don't update flags if a latter instruction in the block is going to overwrite the flags anyway. It does require a scan forward and then backwards before processing instructions to see what the liveness status is of flags before you go.

b) Host flags caching: This is where you use the x86 flags to directly hold the state of the 8080 flags for as long as possible. When the x86 flags have to be used for other things you write back the ones that are necessary.

c) Changing representation: The actual flags state doesn't have to look like the 8080 state until the 8080 code loads the flags into a register or memory (which is relatively uncommon). Until then you can store flags in a format that's quicker to write to.

2) Branches
a) You end branches by exiting a block and indirectly loading the address of the next block. If a block is known compiled you can branch directly to it. Branch targets already going to compiled blocks can be linked directly. For targets going to not yet compiled blocks, you can either compile the block right then then link, or you can link to a stub that will compile the block and patch the link to go to the block. The latter has the advantage of not compiling blocks that don't yet have valid data. This does mean you have to put in explicit cycle tests in your code, but you might find it's worthwhile to position them so that they only occur around certain types of branches, like backwards one. I personally put cycle tests at the beginning of blocks instead of the end of them, and put cycle decrements within blocks. You'd probably be better off having this counter operated on inline instead of storing a cycle total per-block in an array.

b) You always exit blocks when a branch is hit. A conditional branch has the option of conditionally exiting instead. This does have the downside of creating more recompiled code, but it can be mitigated by allowing blocks to branch within themselves.

Beyond this I'm sure there are some complex optimizations you can make with enough analysis, like propagation to fold operations into the more advanced 386+ addressing or higher bit-width operations. But this would start getting pretty difficult. You have a really huge advantage in being able to address memory as a simple array like this.

About the Flags:
Right now, it always uses the flags generated by the host processor, due to the similarities between the x86 and the 8080.

I can get away with the fact that it stores them exactly like the 8080.
So I can just "LAHF", and I get the flags calculated, with no additional computation.

As a result, flags are calculated pretty fast and I think they are not a problem here due to implicit calculation of them.

Dead Flag Elimination would still generate faster code, but, in this case, is it really worthwhile?

Not to mention that dead flag elimination requires an extra pass too, slowing down recompilation.


About Branches:
Nice suggestions you got there, will take a look and improve it!


Thanks Exophase, your input is pretty much appreciated, as always, very helpful!
 

Exophase

Emulator Developer
I've looked a little closer at your code, and admittedly a lot of what looks like weakness in design is more just a matter of your implementation. For instance, this is what'll be emitted for an "inc" instruction:

Code:
push bx
mov bl, ah
lahf
and ah, 0xD4
and bl, 0x2B
or bl, ah
mov ah, bl
pop bx

The and wouldn't happen if it was say, an add, but everything else would. This sequence is relatively expensive (btw, there's no reason to do push bx instead of push ebx - you're better off avoiding 16-bit instructions where possible and you don't really want to misalign the stack). When you're updating all of the flags, an lahf by itself should actually suffice.

Nontheless, the lahf isn't free (for instance, 2 cycles on Atom) and there's another advantage to keeping things in x86 flags where possible: you can do branches without code to extract the flags. Here's how you're doing branches:

Code:
push ax
and ah, (condition)
pop ax

j(n)z skip
mov di, address

skip:

Now, you really don't need the push/pop, since you can just do a test ah, (condition) instead. But if the flags were already set you could simply do the branch. And if it's a direct target you can link straight to the recompiled block. You can actually compile cycle checks at the beginning of blocks so you don't have to worry about a potential exit here.

But the problem is getting x86 to not clobber flags when you want to retain them. You can use lea to perform constant additions and subtracts without setting flags, like when subtracting cycles. A more subtle benefit of performing liveness analysis (not just on flags, but on registers) is that by knowing when the registers aren't needed you know when you're free to use them for something else without worrying about saving (and possibly restoring) them. So if you know the flags are dead, you can freely overwrite them without committing them elsewhere. Fortunately, you don't have to do conditional branches for memory accesses in your emulator, so you really only need them for emulating 8080 conditional branches and to check cycles (if you inline this). Putting a cycle check at the beginning of a block with flags liveness analysis performed is advantageous because at the beginning of the block flags are often dead, especially on this architecture.

Looking further at your output stream revealed a major weakness I didn't catch the first time - you're updating the PC after every instruction.

The PC can be considered a constant that is known at compile time, so there's no need to maintain it at runtime. The only time you need to output it is when a return address needs to be stored (either for a call or an interrupt), or when a branch target needs to be made available. In your case, you'd probably want to update the PC before an instruction that will end the block. But with a redesign this wouldn't be necessary.

It's true all of this extra analysis increases recompilation time. But unless this is a recurring expense, for instance due to self-modifying code, you should never look at it in equal footing to run time. Compilation is something that happens only once, and a recompiler of this nature is only performing O(n) operations. It takes a lot of code to really even become noticeable, and the impact is asymptotically zero since it's a one-time cost vs ongoing execution. Basically, I would say go back and optimize/simplify the recompilation only if you can demonstrate it has become a problem.

Finally, this isn't a suggestion per se, but one thing I like with recompilers is the ability to look at before + after streams. Maybe if you can produce some examples some other ideas for improvement will become clearer.
 
OP
M

MelvoPR

New member
The and of flags in their calculation when it's not necessary, that is, when all flags are going to get updated, it was correct with previous flag map, however, I changed flag map and missed it, thanks!

The and of branches instead of test cannot be, due to it setting all flags, and I don't want that.
However, the PC update after every instruction is not necessary, true, silly of me...

BTW, I already knew about jumping to recompiled blocks directly, but admittedly, I got bored and decided to not continue optimizing it, primarily for the reason that all effort would be wasted, since in first place, there isn't any utility for a 8080 dynarec, and was primarily done as a basic example of the latter.

I even planned for self-modifying code in first place, but dropped the idea too for same reasons.

Anyway, there's a lot optimizations it could have, apart from the previously mentioned ones, like analyzing closely the asm of space invaders, to selectively check for cycles, further reducing code.
There's some little optimizations too, like checking if an immediate is 1, so instead of performing an ADD, we could do an INC and same for DEC.

Const Prop would be nice.

I appreciate for taking your time to help, and I promise I'll do said optimizations for a real emu, maybe nes, something harder? What do you think?

Again, thanks for the info!
 
Last edited:

Exophase

Emulator Developer
The and of branches instead of test cannot be, due to it setting all flags, and I don't want that.

I don't understand the problem.. your method - push/and/pop - is functionally the same as test. The impact on flags is exactly the same for both.

BTW, I already knew about jumping to recompiled blocks directly, but admittedly, I got bored and decided to not continue optimizing it, primarily for the reason that all effort would be wasted, since in first place, there isn't any utility for a 8080 dynarec, and was primarily done as a basic example of the latter.

I even planned for self-modifying code in first place, but dropped the idea too for same reasons.

Anyway, there's a lot optimizations it could have, apart from the previously mentioned ones, like analyzing closely the asm of space invaders, to selectively check for cycles, further reducing code.

Yeah, you're naturally only going to want to go so far with this.

There's some little optimizations too, like checking if an immediate is 1, so instead of performing an ADD, we could do an INC and same for DEC.

Nitpicking, but this would actually be a really minor optimization, especially given how infrequently the original code would make such a decision (and on most x86 CPUs this wouldn't end up giving you enough fetch benefit to matter)

Const Prop would be nice.

Usually found to be useful when a machine has smaller immediates than registers (like most RISCs).. in this case could be useful for instructions that don't have immediate forms at all on 8080 but do on x86, but I don't see much of that either. Maybe good for loading registers pairs.

The thing with constant propagation (like any propagation) is you need some form of liveness analysis for it to really be useful, so you can eliminate the instructions where intermediate values are loaded but then overwritten.

For 8080 move propagation could be more useful. Since you can only perform ALU operations with register A as a source and result, there were probably a lot of instructions that moved to/from A and other registers. Some of these moves could possibly be folded into the two-operand form instructions, although the fact that they're destructive is limiting (you'd probably have a lot more luck with three-address instructions)

I appreciate for taking your time to help, and I promise I'll do said optimizations for a real emu, maybe nes, something harder? What do you think?

I do know of one NES emulator with a recompiler. It's tricky because so many NES games require very precise timing, although I think it largely manages pretty decently while still having atomic blocks.

NES can be a lot of work (mapper hell) where so little of that is related to the CPU emulation.. you might get more out of doing something with a fast CPU and not a lot else. And it'd be great if it's something that not a lot has been done of. Nothing immediately comes to mind, but maybe there are some more recent arcade boards?

Again, thanks for the info!

No problem.
 
OP
M

MelvoPR

New member
Yes, they are functionally equivalent in this case, didn't notice that, wasn't concerning on small things, probably there are other minor things I forgot to optimize...


It's true that there are a lot of NES emulators out there, as with most doable within a reasonable time-span systems, at least x86 based ones, I was thinking of another platform, mostly portable ones, maybe a video game console...

Anyway, time will tell for sure!
 
Last edited:

Exophase

Emulator Developer
How about Atari Jaguar? That has a couple RISC processors that would probably lend themselves well to recompilation, and could still make the difference of being usable or not on lower end mobile hardware.

Another good choice is GP32 - it's really dominated by a relatively fast (for its time) ARM9 and a total lack of other computationally intensive peripherals. It could benefit from recompilation without requiring doing a lot of other optimized work. Even lower-end PCs (particularly Atoms) would have a difficult time emulating this full speed with an interpreter, at least when it's clocked at 133MHz.

I'd actually like to do a GP32 emulator myself.
 
OP
M

MelvoPR

New member
The GP32 sounds nice to me, could be possible that I give it a try in the near future, however, since that would be a serious dynarec, I'll at least optimize it to a moderate extent, for the sake of it.

However, if you decide to do it, no doubt yours would generate much better code than mine. :)

Then again, it's all about the effort you put in...
 

Exophase

Emulator Developer
It's my prerogative to have a good ARM recompiler, although not necessarily as good when targeting x86. Could end up as an apples to apples comparison. Would be fun to compare code output, though ;D
 

Top