Recompiling 8bit CISC CPUs (especially 6502 family)

Exophase · Nov 5, 2007

For a long time I've thought that doing recompilers for CISC platforms wouldn't be as big of a win for RISC ones, especially for old 8bit platforms. After writing some Hu6280 interpretation I have a somewhat different perspective.

My main argument before was that CISC is a lot simpler to decode than RISC, and can be more directly mapped to real registers in an interpreter (for most platforms). For 6502 CPUs a CISC instruction consists of a single byte. Therefore it's realistic to have a single 256 entry switch that dispatches instructions. Furthermore, there are few CPU registers and each of these instructions are tied to the registers it's using, so you can map the registers to variables. Hopefully these variables will be turned into real registers by the compiler (definitely if you're doing an interpreter in assembly language). If you compare it to a RISC CPU, you usually have 32bit instructions (or at least 16bit instructions), so you can't switch over all of them. Operands will be packed into arbitrarily long bit-fields instead of being nice bytes and they can be complex to decode with lots of options (especially ARM). What's more, since there are usually 16+ registers you have to go to an array to access them. For ARM there are lots of other gotchas like being able to access the PC. The absence of these problems made me think that CISC wasn't really worth recompiling.

The reality is that CISC has its own weaknesses when it comes to interpretation, and there's a lot of room for extra performance when doing a recompiler. While RISC has complex decode, CISC has complex fetch. You have to fetch a CISC instruction one byte at a time because you don't know the instruction until you fetch the first byte, and beyond that a lot of CPUs won't let you do proper unaligned loads so you still have to do it a byte at a time. This fetch can be expensive, depending on the platform. On Hu6280 an MMU is in place so every time you cross an 8KB boundary you could potentially be skipping into a non-contiguous memory region. Because of this you either need to check for such a thing at every fetch or at least at every instruction (doing it this way means you need to "slow down" and check at every fetch when you get near a boundary, which is complex to code). You also have to check for this when doing branches. If you have a recompiler then all of this fetching will be done at compile time.

Another advantage a 6502 recompiler has over a RISC one is that a lot of the memory accesses will be to/from fixed locations. For these any checks to see what kind of access it is can be done at compile time. That isn't strictly true for Hu6280, however the only "complex" kind of access is to I/O, and usually a program won't change what segments are mapped to I/O, so you can manage to flush and recompile when this happens. Actually, everything I've seen for PC-Engine just uses segment 0 for I/O, so assuming that should be fine.

Even accesses that aren't from fixed locations are highly range limited. Any zero page or stack accesses can be translated very efficiently, and for the most part you'll be able to tell if a direct + x/y access can't cross a segment boundary since x/y are only 8bit (so the base needs to be near the edge of a segment). This also applies to 16bit wraparound. Indirect accesses are still rather bad, but for direct indirect or post increment indirect if the direct address is word aligned then you can use a 16bit load (so long as the target platform can do this little endian).

Flags could be handled more efficiently thanks to dead flag elimination, as usual with recompilers. You could also use native flags more efficiently, since you don't need to constantly trash them by doing an interpretive loop. Thanks to the memory shortcuts above you usually won't have to trash them for memory accesses either (which is good, because memory accesses happen all the time on these machines, especially 6502)

There are also some peephole style optimizations that can be done to convert primitive 6502 code into something more sophisticated. For instance, having constant propagation for the carry flag means that you can convert cly + adc/sbc instruction pairs into add/sub instructions. You could also look for shift loops and replace them with single shift instructions, although this is of course more complex. Even multiply and divide sequences could potentially be located - you'd probably look for specific matches for this. There's some decent potential for idle loop detection - the idle loops I've run across on PC-Engine are a lot simpler than the ones I've seen on GBA. They're usually of the type { loop: check memory location; conditional branch to loop } which is very easy to detect in a recompiler (I guess this is the beauty of not dealing with compiled code)

One problem that recompilers have is dealing with self modifying code. How much of an issue this actually is depends on how games use it (if it at all). For a platform like PC-Engine the cartridge space is as fast as RAM, and there isn't a lot of RAM to begin with. Self-modifying code will typically only be used for legitimate cases, like for the block transfer instructions (so you can more easily modify the operands). For cases like this the modify to use ratio is going to be near 1:1, at which point you're better off just switching out to an interpreter to handle this. This of course means doing an interpreter, but it's good to have one anyway (to get things working). If the platform actually does have code in RAM a lot that changes you'll have to check for modifications to it and recompile when they happen. This isn't very bad unless the game is constantly changing the code, in which case you can try segmenting up the changed and unchanged parts (and possibly interpret the heavily changing parts).

Since these platforms have limited ROM space you can do fast flat tables for linking blocks (if you have the RAM to match it). This is simpler and faster than using a hash table.

The other caution point is reduced timing accuracy with a recompiler. The only place where timing accuracy is actually reduced is with event latency, where an event is the changing of some system parameter and/or an IRQ firing. Usually with a dynarec it isn't a good idea to check for these things after every instruction, so they're only done before branches (or only some branches) so that they're eventually checked. Some 8bit games are very timing sensitive (probably not on PC-Engine) so this could be a problem. However, there is a way to get around this. Instead of taking away cycles at the end of a block, you could take them away at the beginning, and then if you've taken too much you can jump to an interpreter to finish up the remaining amount. I haven't tried this technique before but I think it should work in theory. Very few games should actually require this because the timing is just off by a relatively stable number of cycles, nothing that scales with the number of instructions.

So the big question is, why bother with all of this? Isn't an interpreter enough for such slow old CPUs? Usually yes. But even today there are some pretty slow embedded devices where you not only have to get things quite fast but it's in your best interest to make things as fast as possible to save battery life. For platforms like PC-Engine that actually have a pretty powerful CPU and pretty simple video (relatively speaking), a C based interpreter can sometimes not cut it. And doing a recompiler is not necessarily more difficult than doing an assembly based interpreter, especially if you already have code to emit instructions. You can still do a lot of the infrastructure in C and retain more portability this way as well.

You can see I'm rather tempted to give it a shot >_>

xtra krazzy · Nov 5, 2007

Great article(if it can be called that way).

In another manner, have you seen games which use self-modifying code? (I only know UPX as one sole example and I seem to be having a vague memory of a certain assembly B.Sc. course...

, other than that, this is not even considered when building a game emulator, so you pretty much moved on to the theoretical side of things)

Exophase · Nov 5, 2007

xtra krazzy said:
Great article(if it can be called that way).

In another manner, have you seen games which use self-modifying code? (I only know UPX as one sole example and I seem to be having a vague memory of a certain assembly B.Sc. course... , other than that, this is not even considered when building a game emulator, so you pretty much moved on to the theoretical side of things)

Self modifying code is never an issue for interpretive emulators, or at least it shouldn't be. However, for some emulators out there that use recompilers it's a huge issue. Project64 alone has several strategies for handling it.

On GBA "it" is used all the time because IWRAM is a lot faster than cartride ROM (especially for 32bit accesses, which is good for ARM code as opposed to Thumb). I say "it" in quotes because usually code is just loaded there and executed. If the code is only loaded once and never touched afterwards then it doesn't cound as self modifying code. But if new code is loaded over it sometime later (as what happens with code overlays) then you have to detect it. Then there are some games which change code in RAM all the time for whatever reason.. these games aren't common but they do exist in significant enough numbers they're really annoying for dynarecs (like the games Camelot makes for GBA).

On older platforms it'll depend on how fast the ROM is compared to RAM and how much RAM it has. On NES there is not a lot of RAM so it's probably not something that is done often. On SNES it depends on the type of ROM used. On N64 ROM is significantly slower like on GBA, so again it copies a lot.

Otherwise self-modifying code is only used as an optimization or to overcome weaknesses of the instruction set. In these cases there probably isn't a lot of code ran in RAM, so I'm thinking it's better to just interpret it.

xtra krazzy · Nov 5, 2007

So what you say is, that everything on the data-segment is automatically interpreted and is not allowed to be 'dynarec'ed?

Exophase · Nov 6, 2007

xtra krazzy said:
So what you say is, that everything on the data-segment is automatically interpreted and is not allowed to be 'dynarec'ed?

How often is code on the data segment and not on the text segment? Especially when your RAM is only 8KB large.

Cyberman · Nov 6, 2007

Well with small internal memories and ROM execution being fast enough this really gives no impetous for someone to want to use the RAM as anything but data space. Typical embeded programs need about 900-1500 bytes of RAM. That 8k includes stack space (for passing parameters and storing parent functions status and local variables).
If there are numerous functions (unlikely) and lots of parameters being passed (again this is unlikely) your stack needs to be at least 2K (my guess is they go with 512 bytes). The big thing that chunk of RAM is used for likely is keeping information about the game state etc.

Cyb

AamirM · Nov 7, 2007

Hi,

Although this is not 8-bit CPU specific but I tried to write a M68000 dynamic recompiler for my Nugen(Neogeo, CPS1/2) and soon to be released Regen(Sega Genesis) emus but soon realized that it was not worth it. Although the increased speed is there but not by much(after 800Mhz Pentium) from the fastest 68k interpreted emu, the A68K, but its implementation is much harder than an interpreter. If I remeber correctly, the author of Generator emulator got similar results after writing a 68k recompiler for ARM processor. You can read his report from www.squish.net.

stay safe,

AamirM

Exophase · Nov 7, 2007

AamirM said:
Hi,

Although this is not 8-bit CPU specific but I tried to write a M68000 dynamic recompiler for my Nugen(Neogeo, CPS1/2) and soon to be released Regen(Sega Genesis) emus but soon realized that it was not worth it. Although the increased speed is there but not by much(after 800Mhz Pentium) from the fastest 68k interpreted emu, the A68K, but its implementation is much harder than an interpreter. If I remeber correctly, the author of Generator emulator got similar results after writing a 68k recompiler for ARM processor. You can read his report from www.squish.net.

stay safe,

AamirM

Personally I think that doing a dynarec for 68k will probably be more maintainable and legible than the interpreters used which involve generating assembly code and huge switch tables. Obviously on an 800MHz P3 you shouldn't need to heavily optimized for Genesis performance. But not every machine is an 800MHz P3.

Dynarecs are really not necessarily that complicated, especially if you already have code to emit the machine code, and depending on whether or not you have to deal with self modifying code. It also depends on how fast/complicated the CPU is vs. the rendering subsystem. If you look at Genesis vs. PC-Engine, you'll see that PC-Engine has a much "faster" CPU that generates far more instructions per second, and much simpler video.

And of course I'm talking about "good" dynarecs, ie ones that have native register usage, direct branches, and at least some analysis based optimizations. I'm not talking about threaded interpreters, which a lot of people consider true dynarecs. Another important thing is having optimized functions which play nice with the cached registers. If you constantly have to flush registers to call C functions to perform memory load/stores you'll see a serious hit. Optimizing memory accesses in general is important for dynarecs. One of the big speed hits on the lxdream emulator is a slow memory accessing function. MMU support can help a lot with this.

The thing with Generator is that it wasn't a pure interpreter to begin with, it did do code block analysis and dead flag elimination on those blocks. That means that it has to manage blocks and is susceptible to self modifying code, just like a recompiler (and apparently it didn't really handle it properly). All it really said was that dynarec didn't have a "quantifiable effect on performance", but it didn't really go into detail beyond this. Would have been nice to at least see some comparisons of emulated code blocks and their recompiled counterparts.

fbff · Nov 7, 2007

Yeah, well written, Exophase. You've already mentioned the points that I would have raised as arguments.

My only suggestion now is to do an analysis of the % of PC Engine games that use SMC before writing the recompiler. I would guess that quite a number would as the 6502 is relatively slow.

Exophase · Nov 7, 2007

fbff said:
Yeah, well written, Exophase. You've already mentioned the points that I would have raised as arguments.

My only suggestion now is to do an analysis of the % of PC Engine games that use SMC before writing the recompiler. I would guess that quite a number would as the 6502 is relatively slow.

I haven't seen any at all yet in commercial games. There just isn't a lot of RAM and the CPU is "fast enough" for most things (over 3x faster than NES's CPU). Self modifying code is possibly good for changing constants in tight running loops but it's not as if the CPU is being used for things such as rendering (at least, very infrequently).

The one example I've seen suggested for homebrew (and seen setup code for, but I'm not sure if it's actually used) is for the block transfer functions. Somewhere in RAM would be a block move instruction followed by a rts instruction. The block move instruction is modified to allow moves from variable locations/lengths stored in registers instead of fixed values. This is the only set of instructions I can think of that has this limitation. Since most of the time emulating these two instructions will be spent in memory transfers it doesn't hurt very much to interpret them.

If large code blocks are used in RAM then chances are that most of it will be constant, with only a few instructions being modified. In this case RAM can be partitioned into "static" and "dynamic" portions. The dynamic portions can either use a separate buffer (so that they don't cause the rest of the code to be flushed when they're recompiled) or they can be interpreted. An area would become dynamic after it has been modified enough times. Unfortunately this means that self modifying code has to be detected for every write, which is more expensive.

One other thing that can be done in conjunction with the above method is to have a small cache for each RAM byte which determines which ones are written. If only a few different values are written then recompiled blocks could exist for each variation with the recompiler branching to the proper block for the current instruction at RAM. Then a write to this location would cause that branch target to change. This is probably not an effective solution for a machine with fast ROM; it's something that would be more likely to happen on a platform with slow ROM that has to page code into RAM, sometimes constantly. The downside is that it requires a huge amount of metadata, potentially for every byte in RAM (possibly only for the bytes that have code, depending on how it's done).

AamirM · Nov 8, 2007

Exophase said:
The thing with Generator is that it wasn't a pure interpreter to begin with, it did do code block analysis and dead flag elimination on those blocks.

How can you say that?

fbff · Nov 8, 2007

Exophase said:
I haven't seen any at all yet in commercial games.

Yes, but I would run through some quick analysis using an interpretive core before putting a lot of time into a recompiler. Think of it like a feasibility test.

Exophase said:
The one example I've seen suggested for homebrew (and seen setup code for, but I'm not sure if it's actually used) is for the block transfer functions.

Yeah, if SMC code is only a theory thought up by the homebrew developers, then I'd say go for writing a 6502 recompiler.

Exophase said:
In this case RAM can be partitioned into "static" and "dynamic" portions.

One other thing that can be done in conjunction with the above method is to have a small cache for each RAM byte which determines which ones are written.

My guess is that while these two ideas for coping with SMC are possible, the analysis might slow the recompiler down too much.

All are my guesses, worth confirming with some real code

Good luck with this concept. I cannot think of any real disadvantages with it, so consider your concept reviewed

Exophase · Nov 8, 2007

AamirM said:
How can you say that?

I read the paper ;p At the very least, what is described as an interpreter doesn't meet the design specifications for an interpreter given in the same paper. Since it does block analysis and dead flag elimination, caching the results, it no longer fulfills the requirement of "can handle self modifying code naturally."

aprentice · Nov 8, 2007

How about doing all the recompiling and analysis before anything is executed?

Exophase · Nov 8, 2007

aprentice said:
How about doing all the recompiling and analysis before anything is executed?

There are big problems with static recompilation.

- At runtime you can't exactly determine the destination of indirect branches (even returns from subroutines can be modified) so you can't statically what's code and what's data. As a result you have to recompile everything. For a CPU with variable length instructions this is a serious problem, because you have to have a code block translated for every starting byte. This could probably turn a 1MB game into hundreds of MBs of recompiled code.

- Self modifying code can't be detected/handled at all. You can still branch to an interpreter for code in RAM, I suppose.

So you have to ask, what do you actually win by doing this? The cost of dynamically recompiling should be cheap because the recompilations should stick (and the load will be distributed throughout the program instead of all at once). You'll recompile a lot less. The only downside I can think of is that indirect branches are slightly slower since you have to check if the target hasn't been compiled yet but this cost is entirely negligible.

It's not as if there are optimizations you can do statically and not dynamically, and it's not as if dynamic recompilation means that you'll be doing more recompilation, it'll just be distributed at runtime.

Search

Search

Recompiling 8bit CISC CPUs (especially 6502 family)

Exophase

Emulator Developer

xtra krazzy

Dolphin Developer

Exophase

Emulator Developer

xtra krazzy

Dolphin Developer

Exophase

Emulator Developer

Cyberman

Moderator

AamirM

New member

Exophase

Emulator Developer

fbff

Guest

Exophase

Emulator Developer

AamirM

New member

fbff

Guest

Exophase

Emulator Developer

aprentice

Moderator

Exophase

Emulator Developer