For a long time I've thought that doing recompilers for CISC platforms wouldn't be as big of a win for RISC ones, especially for old 8bit platforms. After writing some Hu6280 interpretation I have a somewhat different perspective.
My main argument before was that CISC is a lot simpler to decode than RISC, and can be more directly mapped to real registers in an interpreter (for most platforms). For 6502 CPUs a CISC instruction consists of a single byte. Therefore it's realistic to have a single 256 entry switch that dispatches instructions. Furthermore, there are few CPU registers and each of these instructions are tied to the registers it's using, so you can map the registers to variables. Hopefully these variables will be turned into real registers by the compiler (definitely if you're doing an interpreter in assembly language). If you compare it to a RISC CPU, you usually have 32bit instructions (or at least 16bit instructions), so you can't switch over all of them. Operands will be packed into arbitrarily long bit-fields instead of being nice bytes and they can be complex to decode with lots of options (especially ARM). What's more, since there are usually 16+ registers you have to go to an array to access them. For ARM there are lots of other gotchas like being able to access the PC. The absence of these problems made me think that CISC wasn't really worth recompiling.
The reality is that CISC has its own weaknesses when it comes to interpretation, and there's a lot of room for extra performance when doing a recompiler. While RISC has complex decode, CISC has complex fetch. You have to fetch a CISC instruction one byte at a time because you don't know the instruction until you fetch the first byte, and beyond that a lot of CPUs won't let you do proper unaligned loads so you still have to do it a byte at a time. This fetch can be expensive, depending on the platform. On Hu6280 an MMU is in place so every time you cross an 8KB boundary you could potentially be skipping into a non-contiguous memory region. Because of this you either need to check for such a thing at every fetch or at least at every instruction (doing it this way means you need to "slow down" and check at every fetch when you get near a boundary, which is complex to code). You also have to check for this when doing branches. If you have a recompiler then all of this fetching will be done at compile time.
Another advantage a 6502 recompiler has over a RISC one is that a lot of the memory accesses will be to/from fixed locations. For these any checks to see what kind of access it is can be done at compile time. That isn't strictly true for Hu6280, however the only "complex" kind of access is to I/O, and usually a program won't change what segments are mapped to I/O, so you can manage to flush and recompile when this happens. Actually, everything I've seen for PC-Engine just uses segment 0 for I/O, so assuming that should be fine.
Even accesses that aren't from fixed locations are highly range limited. Any zero page or stack accesses can be translated very efficiently, and for the most part you'll be able to tell if a direct + x/y access can't cross a segment boundary since x/y are only 8bit (so the base needs to be near the edge of a segment). This also applies to 16bit wraparound. Indirect accesses are still rather bad, but for direct indirect or post increment indirect if the direct address is word aligned then you can use a 16bit load (so long as the target platform can do this little endian).
Flags could be handled more efficiently thanks to dead flag elimination, as usual with recompilers. You could also use native flags more efficiently, since you don't need to constantly trash them by doing an interpretive loop. Thanks to the memory shortcuts above you usually won't have to trash them for memory accesses either (which is good, because memory accesses happen all the time on these machines, especially 6502)
There are also some peephole style optimizations that can be done to convert primitive 6502 code into something more sophisticated. For instance, having constant propagation for the carry flag means that you can convert cly + adc/sbc instruction pairs into add/sub instructions. You could also look for shift loops and replace them with single shift instructions, although this is of course more complex. Even multiply and divide sequences could potentially be located - you'd probably look for specific matches for this. There's some decent potential for idle loop detection - the idle loops I've run across on PC-Engine are a lot simpler than the ones I've seen on GBA. They're usually of the type { loop: check memory location; conditional branch to loop } which is very easy to detect in a recompiler (I guess this is the beauty of not dealing with compiled code)
One problem that recompilers have is dealing with self modifying code. How much of an issue this actually is depends on how games use it (if it at all). For a platform like PC-Engine the cartridge space is as fast as RAM, and there isn't a lot of RAM to begin with. Self-modifying code will typically only be used for legitimate cases, like for the block transfer instructions (so you can more easily modify the operands). For cases like this the modify to use ratio is going to be near 1:1, at which point you're better off just switching out to an interpreter to handle this. This of course means doing an interpreter, but it's good to have one anyway (to get things working). If the platform actually does have code in RAM a lot that changes you'll have to check for modifications to it and recompile when they happen. This isn't very bad unless the game is constantly changing the code, in which case you can try segmenting up the changed and unchanged parts (and possibly interpret the heavily changing parts).
Since these platforms have limited ROM space you can do fast flat tables for linking blocks (if you have the RAM to match it). This is simpler and faster than using a hash table.
The other caution point is reduced timing accuracy with a recompiler. The only place where timing accuracy is actually reduced is with event latency, where an event is the changing of some system parameter and/or an IRQ firing. Usually with a dynarec it isn't a good idea to check for these things after every instruction, so they're only done before branches (or only some branches) so that they're eventually checked. Some 8bit games are very timing sensitive (probably not on PC-Engine) so this could be a problem. However, there is a way to get around this. Instead of taking away cycles at the end of a block, you could take them away at the beginning, and then if you've taken too much you can jump to an interpreter to finish up the remaining amount. I haven't tried this technique before but I think it should work in theory. Very few games should actually require this because the timing is just off by a relatively stable number of cycles, nothing that scales with the number of instructions.
So the big question is, why bother with all of this? Isn't an interpreter enough for such slow old CPUs? Usually yes. But even today there are some pretty slow embedded devices where you not only have to get things quite fast but it's in your best interest to make things as fast as possible to save battery life. For platforms like PC-Engine that actually have a pretty powerful CPU and pretty simple video (relatively speaking), a C based interpreter can sometimes not cut it. And doing a recompiler is not necessarily more difficult than doing an assembly based interpreter, especially if you already have code to emit instructions. You can still do a lot of the infrastructure in C and retain more portability this way as well.
You can see I'm rather tempted to give it a shot >_>
My main argument before was that CISC is a lot simpler to decode than RISC, and can be more directly mapped to real registers in an interpreter (for most platforms). For 6502 CPUs a CISC instruction consists of a single byte. Therefore it's realistic to have a single 256 entry switch that dispatches instructions. Furthermore, there are few CPU registers and each of these instructions are tied to the registers it's using, so you can map the registers to variables. Hopefully these variables will be turned into real registers by the compiler (definitely if you're doing an interpreter in assembly language). If you compare it to a RISC CPU, you usually have 32bit instructions (or at least 16bit instructions), so you can't switch over all of them. Operands will be packed into arbitrarily long bit-fields instead of being nice bytes and they can be complex to decode with lots of options (especially ARM). What's more, since there are usually 16+ registers you have to go to an array to access them. For ARM there are lots of other gotchas like being able to access the PC. The absence of these problems made me think that CISC wasn't really worth recompiling.
The reality is that CISC has its own weaknesses when it comes to interpretation, and there's a lot of room for extra performance when doing a recompiler. While RISC has complex decode, CISC has complex fetch. You have to fetch a CISC instruction one byte at a time because you don't know the instruction until you fetch the first byte, and beyond that a lot of CPUs won't let you do proper unaligned loads so you still have to do it a byte at a time. This fetch can be expensive, depending on the platform. On Hu6280 an MMU is in place so every time you cross an 8KB boundary you could potentially be skipping into a non-contiguous memory region. Because of this you either need to check for such a thing at every fetch or at least at every instruction (doing it this way means you need to "slow down" and check at every fetch when you get near a boundary, which is complex to code). You also have to check for this when doing branches. If you have a recompiler then all of this fetching will be done at compile time.
Another advantage a 6502 recompiler has over a RISC one is that a lot of the memory accesses will be to/from fixed locations. For these any checks to see what kind of access it is can be done at compile time. That isn't strictly true for Hu6280, however the only "complex" kind of access is to I/O, and usually a program won't change what segments are mapped to I/O, so you can manage to flush and recompile when this happens. Actually, everything I've seen for PC-Engine just uses segment 0 for I/O, so assuming that should be fine.
Even accesses that aren't from fixed locations are highly range limited. Any zero page or stack accesses can be translated very efficiently, and for the most part you'll be able to tell if a direct + x/y access can't cross a segment boundary since x/y are only 8bit (so the base needs to be near the edge of a segment). This also applies to 16bit wraparound. Indirect accesses are still rather bad, but for direct indirect or post increment indirect if the direct address is word aligned then you can use a 16bit load (so long as the target platform can do this little endian).
Flags could be handled more efficiently thanks to dead flag elimination, as usual with recompilers. You could also use native flags more efficiently, since you don't need to constantly trash them by doing an interpretive loop. Thanks to the memory shortcuts above you usually won't have to trash them for memory accesses either (which is good, because memory accesses happen all the time on these machines, especially 6502)
There are also some peephole style optimizations that can be done to convert primitive 6502 code into something more sophisticated. For instance, having constant propagation for the carry flag means that you can convert cly + adc/sbc instruction pairs into add/sub instructions. You could also look for shift loops and replace them with single shift instructions, although this is of course more complex. Even multiply and divide sequences could potentially be located - you'd probably look for specific matches for this. There's some decent potential for idle loop detection - the idle loops I've run across on PC-Engine are a lot simpler than the ones I've seen on GBA. They're usually of the type { loop: check memory location; conditional branch to loop } which is very easy to detect in a recompiler (I guess this is the beauty of not dealing with compiled code)
One problem that recompilers have is dealing with self modifying code. How much of an issue this actually is depends on how games use it (if it at all). For a platform like PC-Engine the cartridge space is as fast as RAM, and there isn't a lot of RAM to begin with. Self-modifying code will typically only be used for legitimate cases, like for the block transfer instructions (so you can more easily modify the operands). For cases like this the modify to use ratio is going to be near 1:1, at which point you're better off just switching out to an interpreter to handle this. This of course means doing an interpreter, but it's good to have one anyway (to get things working). If the platform actually does have code in RAM a lot that changes you'll have to check for modifications to it and recompile when they happen. This isn't very bad unless the game is constantly changing the code, in which case you can try segmenting up the changed and unchanged parts (and possibly interpret the heavily changing parts).
Since these platforms have limited ROM space you can do fast flat tables for linking blocks (if you have the RAM to match it). This is simpler and faster than using a hash table.
The other caution point is reduced timing accuracy with a recompiler. The only place where timing accuracy is actually reduced is with event latency, where an event is the changing of some system parameter and/or an IRQ firing. Usually with a dynarec it isn't a good idea to check for these things after every instruction, so they're only done before branches (or only some branches) so that they're eventually checked. Some 8bit games are very timing sensitive (probably not on PC-Engine) so this could be a problem. However, there is a way to get around this. Instead of taking away cycles at the end of a block, you could take them away at the beginning, and then if you've taken too much you can jump to an interpreter to finish up the remaining amount. I haven't tried this technique before but I think it should work in theory. Very few games should actually require this because the timing is just off by a relatively stable number of cycles, nothing that scales with the number of instructions.
So the big question is, why bother with all of this? Isn't an interpreter enough for such slow old CPUs? Usually yes. But even today there are some pretty slow embedded devices where you not only have to get things quite fast but it's in your best interest to make things as fast as possible to save battery life. For platforms like PC-Engine that actually have a pretty powerful CPU and pretty simple video (relatively speaking), a C based interpreter can sometimes not cut it. And doing a recompiler is not necessarily more difficult than doing an assembly based interpreter, especially if you already have code to emit instructions. You can still do a lot of the infrastructure in C and retain more portability this way as well.
You can see I'm rather tempted to give it a shot >_>