I've looked a little closer at your code, and admittedly a lot of what looks like weakness in design is more just a matter of your implementation. For instance, this is what'll be emitted for an "inc" instruction:
Code:
push bx
mov bl, ah
lahf
and ah, 0xD4
and bl, 0x2B
or bl, ah
mov ah, bl
pop bx
The and wouldn't happen if it was say, an add, but everything else would. This sequence is relatively expensive (btw, there's no reason to do push bx instead of push ebx - you're better off avoiding 16-bit instructions where possible and you don't really want to misalign the stack). When you're updating all of the flags, an lahf by itself should actually suffice.
Nontheless, the lahf isn't free (for instance, 2 cycles on Atom) and there's another advantage to keeping things in x86 flags where possible: you can do branches without code to extract the flags. Here's how you're doing branches:
Code:
push ax
and ah, (condition)
pop ax
j(n)z skip
mov di, address
skip:
Now, you really don't need the push/pop, since you can just do a test ah, (condition) instead. But if the flags were already set you could simply do the branch. And if it's a direct target you can link straight to the recompiled block. You can actually compile cycle checks at the beginning of blocks so you don't have to worry about a potential exit here.
But the problem is getting x86 to not clobber flags when you want to retain them. You can use lea to perform constant additions and subtracts without setting flags, like when subtracting cycles. A more subtle benefit of performing liveness analysis (not just on flags, but on registers) is that by knowing when the registers aren't needed you know when you're free to use them for something else without worrying about saving (and possibly restoring) them. So if you know the flags are dead, you can freely overwrite them without committing them elsewhere. Fortunately, you don't have to do conditional branches for memory accesses in your emulator, so you really only need them for emulating 8080 conditional branches and to check cycles (if you inline this). Putting a cycle check at the beginning of a block with flags liveness analysis performed is advantageous because at the beginning of the block flags are often dead, especially on this architecture.
Looking further at your output stream revealed a major weakness I didn't catch the first time - you're updating the PC after every instruction.
The PC can be considered a constant that is known at compile time, so there's no need to maintain it at runtime. The only time you need to output it is when a return address needs to be stored (either for a call or an interrupt), or when a branch target needs to be made available. In your case, you'd probably want to update the PC before an instruction that will end the block. But with a redesign this wouldn't be necessary.
It's true all of this extra analysis increases recompilation time. But unless this is a recurring expense, for instance due to self-modifying code, you should never look at it in equal footing to run time. Compilation is something that happens only once, and a recompiler of this nature is only performing O
operations. It takes a lot of code to really even become noticeable, and the impact is asymptotically zero since it's a one-time cost vs ongoing execution. Basically, I would say go back and optimize/simplify the recompilation only if you can demonstrate it has become a problem.
Finally, this isn't a suggestion per se, but one thing I like with recompilers is the ability to look at before + after streams. Maybe if you can produce some examples some other ideas for improvement will become clearer.