What's new

Threaded interpretation vs Dynamic Binary Translation

Disonance

New member
I'd like to hear people's experiences and opinions on the performance benefits of threaded interpreters vs traditional interpreters vs dynamic binary translation/JIT. For those who are not aware of what threaded interpreters are I shall provide a brief description:

As each instruction is fetched and decoded by the standard interpreter a function pointer to the final native code is added to an array indexed by the program counter. Each time an instruction is fetched the function pointer array is read to see if there is a valid function pointer. If so the code is called using the function pointer and the instruction executed while bypassing the decode. The aim being to create threaded code (i.e. a contiguous sequence of unconditional subroutine calls) which will execute much faster than the sequence of conditional calls/branches which make up an interpreter's decode. On the other hand there is still some overhead in calling and returning when compared to the use of cached basic blocks in dynamic binary translation.

Why would I be interested in using this method? As it happens platform X prevents execution of data by application code, meaning dynamic binary translation cannot be used under any circumstances. It is however permissible to call static code using function pointers. It is widely agreed to be impractical to emulate target Y on platform X using a basic linear interpretation. A proof of concept of threaded interpretation to emulate targets similar in complexity/speed to target Y has been published but has suffered from neglect due to IRL issues facing the lone developer. Other developers continue to disparage platform X as "useless" for emulation purposes due to the inability to use dynarec.

My question therefore is; how widespread is use/knowledge of threaded interpretation techniques in emulation? Or is there a legitimate reason behind developers apparent ignorance of this method of optimization?
 

Shonumi

EmuTalk Member
You know, I was thinking about implementing something like this myself a while ago; I had no idea there was a name behind the concept though.

I guess the benefits will come depending on how the emulator actually needs to decode instructions. If it's something like the GB where most opcodes are 8-bits, a switch statement of the opcode byte will basically act like a jump-table and will be fast enough to where function pointers wouldn't show any noticeable speed increase. That's my guess at least.

Now for more complex CPU architectures, I'll bet that's a different story. Again though, it really depends on how the emulator decodes instructions and if that actually takes up a substantial amount of processing time. If the decoding phase takes a notable length of time (relatively speaking) then yes, this method could be used to overcome that performance issue.

The drawbacks I see would be cases where the function pointers would have to be invalidated for the current instruction at the PC. Think self-modifying code. It's more common than many imagine in commercial games. Many games transfer instructions dynamically to RAM and execute code from there, especially on some systems where RAM execution was faster than ROM execution. Basically, any portion of memory that can be written/read to and used as a program counter has the potential to invalid the previous instruction represented by the array's function pointer. It may not be a serious deal-breaker, but it's one of the pitfalls of the method you should be aware of. The performance penalties for validation checks on the PC/current instruction are worth investigating if you really want to pursue that method.
 
OP
D

Disonance

New member
IIRC threaded interpretation was first introduced in the 70s to accelerate Forth interpreters. So it's an established technique in computer science, unfortunately it seems to have been eclipsed by dynamic binary translation.

Regarding CPU architectures, even basic 8-bit opcodes are often best decoded in a cascade of switches/jump tables. For example a move instruction will usually have a register to register option or a memory to register version even in RISC. If you go to the extreme and separate these down to one interpreter leaf function per register combination then when ported to a threaded interpretation, you in the case of a move instruction, eliminate all branches except for updating condition code flags, this will have benefits for the host architecture's pipeline which can be a real boost if the host lacks complex branch prediction. It may be that you end up with an enormous number of subroutines and lots of duplicate code but think of it as trading code space for speed in the same way as the threaded interpreter's pointer cache or dynarec's translation cache trades data space for speed.

Now on the subject of how long a decode takes, if you look at slightly more complicated CPU, a 68000 for example, then the instruction decode really has to happen in the cascaded fashion I described because the CPU state machine is actually doing so! It's possible to structure cascaded instruction decodes to take a thread path proportional to the time taken by the actual CPU to decode the instruction. So (taking 68k as an example again) your decode path for ADDQ should be much shorter than that used for ADDI. The same can apply for 8-bit when you think about addressing modes and the number of operands required.

Regarding self-modifying code, this is as much a problem for a dynarec approach as it is to threaded interpreter and can be detected and dealt with in a number of ways. Actually, as far as I know a dynarec CPU emulation can be ported back to a threaded interpreter just by modifying the code that manages the translation cache, the profiler which decides what to cache can remain the same.

Now as it happens there are platforms in widespread use that have considerable processing power available but for security reasons the OS prevents data execution under pretty much any circumstances, while an 8 or well optimised 16-bit CPU emulation may be possible purely with a simple interpreter on these platforms (and there is a lot of open source code out there enabling this to be done) a 32-bit emulation using a pure interpreter is not practical. So those willing to port a 32-bit dynarec emulator over to a threaded interpreter actually have a captive audience on these platforms.
 

Shonumi

EmuTalk Member
I don't know a great deal about dynarec's aside from some of the JIT recompilation stuff the Dolphin emulator does. Perhaps if you asked some of the developers on the dolphin-dev IRC channel, one of them (Fiora perhaps) would be able to tell you more accurately than I can why dynarecs are more popular than threaded interpreters. I've only made pure interpreters for the GB's Z80 variant and the GBA's ARM7TDMI.

Btw, what host platform are you considering or have in mind? iOS or something of that sort?
 
OP
D

Disonance

New member
iOS is a non-starter considering Apple's refusal to allow any emulators in the app store and the fact that a jail broken iPhone can run dynarec. Those who feel they must have an emulator on their iPhone can jailbreak their phone.

The platform I am referring to is Windows 8.1's metro environment, together with its RT and Windows Phone siblings. Microsoft has so far had no qualms about allowing pure interpreter based emulators onto their store, even as paid for apps, but due to the obvious performance limitations these are limited to 4th gen consoles and below. The proof-of-concept app I mentioned is EmiPSX on the Windows Phone which will happily run games even on a Lumia 520, unfortunately it has compatibility issues with many games (as most dynarec based emulators do early in development) and the lone developer is struggling to find time to maintain development. The result of this is that discussion of EmiPSX has been brushed aside or ignored by those who find it convenient to do so.

I was avoiding mentioning Windows 8/RT/Phone to start with because I have seen particularly negative reactions to it from developers based on some rather irrational prejudices. But I have gotten sick and tired of seeing well meaning laymen users politely asking on various forums whether a port could be considered, only to be shut down/flamed by the communities' members who dismiss it as completely impossible. It may well be that there is no commercial case for porting to a threaded interpreter just for Windows 8's benefit given the market share and scope of work. But it is technically feasible, at least in my opinion for the PSX, N64, Amiga, Nintendo DS and possibly the PSP.
 
Last edited:

Shonumi

EmuTalk Member
Well, I'm not one to make assumptions about people or their intentions. Who knows, maybe you were one of those people who like to play the "racing" game with Apple :p

Anyway, I'm also not one to tell others what platforms/tools they should use or target, and as one who doesn't appreciate being told what to do with my free time, I don't tell others what to do in theirs, so I have no problem with your interest in Windows 8/RT/Phone. Getting back to the topic, since I don't know a lot, I asked some Dolphin associates if they knew about this sort of stuff. Here's the brief IRC discussion I had (self-referencing this thread, infinite loop incoming...)

Code:
16:43 < shonumi> Anyone know much/heard about threaded interpreters?
16:43 < shonumi> Recently found out about them
16:43 < shonumi> curious about opinions
16:43 < NsGk_> opinions on performance?
16:44 <@HdkR> What part of it would be threaded?
16:44 < shonumi> I think it might be a misnomer, not actually using thread/multithreading
16:44 < NsGk_> you don't mean multithreaded, right? you mean linking the blocks togther
16:44 < NsGk_> yeh
16:44 < shonumi> This is where someone and I were talking about it: 
http://www.emutalk.net/threads/55275-Threaded-interpretation-vs-Dynamic-Binary-Translation?goto=newpost
16:44 <@HdkR> oh, I didn't know it was called that
16:45 < shonumi> Wondering if any here had some knowledge about that kinda stuff
16:45 < NsGk_> interpreters in general are just going to give you worse performance
16:45 < NsGk_> threaded > central dispatch, sure
16:45 < NsGk_> but binary translation still blows them both out of the water
16:46 < shonumi> low level optimizations right?
16:46 < shonumi> and better low level code, right?
16:47 <@HdkR> More like you don't have to do instruction parsing over and over, you compile the 
              block once and you win the game :P
16:47 <@HdkR> or compile the block multiple times in the case of having a tiered recompiler
16:48 < NsGk_> could give this a read 
               http://www.ittc.ku.edu/~kulkarni/teaching/EECS768/slides/chapter2.pdf gives a solid 
               comparison of a number of methods
16:48 < shonumi> NsGk_: Excellent, just what I need to look at :D
16:48 < shonumi> time to learn me stuff
16:49 < NsGk_> when it comes down to it, dynarec will be superior but harder to implement
16:49 <@HdkR> s/harder/funner

So I guess have a look at the link NsGk gave me: http://www.ittc.ku.edu/~kulkarni/teaching/EECS768/slides/chapter2.pdf. On the hierarchy of performance, I suppose it goes like Dynarec > Threaded Interpreter > Pure Interpreter. Depending on how efficiently you emulate a system, and how much hardware resources in generally takes to accomplish that emulation, threaded interpretation could easily be a performance solution when dynarec's are not possible and pure interpreters are too slow. That is all assuming CPU emulation is the bottleneck, of course. Remember, this isn't exactly within my domain to answer, but I hope that sheds some light for you.
 
OP
D

Disonance

New member
Thank's for that. That's an interesting exchange you had, it's pretty similar to the response I have had before from other development teams. At least you didn't get the classic straw man "If it's so great/easy why don't you do it yourself". But the other staple "dynarec wins everytime so why bother thinking about doing anything else" was trotted out despite you linking to a discussion clearly explaining why someone just might need to consider taking a different approach! But I digress...

The document NsGk_ provided is interesting since I hadn't thought of using precoding before, but it doesn't offer any example metrics as to the performance gain from using threaded interpretation. Incidentally "threaded" is not a misnomer, it's just become associated with multi-threaded programming, a threaded interpreter is named such because it uses threaded code (see wikipedia article of same name). Also, I dug up the document where I first heard of threaded interpretation, for some reason I can't post URLs to this forum, but google "Study of the techniques for emulation programming" and you'll find a very interesting paper by one Victor Moya del Barrio. While the author's english isn't great it provides a very balanced overview of various techniques including threaded interpretation. I would go as far as to call it required reading for someone interested in emulators.

I myself am not intending to port an emulator, while I'm confident I can handle any CPU/memory emulation (having a background in developing HDL soft-core processors helps) I have little knowledge of graphics and sound programming especially in the limited DirectX API offered by WP8/RT platforms. I would however like to correct those people who persist in discouraging others from exploring these promising alternatives or refusing to accept that they are valid and practical methods for optimising an emulator.
 

fleroviux

Member
One intresting point could also be timing accuracy, because I think it's harder to archieve with dynamic recompilation. I mean many systems like NES (and Gameboy Games sometimes too) require accurate instruction / memory timing.
For threaded interpretation we could simply do timing between execution of the current block's members / function pointers. But what about dynamic recompilation. We could simply limit blocks to one source instruction but that would heavily decrease performance and not only that it would increase memory usage because block tables would go high like skyscrapers in size..

EDIT: Of course we could also emit something like this after each source instruction:
Code:
jmp .cpuDoTiming
Don't how speed it would eat..
 
Last edited:

hwave

New member
I have been researching interpreters recently, and now this interestingly popped out. I won't comment on interpreters vs dynarec's, as I have no experience in dynarec stuff. A common theme here is to help the cpu's branch predictor/target predictor. I have not had the time to test these out myself yet however, so I cannot relay my personal experience in using them.

Here's an interesting paper that among other things concludes that threaded interpreters can yield up to double performance over regular switch dispatch.
The Structure and Performance of Efficient Interpreters

There's also this one, comparing several techniques. They claim speedups of 4.55 over threaded code interpreters if you use what they describe in this paper.
Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters

Then there's this guy's pseudo-blog: NO EXECUTE! September 3, 2008
He has quite some interesting stuff on emulators, he regularly uses bochs as an example, and shows tests before and after modifying it using his modifications, showing interesting speedups. The link is to a post where he talks about threaded interpreters at first, and eventually ends up at something he has named "The Nostradamus Distributor", a funky looking dispatch method that is apparently very efficient.

As I said I haven't tried any of these myself yet, but they were all fascinating reads if nothing else, and since there was talk about threaded interpreters here you may be interested in this.
 
OP
D

Disonance

New member
Flevrovium; Yes, since dynarec revolves around basic blocks then the emulated platform must have loose enough timing tolerances to allow timing to be calculated following execution of a code block. I know dynarecs have been successfully created for the Mega Drive which has a line based interrupt from the VDP. There's a paper on it at (www.squish.net/generator/gendoc.pdf). One instruction blocks however don't make much sense when you consider the considerable overhead of loading them from the translation cache every single cycle, I'd expect a threaded interpreter to win in this case.

hwave; Thanks for posting those, I agree they make fascinating reading. The blog post in particular made me smile, I'll have to dust off my x86 assembly engrams before I can fully digest it. It also raised some interesting questions for me; how many emulator developers can read x86 or ARM assembly? Do developers ever consider checking the output of their compiler's release builds or do they just trust it? Is the performance gain provided by dynarec techniques camouflaging severe bottlenecks in the interpreter/translated code? How much thought are people porting from Intel to ARM putting into the architectural differences between these platforms (e.g. different pipelining and branch prediction)?

Perhaps developers should consider these question before rushing to declare a platform "useless" for emulation purposes.
 

Top