What's new

Optimizing in the dynamic recompilation!

Norlin

New member
Hello Hacktarux and others,

My name is Markus and I'm a programmer from Sweden, I don't know a lot about emulationprogramming, but my interest in the subject began when I started to use N64 emus. Unfortunatly I've not been able to play a lot as I use Linux and have a slow (500Mhz) computer, so I'm happy to see the Mupen64 project evolve the way it does...

During the past couple of weeks I've thought more and more about starting to study emulationprogramming, I downloaded and printed the book "Study of the Techniques of Emulationprogramming" by Victor Moya del Barrio, I've read the first couple of chapters and I'm starting to get a slight idea of what emulationprogramming is all about.

Here comes the reason for why I'm writing this mail, after I had read the chapters about dynamic recompilation I found there weren't any advanced optimization techniques mentioned in this book, I thought a lot about optimization in the dynarec process that day and I came up with an idea that I found quite interesting, during the next couple of days I kept thinking about this idea and different ways to implement it...

As I don't know 'anything' at all about emulation, I don't know

1. If this optimization perhaps does exist and is being used...
2. If this optimization perhaps isn't possible to implement for some reason...

I think this forum is a good place to ask about this, you seem to be great emulationprogrammers and Hacktarux's work is impressive. I'll introduce you to my ideas and it'd be very fun to hear thoughts about how something like this perhaps would work if implemented...


My optimization thoughts:


Emulating a CPU with about a ton of 64 bit registers as that in N64 on a CPU like Intel's x86 series with only 4 32bit registers (I don't know if the MMX registers can be used during emulation?) makes it impossible to have more than very few of the registers being emulated, in real CPU registers, I'd guess an N64 emu makes use of RAM for emulating the morepart of the emulated registers? If this is correct it'd mean, those that can be mapped into the Intel CPU's regs would work quite fast, but the ones stored in RAM would be quite slow. If one could know when and where the different emulated registers would be used, it would be possible to insert a small piece of code for flipping the contents of the regs around and most of the time, have the most used registers, in real regs while having less used ones, in RAM, flips like this could be done several times in a codesegment each time the behavior of the code changes. To be able to know when and where different registers are used, one would have to make an analysis while first interpreting the codesegment, I would call this system RUAS (Register Utilization Analysis System):

1. A number of structs are alloc'ed, these are to be used to hold the data from the first optimization stage, structs would have to contain items able to hold what register(s) are used, jumpts to positions and jumps from positions, there would have to be such structs for each detected instructionset in the codesegment to optimize...

2. The emulationprocess is executed for the segment in a special interpreter, this interpreter does what a normal interpreter would to, except for that it also fills in what's going on in the alloc'ed structs for each instruction, there it increase the item holding how many times this label was jumped to, or how many times a jump was made away from this instruction, it also stores (increases) an item measuring how many times a register perhaps was accessed here...

3. The structs now contain a lot of data about what happend in the codesegment. Now there has to be a function implemented that goes through this data and analyse it carefully... By checking a lot of possible areas (according to some optimization-level-formula) inside the data, and what registers were used inside that area, one will come up with a second stage set of information placed in a number of second stage structs. This information will be able to place all the emulated registers in a row from the least used one, to the most used one. It would be possible to tune the accuracy of these measurements to get more or less number of theoretical segments, each segment with its own register-prioritization row.

4. Now (the second time the codesegment is to be ran) the real dynarec is used, this takes the 'second stage' optimization data and use this to know how to emulate the registers. A small piece of code is used to flip references to registers accordingly to the 'second stage' information. Each time a new 'theoretical optimization area' is entered, such a flip is done, just prior to entering this area.


To give an example of what this would result in as I think about it:

The target cpu has regs target1 and target2, emulated CPU has emu1 to emu4...

[codesegment starts]

-> [references flips so emu1/emu2 exist in target1/target2, emu3/target4 in RAM]
-> [data in registers are moved to their new places]

[theoretical optimization area 1 starts, uses lots of emu1 and emu2 regs]
[... code ...]
[theoretical optimization area 1 stops]

-> [references flips so emu3/emu4 exist in target1/target2, emu1/target2 in RAM]
-> [data in registers are moved to their new places]

[theoretical optimization area 2 starts, uses lots of emu3 and emu4 regs]
[... code ...]
[theoretical optimization area 2 stops]

[codesegment stops]


That's the whole idea I have ^, I hope I explained my thoughts in a way so they makes sense. As I said previously, I don't have a clue whether this does exist, if it's even possible to implement or if it would give a performance increase enough to justify the time it would take to implement something like this, perhaps the optimizationprocess would take so much time it'd not make up for the performanceincrease, or perhaps is the dynarec'ed code could be stored in a cache and thus make this quite usefull only slowing down the emulationprocess once the first time something new is being executed? I don't know but It'd be quite fun to hear thoughts on this, I really hope this can be constructive in some way...

Best regards, Markus
 

ShadowPrince

Moderator
Norlin said:
Hello Hacktarux and others,


That's the whole idea I have ^, I hope I explained my thoughts in a way so they makes sense. As I said previously, I don't have a clue whether this does exist, if it's even possible to implement or if it would give a performance increase enough to justify the time it would take to implement something like this, perhaps the optimizationprocess would take so much time it'd not make up for the performanceincrease, or perhaps is the dynarec'ed code could be stored in a cache and thus make this quite usefull only slowing down the emulationprocess once the first time something new is being executed? I don't know but It'd be quite fun to hear thoughts on this, I really hope this can be constructive in some way...

Best regards, Markus

Good reading.Dynarec exists of course in many n64 emulators for quite some time already,book is slightly outdated now.
For best working dynarec engine you should check sources of 1964.It's open sourced emu for winx platforms and can be downloaded from 1964.emulation64.com .
Welcome to the world of emulationprogramming :)
 

Hacktarux

Emulator Developer
Moderator
Norlin, you're missing an important detail, you need the registers to executes opcodes... I'll take your emu target example, imagine you have emu1 and 2 in target1 and 2, you want to multiply emu2 and 3, on a x86 multiplication put the result in eax:edx, you don't have choice, so it'll necessarily erase some part of your cache.
You need something more flexible and perhaps less complex in the pre optimization stage. It has to be dynamic. I suggest you to look at the document written by schibo (author of 1964). I guess it's included in the source code but i'm not sure, otherwise i have no clue on where you can find it.

Fortunately most of the times, games only use 32bits data, so there's no need to always use two target registers to cache. The MMX idea has two problems, firstly you can't do all operations on mmx registers. Secondly and it's the main problem, you can't use fpu and mmx at the same time. And the n64 games are mainly 3d games that use fpu a lot, so i think you'll lose much time by this approach.

Mupen64 has a preliminary dynarec but it's far from being functionnal yet. I'll work again on speed after the next mupen64 release that should happen really soon ;)
 

Top