What's new

Running into some walls, here

phytoporg

New member
So this summer I thought I'd take it upon myself to begin my very first emulation project-- I am writing a dynamic recompiling SNES emulator for the PSP.

I didn't quite realize when I started how ambitious this project is. A few quarters ago I took a computer organization course that dealt with MIPS, so I thought I'd be relatively prepared to jump into this. I was particularly confident since I choose to (initially, at least) take a dictionary-type translation approach. No real profiling or optimization at first. I just want something that works for now.

I now see that even without some crazy profiler/optimization utility, this is not a trivial undertaking and I'm at the point where I could really use some advice from people who have a good understanding of how this all works.

So, my first problem I guess. I spent the past two or so weeks writing "emit" functions (for example, emit_lda() ) that spit out MIPS-encoded instructions that perform 65c816-equivalent operations according to the address mode of interest. This weekend I wrote a MIPS disassembler/interpreter and they seem to do what they're supposed to do. However, these functions more or less assume that all ROM and and RAM data are stored contiguously in memory. After reading a little more about the SNES memory map (I'm only implementing LoROM support at the moment), I realize that this approach won't really work.

The way I see it, I have two options:

1) Allocate a huge wad of memory. Cut the ROM data up into 32k chunks and distribute them according to their proper bank locations (the first 32k chunk would be located at 00.8000, the second at 01.8000, etc. ).

2) Go the typical route of writing read/writemem() functions.

The obvious problem with #1 is that I'd be using a ton of memory, and then any writes to mirrored memory would have to be duplicated. #2 seems like it'd be the clear alternative, but the 65c816->MIPS translation already yields an approximate 1:20 instruction ratio. I want to avoid any unecessary function calls or additional code generation, and I figured the method might lend itself to an elegant solution of some kind. But maybe not?

Suggestions are greatly appreciated.
 
Last edited:

Exophase

Emulator Developer
I haven't done anything with SNES before but I looked at some memory maps and it has 256 64k banks (16mb total space), right? And of these banks they can be split into 4k regions corresponding to different hardware locations. Some hardware registers have to be trapped, so you can't map those directly to memory. What I would recommend doing is to allocate a 256 wide array of points to 16 wide arrays, that holds the current bank map.

When you change banks change this, and when you access memory use the upper 4 bits of the 16bit offset to determine which sub-area it's part of. In these tables either store a pointer into the memory array it's part of, or a null pointer to indicate that it has to be handled with more logic than just a memory read/write.

Check to see if the pointer is null, if it isn't offset it by the address with the lower 12bits masked and reference this location. If it is call a function to handle it.

Since this uses a kind of page table you don't have to change the way memory areas are represented in the PSP's memory and you don't have to duplicate mirrored writes or anything like that. It also only uses a relatively small amount of memory, 256 * 16 * 4 bytes (16kb)

So, 1:20 ratios? Are you sure it's even worth recompiling anymore at that point? Could you give me an example of what kind of 65c816 instructions are taking those amounts of MIPS instructions?

EDIT: I'm currently working on an emulator with the intention of porting it to PSP one day (that's why I started it) and doing a dynarec inevitably is very much planned as well.. I'd love to share ideas, if you have AIM could you IM me? Screenname's Exophase.
 
Last edited:
OP
P

phytoporg

New member
Oh, I think I saw something like that going on in the sneq source, but I couldn't completely make out what he was doing. That's a neat idea, I'll give it a shot.

The 1:20 ratio is only that way 'cause there isn't any profiling going on. For the moment, all 65c816 registers are statically mapped to the s-registers, and the translation includes encoded MIPS instructions to calculate the m and x flags, with appropriate branches to perform the necessary masking. Not to mention the fact that, with these statically mapped registers, I'm not taking any steps to avoid the clutter caused by the accumulator-based architecture.

So, for example, here's my current emit_lda function (which will need some fixing to take into consideration proper memory mapping):

Code:
void emit_lda( unsigned int **location, ADDRESS_MODE am ) {
	/**
	 * --$t3 = effective address--
	 * --$t0 = *(effective address)--
	 * 0  andi $t1, S, MFLAG
	 * 4  bne  $t1, $zero, 16
	 * 8  add  A, $zero, $t0
	 * 12 j    24
	 * 16 andi A, A, 0xFF00
	 * 20 or   A, A, $t0
	 * 24 nop
	 **/

	emit_load_op( memPtr, location, T3, am );
	emit_deref_mflag( location, T0, T3 );

	unsigned int initLoc = (unsigned int)*location;

	AND_I( location, T1, S, MFLAG );
	BNE( location, T1, ZERO, 2 );
	ADD( location, A, ZERO, T0 );
	J( location, initLoc + 24 );
	AND_I( location, A, A, 0xFF00 );
	OR( location, A, A, T0 );
	NOP( location );
}

The instruction macros look something like:
#define ADD_I( location, rt, rs, immediate ) \
**location = ITYPE( ADD_I_OP, rs, rt, immediate ); \
++(*location);

Once this all works, even if it's super slow, the emit functions are going to see a pretty big overhaul so I can work them into some kind of profiling scheme for on-the-fly instruction reordering, dynamic register allocation and general basic block optimizations. Until then, things will look a little messy.

And sure, I'll shoot you an IM when I get the chance. Right now it looks like you're idle, though. D;
 

Exophase

Emulator Developer
I'm not sure what's wrong with static allocation, assuming that you end up using more than just the s registers. You should have an entire 29 registers at your disposal; you have to preserve them appropriately across calls to C code but your recompiled could should ideally stay within either its own code or safe ASM you created yourself for as long as it can. 65c816 only has A, X, Y, S, DB, D, PB, and P, right? The first three of the last four not being accessed explicitely as often in comparison. As for the flags, if you're just going to check if they're zero or non-zero most of the time you'd may as well keep them in a P register, and only keep the ones that you calculate often (N, V, C, Z) in their own registers. You need maybe two or three others as temporaries and you'll want to keep things like a cycle counter, translated bank address, etc in registers, but you should have more than enough registers to manage all this.

I'm not sure that instruction reordering or dynamic register allocation is really going to get you much of anything and the former especially is a rather expensive optimization. Like I said, you have plenty of registers so you can map each 65c816 register to a fixed MIPS register (IE, A -> $s0, X -> $s1, etc), which is what you seem to be doing. Most likely the best optimizations you'll be able to perform is redundant flag elimination. The 8bit vs 16bit stuff looks rather expensive, you'd be well off to try to determine it at runtime as much as possible. If SNES games are usually in 16bit modes (I don't know either way) then it'd be best to have the modes be global compiletime known values and flush the entire recompilation cache when it changes. However, barring that, you can keep registers that have either an 0xFF or 0xFFFF mask in it and one with its complement and have it change depending on the M flag, and do this:

lda t0, memory_base, offset
and t0, t0, mask
and a, a, not_mask
or a, a, t0

Something I'm confused about, when you grab the MFLAG do you mean to do it from P and not S?
 

Cyberman

Moderator
Moderator
I suggest instead of forging on sitting back and look at what is going on within a typical SNES game. Quite a bit. You might want to look at SNES9X source code for some ideas as well.

1) Other things, the PSP is too small for NOT doing instruction optimization.
2) I suggest usingintermediate data instead of emitting immediately. This allows you to do some analytical clean up later on before emiting actual code. IE bank switching routines can be converted to direct code leaps etc.
3) DATA DATA DATA! Think about how the game accesses data. Why? Because you can save yourself some banking headaches this way. Banking data can be simplified the same way code can be. It's just a matter of 'seeing' what you are looking at.

Think HLE a bit too, not everything is straight forward in a bank switched or segmented beast like the SNES but you can certainly make things a bit faster. Remember most of these games were coded modularly likely using a compilor.

First pick a simple game IE one that doesn't use a lot of banking :D (yeah like FF4 (smirk).

Tales of Phantasia is a good game to test your system with. It abuses the heck out of the SNES and is the largest game apart from Star Ocean for the SNES (6Mega bytes). It's a japanese only game but there is a patch for an english translation.
 
Last edited:
OP
P

phytoporg

New member
I guess my naming convention is a little off-- I call it S for "status register" (I use SP for the stack pointer). I guess I should change that, heh.

I'm currently only mapping the s-registers to the 65c816 regs you named in addition to the program counter (I'm actually storing the PB as the high eight bits in a larger, "full" PC register). I wanted to keep a lot of t-registers around for use elsewhere, but it's becoming more apparent that I don't need very many for general use. I'm going to be redoing my emit functions to take into account the memory map anyway (I implemented your idea yesterday, by the way-- thanks a lot!), mapping more registers might help. I'm not completely sure, though, 'cause switching from the emitted code to compiled C-code would then mean more stack manipulation on a very frequent basis, and I want to keep memory traffic to a minimum.

And yeah, the 8-bit/16-bit mode switching is a pain, but unfortunately it happens very often since the switch needs to be done to access individual bytes rather than words and vice-versa. Once a profiler comes into play, it'll be possible to determine during run-time the certain states of the relevant flags in a code block to avoid the extra branching.

EDIT:
1) Other things, the PSP is too small for NOT doing instruction optimization.
2) I suggest using data instead of emitting immediately. This allows you to do some analytical clean up. IE bank switching functions can be trimmed out, and direct code leaps instead etc.
3) DATA DATA DATA! Think about how the game accesses data. Why? Because you can save yourself some banking headaches this way. Banking data can be simplified the same way code can be. It's just a matter of 'seeing' what you are looking at.

Whoah, hello. Sorry, was still writing the above when you got your post in.

1) I do intend to do instruction optimization, just later in the game. There are probably a lot of little things I can even do initially, like omitting flag calculations where they don't need to be done and that kind of thing.
2 & 3) Yeah, I put together something similar to Exophase's suggestion in his first reply, which practically eliminates the need to even acknowledge the existance of banks and fragmented data. My only issue now is dealing with the hardware registers and other locations that can't just directly be mapped somewhere.

I'm actually shooting to get super mario world working first, and I use "working" lightly, here. It's more or less the goal for the end of this summer. I'm not totally sure how realistic that is, but I'm crossing my fingers.
 
Last edited:

huarifaifa

New member
1) Other things, the PSP is too small for NOT doing instruction optimization.

Are you sure? I don't know too much about SNES nor PSP, but a 333 MHz MIPS CPU should be enough to emulate a 65c816 at 3.58 MHz (and the video, audio, etc. hardware).

I'm working on a GameBoy emulator (Z80 4.194 MHz) for my Palm Tungsten T2 (ARM 144 MHz). I don't do dynamic recompilation, just a simple interpreter written in C++ and I get 1250 fps on my P4 1.4 GHz machine (according to some benchmarks ---XviD decoding and memory speed--- the P4 1.40 GHz CPU is only 3 or 4 times faster than the ARM 144 MHz in my PDA).
 

Cyberman

Moderator
Moderator
huarifaifa said:
Are you sure? I don't know too much about SNES nor PSP, but a 333 MHz MIPS CPU should be enough to emulate a 65c816 at 3.58 MHz (and the video, audio, etc. hardware).
Unlike a PC it doesn't have 256 to 4096 megs of memory to spew data into. It is limited. Although it's memory is near to the largest of SNES ROMs in size. You still need to fit the emulator and other things within it. Think EMBEDED system.
huarifaifa said:
I'm working on a GameBoy emulator (Z80 4.194 MHz) for my Palm Tungsten T2 (ARM 144 MHz). I don't do dynamic recompilation, just a simple interpreter written in C++ and I get 1250 fps on my P4 1.4 GHz machine (according to some benchmarks ---XviD decoding and memory speed--- the P4 1.40 GHz CPU is only 3 or 4 times faster than the ARM 144 MHz in my PDA).
A game boy is a much smaller system in comparison. It's not too hard to load the contents of an entire ROM and run an interpretor. Dynamic recompilation generations code on the fly, this code may be 2 to 3 instructions for every instruction. In addition the instruction sizes are 32 bit. Your code size will double instruction for instruction almost immediately. So you may have more memory but you also have more memory concerns.

Cyb
 

Exophase

Emulator Developer
For systems with limited memory especially (like PSP) it's a good idea to only recompile frequently used blocks. StrmnNrmn takes this approach for Daedalus and the translation cache ends up being pretty small compared to the overall ROM size. This could improve speed as well, depending on how fast your recompilation is (if a lot of optimizations are thrown in you really don't want to waste time recompiling things that are only executed once or a few times, like initialization code).

Cyberman; PSP's available RAM is much larger than the largest SNES ROMs, isn't it? Plus, with bank switching required for > 4MB ROMs it's easier to load the banks on/off memstick, although I don't know what kind of speed hit this would incur. Sadly, phytoporg is currently looking at far more than 2-3 MIPS instructions per 65c816 instruction. I guess the other question is how much of an SNES ROM consists of code vs. data (my guess would be usually not much for most games, they don't get awfully complex).

huarifaifa; This is off topic, but I'm skeptical that an ARM9 based CPU at 144MHz is really only 1/4th to 1/3rd of a P4 1.4GHz, generally speaking. It sounds like you were running very memory bandwidth limited tests, which are not necessarily representative of how an interpretative emulator will behave. Most of the working data set is highly localized (registers, flags) and the locality of the emulated program itself lends itself to this as well. The ARM CPU will surely perform the same thing in significantly fewer instructions, but probably at higher cycles per instruction counts as well. I am interested to see what the comparison is like in the end though...
 
Last edited:

gladius

New member
I'd advise against a recompiler (unless you are doing it for fun, then go for it! :)). The benefits for 65c816 are small enough (8/16 mode switch playing a big part here), and the overhead for going out to memory large enough that it's not really worth it. Plus with the fun of keeping the APU synced (not to mention specialty chips) to the main core, you want to be pretty close to cycle accurate.

On the DS (a 66Mhz arm9) both snesds and snezzids use a hand written arm asm core, and it runs at 60fps on everything out there. Mind you, that is using the DS's graphics hardware to draw the tiled backgrounds and sprites, but with a 333Mhz processor a hand coded asm core should leave plenty of horsepower to spare for sound&gfx.

Edit: Just to mention why I recommend against it, debugging a recompiler is a *pain* unless you have some nice single-stepping tools. I wrote a dynarec for the SPC-700 cpu on the GBA (256kb ram, 16Mhz arm7) and it was a substantial amount of effort to get it debugged and running well.
 
Last edited:
OP
P

phytoporg

New member
At this point it's more of a research project, actually. Wanted to try something a little more challenging than an interpreter, although I'm still convinced in the end that I can squeeze a few more cycles out of this core than I would otherwise. We'll see what happens there, though. D;

I have some ideas for setting up a decent debugging environment, but I need something to debug first, heh. I don't think I'll be able to start the actual dynarec logic until early July.

Oh, that was you that wrote the SPC-700 dynarec? I've come across the project a few times googling around for dynarec references. Very impressive, I've heard that thing is a mess.
 
Last edited:

gladius

New member
Research is fun :). That's exactly why I tried doing it that way too, learned some interesting stuff.

Hehe, you've heard it is a mess huh? Whoever told you that wasn't lying. It generates some decent code, but for overhead reasons my opcode generation is a bunch of macros. This means it's pretty fast (to generate an opcode is usually 2, maybe 3 instructions) but it's hard as heck to read and code.
 

Cyberman

Moderator
Moderator
phytoporg said:
At this point it's more of a research project, actually. Wanted to try something a little more challenging than an interpreter, although I'm still convinced in the end that I can squeeze a few more cycles out of this core than I would otherwise. We'll see what happens there, though. D;

I have some ideas for setting up a decent debugging environment, but I need something to debug first, heh. I don't think I'll be able to start the actual dynarec logic until early July.

Oh, that was you that wrote the SPC-700 dynarec? I've come across the project a few times googling around for dynarec references. Very impressive, I've heard that thing is a mess.
Oh I'm sure you have had lots of fun with all our advise LOL!
I have put a few thoughts together (recently they got together and had a party :D AHEM).

Anyhow, perhaps tokenization, IE issue a stream of tokens that can be 'retranslated' to something more interesting/more compact/completely different. Sort of combining HLE with DynamicRec.
I would have to examine the instruction stream of the 65816 to see what likely combinations of instruction streams one would see. I am much more familiar with the x86 segmented archetecture than the 65816, ( ironically I wrote the backend for a pascal compilor for the x86 archetecture quite fun really).

A few ROMs would have to be examined I suppose as well. Off the top of my head the SNES system dynamic rec could be a real Pain :D, not that you haven't already noticed that.

Well anyhow have fun! This is the future of a lot of systems, beyond that of emulating game systems. (I kid you not on that reguard).

Cyb
 
OP
P

phytoporg

New member
^I ultimately do want to eliminate the extra steps caused by the differences in architecture, although I was thinking to try something more generic than tokenization. I don't know what that something could be just yet, though, heh.

By the way, anyone have thoughts on basic block identification? I was thinking of making two passes: the first pass to gather all the "labels" by finding all jump/branch instructions and storing their target addresses. The second pass uses the "labels" to partition the ROM up into label/branch pairs, which signify a basic block (obviously I'm going to limit the size of each block to a relatively small number of instructions). Everything else is assumed to execute consecutively and is split up into blocks accordingly.

[edit]Hrm, I just realized that a lot of labels can't be determined this way until the appropriate jump instruction is executed, since JMP and JSR support indirect addressing. Any ideas? :/
 
Last edited:

gladius

New member
The problem with that is code copying to fast ram, and self-modifying code, both of which are exceedingly popular techniques on the SNES. This causes your block analysis to have to be invalidated for the area which the modification happened (which you will need to do anyhow, but in your method it sounds like you'd need to reanalyze the entire program).

I took the approach of having the code generator cache blocks based on start address, and compiling a block until a branch was made, then compiling at the new PC (using the cached block if one existed). That worked out pretty well even though you have duplicated code sitting around. I ran some stats on a few games and there is a branch roughly every 5-6 instructions or so, so optimizing for small block sizes is a good idea.
 

Cyberman

Moderator
Moderator
The 65816 is sort of a CISC RISC hybrid. Why is this important or relevant?

Things that kill recompilation are:
Self modifying code (A bloody nono but many programmers haven't read the classic 'goto considered dangerous' and 'your code shouldn't refer to itself in the plural' <this is a joke by the way>).
'clever' uses of the instruction set such as twiddling the jump location of a long jump for 'fast' execution. The reason this are common is because the 6502 parentage of the 65816. There isn't a nice JMP [X + 0xFF00] type instruction available (Indexed jump to the address contained in a location). Instead some programmers did this
JMP 0 ; jump absolute location of 0000
ORG $_ - 2
CHANGE_LOCATION DW

OR
DB <JMP opcode>
CHANGE_LOCATION DW


And then modify CHANGE_LOCATION for a quick and dirty way to be sure it jumped to the right location in execution. Great for switching between FIELD and BATTLE systems for example (IE bank in BATTLE code run battle then bank in FIELD code).

Sometimes there is no clean way to do these things sadly. :/

Most decompilors take several passes first finding all jump instructions and there end locations then it begins labeling things based on that.


Tokens have the advantage of being easily filtered with a bit of program logic for end translation. Though I suppose you could do the same with the instruction stream itself. It's all a matter of perspective. The AMD x86 processors of the athalon variety do dynamic recompilation internally by grabing the instruction stream and grabing which branches are likely to be taken this prevents execution stalls and allows for recoding to the micro engine the instruction stream. Most processors of the x86 variety these days do this.

Cyb
 
OP
P

phytoporg

New member
Most decompilors take several passes first finding all jump instructions and there end locations then it begins labeling things based on that.

That's good to hear. At least the idea wasn't a bad one, heh. Again, though, jumps can have indirect addressing,

gladius: I like that suggestion, I think I'll go with it. D;

I can't believe I didn't realize this earlier, but I noticed today that many immediate instructions vary in length based on the value of the m-flag (some indexed stuff also changes based on the x-flag). That kind of trashes the direct translation approach I was taking. I'm going to have to rethink the structure of everything I've written, and start on an interpreter for some kind of hybrid/profiler action.

I spoke with exophase a bit and he suggested marking each code block with the state of the m-flag, and ending the block with either the usual branching business or at a change in the relevant flags. Since the flag state determines the length of an instruction, I think it's fair to assume that these instructions are in places where the state can be predetermined. Since these flag changes are pretty explicit, that shouldn't be too bad.

I also think I'll be taking measures to compile code only after it's been executed a certain number of times. That should cut back considerably on the translation cache size.

The big obstacle that's left is self-modifying code. For now I'm just going to let the interpreter handle anything outside of ROM address space, I guess. It's nothing I've given much thought to yet. For another time.
 

Top