ds emulation

|\/|-a-\/ · Oct 1, 2006

yes, that is actually what i want to know...

Exophase · Oct 2, 2006

|\/|-a-\/ said:
how are halfwords written into memory |H|H|L|L| or |L|L|H|H| ?
how are words written into memory?

Nintendo DS and GBA are little endian. Low bytes are written first.

blueshogun96 · Oct 3, 2006

btw exophase, you wern't kidding when you said that arm was worse then x86! I'll never complain about x86 ever again!

lain:

ector · Oct 4, 2006

Exophase said:

Here's another example to illustrate the usefulness of macros:

Code:

// These must be statically declared arrays (ie, global or on the stack,
// not dynamically allocated on the heap)

#defile file_read_array(filename_tag, array)
  file_read(filename_tag, array, sizeof(array))

#define file_write_array(filename_tag, array)
  file_write(filename_tag, array, sizeof(array))

You can do this with inline. A first draft looks like this (i guess your tags are ints):

Code:

template <class T, size_t size>
inline file_write_array(int filename_tag, const T array[size])
{
      file_write(filename_tag, (void*)array, sizeof(array));
}

The template parameters will be auto filled in when using it, so you can call it just like your macro.

Not sure how much you gain in type safety in this case, though

Exophase · Oct 5, 2006

That's not C's inline. That's C++, using templates. Not comparable for the programmer using C. Furthermore, filename_tag may be of varying types (FILE * or int in my case).

civilian0746 · Oct 6, 2006

Code:

void? file_write_array(int filename_tag, const void* array, int sizeeee)
{
      file_write(filename_tag, array, sizeeee);
}

Should be more flexible and should work. Pass the array, which is a pointer, as the second parameter and its size as 3rd.

mono · Oct 7, 2006

How does that help, civilian? He doesn't want to have to pass the size to file_write_array. You basically missed the point...

civilian0746 · Oct 8, 2006

Well, even with static arrays, you would have to access size at some point anyway. Won't produce any smaller or faster code. But if he only wants to only work with static arrays and don't want to pass size, precompiler macros as Exophase suggested would be the only only option not going into c++ unless the size is constant.

Exophase · Oct 8, 2006

Producing smaller/faster code isn't the only motivation to do anything D: Even if it's your #1 priority it's still helpful to actually smaller SOURCE when possible, and not having to put in the size for things like this reduces some potential errors (of course you have to know not to do it when you're not supposed to..) For what it's worth, in an emulator it's possible and perhaps preferable to have most, if not all of your data statically sized.

I think the ironic thing is that your code, if not using macros/inlines, would actually produce slower code! And it wouldn't be smaller since you're just wrapping a function call of the same number of parameters. In fact, your code isn't doing anything other than being redundant...

civilian0746 · Oct 9, 2006

Well, I did not look into where this code belongs to. Most modern compilers with the smallest hint of optimisation would inline something like that without you having to instruct them unless it is accessed from outside sources or it's address is accessed. Speed/performance is rarely about inlining/macroing everything on all parts of the program. If someone did that, they'd surely end up with a mess rather than faster code. I'll give you an example of something I picked up in this thread: You said something like checking for some "condition is not AL (this check is slightly cheaper than entering the switch" inside that function. The code he had would be at least as fast or most probably faster than if he had that conditional for AL even if 99% of the instructions he parsed had the conditional AL!

Code:

    // Checks if Instruction is Conditionally True
    static inline int CheckInstructionCondition(u32* _psr, u8 _condition)
    {
        switch(_condition & 0xf)
        {
        case ARM9_COND_EQ: return ((*_psr & PSR_Z) != 0); // Z set
        case ARM9_COND_NE: return ((*_psr & PSR_Z) == 0); // Z clear
        case ARM9_COND_CS: return ((*_psr & PSR_C) != 0); // C set
        case ARM9_COND_CC: return ((*_psr & PSR_C) == 0); // C clear
        case ARM9_COND_MI: return ((*_psr & PSR_N) != 0); // N set
        case ARM9_COND_PL: return ((*_psr & PSR_N) == 0); // N clear
        case ARM9_COND_VS: return ((*_psr & PSR_V) != 0); // V set
        case ARM9_COND_VC: return ((*_psr & PSR_V) == 0); // V clear
        case ARM9_COND_HI: return (((*_psr & PSR_C) != 0) && ((*_psr & PSR_Z) == 0)); // C set and Z clear
        case ARM9_COND_LS: return (((*_psr & PSR_C) == 0) || ((*_psr & PSR_Z) != 0)); // C clear or Z set
        case ARM9_COND_GE: return ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) == 0)) || 
                                    (((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) != 0))); // N equals V
        case ARM9_COND_LT: return ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) != 0)) || 
                                    (((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) == 0))); // N does not equal V
        case ARM9_COND_GT: return (((*_psr & PSR_Z) == 0) && ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) == 0)) || 
                                    (((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) != 0)))); // Z clear and N equals V
        case ARM9_COND_LE: return (((*_psr & PSR_Z) != 0) || ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) != 0)) || 
                                    (((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) == 0)))); // Z set or N does not equal V
        case ARM9_COND_NV: Log(100, ".ARM9Interpreter: Unimplemented NV Opcode Condition!\n"); // Special
        case ARM9_COND_AL: return (1); // Always (unconditional)
        }

        return 0;
    }

Exophase · Oct 9, 2006

What is your rationale for this? GCC, at the very least, will generate for the switch a boundary check (to make sure it's within the cases, even with the & - if you don't believe me look at the emited ASM yourself, I'm certain about this), an AND, a table lookup, and a jump. That is more expensive than a single boundary check.

I'd especially like to know how you think a switch would be FASTER than an if.

civilian0746 · Oct 9, 2006

Well, I havent worked much with the ARM family. Only once: night before the day I had to submit a report on arm isa and I dont tent to remember things I learn at school vary well. But I do remember the codes for those COND_* codes. Well, his codes were enough to remind me. The reason I would say why those numbers were picked is so that ARM cab quickly demultiplex to enable appropriate microoperations without using complex circuitry. I am sure the compiler does something like it as well and you said something like that. But the asm his code would generate is rather something like this:

Code:

CheckInstructionCondition: _condition: reg
                and     reg, 0xf; code to fetch the parameter unless inlined or other situations..
  		jmp	[switch_condition___0_f+reg*4]

ARM9_COND_EQ:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_NE:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_CS:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_CC:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_MI:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_PL:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_VS:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_VC:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_HI:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_LS:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_GE:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_LT:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_GT:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_LE:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_NV:
               ;code
               ;return or jump to next bit of code if inline
ARM9_COND_AL:
               ;code
               ;return if not inline

switch_condition___0_f:
 		dd	ARM9_COND_EQ
 		dd	ARM9_COND_NE
 		dd	ARM9_COND_CS
 		dd	ARM9_COND_CC
 		dd	ARM9_COND_MI
 		dd	ARM9_COND_PL
 		dd	ARM9_COND_VS
 		dd	ARM9_COND_VC
 		dd	ARM9_COND_HI
 		dd	ARM9_COND_LS
 		dd	ARM9_COND_GE
 		dd	ARM9_COND_LT
 		dd	ARM9_COND_GT
 		dd	ARM9_COND_LE
 		dd	ARM9_COND_NV
 		dd	ARM9_COND_AL

Basically, it will multiply (_condition & 0xf) by 4 which is just 2 shifts to the left and add it to another number and then dereference that number to get where to jump to. One jump, one memory access -> end of the story. The reason it wont add that conditional you were talking about is because there is no point in adding that conditional and the people who codes compilers are really really smart.

But if there was a conditional, it would need jumps to jump to the AL case or rest of the case. And then when its near the rest of the case, it will need to add another conditional around switch because it no longer has all the possible values. Then you need another extra memory read and jump to implement the switch anyway. The 1% non-al code will really make it slow. Well, on arm9, he will be doing this to millions of instructions every second and I think that less than 30% of them will be AL. You should know that branching is even slower than memory acccess in this case.

Exophase · Oct 10, 2006

Your explanation is absurd, claiming that the 1% of the time it isn't AL will slow it down tremendously when the overhead for that time would be less than 2x. What the actual values are in ARM code is irrelevant, because YOUR example only said for 1% of the time (saying less than 30% AL is insane though, it's probably closer to 80% or higher AL). Either way, the indirect branch in the switch is pretty much guaranteed slower (is more of a pipeline hazard) than the conditional branch in a preemptive test, if it's worthy.

Anyway, you're assuming that the compiler will make optimizations it won't. Look over the following clearly, for a value with a clear & 0x0F in the switch itself and with all 16 switch cases accounted for GCC with -O3 will still emit the following:

Code:

	andl	$15, %edx
	cmpl	$15, %edx
	ja	L1
	jmp	*L8(,%edx,4)

That's the AND, a test to see if it's in range, a jump if it isn't, an indirect jump, and a memory access.

Compare it to what it does for the following escape clause:

Code:

  if((x & 0x0F) == 15)
  {
    return 10;
  }

Code:

	andl	$15, %edx
	cmpl	$15, %edx
	je	L1

That's less an indirect jump and a memory dereference. Per your example, let's say that the second one occurs 99% of the time and the other 1% of the time both must occur. The time for the second we'll call S, the first F, so we have overall time of:

(0.01 * (S + F)) + (0.99 * S) = S + 0.01F

What you're saying is that the first only, or F, is smaller than (S + 0.01F). In other words, you think that S is less than 1% faster than F, when in reality it's probably a solid 2x as fast.

civilian0746 · Oct 10, 2006

Exophase said:
Your explanation is absurd, claiming that the 1% of the time it isn't AL will slow it down tremendously when the overhead for that time would be less than 2x. What the actual values are in ARM code is irrelevant, because YOUR example only said for 1% of the time (saying less than 30% AL is insane though, it's probably closer to 80% or higher AL). Either way, the indirect branch in the switch is pretty much guaranteed slower (is more of a pipeline hazard) than the conditional branch in a preemptive test, if it's worthy.

Anyway, you're assuming that the compiler will make optimizations it won't. Look over the following clearly, for a value with a clear & 0x0F in the switch itself and with all 16 switch cases accounted for GCC with -O3 will still emit the following:

Code:

andl $15, %edx cmpl $15, %edx ja L1 jmp *L8(,%edx,4)

That's the AND, a test to see if it's in range, a jump if it isn't, an indirect jump, and a memory access.

Compare it to what it does for the following escape clause:

Code:

if((x & 0x0F) == 15) { return 10; }

Code:

andl $15, %edx cmpl $15, %edx je L1

That's less an indirect jump and a memory dereference. Per your example, let's say that the second one occurs 99% of the time and the other 1% of the time both must occur. The time for the second we'll call S, the first F, so we have overall time of:

(0.01 * (S + F)) + (0.99 * S) = S + 0.01F

What you're saying is that the first only, or F, is smaller than (S + 0.01F). In other words, you think that S is less than 1% faster than F, when in reality it's probably a solid 2x as fast.

So now we are on about pipelines. Thats nice. how does it work again: Lets take an average 1.0-2.0ghz Intel processor. Its clock speed is irrelevent but running that fast necessarily implies that this processor has a 10-20 stage pipeline. I.e. you might already know what I am trying to do. Lets just say it has 10. Lets say it has the following stages:
[read instruction identifier][select appropriate path][operand load][operand decode][precalculation logic][memory fetch][calculation logic][state update][store result][memory write]
Of course, made up. But makes sense =/ earlier stages are about getting the operation, middle about executing it and towards the end you store results if needed. For the sake of simplicity, lets just say there wont be any cache miss. Say the time it takes to move data from cache to appropriate internal register is less than 2 clock cycles.

Lets just compare the time it takes from the point when the first instruction is identified to time when the apprppriate instruction is read into the pipeline.

1. ShizZy's jump statement that I said his code translates to which it does with the compiler I compiled with:
- jmp [switch_condition___0_f+reg*4]
The branch is "unpredictable" and hence we will assume that it will not do any branch prediction for this. I.e. The entire pipeline will be discarded on the "[state update]" when it modifies PC and on the next cycle it will read the next instruction identifier as pointed by PC. The time here is 8 clock cycles + the time it took for the jmp instruction to get to the state update stage which is 7 cycles + 1 extra one for memory read = 16 cycles.

2. What you said his jmp statement was:
- cmp reg, 15
- ja NOTINRANGE
- jmp [switch_condition___0_f+reg*4]
As far as intel's 2 level shift register branch predictor is concerned, with 99 to 1 ratio, the possibility of it predicting the branch not to be taken is not exactly 99/100 but a bit less. But it wont be taken anyway so we dont have to worry about it. So the time is just what it took for previous one + 2 cycles + no extra memory reads.

3. Now its your conditional AL aided code:
the setup will be something like:

Code:

...
cmp	reg, 15
jne	NOT_AL
;code for al
...
NOT_AL:
cmp	reg, 14
ja	NOTINRANGE
jmp	[switch_condition___0_f+reg*4]
;;codes for different non-al switches
...
NOTINRANGE:
...

* If its ARM_COND_AL:
- cmp reg, 15
- jne NOT_AL
Assuming the branch is not taken for a long time which is highly unlikely for random code: is just one exra cycle. Most probably 2 or 3.

* If its not ARM_COND_AL:
- cmp reg, 15
- jne NOT_AL
NOT_AL:
- cmp reg, 14
- jx NOTINRANGE
- jmp [switch_condition___0_f+reg*4]

For this, its safe to assume that the first bit is branching prediction failure because most instructions are AL. For obvious reasons, the second one would also fail. and the 3rd one is just that plain jump statement whose timing is 16 cycles. Each branch failure costs 8 cycles + 2 extra cmps gives a total of 16 + 16 + 2 = 34 clock cycles. Of course, this is assuming everything is ideal and chances of things going wrong in pipelining with bigger code is more than smaller code.

Well, lets find a descent ratio:
16 vs (ratio * 3) + ((1-ratio) * 34;

simply solve it and will get ratio of AL > 0.5806 which is 58% for that to be effective.

Wait...lemme double check.

Anyways, it seems that I am always right. Moving on from here, 20 * 36 + 80 * 2 = 880 is not less than or equal to 1/2 * (16*100). So even with 80/20 ratio, its transport time not more than or equal to 2 times faster as without conditional and I love exaggerating. There are other things to consider as well and some things are slower than I assumed. Your AL does almost nothing compared to the memory accesses and extra comparisons on other conditionals. Even then I would not be surprised if the ratio of AL is more like 60% or less than something like 80%. With ARM, most people would use standard c/c++ compilers anyway. Most of them wont care about optimisation that much with their clock running at 300 mhz and a lot of memory. Especially those game developers. They like wrapping everything with conditionals and break things up into small logical pieces for tiny things, and compilers would most likely produce those single instructions with those conditionals for things like global values and parameters loaded into registers because it makes little to no difference running it on the hardware. Thats one of the other minor reasons why people like ARM.

And the code it produces for the latest version of free visual c++ for that function is exactly as I put up there. It seems like GCC does not have that optimisation yet. Well, even thou its widely used for most non-commercial purposes, there are still areas where it can improve.

Exophase · Oct 10, 2006

There is the extra branch prediction failure for the if falling through, but in the 99% ratio you said that's still not going to come even close to adding up to the same amount, given that the indirect jmp is significantly more expensive than the jne NOT_AL when predicted (and 99% WILL have it predicted most of the time). I don't know what you meant about the indirect branch predicted to "not be taken", it's not a conditional branch. Also, how did you get "each branch failure costs 8 cycles", then 16 + 16? I assume you meant 16, not 8. Finally, GCC will reuse the results of the first cmp for the out of range check.

I was only going on what you said about 99%, and I stand by this. Feel free to benchmark it on whatever CPU. As for the actual in practice usage, I said it MAY be faster, but was pretty skeptical. It depends too much on the code, not to mention the CPU (pipeline penalties, as mentioned), and the compiler. Of course GCC should have that optimization, it has been brought up before (look for it), but it doesn't. In the case of Visual C++ it is probably a special case specifically for ANDs on switches though, and not a more generalized boundary limitation kind of thing. That or there's some kind of unsafe optimization in play. Anyway, you said AL has to be in effect 58% of the time, then you said you were double checking and were "always right" and honestly I lost you from here. We can talk about real ratios but we don't have figures, and honestly I'm not interested - the version w/conditional may or may not be more effective, I really don't know. But if it were 99% of the time AL (like you said) it would definitely be more effective, and if you're saying otherwise then you have to explain it to me more clearly because I am NOT seeing this.

civilian0746 · Oct 11, 2006

The "8 cycle" thing comes from this particular pipeline. The "branch predictor" will probably discard earlier pipeline stages after its operand decode stage if it thinks that the branch will be taken and start reading from the place where its branched to. Then on the stage update section when it finally knows if the branch is taken or not, if its not what branch predictor predicted, then it would have to refetch the next instruction from appropriate place for it to be executed. The time it woould take for that to happen from the start of the conditional jump despite its branch failures is 9 on that particular pipeline. It may be more or less and is just as you said, processor dependent. Of course, there might be microarchitectures where one would work better than other but I could not really care less.

You are just way too naive.

Exophase · Oct 11, 2006

Naive? How?

I still don't see how you've done anything to show that the 99% AL case would not be better with the if, but whatever, if you're just going to explain things I already know and shout other irrelevancies at me, in addition to insults and bragging about who knows what... and none of what you said changes that you said 8 then said 16 + 16, so you either meant one or the other, since they were said with regards to the same branches...

civilian0746 · Oct 11, 2006

Exophase said:
Naive? How?

\/
|
|
|
\/

Exophase said:
I still don't see how you've done anything to show that the 99% AL case would not be better with the if, but whatever, if you're just going to explain things I already know and shout other irrelevancies at me, in addition to insults and bragging about who knows what... and none of what you said changes that you said 8 then said 16 + 16, so you either meant one or the other, since they were said with regards to the same branches...

Exophase · Oct 11, 2006

I wonder if you even know what naive means. While you're looking it up check out "conceited" and "condescending."

civilian0746 · Oct 13, 2006

I do know what it means and I couldnot find a more appropriate word hoping that you would understand.
I am yet to mature fully. You need to stay alive till that happens.

ds emulation

Uli Hecht

Emulator Developer

A lowdown dirty shame

Emulator Developer

Emulator Developer

evil god

New member

evil god

Emulator Developer

evil god

Emulator Developer

evil god

Emulator Developer

evil god

Emulator Developer

evil god

Emulator Developer

evil god

Emulator Developer

evil god