|\/|-a-\/
September 7th, 2006, 16:49
it would be great to know, which docs were used by the author of desmume..... where can i find some good informations about nds like memory map, used graphics chip, nds rom file header ... ... ...
thanks, uli
Exophase
September 8th, 2006, 21:48
it would be great to know, which docs were used by the author of desmume..... where can i find some good informations about nds like memory map, used graphics chip, nds rom file header ... ... ...
thanks, uli
http://nocash.emubase.de/gbatek.htm
The best consolidated source of GBA AND DS information.
ShizZy
September 9th, 2006, 00:45
That's a niiiccccee looking doc.
|\/|-a-\/
September 9th, 2006, 11:58
WOOOOOOOOW!!!! thanks a lot!!!!!!!
i wonder, where you guys get those informations... i googled two days and found nothing!
bcrew1375
September 9th, 2006, 17:19
That's from Martin Korth. The same guy that did NO$GMB and NO$GBA.
Exophase
September 10th, 2006, 04:39
No$ has tons of docs on tons of systems, like GB, NES, C64... I have no idea where he gets all the time/energy to do these. But they're very helpful. The GBA doc is just about all you need to do a GBA emulator, although it has a few errors/missing things.
|\/|-a-\/
September 10th, 2006, 14:34
Yeah, i know why he has the time / energy... He wrote it's his main job, he earns enough money with his NO$GBA debugger.....
i don't want to write a gba emulator, but a nds emu ^^
there is an emu for each nintendo system which does its job very great: virtual gameboy, Project64/1964, zsnes ...... but there are not many very good emus for the newer systems like ngc and nds. i think it's a bit harder today, the newer systems are harder to emulate. additionaly the progress of developement of pc cpus has slowed down...
synch
September 10th, 2006, 15:54
Some nds emus are getting better :P Oh, and that doc from Martin Korth is simply impressive :)
|\/|-a-\/
September 10th, 2006, 17:13
yes, that describes it exactly ^^
ShizZy
September 10th, 2006, 22:34
If there was more time in a day I'd like to write an NDS emu :P
synch
September 11th, 2006, 05:24
Shizzy: if you want to contribute, the source is on the web :P And you'd no excuse (if not developing gekko): I work 8-10h a day (programming), and I even get enough motivation to still spend 4-6h (more programming) on desmume :) Anyway, kudos for gekko :)
ShizZy
September 11th, 2006, 23:31
:D Thanks
Shizzy: if you want to contribute, the source is on the web :P I'm waiting for you to release the source for your updates :) But, in the meantime, I've started my own DS/GBA emulator, not sure how far I get.
*ShizZy waits for synch to make his video core in plugin form so he can use it too :bouncy:
synch
September 12th, 2006, 00:22
:D Thanks
I'm waiting for you to release the source for your updates :) But, in the meantime, I've started my own DS/GBA emulator, not sure how far I get.
Nice, but remember to get some work on gekko too :D
*ShizZy waits for synch to make his video core in plugin form so he can use it too :bouncy:
Emh... First I think I'll try to get stuff working better (and faster) than now :P
ShizZy
September 12th, 2006, 00:26
Hehe..
Anyone have a backup of the latest GBATek? NoCash site seems to be down for a few days now :(
synch
September 12th, 2006, 00:39
I've one copy, of course.
I guess Martin Korth won't get annoyed for me putting this as an attachment. If not I'll just delete the attachment :)
ShizZy
September 12th, 2006, 00:55
Merci
|\/|-a-\/
September 12th, 2006, 17:09
Hey synch, are the docs by Martin everything you use?
synch
September 12th, 2006, 18:49
Hey synch, are the docs by Martin everything you use?
Mostly, but I also look at other gba documentation from time to time. Also, the unnoficial devkits are really helpful. And coding you own examples, to test/debug single features :P
|\/|-a-\/
September 14th, 2006, 15:51
hmmm.... where are the nds entries??? martin has removed them...
http://work.de/nocash/gbatek.htm
synch
September 14th, 2006, 19:19
Next time use google a bit more, the first link is the old page, the new is at:
http://nocash.emubase.de/gbatek.htm
|\/|-a-\/
September 14th, 2006, 20:20
oh... sure.. i thought i used the work.de link everytime... *sry*
ShizZy
September 15th, 2006, 01:36
synch: that link has been down, hasn't it?
synch
September 15th, 2006, 02:48
shizzy: yes, for a few days, but as I had it cached (and a copy on my hd), I didn't notice. Anyway, having a local copy doesn't hurt that much (for reading it while the internet connection is down or so).
|\/|-a-\/
September 16th, 2006, 11:40
hi, i've two questions:
1. AND{cond}{S}, what is the S good for?
2. i can't find the "main cpu loop" - you know, fetching opcodes, decoding, executing - in the source of desmume released by yopyop... i think, synch can help me ^^
synch
September 16th, 2006, 18:47
1. Afaik, if {S}, the opcode modifies flags, but my knowledge of arm cpus is vague.
2. It's in armcpu.cpp, I think. It's quite clear where :P
|\/|-a-\/
September 16th, 2006, 18:56
sure but i can't find it in armcpu.cpp .........
i realised that i don't understand what the cp15 is and how it's used...
|\/|-a-\/
September 17th, 2006, 09:47
it's in ARM_CPU.h of course, because it's inline
ShizZy
September 18th, 2006, 03:54
I have an emu framework setup, a very basic mmu, and a very basic cpu layout. :) That was about a week ago, havn't worked on it since. We'll see how it goes I guess..
ShizZy
September 18th, 2006, 04:08
Ho humm.. how do you guys handle instruction conditions? My quick dirty lame attempt is to call this function at the beginning of every op, then execute if it returns true:
// Checks if Instruction is Conditionally True
static inline int CheckInstructionCondition(u32* _psr, u8 _condition)
{
switch(_condition & 0xf)
{
case ARM9_COND_EQ: return ((*_psr & PSR_Z) != 0); // Z set
case ARM9_COND_NE: return ((*_psr & PSR_Z) == 0); // Z clear
case ARM9_COND_CS: return ((*_psr & PSR_C) != 0); // C set
case ARM9_COND_CC: return ((*_psr & PSR_C) == 0); // C clear
case ARM9_COND_MI: return ((*_psr & PSR_N) != 0); // N set
case ARM9_COND_PL: return ((*_psr & PSR_N) == 0); // N clear
case ARM9_COND_VS: return ((*_psr & PSR_V) != 0); // V set
case ARM9_COND_VC: return ((*_psr & PSR_V) == 0); // V clear
case ARM9_COND_HI: return (((*_psr & PSR_C) != 0) && ((*_psr & PSR_Z) == 0)); // C set and Z clear
case ARM9_COND_LS: return (((*_psr & PSR_C) == 0) || ((*_psr & PSR_Z) != 0)); // C clear or Z set
case ARM9_COND_GE: return ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) == 0)) ||
(((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) != 0))); // N equals V
case ARM9_COND_LT: return ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) != 0)) ||
(((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) == 0))); // N does not equal V
case ARM9_COND_GT: return (((*_psr & PSR_Z) == 0) && ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) == 0)) ||
(((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) != 0)))); // Z clear and N equals V
case ARM9_COND_LE: return (((*_psr & PSR_Z) != 0) || ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) != 0)) ||
(((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) == 0)))); // Z set or N does not equal V
case ARM9_COND_NV: Log(100, ".ARM9Interpreter: Unimplemented NV Opcode Condition!\n"); // Special
case ARM9_COND_AL: return (1); // Always (unconditional)
}
return 0;
}
Doesn't seem very effective.
synch
September 18th, 2006, 04:59
it's in ARM_CPU.h of course, because it's inline
No, just the instruction decode (INSTRUCTION_INDEX macro) and the test codition (TEST_COND macro), but they're macros to make the opcode execution readable :P
how do you guys handle instruction conditions?
Desmume seems to check all the possible combinations. You can check the TEST_COND macro in arm_cpu.h and check how it's used in arm_cpu.cpp. I haven't even checked if this could be improved, as I'm mainly focused on the 3D core and compatibility fixes, rather than speed, atm.
BTW: why I'm member of the month? Is that a random thingie?
ShizZy
September 18th, 2006, 23:44
Yes, member of the month is random.
Exophase
September 19th, 2006, 08:26
Ho humm.. how do you guys handle instruction conditions? My quick dirty lame attempt is to call this function at the beginning of every op, then execute if it returns true: (snipped)
For an interpretive emulator you have to check the condition code at every cycle. It might be worthwhile to only perform the check if the condition is not AL (this check is slightly cheaper than entering the switch), VBA does this. Now, as to how those checks are performed..
You're generally better off keeping the computational result flags (N, Z, C, and V) cached in separate variables, and extracting these from the CPSR register after msr instructions/mode changing returns, etc. (in addition to when you firstenter the interpretive loop). Likewise, you'd be collapsing them to CPSR upon mrs instructions and so forth. These flags are modified and accessed significantly more in isolation than in CPSR representation.
For the actual condition checking there isn't anything that can be done beyond what's obvious (GBATek's logical combinations). However, since the values are strictly boolean, you're best off using purely binary/arithmetic operations (as opposed to logical ones). Furthermore, you can use ^ (binary XOR) instead of !=, which may be or may not be faster (depends on platform. On x86 != seems to be about the same or better)
Now, you didn't ask for these in particular, but you might find them helpful.. let me know how you think these compare with what you're already using:
#define calculate_c_flag_sub(dest, src_a, src_b) \
c_flag = ((unsigned)src_b <= (unsigned)src_a) \
#define calculate_v_flag_sub(dest, src_a, src_b) \
v_flag = ((signed)src_b > (signed)src_a) != ((signed)dest < 0) \
#define calculate_c_flag_add(dest, src_a, src_b) \
c_flag = ((unsigned)dest < (unsigned)src_a) \
#define calculate_v_flag_add(dest, src_a, src_b) \
v_flag = ((signed)dest < (signed)src_a) != ((signed)src_b < 0) \
These should perform significantly better than versions I've typically seen used that perform binary logic on the sign bits and what have you. These perform especially well on MIPS (it is possible to update all 4 flags in only 6 instructions). Again, ^ can be used instead of != where applicable.
I have a few other (pretty typical..) optimization techniques for ARM interpreters, nothing very amazing. Let me know if you're ever interested in doing a recompiler...
|\/|-a-\/
September 19th, 2006, 13:35
hey shizzy, can i have your source, i'm very interested in a nds emu source which is in a very early state!
hey exophase, why are these programmers like you always mixing c++ code and c #define macros, which are imo very obsolete because you can use inline functions? the person who teached me some c++ stuff said that i should not use those #defines :)
ps.: i think it's a good idea to make this ds emulation thread sticky like the others ^^
Exophase
September 19th, 2006, 14:34
Okay, first of all, the code I pulled this from is 100% C (C99, which does include inline), don't know why you assumed I was using C++. Second, unlike what some people tried to jam down my throat, inlines are NOT as general as macros. Don't use macros if you don't want to, chances are you'll only be using them for things in which inlines ARE better. I use macros because they can generate anything you want. I build functions out of macros. I access the same set of variables across macros. I use the C preprocessor's name pasting, IE:
#define arm_access_memory(access_type, direction, adjust_op, mem_type, \
offset_type) \
{ \
arm_data_trans_##offset_type(adjust_op, direction); \
arm_access_memory_##access_type(mem_type ); \
} \
Personally, macros have never given me a hard time in terms of the popular caveats (not type safe, don't require actual arguments, multiple evaluation). I'm a low level programmer, I see through things like this easily. The only headaches they've given me is that in C they're not truly multi-line and they're much harder to debug (or sometimes, even get to compile).
Summary: macros are not obsolete if you use C or C++ because inline functions can't accomplish the same thing. And anyone who tells you that what I do is "undefined" by the standard or some garbage like that is full of it.
Here's another example to illustrate the usefulness of macros:
// These must be statically declared arrays (ie, global or on the stack,
// not dynamically allocated on the heap)
#defile file_read_array(filename_tag, array)
file_read(filename_tag, array, sizeof(array))
#define file_write_array(filename_tag, array)
file_write(filename_tag, array, sizeof(array))
Even though this is restricted (and in such a way that if you use it incorrectly it'll mess up your program rather than give you an error) it is something that as far as I'm aware you simply cannot do with inline functions at all. This pretty much characterizes macros in general: they're more powerful and can save you a lot of typing compared to inlines (where you have to at least pass variables around, in C by address (there's no pass by reference in C) and you're restricted by type) but are more dangerous, so should only be used when the person really knows what they're doing. So a lot of teachers will tell their C/C++ students to use inlines because said students DON'T know what they're doing, or because the teachers learned it that way themselves and don't know any better.
This is what tends to happen in general with programming, how "safe" something is determines whether or not it should be used, regardless of its power. I think that should be left to the programmer's discretion, and it shouldn't simply be assumed that problems like the common ones with macros will catch EVERY programmer (I've heard statements like this before. I've used macros very extensively and have never been caught in any of the typical traps except for a precedence mistake, after which I learned my lesson pretty completely).
Macros also have the benefit of being easy to expand at compile time, if you want to see what the actual generated code looks like. This can be useful for finding some hidden performance killers you missed.
|\/|-a-\/
September 19th, 2006, 16:12
well, i don't said i was told it is "undefined" by the standard... the "multiline problem" mainly annoys me... but do what you want. don't take my comment personally, i asked you, because i saw those #defines so often and now i finally wanted to know why ^^
|\/|-a-\/
September 21st, 2006, 16:26
how do you emu programmers manage the two cpus? i don't know where to start...
(still want an answer ^^)
ShizZy
September 22nd, 2006, 00:54
@Exophase, very nice explanation, thank you. Here are the functions I scratched together for calculating the flags:
// Computes the Negative and Zero bits of a Program status register
static inline void ComputeNegativeZero(u32* _psr, u32 _operandA)
{
// Check if Negative
if(_operandA & 0x80000000)
ARM9_SET_BIT(*_psr, PSR_N); // Negative
else
ARM9_RESET_BIT(*_psr, PSR_N); // Positive
// Check if Zero
if(_operandA == 0)
ARM9_SET_BIT(*_psr, PSR_Z); // Zero
else
ARM9_RESET_BIT(*_psr, PSR_Z); // No Zero
}
// Computes the Carry bit of a Program status register - Addition Opcode
static inline void ComputeCarryAddition(u32* _psr, u32 _operandA, u32 _operandB)
{
// Check for Carry
if((0xffffffff - _operandA) < _operandB)
ARM9_SET_BIT(*_psr, PSR_C); // Carry
else
ARM9_RESET_BIT(*_psr, PSR_C); // No Carry
}
// Computes the Carry bit of a Program status register - Negative Opcode
static inline void ComputeCarrySubtraction(u32* _psr, u32 _operandA, u32 _operandB)
{
// Check for Borrow
if(_operandA > _operandB)
ARM9_SET_BIT(*_psr, PSR_C); // No Borrow
else
ARM9_RESET_BIT(*_psr, PSR_C); // Borrow
}
// Computes the Carry bit of a Program status register - Shift left Opcode
static inline void ComputeCarryShiftleft(u32* _psr, u32 _operandA, u32 _operandB)
{
u32 shift;
shift = _operandA << (_operandB - 1); // 32bit for speed...
// Check for Carry
if(shift & 0x80000000)
ARM9_SET_BIT(*_psr, PSR_C); // High
else
ARM9_RESET_BIT(*_psr, PSR_C); // Low
}
...And so on and so forth. A bit messy, I know, but should work (I think).
@|\/|-a-\/, trust me, you don't want my source :P When it runs something, maybe, but right now it's just a very boring framework. I don't even have any instructions coded, havn't had much time to work on it.
|\/|-a-\/
September 22nd, 2006, 17:53
@shizzy: sure it's a boring framework, but (as mentioned) i do not really know where to start... that's the cause why i'm interested in the first lines of your emulator, otherwise i could take yopyop's desmume source...
|\/|-a-\/
September 30th, 2006, 15:37
how are halfwords written into memory |H|H|L|L| or |L|L|H|H| ?
how are words written into memory?
blueshogun96
September 30th, 2006, 19:34
how are halfwords written into memory |H|H|L|L| or |L|L|H|H| ?
how are words written into memory?
Wouldn't that depend on the endianness? :unsure:
|\/|-a-\/
October 1st, 2006, 17:05
yes, that is actually what i want to know...
Exophase
October 2nd, 2006, 03:07
how are halfwords written into memory |H|H|L|L| or |L|L|H|H| ?
how are words written into memory?
Nintendo DS and GBA are little endian. Low bytes are written first.
blueshogun96
October 3rd, 2006, 18:00
btw exophase, you wern't kidding when you said that arm was worse then x86! I'll never complain about x86 ever again! :plain:
ector
October 4th, 2006, 23:51
Here's another example to illustrate the usefulness of macros:
// These must be statically declared arrays (ie, global or on the stack,
// not dynamically allocated on the heap)
#defile file_read_array(filename_tag, array)
file_read(filename_tag, array, sizeof(array))
#define file_write_array(filename_tag, array)
file_write(filename_tag, array, sizeof(array))
You can do this with inline. A first draft looks like this (i guess your tags are ints):
template <class T, size_t size>
inline file_write_array(int filename_tag, const T array[size])
{
file_write(filename_tag, (void*)array, sizeof(array));
}
The template parameters will be auto filled in when using it, so you can call it just like your macro.
Not sure how much you gain in type safety in this case, though :P
Exophase
October 5th, 2006, 19:43
That's not C's inline. That's C++, using templates. Not comparable for the programmer using C. Furthermore, filename_tag may be of varying types (FILE * or int in my case).
civilian0746
October 6th, 2006, 02:47
void? file_write_array(int filename_tag, const void* array, int sizeeee)
{
file_write(filename_tag, array, sizeeee);
}
Should be more flexible and should work. Pass the array, which is a pointer, as the second parameter and its size as 3rd.
mono
October 7th, 2006, 22:02
How does that help, civilian? He doesn't want to have to pass the size to file_write_array. You basically missed the point...
civilian0746
October 8th, 2006, 02:13
Well, even with static arrays, you would have to access size at some point anyway. Won't produce any smaller or faster code. But if he only wants to only work with static arrays and don't want to pass size, precompiler macros as Exophase suggested would be the only only option not going into c++ unless the size is constant.
Exophase
October 8th, 2006, 19:22
Producing smaller/faster code isn't the only motivation to do anything D: Even if it's your #1 priority it's still helpful to actually smaller SOURCE when possible, and not having to put in the size for things like this reduces some potential errors (of course you have to know not to do it when you're not supposed to..) For what it's worth, in an emulator it's possible and perhaps preferable to have most, if not all of your data statically sized.
I think the ironic thing is that your code, if not using macros/inlines, would actually produce slower code! And it wouldn't be smaller since you're just wrapping a function call of the same number of parameters. In fact, your code isn't doing anything other than being redundant...
civilian0746
October 9th, 2006, 12:56
Well, I did not look into where this code belongs to. Most modern compilers with the smallest hint of optimisation would inline something like that without you having to instruct them unless it is accessed from outside sources or it's address is accessed. Speed/performance is rarely about inlining/macroing everything on all parts of the program. If someone did that, they'd surely end up with a mess rather than faster code. I'll give you an example of something I picked up in this thread: You said something like checking for some "condition is not AL (this check is slightly cheaper than entering the switch" inside that function. The code he had would be at least as fast or most probably faster than if he had that conditional for AL even if 99% of the instructions he parsed had the conditional AL!
// Checks if Instruction is Conditionally True
static inline int CheckInstructionCondition(u32* _psr, u8 _condition)
{
switch(_condition & 0xf)
{
case ARM9_COND_EQ: return ((*_psr & PSR_Z) != 0); // Z set
case ARM9_COND_NE: return ((*_psr & PSR_Z) == 0); // Z clear
case ARM9_COND_CS: return ((*_psr & PSR_C) != 0); // C set
case ARM9_COND_CC: return ((*_psr & PSR_C) == 0); // C clear
case ARM9_COND_MI: return ((*_psr & PSR_N) != 0); // N set
case ARM9_COND_PL: return ((*_psr & PSR_N) == 0); // N clear
case ARM9_COND_VS: return ((*_psr & PSR_V) != 0); // V set
case ARM9_COND_VC: return ((*_psr & PSR_V) == 0); // V clear
case ARM9_COND_HI: return (((*_psr & PSR_C) != 0) && ((*_psr & PSR_Z) == 0)); // C set and Z clear
case ARM9_COND_LS: return (((*_psr & PSR_C) == 0) || ((*_psr & PSR_Z) != 0)); // C clear or Z set
case ARM9_COND_GE: return ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) == 0)) ||
(((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) != 0))); // N equals V
case ARM9_COND_LT: return ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) != 0)) ||
(((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) == 0))); // N does not equal V
case ARM9_COND_GT: return (((*_psr & PSR_Z) == 0) && ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) == 0)) ||
(((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) != 0)))); // Z clear and N equals V
case ARM9_COND_LE: return (((*_psr & PSR_Z) != 0) || ((((*_psr & PSR_N) == 0) && ((*_psr & PSR_V) != 0)) ||
(((*_psr & PSR_N) != 0) && ((*_psr & PSR_V) == 0)))); // Z set or N does not equal V
case ARM9_COND_NV: Log(100, ".ARM9Interpreter: Unimplemented NV Opcode Condition!\n"); // Special
case ARM9_COND_AL: return (1); // Always (unconditional)
}
return 0;
}
Exophase
October 9th, 2006, 17:15
What is your rationale for this? GCC, at the very least, will generate for the switch a boundary check (to make sure it's within the cases, even with the & - if you don't believe me look at the emited ASM yourself, I'm certain about this), an AND, a table lookup, and a jump. That is more expensive than a single boundary check.
I'd especially like to know how you think a switch would be FASTER than an if.
civilian0746
October 10th, 2006, 00:36
Well, I havent worked much with the ARM family. Only once: night before the day I had to submit a report on arm isa and I dont tent to remember things I learn at school vary well. But I do remember the codes for those COND_* codes. Well, his codes were enough to remind me. The reason I would say why those numbers were picked is so that ARM cab quickly demultiplex to enable appropriate microoperations without using complex circuitry. I am sure the compiler does something like it as well and you said something like that. But the asm his code would generate is rather something like this:
CheckInstructionCondition: _condition: reg
and reg, 0xf; code to fetch the parameter unless inlined or other situations..
jmp [switch_condition___0_f+reg*4]
ARM9_COND_EQ:
;code
;return or jump to next bit of code if inline
ARM9_COND_NE:
;code
;return or jump to next bit of code if inline
ARM9_COND_CS:
;code
;return or jump to next bit of code if inline
ARM9_COND_CC:
;code
;return or jump to next bit of code if inline
ARM9_COND_MI:
;code
;return or jump to next bit of code if inline
ARM9_COND_PL:
;code
;return or jump to next bit of code if inline
ARM9_COND_VS:
;code
;return or jump to next bit of code if inline
ARM9_COND_VC:
;code
;return or jump to next bit of code if inline
ARM9_COND_HI:
;code
;return or jump to next bit of code if inline
ARM9_COND_LS:
;code
;return or jump to next bit of code if inline
ARM9_COND_GE:
;code
;return or jump to next bit of code if inline
ARM9_COND_LT:
;code
;return or jump to next bit of code if inline
ARM9_COND_GT:
;code
;return or jump to next bit of code if inline
ARM9_COND_LE:
;code
;return or jump to next bit of code if inline
ARM9_COND_NV:
;code
;return or jump to next bit of code if inline
ARM9_COND_AL:
;code
;return if not inline
switch_condition___0_f:
dd ARM9_COND_EQ
dd ARM9_COND_NE
dd ARM9_COND_CS
dd ARM9_COND_CC
dd ARM9_COND_MI
dd ARM9_COND_PL
dd ARM9_COND_VS
dd ARM9_COND_VC
dd ARM9_COND_HI
dd ARM9_COND_LS
dd ARM9_COND_GE
dd ARM9_COND_LT
dd ARM9_COND_GT
dd ARM9_COND_LE
dd ARM9_COND_NV
dd ARM9_COND_AL
Basically, it will multiply (_condition & 0xf) by 4 which is just 2 shifts to the left and add it to another number and then dereference that number to get where to jump to. One jump, one memory access -> end of the story. The reason it wont add that conditional you were talking about is because there is no point in adding that conditional and the people who codes compilers are really really smart.
But if there was a conditional, it would need jumps to jump to the AL case or rest of the case. And then when its near the rest of the case, it will need to add another conditional around switch because it no longer has all the possible values. Then you need another extra memory read and jump to implement the switch anyway. The 1% non-al code will really make it slow. Well, on arm9, he will be doing this to millions of instructions every second and I think that less than 30% of them will be AL. You should know that branching is even slower than memory acccess in this case.
Exophase
October 10th, 2006, 07:01
Your explanation is absurd, claiming that the 1% of the time it isn't AL will slow it down tremendously when the overhead for that time would be less than 2x. What the actual values are in ARM code is irrelevant, because YOUR example only said for 1% of the time (saying less than 30% AL is insane though, it's probably closer to 80% or higher AL). Either way, the indirect branch in the switch is pretty much guaranteed slower (is more of a pipeline hazard) than the conditional branch in a preemptive test, if it's worthy.
Anyway, you're assuming that the compiler will make optimizations it won't. Look over the following clearly, for a value with a clear & 0x0F in the switch itself and with all 16 switch cases accounted for GCC with -O3 will still emit the following:
andl $15, %edx
cmpl $15, %edx
ja L1
jmp *L8(,%edx,4)
That's the AND, a test to see if it's in range, a jump if it isn't, an indirect jump, and a memory access.
Compare it to what it does for the following escape clause:
if((x & 0x0F) == 15)
{
return 10;
}
andl $15, %edx
cmpl $15, %edx
je L1
That's less an indirect jump and a memory dereference. Per your example, let's say that the second one occurs 99% of the time and the other 1% of the time both must occur. The time for the second we'll call S, the first F, so we have overall time of:
(0.01 * (S + F)) + (0.99 * S) = S + 0.01F
What you're saying is that the first only, or F, is smaller than (S + 0.01F). In other words, you think that S is less than 1% faster than F, when in reality it's probably a solid 2x as fast.
civilian0746
October 10th, 2006, 16:40
Your explanation is absurd, claiming that the 1% of the time it isn't AL will slow it down tremendously when the overhead for that time would be less than 2x. What the actual values are in ARM code is irrelevant, because YOUR example only said for 1% of the time (saying less than 30% AL is insane though, it's probably closer to 80% or higher AL). Either way, the indirect branch in the switch is pretty much guaranteed slower (is more of a pipeline hazard) than the conditional branch in a preemptive test, if it's worthy.
Anyway, you're assuming that the compiler will make optimizations it won't. Look over the following clearly, for a value with a clear & 0x0F in the switch itself and with all 16 switch cases accounted for GCC with -O3 will still emit the following:
andl $15, %edx
cmpl $15, %edx
ja L1
jmp *L8(,%edx,4)
That's the AND, a test to see if it's in range, a jump if it isn't, an indirect jump, and a memory access.
Compare it to what it does for the following escape clause:
if((x & 0x0F) == 15)
{
return 10;
}
andl $15, %edx
cmpl $15, %edx
je L1
That's less an indirect jump and a memory dereference. Per your example, let's say that the second one occurs 99% of the time and the other 1% of the time both must occur. The time for the second we'll call S, the first F, so we have overall time of:
(0.01 * (S + F)) + (0.99 * S) = S + 0.01F
What you're saying is that the first only, or F, is smaller than (S + 0.01F). In other words, you think that S is less than 1% faster than F, when in reality it's probably a solid 2x as fast.
So now we are on about pipelines. Thats nice. how does it work again: Lets take an average 1.0-2.0ghz Intel processor. Its clock speed is irrelevent but running that fast necessarily implies that this processor has a 10-20 stage pipeline. I.e. you might already know what I am trying to do. Lets just say it has 10. Lets say it has the following stages:
[read instruction identifier][select appropriate path][operand load][operand decode][precalculation logic][memory fetch][calculation logic][state update][store result][memory write]
Of course, made up. But makes sense =/ earlier stages are about getting the operation, middle about executing it and towards the end you store results if needed. For the sake of simplicity, lets just say there wont be any cache miss. Say the time it takes to move data from cache to appropriate internal register is less than 2 clock cycles.
Lets just compare the time it takes from the point when the first instruction is identified to time when the apprppriate instruction is read into the pipeline.
1. ShizZy's jump statement that I said his code translates to which it does with the compiler I compiled with:
- jmp [switch_condition___0_f+reg*4]
The branch is "unpredictable" and hence we will assume that it will not do any branch prediction for this. I.e. The entire pipeline will be discarded on the "[state update]" when it modifies PC and on the next cycle it will read the next instruction identifier as pointed by PC. The time here is 8 clock cycles + the time it took for the jmp instruction to get to the state update stage which is 7 cycles + 1 extra one for memory read = 16 cycles.
2. What you said his jmp statement was:
- cmp reg, 15
- ja NOTINRANGE
- jmp [switch_condition___0_f+reg*4]
As far as intel's 2 level shift register branch predictor is concerned, with 99 to 1 ratio, the possibility of it predicting the branch not to be taken is not exactly 99/100 but a bit less. But it wont be taken anyway so we dont have to worry about it. So the time is just what it took for previous one + 2 cycles + no extra memory reads.
3. Now its your conditional AL aided code:
the setup will be something like:
...
cmp reg, 15
jne NOT_AL
;code for al
...
NOT_AL:
cmp reg, 14
ja NOTINRANGE
jmp [switch_condition___0_f+reg*4]
;;codes for different non-al switches
...
NOTINRANGE:
...
* If its ARM_COND_AL:
- cmp reg, 15
- jne NOT_AL
Assuming the branch is not taken for a long time which is highly unlikely for random code: is just one exra cycle. Most probably 2 or 3.
* If its not ARM_COND_AL:
- cmp reg, 15
- jne NOT_AL
NOT_AL:
- cmp reg, 14
- jx NOTINRANGE
- jmp [switch_condition___0_f+reg*4]
For this, its safe to assume that the first bit is branching prediction failure because most instructions are AL. For obvious reasons, the second one would also fail. and the 3rd one is just that plain jump statement whose timing is 16 cycles. Each branch failure costs 8 cycles + 2 extra cmps gives a total of 16 + 16 + 2 = 34 clock cycles. Of course, this is assuming everything is ideal and chances of things going wrong in pipelining with bigger code is more than smaller code.
Well, lets find a descent ratio:
16 vs (ratio * 3) + ((1-ratio) * 34;
simply solve it and will get ratio of AL > 0.5806 which is 58% for that to be effective.
Wait...lemme double check.
Anyways, it seems that I am always right. Moving on from here, 20 * 36 + 80 * 2 = 880 is not less than or equal to 1/2 * (16*100). So even with 80/20 ratio, its transport time not more than or equal to 2 times faster as without conditional and I love exaggerating. There are other things to consider as well and some things are slower than I assumed. Your AL does almost nothing compared to the memory accesses and extra comparisons on other conditionals. Even then I would not be surprised if the ratio of AL is more like 60% or less than something like 80%. With ARM, most people would use standard c/c++ compilers anyway. Most of them wont care about optimisation that much with their clock running at 300 mhz and a lot of memory. Especially those game developers. They like wrapping everything with conditionals and break things up into small logical pieces for tiny things, and compilers would most likely produce those single instructions with those conditionals for things like global values and parameters loaded into registers because it makes little to no difference running it on the hardware. Thats one of the other minor reasons why people like ARM.
And the code it produces for the latest version of free visual c++ for that function is exactly as I put up there. It seems like GCC does not have that optimisation yet. Well, even thou its widely used for most non-commercial purposes, there are still areas where it can improve.
Exophase
October 10th, 2006, 20:11
There is the extra branch prediction failure for the if falling through, but in the 99% ratio you said that's still not going to come even close to adding up to the same amount, given that the indirect jmp is significantly more expensive than the jne NOT_AL when predicted (and 99% WILL have it predicted most of the time). I don't know what you meant about the indirect branch predicted to "not be taken", it's not a conditional branch. Also, how did you get "each branch failure costs 8 cycles", then 16 + 16? I assume you meant 16, not 8. Finally, GCC will reuse the results of the first cmp for the out of range check.
I was only going on what you said about 99%, and I stand by this. Feel free to benchmark it on whatever CPU. As for the actual in practice usage, I said it MAY be faster, but was pretty skeptical. It depends too much on the code, not to mention the CPU (pipeline penalties, as mentioned), and the compiler. Of course GCC should have that optimization, it has been brought up before (look for it), but it doesn't. In the case of Visual C++ it is probably a special case specifically for ANDs on switches though, and not a more generalized boundary limitation kind of thing. That or there's some kind of unsafe optimization in play. Anyway, you said AL has to be in effect 58% of the time, then you said you were double checking and were "always right" and honestly I lost you from here. We can talk about real ratios but we don't have figures, and honestly I'm not interested - the version w/conditional may or may not be more effective, I really don't know. But if it were 99% of the time AL (like you said) it would definitely be more effective, and if you're saying otherwise then you have to explain it to me more clearly because I am NOT seeing this.
civilian0746
October 11th, 2006, 03:49
The "8 cycle" thing comes from this particular pipeline. The "branch predictor" will probably discard earlier pipeline stages after its operand decode stage if it thinks that the branch will be taken and start reading from the place where its branched to. Then on the stage update section when it finally knows if the branch is taken or not, if its not what branch predictor predicted, then it would have to refetch the next instruction from appropriate place for it to be executed. The time it woould take for that to happen from the start of the conditional jump despite its branch failures is 9 on that particular pipeline. It may be more or less and is just as you said, processor dependent. Of course, there might be microarchitectures where one would work better than other but I could not really care less.
You are just way too naive.
Exophase
October 11th, 2006, 07:34
Naive? How?
I still don't see how you've done anything to show that the 99% AL case would not be better with the if, but whatever, if you're just going to explain things I already know and shout other irrelevancies at me, in addition to insults and bragging about who knows what... and none of what you said changes that you said 8 then said 16 + 16, so you either meant one or the other, since they were said with regards to the same branches...
civilian0746
October 11th, 2006, 11:31
Naive? How?
\/
|
|
|
\/
I still don't see how you've done anything to show that the 99% AL case would not be better with the if, but whatever, if you're just going to explain things I already know and shout other irrelevancies at me, in addition to insults and bragging about who knows what... and none of what you said changes that you said 8 then said 16 + 16, so you either meant one or the other, since they were said with regards to the same branches...
Exophase
October 11th, 2006, 19:51
I wonder if you even know what naive means. While you're looking it up check out "conceited" and "condescending."
civilian0746
October 13th, 2006, 12:53
I do know what it means and I couldnot find a more appropriate word hoping that you would understand.
I am yet to mature fully. You need to stay alive till that happens.
Exophase
October 13th, 2006, 17:16
Okay, I'll try not to die then ;P
Ralph
October 14th, 2006, 01:40
I'm pretty sure Exophase knows a little something about fast code, have you guys even tried his GBA emulator on the PSP?
|\/|-a-\/
October 20th, 2006, 12:24
hey guys!
i read a bit through the dsemu 0.4.10 code and began where dsemu is allocating ds memory. I wonder why it's allocating EWRAM, afaik EWRAM is only used in the Gameboy Advance... The next thing is, that it's initializing the PC (R[15]) with 0x08000000 O_o i never heard, that the execution begins at 0x08000000.
Garstyciuks
October 20th, 2006, 13:30
If you are right about EWRAM, ds emu probably does that for compatibility with game boy advance games. I heard that game boy advance can run games of older game boys. Maybe ds can do it too.
Cyberman
October 20th, 2006, 20:54
The DS is designed to run NDS games and GBA games.
Hence you need EWRAM for it to be compatible.
Cyb
|\/|-a-\/
October 20th, 2006, 21:13
and entry address in gba is 0x08000000 ?!?
i don't think so, because dsemu has source files for gba memory and sourcefiles for ds memory... ewram is allocated in both... afaik you can find the entry address(es) in the nds rom header.
vBulletin v3.6.2, Copyright ©2000-2010, Jelsoft Enterprises Ltd.