What's new

Bra64 news

tooie

New member
Plisco said:
Ignore that Ashade you do what you want keep up the good work. Cant wait for a release;)

Plisco read the whole message .. it is saying a waste of time programming in asm .. this is pure a productivity thing. The benifits are being claimed are basicly nill. But the programming overhead is massive. it had nothing to do with Ashade plugin.
 

sketzh

New member
..

tooie said:
Plisco read the whole message .. it is saying a waste of time programming in asm .. this is pure a productivity thing. The benifits are being claimed are basicly nill. But the programming overhead is massive. it had nothing to do with Ashade plugin.

exactly my point.

actually I knew some guys who did an entire game for windows a couple of ago in pure asm. they knew their stuff and it took up A LOT of time. The game was nize and everything but it was almost outdated when it was released. I am not saying that I have no respect for asm-programmers because I do. I started coding my very first program in asm too so I know its a hard work and can be fast.

BTW: Some guys did some bilinear radial blur algorithm a year back. One did it in assembler and one did it in C++ with some inline asm. The C++ version was the fastest one.. All the innerloops was made in asm, the ones that cache well. And these guys are VERY skilled at programming. I am just mentioning this because its not a fact that asm is faster..

I respect your work Ashade, dont stop your plugin, this is only my opinion and what I would do. Anybody just making a little effort in being creative and productive are to be rewarded so go on mate! :)
 
Last edited:

Cyberman

Moderator
Moderator
ashade said:
yeah, unroll loops i can, but look at this example:

to divide a number in c++ by a constant, you have to do it:

//consider x is a 16bit number
x /= 10; //using 10 for example

this is slow, because the divide operantion takes too much cycles... my idea is to use the multiply operation to divide by a constant (and the mul instruction is marginally faster!).

look at this:

__asm {
mov ax, x
mov dx, 6554
mul dx
mov x, dx
}


it is a little hard to explain you why this works, but this kind of multiplication can't be done in c++... test yourself if u want... make a loop repeating about 10^10 times the two instructions and see the results

Ashade.. there is a slight flaw in your logic. you are assuming you are a better optimizer than the compilor. Consider the fact the compilor has several thousand verarious automatic optimizations. Also newere CPUs that execute CISC code aren't not CISC machines. They actually do dynamic opcode decoding for a fast RISC engine internally. The eliminate wasted opcode operations to enhance execution. Don't try to out think the compilor. You're struggling against people with fifteen to twenty years experience in enhancing performance and optimizing code.

You are assuming WRONGLY I might point out that the compilors code will be exactly what you coded it as in C++. This is a very flawed and myopic view of what the backend of a compilor does. Unless you want to discuse how the code generator on a typical compilor these days works, I won't get into it. Needless to say compilor output is VERY optimized already. it would be very hard for you to do better in assembly unless you are using MMX instructions (no compilors generate this that I'm aware of using C/C++ code because it has inherent parallelism which is not expressable in C/C++).

Assembly is most needed in a few areas. These are the following:
  1. When the coresponding C/C++ code generates a combersome bit of code that is simpler in assembly. You find this out by compiling the C code in assembly and seeing what it generates. That's what I do.
  2. When there is no coresponding code equivalent in C or C++
  3. When one is doing DIRECT hardware access and manipulation.
    [/list=1]
    These are quite rare. Things like MMX code can improve performance BUT most assembly won't improve anything but it will give you a big headache if you make a mistake.

    Cyb
 

Hacktarux

Emulator Developer
Moderator
From what i've heard, the latest intel c++ compiler can produce mmx code as well as sse and sse2. You have to define in compiler options which instructions set to use and what's the target processor. From what i've understood it tries to use them in loops in your program and also in conditional jumps. You can also switch on an option that give you some infos at compile time on how to slightly change your code so that the compiler can parallelized it with mmx. Unfortunately, this compiler isn't free on windows. The linux version is free if u don't use it for commercial purposes.

There's also some classes made by intel to handle simd but it's like doing it in asm IMO...
 

Slougi

New member
Hacktarux said:
From what i've heard, the latest intel c++ compiler can produce mmx code as well as sse and sse2. You have to define in compiler options which instructions set to use and what's the target processor. From what i've understood it tries to use them in loops in your program and also in conditional jumps. You can also switch on an option that give you some infos at compile time on how to slightly change your code so that the compiler can parallelized it with mmx. Unfortunately, this compiler isn't free on windows. The linux version is free if u don't use it for commercial purposes.

There's also some classes made by intel to handle simd but it's like doing it in asm IMO...
GCC does the same, try the -m3dnow, -msse, -msse2 and -mmmx options. Also -ffast-math does most of the stuff described above by ashade and other automatically.

BTW, Cyberman, is it not called compiler? Compilor just hurst my eyes and imaginary ears :p
 

cooliscoo

EmuTalk Member
Slougi said:
GCC does the same, try the -m3dnow, -msse, -msse2 and -mmmx options. Also -ffast-math does most of the stuff described above by ashade and other automatically.

BTW, Cyberman, is it not called compiler? Compilor just hurst my eyes and imaginary ears :p

Isn't it called hurts, and not hurst?

:happy:
 

Hacktarux

Emulator Developer
Moderator
Hacktarux said:
From what i've heard, the latest intel c++ compiler can produce mmx code as well as sse and sse2. You have to define in compiler options which instructions set to use and what's the target processor. From what i've understood it tries to use them in loops in your program and also in conditional jumps. You can also switch on an option that give you some infos at compile time on how to slightly change your code so that the compiler can parallelized it with mmx. Unfortunately, this compiler isn't free on windows. The linux version is free if u don't use it for commercial purposes.

There's also some classes made by intel to handle simd but it's like doing it in asm IMO...

Hey didn't know that :)
But i've just checked the manual and it's not as good as intel c++ compiler yet... These switches only enable these instructions, this means that it'll recognize them when you use it in inline asm or in the builtin wrapper functions for mmx in gcc. The only exception is sse, gcc can use it to do fast floating point. Gcc can't optimize loops, conditions, and generally speaking all integer optimizations done by intel c++ compiler.

It's still good to know that they're working on it ;)
 

Slougi

New member
Hacktarux said:
Hey didn't know that :)
But i've just checked the manual and it's not as good as intel c++ compiler yet... These switches only enable these instructions, this means that it'll recognize them when you use it in inline asm or in the builtin wrapper functions for mmx in gcc. The only exception is sse, gcc can use it to do fast floating point. Gcc can't optimize loops, conditions, and generally speaking all integer optimizations done by intel c++ compiler.

It's still good to know that they're working on it ;)
Hmm seems I got it wrong at some point :(
They did give me pretty nice speed-ups on my duron morgan though. Also -funroll-loops and -frerun-cse-after-loop might have something to do with this :) And of course -fomit-frame-pointer which frees up a register on x86 platforms. Just makes debugging impossible :( We are completely offtopic btw :doh:

Cooliscoo: whoooops :blush:
 

Cyberman

Moderator
Moderator
Hacktarux said:
Hey didn't know that :)
But i've just checked the manual and it's not as good as intel c++ compiler yet... These switches only enable these instructions, this means that it'll recognize them when you use it in inline asm or in the builtin wrapper functions for mmx in gcc. The only exception is sse, gcc can use it to do fast floating point. Gcc can't optimize loops, conditions, and generally speaking all integer optimizations done by intel c++ compiler.

It's still good to know that they're working on it ;)

Hmmm.. from what I understand on the backend part of things many compilors use GCC's processor model optimizations. Having written of few of these (what a pain is all I can describe them as), I would say GCC definately has ineger optimizations. You can see this by playing with the -O# options for optimizations. As for optimizing MMX that's a horse of a different color, or should I say a real PITA. The real fun with MMX instructions comes when you have to switch off the MMX before any floating point instructions can be executed.

Yeah it's off topic (what's your problem! ;) ). I remember working on converting 16 bit ABGR values to 32 bit RGBA values on a PSX GPU plugin, (also going from 24 bit BGR to 32 bit RGB was fun as well). Fortunately these aren't too bad as I just loaded 2 ABGR values into the lower 32 bits of the MMX register then 'expanded' it to the lower 16 bits of 2 32 bit values. Then masked shift masked shift masked .. tada 2 32 bit RGB values (snicker). I never compared the speed though with the old algo (might have been wise ;) ).

Cyb
 

Cyberman

Moderator
Moderator
Is it me or is cooliscoo really uncool at times? :)

Yeah Slougi I'm spelling challenged sorry ;)
 

Slougi

New member
Cyberman said:
Hmmm.. from what I understand on the backend part of things many compilors use GCC's processor model optimizations. Having written of few of these (what a pain is all I can describe them as), I would say GCC definately has ineger optimizations. You can see this by playing with the -O# options for optimizations. As for optimizing MMX that's a horse of a different color, or should I say a real PITA. The real fun with MMX instructions comes when you have to switch off the MMX before any floating point instructions can be executed.

Yeah it's off topic (what's your problem! ;) ). I remember working on converting 16 bit ABGR values to 32 bit RGBA values on a PSX GPU plugin, (also going from 24 bit BGR to 32 bit RGB was fun as well). Fortunately these aren't too bad as I just loaded 2 ABGR values into the lower 32 bits of the MMX register then 'expanded' it to the lower 16 bits of 2 32 bit values. Then masked shift masked shift masked .. tada 2 32 bit RGB values (snicker). I never compared the speed though with the old algo (might have been wise ;) ).

Cyb
The -Ox flags don't do anything by themselves except turn on extra flags to optimize for binary speed (1-3) or size (s). From the man page:

Code:
Options That Control Optimization

       These options control various sorts of optimizations:

       -O
       -O1 Optimize.  Optimizing compilation takes somewhat more time, and a
           lot more memory for a large function.

           Without -O, the compiler's goal is to reduce the cost of compila-
           tion and to make debugging produce the expected results.  State-
           ments are independent: if you stop the program with a breakpoint
           between statements, you can then assign a new value to any variable
           or change the program counter to any other statement in the func-
           tion and get exactly the results you would expect from the source
           code.

           With -O, the compiler tries to reduce code size and execution time,
           without performing any optimizations that take a great deal of com-
           pilation time.

       -O2 Optimize even more.  GCC performs nearly all supported optimiza-
           tions that do not involve a space-speed tradeoff.  The compiler
           does not perform loop unrolling or function inlining when you spec-
           ify -O2.  As compared to -O, this option increases both compilation
           time and the performance of the generated code.

           -O2 turns on all optional optimizations except for loop unrolling,
           function inlining, and register renaming.  It also turns on the
           -fforce-mem option on all machines and frame pointer elimination on
           machines where doing so does not interfere with debugging.

           Please note the warning under -fgcse about invoking -O2 on programs
           that use computed gotos.

       -O3 Optimize yet more.  -O3 turns on all optimizations specified by -O2
           and also turns on the -finline-functions and -frename-registers
           options.

       -O0 Do not optimize.

       -Os Optimize for size.  -Os enables all -O2 optimizations that do not
           typically increase code size.  It also performs further optimiza-
           tions designed to reduce code size.

           If you use multiple -O options, with or without level numbers, the
           last such option is the one that is effective.
So -O1 for example does not use -funroll-loops etc. Personally I have found -Os to produce the best results, due to decreased loading times, for example OOo and Mozilla really load MUCH faster. Unfortunately it makes some apps very fragile (gimp, xfree, evolution, cdrdao etc), but I use it whenever I can along with -fomit-frame-pointer. On the other Mupen would prolly run faster if compiled say with -O3 -funroll-loops -ffast-math -frerun-cse-after-loop -fforce-addr -frerun-loop-opt -falign-functions=4 -maccumulate-outgoing-args since the codesize is much smaller. Compiler flags can be loads of fun :saint:
 

Hacktarux

Emulator Developer
Moderator
I know gcc can do integer optimizations, i was talking about automatic integer optimizations using mmx or other simd inst. Maybe the gcc's structure make it possible, you seem to know it better than me, but from what i've read it's not done yet...

And yes those posts are off topic, but we're actually talking about programming and it's exceptionnal in this forum.... :D
 

Slougi

New member
Cyberman said:
Is it me or is cooliscoo really uncool at times? :)

Yeah Slougi I'm spelling challenged sorry ;)
Yeah cooliscoo is sometimes pretty annoying, especially if he notices something like this :p

No probs Cyb, just makes me queasy when someone uses -or in the end of stuff like reporter, compiler, and other such words :) Didn't mean to knit-pick.
 

Slougi

New member
Hacktarux said:
I know gcc can do integer optimizations, i was talking about automatic integer optimizations using mmx or other simd inst. Maybe the gcc's structure make it possible, you seem to know it better than me, but from what i've read it's not done yet...
Heh, I bet you know much more than I do about the structure, I just know what gentoo has taught me in the few months I have been using it :) I always thought GCC did 3dnow and sse automatically though, but maybe I am wrong.

And yes those posts are off topic, but we're actually talking about programming and it's exceptionnal in this forum.... :D
Actually we are not talking about programming, we are talking about compiling ;) Two very different things. I couldn't program even a simple input plugin, yet I can compile and patch OOo by hand ;)
 

Top