PDA

View Full Version : Bra64 news



ashade
January 28th, 2003, 22:38
Hello everyone, bra64 is not dead! I got it working on wireframes, but some strange things are happanning... the first one is that the image is flipped upside down... I've tried everything! Look at this:

http://n64rommania.hpg.com.br/imagem.jpg

It is Super Mario 64 start screen... does anyone have any idea how to fix this?

Tagrineth
January 28th, 2003, 23:33
Well, offhand without knowing how your code works, the only suggestion I can make is a quick flip in the DAC, or something along those lines.

ShadowPrince
January 29th, 2003, 01:00
Originally posted by ashade
It is Super Mario 64 start screen... does anyone have any idea how to fix this?

This is Super Mario start screen indeed :)
Congratulations on your progress.

Knuckles
January 29th, 2003, 01:03
Originally posted by ShadowPrince
This is Super Mario start screen indeed :)
Congratulations on your progress.

It's better than nothing ;) good work!

CpU MasteR
January 29th, 2003, 06:08
Nice Job ashade.

aprentice
January 29th, 2003, 06:12
can we get a screenshot with the fps counter?

euphoria
January 29th, 2003, 09:40
Originally posted by ashade
Hello everyone, bra64 is not dead! I got it working on wireframes, but some strange things are happanning... the first one is that the image is flipped upside down... I've tried everything! Look at this:

http://n64rommania.hpg.com.br/imagem.jpg

It is Super Mario 64 start screen... does anyone have any idea how to fix this?

Maybe you're using upper left coordinates instead of lower left coords and because of this the image is upside down.


(0,0) this is where your origo is
*------*
| |
| |
*------*
and here it should be

GuestX
January 29th, 2003, 13:49
Congrats ashade!!!! It's great to see a new & and Gr8 Gfx
Plugin!!! Congrats!!! :) :)

Plisco
January 29th, 2003, 14:52
Sweet!!! BTW why the name Bra64

ashade
January 29th, 2003, 15:59
oH YEAH, i corrected this bug!! Look at this:

http://n64rommania.hpg.com.br/SUPERMARIO.jpg

the bitmap version:
http://n64rommania.hpg.com.br/SUPERMARIO.bmp

I think i have done the most hard part... now it is much easier to complete the plugin...

WildSOfT
January 29th, 2003, 16:17
Nice job, ashade.

But why using PJ64 1.5 beta1 , when there's PJ64 1.5 final?

Anyway, good luck with te progress of the plugin! ;)

GuestX
January 29th, 2003, 16:42
hah ;) great progress !! :)

mightyrocket
January 29th, 2003, 16:57
This looks very good. Maybe after all a good opponent for Jabo and Icepir8! But keep working on it!

A small question: Does it (currently) run fast?

sytaylor
January 29th, 2003, 17:32
well considering its running wireframe its a bit unfair to compare it to fully shaded pologon producing plugins

Doomulation
January 29th, 2003, 18:01
Well, it is developed in asm, somewhat at least, so yes it is FAST!!!
Very nice work, ashade! :inlove:

tooie
January 29th, 2003, 18:25
written in asm means nothing .. you can easily have it in asm and slower .. it is the algorithms that are important .. granted if you had all them perfect .. then you can possible re-do some in asm to truely optimize

ashade
January 29th, 2003, 20:40
Hey man, writing in assembler really means that the code will be faster than c++, just because assembler has special instructions that c++ doesn't have... I will give you a little example:

when u want to set the bit 3 of a dword in c++, you do it:

n |= 0x8

this code is fast, but not so fast as this one in assembler:

__asm bts n, 3

because assembler has specific functions to set, clear and invert the value of one or more bits...

other goal with assembler is that you work directly with the hardware; the code is written your own way... when u use a high level language, u have to compile the code, and the compiler follow some "standards" to convert from c++ to assembler code. The compiler doesn't think as humans, so it can't avoid putting some slow code, even if it is configured for fast code...

icepir8
January 29th, 2003, 21:06
Originally posted by ashade
Hey man, writing in assembler really means that the code will be faster than c++, just because assembler has special instructions that c++ doesn't have... I will give you a little example:

when u want to set the bit 3 of a dword in c++, you do it:

n |= 0x8

this code is fast, but not so fast as this one in assembler:

__asm bts n, 3

because assembler has specific functions to set, clear and invert the value of one or more bits...

other goal with assembler is that you work directly with the hardware; the code is written your own way... when u use a high level language, u have to compile the code, and the compiler follow some "standards" to convert from c++ to assembler code. The compiler doesn't think as humans, so it can't avoid putting some slow code, even if it is configured for fast code...

Just because there is a special instruction to do something, it doesn't mean that it is faster. There are a lot of examples of Intel adding special instructions that take more time to execute than the multiple instruction solutions. The set/clear/test bit instructions are one of them.

Hacktarux
January 29th, 2003, 21:24
Originally posted by ashade
Hey man, writing in assembler really means that the code will be faster than c++, just because assembler has special instructions that c++ doesn't have... I will give you a little example:

when u want to set the bit 3 of a dword in c++, you do it:

n |= 0x8

this code is fast, but not so fast as this one in assembler:

__asm bts n, 3

because assembler has specific functions to set, clear and invert the value of one or more bits...

other goal with assembler is that you work directly with the hardware; the code is written your own way... when u use a high level language, u have to compile the code, and the compiler follow some "standards" to convert from c++ to assembler code. The compiler doesn't think as humans, so it can't avoid putting some slow code, even if it is configured for fast code...

If you want to play this game i'll play with you :D
I've checked the OR and BTS opcodes, and guess what, BTS is far slower :
OR mem, imm : 3 cycles
BTS mem, imm : 13 cycles

In some cases, on modern cpu, compiler can be faster than manual optimizations coz it isn't only about which opcodes you choose but also the sequence coz modern cpu can execute several things at the same times and you have to know which part of the cpu is used by which opcode and how many time each part is required... It also affects cache management, it's not as simple as it seems to be...

Finally you can optimize as much as u can your functions but as Tooie said if you don't have the good algorithm with low complexity, it'll be totally useless work. You can have 10x times bigger function and it can still be faster than the short one. And if the program is complex as a gfx plugin, optimizations can required very complex functions that can possibly be hard to do in asm...

ashade
January 29th, 2003, 21:54
New shot (with the fps counter):

http://n64rommania.hpg.com.br/newscr.jpg

ashade
January 29th, 2003, 22:03
ok, so lets play this game...
take a look at my matrix multiplication function and glide 64 multiplication function... compare the speed of them and tell me what u think...

To compare the speed, put both in a huge loop (repeating each function about 1000000 times) and compare the time used to each function...

my function:

typedef struct {
union {
float M[4][4];
struct { //acesso direto dos elementos
float a11, a12, a13, a14,
a21, a22, a23, a24,
a31, a32, a33, a34,
a41, a42, a43, a44;
};
};
} MATRIS;

void MATRIS_dot(MATRIS* mDest, MATRIS* mSrc1, MATRIS*
mSrc2) {

/*
{{a11 b11 + a12 b21 + a13 b31 + a14 b41,
a11 b12 + a12 b22 + a13 b32 + a14 b42,
a11 b13 + a12 b23 + a13 b33 + a14 b43,
a11 b14 + a12 b24 + a13 b34 + a14 b44}, {a21 b11 + a22 b21 + a23 b31 +
a24 b41, a21 b12 + a22 b22 + a23 b32 + a24 b42,
a21 b13 + a22 b23 + a23 b33 + a24 b43,
a21 b14 + a22 b24 + a23 b34 + a24 b44}, {a31 b11 + a32 b21 + a33 b31 +
a34 b41, a31 b12 + a32 b22 + a33 b32 + a34 b42,
a31 b13 + a32 b23 + a33 b33 + a34 b43,
a31 b14 + a32 b24 + a33 b34 + a34 b44}, {a41 b11 + a42 b21 + a43 b31 +
a44 b41, a41 b12 + a42 b22 + a43 b32 + a44 b42,
a41 b13 + a42 b23 + a43 b33 + a44 b43,
a41 b14 + a42 b24 + a43 b34 + a44 b44}}
*/

__asm {
push eax
push ebx
push ecx


mov eax, mSrc2
mov ebx, mSrc1
mov ecx, mDest

fld [eax].a11
fld [eax].a12
fld [eax].a13
fld [eax].a14

fld st(3)
fmul [ebx].a11

fld st(3)
fmul [ebx].a21
faddp st(1), st(0)

fld st(2)
fmul [ebx].a31
faddp st(1), st(0)

fld st(1)
fmul [ebx].a41
faddp st(1), st(0)
fstp [ecx].a11


fld st(3)
fmul [ebx].a12

fld st(3)
fmul [ebx].a22
faddp st(1), st(0)

fld st(2)
fmul [ebx].a32
faddp st(1), st(0)

fld st(1)
fmul [ebx].a42
faddp st(1), st(0)
fstp [ecx].a12


fld st(3)
fmul [ebx].a13

fld st(3)
fmul [ebx].a23
faddp st(1), st(0)

fld st(2)
fmul [ebx].a33
faddp st(1), st(0)

fld st(1)
fmul [ebx].a43
faddp st(1), st(0)
fstp [ecx].a13


fld st(3)
fmul [ebx].a14

fld st(3)
fmul [ebx].a24
faddp st(1), st(0)

fld st(2)
fmul [ebx].a34
faddp st(1), st(0)

fld st(1)
fmul [ebx].a44
faddp st(1), st(0)
fstp [ecx].a14

fstp st(0)
fstp st(0)
fstp st(0)
fstp st(0)

fld [eax].a21
fld [eax].a22
fld [eax].a23
fld [eax].a24

fld st(3)
fmul [ebx].a11

fld st(3)
fmul [ebx].a21
faddp st(1), st(0)

fld st(2)
fmul [ebx].a31
faddp st(1), st(0)

fld st(1)
fmul [ebx].a41
faddp st(1), st(0)
fstp [ecx].a21


fld st(3)
fmul [ebx].a12

fld st(3)
fmul [ebx].a22
faddp st(1), st(0)

fld st(2)
fmul [ebx].a32
faddp st(1), st(0)

fld st(1)
fmul [ebx].a42
faddp st(1), st(0)
fstp [ecx].a22


fld st(3)
fmul [ebx].a13

fld st(3)
fmul [ebx].a23
faddp st(1), st(0)

fld st(2)
fmul [ebx].a33
faddp st(1), st(0)

fld st(1)
fmul [ebx].a43
faddp st(1), st(0)
fstp [ecx].a23


fld st(3)
fmul [ebx].a14

fld st(3)
fmul [ebx].a24
faddp st(1), st(0)

fld st(2)
fmul [ebx].a34
faddp st(1), st(0)

fld st(1)
fmul [ebx].a44
faddp st(1), st(0)
fstp [ecx].a24

fstp st(0)
fstp st(0)
fstp st(0)
fstp st(0)

fld [eax].a31
fld [eax].a32
fld [eax].a33
fld [eax].a34

fld st(3)
fmul [ebx].a11

fld st(3)
fmul [ebx].a21
faddp st(1), st(0)

fld st(2)
fmul [ebx].a31
faddp st(1), st(0)

fld st(1)
fmul [ebx].a41
faddp st(1), st(0)
fstp [ecx].a31


fld st(3)
fmul [ebx].a12

fld st(3)
fmul [ebx].a22
faddp st(1), st(0)

fld st(2)
fmul [ebx].a32
faddp st(1), st(0)

fld st(1)
fmul [ebx].a42
faddp st(1), st(0)
fstp [ecx].a32


fld st(3)
fmul [ebx].a13

fld st(3)
fmul [ebx].a23
faddp st(1), st(0)

fld st(2)
fmul [ebx].a33
faddp st(1), st(0)

fld st(1)
fmul [ebx].a43
faddp st(1), st(0)
fstp [ecx].a33


fld st(3)
fmul [ebx].a14

fld st(3)
fmul [ebx].a24
faddp st(1), st(0)

fld st(2)
fmul [ebx].a34
faddp st(1), st(0)

fld st(1)
fmul [ebx].a44
faddp st(1), st(0)
fstp [ecx].a34

fstp st(0)
fstp st(0)
fstp st(0)
fstp st(0)

fld [eax].a41
fld [eax].a42
fld [eax].a43
fld [eax].a44

fld st(3)
fmul [ebx].a11

fld st(3)
fmul [ebx].a21
faddp st(1), st(0)

fld st(2)
fmul [ebx].a31
faddp st(1), st(0)

fld st(1)
fmul [ebx].a41
faddp st(1), st(0)
fstp [ecx].a41


fld st(3)
fmul [ebx].a12

fld st(3)
fmul [ebx].a22
faddp st(1), st(0)

fld st(2)
fmul [ebx].a32
faddp st(1), st(0)

fld st(1)
fmul [ebx].a42
faddp st(1), st(0)
fstp [ecx].a42


fld st(3)
fmul [ebx].a13

fld st(3)
fmul [ebx].a23
faddp st(1), st(0)

fld st(2)
fmul [ebx].a33
faddp st(1), st(0)

fld st(1)
fmul [ebx].a43
faddp st(1), st(0)
fstp [ecx].a43


fld st(3)
fmul [ebx].a14

fld st(3)
fmul [ebx].a24
faddp st(1), st(0)

fld st(2)
fmul [ebx].a34
faddp st(1), st(0)

fld st(1)
fmul [ebx].a44
faddp st(1), st(0)
fstp [ecx].a44

fstp st(0)
fstp st(0)
fstp st(0)
fstp st(0)


pop ecx
pop ebx
pop eax
}
}


Glide 64 function:

void projection_mul (float proj[4][4], float m_src[4][4], float m[4][4])
{
float m_src[4][4];

for (int i=0; i<4; i++)
{
for (int j=0; j<4; j++)
{
proj[j][i] =
m_src[0][i] * m[j][0] +
m_src[1][i] * m[j][1] +
m_src[2][i] * m[j][2] +
m_src[3][i] * m[j][3];
}
}

}

Remote
January 29th, 2003, 22:15
Perhaps someone knowing could give his view on this...

EDIT: Nevermind I just read Hacktarux's reply...

Hacktarux
January 29th, 2003, 22:46
i've tried just to be sure and the results was what i though, i get nearly exactly same results with your function and glide64's one. After 10^7 times of random matrixes multiplication (it take 30 seconds on my computer) the difference is less than 20ms.

You may be wondering why...
my guess is that your code take way too much space. Cpus are not designed to cache such loops generally. They prefer optimizing it for little loops. Nevertheless, it still fit in the cache after some iterations. Remember that this test is not done in real conditions. In a real plugins there's many things between two matrixes multiplication so the code has to be reload each time. This means that your function is probably slower in a real plugin than glide64 one... Again it has nothing to do with asm vs c, it's the algorithm.... You can also unroll loops in c and you'll still have the same issue.

Knuckles
January 29th, 2003, 22:54
Did you only makes Mario64 run or also some other games? And did only the title screen is working or you can get ingame with wireframe display?

ashade
January 30th, 2003, 00:39
yeah, unroll loops i can, but look at this example:

to divide a number in c++ by a constant, you have to do it:

//consider x is a 16bit number
x /= 10; //using 10 for example

this is slow, because the divide operantion takes too much cycles... my idea is to use the multiply operation to divide by a constant (and the mul instruction is marginally faster!).

look at this:

__asm {
mov ax, x
mov dx, 6554
mul dx
mov x, dx
}


it is a little hard to explain you why this works, but this kind of multiplication can't be done in c++... test yourself if u want... make a loop repeating about 10^10 times the two instructions and see the results

tooie
January 30th, 2003, 06:17
Asm can be faster then C++ .. but what you have here is algrothim method .. granted some algorithms can be done in asm and not in c .. that is where inline asm comes in .. writing high intesive function in asm can be great .. I know for your maxtrix stuff using the different P3 matrix ops can be dramaticly faster ..

but you come in to maintainabilty and being able to change code as well .. no matter how fast you wrote an interpter .. dynamic recompiler will always be faster .. this has to do with algorithm more then language ..


Originally posted by ashade
yeah, unroll loops i can, but look at this example:

to divide a number in c++ by a constant, you have to do it:

//consider x is a 16bit number
x /= 10; //using 10 for example

this is slow, because the divide operantion takes too much cycles... my idea is to use the multiply operation to divide by a constant (and the mul instruction is marginally faster!).

look at this:

__asm {
mov ax, x
mov dx, 6554
mul dx
mov x, dx
}


it is a little hard to explain you why this works, but this kind of multiplication can't be done in c++... test yourself if u want... make a loop repeating about 10^10 times the two instructions and see the results

sketzh
January 30th, 2003, 08:23
Originally posted by ashade
//consider x is a 16bit number
x /= 10; //using 10 for example


Well to divide faster in C++ you do the same trick.
You just multiply like this instead:

x *= 0.10f;

Doomulation
January 30th, 2003, 08:57
Although, asm is a little faster in general. Do a small strcpy and you'll see how it jumps around and takes time to complete...
Function juming is slow, especially this happens at the first call, dunno if it calls anymore...

With asm, however, the processor amends the instructions immidietly (afaik?) and is thus a little faster. But it might not matter in the length as many have said......
And the fact that the compiler cannot optimize asm :(

Anyway, good luck on the plugin ashade and post a shot with real gfx and the speed limiter off (press F4 in pj) when you've got it working! :happy:

ScottJC
January 30th, 2003, 09:24
Why the hell are you people arguing, Ashade can program his gfx plugin in whatever compiler he wants to,

And it is a fact that assembler is faster than c++, not by much on these modern computers but it definatly is. it is also a fact that c++ compiles INTO assembler (machine code), all compilers do, a typical compiler compiles code into assembler which isn't exactly brilliantly optimized.

in assembler you have complete control over your software, and how much optimization it can have, as in c++, you do not. because it will always compile your C++ in the exact same way, optimized for what the compiler thinks is optimized;

a function in assembler can be just as slow as a c++, but i'm willing to such a function would have alot more instructions than the code produced in c++, and it did them all in the same amount of time. in the end c++ code produces lengthy assembler code as a result.

Btw ashade, good work, i look forward to the future of this plugin, :D

radTube
January 30th, 2003, 11:43
Why do you guys sound like you want ashade to quit or code just like everybody else has done? I know these things tend to turn into some sort of competitions, usually made into such by people who have nothing to do with the coding, but couldn't we just give our support to this project and see what comes of it?

Good luck ashade, I hope bra64 develops into another great plugin. :flowers:

Doomulation
January 30th, 2003, 12:14
'Tis not good!
Ashade, we're not making you develop any diffrent if you're thinking that! You do it as you want, as long as the plugin gets good :D

Good luck! :flowers:

Trotterwatch
January 30th, 2003, 12:49
I don't think anyone here has been nasty to Ashade. He has posted stating he has written code in ASM that is superior to anything that could be written in C++. As a result some experienced C Coders have stated that what he has said isn't totally correct, and explained why. Ashade should take this as a challenge rather than an insult.

Hacktarux
January 30th, 2003, 13:50
The problem is to define in which aspect asm is superior to c. It's superior in speed when you optimize deeply a single blick of lines, but on the other hand it's harder to maintain...

What i was trying to say is that asm speed optimization isn't that important when you begin coz algorithms have much more incidence. It's like tunning your car, adding aileron and such and still having a bad engine. It'll probably be faster, but still slow, and i'd prefer starting by improving the engine...

I personnaly believe that it's very hard to do it all in asm the first time you do it. I think it would be faster to do it in c, optimize algorithms and once everything is working for sure, it can still be optimized in asm... Starting by most time consuming functions and step by step converting all parts in asm... I believe that this approach take less dev time, coz if u write everything in asm, it'll be very hard to read, correct, modify.... when the plugin will start to be huge and finally will take more time than doing it twice (one in c and one in asm). Now, it's only my opinion you don't need to agree :D

tooie
January 30th, 2003, 22:19
Originally posted by Sayargh
Why the hell are you people arguing, Ashade can program his gfx plugin in whatever compiler he wants to,

And it is a fact that assembler is faster than c++, not by much on these modern computers but it definatly is. it is also a fact that c++ compiles INTO assembler (machine code), all compilers do, a typical compiler compiles code into assembler which isn't exactly brilliantly optimized.

in assembler you have complete control over your software, and how much optimization it can have, as in c++, you do not. because it will always compile your C++ in the exact same way, optimized for what the compiler thinks is optimized;

a function in assembler can be just as slow as a c++, but i'm willing to such a function would have alot more instructions than the code produced in c++, and it did them all in the same amount of time. in the end c++ code produces lengthy assembler code as a result.

Btw ashade, good work, i look forward to the future of this plugin, :D

the discussion is more about:

Originally posted by Doomulation
Well, it is developed in asm, somewhat at least, so yes it is FAST!!!
Very nice work, ashade! :inlove:

which we are just saying this can be but not nessarly true .. there is a lot more to programming then just the language .. maintainabilty is a major thing, as well as readability.

pj64er
January 30th, 2003, 22:55
i dont see any nastiness either. i just see three emu programmers telling ashade what they learned through cold, hard experience.

even in my limited programming knowledge, i know that assembly (low level) can be more optimised than c++ (high level). but you guys cannot think like:

-low level is better than high level!
-asm is better than c++!
-plugin in asm is better than plugin in c++!
-w00t! revolution!
-quick! defend guy who write in asm at all costs!


ashade, hacktarux, tooie and icepir8 all know their stuff. let them have their little debate. I have a feeling the rest of you (like me) dont really know whats going on:flowers:

tooie
January 31st, 2003, 00:37
Originally posted by pj64er

-low level is better than high level!


I never really think of C++ as high level .. mostly cause I do work at times with Visual Basic, Web stuff, SQL .. those I would think more as high level. But I guess it is what your comparing.

pj64er
January 31st, 2003, 01:57
Originally posted by tooie
I never really think of C++ as high level .. mostly cause I do work at times with Visual Basic, Web stuff, SQL .. those I would think more as high level. But I guess it is what your comparing.

may I emphasize the limited knowledge part...:doh:

:happy:

mesman00
January 31st, 2003, 15:20
what happened to the screens, just dead links!

sketzh
January 31st, 2003, 15:28
Many gfx coders these days dont ever bother about coding in asm anymore.. They know it takes to much effort to make any faster that the compiler can do..

The optimizers today aint that bad.. And with all the GPUs at the market there is really no point in optimizing it yourself..

And for speaking of CPUs, I dont even know anybody with a CPU slower than 1.5 ghz these days..

My opinion is that its just a lot of waste of time..

Plisco
January 31st, 2003, 17:34
My opinion is that its just a lot of waste of time..

Ignore that Ashade you do what you want keep up the good work. Cant wait for a release;)

tooie
January 31st, 2003, 19:39
Originally posted by Plisco
Ignore that Ashade you do what you want keep up the good work. Cant wait for a release;)

Plisco read the whole message .. it is saying a waste of time programming in asm .. this is pure a productivity thing. The benifits are being claimed are basicly nill. But the programming overhead is massive. it had nothing to do with Ashade plugin.

sketzh
January 31st, 2003, 20:20
Originally posted by tooie
Plisco read the whole message .. it is saying a waste of time programming in asm .. this is pure a productivity thing. The benifits are being claimed are basicly nill. But the programming overhead is massive. it had nothing to do with Ashade plugin.

exactly my point.

actually I knew some guys who did an entire game for windows a couple of ago in pure asm. they knew their stuff and it took up A LOT of time. The game was nize and everything but it was almost outdated when it was released. I am not saying that I have no respect for asm-programmers because I do. I started coding my very first program in asm too so I know its a hard work and can be fast.

BTW: Some guys did some bilinear radial blur algorithm a year back. One did it in assembler and one did it in C++ with some inline asm. The C++ version was the fastest one.. All the innerloops was made in asm, the ones that cache well. And these guys are VERY skilled at programming. I am just mentioning this because its not a fact that asm is faster..

I respect your work Ashade, dont stop your plugin, this is only my opinion and what I would do. Anybody just making a little effort in being creative and productive are to be rewarded so go on mate! :)

Cyberman
January 31st, 2003, 20:56
Originally posted by ashade
yeah, unroll loops i can, but look at this example:

to divide a number in c++ by a constant, you have to do it:

//consider x is a 16bit number
x /= 10; //using 10 for example

this is slow, because the divide operantion takes too much cycles... my idea is to use the multiply operation to divide by a constant (and the mul instruction is marginally faster!).

look at this:

__asm {
mov ax, x
mov dx, 6554
mul dx
mov x, dx
}


it is a little hard to explain you why this works, but this kind of multiplication can't be done in c++... test yourself if u want... make a loop repeating about 10^10 times the two instructions and see the results

Ashade.. there is a slight flaw in your logic. you are assuming you are a better optimizer than the compilor. Consider the fact the compilor has several thousand verarious automatic optimizations. Also newere CPUs that execute CISC code aren't not CISC machines. They actually do dynamic opcode decoding for a fast RISC engine internally. The eliminate wasted opcode operations to enhance execution. Don't try to out think the compilor. You're struggling against people with fifteen to twenty years experience in enhancing performance and optimizing code.

You are assuming WRONGLY I might point out that the compilors code will be exactly what you coded it as in C++. This is a very flawed and myopic view of what the backend of a compilor does. Unless you want to discuse how the code generator on a typical compilor these days works, I won't get into it. Needless to say compilor output is VERY optimized already. it would be very hard for you to do better in assembly unless you are using MMX instructions (no compilors generate this that I'm aware of using C/C++ code because it has inherent parallelism which is not expressable in C/C++).

Assembly is most needed in a few areas. These are the following:
[list=1]
When the coresponding C/C++ code generates a combersome bit of code that is simpler in assembly. You find this out by compiling the C code in assembly and seeing what it generates. That's what I do.
When there is no coresponding code equivalent in C or C++
When one is doing DIRECT hardware access and manipulation.
[/list=1]
These are quite rare. Things like MMX code can improve performance BUT most assembly won't improve anything but it will give you a big headache if you make a mistake.

Cyb

Hacktarux
January 31st, 2003, 21:14
From what i've heard, the latest intel c++ compiler can produce mmx code as well as sse and sse2. You have to define in compiler options which instructions set to use and what's the target processor. From what i've understood it tries to use them in loops in your program and also in conditional jumps. You can also switch on an option that give you some infos at compile time on how to slightly change your code so that the compiler can parallelized it with mmx. Unfortunately, this compiler isn't free on windows. The linux version is free if u don't use it for commercial purposes.

There's also some classes made by intel to handle simd but it's like doing it in asm IMO...

Slougi
January 31st, 2003, 21:26
Originally posted by Hacktarux
From what i've heard, the latest intel c++ compiler can produce mmx code as well as sse and sse2. You have to define in compiler options which instructions set to use and what's the target processor. From what i've understood it tries to use them in loops in your program and also in conditional jumps. You can also switch on an option that give you some infos at compile time on how to slightly change your code so that the compiler can parallelized it with mmx. Unfortunately, this compiler isn't free on windows. The linux version is free if u don't use it for commercial purposes.

There's also some classes made by intel to handle simd but it's like doing it in asm IMO...
GCC does the same, try the -m3dnow, -msse, -msse2 and -mmmx options. Also -ffast-math does most of the stuff described above by ashade and other automatically.

BTW, Cyberman, is it not called compiler? Compilor just hurst my eyes and imaginary ears :P

cooliscoo
January 31st, 2003, 21:37
Originally posted by Slougi
GCC does the same, try the -m3dnow, -msse, -msse2 and -mmmx options. Also -ffast-math does most of the stuff described above by ashade and other automatically.

BTW, Cyberman, is it not called compiler? Compilor just hurst my eyes and imaginary ears :P

Isn't it called hurts, and not hurst?

:happy:

Hacktarux
January 31st, 2003, 21:45
Originally posted by Hacktarux
From what i've heard, the latest intel c++ compiler can produce mmx code as well as sse and sse2. You have to define in compiler options which instructions set to use and what's the target processor. From what i've understood it tries to use them in loops in your program and also in conditional jumps. You can also switch on an option that give you some infos at compile time on how to slightly change your code so that the compiler can parallelized it with mmx. Unfortunately, this compiler isn't free on windows. The linux version is free if u don't use it for commercial purposes.

There's also some classes made by intel to handle simd but it's like doing it in asm IMO...

Hey didn't know that :)
But i've just checked the manual and it's not as good as intel c++ compiler yet... These switches only enable these instructions, this means that it'll recognize them when you use it in inline asm or in the builtin wrapper functions for mmx in gcc. The only exception is sse, gcc can use it to do fast floating point. Gcc can't optimize loops, conditions, and generally speaking all integer optimizations done by intel c++ compiler.

It's still good to know that they're working on it ;)

Slougi
January 31st, 2003, 22:31
Originally posted by Hacktarux
Hey didn't know that :)
But i've just checked the manual and it's not as good as intel c++ compiler yet... These switches only enable these instructions, this means that it'll recognize them when you use it in inline asm or in the builtin wrapper functions for mmx in gcc. The only exception is sse, gcc can use it to do fast floating point. Gcc can't optimize loops, conditions, and generally speaking all integer optimizations done by intel c++ compiler.

It's still good to know that they're working on it ;)
Hmm seems I got it wrong at some point :(
They did give me pretty nice speed-ups on my duron morgan though. Also -funroll-loops and -frerun-cse-after-loop might have something to do with this :) And of course -fomit-frame-pointer which frees up a register on x86 platforms. Just makes debugging impossible :( We are completely offtopic btw :doh:

Cooliscoo: whoooops :blush:

Cyberman
February 1st, 2003, 19:39
Originally posted by Hacktarux
Hey didn't know that :)
But i've just checked the manual and it's not as good as intel c++ compiler yet... These switches only enable these instructions, this means that it'll recognize them when you use it in inline asm or in the builtin wrapper functions for mmx in gcc. The only exception is sse, gcc can use it to do fast floating point. Gcc can't optimize loops, conditions, and generally speaking all integer optimizations done by intel c++ compiler.

It's still good to know that they're working on it ;)

Hmmm.. from what I understand on the backend part of things many compilors use GCC's processor model optimizations. Having written of few of these (what a pain is all I can describe them as), I would say GCC definately has ineger optimizations. You can see this by playing with the -O# options for optimizations. As for optimizing MMX that's a horse of a different color, or should I say a real PITA. The real fun with MMX instructions comes when you have to switch off the MMX before any floating point instructions can be executed.

Yeah it's off topic (what's your problem! ;) ). I remember working on converting 16 bit ABGR values to 32 bit RGBA values on a PSX GPU plugin, (also going from 24 bit BGR to 32 bit RGB was fun as well). Fortunately these aren't too bad as I just loaded 2 ABGR values into the lower 32 bits of the MMX register then 'expanded' it to the lower 16 bits of 2 32 bit values. Then masked shift masked shift masked .. tada 2 32 bit RGB values (snicker). I never compared the speed though with the old algo (might have been wise ;) ).

Cyb

Cyberman
February 1st, 2003, 19:41
Is it me or is cooliscoo really uncool at times? :)

Yeah Slougi I'm spelling challenged sorry ;)

Slougi
February 1st, 2003, 19:54
Originally posted by Cyberman
Hmmm.. from what I understand on the backend part of things many compilors use GCC's processor model optimizations. Having written of few of these (what a pain is all I can describe them as), I would say GCC definately has ineger optimizations. You can see this by playing with the -O# options for optimizations. As for optimizing MMX that's a horse of a different color, or should I say a real PITA. The real fun with MMX instructions comes when you have to switch off the MMX before any floating point instructions can be executed.

Yeah it's off topic (what's your problem! ;) ). I remember working on converting 16 bit ABGR values to 32 bit RGBA values on a PSX GPU plugin, (also going from 24 bit BGR to 32 bit RGB was fun as well). Fortunately these aren't too bad as I just loaded 2 ABGR values into the lower 32 bits of the MMX register then 'expanded' it to the lower 16 bits of 2 32 bit values. Then masked shift masked shift masked .. tada 2 32 bit RGB values (snicker). I never compared the speed though with the old algo (might have been wise ;) ).

Cyb
The -Ox flags don't do anything by themselves except turn on extra flags to optimize for binary speed (1-3) or size (s). From the man page:


Options That Control Optimization

These options control various sorts of optimizations:

-O
-O1 Optimize. Optimizing compilation takes somewhat more time, and a
lot more memory for a large function.

Without -O, the compiler's goal is to reduce the cost of compila-
tion and to make debugging produce the expected results. State-
ments are independent: if you stop the program with a breakpoint
between statements, you can then assign a new value to any variable
or change the program counter to any other statement in the func-
tion and get exactly the results you would expect from the source
code.

With -O, the compiler tries to reduce code size and execution time,
without performing any optimizations that take a great deal of com-
pilation time.

-O2 Optimize even more. GCC performs nearly all supported optimiza-
tions that do not involve a space-speed tradeoff. The compiler
does not perform loop unrolling or function inlining when you spec-
ify -O2. As compared to -O, this option increases both compilation
time and the performance of the generated code.

-O2 turns on all optional optimizations except for loop unrolling,
function inlining, and register renaming. It also turns on the
-fforce-mem option on all machines and frame pointer elimination on
machines where doing so does not interfere with debugging.

Please note the warning under -fgcse about invoking -O2 on programs
that use computed gotos.

-O3 Optimize yet more. -O3 turns on all optimizations specified by -O2
and also turns on the -finline-functions and -frename-registers
options.

-O0 Do not optimize.

-Os Optimize for size. -Os enables all -O2 optimizations that do not
typically increase code size. It also performs further optimiza-
tions designed to reduce code size.

If you use multiple -O options, with or without level numbers, the
last such option is the one that is effective.
So -O1 for example does not use -funroll-loops etc. Personally I have found -Os to produce the best results, due to decreased loading times, for example OOo and Mozilla really load MUCH faster. Unfortunately it makes some apps very fragile (gimp, xfree, evolution, cdrdao etc), but I use it whenever I can along with -fomit-frame-pointer. On the other Mupen would prolly run faster if compiled say with -O3 -funroll-loops -ffast-math -frerun-cse-after-loop -fforce-addr -frerun-loop-opt -falign-functions=4 -maccumulate-outgoing-args since the codesize is much smaller. Compiler flags can be loads of fun :saint:

Hacktarux
February 1st, 2003, 19:55
I know gcc can do integer optimizations, i was talking about automatic integer optimizations using mmx or other simd inst. Maybe the gcc's structure make it possible, you seem to know it better than me, but from what i've read it's not done yet...

And yes those posts are off topic, but we're actually talking about programming and it's exceptionnal in this forum.... :D

Slougi
February 1st, 2003, 19:58
Originally posted by Cyberman
Is it me or is cooliscoo really uncool at times? :)

Yeah Slougi I'm spelling challenged sorry ;)
Yeah cooliscoo is sometimes pretty annoying, especially if he notices something like this :P

No probs Cyb, just makes me queasy when someone uses -or in the end of stuff like reporter, compiler, and other such words :) Didn't mean to knit-pick.

Slougi
February 1st, 2003, 20:05
Originally posted by Hacktarux
I know gcc can do integer optimizations, i was talking about automatic integer optimizations using mmx or other simd inst. Maybe the gcc's structure make it possible, you seem to know it better than me, but from what i've read it's not done yet...
Heh, I bet you know much more than I do about the structure, I just know what gentoo has taught me in the few months I have been using it :) I always thought GCC did 3dnow and sse automatically though, but maybe I am wrong.


And yes those posts are off topic, but we're actually talking about programming and it's exceptionnal in this forum.... :D
Actually we are not talking about programming, we are talking about compiling ;) Two very different things. I couldn't program even a simple input plugin, yet I can compile and patch OOo by hand ;)