What's new

Announcement: Cycle-accurate N64 development underway.

Reznor007

New member
AMD has supported SSE3 since Athlon64 stepping E3/4 according to wikipedia.

JustDesserts recently added some partial SSE support for RDP in MAME/MESS but it's not enabled by default since it isn't done yet.
 

Reznor007

New member
Right, though I thought you meant normal SSE3 at first. I think CEN64 is only uses SSE2/3 at the moment anyway so AMD not supporting SSSE3 until Bulldozer in 2011 shouldn't have been a major hinderance to anyone working on a similar project.
 

Reznor007

New member
Theoretically yes, but things may have changed since the beginning and when that binary was made.

"Completed all but a handful of the LWC2/SWC2 opcodes, and optimized them with SSE2/3" by MarathonMan earlier in this thread.
 

mrmudlord

New member
<MooglyGuy> Reznor007 isn't even remotely technically apt and should not be speaking on topics he knows fuckall about
<MooglyGuy> If he doesn't even know the difference between the RDP (which I haven't vectorized at all) and the RSP (which I have started vectorizing), he should stay the hell out of the thread. I've already been discussing RSP SSE optimizations with MarathonMan, and there are a handful of opcodes he has vectorized which I haven't, and I have a handful of opcodes I've vectorized that he hasn't

a notice
 
OP
MarathonMan

MarathonMan

Emulator Developer
Right, though I thought you meant normal SSE3 at first. I think CEN64 is only uses SSE2/3 at the moment anyway so AMD not supporting SSSE3 until Bulldozer in 2011 shouldn't have been a major hinderance to anyone working on a similar project.

SSE4.1

I planned a fallback to SSE2 and it's mostly in place, but I'm thinking of making SSSE3 a bottom-line requirement for shuffles as processors that don't have the SSSE3 instruction don't have the performance capabilities anyways.
 
OP
MarathonMan

MarathonMan

Emulator Developer
Case in point:

The edgewalker is atrocious when it comes to operation count.

angrylion SVN after placing variables into arrays (logic unchanged):
Code:
  ewvars[EW_R]        = (ewdata[8] & 0xffff0000) | ((ewdata[12] >> 16) & 0x0000ffff);
  ewvars[EW_G]        = ((ewdata[8] << 16) & 0xffff0000) | (ewdata[12] & 0x0000ffff);
  ewvars[EW_B]        = (ewdata[9] & 0xffff0000) | ((ewdata[13] >> 16) & 0x0000ffff);
  ewvars[EW_A]        = ((ewdata[9] << 16) & 0xffff0000) | (ewdata[13] & 0x0000ffff);
...
  ewdxvars[EWDX_DRDX] = (ewdata[10] & 0xffff0000) | ((ewdata[14] >> 16) & 0x0000ffff);
  ewdxvars[EWDX_DGDX] = ((ewdata[10] << 16) & 0xffff0000) | (ewdata[14] & 0x0000ffff);
  ewdxvars[EWDX_DBDX] = (ewdata[11] & 0xffff0000) | ((ewdata[15] >> 16) & 0x0000ffff);
  ewdxvars[EWDX_DADX] = ((ewdata[11] << 16) & 0xffff0000) | (ewdata[15] & 0x0000ffff);

I don't want to think about how many scalar CPU operations that is. Let's vectorize, shall we?
Code:
  ewData1 = _mm_load_si128((__m128i*) (ewdata + 8));
  ewData2 = _mm_load_si128((__m128i*) (ewdata + 12)); 
  ewDataLo = _mm_unpacklo_epi64(ewData1, ewData2);
  ewDataHi = _mm_unpackhi_epi64(ewData1, ewData2);
  ewDataLo = _mm_shuffle_epi8(ewDataLo, ewShuffleKey);
  ewDataHi = _mm_shuffle_epi8(ewDataHi, ewShuffleKey);
  _mm_store_si128((__m128i*) (ewvarstest + 0), ewDataLo);
  _mm_store_si128((__m128i*) (ewdxvarstest + 0), ewDataHi);

A mere 8 vector instructions to do the same amount work.

Before two nights of stupid obvious vectorization:
Code:
$ du -b cen64
212064	cen64

... and after:
Code:
$ du -b cen64
208736	cen64
 
Last edited:

mrmudlord

New member
Before two nights of stupid obvious vectorization:
Code:
$ du -b cen64
212064	cen64

... and after:
Code:
$ du -b cen64
208736	cen64


You see? This is exactly the issue I have with byuu and angrylion. They just don't give a fuck about optimization at all. No matter how simple.
It boggles the mind as to how truly incompetant they really are, and how they are blinded by their dogmatic practises.

Seriously, they think SSE/NEON/etc is the devil. And they tout code == documentation. I mean, ffs.
 

Nintendo Maniac

New member
I find it interesting that mrmudlord here is no longer spewing hate at MarathonMan...

Anyway, I was always under the impression that you could always have (slower) fallbacks so that having the newest and latest SSE's aren't a flat-out requirement. Was my impression wrong? If it wasn't, then I REALLY don't understand why you wouldn't want to take advantage of said newer SSEs...
 

mrmudlord

New member
I find it interesting that mrmudlord here is no longer spewing hate at MarathonMan...

Mainly because MarathonMan is on the level. I have zero issue with cycle accuracy itself.
My issue with incompetant hacks trying their hand at it. And failing.
 

Nintendo Maniac

New member
Mainly because MarathonMan is on the level. I have zero issue with cycle accuracy itself.
Several of your posts back in March-April seemed quite... I'll let them speak for themselves:

Meh, performance isnt the problem.

3Ghz Ivy Bridges are no problem at all. The issue should be accuracy. Since everyone cares about that, who gives a toss about speed? Byuu clearly doesn't give a damn, nor do mamedev. So neither should you. :)

Don't forget, there is also the pixel/cycle exact RDP too to emulate. :)

Oh yes, future proofing. Just like Crysis when it was released.

Who cares if it runs 10 seconds per frame if in 50 years time it will run fine. We must leave a legacy to our children.

If you are not a idiot, i can give you the oman archive alegend.

but people like Exophase are here, so they can [expletive removed].
 
OP
MarathonMan

MarathonMan

Emulator Developer
Anyway, I was always under the impression that you could always have (slower) fallbacks so that having the newest and latest SSE's aren't a flat-out requirement. Was my impression wrong? If it wasn't, then I REALLY don't understand why you wouldn't want to take advantage of said newer SSEs...

I can, and I have. I released the PJ64 RSP plugin that's backed by CEN64's vectorized codepaths in both SSE2 and SSE4.1 flavours. When I ported back the changes into CEN64, I never bothered to re-enable the SSE2 (it's still a SSSE3 hard-requirement, with switches to enable SSE4.1 if you have it). If you don't have SSSE3, chances are you can't run CEN64 anyways. And even if you can, all that has to be done is copypasta some old code and slap it in.
 

Alegend45

New member
You see? This is exactly the issue I have with byuu and angrylion. They just don't give a fuck about optimization at all. No matter how simple.
It boggles the mind as to how truly incompetant they really are, and how they are blinded by their dogmatic practises.

Seriously, they think SSE/NEON/etc is the devil. And they tout code == documentation. I mean, ffs.

To be fair, one of the principle rules of programming is to optimize last, after debugging constantly for a while.
Also, C++ is more exact than English for documentation purposes.

I'm happy to optimize, but I always insist that it's portable, and if not, there is a generic C++ fallback to maintain portablility.
 

Top