Announcement: Cycle-accurate N64 development underway.

MarathonMan · Jun 4, 2013

krom and durza007 rewrote portions of the video system. Not only does CEN64 now build on mingw64, but it doesn't use slow glBegin()/glEnd() calls to render graphics... just a glDrawArray().

I don't want to bother with the RDP for the time being, and the RSP looks like it's stable enough to run simple commercial ROMs. So, when I find time, I am going to merge angrylion's z64 renderer into CEN64 and use that until I get a chance to go back and do all of my TODO's with the VR4300 and RSP (and audio and PIF and... sigh).

angrylion · Jun 5, 2013

When I started converting the renderer code from MESS 0.128 into my plugin, I used this header, z64.h from ziggy's z64gl plugin, which contained a handful of trivial defines to facilitate porting. Besides these defines there's no code from z64gl in my plugin, so there's no reason to call it z64, it's called angrylion's RDP plugin.
I'd like to share some thoughts about vectorizing the RDP. It should be safe to vectorize calculations for R, G, B color components. This can provide a visible speedup.
The biggest gains when vectorizing the renderer are typically expected from processing several pixels simultaneously. This can be done, but it's not straightforward at all. First of all, if you do this, you'll have a harder time adding cycle accuracy later. The RDP can hang on an arbitrary pixel, global parameters like "fill color" change on an arbitrary pixel. A lot of things can change in the middle of your batch of pixels, maybe even CPU writes into the frame buffer can intervene.
Besides that, you'll have a hard time maintaining the level of accuracy achieved in my code. You see, certain calculations for a particular pixel often depend on a previous/next pixel or even later pixels. There is this damned blender input called "memory color /alpha". In the 1st cycle of two-cycle blender it actually belongs to the previous pixel. I think it's even inherited across primitives. But when the 2nd part of the blender equation is "+ memrgb*memalpha", then in some circumstances, when many one-pixel spans have been processed, memory color and memory alpha belong to two different pixels. Likewise, parameters called "blender shifters" used inside the blender belong to the previous pixel in the 1st cycle of two-cycle blender. The combiner input called "texel 1 color" belongs to the next pixel in the 2nd cycle of two-cycle combiner, and this is exploited by the text in Monster Truck Madness 64, that's why the text is not rendered correctly by existing hardware-accelerated plugins. In one-cycle mode, however, "texel 1 color" may belong to the next or later pixels depending on the placement of the current pixel in its span and the length of said span. These dependencies on a span also come into play when computing lod fraction and tile indices (if tex_lod_en bit is set) in one-cycle mode. I've assumed in my code that these dependencies are similar for all frame buffer sizes, but I can be wrong. I'm sure other backward/forward dependencies for pixels can be discovered later.
I'm not saying this vectorization can't be done, I'm just bringing this point about high complexity of the resulting SIMD code.

Zuzma · Jun 5, 2013

What about something like mlg.h (mylittle generic header) instead of z64.h? You could stick that it's from z64 somewhere inside the file maybe, I dunno.

MarathonMan · Jun 5, 2013

angrylion said:
When I started converting the renderer code from MESS 0.128 into my plugin, I used this header, z64.h from ziggy's z64gl plugin, which contained a handful of trivial defines to facilitate porting. Besides these defines there's no code from z64gl in my plugin, so there's no reason to call it z64, it's called angrylion's RDP plugin.
I'd like to share some thoughts about vectorizing the RDP. It should be safe to vectorize calculations for R, G, B color components. This can provide a visible speedup.
The biggest gains when vectorizing the renderer are typically expected from processing several pixels simultaneously. This can be done, but it's not straightforward at all. First of all, if you do this, you'll have a harder time adding cycle accuracy later. The RDP can hang on an arbitrary pixel, global parameters like "fill color" change on an arbitrary pixel. A lot of things can change in the middle of your batch of pixels, maybe even CPU writes into the frame buffer can intervene.
Besides that, you'll have a hard time maintaining the level of accuracy achieved in my code. You see, certain calculations for a particular pixel often depend on a previous/next pixel or even later pixels. There is this damned blender input called "memory color /alpha". In the 1st cycle of two-cycle blender it actually belongs to the previous pixel. I think it's even inherited across primitives. But when the 2nd part of the blender equation is "+ memrgb*memalpha", then in some circumstances, when many one-pixel spans have been processed, memory color and memory alpha belong to two different pixels. Likewise, parameters called "blender shifters" used inside the blender belong to the previous pixel in the 1st cycle of two-cycle blender. The combiner input called "texel 1 color" belongs to the next pixel in the 2nd cycle of two-cycle combiner, and this is exploited by the text in Monster Truck Madness 64, that's why the text is not rendered correctly by existing hardware-accelerated plugins. In one-cycle mode, however, "texel 1 color" may belong to the next or later pixels depending on the placement of the current pixel in its span and the length of said span. These dependencies on a span also come into play when computing lod fraction and tile indices (if tex_lod_en bit is set) in one-cycle mode. I've assumed in my code that these dependencies are similar for all frame buffer sizes, but I can be wrong. I'm sure other backward/forward dependencies for pixels can be discovered later.
I'm not saying this vectorization can't be done, I'm just bringing this point about high complexity of the resulting SIMD code.

Holy smokes. Thanks for the overview.

Stopping mid-pixel wouldn't be an overly large burden if you were to just throw away some results, I would think. I would have to see it in context.

. There's still some things in your render that, taking a quick glance, appear to amenable to vectorization:

Code:

 99     INT32 summand_xr = offx * SIGN13(object.SpanBase.m_span_dr >> 14);          
100     INT32 summand_yr = offy * SIGN13(object.SpanBase.m_span_drdy >> 14);        
101     INT32 summand_xb = offx * SIGN13(object.SpanBase.m_span_db >> 14);          
102     INT32 summand_yb = offy * SIGN13(object.SpanBase.m_span_dbdy >> 14);        
103     INT32 summand_xg = offx * SIGN13(object.SpanBase.m_span_dg >> 14);          
104     INT32 summand_yg = offy * SIGN13(object.SpanBase.m_span_dgdy >> 14);        
105     INT32 summand_xa = offx * SIGN13(object.SpanBase.m_span_da >> 14);          
106     INT32 summand_ya = offy * SIGN13(object.SpanBase.m_span_dady >> 14)

OTOH, I was completely unaware of the fact the memory/alpha blender had inputs that relied on other outputs. I don't see how this would have played out in hardware (and not be really slow). That really throws a wrench in my plans if that's the case.

I don't know if I've said it before, but my knowledge of the RDP is absolutely horrendous.

angrylion · Jun 5, 2013

Good example, and it's related to the thing I wrote about, processing RGB/RGBA color components at once. In my current code, however, this looks like

Code:

summand_xr = offx * spans_cdr;
summand_yr = offy * spans_drdy;
summand_xg = offx * spans_cdg;
summand_yg = offy * spans_dgdy;

There's no need to do shifts and sign-extensions per-pixel in this case, we can do them once per primitive. You'll only find my recent updates in "nocomment" folder. The example you posted looks old to me. Maybe it was taken from the outdated and effectively abandoned "fptr" folder?

Concerning these inter-pixel dependencies, the RDP is pipelined, so it's understandable that texel1 in one-cycle mode (in most cases) refers to the next pixel, because the pipeline already prepares texel parameters for the latter. The behavior of "memory color / alpha" and "blender shifters" looks like an outright hardware bug, but it's obvious where it comes from. The blender on the 1st cycle of the two doesn't have enough time to deduce all its variables and takes some from the previous pixel, because they're marked as ready. So the principle behind these glitches is transparent, it's only that the behavior is unfortunately more complicated than just taking outputs from the previous/next pixel.

MarathonMan · Jun 6, 2013

angrylion said:
Good example, and it's related to the thing I wrote about, processing RGB/RGBA color components at once. In my current code, however, this looks like

Code:

summand_xr = offx * spans_cdr; summand_yr = offy * spans_drdy; summand_xg = offx * spans_cdg; summand_yg = offy * spans_dgdy;

There's no need to do shifts and sign-extensions per-pixel in this case, we can do them once per primitive. You'll only find my recent updates in "nocomment" folder. The example you posted looks old to me. Maybe it was taken from the outdated and effectively abandoned "fptr" folder?

I think I grabbed it from MAME 0.148 (no patches)... so yeah, it's likely old code.

angrylion said:
Concerning these inter-pixel dependencies, the RDP is pipelined, so it's understandable that texel1 in one-cycle mode (in most cases) refers to the next pixel, because the pipeline already prepares texel parameters for the latter. The behavior of "memory color / alpha" and "blender shifters" looks like an outright hardware bug, but it's obvious where it comes from. The blender on the 1st cycle of the two doesn't have enough time to deduce all its variables and takes some from the previous pixel, because they're marked as ready. So the principle behind these glitches is transparent, it's only that the behavior is unfortunately more complicated than just taking outputs from the previous/next pixel.

I wouldn't blink twice if it was HW bug. The amount of misinformation in the RSP docs alone is enough is cause concern.

angrylion · Jun 8, 2013

Ah, forgot to mention a combiner input called "combined color". In one-cycle mode and the 1st cycle of two-cycle mode it refers to the previous pixel.

MarathonMan · Jun 9, 2013

angrylion said:
Ah, forgot to mention a combiner input called "combined color". In one-cycle mode and the 1st cycle of two-cycle mode it refers to the previous pixel.

Thanks for that.

Question: your show_* functions are all for debugging, right? I don't want to put the effort into supporting them at this point. My current plan is to change update_rsp() so that it copies the framebuffer to RDRAM, right before a vsync, so that the video subsystem displays the proper image.

angrylion · Jun 9, 2013

That's true.

jevankovich · Jun 14, 2013

MarathonMan, your idea for run-ahead execution seems an awful lot like software transactional memory (STM). If I remember rightly, STM allows multiple threads to run asynchronously until one of the thread does something that would affect the other threads. When such an event occurs all of the threads backtrack to resynchronize so they are all correct. I might be remembering this wrong, though. Information available at wikipedia (I'm not allowed to post links yet. A simple search on wikipedia should do the job.).

MarathonMan · Jun 14, 2013

jevankovich said:
MarathonMan, your idea for run-ahead execution seems an awful lot like software transactional memory (STM). If I remember rightly, STM allows multiple threads to run asynchronously until one of the thread does something that would affect the other threads. When such an event occurs all of the threads backtrack to resynchronize so they are all correct. I might be remembering this wrong, though. Information available at wikipedia (I'm not allowed to post links yet. A simple search on wikipedia should do the job.).

Right, that's more or less the idea.

Remote · Jun 14, 2013

Give us some sweet info about the current status

MarathonMan · Jun 15, 2013

Remote said:
Give us some sweet info about the current status

To be honest, I bought a Wii U and Skyward Sword, so the CEN64 gears haven't been churning as much lately.

I am trying to get angrylion's pixel-accurate RDP renderer merged into CEN64. I'm hoping the RSP is ready enough to make use of it. The code compiles and the RDP gets the commands, but there are some issues that I need to iron out.

EDIT: krom has completed a vast majority of the CP1 opcodes for me since I last posted as well... I will be committing them soon too. This enables several additional homebrew ROMs to run amongst other things.

Nintendo Maniac · Jun 20, 2013

Something interesting I just found out, surprisingly Haswell is about 15-20% faster clock-for-clock than Ivy Bridge in both Dolphin and PCSX2. This is particularly interesting because no other benchmark or performance test seems to demonstrate such a performance increase with Haswell, which may mean that, for whatever reason, Haswell is particularly fast at emulation. Therefore, in theory, this 15-20% performance increase over Ivy could very well apply to CEN64 as well.

Sources:
http://forums.dolphin-emu.org/Threa...wind-waker-cpu-benchmark?pid=278859#pid278859

http://forums.pcsx2.net/Thread-CPU-Benchmark-designed-for-PCSX2-based-on-FFX-2?pid=303948#pid303948

MarathonMan · Jun 20, 2013

Nintendo Maniac said:
Something interesting I just found out, surprisingly Haswell is about 15-20% faster clock-for-clock than Ivy Bridge in both Dolphin and PCSX2. This is particularly interesting because no other benchmark or performance test seems to demonstrate such a performance increase with Haswell, which may mean that, for whatever reason, Haswell is particularly fast at emulation. Therefore, in theory, this 15-20% performance increase over Ivy could very well apply to CEN64 as well.

Sources:
http://forums.dolphin-emu.org/Threa...wind-waker-cpu-benchmark?pid=278859#pid278859

http://forums.pcsx2.net/Thread-CPU-Benchmark-designed-for-PCSX2-based-on-FFX-2?pid=303948#pid303948

There's some MAME benchmarks floating around that suggest the same thing.

I don't see why CEN64 would be an exception to these findings.

Looking at the microarchitectural changes, I'm not sure what would have caused this: http://www.realworldtech.com/haswell-cpu/6/

Anyone with a Haswell on hand, feel free to report in.

MarathonMan · Jun 23, 2013

Been scouting for the RSP (?) bug, can't seem to find it.

Many games like Rampage: World Tour, and public domain demos issue a RSP BREAK instruction after their task is complete. The status register has SIG2 set to mark the task as finished.

... yet the RSP never gets any more tasks, and the whole scheduler seems to come to a grinding halt?

I'm not sure why.

EDIT: It appears that the RDP is not executing a SyncFull command (confirmed)... hmm...

Alegend45 · Jun 23, 2013

And Super Mario 64 executes VMULF, which at least looks easy to code. Hell, I tried to.

MarathonMan · Jun 24, 2013

Alegend45 said:
And Super Mario 64 executes VMULF, which at least looks easy to code. Hell, I tried to.

The problems are MUCH more severe than VMULF being an unimplemented instruction.

Alegend45 · Jun 24, 2013

I know, I'm just saying. Besides, there seems to be some sort of RDP-related deadlock.

MarathonMan · Jun 24, 2013

Alegend45 said:
I know, I'm just saying. Besides, there seems to be some sort of RDP-related deadlock.

This is actually an RSP issue (I think). If I understand libultra correctly, the RSP "builds" the RDP commands dynamically at runtime. It then ships them off to the RDP, which executes them.

For some reason, the RSP never generates the SyncFull command; thus causing the RDP to never raise an interrupt; thus causing the R4300 to believe that the RDP still hasn't rendered the last display list (or something to that effect).

If I raise an RDP interrupt along with the RSP interrupts on a BREAK instruction, the ROMs that "deadlock" proceed as normal.

EDIT: It appears to be an issue related with the merging of angrylion's renderer. Somehow, I must have botched it enough that it wasn't emulating instructions properly.

Switching to an older RDP pipeline that only really accepts FullSync commands worked.

Announcement: Cycle-accurate N64 development underway.

Emulator Developer

New member

New member

Emulator Developer

New member

Emulator Developer

New member

Emulator Developer

New member

New member

Emulator Developer

Active member

Emulator Developer

New member

Emulator Developer

Emulator Developer

New member

Emulator Developer

New member

Emulator Developer