When I started converting the renderer code from MESS 0.128 into my plugin, I used this header, z64.h from ziggy's z64gl plugin, which contained a handful of trivial defines to facilitate porting. Besides these defines there's no code from z64gl in my plugin, so there's no reason to call it z64, it's called angrylion's RDP plugin.
I'd like to share some thoughts about vectorizing the RDP. It should be safe to vectorize calculations for R, G, B color components. This can provide a visible speedup.
The biggest gains when vectorizing the renderer are typically expected from processing several pixels simultaneously. This can be done, but it's not straightforward at all. First of all, if you do this, you'll have a harder time adding cycle accuracy later. The RDP can hang on an arbitrary pixel, global parameters like "fill color" change on an arbitrary pixel. A lot of things can change in the middle of your batch of pixels, maybe even CPU writes into the frame buffer can intervene.
Besides that, you'll have a hard time maintaining the level of accuracy achieved in my code. You see, certain calculations for a particular pixel often depend on a previous/next pixel or even later pixels. There is this damned blender input called "memory color /alpha". In the 1st cycle of two-cycle blender it actually belongs to the previous pixel. I think it's even inherited across primitives. But when the 2nd part of the blender equation is "+ memrgb*memalpha", then in some circumstances, when many one-pixel spans have been processed, memory color and memory alpha belong to two different pixels. Likewise, parameters called "blender shifters" used inside the blender belong to the previous pixel in the 1st cycle of two-cycle blender. The combiner input called "texel 1 color" belongs to the next pixel in the 2nd cycle of two-cycle combiner, and this is exploited by the text in Monster Truck Madness 64, that's why the text is not rendered correctly by existing hardware-accelerated plugins. In one-cycle mode, however, "texel 1 color" may belong to the next or later pixels depending on the placement of the current pixel in its span and the length of said span. These dependencies on a span also come into play when computing lod fraction and tile indices (if tex_lod_en bit is set) in one-cycle mode. I've assumed in my code that these dependencies are similar for all frame buffer sizes, but I can be wrong. I'm sure other backward/forward dependencies for pixels can be discovered later.
I'm not saying this vectorization can't be done, I'm just bringing this point about high complexity of the resulting SIMD code.