What's new

Announcement: Cycle-accurate N64 development underway.

Nintendo Maniac

New member
I know Haswell is only as much faster vs Ivy as Ivy was to Sandy, but I figured that both minor IPC improvements in Ivy and Haswell would be enough together. Perhaps I was overestimating CEN64's current performance?

You could overclock a low-end SB/IB chip and it'll run circles around the stock Haswell chip as far as the simulator's performance goes.
I'm talking about OC'd Haswell vs your OC'd Sandy...
 
OP
MarathonMan

MarathonMan

Emulator Developer
I know Haswell is only as much faster vs Ivy as Ivy was to Sandy, but I figured that both minor IPC improvements in Ivy and Haswell would be enough together. Perhaps I was overestimating CEN64's current performance?


I'm talking about OC'd Haswell vs your OC'd Sandy...

The difference in IPC should more or less linearly co-relate to performance improvements for CEN64. So if you were getting 40 VI/s at Nmhz on IB, you'll likely see 42 VI/s at Nmhz on Haswell. OTOH, you could just clock your chip an extra couple hundred MHz and you'd probably see the same returns. Unless you're shooting for a world record, the new architecture isn't really going to do much good IMO.

Also: The RSP simulator works almost perfectly after I merged it into CEN64. I was expecting lots of trouble. Instead, I got commands being FIFO'd to the graphics processor within the first couple times of booting SM64. :)

Code:
59274 [RSP]: "SPRegWrite: Writing to register [CMD_STATUS]."
59275 [RSP]: "SPRegWrite: Writing to register [CMD_START]."                          
59276 [RSP]: "SPRegWrite: Writing to register [CMD_END]."
59277 [RSP]: "SPRegRead: Reading from register [SP_DMA_FULL_REG]."
 

Alegend45

New member
Holy crap. Dude, this is seriously impressive! I couldn't do this alone... I can somewhat code in C, though, but I can't test, due to my computer. I only have SSE2 on the damn thing.
 
Last edited by a moderator:

Exophase

Emulator Developer
Cool that you're focusing on performance with this. I saw you mention a few times the TLB emulation as one of the big optimizations. Was surprised to see you use a black-tree representation. Have you tried a flat sorted array with hardcoded interval lookups? I'm curious how that'd compare..

I also wonder how many games enable the TLB. I know a few do, but IIRC they're a minority.
 
OP
MarathonMan

MarathonMan

Emulator Developer
Holy crap. Dude, this is seriously impressive! I couldn't do this alone... I can somewhat code in C, though, but I can't test, due to my computer. I only have SSE2 on the damn thing.

In my experience, if you don't have an SSE2-supported processor, chances are that the processor isn't cut to run it fast enough anyways. I'm eventually hoping to provide alternatives to the SSE codepath at a later time, though.

Cool that you're focusing on performance with this. I saw you mention a few times the TLB emulation as one of the big optimizations. Was surprised to see you use a black-tree representation. Have you tried a flat sorted array with hardcoded interval lookups? I'm curious how that'd compare..

I also wonder how many games enable the TLB. I know a few do, but IIRC they're a minority.

I knew zilch about the N64 six months ago when I started the project. The TLB was a logical starting point because I could testcase it and planned to work outwards after that. I kind of regret it after learning about how the system really works because it turns out that, yes, -- most games don't even use it. :)

The RB tree was beating a linear scan back in the day when I tried it, but only by a very small margin. I might nix it down the road, but it's nice to have on hand to experiment with when the time comes.
 

Exophase

Emulator Developer
Rather than linear scan I was thinking a fully expanded bineary search over a sorted flat array, something like this:

http://pastebin.com/ZXsZA7St

Most of the performance will probably be tied up in how regular the branches are. Maybe better to do linear searches at the lower levels. Possibly worth using SSE for searching elements in parallel. In practice a small hash table would probably be faster.. and either way putting a small micro-ITLB and DTLB in front of it is a must. You were probably already thinking something like that. I'd go with a single entry for micro-ITLB and maybe 2-4 for micro-DTLB.

I guess more of an academic exercise but still useful for a few games, Goldeneye being one of them IIRC.
 
OP
MarathonMan

MarathonMan

Emulator Developer
Rather than linear scan I was thinking a fully expanded bineary search over a sorted flat array, something like this:

http://pastebin.com/ZXsZA7St

Most of the performance will probably be tied up in how regular the branches are. Maybe better to do linear searches at the lower levels. Possibly worth using SSE for searching elements in parallel. In practice a small hash table would probably be faster.. and either way putting a small micro-ITLB and DTLB in front of it is a must. You were probably already thinking something like that. I'd go with a single entry for micro-ITLB and maybe 2-4 for micro-DTLB.

I guess more of an academic exercise but still useful for a few games, Goldeneye being one of them IIRC.

Hash table would be ideal, but the problem is the MIPS variable page size. You don't wide the page offset is so you can't hash on the virtual page number. The RB tree more or less is performing a binary search; however, instead of copying around all the entries to keep them in order (each entry is 64B), it maintains a sorted tree. Also, even with the binary search, you still need to spend a while inspecting each entry because of the variable page size. The tree allows you to do a subtraction and comparison (designed for very, very fast lookup even if insertion is a constant-factor greater than something simpler like a linear array).

Eventually I was hoping to basically leave TLB entries in the tree at all times (to make insertion a non-factor once the game has warmed up) and include a valid/swapped-in field.
 
Last edited:

Exophase

Emulator Developer
Didn't know the CPU had variable page sizes, that's pretty overblown >_> Fancy MMU for something with direct mapped caches.. Not sure if I get the implications but no matter what your tree arrangement is like you should be able to flatten it and make the pointer relationships implict. And assume fixed interval sizes.

I get the red-black tree has algorithmically faster insert/delete time but for something with this small fixed size I'm not sure it really beats it.. and besides, the ratio of TLB lookup to TLB modify has to be huge in any normal game. It's also faster for when you have to invalidate the entire TLB.

I'm skeptical that letting the TLB tree grow is really going to do better, just adding more and more layers for branch mispredicts, especially if you still have to emulate the TLB misses anyway. Yeah you remove the cost of deleting items from the tree but you make lookups and insertions grow more and more expensive in the process..
 
OP
MarathonMan

MarathonMan

Emulator Developer
Didn't know the CPU had variable page sizes, that's pretty overblown >_> Fancy MMU for something with direct mapped caches.. Not sure if I get the implications but no matter what your tree arrangement is like you should be able to flatten it and make the pointer relationships implict. And assume fixed interval sizes.

I get the red-black tree has algorithmically faster insert/delete time but for something with this small fixed size I'm not sure it really beats it.. and besides, the ratio of TLB lookup to TLB modify has to be huge in any normal game. It's also faster for when you have to invalidate the entire TLB.

I'm skeptical that letting the TLB tree grow is really going to do better, just adding more and more layers for branch mispredicts, especially if you still have to emulate the TLB misses anyway. Yeah you remove the cost of deleting items from the tree but you make lookups and insertions grow more and more expensive in the process..

I agree that's it's skeptical and certainly a gamble. I aligned the TLB entries to a cache-line size on x86 to make sure that the cache-misses during lookups are kept to a minimum. I originally designed it with the idea in mind that I was going to have to do 93.75mil lookups/sec assuming constant bombardment of the instruction cache (think beq $0, $0, -4). The branch misprediction is no worse than than of a linear lookup (it's actually better than PJ64's IIRC, due to the fact that they use multiple branches to evaluate one entry), but yes, I do plan on testing it to make sure that it's actually faster before using it as the de-facto component.

The idea was also to hide the cost of removal/modification by simulating the tree as if it had an infinite number of entries, but only 32 of which are flagged as 'active' at any given time. Then all you need to do to marked something as swapped in or out is to flip a bit. This idea doesn't support modification of existing entries, though, so it certainly has a big limitation attached.

Nice to have somebody technical to talk to. :)
 
Last edited:
OP
MarathonMan

MarathonMan

Emulator Developer
Where did you get your docs, and can I see?

I'm not sure I can share everything. The source code should hopefully serve as documentation. ;)

VR4300 datasheet can easily be found via Google, the N64 Ultra Programming Manual is under this board a few posts down, etc.
 

mrmudlord

New member
If you are not a idiot, i can give you the oman archive alegend.

but people like Exophase are here, so they can [expletive removed].
 
Last edited by a moderator:

Exophase

Emulator Developer
I was thinking about the TLB stuff more and I'm not sure why I started talking about emulating a direct data structure for it.. other emulators are just using a one level page table for it. Have you compared performance against something like this? I'd be really surprised if searching a TLB, no matter how optimized, was faster than this except in really weird not-game-like scenarios. Especially if the alternative is a tree that keeps growing with invalid entries.
 
OP
MarathonMan

MarathonMan

Emulator Developer
I was thinking about the TLB stuff more and I'm not sure why I started talking about emulating a direct data structure for it.. other emulators are just using a one level page table for it. Have you compared performance against something like this? I'd be really surprised if searching a TLB, no matter how optimized, was faster than this except in really weird not-game-like scenarios. Especially if the alternative is a tree that keeps growing with invalid entries.

One level page tables are generally horrendous in performance. I've tried them before. That being said, when I try to squeeze out extra performance for things like Goldeneye, it might be worth a try to see if the converse is true for N64.
 

Exophase

Emulator Developer
One level page tables are generally horrendous in performance. I've tried them before. That being said, when I try to squeeze out extra performance for things like Goldeneye, it might be worth a try to see if the converse is true for N64.

Can you give any more context for that comparison? Like, vs what and with what kind of code testing it? Did you do any profiling?

First should make it clear I'm only talking about 32-bit address mode.. I have no idea if N64 games use the 64-bit address mode or not. For that at least another level would be in order.

In terms of operations, single-level page table vs TLB lookup is a no-brainer - a shift + array lookup to get the translation vs traversing a tree. It's also branchless so there's no concern of mispredicts. The part that can be potentially different is cache footprint. Your current TLB has a fixed ~2KB cache impact while a 1-level page table needs up to a cache line for each 4KB access mapped. But the thing is, if the TLB accesses have bad locality of reference (so they cache poorly vs memory accesses) then the emulated icache and dcache will have poor locality of reference and that will greatly throttle the number of TLB accesses you need to emulate. Furthermore, unless the game uses fairly big page sizes then you'll pretty easily keep the page table entries corresponding to the resident 32 TLB entries in L1 or at least L2 cache (@ 4KB it's at most 32 cache lines, in practice far fewer). If the game does end up accessing lots of different pages for the same large page TLB entry that'd be a good argument for multi-level page tables, but still only if it really has poor locality of reference.

And if the game goes outside those 32 entries you'll be throttled by emulating the TLB fault handler.

If you keep a TLB tree that keeps growing, with your current data type (64 bytes per node) then for strictly 4KB pages you'll spend as much cache as the very worst case scenario for a page table. It gets better for larger caches, but you'd still have to have pretty sparse TLB mappings which is very unusual..
 
OP
MarathonMan

MarathonMan

Emulator Developer
Can you give any more context for that comparison? Like, vs what and with what kind of code testing it? Did you do any profiling?

First should make it clear I'm only talking about 32-bit address mode.. I have no idea if N64 games use the 64-bit address mode or not. For that at least another level would be in order.

In terms of operations, single-level page table vs TLB lookup is a no-brainer - a shift + array lookup to get the translation vs traversing a tree. It's also branchless so there's no concern of mispredicts. The part that can be potentially different is cache footprint. Your current TLB has a fixed ~2KB cache impact while a 1-level page table needs up to a cache line for each 4KB access mapped. But the thing is, if the TLB accesses have bad locality of reference (so they cache poorly vs memory accesses) then the emulated icache and dcache will have poor locality of reference and that will greatly throttle the number of TLB accesses you need to emulate. Furthermore, unless the game uses fairly big page sizes then you'll pretty easily keep the page table entries corresponding to the resident 32 TLB entries in L1 or at least L2 cache (@ 4KB it's at most 32 cache lines, in practice far fewer). If the game does end up accessing lots of different pages for the same large page TLB entry that'd be a good argument for multi-level page tables, but still only if it really has poor locality of reference.

And if the game goes outside those 32 entries you'll be throttled by emulating the TLB fault handler.

If you keep a TLB tree that keeps growing, with your current data type (64 bytes per node) then for strictly 4KB pages you'll spend as much cache as the very worst case scenario for a page table. It gets better for larger caches, but you'd still have to have pretty sparse TLB mappings which is very unusual..

I work on closed/open-source commercial and professional grade simulators for x86/ARM/etc. When we used a std::unordered_map to try and hash the page tables, it was taking over 23% (?) of our performance IIRC. We got a noticeable boost when using optimized range trees. I claim nothing about the hash function or table size that was chosen, however, I just remember seeing the report. :D

I'm emulating the full 64-bit address space, regardless of what games do. Accuracy > performance. Because of this, even 2-level page tables are probably out of the question. The VR4300 has a 2.5TB (total) TLB-mapped virtual address space... IMO that's 3-level worthy!

I'd be willing to bet that by changing the TLBTree so that it's:
* Fixed, coallocated number of entries in memory.
* Uses integers for indexing instead of pointers (also less space).
* etc etc.

I bet it'd be faster than nested page tables. I wonder if you could also gamble and maintain an ordered-queue of TLB entries and pull entries up as they get hit you'd have better luck. I'd be willing to guess that it's unlikely for something like an N64 game to be jumping all over memory space.

EDIT: A hash-table using [address & ~ (largest_page_size - 1)] as an index into a hash function might also work well in narrowing out candidates.
 
Last edited:

Exophase

Emulator Developer
I work on closed/open-source commercial and professional grade simulators for x86/ARM/etc. When we used a std::unordered_map to try and hash the page tables, it was taking over 23% (?) of our performance IIRC. We got a noticeable boost when using optimized range trees. I claim nothing about the hash function or table size that was chosen, however, I just remember seeing the report. :D

Fair, but a hash table isn't the same as a flat one level page table like I'm suggesting. This is for the 32-bit address space mode (31-bits of virtual address) with 1M entries, so an array that's something like 4-8MB.

I'm emulating the full 64-bit address space, regardless of what games do. Accuracy > performance. Because of this, even 2-level page tables are probably out of the question. The VR4300 has a 2.5TB (total) TLB-mapped virtual address space... IMO that's 3-level worthy!

I'd be willing to bet that by changing the TLBTree so that it's:
* Fixed, coallocated number of entries in memory.
* Uses integers for indexing instead of pointers (also less space).
* etc etc.

I bet it'd be faster than nested page tables. I wonder if you could also gamble and maintain an ordered-queue of TLB entries and pull entries up as they get hit you'd have better luck. I'd be willing to guess that it's unlikely for something like an N64 game to be jumping all over memory space.

EDIT: A hash-table using [address & ~ (largest_page_size - 1)] as an index into a hash function might also work well in narrowing out candidates.

Since the processor has separate 32-bit and 64-bit address modes emulating it accurately means emulating both separately. No reason to need to use the same emulation strategy for both, and in this case what games use will matter (personally I think you'd be crazy to use the 64-bit mode on N64 but who knows). I also don't agree that the 64-bit mode (40-bit virtual address space) necessarily needs more than two-level tables. This isn't like an OS where you need a separate page table for many processes. If your emulator needs a high end CPU to run fast enough then I think using some data structures that may take even dozens of MB isn't strictly out of the question.

That said, your improvements for the TLB tree should improve things.

Did you notice that the CPU actually does have a two-entry micro-ITLB? I don't know what the penalty is for missing it but it has to be something or it wouldn't be there.. so you'll have to emulate that. That it'll help performance substantially is just a bonus.
 
OP
MarathonMan

MarathonMan

Emulator Developer
Fair, but a hash table isn't the same as a flat one level page table like I'm suggesting. This is for the 32-bit address space mode (31-bits of virtual address) with 1M entries, so an array that's something like 4-8MB.



Since the processor has separate 32-bit and 64-bit address modes emulating it accurately means emulating both separately. No reason to need to use the same emulation strategy for both, and in this case what games use will matter (personally I think you'd be crazy to use the 64-bit mode on N64 but who knows). I also don't agree that the 64-bit mode (40-bit virtual address space) necessarily needs more than two-level tables. This isn't like an OS where you need a separate page table for many processes. If your emulator needs a high end CPU to run fast enough then I think using some data structures that may take even dozens of MB isn't strictly out of the question.

That said, your improvements for the TLB tree should improve things.

Did you notice that the CPU actually does have a two-entry micro-ITLB? I don't know what the penalty is for missing it but it has to be something or it wouldn't be there.. so you'll have to emulate that. That it'll help performance substantially is just a bonus.

But you can't use the hash tables on this arch due to the variable page sizes, right? Or, at least, you would have to hash on the largest page size and somehow maintain a list of smaller pages that can be maintained within it?

It's always in 64-bit mode; they 32-bits just get sign extended to create the illusion of 32-bit mode within a 64-bit uarch. :)

IIRC the uTLB refill penalty is mentioned somewhere officially... somewhere along the lines of 2-3 cycles IIRC.
 

Exophase

Emulator Developer
But you can't use the hash tables on this arch due to the variable page sizes, right? Or, at least, you would have to hash on the largest page size and somehow maintain a list of smaller pages that can be maintained within it?

I'm not sure I'm making the idea very clear.. I'm not talking about a hash table. I'm talking about a single level page table (straight dense array) that has an entry for every page in the address space, where page is the smallest size (4KB). Larger TLBs just get smashed into multiple 4KB pages. This isn't really anything new, for instance as I'm sure you already know 32-bit ARM requires 16 4KB page table entries for 64KB pages. You would only want to have valid entries for what maps to the TLB, so for instance you can use a reverse mapping to quickly clear all of them if the TLB is flushed. There's only really a fair amount more overhead maintaining the page table if the game uses very large pages, but in this case they'll probably miss the TLB less so it evens out.

This is for a single level page table, which can work for 32-bit mode.

It's always in 64-bit mode; they 32-bits just get sign extended to create the illusion of 32-bit mode within a 64-bit uarch. :)

This is the datasheet I'm reading:

http://datasheets.chipdb.org/NEC/Vr-Series/Vr43xx/U10504EJ7V0UMJ1.pdf

Tell me if my understanding is correct, but in section 5.2 and onwards they clearly refer to separate 32-bit and 64-bit address modes. Not only that, but there's a separate TLB exception for each of the two modes. The UX bit in the status register selects which mode is currently used.

I don't know what mode TLB-using N64 games use, but I do know that there are other emulators (like mupen64plus, at least Ari64's branch) that use a flat table so the 32-bit mode has to be used somewhere. I really doubt the 64-bit mode is that useful, given how huge 31-bits of address space is when your cartridges cap out around 64MB.

IIRC the uTLB refill penalty is mentioned somewhere officially... somewhere along the lines of 2-3 cycles IIRC.

Yeah would appear that way, page 107 says 3 cycles.
 
Last edited:

Top