Can you give any more context for that comparison? Like, vs what and with what kind of code testing it? Did you do any profiling?
First should make it clear I'm only talking about 32-bit address mode.. I have no idea if N64 games use the 64-bit address mode or not. For that at least another level would be in order.
In terms of operations, single-level page table vs TLB lookup is a no-brainer - a shift + array lookup to get the translation vs traversing a tree. It's also branchless so there's no concern of mispredicts. The part that can be potentially different is cache footprint. Your current TLB has a fixed ~2KB cache impact while a 1-level page table needs up to a cache line for each 4KB access mapped. But the thing is, if the TLB accesses have bad locality of reference (so they cache poorly vs memory accesses) then the emulated icache and dcache will have poor locality of reference and that will greatly throttle the number of TLB accesses you need to emulate. Furthermore, unless the game uses fairly big page sizes then you'll pretty easily keep the page table entries corresponding to the resident 32 TLB entries in L1 or at least L2 cache (@ 4KB it's at most 32 cache lines, in practice far fewer). If the game does end up accessing lots of different pages for the same large page TLB entry that'd be a good argument for multi-level page tables, but still only if it really has poor locality of reference.
And if the game goes outside those 32 entries you'll be throttled by emulating the TLB fault handler.
If you keep a TLB tree that keeps growing, with your current data type (64 bytes per node) then for strictly 4KB pages you'll spend as much cache as the very worst case scenario for a page table. It gets better for larger caches, but you'd still have to have pretty sparse TLB mappings which is very unusual..
I work on closed/open-source commercial and professional grade simulators for x86/ARM/etc. When we used a std::unordered_map to try and hash the page tables, it was taking over 23% (?) of our performance IIRC. We got a noticeable boost when using optimized range trees. I claim nothing about the hash function or table size that was chosen, however, I just remember seeing the report.
I'm emulating the full 64-bit address space, regardless of what games do. Accuracy > performance. Because of this, even 2-level page tables are probably out of the question. The VR4300 has a 2.5TB (total) TLB-mapped virtual address space... IMO that's 3-level worthy!
I'd be willing to bet that by changing the TLBTree so that it's:
* Fixed, coallocated number of entries in memory.
* Uses integers for indexing instead of pointers (also less space).
* etc etc.
I bet it'd be faster than nested page tables. I wonder if you could also gamble and maintain an ordered-queue of TLB entries and pull entries up as they get hit you'd have better luck. I'd be willing to guess that it's unlikely for something like an N64 game to be jumping all over memory space.
EDIT: A hash-table using [address & ~ (largest_page_size - 1)] as an index into a hash function might also work well in narrowing out candidates.