What's new

Announcement: Cycle-accurate N64 development underway.

Hacktarux

Emulator Developer
Moderator
No game or homebrew software is using the 64 bits adresses as far as i know. The different available boot codes are leaving the cpu in 32 bits and kernel mode state. The kernel mode allow 64 bits instruction usage. That's why nobody either bothered implementing it. I guess it should be implemented for completness but it is hard to implement something when you know nothing will be using it and it can't really be tested. Anyway, it also means that all tricks that can optimize 32 bits adress access can be used.
 
OP
MarathonMan

MarathonMan

Emulator Developer
I'm not sure I'm making the idea very clear.. I'm not talking about a hash table. I'm talking about a single level page table (straight dense array) that has an entry for every page in the address space, where page is the smallest size (4KB). Larger TLBs just get smashed into multiple 4KB pages. This isn't really anything new, for instance as I'm sure you already know 32-bit ARM requires 16 4KB page table entries for 64KB pages. You would only want to have valid entries for what maps to the TLB, so for instance you can use a reverse mapping to quickly clear all of them if the TLB is flushed. There's only really a fair amount more overhead maintaining the page table if the game uses very large pages, but in this case they'll probably miss the TLB less so it evens out.

This is for a single level page table, which can work for 32-bit mode.

OH. Wow. Yeah, I see what you're saying now. Yeah, that seems like a definite winner. I've never even thought of implementing something like PTs in software before. Even the cache locality shouldn't be too much of a problem because only 32 entries across the entire allocation should be used at one time anyways. I can't think of any reason why this wouldn't work, and the performance will surely be better than what I have now. Thank you kindly.

No game or homebrew software is using the 64 bits adresses as far as i know. The different available boot codes are leaving the cpu in 32 bits and kernel mode state. The kernel mode allow 64 bits instruction usage. That's why nobody either bothered implementing it. I guess it should be implemented for completness but it is hard to implement something when you know nothing will be using it and it can't really be tested. Anyway, it also means that all tricks that can optimize 32 bits adress access can be used.

Well I guess that nails it in the coffin. I'll throw an assert on ux == 1 just to make sure.
 
OP
MarathonMan

MarathonMan

Emulator Developer
So, without multithreading, this will never run realtime. I'm going to try and do something that I haven't seen done in the emulation community: "runahead" execution.

The problem with cycle-accuracy, and one of the reason why many believe it isn't going to be possible on N64-generation consoles, is due to lock-contention. Let me begin by stating that a cycle-accurate emulation for this console will NEVER, ever run fast enough on a single core... unless those guys at Intel start shipping processors that use graphine as a substrate. (CEN64, in it's current state, can demonstrate this already!)

So why not multithread? Well, processors are relatively slow at shipping data to other cores. On the other hand, cycle-accurate simulation requires all cores to effectively "sync" each cycle to check for interrupts or other events... conflicting ideas. My "runahead" execution revolves around the two following assumptions:

(a) Simulated processors takes a relatively large number of cycles to perform a bus access. If the VR4300 is going to read from the memory and it misses the cache, it's going to spin a few cycles and not do work anyways.

(b) Interrupts are not TOO common. We have to simulate 93.75 million cycles/second. Most of the time, whatever we're simulating isn't going to be bothered by anything else, nor are we going to bother anyone else. And, if we do, we're going to be classified in (a) and in a "dead" period until our access is fulfilled.

So, basically, what we can do is run each core in its own thread, and allow each core to become unsync-ed with other cores. To not invalidate the guarantees of cycle-accuracy, when doing so, we will keep a "history log" of what occurred during each one of the cycles that we run ahead. On the off chance we do get interrupted by somebody else, we can use the history log to "micro reboot" ourselves to the exact point in time we got the interrupt.

Certainly a risky idea, but it's the only way I can imagine that contention can possibly reduced enough. The idea comes from reorder buffers in conventional processors that handle rollback on branch misprediction; same game here, just in software! Crossing my fingers...

EDIT: Quick and dirty inspection says that I can expect only a couple hundred interrupts to occur each second. The only other worry will be external accesses on some devices, but that should be manageable (I can use the history buffer to determine what the data was at that cycle and not even have to revert in some cases).
 
Last edited:

Nintendo Maniac

New member
Even without running at fullspeed a cycle-accurate emulator can still be extremely valuable by essentially being a reference implementation of the emulated console itself.

Nevertheless, if you could pull off multithreading in cycle-accurate single-core system emulation, that itself can set a HUGE precedent for future emulation projects.
 
Last edited:
OP
MarathonMan

MarathonMan

Emulator Developer
Even without running at fullspeed a cycle-accurate emulator can still be extremely valuable by essentially being a reference implementation of the emulated console itself..

I'm too invested in getting it to run into fullspeed to lower myself to that just yet!
 

Nintendo Maniac

New member
Well I hope the multi-threaded thing works, because AFAIK you'd be the first person to ever pull off multi-threading the emulation/simulation of a single-core CPU.
 
OP
MarathonMan

MarathonMan

Emulator Developer
Well I hope the multi-threaded thing works, because AFAIK you'd be the first person to ever pull off multi-threading the emulation/simulation of a single-core CPU.

Already done that :p... that's easier. The challenge here is multi-threading of whole device, where each device is simulated in a dedicated thread. This would enable the RDP, RSP, VR4300, and RDRAM plugins (the big boys) to execute simultaneously... which would surely give me the performance I need. Just need to hope the overhead is < 25% or so.
 
OP
MarathonMan

MarathonMan

Emulator Developer
o_O

I think that's a first as well. Dolphin I know has the PowerPC emulation on a single thread but can have other things on other threads like the GPU and the audio DSP.

Not emulation related... that was for academic research. It's easier when you're dealing with more complex cores; the frequency at which you have to lock because the speed at which you simulate is ultimately much slower due to more complex cores. It wouldn't work for emulated CPUs, which only have a handful of cycles and need to sustain high frequencies.
 

DETOMINE

New member
I'm going to try[...]"runahead" execution.[...]
So, basically, what we can do is run each core in its own thread, and allow each core to become unsync-ed with other cores. To not invalidate the guarantees of cycle-accuracy, when doing so, we will keep a "history log" of what occurred during each one of the cycles that we run ahead. On the off chance we do get interrupted by somebody else, we can use the history log to "micro reboot" ourselves to the exact point in time we got the interrupt.[...]
The only other worry will be external accesses on some devices, but that should be manageable (I can use the history buffer to determine what the data was at that cycle and not even have to revert in some cases).
Sound clever, look quite difficult to implement though.
This could make a great subject for a thesis I think.
 

PsyMan

Just Another Wacko ;)
I don't believe that it's possible to accurately specify how the physical cores should behave in current x86/x64 systems. Neither Linux nor Windows (nor the development interfaces coming with them) allow for this kind of specific control.

It's already a nightmare to keep threads in sync on various inaccurate emulators that make use of many cores and threads. If you look at PCSX2 and Dolphin (and even nullDC up to a degree) there are cases when things just break because the load is split to many threads. If it's so hard to keep things in sync for emulators that don't depend on synchronization so much imagine how things will be when attempting to be cycle-accurate.

From past experience I concluded that we need lower level access to the processor cores in order to do "runahead" execution successfully. Hopefully someone might be able to prove me wrong. I'll actually be glad if it happens. :p
 
OP
MarathonMan

MarathonMan

Emulator Developer
I don't believe that it's possible to accurately specify how the physical cores should behave in current x86/x64 systems. Neither Linux nor Windows (nor the development interfaces coming with them) allow for this kind of specific control.

It's already a nightmare to keep threads in sync on various inaccurate emulators that make use of many cores and threads. If you look at PCSX2 and Dolphin (and even nullDC up to a degree) there are cases when things just break because the load is split to many threads. If it's so hard to keep things in sync for emulators that don't depend on synchronization so much imagine how things will be when attempting to be cycle-accurate.

From past experience I concluded that we need lower level access to the processor cores in order to do "runahead" execution successfully. Hopefully someone might be able to prove me wrong. I'll actually be glad if it happens. :p

POSIX provides sched_setaffinity and sched_setscheduler... you can do some pretty crafty things with them. Windows probably doesn't because it's... well, Windows. I'm hoping that I won't need to rely on such interfaces, or at least modularize the POSIX component so it still builds on Windows if I do resort to them.

The project is more or less dead in the road without more processing power, so I'm going to try it regardless. Thanks for the words of warning, though.
 

zlb

New member
Been watching this thread with interest.

The 'runahead' thing sounds interesting, I have written something very similar myself for database synchronization. If I understand correctly though it does mean that you'll have to 'journal' all system state changes so that after a rollback, the other thread sees the system as it would have been.

Re: Threading, under Win32 you can use Fibers to if you need very fine control over your threads and want to manually schedule which fibers run and need low context switch overhead, though obviously this won't be portable if there's no equivalent functionality on other platforms.
 

Nintendo Maniac

New member
Totally random idea - could HSA and hUMA theoretically help at all in terms of cycle-accurate simulation/emulation performance? Normally such a thing would only really be useful in parrallel loads which simulation/emulation typically isn't, but since you're investigating larger amounts of multitasking and various advanced techniques, it got me thinking about future cutting edge hardware functionality...especially since you claim that Haswell by itself surely won't be enough with the current model CEN64 uses.
 
OP
MarathonMan

MarathonMan

Emulator Developer
Totally random idea - could HSA and hUMA theoretically help at all in terms of cycle-accurate simulation/emulation performance? Normally such a thing would only really be useful in parrallel loads which simulation/emulation typically isn't, but since you're investigating larger amounts of multitasking and various advanced techniques, it got me thinking about future cutting edge hardware functionality...especially since you claim that Haswell by itself surely won't be enough with the current model CEN64 uses.

Anytime you touch the GPU, you're going to get terrible latency. And it's virtually impossible to write anything that's not embarrassingly parallel. I'll look into it eventually probably, but I'm more focused now on sticking with SSE-based stuff since that's more widespread at the moment.

Hacked together support for Retrolink N64 USB controllers on Linux and fooled around in Namco Arcade tonight. Works great. Thanks to whomever suggested the joystick pages and pages ago.
 

Nintendo Maniac

New member
Anytime you touch the GPU, you're going to get terrible latency

Erm... isn't that due to the distance from the CPU to the GPU and the copying of memory back and forth due to seperate memory pools? That's why I said HSA and by association hUMA which will have the CPU and GPU on the same physical die and have them share the same memory space.

Are you not familiar in general with what HSA is? Because that's the impression I'm getting...
 

Guru64

New member
Are you not familiar in general with what HSA is? Because that's the impression I'm getting...
Who is? I sure haven't used hardware capable of doing that.

Either way, I'm not sure what part of the emulation you were thinking of doing on the GPU, but emulation workloads are generally a very bad fit for GPUs. Maybe parts of the RDP would be possible, but I'm not sure.
 
OP
MarathonMan

MarathonMan

Emulator Developer
Erm... isn't that due to the distance from the CPU to the GPU and the copying of memory back and forth due to seperate memory pools? That's why I said HSA and by association hUMA which will have the CPU and GPU on the same physical die and have them share the same memory space.

Are you not familiar in general with what HSA is? Because that's the impression I'm getting...

Yes. I know what HSA is. It still means your workload has to be EP. And again, most people don't have access to HSA hardware yet (myself included), so I see no need to even consider it right now.

EDIT 1: The latency is still going to be not awesome, too. It might be a lot better, but it's still not the latency that you'll get by using SSE or something on-core.

EDIT 2: HSA won't even have context switching support until 2014, making it more or less useless at it's advertised purpose for another year or so.
 
Last edited:

didado

New member
Hacked together support for Retrolink N64 USB controllers on Linux and fooled around in Namco Arcade tonight. Works great. Thanks to whomever suggested the joystick pages and pages ago.

That was me i guess ;) .. Im honored you thank me for it lol

I follow this thread everyday.

Keep up the researching for better performance.. :bouncy: and nice to see you are working together with other well known n64 emu coders.
 

Top