Don't blame the OS, the overhead of that is maybe 1-2% max. Dolphin for DOS or bare hardware wouldn't be much faster.
It's just extremely time consuming to emulate a full system with a completely alien architecture on a PC. Nothing can be run directly, everything (including every memory access and every little gpu command) must be translated, all the time. We do cache the translation of CPU instructions in blocks (this is what's called a dynarec), improving perf somewhat, but it's not nearly enough....
We need 30Ghz diamond laser CPUs
