Okay, for instance, rendering samples at the rate of the audio clock (which is usually in the MHz, in the case of PC-Engine it'd be 3.58MHz) then downsampling that to the output rate (in the KHz, something like 44.1KHz is typical, less can be offered too). This is how Hap does it and recommends doing it in the NES thread. Doing it this way is obviously both simple and very accurate but at a huge speed hit. For high-frequency noise it's definitely the most obvious way to do it.
There are less extreme examples of sound cores that can be very sub-optimal as well. For instance for PS1's SPU or any wave synth kind of playback with compressed voices, not caching voices and not playing several samples in a row when possible. These platforms actually have a very high polyphony and if you don't optimize it you won't get good speed (which is fine on PCs now but I guarantee that Sony's PS1 emulator for PSP and Bleemcast for DC had more optimized SPU implementations than that).
Also have to consider that a lot of platforms have secondary CPUs that are either exclusive for sound or near exclusively used for sound that don't need to be emulated until later. This is often bundled with sound emulation. SNES is an obvious example.
Then you have to consider a high level of timing accuracy necessary to handle real-time streaming sound for platforms that support it but don't have any kind of buffering provided for it. Even on platforms with small buffers like GBA synchronization can still be very painful (which is why most GBA emulators actually have pretty bad sound). Synchronization in general is important if you want the audio to not start skipping over itself. Of course, this is a good place to get the emulator properly synchronized anyway.
For gpSP I did manage to sample at the output rate directly, and it sounds good. But it's a mess because it has to deal with a lot of fixed point arithmetic and there are more divisions than I'd like. For a platform with frequency modulation this just isn't going to work well at all (especially for one with any real amount of it like Genesis). PC-Engine only has one FM channel, but this is still enough to not want to have to do things like I did them before.
So there's kind of two "directions" to do this in, both involve iterating counters. One is to iterate through a position that maps from the destination stream to the source streams. To do this you have to transform the frequencies from their natural rate to the destination rate. With FM you have to do this a lot, and since you end up with fractional values you will end up accumulating error. It's best to use fixed point even if you have floating point (unless you have very fast conversion to integer) because you'll be using the counter to index the source data. So, for every 1 step you take in the destination frequency, you take an F step in the source data which is some fixed point frequency value that has been converted from a natural frequency. One advantage of using this method is that you can perform interpolation on every source channel since you're doing fractional steps through it.
The other way to do it is to render to a frequency that "fits" the natural frequency of the platform, which the channel frequencies calibrated to. If you render at the frequency of the clock this means decrementing the individual frequency counter by one and when it hits zero increment the source index by one, and for every iteration go up by one in the destination. The thing is, even though these machines really did render audio internally at several MHz it's not as if so much is needed (and there will be a cut off frequency thanks to low pass filters, limitations of the speakers used, and your ears).
What I'm thinking is to have an approach that's kind of inbetween the two. Instead of decrementing by one decrement by several - then the frequency you render to becomes some integer division of the natural frequency of the machine. When this value goes below zero go forward then increment it back up by the full amount. You could handle rendering directly to the output frequency you're using by having fractional steps, but this will accumulate error. If you render to a different frequency you can then resample the entire thing to the destination frequency. This will allow you to use interpolation on the final result, which is only one channel (well, two, for stereo) as opposed to all of the input channels. I don't really know which will sound better..
With this approach you could have a configurable internal sound quality. You could set the render frequency to be very low to save CPU time at the expense of quality. Or you could set it higher and see how it works out. I'm not really sure if this is a very good approach yet.. this is more of an idea.
Maybe some other people could share some of their approaches to sound emulation.
EDIT: sarencele pointed out that I was mistaken about how blargg's libraries work (so I removed that part). Sorry blargg :<