I was just reminded of the big difference between Windows' process scheduler and a sane one, which fully explains why it migrates threads so often.
For reference,
this easily findable book chapter explains how several different types of multiprocessor scheduler work. Pay particular attention to the "work stealing" balancing algorithm; it runs on an idle or lightly-loaded CPU, and looks for CPUs with greater load than itself. An alternative approach is for a heavily-loaded CPU to look for CPUs with *less* load than itself, in order to *give* them some of its excess work - this works better in cases where idle CPUs are not periodically woken (which is more power efficient).
Whichever approach Windows uses, it
constantly attempts to move threads to less-loaded CPUs - even when it is the *only* runnable thread on its original CPU - and it
counts the thread's own past load against its current CPU. This is inhibited only by the parking and affinity masks (which are clearly bolted-on afterthoughts), and
makes no allowance whatsoever for SMT, NUMA, cache affinity, or the cost of context switches. The book chapter I linked doesn't mention SMT or NUMA (it may be a relatively old book, in which those concepts were not yet widespread), but it *does* talk about the other two factors as being key for efficiency.
This *should* be very easy for Microsoft to fix, if they can be bothered. Simply make any thread meeting all of the following criteria ineligible for migration:
- It is the only thread currently in its CPU's run queue.
- It currently satisfies its own affinity mask, if any.
- Its CPU is not parked.
- It shares the same LLC as all other threads in the same process.
This would make the precise behaviour of the core-parking algorithm much less important for enforcing short-term performance and efficiency goals. A useful additional parameter to the latter would then be an optimisation target, taking the following values:
- Execution resources - the current behaviour, preferentially unparking just one thread per physical core.
- Cache affinity - as above, but only within each LLC block. When all cores are unparked in one LLC, begin on the next.
- Power efficiency - always unpark all virtual cores in the same physical core before proceeding to another physical core. Also unpark all physical cores in one LLC before proceeding to the next.
Well, we can dream.