So the scheduler saga continues. This time I wanted to try out something different and see just how bad the thread migration issue is. As it turns out, there is a way to programmatically find out which core your thread is running on (see
https://msdn.microsoft.com/en-us/library/ms683181(v=vs.85).aspx). Calling it every iteration would be quite excessive, so instead I made a condition where it gets called only when there was too big of a difference in latency compared to previous iteration. Maybe not perfect, but the idea being that big difference between iterations could be explained by context switch.
I also changed fixed number of iterations to timed run instead. Now each thread will run for 60 seconds and at the end report their average latency as well as all cores they visited during the run. To get migration, each run was done without affinity. First run was with 4 threads each referencing one and same tree instance.
Here are the results:
[Thread: 1] Avg: 495ns
[Thread: 1] Cores: 0000000000000001
[Thread: 2] Avg: 483ns
[Thread: 2] Cores: 1000010000001000
[Thread: 3] Avg: 483ns
[Thread: 3] Cores: 0010000000110000
[Thread: 4] Avg: 491ns
[Thread: 4] Cores: 0000100011000000
Cores are from 0-15 and value of 1 means it was visited by the thread, where as 0 means it wasn't. So during 60 second run each thread with the exception of #1 got bounced between cores, and in fact between CCX as well. Now one shortcoming is that I'm not collecting metrics on how long time was spent on each core and how frequently does migration occur. The main point however is to verify that cross CCX migration does happen.
Next up is a slightly different setup with 4 smaller 1MB instances of the tree structure containing a dictionary. Each thread this time had reference to different instances, so we have 1:1 cardinality. In this case it shouldn't make any difference which CCX threads get assigned to as long as they stay there.
Here are the results:
[Thread: 1] Avg: 231ns
[Thread: 1] Cores: 0000101000100000
[Thread: 2] Avg: 232ns
[Thread: 2] Cores: 1010000000000001
[Thread: 3] Avg: 230ns
[Thread: 3] Cores: 0100000010000000
[Thread: 4] Avg: 231ns
[Thread: 4] Cores: 0000000001100000
So again we can see all but one thread were at some point scheduled on different CCX. The overall lower latencies could be explained by smaller data set and hence smaller tree to traverse. Anyway there's again no clue of how frequently migration occurs here. If it only happens like once during the run, then it would have very little effect on overall averages over 60 second run. Indeed if we look at same scenario with 0,2,4,6 affinity, we can see that there is pretty much no difference:
[Thread: 1] Avg: 228ns
[Thread: 1] Cores: 1000000000000000
[Thread: 2] Avg: 231ns
[Thread: 2] Cores: 0010000000000000
[Thread: 3] Avg: 225ns
[Thread: 3] Cores: 0000100000000000
[Thread: 4] Avg: 231ns
[Thread: 4] Cores: 0000001000000000
Finally for the last scenario I had only one thread, which was assigned with the larger 3MB data set. During these runs I had no other active software running aside from task manager. All other cores were sitting idle with CPU usage at 1%. My benchmarking software is running continuous loop without any blocking operations, locking, thread sleeping, i/o or any other operation which might encourage context switching. Indeed there should be little to no reason to interrupt my loop during the run. If any other operations need to be scheduled, they could be done so on any of the idle cores..
So lets see what happened over 3 runs, each with only one thread:
[Thread: 1] Avg: 311ns
[Thread: 1] Cores: 0000000100000010
[Thread: 1] Avg: 309ns
[Thread: 1] Cores: 0000000000010000
[Thread: 1] Avg: 312ns
[Thread: 1] Cores: 0001000000000010
In 2/3 runs our thread not only got scheduled on different cores, but it also got bounced between the two CCX's. Based on what I saw in task manager, I believe this migration happened only once during each run, so it should hardly even show up in averages.
In conclusion if there still was any doubt as to whether Windows 10 scheduler is needlessly bouncing threads between CCX's, I can now say with high degree of certainty:
it actually does!
Q: "But wait a minute.. aren't your results suggesting that there is very little latency penalty in single threaded applications even if thread migration is real?"
A: Well. Again my scenario gives little to no reason for context switching to occur in the first place. Something like a single threaded game is much more complex, so these results do not translate well to such scenarios. This is just one specific case, but it does suffice to substantiate that thread migration is real.
Also bonus factoid: over 5 minute run with 4 threads; each thread visited every single physical and logical core at some point during the run.