In terms of Ethernet - Ethernet as used with "dumb hubs" is a half-duplex protocol. Data fabrics are designed to support multiple full-duplex data paths, and for that reason are used in Ethernet switches, which are very much more efficient than Ethernet hubs and can support Ethernet in full-duplex mode. In fact, it's impossible to use Gigabit Ethernet with dumb hubs, because it doesn't *have* a half-duplex mode.
So forget about 40% of anything.
Based on stuff I've found out in the past few days, here's how Windows appears to assign threads to cores. I'm basing this on experiments with Windows 7, but I believe Win10 is essentially similar.
1: After every timeslice, the core assignment for each thread is calculated anew, and is not stable. This results in frequent migrations of threads between cores, unless only one core is available to that thread (of which more below) or all other available cores are already busy.
2: The timeslice is based on the thread priority, which itself is influenced by the process priority. Higher priorities get longer timeslices - "realtime" gets a (pseudo?) infinite timeslice. You can therefore reduce the thread migration frequency just by increasing the process priority.
3: Core availability is determined solely by Core Parking. The scheduler itself **is not SMT aware**, but the Core Parking algorithm is. Core Parking is not part of the scheduler - it is part of the power management subsystem, and its tunables can be made available in the Power Options control panel with some registry tweaks. I emphasise: if all cores are unparked, Windows will happily migrate threads randomly onto both physical and virtual cores, just as if they were all physical.
4: If the thread or process has CPU Affinity set, that further limits the available set of cores for that thread, on top of Core Parking. However, if all affinity-set cores for a thread are parked, Core Parking is overridden and the thread is assigned to one of its affinity cores anyway. Core Parking can detect this and will unpark cores which see significant affinity activity. NB: a parked core is not necessarily shut down (in C-state) - but a parked core is *usually* idle and *therefore* usually shut down. C-stated cores are necessary to go beyond the all-core turbo speeds automatically.
5: On Windows 7, by default all physical cores have at least one virtual core unparked at all times. This might be different on Win10, based on screenshots. There is a toggle for this behaviour.
6: The Core Parking algorithm is very dumb - it doesn't seem to know how many threads are "runnable" at any given instant - and is therefore tuned to be very liberal about how many cores are unparked. I can fully believe that it will unpark about twice as many cores as there are full-time threads. This means that a 2T workload will appear to be spread evenly across 4 cores on a many-core system - but this **must not** be taken as evidence of NUMA awareness.
In short, it's a mess. As I noted over on Reddit, Linux does all of this about 500% better, because Linux people care about technical results, not risk-averse arse-covering.
Well, now the chickens have come home to roost in Redmond, as far as technical debt in the basic scheduler is concerned. Here are a couple of things that could be fixed by one competent engineer in a week:
1: Make the core assignment algorithm stable, ie. so that it prefers to assign the core that the thread is already running on, or that it last ran on (if it isn't busy). This will have immediate performance benefits on *all* SMP, SMT, CMT, and NUMA systems, not just Ryzen. Less context-switch overhead, fewer cache misses, better branch prediction, more truly-idle cores that can be C-stated. It's pure win.
2: Rework the Core Parking algorithm to use the runnable thread count to guide the number of unparked cores. On NUMA systems, unpark cores preferentially on the same node for runnable threads belonging to the same process, and preferentially on different nodes for runnable threads in different processes. Better yet, scrap Core Parking entirely and work on the best practical implementation of Point 1.
That, of course, is how Linux does it.