- Oct 22, 2004
- 805
- 1,394
- 136
Well, Intel is still the biggest player and their architecture is the deciding factor to optimize software for and will remain so for some time.
Ok. I see your argument. By aligning themselves to Intel's architecture, AMD will get better performance on software optimized for Intel. And by going to a larger CCX with a mesh interconnect, similar to Intel's, they will hence benefit. Possibly.
But even if AMD moves to a 6-core or 8-core CCX with a mesh topology, you'd still have cross-CCX latency to deal with. It only changes the partition size a little. To really take your argument to its logical conclusion, AMD should eliminate CCXs altogether and create a monolithic die like Intel. Personally, I think the design trend will be the other way around, i.e. Intel moving to modular designs using their EMIB multi-die interconnect technology.
With regards to the 4-core vs 6-core question, you have to consider the system as a whole and the workloads you care about. Say, a 48-core CPU with 4-core CCXs vs a 48-core CPU with 6-core CCXs. The first system will have faster core-to-core latency in the CCX, although some workloads may suffer a little more cross-CCX latency due to a few more CCXs. To evaluate which is the overall best topology, you would have to test/simulate the systems on a variety of workloads you care about and make a judgement. My bet is on a 4-core CCX.
Of course direct connections are best, and the big question is of course : Will a workload always fit in that cache. I doubt that.
The workload does not need to fit in the cache to avoid cross-CCX latency. As long as it does not need to access remote memory (connected to another CCX), nor need to snoop another remote cache (e.g. due to shared memory/locks), it will be fine. In other words, as long as it accesses memory connected to the local memory controller only, and does not share memory with cores outside the CCX, it will not suffer cross-CCX latency, as I understand it.
And besides, the whole issue is that threads are migrated from core to core as several kinds of software is running
That's a good point. For optimal performance, the operating system will need to be NUMA-aware and schedule threads according to the system topology. I am not sure how well (or badly) operating systems do on this today.
Server cpu can be running multiple virtual machines that run programs with lots of threads. And that is where the inter CCX communication hurts.
On the other hand, if the virtual machine is allocated a single CCX, then you should see no penalty. In fact, it should be as good as it gets, since all cores in the virtual machine are direct-connected. For larger partition sizes, it will depend on the workload, I guess. NUMA-aware and latency-insensitive workloads should run well.
it does seem that the scalable data fabric is also used to connect the different L3 caches. But that still does not explain the average latency explanation between L3 caches. I cannot understand that yet.
The way I understand it, the interleaving scheme, which I described in my last post, ensures a consistent average latency. For a memory address that hits the local cache slice, the latency is lower, and for addresses that hit the remote slices in the other cores, the latency is higher. However, the interleaving ensures that the latency averages out. And it may increase bandwidth as every core will use all 4 slices of the L3. (This is similar to the UMA mode for Threadripper, in which memory is interleaved across all memory controllers, causing a higher average latency, but also higher bandwidth.)