I don't see how a 16C32T chip solves the inter-CCX data fabric latency issue from a purely HW design standpoint, unless it's a monolithic block like Intel. Unless they aren't using the 4-core CCX blocks?
How does a silicon revision solve what is fundamentally an interconnectivity issue?
Intel has no monolithic block either with some data hyperloops between each core and the IMC. There is a ring bus.
I'd wait for further data points before putting all the blame on the data fabric latency. On Reddit I saw similar comments, resembling something like the CCX would match Intel's separate quadcore
dies connected via FSB. It's actually not that bad. And the ring bus based designs also don't have direct connections between each core and the mem controller. In all those cases, the access requests and returning data (or store addresses and data) have to pass one or more hops/ring bus stops to get to the UMC/IMC, and again on the way back to the core.
That's what I assume to be happening:
CCX mem access:
Core -> check L1 tags (LSU) -> check L2 tags (L2 IF) -> check L3 tags (L3 IF/CCX XBar) -> [clock domain crossing] send request to DF (router or XBar?) -> (1+ hops?) -> UMC -> access DRAMs -> data received at UMC -> transmit 64B line to DF (2 cycles) -> (1+ hops) -> receive data at CCX [clock domain crossing] -> move data to requesting core.
Intel ring bus mem access:
Core -> check L1 tags (LSU) -> check L2 tags (L2 IF) -> check local L3 tags (L3 IF) -> [clock domain crossing to core like ring bus clock] send request via ring bus -> (1 to n hops) -> IMC -> access DRAMs -> data received at IMC -> transmit 64B line to DF -> (1 to n hops) -> receive data at core
So the UMC accesses via DF should add at least one hop (no direct connection), or 0.5 to 0.9 ns per direction (address, then data) depending on DF clock.
On Intel's 8C ring bus SoCs the avg. distance should be 2.5 hops (1 to 4 hops per 4 core half), but at clocks as high as core clocks -> 0.6 to 0.8 ns per direction.