The Buddhists are probably pleased more than anything else.
I'd say they're more likely to be content with it than pleased. I mean... they're Buddhists.
Hmm . . . that could work. L3 is per CCX, right? What if they're using something like a modernized version of
HT Assist to mitigate inter-CCX communication problems? We might not detect it unless using something with a sufficiently-large data set.
L3 is per CCX, yes.
I know some of how the inter-CCX communications might work, based on the die shots, the CCX design, and some comments by AMD personnel.
I see two possibilities - the first is that there is NO inter-CCX communication and it is all done through the memory controller... this would be the slowest method and would be a very bad idea.. it would result in horrible multi-threaded scaling issues... which we simply aren't seeing at all with the leaks...
The second is an external clone of the internal L3 tags. Each CCX would have an external L3 tag set to which it can write, but not read. When a CCX misses on its own cache it sends a request out on its high-speed, low latency, external command bus to a specific memory controller. The memory controller would scan the L3 tag cache which clones the other CCX's tags, letting it know if the data is on-die, but just in the other L3...
If the tag is in the other L3, it will make a request for that data from the other CCX and that data will be tagged for the other CCX & core and sent on the wide, high speed, inter-CCX data bus (which resides in an upper metal layer).
As the servicing L3 would not need to do an L3 lookup, you save a cycle or two, which pays for some of the round-trip costs. The extra cost would only be a couple nanoseconds of latency - compared to 20+ for going to main memory - with double the bandwidth (100GB/s vs 50GB/s).
In addition to all of that, there needs to be some method of synchronizing shared data reliably. I'm very surprised by the layout of the CCXes - I would have thought the L3 caches butting against each other would have made for easier and faster communications, but it seems each CCX has its own preferred memory channel which may have an attached global L3 tag clone set.
It would be efficient enough, but would cost quite a bit of area outside of the CCX...
It will be REALLY interesting to see what route AMD took (I expect we will get some solid details on this with the launch-day reviews).