My post on the previous page seems to have largely been glossed over, but nobody is noticing that cross CCX traffic on monolithic Strix Point is almost equally as bad as cross CCX/CCD traffic on Granite Ridge.
Comments on this have been made already right after the Strix Point launch. It's just that the special group of people whose job it is to make funny faces in youtube thumbnails didn't jump on the cross CCX workloads topic until the 9950X/9900X review window, when AMD made them reinstall Windows numerous times just for the core parking driver.
Just from that, you can throw out anything to do with chiplets, IF, or IOD. This is somehow related to the new core designs.
I still am guessing that it is more likely to be an uncore topic.
STX is different in that the path between CCXs and MMCs doesn't need to go through the substrate,¹ and in that it is the first(?) Zen device which has got CCXs of different core counts.²
¹) Perhaps they could have optimized the monolithic chip more, but didn't because co-developing and validating all the different Zen 5 products is taxing as it is.
²) Let's hope that the need to manage differing CCXs in STX is not a driver for compromises in GNR. But who knows.
If it was a conscious decision to make it that way, they should come out and let us know their logic behind it.
There will be the Hot Chips presentation
soon. Although they will surely prefer to concentrate on what got better, not so much on what was compromised.
The latency itself isn't as big of a problem as people make it out to be. It only made the existing issue more visible,
I agree.
I for one have home computers with up to sixteen last-level cache domains (have been having them for quite some time now), and in the few cases in which it matters at all, I know how to optimize at the application level based on my own targeted benchmarks. Sure, it's a bit of a nuisance to do, but the payoff compared to Intel chips is better performance of cache-fitting workloads and other fundamental benefits. (Of course I don't dedicate computers with sixteen last level caches to playing video games. And these workloads are not about sharing just a couple of cache lines, but hot data sizes in the same order of magnitude as cache sizes, so that's really not comparable to those microbenchmarks.) — Now, what remains to be seen is whether or not Zen 5's regression in cacheline bouncing microbenchmarks is going to be connected with any payoffs elsewhere.
To be honest, on CPU front they were executing perfectly, especially with Zen 3 and Zen 4 (which brought HUGE performance uplifts). We got spoiled.
For parallel workloads, Zen 2 and Zen 4 brought the big uplifts.
It seems true that Zen 5 cores are super-power hungry and scale perf well past what Zen 4 does, and scale much worse than Zen 4 at very low power.
Basically true, but your wording occurs exaggerated to me. Two things have been known for a while now:
– Zen 5 is quite a step up in core width compared to its predecessors,
– Turin's socket power budget is going to be increased merely ≈proportional to core count increase over Genoa.
Per-core power budget in servers isn't big, and remains about the same in Turin. Yet what will matter a lot to AMD's bottom line is how much Turin will improve in iso-core-count performance, iso-power performance, and absolute per-socket performance over Genoa and over the competition. And I for one am curious to see that (but not because I cared much for AMD's bottom line). I don't feel confident to make any guesses based on the STX and GNR performance-over-power scaling figures which we have seen so far, though maybe I am simply not seeing the forest for the trees. — What also matters is how many designs AMD wins in mobile. In contrast, how much it matters which variety of funny faces the youtubers make in their thumbnails is something which can be left for everyone to decide for themselves.
huge resource investment in the FP/AVX512 while integer was left behind,
That's a bit exaggerated too. Apart from the fact that vectorized integer arithmetic profits from the vector/"FP" pipeline enhancements, the integer backend got resource investments too (e.g. not just more execution units, but more capable units to boot), and a good deal of resources were poured into the frontend.