This is something I can not get my head wrapped around:
EPYC Rome was released in 2019. So Intel will have had detailed information somewhere around 2018 at the latest. This has been five years ago and still their roadmap does not show good signs of reaction. Even SRF and GNR seem to be designed according to an outdated philosophy.
There
must be some significant customer demand for such large coherency domains. It's not like Intel lacks the tech to make a more "traditional" NUMA-type arrangement. They could have just used their existing UPI protocol over EMIB for lower power/latency and called it a day, and yet they went to all this effort. So the two options are they were just being stupid, or they had
some real engineering reason for it (even if those reasons ultimately led to poor tradeoffs). In the absence of compelling evidence one way or the other, the latter seems more likely. Though I'm frustrated at the lack of proper server benchmarking for us to really work off of. Cinebench and even SPEC just don't cut it. ServeTheHome has some data, but it feels like we're working with scraps.
Remember, Intel mocked AMD when AMD first unveiled first gen EPYC because AMD had a NUMA architecture while Intel boldly claimed that their approach was superior. For Intel to admit that AMD's approach is actually the correct one because it can scale better would be a big blow to their credibility in my opinion
It's worth noting that Naples never really got much traction. It's Rome where AMD really took off in server, both from the doubling of core count,
and the move away from a NUMA architecture. Still chiplet, but clearly NUMA wasn't the future for them either.
AMD's chiplet architecture
also burns tremendous power, but unlike SPR we don't have a monolithic reference point and they have enough advantages elsewhere (process, core, etc) to more than make up for it compared to the competition.
Honestly, I think the focus on SPR's chiplet architecture is misplaced. You'd expect a ~60c GLC product to perform similarly to Milan, and that's more or less what we see. What's more notable is the gaps with the process and CPU core, and the resulting inefficiency. Well, that, and of course the delays. If GNR/SRF can close the process and schedule gaps, then that would go a long way regardless of the uncore.