Discussion Intel current and future Lakes & Rapids thread

Page 795 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
Yes, it is of course more than monolithic but there is more to it than might meet the eye at first sight:
Intel had this idea of a unified L3 cache across several tiles. That is why there are huge amounts of traffic across EMIB for next to no benefit. AMD OTOH has much less bandwidth via IFoP at much higher power per bit transferred to begin with. But because their L3 is exclusive per CCD, there is only a tiny bit of coherency traffic going on. And this is why they are much more efficient.
Of course there might be edge cases - but it is more than questionable why Intel took that design direction in the grand scheme.
I wish we had some more data on SNC4 vs SNC-OFF modes. Would give a lot more insight on what workloads, if any, benefit.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
Pretty sure thats in chips and cheese's article about SPR
They touched on it a bit. I remember they saw a significant decrease in average L3 latency with SNC on. But I don't think they really did a deep dive.
A modicum of benefit at best.
SPEC, Linpack, etc. probably aren't the benchmarks that would show the most difference. Feel like we need some big database workloads on there to get a real understanding of the tradeoffs. After all, with those numbers, Intel should have just done die to die UPI instead of stretching the mesh across.
 

soresu

Platinum Member
Dec 19, 2014
2,972
2,202
136
A monolithic 24C SPR vs 24C Zen 3 is like 15% more efficient in CB23, but a 56C SPR uses >30% more energy for the same score as a 64C Zen 3 TR? Yes the TR has more cores, but it still shouldn't be this bad.
I think AMD planned from the outset of the Zen/EPYC programme to maximise core count and steadily increase it generation on generation - which means appropriately scaling efficiency with core count.

Intel may well have been caught off guard by how aggressively AMD increased core counts gen to gen and didn't have the necessary SoC and platform preparation to scale without significant overhead.

That's the only way I can explain it.
 
Reactions: Tlh97 and moinmoin

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
This is something I can not get my head wrapped around:
EPYC Rome was released in 2019. So Intel will have had detailed information somewhere around 2018 at the latest. This has been five years ago and still their roadmap does not show good signs of reaction. Even SRF and GNR seem to be designed according to an outdated philosophy.
 

eek2121

Diamond Member
Aug 2, 2005
3,053
4,281
136
Delays in 10nm(Ice Lake) just messed everything, further delays in Sapphire Rapids compounded the Issues, Sierra Forest and Granite Rapids SP were designed many years ago.
They will catch up. Once Intel has node parity, AMD will have a tough fight.

If Sapphire Rapids were ported to Intel 4/3, power consumption would be almost halved.
 

Saylick

Diamond Member
Sep 10, 2012
3,394
7,159
136
Yes, it is of course more than monolithic but there is more to it than might meet the eye at first sight:
Intel had this idea of a unified L3 cache across several tiles. That is why there are huge amounts of traffic across EMIB for next to no benefit. AMD OTOH has much less bandwidth via IFoP at much higher power per bit transferred to begin with. But because their L3 is exclusive per CCD, there is only a tiny bit of coherency traffic going on. And this is why they are much more efficient.
Of course there might be edge cases - but it is more than questionable why Intel took that design direction in the grand scheme.
It's probably just a relic of the tried-and-true Intel design approach where all cores needed to be connected to each other with as uniform memory access as possible. Remember, Intel mocked AMD when AMD first unveiled first gen EPYC because AMD had a NUMA architecture while Intel boldly claimed that their approach was superior. For Intel to admit that AMD's approach is actually the correct one because it can scale better would be a big blow to their credibility in my opinion. AMD will eventually have more cores per CCX moving forward, but they do it judiciously because I suppose they are letting the math dictate when it should be done, rather than Intel's approach of "We have to connect all cores via a mesh because it's how we've always done it". Ironically, Intel doing things a certain way because that's how they've always done it is precisely what got them into this pickle. If they want to compete, they need to be more agile and be willing to scrap old ideas.
 

Geddagod

Golden Member
Dec 28, 2021
1,214
1,177
106
Is it just me or does GLC use way more power iso frequency compared to Zen 3, esp at higher clocks?
 

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
Is it just me or does GLC use way more power iso frequency compared to Zen 3, esp at higher clocks?
No, it is not just you. I would go as far as to say that this is common knowledge. Now the exact "why" is still up for discussion. It surely is a mix of process as well as design choices.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
This is something I can not get my head wrapped around:
EPYC Rome was released in 2019. So Intel will have had detailed information somewhere around 2018 at the latest. This has been five years ago and still their roadmap does not show good signs of reaction. Even SRF and GNR seem to be designed according to an outdated philosophy.
There must be some significant customer demand for such large coherency domains. It's not like Intel lacks the tech to make a more "traditional" NUMA-type arrangement. They could have just used their existing UPI protocol over EMIB for lower power/latency and called it a day, and yet they went to all this effort. So the two options are they were just being stupid, or they had some real engineering reason for it (even if those reasons ultimately led to poor tradeoffs). In the absence of compelling evidence one way or the other, the latter seems more likely. Though I'm frustrated at the lack of proper server benchmarking for us to really work off of. Cinebench and even SPEC just don't cut it. ServeTheHome has some data, but it feels like we're working with scraps.
Remember, Intel mocked AMD when AMD first unveiled first gen EPYC because AMD had a NUMA architecture while Intel boldly claimed that their approach was superior. For Intel to admit that AMD's approach is actually the correct one because it can scale better would be a big blow to their credibility in my opinion
It's worth noting that Naples never really got much traction. It's Rome where AMD really took off in server, both from the doubling of core count, and the move away from a NUMA architecture. Still chiplet, but clearly NUMA wasn't the future for them either.

AMD's chiplet architecture also burns tremendous power, but unlike SPR we don't have a monolithic reference point and they have enough advantages elsewhere (process, core, etc) to more than make up for it compared to the competition.

Honestly, I think the focus on SPR's chiplet architecture is misplaced. You'd expect a ~60c GLC product to perform similarly to Milan, and that's more or less what we see. What's more notable is the gaps with the process and CPU core, and the resulting inefficiency. Well, that, and of course the delays. If GNR/SRF can close the process and schedule gaps, then that would go a long way regardless of the uncore.
 

Geddagod

Golden Member
Dec 28, 2021
1,214
1,177
106
No, it is not just you. I would go as far as to say that this is common knowledge. Now the exact "why" is still up for discussion. It surely is a mix of process as well as design choices.
I mean I always thought it was marginally more, but from what I'm seeing it's like drastically worse. Not just marginally, but like way, way worse.

In CB, for example 8 P cores use >80% more energy than 8 Zen 3 cores while clocking <200MHz faster (~4.7 vs ~4.5). And it only scores ~20% better.
At 6 cores, the 12400f scores 10% better than the 5600x while boosting 300MHz lower, (4 vs 4.3), but still consumes 28% more power.

And yet for perf/watt, with ISO or near ISO perf, they look competitive.
The 12400f scores 10% better points/watt which is even more impressive considering it scores higher.

I would love to find more data at frequencies/power draw for these CPUs but it looks like that data is nonexistent :/

It looks like, especially at higher clocks, Intel is heavily depending on their PPC advantage to carry them through efficiency, but if that PPC advantage is less than expected, the efficiency craters as a result.
 

Saylick

Diamond Member
Sep 10, 2012
3,394
7,159
136
It's worth noting that Naples never really got much traction. It's Rome where AMD really took off in server, both from the doubling of core count, and the move away from a NUMA architecture. Still chiplet, but clearly NUMA wasn't the future for them either.

AMD's chiplet architecture also burns tremendous power, but unlike SPR we don't have a monolithic reference point and they have enough advantages elsewhere (process, core, etc) to more than make up for it compared to the competition.

Honestly, I think the focus on SPR's chiplet architecture is misplaced. You'd expect a ~60c GLC product to perform similarly to Milan, and that's more or less what we see. What's more notable is the gaps with the process and CPU core, and the resulting inefficiency. Well, that, and of course the delays. If GNR/SRF can close the process and schedule gaps, then that would go a long way regardless of the uncore.
I don't think AMD ever moved away from a NUMA architecture, even with Rome and beyond, because there is a latency penalty for accessing data between CCDs. I'd have to see what the core-to-core latency table is for SPR, but I wouldn't be surprised if it's more uniform.

Edit: Ah, I see what you were getting at. Yes, the latency from accessing RAM is now uniform across the entire package for Rome and beyond, i.e. it's all just one NUMA node within package. There is still a penalty going between CCDs though, which is less prevalent on SPR since those off-die connections are all silicon bridges.

Yes, AMD's approach burns more power because IFOP uses more J/bit, but that's largely an issue with the selection of interconnect. If AMD hypothetically could use a silicon bridge for all of its IOD-to-CCD connections, I think the power consumption goes down significantly.
 
Last edited:

Abwx

Lifer
Apr 2, 2011
11,172
3,869
136
I mean I always thought it was marginally more, but from what I'm seeing it's like drastically worse. Not just marginally, but like way, way worse.

In CB, for example 8 P cores use >80% more energy than 8 Zen 3 cores while clocking <200MHz faster (~4.7 vs ~4.5). And it only scores ~20% better.
At 6 cores, the 12400f scores 10% better than the 5600x while boosting 300MHz lower, (4 vs 4.3), but still consumes 28% more power.

And yet for perf/watt, with ISO or near ISO perf, they look competitive.
The 12400f scores 10% better points/watt which is even more impressive considering it scores higher.

I would love to find more data at frequencies/power draw for these CPUs but it looks like that data is nonexistent :/

It looks like, especially at higher clocks, Intel is heavily depending on their PPC advantage to carry them through efficiency, but if that PPC advantage is less than expected, the efficiency craters as a result.

It s because the Intel 7 process used for ADL was not as good as the one used for RPL.

Perf/Watt was sIgnificantly improved with the ESF process variant, if SPR is fabbed using the former process then there s no wonder that it has terrible perf/watt, although ESF wont close the gap with TSMC s 5nm as can be seen in DT parts comparisons.

That being said server chips uncore is much more power hungry than the ones found in DT parts, and fewer power percentage goes to the cores comparatively.
 

Geddagod

Golden Member
Dec 28, 2021
1,214
1,177
106
It s because the Intel 7 process used for ADL was not as good as the one used for RPL.

Perf/Watt was sIgnificantly improved with the ESF process variant, if SPR is fabbed using the former process then there s no wonder that it has terrible perf/watt, although ESF wont close the gap with TSMC s 5nm as can be seen in DT parts comparisons.

That being said server chips uncore is much more power hungry than the ones found in DT parts, and fewer power percentage goes to the cores comparatively.
Node for node though, Intel 7 and TSMC 7nm should be around the same.
And yet GLC chugs power. I mean I get it's a larger architecture, but cmon...
It's still pretty good in overall efficiency because of PPC advantage, but when that goes depending on the application (or when SPR has to pull data out of the L3 lol) efficiency should tank.
 
Reactions: Tlh97

Geddagod

Golden Member
Dec 28, 2021
1,214
1,177
106
Palm Cove (if it worked) would have been the last 'good' Intel core until NGC, change my mind lol.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
Edit: Ah, I see what you were getting at. Yes, the latency from accessing RAM is now uniform across the entire package for Rome and beyond, i.e. it's all just one NUMA node within package. There is still a penalty going between CCDs though, which is less prevalent on SPR since those off-die connections are all silicon bridges.
Yeah, though technically the memory latency isn't actually uniform for each CCD, and AMD's introduced a "NUMA nodes per socket" feature that's conceptually similar to Intel's SNC. But it's definitely less "NUMA-like" than Naples.
Yes, AMD's approach burns more power because IFOP uses more J/bit, but that's largely an issue with the selection of interconnect. If AMD hypothetically could use a silicon bridge for all of its IOD-to-CCD connections, I think the power consumption goes down significantly.
IFOP is definitely significant, but the IO die consumes quite a bit of power in its own right. I think AMD's current uncore takes roughly half of the total TDP. Definitely interesting to speculate what different packaging tech would bring them.
 

Saylick

Diamond Member
Sep 10, 2012
3,394
7,159
136
IFOP is definitely significant, but the IO die consumes quite a bit of power in its own right. I think AMD's current uncore takes roughly half of the total TDP. Definitely interesting to speculate what different packaging tech would bring them.
Hm... Half of the TDP seems a little high. Was that belief based on this AT article? https://www.anandtech.com/show/16529/amd-epyc-milan-review/5


If so, FWIW, in AT's formal Milan review, the high idle was resolved. https://www.anandtech.com/show/16778/amd-epyc-milan-review-part-2/3


Either way, ~70W for an IOD with 8-memory controllers is roughly a quarter of the total TDP of 280W. Not great, not terrible. I suppose if you scaled that IOD up to Genoa's 12-memory controllers but factored in the improved process of N6, the power consumption might still be roughly the same at ~70W. Considering that Genoa's TDP can be as high as 360W (stock TDP), 70W is a smaller portion of the total TDP. Not bad I suppose.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |