Discussion Intel current and future Lakes & Rapids thread

Exist50 · Apr 16, 2023

BorisTheBlade82 said:
Yes, it is of course more than monolithic but there is more to it than might meet the eye at first sight:
Intel had this idea of a unified L3 cache across several tiles. That is why there are huge amounts of traffic across EMIB for next to no benefit. AMD OTOH has much less bandwidth via IFoP at much higher power per bit transferred to begin with. But because their L3 is exclusive per CCD, there is only a tiny bit of coherency traffic going on. And this is why they are much more efficient.
Of course there might be edge cases - but it is more than questionable why Intel took that design direction in the grand scheme.

I wish we had some more data on SNC4 vs SNC-OFF modes. Would give a lot more insight on what workloads, if any, benefit.

Geddagod · Apr 16, 2023

Exist50 said:
I wish we had some more data on SNC4 vs SNC-OFF modes. Would give a lot more insight on what workloads, if any, benefit.

Pretty sure thats in chips and cheese's article about SPR

igor_kavinski · Apr 16, 2023

Exist50 said:
I wish we had some more data on SNC4 vs SNC-OFF modes. Would give a lot more insight on what workloads, if any, benefit.

https://lenovopress.lenovo.com/lp1499.pdf

https://indico.cern.ch/event/1225408/contributions/5302299/attachments/2612652/4514749/Technical%20Workshop%2023.pptx

A modicum of benefit at best.

Exist50 · Apr 16, 2023

Geddagod said:
Pretty sure thats in chips and cheese's article about SPR

They touched on it a bit. I remember they saw a significant decrease in average L3 latency with SNC on. But I don't think they really did a deep dive.

igor_kavinski said:
A modicum of benefit at best.

SPEC, Linpack, etc. probably aren't the benchmarks that would show the most difference. Feel like we need some big database workloads on there to get a real understanding of the tradeoffs. After all, with those numbers, Intel should have just done die to die UPI instead of stretching the mesh across.

igor_kavinski · Apr 16, 2023

SPR 69 page sales deck: https://www.colfax-intl.com/downloads/4th-gen-xeon-scalable-processors-sales-deck.pdf

Any manager seeing that will feel like they are getting something really special with SPR.

soresu · Apr 16, 2023

Geddagod said:
A monolithic 24C SPR vs 24C Zen 3 is like 15% more efficient in CB23, but a 56C SPR uses >30% more energy for the same score as a 64C Zen 3 TR? Yes the TR has more cores, but it still shouldn't be this bad.

I think AMD planned from the outset of the Zen/EPYC programme to maximise core count and steadily increase it generation on generation - which means appropriately scaling efficiency with core count.

Intel may well have been caught off guard by how aggressively AMD increased core counts gen to gen and didn't have the necessary SoC and platform preparation to scale without significant overhead.

That's the only way I can explain it.

BorisTheBlade82 · Apr 17, 2023

This is something I can not get my head wrapped around:
EPYC Rome was released in 2019. So Intel will have had detailed information somewhere around 2018 at the latest. This has been five years ago and still their roadmap does not show good signs of reaction. Even SRF and GNR seem to be designed according to an outdated philosophy.

nicalandia · Apr 17, 2023

BorisTheBlade82 said:
Even SRF and GNR seem to be designed according to an outdated philosophy.

Delays in 10nm(Ice Lake) just messed everything, further delays in Sapphire Rapids compounded the Issues, Sierra Forest and Granite Rapids SP were designed many years ago.

nicalandia · Apr 17, 2023

Intel Xeon W9 3475X 36 Cores vs AMD Threadripper Pro 5975WX 32 Cores.

AMD EPYC Genoa 9354 32 Cores.

PassMark Intel vs AMD CPU Benchmarks - High End

PassMark Software - CPU Benchmarks - Over 1 million CPUs and 1,000 models benchmarked and compared in graph form, updated daily!

www.cpubenchmark.net

eek2121 · Apr 17, 2023

nicalandia said:
Delays in 10nm(Ice Lake) just messed everything, further delays in Sapphire Rapids compounded the Issues, Sierra Forest and Granite Rapids SP were designed many years ago.

They will catch up. Once Intel has node parity, AMD will have a tough fight.

If Sapphire Rapids were ported to Intel 4/3, power consumption would be almost halved.

Saylick · Apr 17, 2023

BorisTheBlade82 said:
Yes, it is of course more than monolithic but there is more to it than might meet the eye at first sight:
Intel had this idea of a unified L3 cache across several tiles. That is why there are huge amounts of traffic across EMIB for next to no benefit. AMD OTOH has much less bandwidth via IFoP at much higher power per bit transferred to begin with. But because their L3 is exclusive per CCD, there is only a tiny bit of coherency traffic going on. And this is why they are much more efficient.
Of course there might be edge cases - but it is more than questionable why Intel took that design direction in the grand scheme.

It's probably just a relic of the tried-and-true Intel design approach where all cores needed to be connected to each other with as uniform memory access as possible. Remember, Intel mocked AMD when AMD first unveiled first gen EPYC because AMD had a NUMA architecture while Intel boldly claimed that their approach was superior. For Intel to admit that AMD's approach is actually the correct one because it can scale better would be a big blow to their credibility in my opinion. AMD will eventually have more cores per CCX moving forward, but they do it judiciously because I suppose they are letting the math dictate when it should be done, rather than Intel's approach of "We have to connect all cores via a mesh because it's how we've always done it". Ironically, Intel doing things a certain way because that's how they've always done it is precisely what got them into this pickle. If they want to compete, they need to be more agile and be willing to scrap old ideas.

nicalandia · Apr 17, 2023

Intel Xeon W9 3475X 36 core Cinebench R23.

47,000 Points

For Comparison the 32 core 5975WX does 50,000

Geddagod · Apr 17, 2023

Is it just me or does GLC use way more power iso frequency compared to Zen 3, esp at higher clocks?

BorisTheBlade82 · Apr 17, 2023

Geddagod said:
Is it just me or does GLC use way more power iso frequency compared to Zen 3, esp at higher clocks?

No, it is not just you. I would go as far as to say that this is common knowledge. Now the exact "why" is still up for discussion. It surely is a mix of process as well as design choices.

nicalandia · Apr 17, 2023

Geddagod said:
Is it just me or does GLC use way more power iso frequency compared to Zen 3, esp at higher clocks?

Server Golden Cove core size is twice as large as a Zen3 core.

Ajay · Apr 17, 2023

nicalandia said:
Server Golden Cove core size is twice as large as a Zen3 core.

Twice the size by area, or twice the size by xtor count?

Exist50 · Apr 17, 2023

BorisTheBlade82 said:
This is something I can not get my head wrapped around:
EPYC Rome was released in 2019. So Intel will have had detailed information somewhere around 2018 at the latest. This has been five years ago and still their roadmap does not show good signs of reaction. Even SRF and GNR seem to be designed according to an outdated philosophy.

There must be some significant customer demand for such large coherency domains. It's not like Intel lacks the tech to make a more "traditional" NUMA-type arrangement. They could have just used their existing UPI protocol over EMIB for lower power/latency and called it a day, and yet they went to all this effort. So the two options are they were just being stupid, or they had some real engineering reason for it (even if those reasons ultimately led to poor tradeoffs). In the absence of compelling evidence one way or the other, the latter seems more likely. Though I'm frustrated at the lack of proper server benchmarking for us to really work off of. Cinebench and even SPEC just don't cut it. ServeTheHome has some data, but it feels like we're working with scraps.

Saylick said:
Remember, Intel mocked AMD when AMD first unveiled first gen EPYC because AMD had a NUMA architecture while Intel boldly claimed that their approach was superior. For Intel to admit that AMD's approach is actually the correct one because it can scale better would be a big blow to their credibility in my opinion

It's worth noting that Naples never really got much traction. It's Rome where AMD really took off in server, both from the doubling of core count, and the move away from a NUMA architecture. Still chiplet, but clearly NUMA wasn't the future for them either.

AMD's chiplet architecture also burns tremendous power, but unlike SPR we don't have a monolithic reference point and they have enough advantages elsewhere (process, core, etc) to more than make up for it compared to the competition.

Honestly, I think the focus on SPR's chiplet architecture is misplaced. You'd expect a ~60c GLC product to perform similarly to Milan, and that's more or less what we see. What's more notable is the gaps with the process and CPU core, and the resulting inefficiency. Well, that, and of course the delays. If GNR/SRF can close the process and schedule gaps, then that would go a long way regardless of the uncore.

nicalandia · Apr 17, 2023

Ajay said:
Twice the size by area, or twice the size by xtor count?

Die size Area mm2

Geddagod · Apr 17, 2023

BorisTheBlade82 said:
No, it is not just you. I would go as far as to say that this is common knowledge. Now the exact "why" is still up for discussion. It surely is a mix of process as well as design choices.

I mean I always thought it was marginally more, but from what I'm seeing it's like drastically worse. Not just marginally, but like way, way worse.

In CB, for example 8 P cores use >80% more energy than 8 Zen 3 cores while clocking <200MHz faster (~4.7 vs ~4.5). And it only scores ~20% better.
At 6 cores, the 12400f scores 10% better than the 5600x while boosting 300MHz lower, (4 vs 4.3), but still consumes 28% more power.

And yet for perf/watt, with ISO or near ISO perf, they look competitive.
The 12400f scores 10% better points/watt which is even more impressive considering it scores higher.

I would love to find more data at frequencies/power draw for these CPUs but it looks like that data is nonexistent :/

It looks like, especially at higher clocks, Intel is heavily depending on their PPC advantage to carry them through efficiency, but if that PPC advantage is less than expected, the efficiency craters as a result.

Saylick · Apr 17, 2023

Exist50 said:
It's worth noting that Naples never really got much traction. It's Rome where AMD really took off in server, both from the doubling of core count, and the move away from a NUMA architecture. Still chiplet, but clearly NUMA wasn't the future for them either.

AMD's chiplet architecture also burns tremendous power, but unlike SPR we don't have a monolithic reference point and they have enough advantages elsewhere (process, core, etc) to more than make up for it compared to the competition.

Honestly, I think the focus on SPR's chiplet architecture is misplaced. You'd expect a ~60c GLC product to perform similarly to Milan, and that's more or less what we see. What's more notable is the gaps with the process and CPU core, and the resulting inefficiency. Well, that, and of course the delays. If GNR/SRF can close the process and schedule gaps, then that would go a long way regardless of the uncore.

I don't think AMD ever moved away from a NUMA architecture, even with Rome and beyond, because there is a latency penalty for accessing data between CCDs. I'd have to see what the core-to-core latency table is for SPR, but I wouldn't be surprised if it's more uniform.

Edit: Ah, I see what you were getting at. Yes, the latency from accessing RAM is now uniform across the entire package for Rome and beyond, i.e. it's all just one NUMA node within package. There is still a penalty going between CCDs though, which is less prevalent on SPR since those off-die connections are all silicon bridges.

Yes, AMD's approach burns more power because IFOP uses more J/bit, but that's largely an issue with the selection of interconnect. If AMD hypothetically could use a silicon bridge for all of its IOD-to-CCD connections, I think the power consumption goes down significantly.

Abwx · Apr 17, 2023

Geddagod said:
I mean I always thought it was marginally more, but from what I'm seeing it's like drastically worse. Not just marginally, but like way, way worse.

In CB, for example 8 P cores use >80% more energy than 8 Zen 3 cores while clocking <200MHz faster (~4.7 vs ~4.5). And it only scores ~20% better.
At 6 cores, the 12400f scores 10% better than the 5600x while boosting 300MHz lower, (4 vs 4.3), but still consumes 28% more power.

And yet for perf/watt, with ISO or near ISO perf, they look competitive.
The 12400f scores 10% better points/watt which is even more impressive considering it scores higher.

I would love to find more data at frequencies/power draw for these CPUs but it looks like that data is nonexistent :/

It looks like, especially at higher clocks, Intel is heavily depending on their PPC advantage to carry them through efficiency, but if that PPC advantage is less than expected, the efficiency craters as a result.

It s because the Intel 7 process used for ADL was not as good as the one used for RPL.

Perf/Watt was sIgnificantly improved with the ESF process variant, if SPR is fabbed using the former process then there s no wonder that it has terrible perf/watt, although ESF wont close the gap with TSMC s 5nm as can be seen in DT parts comparisons.

That being said server chips uncore is much more power hungry than the ones found in DT parts, and fewer power percentage goes to the cores comparatively.

Geddagod · Apr 17, 2023

Abwx said:
It s because the Intel 7 process used for ADL was not as good as the one used for RPL.

Perf/Watt was sIgnificantly improved with the ESF process variant, if SPR is fabbed using the former process then there s no wonder that it has terrible perf/watt, although ESF wont close the gap with TSMC s 5nm as can be seen in DT parts comparisons.

That being said server chips uncore is much more power hungry than the ones found in DT parts, and fewer power percentage goes to the cores comparatively.

Node for node though, Intel 7 and TSMC 7nm should be around the same.
And yet GLC chugs power. I mean I get it's a larger architecture, but cmon...
It's still pretty good in overall efficiency because of PPC advantage, but when that goes depending on the application (or when SPR has to pull data out of the L3 lol) efficiency should tank.

Geddagod · Apr 17, 2023

Palm Cove (if it worked) would have been the last 'good' Intel core until NGC, change my mind lol.

Exist50 · Apr 17, 2023

Saylick said:
Edit: Ah, I see what you were getting at. Yes, the latency from accessing RAM is now uniform across the entire package for Rome and beyond, i.e. it's all just one NUMA node within package. There is still a penalty going between CCDs though, which is less prevalent on SPR since those off-die connections are all silicon bridges.

Yeah, though technically the memory latency isn't actually uniform for each CCD, and AMD's introduced a "NUMA nodes per socket" feature that's conceptually similar to Intel's SNC. But it's definitely less "NUMA-like" than Naples.

Saylick said:
Yes, AMD's approach burns more power because IFOP uses more J/bit, but that's largely an issue with the selection of interconnect. If AMD hypothetically could use a silicon bridge for all of its IOD-to-CCD connections, I think the power consumption goes down significantly.

IFOP is definitely significant, but the IO die consumes quite a bit of power in its own right. I think AMD's current uncore takes roughly half of the total TDP. Definitely interesting to speculate what different packaging tech would bring them.

Saylick · Apr 17, 2023

Exist50 said:
IFOP is definitely significant, but the IO die consumes quite a bit of power in its own right. I think AMD's current uncore takes roughly half of the total TDP. Definitely interesting to speculate what different packaging tech would bring them.

Hm... Half of the TDP seems a little high. Was that belief based on this AT article? https://www.anandtech.com/show/16529/amd-epyc-milan-review/5

If so, FWIW, in AT's formal Milan review, the high idle was resolved. https://www.anandtech.com/show/16778/amd-epyc-milan-review-part-2/3

Either way, ~70W for an IOD with 8-memory controllers is roughly a quarter of the total TDP of 280W. Not great, not terrible. I suppose if you scaled that IOD up to Genoa's 12-memory controllers but factored in the improved process of N6, the power consumption might still be roughly the same at ~70W. Considering that Genoa's TDP can be as high as 360W (stock TDP), 70W is a smaller portion of the total TDP. Not bad I suppose.

Discussion Intel current and future Lakes & Rapids thread

Platinum Member

Golden Member

Lifer

Platinum Member

Lifer

Platinum Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Senior member

Diamond Member

Lifer

Platinum Member

Diamond Member

Golden Member

Diamond Member

Lifer

Golden Member

Golden Member

Platinum Member

Diamond Member