Discussion Intel current and future Lakes & Rapids thread

Page 796 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
Hm... Half of the TDP seems a little high. Was that belief based on this AT article? https://www.anandtech.com/show/16529/amd-epyc-milan-review/5
Kinda been ballparking the numbers from multiple sources, but a key datapoint was this: https://www.anandtech.com/show/1647...ness-the-amd-threadripper-pro-3995wx-review/3



Rome-based, but you get the gist. The ratio might not actually be half (though I'd expect a server to be worse than workstation because of lower CPU clocks), but it's certainly extremely significant. I wouldn't be surprised if AMD and Intel's uncore power ratios end up similar, albeit for very different reasons.
 

Saylick

Diamond Member
Sep 10, 2012
3,394
7,159
136
Kinda been ballparking the numbers from multiple sources, but a key datapoint was this: https://www.anandtech.com/show/1647...ness-the-amd-threadripper-pro-3995wx-review/3



Rome-based, but you get the gist. The ratio might not actually be half (though I'd expect a server to be worse than workstation because of lower CPU clocks), but it's certainly extremely significant. I wouldn't be surprised if AMD and Intel's uncore power ratios end up similar, albeit for very different reasons.
Ah, I see... As Ian notes, the uncore might include the L3 cache as well. If so, then a portion of that uncore power is technically not from the IOD itself. If anything, that chart suggests the upper bound of the IOD's power consumption. FWIW, Intel considers the L3 cache as part of the uncore, too, I believe.
Going through our POV-Ray scaling power test for per-core consumption, we’re seeing a trend whereby 40% of the power goes to the non-core operation of the system, which is also likely to include the L3 cache.
I suppose there is an argument to be made that if AMD went with an approach more in-line with Intel's where all cores can access the L3 cache, AMD would not need as much of it, thereby cutting down on some uncore power.
 
Reactions: Tlh97 and Geddagod

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
Yeah, though technically the memory latency isn't actually uniform for each CCD, and AMD's introduced a "NUMA nodes per socket" feature that's conceptually similar to Intel's SNC. But it's definitely less "NUMA-like" than Naples.

IFOP is definitely significant, but the IO die consumes quite a bit of power in its own right. I think AMD's current uncore takes roughly half of the total TDP. Definitely interesting to speculate what different packaging tech would bring them.
Generally I am with you, just two points:

The L3 Caches of all the Rome/Milan/Genoa CCDs are coherent as well - that is BTW how they do core-to-core communication (see MOESI). The big difference is that one CCD is not able to directly access the L3 of another one - which in turn saves data transferral big time.

The IOD being a power hog has directly to do with IFoP. It is because The former literally has to drive the latter - needing to provide all the pJ per bit to transfer. So the IOD is not a big consumer in and of itself.
 
Reactions: Tlh97 and Exist50

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
The big difference is that one CCD is not able to directly access the L3 of another one - which in turn saves data transferral big time.
Yeah, I guess coherency domain is not the right term here. Idk, maybe "snoop domain"? "Allocation domain"? Eh, whatever the name, you get what I mean.
The IOD being a power hog has directly to do with IFoP. It is because The former literally has to drive the latter - needing to provide all the pJ per bit to transfer. So the IOD is not a big consumer in and of itself.
True, but the fabric needed to support ~500GB/s routed between many possible endpoints will not be free. Unfortunately, don't think anyone's broke it down at that level.
 

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
Yeah, I guess coherency domain is not the right term here. Idk, maybe "snoop domain"? "Allocation domain"? Eh, whatever the name, you get what I mean.

True, but the fabric needed to support ~500GB/s routed between many possible endpoints will not be free. Unfortunately, don't think anyone's broke it down at that level.
Well, we can calculate this:

The IFoP has 64/32 GByte/s per standard (narrow) link. With 2 pJ/bit that nets you around 1.6w per link. I suspect that there is some form of set amount per link involved as well.

If we assume, that AMD might want to widen the link in order to be future-proof for RAM generations to come, let's calculate with 256/128 GByte/s. With InFO-R you get around 0.3 pJ/bit. So that would net you around 0.9w, but only if one assumes full throttle right from the start.
 
Last edited:
Reactions: Tlh97

ryanjagtap

Member
Sep 25, 2021
110
132
96
Right now, with reference to AMD desktop processor, each CCD has a link to the IO die and all inter-CCD communication goes through the IO die. What if AMD combines or makes a way for the 2 CCD to communicate without going through the IO die and make a single link to connect with the CCDs? The link would have to have higher bandwidth, but it will still be less than 2 IFOP links right? [Don't know if this is viable or not, just speculating. Don't even know if it is appropriate to post it on Intel thread]
 

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
Right now, with reference to AMD desktop processor, each CCD has a link to the IO die and all inter-CCD communication goes through the IO die. What if AMD combines or makes a way for the 2 CCD to communicate without going through the IO die and make a single link to connect with the CCDs? The link would have to have higher bandwidth, but it will still be less than 2 IFOP links right? [Don't know if this is viable or not, just speculating. Don't even know if it is appropriate to post it on Intel thread]
Intel's SPR is living proof that there are no big gains to be expected.
Also: The CCDs do not communicate with each other at all. There is only cache coherency going on, which only needs a fraction of what you'd need for reading from remote caches and stuff.
 

eek2121

Diamond Member
Aug 2, 2005
3,053
4,281
136
Regarding power consumption, I think you would want to compare Raptor Cove vs. Zen 3 instead of Golden Cove.

Reason: Zen 3 is a second generation product on TSMC N7, Raptor Lake is a second generation product on Intel 7.

I suspect Sapphire Rapids may be on an older version of Intel 7.

Neither holds a candle to the efficiency Zen 4/TSMC N5 is capable of.

Just my two cents.
 

Geddagod

Golden Member
Dec 28, 2021
1,214
1,177
106
I lold a bit: Intel 7 is the third gen Intel 10nm node, and that's if we ignore the actual first gen that Intel burried. We had "vanilla" 10nm with Ice Lake, 10nm Superfin with Tiger Lake, and then 10nm Enhanced Superfin with Alder Lake.

Intel's first generation products on even their 'refreshed' Intel 10nm nodes are always pretty bad optimization wise tbh.
'Fixed' 10nm was ICL and TGL got like what, >20% better clocks? Some of that was process sure, but I also think TGL got heavy DTCO optimizations to enable that.
'Fixed', or maybe not as much fixed as heavily improved 10nmSF was Intel 7, GLC. RPC was another >10% frequency upgrade, which is insanely impressive considering the node should not have contributed nearly as much as it did from ICL to TGL.
Intel's refresh team goes hard on DTCO
 

Abwx

Lifer
Apr 2, 2011
11,172
3,869
136
Node for node though, Intel 7 and TSMC 7nm should be around the same.
And yet GLC chugs power. I mean I get it's a larger architecture, but cmon...
It's still pretty good in overall efficiency because of PPC advantage, but when that goes depending on the application (or when SPR has to pull data out of the L3 lol) efficiency should tank.

Methink that Intel 7 first iteration is not as good as TSMCs first 7nm iteration used for both Zen 2 and Zen 3, but the Intel 7 ESF variant is quite better than said TSMC s vanilla 7nm.

Anyway there s a review of the 56C SPR at Hardwareluxx, among other numbers CB23 at stock (350W) yield 72k points, roughly on par with a 280W Zen 3 64C TR :


Edit : Stock score@PL1 350W/PL2 420W is 54k, 72k score is at unlimited power, 500W in this case, test at 250W yield 37k+, from the test at 150W it s clear that the uncore use roughly 100W :

 
Last edited:

Abwx

Lifer
Apr 2, 2011
11,172
3,869
136
Oof, that doesn't look great, but that is a 56 core beast, probably power and thermal constrained. What about one of the models with 20 or fewer cores?

It would be worse since idle power, at 93W, would be about the same.
As it is it would have been competitive with 64C Zen 3 although at much lower perf/watt since a TR 5995X@280W has 63% better perf/Watt, a 320W 64C Zen 3 would be a little less efficient than the 5995X while performing 7% better, but still way above SPR efficency wise.

This SKU was surely meant to compete with Zen 3 based SKUs, comparatively to Zen 4 it doesnt hold a candle and will be at least 2x less efficient, no wonder that Intel is releasing hyped marketing materials, because any IT dpt that has some technical knowledge will realize that getting SPR instead of Genoa is the best way to render your company uncompetitive.
 

Shmee

Memory & Storage, Graphics Cards Mod Elite Member
Super Moderator
Sep 13, 2008
7,548
2,547
146
It would be worse since idle power, at 93W, would be about the same.
As it is it would have been competitive with 64C Zen 3 although at much lower perf/watt since a TR 5995X@280W has 63% better perf/Watt, a 320W 64C Zen 3 would be a little less efficient than the 5995X while performing 7% better, but still way above SPR efficency wise.

This SKU was surely meant to compete with Zen 3 based SKUs, comparatively to Zen 4 it doesnt hold a candle and will be at least 2x less efficient, no wonder that Intel is releasing hyped marketing materials, because any IT dpt that has some technical knowledge will realize that getting SPR instead of Genoa is the best way to render your company uncompetitive.
I wouldn't think idle power would be relavant for games? I was thinking a lower core count SKU would fare better, especially if the cores it did use would boost higher.

Or is there another reason that having fewer cores on die wouldn't allow for higher game/boost clocks for the cores used?
 

Shmee

Memory & Storage, Graphics Cards Mod Elite Member
Super Moderator
Sep 13, 2008
7,548
2,547
146
Nobody is buying these things to play games.
I disagree, I would be interested in a HEDT platform with around 16-24 cores that did well in games. The problem with the main stream platforms is not enough IO.
 

Abwx

Lifer
Apr 2, 2011
11,172
3,869
136
I disagree, I would be interested in a HEDT platform with around 16-24 cores that did well in games. The problem with the main stream platforms is not enough IO.

I m talking of perfs in apps excluding games wich are not relevant tests for such a CPU.

The meciocre perf/watt is partly due to servers CPUs requirements wich mandate at least 10% higher voltage than DT parts at same frequencies because stability is more important than relative perf.

In this respect putting a lot of cores in a monolithic parts is not a good idea as very high currents (500A or more) will have to find their way in a single piece of silicon with large voltage drops that are compensated by even higher supply voltage.

Last but not least, they eventually used the first iteration of their 7 process, in wich case they can do a refresh with their 7 ESF and reduce the gap with AMD but that wont be enough, for competitivness they have to go to their next node, in the meantime they ll surely lose a big chunk of their marketshare as the numbers are such that it would be total incompetence to use SPR rather than a Zen 4 based SKU.
 
Reactions: Tlh97 and soresu

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
One can imagine these new HEDT Intel XEONs as basically lower clocked Raptor Lake core with 2MB of L2, no L3 cache and horrible memory latency at whatever speed due to multiple channels, mesh transitions.


Is disaster level L3 caching for games, 30-40ns at whatever speed is just retarded and will destroy gaming performance. Obviously 56C SKU from XCC die is even more useless as it has those 4 chiplets that destroy even more L3 and memory performance. MCC die SKUs with up to 32C would perform somewhat better.

AMD's HEDT chips have equally bad server SoC memory latency, but they have functioning local to chiplet L3 cache of decent size that would definitely perform quite good in games.


What is unfortunate for Intel, is this "gaming" performance deficit transcends into Server style workloads as well. For example in database loads, i'd expect Intel to completely and utterly outperform AMD, yet their advantage is not that big. Disfunctional L3 cache with disaster level L3 latency and no size to show for that latency is big part of it.
Like i've said a few pages before - IF they can sort out the process, they have architecture in place for some 512-768MB MB L3 cache chip to really shine ( or some retarded scheme with L4 cache would actually work out for them, as their L3 is hurting now, might as well abuse it for L4 tags and have capacity ).

EDIT: i was generous with "30-40ns" above, in fact it is way worse:
From same CnC review:

Up until we exceed L2 TLB coverage, we see about 39 ns of effective access latency for the L3 cache. As we spill out of that though, we see an incredible 48.5 ns of effective L3 latency.

With same tool as authors used, and 2MB TLB i get the following on my desktop



Let that sink in for a moment: desktop Raptor Lake has same memory access latency than what amateurs who call themselves Intel's engineers managed to achieve in L3 cache.
Talk about achievements in retardness, extra points due to having IMC on same "chip" and lifetime award for working in company that had industry leading MCs for half a century.
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |