Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

igor_kavinski · Aug 17, 2024

Fjodor2001 said:
Not sure why AVX512 should be special in this regard.

It's possible if the FP units crunch through data so fast that they waste much of their time waiting for data from RAM because the L3 cache is too small to hold enough data to keep them fed and busy. If this is true, the X3D parts will be something special in AVX-512 workloads.

Fjodor2001 · Aug 17, 2024

igor_kavinski said:
It's possible if the FP units crunch through data so fast that they waste much of their time waiting for data from RAM because the L3 cache is too small to hold enough data to keep them fed and busy. If this is true, the X3D parts will be something special in AVX-512 workloads.

Yes, but this assumes the amount of data per time unit that is being processed is big. Which is not the case for many MT workloads. And not the case for many AVX512 workloads either.

E.g. a typical case is video encoding, where you're processing quite limited amount of data per time unit, but doing lots of calculations on that limited data. Then you need high MT performance, but not high memory bandwidth.

That said, I agree that for MT workloads that process more data per time unit, X3D parts will help.

HurleyBird · Aug 17, 2024

yottabit said:
But clearly Zen5, especially when running in AVX 512 is a memory bandwidth hog.

I'm doubtful any consumer Zen 6 implementation will enjoy full speed AVX 512.

Markfw · Aug 17, 2024

HurleyBird said:
I'm doubtful any consumer Zen 6 implementation will enjoy full speed AVX 512.

Everybody keeps denying the change in avx-512 applications. First it was in PURE avx-512 benchmark show 98% faster than Zen 4. Now we have at least one application where it is 30-40% faster that is actually used (in the application areas). See the benchmarks in the DC forum here. These are not imaginary gains. In scientific and other areas, it should see these huge gains. This being the case, I doubt that it will die at Zen 5.

yottabit · Aug 17, 2024

Fjodor2001 said:
DDR5 is roughly twice as fast as DDR4. Are the new cores on DDR5 twice as fast as the latest cores on DDR4? Do they require twice the memory bandwidth?

Yes, at least when running in AVX-512 mode - Zen5 is twice the throughput of Zen4, let alone Zen3 which only had AVX2!

Fjodor2001 said:
Depends on what kind of workload and data you're using AVX512 for. If the amount data being processed is limited, it'll be kept in cache which fast and has very high memory bandwidth. Same as for other operations BTW. Not sure why AVX512 should be special in this regard.

It's special because it's chungus. AVX is all about loading huge amounts of data and processing them quick. Then the more cores you add into the mix, the more likely it is applications that weren't memory bandwidth dependent before will become memory bandwidth dependent

Why do you think even Zen4 Epyc has 12 memory channels per socket?

Fjodor2001 said:
Source for this claim?

My source is the sporadic lackluster benchmark results.

Actually the best source is the breakdown from the y-cruncher developer, who microbenchmarked to show 100% uplift for AVX-512 Zen5 over Zen4 when membw was not a factor. Obviously, we aren't seeing those gains in most real applications that take advantage of AVX512. http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/

The next best source will be the performance of these same applications on Turin with more membw. My guess is a lot of the software that are struggling on 9950x will perform great on Zen5 Eypc & Threadripper.

Fjodor2001 said:
Also, not all MT workloads use pure AVX512 anyway. Instead it's probably very few actually.

Right, but I'd argue almost all AVX512 workloads use MT. Once we see more apps compiled with AVX512, we'll see more lackluster gains from 9950x.

Fjodor2001 said:
Because for ideal MT workloads, you can double performance by doubling number of cores.

**ideal MT workloads that aren't memory bandwidth bound, of which there will be less and less as you add more cores

I get it - there's plenty of cases where more cores is better and not much memory bandwidth is needed (encryption, encoding, CPU based photorendering, hypervisor spam)

But for each of those there's plenty of cases where it is membw bound too, so the already niche idea of "moar cores" becomes increasingly so when combined with the ridiculous high throughput AVX512 9950x has.

Anyway, my main point is it would just generally be "imbalanced". 16 P-cores is already a lot for a non-HEDT part with only 2 channels of memory. Then there is also the power budget to worry about.

Base clocks will drop, and for software that isn't perfect MT scaling but also isn't smart enough to not spawn a bunch of worker threads, performance will suffer.

Fjodor2001 · Aug 17, 2024

yottabit said:
Yes, at least when running in AVX-512 mode - Zen5 is twice the throughput of Zen4, let alone Zen3 which only had AVX2!

That's a cherry-picked special case. What's the average perf increase when comparing the fastest DDR4 Zen core vs fastest DDR5 Zen core?

And how much more memory bandwidth does the latter need? Is it really twice as much? If not, it should be possible to increase core count above 16 on DDR5, since we already had 16 cores on DDR4.

yottabit said:
It's special because it's chungus. AVX is all about loading huge amounts of data and processing them quick. Then the more cores you add into the mix, the more likely it is applications that weren't memory bandwidth dependent before will become memory bandwidth dependent

Why do you think even Zen4 Epyc has 12 memory channels per socket?

If the active dataset that is being processed is limited, it'll fit in cache which has very high memory bandwidth, so it won't be a problem. As igor_kavinski mentioned, X3D parts will help too, for those workloads where the active dataset is bigger.

yottabit said:
Right, but I'd argue almost all AVX512 workloads use MT. Once we see more apps compiled with AVX512, we'll see more lackluster gains from 9950x.

That does not mean all MT workloads are pure AVX512 though. Far from it.

And even for pure AVX512 workloads, memory bandwidth does not have to be a problem anyway if the active dataset is limited, as mentioned above.

yottabit said:
**ideal MT workloads that aren't memory bandwidth bound, of which there will be less and less as you add more cores

I get it - there's plenty of cases where more cores is better and not much memory bandwidth is needed (encryption, encoding, CPU based photorendering, hypervisor spam)

But for each of those there's plenty of cases where it is membw bound too, so the already niche idea of "moar cores" becomes increasingly so when combined with the ridiculous high throughput AVX512 9950x has.

Anyway, my main point is it would just generally be "imbalanced". 16 P-cores is already a lot for a non-HEDT part with only 2 channels of memory. Then there is also the power budget to worry about

I don't think 16 cores is a lot for a Zen5 CPU. We've been on 16 cores since Zen2 on DT. Intel DT CPUs are already on 24 cores, while also using DDR5.

Regarding power budget I agree though. But then the better solution is to use P + E cores. E.g. 8P cores for max ST perf, then X amount of E cores for max MT perf. Where the E cores are designed for lower power consumption, and optimal perf/watt.

yottabit · Aug 17, 2024

Fjodor2001 said:
Regarding power budget I agree though. But then the better solution is to use P + E cores. E.g. 8P cores for max ST perf, then X amount of E cores for max MT perf. Where the E cores are designed for lower power consumption, and optimal perf/watt.

I think we both agree on this one. I’m only arguing against 24 - 32 p-cores on dual channel consumer desktop. IMO I think 16 is even too many (and I say this as an owner of 5950x , 5900x , and 7840HS )

I think 8P + E spam is probably the best of both worlds for consumer desktop, provided the scheduler works.

That being said before the inter-CCX latency debacle of granite ridge I did like the simplicity of the “homogenous” design. But I think Intel’s approach has proven heterogenous can be done effectively (so long as you don't let your chips fry)

I only bring up the AVX512 so much because it was such a hyped feature for Zen5. I realize it’s not very widely used at least today (much thanks to Intel dropping it) but it represents the most extreme potential case of membw saturation.

Abwx · Aug 17, 2024

.......................

Fjodor2001 said:
Regarding power budget I agree though. But then the better solution is to use P + E cores. E.g. 8P cores for max ST perf, then X amount of E cores for max MT perf. Where the E cores are designed for lower power consumption, and optimal perf/watt.

That doesnt work and is less efficient as proved by Intel s designs, best is to put to contribution the P cores unused ressources thanks to SMT, that s the very reason why SMT exist as it make better use of the existing exe ressources.

FI Zen 5 use 16 front ends that drive 96 ALUs to do the work for a 200W total power, comparatively Intel will use 24 front ends to drive at least 112 ALUs for a 250W total power, and still, if ARL was also using N4P it would be at 300W.

Nothingness · Aug 17, 2024

Fjodor2001 said:
If the active dataset that is being processed is limited, it'll fit in cache which has very high memory bandwidth, so it won't be a problem. As igor_kavinski mentioned, X3D parts will help too, for those workloads where the active dataset is bigger.
...
And even for pure AVX512 workloads, memory bandwidth does not have to be a problem anyway if the active dataset is limited, as mentioned above.

I partly agree with you. But for sure some computational programs such as y-cruncher and Prime95 are limited by main memory BW.

The way I consider it is simple: if you're able to process twice as much data per cycle, you need to double the BW unless you were not BW limited before. BTW people in HPC often use a mem byte/op metric.

This is where I agree with you: not all programs need that extra BW. But the use of full width AVX-512 makes you hit the memory wall twice faster.

MS_AT · Aug 17, 2024

About why Zen5 SIMD is memory bottlenecked I will let myself cite Y-Cruncher author:

AIDA64 measures Zen5's memory bandwidth to be about 60 GB/s. To the untrained eye, this may seem like a lot. But when you break it down, it becomes clear that it is suffocatingly insufficient for Zen5's computational power.

60 GB/s divided across 16 cores becomes 3.75 GB/s per core.

3.75 GB/s divided by ~5 GHz CPU clock becomes 0.75 bytes/cycle.

0.75 bytes/cycle divided by 512-bit load becomes 0.0117 loads/cycle.

1/0.117 loads/cycle = 85.3 cycles per load.

Zen5's 4 x 512-bit execution width means 4 x 85.3 = ~340 instructions/load.

In plain English:

A loop that streams data from memory must do at least 340 AVX512 instructions for every 512-bit load from memory to not bottleneck on memory bandwidth.

Click to expand...

Timmah! · Aug 17, 2024

yottabit said:
Right, but I'd argue almost all AVX512 workloads use MT. Once we see more apps compiled with AVX512, we'll see more lackluster gains from 9950x.

If its AVX-512, that has to be used for CPU cores to use-up the available DDR5 bandwith, then we are cool, cause there are pretty much no notable apps on desktop that use it. And i have serious doubts we are going see more of them compiled anytime soon, especially if Intel does not even support it on desktop anymore.

AVX-512 has been at this point available on desktop cpus since what, 2017 and Skylake-X? Yet no adoption of it happened.

Nothingness · Aug 17, 2024

MS_AT said:
About why Zen5 SIMD is memory bottlenecked I will let myself cite Y-Cruncher author:

In a way it's even worse than that if you consider you can 2xFMA + 2xFADD per cycle, which amounts to 6 512-bit op per cycle.

Nothingness · Aug 17, 2024

Timmah! said:
AVX-512 has been at this point available on desktop cpus since what, 2017 and Skylake-X? Yet no adoption of it happened.

Given how much Intel support AVX-512 in the consumer market, how could it have been different? Intel sucks at supporting their own extensions. It took them 10 years to have AVX2 on all of their CPUs.

Thunder 57 · Aug 17, 2024

inquiss said:
Beginning of 2027 is the rumour

No that's just a number 1-2 people stated and everyone seems to have ran with it. Said by the same folks who got Zen 5 completely wrong. If it is 2027 AMD should just sut down because that would be a disastor.

biostud · Aug 17, 2024

The market for desktop processors is relatively small. The market for +16 cores in desktops is miniscule.

yottabit · Aug 17, 2024

I’m pretty much done being able to deal with this thread. AVX512 is the pretty much the only reason to buy Zen5, but simultaneously a cherry picked use case. Time for me to take a break

MS_AT · Aug 17, 2024

Timmah! said:
AVX-512 has been at this point available on desktop cpus since what, 2017 and Skylake-X? Yet no adoption of it happened.

It's available on consumer desktop since Rocket Lake so 2021. And for one generation only as AlderLake and its derivatives don't have it due to E cores. Then Zen4 appeared and maintained availability but from different vendor. Skylake-X was providing support only on HEDT machines. And Skylake-X implementation was troublesome due to AVX offsets, AVX512 "cold-start" and other things, so if you were only to sprinkle few AVX512 instructions into otherwise scalar code you would hurt the performance. On laptops you also had TigerLake and probably IceLake but Icelake was short lived on laptops. And since Rocket Lake was a real failure then not many people have bought it, so to be honest we should count consumer availability to start with Zen4.

MS_AT · Aug 17, 2024

yottabit said:
It's special because it's chungus. AVX is all about loading huge amounts of data and processing them quick. Then the more cores you add into the mix, the more likely it is applications that weren't memory bandwidth dependent before will become memory bandwidth dependent

Actually it's only one side of the coin. The other is to save power at the front-end. If you can find few independent operations of the same type you can execute them all with instruction. To give some contrived example. With 512b register you can do 16 32b float additions. You have 2 instructions to load operands, one to do the arithmetic operation (addition) and one to store a result. That's 4 instructions to decode. In pure scalar code you would need to decode 64 instructions in unrolled loop to do the same work. [this contrived example assumes uop cache does not exist, but even if the ops would be served from uop cache, each load takes at least 4 cycles to complete and so on]. That's why AVX512 was giving noticeable benefit to Zen4 and lower power draw than when using AVX2 or below. Also remember that for floating point aritthmetic operations it does not matter if you add only scalar values or full width of 16 values in case of 32b float, the latency is the same.

CouncilorIrissa · Aug 17, 2024

The problem with high core count parts on mainstream platforms is that they raise the power delivery system requirements for motherboard manufacturers, because the CPU needs to have relatively high all-core boost clocks to make sense to begin with. Which in turn means general public would need to pay more for motherboards to essentially subsidise this small portion of the desktop market, which is a relatively small market on its own. Which is why I think AMD is reluctant to increase core counts on desktop: not only such an SKU would serve a relatively small niche of workloads that scale to high core counts AND don't need memory bandwidth, it would also require everyone else to pay for it.

I maintain that nT score is a cope metric that is only brought up because this gen does not provide a meaningful 1t increase in client workloads other than JS, which is admittedly pretty important, but not enough on its own for most people.

marees · Aug 17, 2024

Fjodor2001 said:
This I think you'll have to clarify.

A P-core consumes less power than an E-core? And you are talking about ARL DT?

I am not referring to watts (energy / second) but total calories (total energy for the task)

I feel like a loser when my task takes more time on an e-core compared to p-core

(I understand this might change for upcoming e-core)

LightningZ71 · Aug 17, 2024

MS_AT said:
About why Zen5 SIMD is memory bottlenecked I will let myself cite Y-Cruncher author:

The y-cruncher example is essentially a worst case scenario. It's also next to impossible to achieve. In reality, MOST, but not all, AVX-512 workloads that are not purely synthetic will not be constantly streaming the maximum amount of data continuously. They will digest chunks, manipulate it, test the results, then store the results of the manipulation or the findings of the test, then either wait on the non AVX-512 portion of the code to do things, or move on to the next chunk of data.

32MB of l3 for 8 cores is plenty for most tasks, and represents as much or more l3 per core than any Intel avx-512 enabled product ever produced. The X3d parts will have 3x that amount. Yes, main memory bandwidth is limiting in synthetic or academic scenarios, but it isn't the end of the story.

vanplayer · Aug 17, 2024

There would be a core latency patch by the end of August, likely released with new chipset X870/B860. Typical AMD that release software after hardware launch. LOL.
I guess chipset with correct latency microcode cannot be launched with CPU at the same time due to the fail management. If there's any special reason I think AI is influencing these companies' decision, it's first time AMD launching mobile APUs before DT, and Zen5 APU's design looks like an all-in-AI approach.

Kocicak · Aug 18, 2024

E cores on Intel ARE NOT more power efficient than P cores, they improve multithread performance PER CHIP AREA.

Hotrod2go · Aug 18, 2024

Kocicak said:
I am not a project management expert, but if the information that ZEN 5 has been developed by several completely different teams and faced reworks and delays is correct, it is a miracle that it even powers on.

The inter-CCD latency problem may be somehow strongly related to some hardware flaw or peculiarity, which will make fixing it impossible or very difficult.

I wonder how many engineers will need to work on this "flopper" even after its release to make it less bad. It may have flopped many times already during the development.

As an owner of 5700G CPU I wonder why these CPUs are not much more popular in PCs, how are the 8x00G CPUs selling now?

Ordered an 8600G yesterday. I await its arrival with much delight for my usage case in one of my rigs.

JustViewing · Aug 18, 2024

igor_kavinski said:
It's possible if the FP units crunch through data so fast that they waste much of their time waiting for data from RAM because the L3 cache is too small to hold enough data to keep them fed and busy. If this is true, the X3D parts will be something special in AVX-512 workloads.

Most likely because L3 bandwidth hasn't increased compared to Zen4. In Zen6 they may increase it to 64B/s

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Lifer

Diamond Member

Platinum Member

Moderator Emeritus, Elite Member

Golden Member

Diamond Member

Golden Member

Lifer

Diamond Member

Senior member

Golden Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Golden Member

Senior member

Senior member

Senior member

Senior member

Golden Member

Junior Member

Golden Member

Senior member

Senior member