Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

inquiss · Mar 19, 2025

Win2012R2 said:
It's very marginally faster (frequency wise) in Zen 5 - and no, that's not better because large discrepancy in cache size makes those cores not the same, therefore presenting problem for scheduler.

I am arguing for two chiplets with 3D cache rates to same frequency, as exactly the same as possible to reduce scheduling issues, plus NUMA mode should be activatable in BIOS so that OS knows it's different cache domains.

It doesn't solve any scheduling issues. If you want something with a large cache, pin it to the same CCD. AMD does this is software now.

You still don't want cross CCD talk, ideally, whether they both have cache, both don't have cache or are asymmetric. What you want is games to pin to one CCD. And then move to the other if threads are full. What games need more than 8 cores. Even with two vcache ccds you'd pin the game to the first 8 cores. You're in the same position.

gdansk · Mar 19, 2025

I also doubt whether AMD would throw two good bin X3D CCD in the same part. You may get one mid X3D CCD mixed with a nice one like the the 9950X3D has. You're not getting uniformity unless AMD gives them an Epyc haircut.

Maybe still desirable to some.

igor_kavinski · Mar 19, 2025

The problem is they are NOT even trying. They could release a Ryzen PRO version that's priced maybe $250 higher. OK, maybe that's not enough for them. How about $500 higher? Let people discover what workloads are accelerated with V-cache on both CCDs. We have seen from both AMD and Intel that they are clueless about what people actually use for their workflow. AMD included Geekbench 5 in their Zen 5 slides and Intel keeps presenting Cinebench scores. In reality, there could be hundreds of applications benefitting from V-cache which AMD can't test unless they create a whole new division called Special V-cache Software Testing and Benchmarking Division.

StefanR5R · Mar 19, 2025

MS_AT said:
Sorry I thought we were discussing why there is no client halo part with 2 x3d chiplets.

Perhaps; I didn't pay close attention to all this circular discussion of dual V-cache. My own comments were on homogeneous vs. heterogeneous CPUs. And I let myself get pulled into this only because I too easily get irritated by apologism of heterogeneous CPUs.

Tuna-Fish said:
vcache on both chiplets does not help with the scheduler issues!

StefanR5R said:
Wrong.

MS_AT said:
Since with Zen5 the freq diff is small enough then x3d chiplet is the usually the better choice.

It is not this easy.
I assume the owner bought a 16c/32t CPU in order to frequently use 16c/32t. (It's a 2x 8c/16t CPU actually but limiting the thread pool size per computational subtask is something which operators have been doing for ages now; not just because machine topology may be asking for it but also plainly because of Amdahl's law.)
– Now what do you do when you have a homogeneous workload, CFD for instance? You want a homogeneous CPU.
– Or what do you do when you have a heterogeneous workload? Either you profile it sufficiently and give the necessary hints to the OS how to fit it onto the heterogeneous machine. Or you simply take a homogeneous machine if you can get your hands on one.

Tuna-Fish said:
The problem is that the round-trip from one CCD to another takes forever, not just that only one of the CCDs has vcache. If they both had vcache, you still would take horrible penalty if game threads got scheduled across the split.

StefanR5R said:
"The problem" which you are stating only exists if threads share large hot data.

Tuna-Fish said:
This describes all games.

I take your word for it, as I last ran computer games myself (and even wrote one) in the 1990ies.

[However, if a game is CPU intensive and is parallel enough to make good use of more than eight cores, then everything which is old knowledge in the HPC world can and should be applied in game engines too — but isn't, because performance optimization is typically last in line for budget allocation.]

Win2012R2 · Mar 19, 2025

inquiss said:
What you want is games to pin to one CCD

No, what I want is 3D cache on both chiplets that are rated to same frequency (so no B grade 2nd chiplet), that's what I want and prepared to pay for it ($1K max).

What's the problem? It's an upsell to AMD easily done.

GTracing · Mar 19, 2025

Win2012R2 said:
No, what I want is 3D cache on both chiplets that are rated to same frequency (so no B grade 2nd chiplet), that's what I want and prepared to pay for it ($1K max).

What's the problem? It's an upsell to AMD easily done.

The CCD with the 3D-Vcache runs slower because of the cache, not becuase of binning. See the 9800X3D clock speeds versus the 9700X clock speeds.

If AMD made a CPU where both CCDs had a cache die, it would have two slow CCDs, not two fast CCDs.

desrever · Mar 19, 2025

GTracing said:
The CCD with the 3D-Vcache runs slower because of the cache, not becuase of binning. See the 9800X3D clock speeds versus the 9700X clock speeds.

If AMD made a CPU where both CCDs had a cache die, it would have two slow CCDs, not two fast CCDs.

2 slow CCDs and cost $100 more, think of the value lol

inquiss · Mar 19, 2025

Win2012R2 said:
No, what I want is 3D cache on both chiplets that are rated to same frequency (so no B grade 2nd chiplet), that's what I want and prepared to pay for it ($1K max).

What's the problem? It's an upsell to AMD easily done.

Sure. So, in the cases where one core is maxed out in something, the other has a vcache and can perform those functions quicker if it's sensitive to vcache and slower if it's not. That's all well and good.

Doesn't help with scheduling though.

I guess all I can say is that people with those wants exist. You're one of them. But I can't think of a benefit to making the SKU beside pleasing some forum dwellers (affectionately) that can't monetise this new product. Maybe when games use more than 16 threads it will have a market.

Win2012R2 · Mar 19, 2025

GTracing said:
The CCD with the 3D-Vcache runs slower because of the cache, not becuase of binning

Obviously, but in Zen 5 clocks for 3D version is far closer to non-3D for the difference to be immaterial in my view, what's material is uneven chiplets both in terms of frequency and also cache, this may have been the only way in Zen 4, but now the downside of having non-3D chiplet with 3% faster clocks is a downside, not upside.

Anything that is not even makes scheduling harder

moinmoin · Mar 19, 2025

The whole discussion about determinism using two more similar CCDs with X3D is very silly because perfect determinism is already defeated by the cores on the same die already have different max f and f/v behavior. And servers achieve more determinism by generally clocking lower.

CouncilorIrissa · Mar 19, 2025

Win2012R2 said:
Obviously, but in Zen 5 clocks for 3D version is far closer to non-3D for the difference to be immaterial in my view, what's material is uneven chiplets both in terms of frequency and also cache, this may have been the only way in Zen 4, but now the downside of having non-3D chiplet with 3% faster clocks is a downside, not upside.

Anything that is not even makes scheduling harder

It's not only about clock speed. The V$ die isn't universally faster than the non-V$ die even at the same clock, because larger cache incurs (well, at least on Zen 4 it did) 4-cycle penalty for accessing L3. So if your workload fits within 32MB, it would be faster on the non-V$ die even if it clocked the same.

StefanR5R · Mar 19, 2025

gdansk said:
You're not getting uniformity unless AMD gives them an Epyc haircut.

I agree. Alas it's anyone's guess whether or not AMD is ever going to treat AM5 EPYC (alias EPYC 4000) better than a least-effort Ryzen derivate.

Win2012R2 said:
[…] plus NUMA mode should be activatable in BIOS so that OS knows it's different cache domains.

Actually the operating system is well aware of cache topology. (Linux is; I suppose Windows is too.) It just does not apply a cache-aware scheduling policy by default. I suspect this is because the kernel authors do not believe that a generally good enough default policy exists. This is in contrast to non-uniform main memory access (NUMA): Operating systems (Linux at least, I suppose several Windows flavors too) actually do apply a default NUMA-aware scheduling policy. (1. Try to keep a process, including all of its subthreads, running on one NUMA node, such that the process accesses mostly near memory. 2. Spread the overall load from different processes across NUMA nodes. This is a good NUMA related policy in many but not all cases.)

Now, this BIOS option which you mentioned — which lets the BIOS tell the OS that each last-level cache domain is a NUMA node — is actually a bit of a hack:
– It tricks the OS into applying its default NUMA-aware scheduling policy as if it was a cache-domain-aware scheduling policy.
– It tricks NUMA-aware userspace tools and settings to function like cache-domain-aware tools and settings.

GTracing said:
If AMD made a CPU where both CCDs had a cache die, it would have two slow CCDs, not two fast CCDs.

a) He referred to performance determinism, not to extreme peak performance.
b) f_max = 5.2 GHz (9800X3D) or 5.55 GHz (9950X3D) is not "slow".

desrever said:
2 slow CCDs and cost $100 more, think of the value lol

a) They are not slow. b) Performance is workload dependent, and thereby is value.

moinmoin said:
The whole discussion about determinism using two more similar CCDs with X3D is very silly because perfect determinism is already defeated by the cores on the same die already have different max f and f/v behavior.

This is wrong if you consider the particular workloads in which 96 MB L3$/CCX actually make a difference to 32 MB L3$/CCX.

moinmoin said:
And servers achieve more determinism by generally clocking lower.

A dual-CCD CPU which runs twelve or more computationally intense threads does not run at f_max either.

CouncilorIrissa said:
if your workload fits within 32MB,

you buy a vanilla CPU.

Markfw · Mar 19, 2025

Question.. Why is a 16 core 4004 EPYC(4.5 ghz) + motherboard($399+$299 $698 total) close to the price of a 9950x(4.3 ghz, $542) and the 4004 runs faster ????

inquiss · Mar 19, 2025

CouncilorIrissa said:
It's not only about clock speed. The V$ die isn't universally faster than the non-V$ die even at the same clock, because larger cache incurs (well, at least on Zen 4 it did) 4-cycle penalty for accessing L3. So if your workload fits within 32MB, it would be faster on the non-V$ die even if it clocked the same.
View attachment 120318

I think this a red herring? I think the point here is that, even if the cores were exactly the same, going across to the other CCD will incur a penalty whether it's the same or not. You want all threads on one CCD when you can. If everything was identical, you'd still want to pin the game to one CCD.

Shmee · Mar 19, 2025

Markfw said:
Question.. Why is a 16 core 4004 EPYC(4.5 ghz) + motherboard($399+$299 $698 total) close to the price of a 9950x(4.3 ghz, $542) and the 4004 runs faster ????

What generation is the 4004? I don't keep track of all the server parts, but if it is an older generation like Zen 2 or 3 etc, it may be faster in rated frequency but still be using an older architecture, thus often slower clock per clock. Also, in actual benchmarks/usage, it may be slower.

gdansk · Mar 19, 2025

Shmee said:
What generation is the 4004? I don't keep track of all the server parts, but if it is an older generation like Zen 2 or 3 etc, it may be faster in rated frequency but still be using an older architecture, thus often slower clock per clock. Also, in actual benchmarks/usage, it may be slower.

It's Zen 4. The 4564P he's talking about is a 7950X by another name.

GTracing · Mar 19, 2025

Markfw said:
Question.. Why is a 16 core 4004 EPYC(4.5 ghz) + motherboard($399+$299 $698 total) close to the price of a 9950x(4.3 ghz, $542) and the 4004 runs faster ????

Zen4 VS Zen5? The 7950X also has a base clock of 4.5Ghz.

Shmee · Mar 19, 2025

Ok so that helps answer the question...the 4004 is a generation older, though still not much older. You may be able to get it at a bit of a discount though. And where are you seeing it for this price? Prices will vary by seller, and especially can be lower from used on Ebay and similar.

fastandfurious6 · Mar 19, 2025

Markfw said:
Question.. Why is a 16 core 4004 EPYC(4.5 ghz) + motherboard($399+$299 $698 total) close to the price of a 9950x(4.3 ghz, $542) and the 4004 runs faster ????

it doesn't. slower than 7950x too. 4584PX is the fast 16core with 3D

Markfw · Mar 19, 2025

gdansk said:
It's Zen 4. The 4564P he's talking about is a 7950X by another name.

Thanks, I missed that. It says gen 4 in the description of the product. No way I want it then,

Shmee said:
Ok so that helps answer the question...the 4004 is a generation older, though still not much older. You may be able to get it at a bit of a discount though. And where are you seeing it for this price? Prices will vary by seller, and especially can be lower from used on Ebay and similar.

see above.

MS_AT · Mar 19, 2025

StefanR5R said:
It is not this easy.

I am afraid we are arguing about 2 different things. All what I am saying that 9950x3D has nullified the weakness of 7950x3D by minimising the frequency difference between vanilla and x3D cache, to the point where betting on x3D CCD as the default one is set and forget solution, as it won't hurt your general performance meaningfully. Something that could not be said about 7950x3D. Since the OS cannot know if the apps its running prefer MHz or MBs of cache, until they tell it, so 9950x3D is easier to setup for the scheduler than 7950x3D in general case. Adding second x3D CCD wouldn't make it job meaningfully easier as the biggest problem with 2 CCDs is that they are 2 CCDs. But yes some workloads would see better performance from 2x x3D CCDs. It's just orthogonal to the scheduling problem, IMO.

dr1337 · Mar 19, 2025

inquiss said:
I think this a red herring? I think the point here is that, even if the cores were exactly the same, going across to the other CCD will incur a penalty whether it's the same or not. You want all threads on one CCD when you can. If everything was identical, you'd still want to pin the game to one CCD.

Its quite a penalty, both in latency and absolute bandwidth. Current GMI3 links are at 36 GB/s, but looking at Aida64 results of 9800X3Ds on google, L3 bandwidth clocks in at over 700GB/s. So an order of magnitude reduction (and then some) in IO speed just to request from the cache on a different CCD.

MadRat · Mar 19, 2025

dr1337 said:
Its quite a penalty, both in latency and absolute bandwidth. Current GMI3 links are at 36 GB/s, but looking at Aida64 results of 9800X3Ds on google, L3 bandwidth clocks in at over 700GB/s. So an order of magnitude reduction (and then some) in IO speed just to request from the cache on a different CCD.

Seems like cache communication between cores is limited by the laws of physics. So until there's a new law of physics, focus on what is one off from the current market. So much speculation in the thread appears to throw the baby out with the bathwater. There is no magic whatif coming in Zen 5. Its awesome that AMD broke the decorum demanding no big caches for each successive generation. But the innovation today is pretty cool how talk about L1, L2, etc now focuses on whole chips being added on to the package.

fastandfurious6 · Mar 20, 2025

1) Medusa 12core CCD = easier for scheduler to fit more stuff in each

2) OS scheduler gets smarter until H2 '26

3) 3D cache production ramped up for next gen so Medusa likely has x2 3Dcache models

igor_kavinski · Mar 22, 2025

Our Zen Wizard AKA Det0x hit north of 6 GHz on his 9950X3D sample: https://www.overclock.net/posts/29445411/

Like, seriously, AMD! Make him the Global Director of Overclocking Affairs

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Senior member

Diamond Member

Lifer

Elite Member

Senior member

Senior member

Senior member

Senior member

Senior member

Diamond Member

Senior member

Elite Member

Moderator Emeritus, Elite Member

Senior member

Memory & Storage, Graphics Cards Mod Elite Member

Diamond Member

Senior member

Memory & Storage, Graphics Cards Mod Elite Member

Senior member

Moderator Emeritus, Elite Member

Senior member

Senior member

Lifer

Senior member

Lifer