Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

Josh128 · Wednesday at 11:28 AM

In2Photos said:
Another article: https://www.notebookcheck.net/Horri...50X-is-reportedly-getting-fixed.880285.0.html

Yeah, Im familiar with that, but that is not AMD saying they will fix it. Thats a rumor from some unknown Asian person on X.

Hitman928 · Wednesday at 11:31 AM

Mahboi said:
note to self: svgz images won't work.

I mean it looks stylish to see an 8 core Zen 5 beat Intel's best...but let's be honest nobody should care.
It's 50% more power draw on both the 6 and 8 cores. For a growth level of 10% at best.

If you’re going by the overall average, it’s 41% more power for the 9700x and 31% for the 9600x. Looking at the overall average performance increase doesn’t make much sense, though.

In2Photos · Wednesday at 11:35 AM

Josh128 said:
Yeah, Im familiar with that, but that is not AMD saying they will fix it. Thats a rumor from some unknown Asian person on X.

I didn't say it was, just posted a link with some information. This is a speculation thread.

MarkPost · Wednesday at 12:38 PM

Mahboi said:
note to self: svgz images won't work.

I mean it looks stylish to see an 8 core Zen 5 beat Intel's best...but let's be honest nobody should care.
It's 50% more power draw on both the 6 and 8 cores. For a growth level of 10% at best.

What I highlight from this chart is that 9600X at the same TDP is 29% faster than 7600X, or 9700X at the same TDP is 23% faster than 7700X. In my view thats a total success for a new gen.

DisEnchantment · Wednesday at 1:34 PM

// Models 00h-0Fh (Breithorn).
// Models 10h-1Fh (Breithorn-Dense).
// Models 20h-2Fh (Strix 1).
// Models 30h-37h (Strix 2).
// Models 38h-3Fh (Strix 3).
// Models 40h-4Fh (Granite Ridge).
// Models 50h-5Fh (Weisshorn).
// Models 60h-6Fh (Krackan1).
// Models 70h-77h (Sarlak).

[X86] AMD Zen 5 Initial enablement by ganeshgit · Pull Request #107964 · llvm/llvm-project

This patch enables the basic skeleton enablement of AMD next gen zen5 CPUs.

github.com

Some new codenames for Zen 5 CPUs

MS_AT · Wednesday at 1:47 PM

DisEnchantment said:
[X86] AMD Zen 5 Initial enablement by ganeshgit · Pull Request #107964 · llvm/llvm-project

This patch enables the basic skeleton enablement of AMD next gen zen5 CPUs.

github.com

Some new codenames for Zen 5 CPUs

Yay finally, maybe they will make it to clang 19 with real tunings instead of copy/paste from zen4.

Mahboi · Wednesday at 2:26 PM

MarkPost said:
What I highlight from this chart is that 9600X at the same TDP is 29% faster than 7600X, or 9700X at the same TDP is 23% faster than 7700X. In my view thats a total success for a new gen.

Overall yes, Zen 5 is fine.
But it's notably poor for scalar/INT and greatly relies on SIMD for its performance increase. Games get 5%, compilation 10. It's strongly carried by interpreters and massive data crunchers.

A server gen, through and through.

Mahboi · Wednesday at 2:28 PM

MS_AT said:
Yay finally, maybe they will make it to clang 19 with real tunings instead of copy/paste from zen4.

I am genuinely veeeeery curious about possible compiler improvements in the upcoming year.
How much can that split frontend improve? Everyone seems to agree that it's still misused/mostly unused...how far could compiler upgrades help it, especially in scalar?

yuri69 · Wednesday at 5:26 PM

MS_AT said:
Yay finally, maybe they will make it to clang 19 with real tunings instead of copy/paste from zen4.

AMD - The Software Company!

misuspita · Wednesday at 5:37 PM

May I interrupt you for a new interpretation of the miniPC

Beelink SER9 is one of the first mini PCs with Ryzen AI 300 (Strix Point) - Liliputing

Beelink SER9 is one of the first mini PCs with Ryzen AI 300 (Strix Point)

liliputing.com

I mean... wtf??? Speakers? )) Other than that, nice design, if it's anything like the previous model it's really quiet but.... again... speakers? In that form factor? OMG...

MS_AT · Wednesday at 6:31 PM

Mahboi said:
I am genuinely veeeeery curious about possible compiler improvements in the upcoming year.
How much can that split frontend improve? Everyone seems to agree that it's still misused/mostly unused...how far could compiler upgrades help it, especially in scalar?

Not at all I am afraid. I mean according to AMD single thread gets 4-wide decoder and the design is anyway more latency bound than throughput bound. So while I guess compiler guys can do wonders I wouldn't keep my hopes high.

Mahboi said:
Overall yes, Zen 5 is fine.
But it's notably poor for scalar/INT and greatly relies on SIMD for its performance increase. Games get 5%, compilation 10. It's strongly carried by interpreters and massive data crunchers.

A server gen, through and through.

Well servers are also used for compiling Problem with compilation is that poor build system can kill the cpu scaling so the actual performance improvement might be tad higher. I have heard about 15%. You could call it splitting hairs, but well

Hail The Brain Slug · Wednesday at 6:44 PM

misuspita said:
May I interrupt you for a new interpretation of the miniPC

Beelink SER9 is one of the first mini PCs with Ryzen AI 300 (Strix Point) - Liliputing

Beelink SER9 is one of the first mini PCs with Ryzen AI 300 (Strix Point)

liliputing.com

I mean... wtf??? Speakers? )) Other than that, nice design, if it's anything like the previous model it's really quiet but.... again... speakers? In that form factor? OMG...

Maybe they're going to use the speakers for active noise cancellation of the fan

Mahboi · Wednesday at 7:14 PM

MS_AT said:
Not at all I am afraid. I mean according to AMD single thread gets 4-wide decoder and the design is anyway more latency bound than throughput bound. So while I guess compiler guys can do wonders I wouldn't keep my hopes high.

Speaking of that, AMD sounds like they're hiding something under the carpet.
- C&C found that actually AMD's statements regarding double decoder throughput isn't applied with SMT being kind of not fully working (yes vague, I forgot the details)
- David Huang also found something of that extent IIRC

It seems like they're claiming that the dual decoder is running, but kinda sorta ackshually not. Maybe it's just me misunderstanding complex articles, but it feels like we could hear about some microcode update at some point that somehow gained 5% perf in scalar apps.

As for latency bound...I honestly don't even understand how that works. Decode finishes its job and passes it to the backend and somehow it's "late"?? Doesn't it all go stage after stage until retire? What's creating latency from within the CPU once decode is done?

yuri69 · Thursday at 3:12 AM

Mahboi said:
Speaking of that, AMD sounds like they're hiding something under the carpet.
- C&C found that actually AMD's statements regarding double decoder throughput isn't applied with SMT being kind of not fully working (yes vague, I forgot the details)
- David Huang also found something of that extent IIRC

It seems like they're claiming that the dual decoder is running, but kinda sorta ackshually not. Maybe it's just me misunderstanding complex articles, but it feels like we could hear about some microcode update at some point that somehow gained 5% perf in scalar apps.

As for latency bound...I honestly don't even understand how that works. Decode finishes its job and passes it to the backend and somehow it's "late"?? Doesn't it all go stage after stage until retire? What's creating latency from within the CPU once decode is done?

In layman terms: it doesn't matter whether you decode 4 or 14 instructions if you often have to wait to "get" even the first instruction. E.g. you are waiting for the branch predictor to decide due its deep structures, or you were unlucky and TLB doesn't hold the needed address - you wait.

Mahboi · Thursday at 3:24 AM

"A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location.[1]"
Thanks so much for the noob explanation.

every day, the acronyms stray further from God

yottabit · Thursday at 3:27 AM

Mahboi said:
Speaking of that, AMD sounds like they're hiding something under the carpet.
- C&C found that actually AMD's statements regarding double decoder throughput isn't applied with SMT being kind of not fully working (yes vague, I forgot the details)
- David Huang also found something of that extent IIRC

It seems like they're claiming that the dual decoder is running, but kinda sorta ackshually not. Maybe it's just me misunderstanding complex articles, but it feels like we could hear about some microcode update at some point that somehow gained 5% perf in scalar apps.

As for latency bound...I honestly don't even understand how that works. Decode finishes its job and passes it to the backend and somehow it's "late"?? Doesn't it all go stage after stage until retire? What's creating latency from within the CPU once decode is done?

The TLR I remember is that the 2x 4 decoder is only fully utilized in SMT (2 threads per core)

And the reason that a single thread can’t utilize the entire 8 wide decoder is that, AMD claims it wouldn’t have improved performance much (and I would guess may have hurt perf/watt too)

C&C did some analysis on this basically supporting AMD’s conclusion. I believe C&C showed that something like ~90% of the time 1T workloads were able to be served without penalty using just the single 4 wide decoder because they were able to fetch the associated instructions out of the microOp cache

Mahboi · Thursday at 3:29 AM

I expect the branch prediction gets evolved every generation, so even in Zen 6 it should be somewhat better, but if the problem is TLB/memory lookup, then wouldn't the evolved uncore in Zen 6 make that better? Less latency and all?

yottabit · Thursday at 3:30 AM

Mahboi said:
"A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location.[1]"
Thanks so much for the noob explanation.

every day, the acronyms stray further from God

Well if you want an hours-long explanation I can recommend this video starting from -29 minutes

itsmydamnation · Thursday at 3:33 AM

yuri69 said:
In layman terms: it doesn't matter whether you decode 4 or 14 instructions if you often have to wait to "get" even the first instruction. E.g. you are waiting for the branch predictor to decide due its deep structures, or you were unlucky and TLB doesn't hold the needed address - you wait.

Also the uop cache is having an insanely high hit rate that it doesn't seem to matter all that much.

I wonder how much performance would have been improved if :

min page size = 16k
L1D = 128k 8 way
L2/L3 5-6mb L2 per slice , dynamic partition L3 ( aka Z16)
int reg file = 400+

Bigger page sizes just seems to win these days , less TLB misses , easier L1D cache. All the really high IPC cores have massive L2's at good latency, but AMD would still want to keep the scale out nature of the L1 -> L2 -> L3 -> CCX -> CCD ->socket so massive L2 with massive L3 seems not ideal to this design target so need to be "smarter" on the L2 /L3 cache.

Hope we get something more ambitious going forward , i just bought 7800X3D... rofl . I dont care for full rate AVX-512 and i fig Zen5 might be a few % faster in some benchmarks but i bet late game CIV/Simulation the X3D still wins by a massive amount.

yuri69 · Thursday at 3:47 AM

itsmydamnation said:
min page size = 16k
...
Bigger page sizes just seems to win these days , less TLB misses , easier L1D cache.

Yeah, aarch64's 64k page size apparently brings ~15% gain on AmpereOne in the Phoronix test suite: https://www.phoronix.com/review/ampereone-64k-linux611/5

itsmydamnation · Thursday at 3:57 AM

yuri69 said:
Yeah, aarch64's 64k page size apparently brings ~15% gain on AmpereOne in the Phoronix test suite: https://www.phoronix.com/review/ampereone-64k-linux611/5

thats interesting i cant find any details on AmpereOnes L1D other then its 64k and ~ 4 cycle access so its hard to extrapolate if that 15% is just from better TLB usage of 64k page or if the L1D/TLB has some bottlenecks at 4k page size.

DrMrLordX · Thursday at 4:13 AM

People talking about the TLB brings me back to the days of Phenom and the infamous TLB bug.

AMD's B3 Stepping Phenom Previewed, TLB Hardware Fix Tested

www.anandtech.com

Even though the writeup is old, it may prove useful for people wanting to know more about TLBs. Glad Zen5 doesn't have a bug like that though. That was a nasty one.

MS_AT · Thursday at 4:30 AM

itsmydamnation said:
Also the uop cache is having an insanely high hit rate that it doesn't seem to matter all that much.

I think the problem is how much you have to wait when it misses. Games are reaching like 75% uop utilization in Zen4 from C&C analysis, and if you are unlucky it seems the core will sit there doing nothing for 40 cycles when it is trying to fetch instructions to decode from L3 even worse if from memory. So personally I am curious if L1i cache growth would benefit here and why they haven't bumped this size since Zen2? I think Intel is using at least 64kB I cache not to mention gigantic L1i caches on leading ARM uarchs. [192kB? iirc]

Also for gaming workloads I recommend this analysis from C&C for Zen4 https://chipsandcheese.com/2023/09/06/hot-chips-2023-characterizing-gaming-workloads-on-zen-4/ that partially explains why Zen5 doesn't do improve in gaming by a lot.

itsmydamnation · Thursday at 4:33 AM

MS_AT said:
I think the problem is how much you have to wait when it misses. Games are reaching like 75% uop utilization in Zen4 from C&C analysis, and if you are unlucky it seems the core will sit there doing nothing for 40 cycles when it is trying to fetch instructions to decode from L3 even worse if from memory. So personally I am curious if L1i cache growth would benefit here and why they haven't bumped this size since Zen2? I think Intel is using at least 64kB I cache not to mention gigantic L1i caches on leading ARM uarchs. [192kB? iirc]

Also for gaming workloads I recommend this analysis from C&C for Zen4 https://chipsandcheese.com/2023/09/06/hot-chips-2023-characterizing-gaming-workloads-on-zen-4/ that partially explains why Zen5 doesn't do improve in gaming by a lot.

Zen 5's uop cache is significantly different from zen4's, i wouldn't draw assumptions / extrapolations , they did some profiling but not gaming specific.

AMD’s Ryzen 9950X: Zen 5 on Desktop

AMD’s desktop Zen 5 products, codenamed Granite Ridge, are the latest in the company’s line of high performance consumer offerings. Here, we’ll be looking at AMD’s Ryzen 9 9…

chipsandcheese.com

MS_AT · Thursday at 4:59 AM

itsmydamnation said:
Zen 5's uop cache is significantly different from zen4's, i wouldn't draw assumptions / extrapolations , they did some profiling but not gaming specific.

AMD’s Ryzen 9950X: Zen 5 on Desktop

AMD’s desktop Zen 5 products, codenamed Granite Ridge, are the latest in the company’s line of high performance consumer offerings. Here, we’ll be looking at AMD’s Ryzen 9 9…

chipsandcheese.com

It's different in what it stores and that it's higher throughput but the overall mechanism of getting it filled looks the same from comparing Software Optimization Guides so if either one misses it will need to wait unless I have misunderstood something.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Senior member

Diamond Member

Golden Member

Senior member

Golden Member

Senior member

Golden Member

Golden Member

Senior member

Senior member

Senior member

Diamond Member

Golden Member

Senior member

Golden Member

Golden Member

Golden Member

Golden Member

Platinum Member

Senior member

Platinum Member

Lifer

Senior member

Platinum Member

Senior member