Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 812 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Hitman928

Diamond Member
Apr 15, 2012
6,050
10,380
136
note to self: svgz images won't work.

I mean it looks stylish to see an 8 core Zen 5 beat Intel's best...but let's be honest nobody should care.
It's 50% more power draw on both the 6 and 8 cores. For a growth level of 10% at best.

If you’re going by the overall average, it’s 41% more power for the 9700x and 31% for the 9600x. Looking at the overall average performance increase doesn’t make much sense, though.
 
Reactions: Tlh97 and Elfear

MarkPost

Senior member
Mar 1, 2017
295
531
136
note to self: svgz images won't work.

I mean it looks stylish to see an 8 core Zen 5 beat Intel's best...but let's be honest nobody should care.
It's 50% more power draw on both the 6 and 8 cores. For a growth level of 10% at best.

What I highlight from this chart is that 9600X at the same TDP is 29% faster than 7600X, or 9700X at the same TDP is 23% faster than 7700X. In my view thats a total success for a new gen.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
// Models 00h-0Fh (Breithorn).
// Models 10h-1Fh (Breithorn-Dense).
// Models 20h-2Fh (Strix 1).
// Models 30h-37h (Strix 2).
// Models 38h-3Fh (Strix 3).
// Models 40h-4Fh (Granite Ridge).
// Models 50h-5Fh (Weisshorn).
// Models 60h-6Fh (Krackan1).
// Models 70h-77h (Sarlak).

Some new codenames for Zen 5 CPUs
 

Mahboi

Golden Member
Apr 4, 2024
1,001
1,804
96
What I highlight from this chart is that 9600X at the same TDP is 29% faster than 7600X, or 9700X at the same TDP is 23% faster than 7700X. In my view thats a total success for a new gen.
Overall yes, Zen 5 is fine.
But it's notably poor for scalar/INT and greatly relies on SIMD for its performance increase. Games get 5%, compilation 10. It's strongly carried by interpreters and massive data crunchers.

A server gen, through and through.
 

Mahboi

Golden Member
Apr 4, 2024
1,001
1,804
96
Yay finally, maybe they will make it to clang 19 with real tunings instead of copy/paste from zen4.
I am genuinely veeeeery curious about possible compiler improvements in the upcoming year.
How much can that split frontend improve? Everyone seems to agree that it's still misused/mostly unused...how far could compiler upgrades help it, especially in scalar?
 

misuspita

Senior member
Jul 15, 2006
497
589
136

MS_AT

Senior member
Jul 15, 2024
207
497
96
I am genuinely veeeeery curious about possible compiler improvements in the upcoming year.
How much can that split frontend improve? Everyone seems to agree that it's still misused/mostly unused...how far could compiler upgrades help it, especially in scalar?
Not at all I am afraid. I mean according to AMD single thread gets 4-wide decoder and the design is anyway more latency bound than throughput bound. So while I guess compiler guys can do wonders I wouldn't keep my hopes high.
Overall yes, Zen 5 is fine.
But it's notably poor for scalar/INT and greatly relies on SIMD for its performance increase. Games get 5%, compilation 10. It's strongly carried by interpreters and massive data crunchers.

A server gen, through and through.
Well servers are also used for compiling Problem with compilation is that poor build system can kill the cpu scaling so the actual performance improvement might be tad higher. I have heard about 15%. You could call it splitting hairs, but well
 

Hail The Brain Slug

Diamond Member
Oct 10, 2005
3,485
2,406
136
May I interrupt you for a new interpretation of the miniPC



I mean... wtf??? Speakers? )) Other than that, nice design, if it's anything like the previous model it's really quiet but.... again... speakers? In that form factor? OMG...
Maybe they're going to use the speakers for active noise cancellation of the fan
 

Mahboi

Golden Member
Apr 4, 2024
1,001
1,804
96
Not at all I am afraid. I mean according to AMD single thread gets 4-wide decoder and the design is anyway more latency bound than throughput bound. So while I guess compiler guys can do wonders I wouldn't keep my hopes high.
Speaking of that, AMD sounds like they're hiding something under the carpet.
- C&C found that actually AMD's statements regarding double decoder throughput isn't applied with SMT being kind of not fully working (yes vague, I forgot the details)
- David Huang also found something of that extent IIRC

It seems like they're claiming that the dual decoder is running, but kinda sorta ackshually not. Maybe it's just me misunderstanding complex articles, but it feels like we could hear about some microcode update at some point that somehow gained 5% perf in scalar apps.

As for latency bound...I honestly don't even understand how that works. Decode finishes its job and passes it to the backend and somehow it's "late"?? Doesn't it all go stage after stage until retire? What's creating latency from within the CPU once decode is done?
 

yuri69

Senior member
Jul 16, 2013
531
948
136
Speaking of that, AMD sounds like they're hiding something under the carpet.
- C&C found that actually AMD's statements regarding double decoder throughput isn't applied with SMT being kind of not fully working (yes vague, I forgot the details)
- David Huang also found something of that extent IIRC

It seems like they're claiming that the dual decoder is running, but kinda sorta ackshually not. Maybe it's just me misunderstanding complex articles, but it feels like we could hear about some microcode update at some point that somehow gained 5% perf in scalar apps.

As for latency bound...I honestly don't even understand how that works. Decode finishes its job and passes it to the backend and somehow it's "late"?? Doesn't it all go stage after stage until retire? What's creating latency from within the CPU once decode is done?
In layman terms: it doesn't matter whether you decode 4 or 14 instructions if you often have to wait to "get" even the first instruction. E.g. you are waiting for the branch predictor to decide due its deep structures, or you were unlucky and TLB doesn't hold the needed address - you wait.
 
Reactions: Mahboi

yottabit

Golden Member
Jun 5, 2008
1,482
513
146
Speaking of that, AMD sounds like they're hiding something under the carpet.
- C&C found that actually AMD's statements regarding double decoder throughput isn't applied with SMT being kind of not fully working (yes vague, I forgot the details)
- David Huang also found something of that extent IIRC

It seems like they're claiming that the dual decoder is running, but kinda sorta ackshually not. Maybe it's just me misunderstanding complex articles, but it feels like we could hear about some microcode update at some point that somehow gained 5% perf in scalar apps.

As for latency bound...I honestly don't even understand how that works. Decode finishes its job and passes it to the backend and somehow it's "late"?? Doesn't it all go stage after stage until retire? What's creating latency from within the CPU once decode is done?
The TLR I remember is that the 2x 4 decoder is only fully utilized in SMT (2 threads per core)

And the reason that a single thread can’t utilize the entire 8 wide decoder is that, AMD claims it wouldn’t have improved performance much (and I would guess may have hurt perf/watt too)

C&C did some analysis on this basically supporting AMD’s conclusion. I believe C&C showed that something like ~90% of the time 1T workloads were able to be served without penalty using just the single 4 wide decoder because they were able to fetch the associated instructions out of the microOp cache
 

Mahboi

Golden Member
Apr 4, 2024
1,001
1,804
96
I expect the branch prediction gets evolved every generation, so even in Zen 6 it should be somewhat better, but if the problem is TLB/memory lookup, then wouldn't the evolved uncore in Zen 6 make that better? Less latency and all?
 

yottabit

Golden Member
Jun 5, 2008
1,482
513
146
"A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location.[1]"
Thanks so much for the noob explanation.

every day, the acronyms stray further from God
Well if you want an hours-long explanation I can recommend this video starting from -29 minutes
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,911
3,523
136
In layman terms: it doesn't matter whether you decode 4 or 14 instructions if you often have to wait to "get" even the first instruction. E.g. you are waiting for the branch predictor to decide due its deep structures, or you were unlucky and TLB doesn't hold the needed address - you wait.
Also the uop cache is having an insanely high hit rate that it doesn't seem to matter all that much.

I wonder how much performance would have been improved if :

min page size = 16k
L1D = 128k 8 way
L2/L3 5-6mb L2 per slice , dynamic partition L3 ( aka Z16)
int reg file = 400+


Bigger page sizes just seems to win these days , less TLB misses , easier L1D cache. All the really high IPC cores have massive L2's at good latency, but AMD would still want to keep the scale out nature of the L1 -> L2 -> L3 -> CCX -> CCD ->socket so massive L2 with massive L3 seems not ideal to this design target so need to be "smarter" on the L2 /L3 cache.

Hope we get something more ambitious going forward , i just bought 7800X3D... rofl . I dont care for full rate AVX-512 and i fig Zen5 might be a few % faster in some benchmarks but i bet late game CIV/Simulation the X3D still wins by a massive amount.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,911
3,523
136

MS_AT

Senior member
Jul 15, 2024
207
497
96
Also the uop cache is having an insanely high hit rate that it doesn't seem to matter all that much.
I think the problem is how much you have to wait when it misses. Games are reaching like 75% uop utilization in Zen4 from C&C analysis, and if you are unlucky it seems the core will sit there doing nothing for 40 cycles when it is trying to fetch instructions to decode from L3 even worse if from memory. So personally I am curious if L1i cache growth would benefit here and why they haven't bumped this size since Zen2? I think Intel is using at least 64kB I cache not to mention gigantic L1i caches on leading ARM uarchs. [192kB? iirc]

Also for gaming workloads I recommend this analysis from C&C for Zen4 https://chipsandcheese.com/2023/09/06/hot-chips-2023-characterizing-gaming-workloads-on-zen-4/ that partially explains why Zen5 doesn't do improve in gaming by a lot.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,911
3,523
136
I think the problem is how much you have to wait when it misses. Games are reaching like 75% uop utilization in Zen4 from C&C analysis, and if you are unlucky it seems the core will sit there doing nothing for 40 cycles when it is trying to fetch instructions to decode from L3 even worse if from memory. So personally I am curious if L1i cache growth would benefit here and why they haven't bumped this size since Zen2? I think Intel is using at least 64kB I cache not to mention gigantic L1i caches on leading ARM uarchs. [192kB? iirc]

Also for gaming workloads I recommend this analysis from C&C for Zen4 https://chipsandcheese.com/2023/09/06/hot-chips-2023-characterizing-gaming-workloads-on-zen-4/ that partially explains why Zen5 doesn't do improve in gaming by a lot.
Zen 5's uop cache is significantly different from zen4's, i wouldn't draw assumptions / extrapolations , they did some profiling but not gaming specific.

 

MS_AT

Senior member
Jul 15, 2024
207
497
96
Zen 5's uop cache is significantly different from zen4's, i wouldn't draw assumptions / extrapolations , they did some profiling but not gaming specific.

It's different in what it stores and that it's higher throughput but the overall mechanism of getting it filled looks the same from comparing Software Optimization Guides so if either one misses it will need to wait unless I have misunderstood something.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |