Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 838 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

OneEng2

Senior member
Sep 19, 2022
259
359
106
AMD hasn't OoOE'd the SMT away. 8-wide decode on Zen5 is only really feasible with SMT-enabled.

Intel has too much OoOE optimization to make SMT have a worthwhile massive uplift.
True that. It seems like they have Zen 5 balanced pretty well with respect to decode, execution, and retire paths. It wouldn't be able to feed all those execution units in parallel without the wide decode front end.
 

OneEng2

Senior member
Sep 19, 2022
259
359
106
Intel never got more than 10% for some reason. Their design must be garbage.
Ya know, I really don't quite understand this, but you are right (although I generally hear 10-15%). Intel was first to the desktop with SMT. Before P4, SMT was only used in high end server/workstation chip designs. Strange that they haven't managed to get better at it in all these years.

Also interesting is the spin from the 2 companies. When AMD had only SMP, their marketing spin was "we believe in having FULL cores". Now I hear the same type of discussion from Intel. It didn't go so well for AMD. Hope Intel has better luck with the idea.
 

Saylick

Diamond Member
Sep 10, 2012
3,645
8,223
136
True that. It seems like they have Zen 5 balanced pretty well with respect to decode, execution, and retire paths. It wouldn't be able to feed all those execution units in parallel without the wide decode front end.
I'm really, really hoping AMD improves Zen 6 such that both sets of decoders can work on the same thread in SMT-off mode. Then for those who want maximum MT throughput, they can leave SMT on. For those who want a stronk core, they can turn SMT off.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,761
1,462
136
I'm really, really hoping AMD improves Zen 6 such that both sets of decoders can work on the same thread in SMT-off mode. Then for those who want maximum MT throughput, they can leave SMT on. For those who want a stronk core, they can turn SMT off.

Given that Mike Clark himself was confused for a bit over whether both sets of decoders could be utilized without SMT, my guess is that it's just bugged in Zen 5.
 

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
If it’s a bug, I hope it can be addressed via some sort of microcode update.
The problem is, we already know what to expect from Zen 6. If they are updating the clustered decode to work that way, a significant portion of the 10 or so percent gains will be entirely due to the decode change.
 

Saylick

Diamond Member
Sep 10, 2012
3,645
8,223
136
The problem is, we already know what to expect from Zen 6. If they are updating the clustered decode to work that way, a significant portion of the 10 or so percent gains will be entirely due to the decode change.
Well, there’s also the improved connection to the IOD, which hopefully reduces memory access latency. All in all, should lead to a core that’s better fed.
 

DrMrLordX

Lifer
Apr 27, 2000
22,184
11,890
136
The problem is, we already know what to expect from Zen 6. If they are updating the clustered decode to work that way, a significant portion of the 10 or so percent gains will be entirely due to the decode change.
That or Zen5 hasn't reach all its performance targets, and Zen6's slated uptick in performance is rated against not-bugged Zen5.
 

naukkis

Senior member
Jun 5, 2002
962
829
136
Intel never got more than 10% for some reason. Their design must be garbage.

You got it backwards. 2-way SMT scaling is exactly 100% when workload is 100% memory access bound. So obviously SMT scaling is at it's lowest when memory system is fast and efficient. Intel have pretty good SMT-scaling in server side where memory accesses are pretty slow.

So what those tests where AMD SMT-scaling is near 100% actually shows weak points of AMD chiplet designs. In those same tests we should see less SMT-improvement on Intel mesh-systems but better overall performance. Which makes admiring great SMT gains pretty pointless.
 

Abwx

Lifer
Apr 2, 2011
11,612
4,469
136
You got it backwards. 2-way SMT scaling is exactly 100% when workload is 100% memory access bound. So obviously SMT scaling is at it's lowest when memory system is fast and efficient. Intel have pretty good SMT-scaling in server side where memory accesses are pretty slow.
Cinebench R11.5 to CB R23 doesnt rely on memory access yet SMT scaling is 30%,
wich say that SMT scale when a single thread related IPC is low and that there s lot of ressources left unused.
 

MS_AT

Senior member
Jul 15, 2024
365
798
96
The problem is, we already know what to expect from Zen 6. If they are updating the clustered decode to work that way, a significant portion of the 10 or so percent gains will be entirely due to the decode change.
Here we go again. I will quote Chester from Chips&Cheese:
My view is tunnel visioning on the decoders misses the elephant in the room. Backend memory access latency and frontend latency are holding back perf. You can find frontend bandwidth bound slots but there aren’t a lot of them. If the frontend was struggling to feed a 4-wide decoder due to BTB/iTLB/L1i miss latency, it’s not clear how much benefit you’d get from adding more decode slots that you also can’t feed. Also the uop cache covers most of the instruction stream
In other words, they have more pressing issues to fix than clustered decode because uop cache is doing pretty well for now.
 

yuri69

Senior member
Jul 16, 2013
574
1,017
136
That or Zen5 hasn't reach all its performance targets, and Zen6's slated uptick in performance is rated against not-bugged Zen5.
Zen 6 10+% IPC comes from the same slide stating Zen 5 10-15+%. Translated from marketing speak: Zen 6 10-14%, Zen 5 10-19%.

The current shape of Zen 5 matches this range. So the 10-14% uptick of Zen 6 comes from the current shape of Zen 5.

The problem is, we already know what to expect from Zen 6. If they are updating the clustered decode to work that way, a significant portion of the 10 or so percent gains will be entirely due to the decode change.
Larger decode width brings better decode bandwidth. But according to various profiling results that's not the issue for Zen 5. Zen 5 is frontend *latency*-bound. This means throwing wider decode or more bandwidth from the uOP cache won't change the overall latency.

Sure it might help some workloads, but don't expect miraculous "fixed Zen" outcome.
 
Reactions: OneEng2

OneEng2

Senior member
Sep 19, 2022
259
359
106
Cinebench R11.5 to CB R23 doesnt rely on memory access yet SMT scaling is 30%,
wich say that SMT scale when a single thread related IPC is low and that there s lot of ressources left unused.
Which makes sense.

No matter how you slice it and analyze it, Zen 5 ends up with faster single thread performance than Arrow Lake (obviously SMT isn't helping it here, and could conceivably be hurting it), and higher MT performance than Arrow Lake (with SMT helping quite a lot).

I suspect that every CPU design will be limited in some use cases by an unbalance of resources. It can't be 100% balanced for every kind of use case. Zen 5 could have design limitations that Zen 6 will fix (almost certainly), but I suspect that Zen 6 will have more going on than just a fixed decoder path. It has already been confirmed we can expect 16 core CCD's on N3P. That alone is going to raise performance in both ST and MT (ST coming from higher clocks on a new process in this statement .... understanding that the design changes will likely impact ST much more than any clock increase).

Zen 5 appears to RIP though data center workloads so it would appear that it is well designed to scale to very high core and CCD count. It also does this at pretty energy efficient usage numbers compared to the competition (Turin D is even stomping ARM as I understand it).
 

MS_AT

Senior member
Jul 15, 2024
365
798
96
No matter how you slice it and analyze it, Zen 5 ends up with faster single thread performance than Arrow Lake (obviously SMT isn't helping it here, and could conceivably be hurting it), and higher MT performance than Arrow Lake (with SMT helping quite a lot).
Nah, I don't think we have sufficient data to say Zen5 is universally faster in ST workloads or that it does universally better at MT.
It has already been confirmed we can expect 16 core CCD's on N3P
Where? And was it made clear this will be consumer facing die? I mean we already have 16 core CCDs in Turin D.
understanding that the design changes will likely impact ST much more than any clock increase
Sometimes design changes are done to enable clock increase
Turin D is even stomping ARM as I understand it
It's doing better than Ampere that is using custom ARM cores.
 
Reactions: Elfear

blackangus

Member
Aug 5, 2022
160
217
86
Where? And was it made clear this will be consumer facing die? I mean we already have 16 core CCDs in Turin D.
At some point in the past there was a preso floating around that showed both a 16 and 32 core CCD for Zen6/Zen6c
Dont ask me to find it, Im sure I cant! And who knows if its still relevant?
Plans change as always.
Not holding my breath, but it would be nice.
 

Josh128

Senior member
Oct 14, 2022
511
865
106
Recommendations for 9900X build? Is B650 enough or 670/870 as far as VRMs and power delivery? Ive seen enough to know that Arrow Lake is not worth AM5 having the potential for future 9000 series X3D and possibly even Zen 6. Whats the consensus on the best bang for the buck pair of 16G sticks for 1:1 IF speeds and CAS latencies?

 

LightningZ71

Golden Member
Mar 10, 2017
1,910
2,260
136
Typically, published errata is only concerned with stuff like "Don't use this combination of instructions back to back or a meteorite will land on your user!" and not "we borked an implementation on the hardware level and we're choosing to tell you, even though we've made the necessary operational changes to make sure that it will never cause you a program error condition." Sometimes they will reveal issues like that in unofficial channels, but it's just not typically published in manuals and guides because it isn't relevant.
 

naukkis

Senior member
Jun 5, 2002
962
829
136
Cinebench R11.5 to CB R23 doesnt rely on memory access yet SMT scaling is 30%,
wich say that SMT scale when a single thread related IPC is low and that there s lot of ressources left unused.

That was my point - 2-way SMT uplift is exactly 100% when workload is 100% memory access latency bound. This is purely mathematical fact from 2-independent threads scaling when they do not hamper other thread execution at all. Compute intensive fp-workloads scale pretty well with SMT because instruction latencies are many clock cycles but nowhere near 100% scaling. And with those kind of fp-workloads you see similar SMT gains also on Intel hardware.
 

Rheingold

Member
Aug 17, 2022
69
203
76
Intel never got more than 10% for some reason
What do you mean by "never"? Software that's well suited for SMT yields much more than that on Intel systems. I couldn't find new benchmarks, but an 8700K had several uplifts of 30% and more here. The SMT uplift on Ryzen/Epyc is higher, but Intel's was not just 10%. Or is this just about Xeons, and you can provide some data for this?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |