Discussion Zen 5 Architecture & Technical discussion

Page 16 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

MS_AT

Member
Jul 15, 2024
192
438
96
Could it be simply that it was not worth it and that is why it was disabled? Quoting clamchowder from C&C
My view is tunnel visioning on the decoders misses the elephant in the room. Backend memory access latency and frontend latency are holding back perf. You can find frontend bandwidth bound slots but there aren’t a lot of them. If the frontend was struggling to feed a 4-wide decoder due to BTB/iTLB/L1i miss latency, it’s not clear how much benefit you’d get from adding more decode slots that you also can’t feed. Also the uop cache covers most of the instruction stream even with kernel compilation.
I mean if the 2 decoders on 1 HW thread would require extra validation and they were seeing diminishing returns because the other parts of the system were not keeping up with demand, might be they simply gave up.

Also a question, which stage is doing the decoding now? Funnily it seems not to be the decoders themselves as both decoders and uop caches are sending instructions down to rename according to diagrams.

Is the decoder only doing part of the job? IIRC macro ops should be the "decoded" instructions that are later turned into uops before execution. Of course I might have misunderstood something. But in the software optimization guide they are making clear wording distinction that OpCache holds "instructions" vs macroOps in Zen4. Therefore what "decoders" are actually doing is identifying the instruction boundaries [the hardest part of the job I guess] if my understanding is correct.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
Could it be simply that it was not worth it and that is why it was disabled? Quoting clamchowder from C&C

I mean if the 2 decoders on 1 HW thread would require extra validation and they were seeing diminishing returns because the other parts of the system were not keeping up with demand, might be they simply gave up.

Also a question, which stage is doing the decoding now? Funnily it seems not to be the decoders themselves as both decoders and uop caches are sending instructions down to rename according to diagrams.
View attachment 106233
Is the decoder only doing part of the job? IIRC macro ops should be the "decoded" instructions that are later turned into uops before execution. Of course I might have misunderstood something. But in the software optimization guide they are making clear wording distinction that OpCache holds "instructions" vs macroOps in Zen4. Therefore what "decoders" are actually doing is identifying the instruction boundaries [the hardest part of the job I guess] if my understanding is correct.
Perhaps, but this is the last mile we are talking about. It does seems to be a bit better than Zen 4 at keeping the backend fed.

This is an architecture thread after all so answers are nice, hopefully tomorrow?

AMD Next Generation “Zen 5” CoreBrad Cohen & Mahesh Subramony, AMD
 

StefanR5R

Elite Member
Dec 10, 2016
5,885
8,747
136
in the software optimization guide they are making clear wording distinction that OpCache holds "instructions" vs macroOps in Zen4.
CnC for one took the liberty to stick with the term "Micro-Ops" for the output from the decoders = input to "micro-"opqueue and "micro-"opcache... in the articles which they published in August:
https://chipsandcheese.com/2024/08/10/amds-strix-point-zen-5-hits-mobile/
https://chipsandcheese.com/2024/08/14/amds-ryzen-9950x-zen-5-on-desktop/
https://chipsandcheese.com/2024/08/20/zen-5-variants-and-more-clock-for-clock/
Whereas in July, they used the prefix "macro-" in the same places:
https://chipsandcheese.com/2024/07/...t-how-30-year-old-idea-allows-for-new-tricks/
I am guessing it is purely in the eye of the beholder how micro or how macro these ops are.
 

MS_AT

Member
Jul 15, 2024
192
438
96
CnC for one took the liberty to stick with the term "Micro-Ops" for the output from the decoders = input to "micro-"opqueue and "micro-"opcache... in the articles which they wrote in August:
https://chipsandcheese.com/2024/08/10/amds-strix-point-zen-5-hits-mobile/
https://chipsandcheese.com/2024/08/14/amds-ryzen-9950x-zen-5-on-desktop/
https://chipsandcheese.com/2024/08/20/zen-5-variants-and-more-clock-for-clock/
Whereas in July, they used the prefix "macro-" in the same places:
https://chipsandcheese.com/2024/07/...t-how-30-year-old-idea-allows-for-new-tricks/
I am guessing it is purely in the eye of the beholder how micro or how macro these ops are.
I took my pic from software optimization guide for zen5, I have seen C&C were using Zen4 consistent labeling for their analysis, but don't remember if they commented anything in the text itself about this. David Huang also noted the difference in what is stored in the op cache iirc. So at least something changed and hopefully hot chips will provide some more definitive answers
 

Nothingness

Diamond Member
Jul 3, 2013
3,023
1,952
136
CnC for one took the liberty to stick with the term "Micro-Ops" for the output from the decoders = input to "micro-"opqueue and "micro-"opcache... in the articles which they published in August:
https://chipsandcheese.com/2024/08/10/amds-strix-point-zen-5-hits-mobile/
https://chipsandcheese.com/2024/08/14/amds-ryzen-9950x-zen-5-on-desktop/
https://chipsandcheese.com/2024/08/20/zen-5-variants-and-more-clock-for-clock/
Whereas in July, they used the prefix "macro-" in the same places:
https://chipsandcheese.com/2024/07/...t-how-30-year-old-idea-allows-for-new-tricks/
I am guessing it is purely in the eye of the beholder how micro or how macro these ops are.
IIRC AMD stated in the release slides that one of the changes was going from a uop cache of 6.75k macro-op on Zen4 to a 6k micro-op with fusion on Zen5. It might have been a marketing error. Or it could be a deeper change.

I hope this will get clarified at HotChips.
 

Jan Olšan

Senior member
Jan 12, 2017
396
680
136
But that's not really the case. Zen 5 paid no attention to balance since only 512bit AVX-512 got doubled width (in the ideal case up to doubling compute performance), and scalar integer got expanded (in the ideal case offering up to 35% more compute performance). Everything else is essentially untouched, while the improvement in scalar integer seems to be hard to make use of, and that's the imbalance.


As @LightningZ71 already noted that's actually happening with Zen 5, clearly client oriented dies don't get the server Zen 5 512bit AVX-512 implementation but keep the double pumped 256bit AVX-512 implementation from Zen 4. We can that separation to increase, perhaps similar to what happened with the RDNA and CDNA split for GPUs.
You are focused on the wrong thing.

SIMD (AVX-512) only ever can be widened by 2x, nothing else is possible. That's why you get widening of the units only once per several generations.

Whereas numbers of ALUs or RoB entries have much more freedom in the increments that can be made, so it is natural to do it smaller steps spread over generations.

Architecturaly-wise, it is probably much harder to add 50 % ALUs or other such resource, because the control aspect is gets very hard to implement and verify. Meanwhile SIMD widening costs transistors but is easy, because the the control part is not much harder (and that's the point of SIMD).

So really, pointing at the AVX-512 thing as a smoking gun that shows how the criminal is guilty of neglecting other parts of the architectures is missing the point (by a huge lot, IMHO).
 

yuri69

Senior member
Jul 16, 2013
530
945
136
You are focused on the wrong thing.
Well, the width expansion doesn't come in isolation, right?

Along with the widening the team reworked/invested in:
* blowing NSQ size
* blowing vRF size
* doubling FPU-IEU datapath
* widening L1D/L2 datapaths
* redesign of FP pipes composition
* yet another redesign of the FP scheduler
 

moinmoin

Diamond Member
Jun 1, 2017
5,063
8,025
136
So really, pointing at the AVX-512 thing as a smoking gun that shows how the criminal is guilty of neglecting other parts of the architectures is missing the point (by a huge lot, IMHO).
I was talking about the performance essentially being unchanged aside that. In a follow up post I did state that the other way of looking at this is that AMD redid the whole design and still managed to achieve a balanced performance similar to the previous gen.
Another way to look at it is that despite all the changes in the frontend the performance didn't degrade and is still "balanced" (i.e. close enough to previous gen) except for the positive 512bit and INT outliers that appear even without memory bandwidth bottlenecks having been alleviated yet. Zen 5's design looks like one requiring substantial uncore changes, but with none of them applied before Zen 6.
On top of that @Josh128 picked out the quotes from Mike Clark that the design seemed to have been on 3nm first, with the 4nm version being either cut down or having node related shortcomings.
Mike Clark hinted to this in a recent interview about Zen 5. Now, it would be wise to take MCs comments with a grain of salt, because previously he gushed so much about Zen 5 and the reality, so far, has been far less impressive than he led us to believe, IMO.




He seems to indicate than 3nm Zen 5 was the primary architecture design target in the bolded-- the fact that he admitted certain architected features look great in 3nm but not so great in 4nm is very telling and basically confirms this. He also says here (and in another interview I cant seem to find) that the Zen 6 team might come off looking great, but what they achieve is only possible due to the groundwork laid out by the Zen 5 design.
Let's see how much of that is copium. I just wish AMD would finally stick to a <=18 months release cadence.
 

MS_AT

Member
Jul 15, 2024
192
438
96
Well, the width expansion doesn't come in isolation, right?

Along with the widening the team reworked/invested in:
* blowing NSQ size
* blowing vRF size
* doubling FPU-IEU datapath
* widening L1D/L2 datapaths
* redesign of FP pipes composition
* yet another redesign of the FP scheduler
But those things were bottleneck before, according to Zen4 C&C analysis, so all those changes would have been beneficial even if the unit remained double pumped.
 
Reactions: lightmanek

Jan Olšan

Senior member
Jan 12, 2017
396
680
136
Well, the width expansion doesn't come in isolation, right?

Along with the widening the team reworked/invested in:
* blowing NSQ size
* blowing vRF size
* doubling FPU-IEU datapath
* widening L1D/L2 datapaths
* redesign of FP pipes composition
* yet another redesign of the FP scheduler
Nice (half of that is just part of the unit widening pretty much), but what do you call +50 % ALUs, +33 % AGU, whatever % increase to RoB, BTBs, +50% L1D cache capacity (for the first time since Zen 1)? There's redone branch prediction, redone decode (I know, I know, not popular) and probably more significant stuff I can't recall ATM - because summarizing just the officially presented changes in an article was a lot of writing (and the AVX-512 chapter was relatively small portion of the text).

Look at Zen 2, that core also widened SIMD units 2x while it did way less than Zen 5 in all the other parts.
 

yuri69

Senior member
Jul 16, 2013
530
945
136
Nice (half of that is just part of the unit widening pretty much), but what do you call +50 % ALUs, +33 % AGU, whatever % increase to RoB, BTBs, +50% L1D cache capacity (for the first time since Zen 1)? There's redone branch prediction, redone decode (I know, I know, not popular) and probably more significant stuff I can't recall ATM - because summarizing just the officially presented changes in an article was a lot of writing (and the AVX-512 chapter was relatively small portion of the text).

Look at Zen 2, that core also widened SIMD units 2x while it did way less than Zen 5 in all the other parts.
One might say they did too much considering the end result.
 

DavidC1

Senior member
Dec 29, 2023
776
1,236
96
Architecturaly-wise, it is probably much harder to add 50 % ALUs or other such resource, because the control aspect is gets very hard to implement and verify.
Chief architect of Skymont said the extra 4 ALUs were done because it was "cheap" to add, but in both SKT and Zen 5's case the extra ALUs are not capable of executing all instructions.
 

Jan Olšan

Senior member
Jan 12, 2017
396
680
136
Chief architect of Skymont said the extra 4 ALUs were done because it was "cheap" to add,
Cheap maybe in die area. But they need to be connected to the register files, they need their ports, scheduler(s) need to handle more ports and decisions are more complex...

but in both SKT and Zen 5's case the extra ALUs are not capable of executing all instructions.
The extra ALUs are not simple ALUs in Zen 5, they may not handle every single instruction but they are complex ALUs (they handle MUL...)
 
Reactions: lightmanek

MS_AT

Member
Jul 15, 2024
192
438
96
Cheap maybe in die area. But they need to be connected to the register files, they need their ports, scheduler(s) need to handle more ports and decisions are more complex...
I guess this is why AMD went with full 512b registers. It simply was cheaper thing to do in terms of silicon area than adding more execution smaller execution units. After all vRF still has 10 read ports, the same number Zen4 had.
 
Reactions: lightmanek

Nothingness

Diamond Member
Jul 3, 2013
3,023
1,952
136
I guess this is why AMD went with full 512b registers. It simply was cheaper thing to do in terms of silicon area than adding more execution smaller execution units. After all vRF still has 10 read ports, the same number Zen4 had.
It's still not easy to place and route 512-bit wide paths. It's very likely the differences in backend work between Zen5 and Zen5c is significant resulting in different layouts.

Do we already have pictures of both versions of Zen5?
 
Reactions: lightmanek

Saylick

Diamond Member
Sep 10, 2012
3,504
7,763
136
This is a fake, these are not HC slides. These people are shameless.
Even if they were legit, WTFTech at it again by writing a whole article where the content is straight up lifted word-for-word from the slides... I mean, at that point just give me the dang slide deck and I can read it myself.

No one in this industry should ever give these guys a press pass because they don't understand jack. Sadly, the people that do give them a press pass use them as a mouthpiece to disseminate any talking point you want to spread, and they are glad to do it as long as they get to write articles and generate clicks from it.
 
Jul 27, 2020
19,613
13,472
146

Maybe WTFtech was not to blame? I don't know. Even the above site is saying that most of the slides are old.

Anyone spot anything new?
 

poke01

Golden Member
Mar 8, 2022
1,985
2,518
106
Even if they were legit, WTFTech at it again by writing a whole article where the content is straight up lifted word-for-word from the slides... I mean, at that point just give me the dang slide deck and I can read it myself.

No one in this industry should ever give these guys a press pass because they don't understand jack. Sadly, the people that do give them a press pass use them as a mouthpiece to disseminate any talking point you want to spread, and they are glad to do it as long as they get to write articles and generate clicks from it.
Ehh, not one to defend them but others did the same. Only good architecture deep-dives come from Chipsandcheese and Geekerwan..
 

Saylick

Diamond Member
Sep 10, 2012
3,504
7,763
136
Ehh, not one to defend them but others did the same. Only good architecture deep-dives come from Chipsandcheese and Geekerwan..
What annoys me isn't that WTFTech didn't go in-depth (because I know they lack the skills to do so), but rather that they waste people's time by inserting big blocks of text in between the slides which do nothing but copy verbatim what's on the slide, which is clearly an effort to pad the article so that it's long enough for the various ads they have. At least STH's article keeps the text light because they know if there's nothing worthwhile to add you simply don't bother trying.
 

PJVol

Senior member
May 25, 2020
696
618
136
The glue holding the two CCDs and the IOD together?
Ahh... yes, indeed ) I confused Die and CCD, i.e. I read that as CCD-to-CCD infinity fabric (need one more coffee), wondering if they finally implemented direct interconnect...

One more question, if you don't mind:
does Zen5c CCD still have two CCXs ?

I mean, if the layout is the same as in Zen4c, that would explain why StrixPoint has two separate CCXs.
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |