Discussion Zen 5 Architecture & Technical discussion

Page 21 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

naukkis

Senior member
Jun 5, 2002
965
832
136
This is the direct quote from the Software Optimization Guide for the AMD Zen5 Microarchitecture published in August 2024, revision 1.00:

So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.

That ain't split between simple and complex instructions - complex instructions can be quite short too. There's really no point of making decode fetch matrix wider for allowing decoding those overly long instructions simultaneously - there isn't fetch bandwidth or mop extraction bandwidth to support those kind of instruction combinations anyway.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,788
1,277
136
So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.
Decode pipe0 = D0,D1,D2,D3
D0 = >10-byte, Vector-path
Decode pipe1 = D4,D5,D6,D7
D4 = >10-byte, Vector-path

Fastpath, Double Fastpath = All decode slots.
Vectorpath = Only the first slot of each.
This behavior is actually on all AMD processors. Since, this terminology was created. Even Steamroller/Excavator has this behavior.

"The outputs of the early decoders keep all (DirectPath or VectorPath) instructions in program order. Early decoding produces three macro-ops per cycle from either path. The outputs of both decoders are multiplexed together and passed to the next stage in the pipeline, the instruction control unit. Decoding a VectorPath instruction may prevent the simultaneous decoding of a DirectPath instruction." Virtually D0/D4 each act like a large VectorPath/microcode decoder.
 
Last edited:

Cardyak

Member
Sep 12, 2018
75
164
106
His diagram contains errors (there is only one complex decoder in the cluster if we are to trust software optimization guide) so I would not rule out other mistakes.
Yeah, I was guessing here regarding the 2 micro-op queues in Zen 5

If you find out the true answer please let me know and I’ll update the diagram
 

naukkis

Senior member
Jun 5, 2002
965
832
136
So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.

x86 considers instruction "simple" when it outputs just one micro-op. Complex instructions output at least 2 micro-ops. All AMD decoders sans K5 are "complex" - being able to decode instructions which output more than just one mop. Intel decoders are split to simple and complex - simple decoders can only decode one-to-one mapped instructions - meaning hardware instruction is equal to ISA instruction and does not split to multiple different instructions.
 
Reactions: lightmanek

MS_AT

Senior member
Jul 15, 2024
449
972
96
x86 considers instruction "simple" when it outputs just one micro-op. Complex instructions output at least 2 micro-ops. All AMD decoders sans K5 are "complex" - being able to decode instructions which output more than just one mop. Intel decoders are split to simple and complex - simple decoders can only decode one-to-one mapped instructions - meaning hardware instruction is equal to ISA instruction and does not split to multiple different instructions.
Thanks for explanation. Obviously this was lack of knowledge on my side as I have simply assumed the split was done on the instruction length and not on the micro-ops produced.

Therefore the decoders were correctly labelled on Cardyak diagram. Sorry for the confusion.
 

StefanR5R

Elite Member
Dec 10, 2016
6,274
9,588
136
I was guessing here regarding the 2 micro-op queues in Zen 5
Looks plausible, given that the µop cache is dual-ported. (The split could be static or dynamic though...)

Gotta re-read CnC's analysis and the SOG whether something is said about
(a) how many µops/cycle a single thread can pull from the µop cache: up to 6, or up to 6+6?,
(b) if there is any word on the µop queue depth. If it is shallower in 1T mode than ideally possible, then it may be harder to make full use of the next stage (the ROB) in 1T mode, in which case Zen 5's deficit at "stitching the out-of-order instructions streams back in-order at the micro-op queue" might hinder 1T performance more than just WRT decoding bandwidth limit...?
 
Last edited:

MS_AT

Senior member
Jul 15, 2024
449
972
96
(a) how many µops/cycle a single thread can pull from the µop cache: up to 6, or up to 6+6?,
To quote: https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile
To further speed up instruction delivery, Zen 5 fills decoded micro-ops into a 6K entry, 16-way set associative micro-op cache. This micro-op cache can service two 6-wide fetches per cycle. Evidently both 6-wide fetch pipes can be used for a single thread.
This also matches what can be found in software optimization guide (chapter 2.9.1)
The Op Cache (OC) is a cache of previously decoded instructions. When instructions are being
fetched from the Op Cache, normal instruction fetch and decode are bypassed. This improves
pipeline latency because the Op Cache pipeline is shorter than the traditional fetch and decode
pipeline. It improves bandwidth because the maximum throughput from the Op Cache is 12
instructions per cycle, whereas the maximum throughput from the traditional fetch and decode
pipeline is 4 instructions per cycle per thread.
 

Jan Olšan

Senior member
Jan 12, 2017
467
874
136
Somebody found a nice use (huge performance boosts?) for the VP2INTERSECT instruction in Zen 5.

 

Kepler_L2

Senior member
Sep 6, 2020
679
2,742
136
Somebody found a nice use (huge performance boosts?) for the VP2INTERSECT instruction in Zen 5.

Too bad Zen6 deprecates it
 

yuri69

Senior member
Jul 16, 2013
601
1,056
136
Very impressive gains, 2025 I think will finally showcase the software reorg AMD has been working on for some years.
It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...
 

carrotmania

Member
Oct 3, 2020
96
245
106
It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...
What kind of logic is that? Does that mean every ray traced game coming out this year is 5yrs late? Software is done when it's done. And 3mo "late" it's better than never at all, like AMDs previous form. 400% is worth the delay. I take it this will run even better on MI300...
 

branch_suggestion

Senior member
Aug 4, 2023
504
1,051
96
It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...
Progress is progress.
Better to release late than not at all. And better than an on time release that is buggy and missing features.
 
Reactions: Tlh97

MS_AT

Senior member
Jul 15, 2024
449
972
96
Very impressive gains, 2025 I think will finally showcase the software reorg AMD has been working on for some years.
I find this comparison lacking in details. I mean they give you enough to be able to compare ZenDNN against ZenDNN, but not against other solutions. You don't know if they are playing catch-up, or they actually improved things. I mean inference is heavily dependant on memory BW, they don't give information on what that memory BW is, so it is hard to estimate how it would do against other frameworks.
 
Reactions: Tlh97

branch_suggestion

Senior member
Aug 4, 2023
504
1,051
96
I find this comparison lacking in details. I mean they give you enough to be able to compare ZenDNN against ZenDNN, but not against other solutions. You don't know if they are playing catch-up, or they actually improved things. I mean inference is heavily dependant on memory BW, they don't give information on what that memory BW is, so it is hard to estimate how it would do against other frameworks.
They compare it to IPEX 2.4.0 iso-hardware.
Phoronix will compare it to GNR, don't worry.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |