Discussion Zen 5 Architecture & Technical discussion

naukkis · Jan 26, 2025

MS_AT said:
This is the direct quote from the Software Optimization Guide for the AMD Zen5 Microarchitecture published in August 2024, revision 1.00:

So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.

That ain't split between simple and complex instructions - complex instructions can be quite short too. There's really no point of making decode fetch matrix wider for allowing decoding those overly long instructions simultaneously - there isn't fetch bandwidth or mop extraction bandwidth to support those kind of instruction combinations anyway.

NostaSeronx · Jan 26, 2025

MS_AT said:
So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.

Decode pipe0 = D0,D1,D2,D3
D0 = >10-byte, Vector-path
Decode pipe1 = D4,D5,D6,D7
D4 = >10-byte, Vector-path

Fastpath, Double Fastpath = All decode slots.
Vectorpath = Only the first slot of each.
This behavior is actually on all AMD processors. Since, this terminology was created. Even Steamroller/Excavator has this behavior.

"The outputs of the early decoders keep all (DirectPath or VectorPath) instructions in program order. Early decoding produces three macro-ops per cycle from either path. The outputs of both decoders are multiplexed together and passed to the next stage in the pipeline, the instruction control unit. Decoding a VectorPath instruction may prevent the simultaneous decoding of a DirectPath instruction." Virtually D0/D4 each act like a large VectorPath/microcode decoder.

Cardyak · Jan 26, 2025

MS_AT said:
His diagram contains errors (there is only one complex decoder in the cluster if we are to trust software optimization guide) so I would not rule out other mistakes.

Yeah, I was guessing here regarding the 2 micro-op queues in Zen 5

If you find out the true answer please let me know and I’ll update the diagram

naukkis · Jan 26, 2025

MS_AT said:
So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.

x86 considers instruction "simple" when it outputs just one micro-op. Complex instructions output at least 2 micro-ops. All AMD decoders sans K5 are "complex" - being able to decode instructions which output more than just one mop. Intel decoders are split to simple and complex - simple decoders can only decode one-to-one mapped instructions - meaning hardware instruction is equal to ISA instruction and does not split to multiple different instructions.

MS_AT · Jan 26, 2025

naukkis said:
x86 considers instruction "simple" when it outputs just one micro-op. Complex instructions output at least 2 micro-ops. All AMD decoders sans K5 are "complex" - being able to decode instructions which output more than just one mop. Intel decoders are split to simple and complex - simple decoders can only decode one-to-one mapped instructions - meaning hardware instruction is equal to ISA instruction and does not split to multiple different instructions.

Thanks for explanation. Obviously this was lack of knowledge on my side as I have simply assumed the split was done on the instruction length and not on the micro-ops produced.

Therefore the decoders were correctly labelled on Cardyak diagram. Sorry for the confusion.

StefanR5R · Jan 27, 2025

Cardyak said:
I was guessing here regarding the 2 micro-op queues in Zen 5

Looks plausible, given that the µop cache is dual-ported. (The split could be static or dynamic though...)

Gotta re-read CnC's analysis and the SOG whether something is said about
(a) how many µops/cycle a single thread can pull from the µop cache: up to 6, or up to 6+6?,
(b) if there is any word on the µop queue depth. If it is shallower in 1T mode than ideally possible, then it may be harder to make full use of the next stage (the ROB) in 1T mode, in which case Zen 5's deficit at "stitching the out-of-order instructions streams back in-order at the micro-op queue" might hinder 1T performance more than just WRT decoding bandwidth limit...?

MS_AT · Jan 27, 2025

StefanR5R said:
(a) how many µops/cycle a single thread can pull from the µop cache: up to 6, or up to 6+6?,

To quote: https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile

To further speed up instruction delivery, Zen 5 fills decoded micro-ops into a 6K entry, 16-way set associative micro-op cache. This micro-op cache can service two 6-wide fetches per cycle. Evidently both 6-wide fetch pipes can be used for a single thread.

This also matches what can be found in software optimization guide (chapter 2.9.1)

The Op Cache (OC) is a cache of previously decoded instructions. When instructions are being
fetched from the Op Cache, normal instruction fetch and decode are bypassed. This improves
pipeline latency because the Op Cache pipeline is shorter than the traditional fetch and decode
pipeline. It improves bandwidth because the maximum throughput from the Op Cache is 12
instructions per cycle, whereas the maximum throughput from the traditional fetch and decode
pipeline is 4 instructions per cycle per thread.

Jan Olšan · Jan 27, 2025

Somebody found a nice use (huge performance boosts?) for the VP2INTERSECT instruction in Zen 5.

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo

Disclaimers before we start For those who don’t want to read/don’t care that much, here are the results. I hope after seeing them you are compelled to read. TL;DR: I wrote a super fast phrase search algorithm using AVX-512 and achieved wins up to 1600x the performance of Meilisearch. The source...

gab-menezes.github.io

Kepler_L2 · Jan 27, 2025

Jan Olšan said:
Somebody found a nice use (huge performance boosts?) for the VP2INTERSECT instruction in Zen 5.

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo

Disclaimers before we start For those who don’t want to read/don’t care that much, here are the results. I hope after seeing them you are compelled to read. TL;DR: I wrote a super fast phrase search algorithm using AVX-512 and achieved wins up to 1600x the performance of Meilisearch. The source...

gab-menezes.github.io

Too bad Zen6 deprecates it

branch_suggestion · Jan 27, 2025

https://www.amd.com/en/developer/resources/technical-articles/zendnn-5-0-supercharge-ai-on-amd-epyc-server-cpus.html

Very impressive gains, 2025 I think will finally showcase the software reorg AMD has been working on for some years.

yuri69 · Jan 28, 2025

branch_suggestion said:
Very impressive gains, 2025 I think will finally showcase the software reorg AMD has been working on for some years.

It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...

carrotmania · Jan 28, 2025

yuri69 said:
It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...

What kind of logic is that? Does that mean every ray traced game coming out this year is 5yrs late? Software is done when it's done. And 3mo "late" it's better than never at all, like AMDs previous form. 400% is worth the delay. I take it this will run even better on MI300...

StefanR5R · Jan 28, 2025

yuri69 said:
this particular release targets a product released in Oct 2024.

Wrong. Turin servers were not "released" in October 2024.

branch_suggestion · Jan 28, 2025

yuri69 said:
It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...

Progress is progress.
Better to release late than not at all. And better than an on time release that is buggy and missing features.

MS_AT · Jan 28, 2025

branch_suggestion said:
https://www.amd.com/en/developer/resources/technical-articles/zendnn-5-0-supercharge-ai-on-amd-epyc-server-cpus.html
Very impressive gains, 2025 I think will finally showcase the software reorg AMD has been working on for some years.

I find this comparison lacking in details. I mean they give you enough to be able to compare ZenDNN against ZenDNN, but not against other solutions. You don't know if they are playing catch-up, or they actually improved things. I mean inference is heavily dependant on memory BW, they don't give information on what that memory BW is, so it is hard to estimate how it would do against other frameworks.

branch_suggestion · Jan 28, 2025

MS_AT said:
I find this comparison lacking in details. I mean they give you enough to be able to compare ZenDNN against ZenDNN, but not against other solutions. You don't know if they are playing catch-up, or they actually improved things. I mean inference is heavily dependant on memory BW, they don't give information on what that memory BW is, so it is hard to estimate how it would do against other frameworks.

They compare it to IPEX 2.4.0 iso-hardware.
Phoronix will compare it to GNR, don't worry.

igor_kavinski · Mar 24, 2025

An Interview with Zen Chief Architect Mike Clark

Zen is one of the most important microarchitectures in the history of the x86 ecosystem.

www.computerenhance.com

Discussion Zen 5 Architecture & Technical discussion

naukkis

Golden Member

NostaSeronx

Diamond Member

Cardyak

Member

naukkis

Golden Member

MS_AT

Senior member

StefanR5R

Elite Member

MS_AT

Senior member

Jan Olšan

Senior member

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo

Kepler_L2

Senior member

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo

branch_suggestion

Senior member

yuri69

Senior member

carrotmania

Member

StefanR5R

Elite Member

branch_suggestion

Senior member

MS_AT

Senior member

branch_suggestion

Senior member

igor_kavinski

Lifer

An Interview with Zen Chief Architect Mike Clark

TRENDING THREADS