Discussion Zen 5 Architecture & Technical discussion

adroc_thurston · Sep 30, 2024

coercitiv said:
less efficiency than vanilla cores is weird

That's normal, SIR hates cachelet setups.

StefanR5R · Sep 30, 2024

yuri69 said:
Lion Cove/Skymont analysis at David Huang's Blog there are interesting comparisons to Zen 5.

coercitiv said:
The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
View attachment 108440

The graph does not show core efficiency. It is SIR rate-1 performance over package power. Operating system and maybe the test regime may have kept too much of the other components in the package from powering down deeply enough to make single core power usage a dominant part of package power consumption.

The other graph which is based on VDDCR data from MMIO read requests to the SMU (in which case the act of observing the data unfortunately influences the data) shows the classic core and dense core closer to each other WRT SIR r1 perf/W.

coercitiv · Sep 30, 2024

StefanR5R said:
The graph does not show core efficiency. It is SIR rate-1 performance over package power. Operating system and maybe the test regime may have kept too much of the other components in the package from powering down deeply enough to make single core power usage a dominant part of package power consumption.

The other graph which is based on VDDCR data from MMIO read requests to the SMU (in which case the act of observing the data unfortunately influences the data) shows the classic core and dense core closer to each other WRT SIR r1 perf/W.

I'm aware of both the issues, that's why I said that for now all we can do is to take it for what it's worth, essentially observe the strange result and raise an eyebrow.

Geddagod · Sep 30, 2024

coercitiv said:
TL;DR - cores dangerously close to each other, collisions expected.

The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
View attachment 108440

He noticed the same thing with Zen 4 vs Zen 4 Dense IIRC. Though that seemed to be much closer than the difference that is seen here.
I would imagine the fact that the Zen 5C cluster has less L3 cache would play a role here too?

Abwx said:
20/30% better perf/clock in INT/FP for Zen 5c vs SKT in GB 6 ST, so that s close only apparently, in the real world they are far apart, about two gens apart, and to think that we had people here expecting SKT to match or even beat Zen 4, i once said that it was at Zen 3 level at best.

Pretty sure Skymont in LNL doesn't even have access to the L3, much like the LPE cores in MTL. They do have the SLC, but still...
Wait till Skymont appears in ARL.

coercitiv · Sep 30, 2024

Geddagod said:
I would imagine the fact that the Zen 5C cluster has less L3 cache would play a role here too?

The workload plays a role (affinity for cache) and the very limited method of measurement plays another (probably bigger) role. Nevertheless, I think we're more informed about the subject than we were before Huang published the data, as long as we don't attempt to draw definitive conclusions.

coercitiv · Sep 30, 2024

Geddagod said:
Pretty sure Skymont in LNL doesn't even have access to the L3, much like the LPE cores in MTL. They do have the SLC, but still...

No access to L3, and based on info from the lead SoC architect who worked on the project, the SLC is there as more of an bandwidth saving solution.

StefanR5R · Sep 30, 2024

coercitiv said:
I'm aware of both the issues, that's why I said that for now all we can do is to take it for what it's worth, essentially observe the strange result and raise an eyebrow.

Admittedly I didn't read your other post very carefully, as I was already preoccupied with typing my post... :-)

Phoenix 2 was later to market than Genoa + Bergamo. According to low resolution die shots (Phoenix 2/ Durango/ Vindhya), Zen 4-classic and Zen 4-dense core layouts in Phoenix 2 on 4nm are still looking the same as Genoa's and Bergamo's on 5nm. That is, all the optimization work which went into Bergamo relative to Genoa, in order to press 33% more fully-featured cores into the same socket power envelope, should reflect in Phoenix 2.

Strix Point on the other hand is much earlier ready to market than Turin-dense, and the latter is designed for a newer manufacturing node. (Plus, Strix Point's Zen 5 and Zen 5c incarnations are somewhat cut down from Turin's and Turin-dense's, though basically just in the FP department.) This is not the speculation thread, but I can't help myself wondering whether power efficiency tweaking of Strix Point's dense cores was somewhat cut short for sake of time to market, and due to being a 3nm/4nm co-design to some extent.

DavidC1 · Sep 30, 2024

Abwx said:
20/30% better perf/clock in INT/FP for Zen 5c vs SKT in GB 6 ST, so that s close only apparently, in the real world they are far apart, about two gens apart, and to think that we had people here expecting SKT to match or even beat Zen 4, i once said that it was at Zen 3 level at best.

Skymont in Lunarlake performs better 41% in both SpecInt and Geekbench Integer per clock(assuming 90% clock scaling) while in FP it performs 60% faster in GB5 and 70% faster in GB6 compared to Crestmont LP, which is almost exactly as Intel claims.

Now we're waiting for the ring version which Intel said will outperform Raptorlake by 2% in Spec.

Even though Lunarlake's version is not that performant, it scales extremely low in power and is very power efficient.

coercitiv said:
No access to L3, and based on info from the lead SoC architect who worked on the project, the SLC is there as more of an bandwidth saving solution.

This would make sense as higher performing caches need a tradeoff as larger cell sizes and/or lower power efficiency due to higher leakage and outright higher power use.

igor_kavinski · Oct 5, 2024

I think a special game mode could be developed for the dual chiplet Ryzens, using a combination of Windows and BIOS trickery, where the lower binned CCD acts essentially as a zombie CCD for the preferred CCD. The zombie CCD's threads don't do any actual work. But they help to prefetch data for the preferred CCD threads and when needed, provide that data, acting as a dedicated virtual cache CCD, using its all available L2+L3 capacity for this purpose and reducing the instances of waiting on data from system RAM. How hard would this be?

StefanR5R · Oct 6, 2024

L3$ is CCX-private and L2$ is core-private. Prefetching into a certain core's L2$ does nothing for a different core but steal memory access bandwidth (and while doing so, increases memory access latency). I am not sure whether prefetching into one CCX's L3$ would help with latency of the other CCX's L3$ misses, but in any case all sorts of more aggressive prefetch policies run the risk of regressing due to memory access bandwidth and latency limitations.

CouncilorIrissa · Oct 6, 2024

GNR CCD floor plan annotated by Nemez.

https://twitter.com/x/status/1843054429773459889

lightmanek · Oct 6, 2024

Love the Magical SRAM of Mystery block 😅

Tuna-Fish · Oct 7, 2024

CouncilorIrissa said:
GNR CCD floor plan annotated by Nemez.

https://twitter.com/x/status/1843054429773459889

I think the vector side is annotated wrong. The unit is clearly split into 4 quarters, with the execution units tightly bound with the register files. Instead of having 256-bit units per quarter, I think each quarter is a 128-bit slice across the entire SIMD pipe. This would make maximum communication distances much shorter from registers to units.

naukkis · Oct 7, 2024

Tuna-Fish said:
I think the vector side is annotated wrong. The unit is clearly split into 4 quarters, with the execution units tightly bound with the register files. Instead of having 256-bit units per quarter, I think each quarter is a 128-bit slice across the entire SIMD pipe. This would make maximum communication distances much shorter from registers to units.

You can't slice AVX512 instructions efficiently. For being able to do fast 512 bit permutations register file needs to hold whole 512 bit register. So those are 256 bit quarters and those nearby are combined for executing 512 bit instructions( or there's just two double-pumped pipelines). And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.

MS_AT · Oct 7, 2024

naukkis said:
And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.

Any data to back this up? Optimization manual doesn't talk about it, it's not included in any latency tables that I know of. And the way you state it, it sounds different to the one additional latency cycle due to schedulers being full. [that applies only to instructions that would themselves have only 1 cycle of latency, anything with longer latency is not affected].

Bigos · Oct 7, 2024

Another explanation would be the register file being duplicated to both the top and the bottom parts. This is done usually in order to increase the number of read ports of a register file at the cost of area and power. Writes to the register file have to access both parts so the number of write ports does not increase in this scheme.

And with complex enough forwarding network, the latency of register file writes does not matter much as it can be delayed - subsequent operations can source their results from either the back end of execution units directly or from the forwarding network "buffers".

This might still mean that at the critical path an operation done on the bottom EU needs to be forwarded to the top EU immediately (or vice-versa), all within the same cycle. This might be easier to do with the EU as they are closer to each other than the register files (and the middle area is probably mostly the forwarding network itself).

Tuna-Fish · Oct 7, 2024

naukkis said:
You can't slice AVX512 instructions efficiently. For being able to do fast 512 bit permutations register file needs to hold whole 512 bit register. So those are 256 bit quarters and those nearby are combined for executing 512 bit instructions( or there's just two double-pumped pipelines). And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.

That's if you try to make wide permutes very low latency. I find that AMD's optimization manual for Zen5(zip, contains pdf and spreadsheet with latencies) is very illuminating here.

Notably:

All types of VPERMW emit the same amount of Mops, and have same high throughput of 2 per cycle. However...
VPERMW x, x, x (that is, 128-bit registers) has a latency of 2 cycles.
VPERMW y, y, y (256-bit regs) has a latency of 4 cycles.
VPERMW z, z, z (512-bit regs) has a latency of 5 cycles.

This to me screams a horizontally split implementation. And it's also, at least to me, the technically better way to build it, because it means you only pay for lane-crossing when you actually intentionally cross lanes.

CouncilorIrissa · Nov 24, 2024

Pushing AMD’s Infinity Fabric to its Limits

I recently wrote code to test memory latency under load, seeking to reproduce data in various presentations with bandwidth on the X axis and latency on the Y axis.

chipsandcheese.com

igor_kavinski · Jan 24, 2025

Thunder 57 said:
C&C has an article up testing Zen 5 with the opcache disabled to see how well the 2x4 decoder works. Short answer, in single thread not great. With SMT enabled surprisingly good. It does seem like an obvious place for AMD to work some magic in Zen 6 though, assuming they still care about client performance. The results vary quite a bit so if you are interested check it out:

Disabling Zen 5’s Op Cache and Exploring its Clustered Decoder

Zen 5 has an interesting frontend setup with a pair of fetch and decode clusters.

chipsandcheese.com

Very nicely written article. Love that he wrote his own quick and dirty performance counter monitoring program. That could be fun to use or experiment with. Seems the clustered decoders are there mainly to support and boost SMT which speaks strongly to Zen 5's server focused roots.

StefanR5R · Jan 25, 2025

From https://chipsandcheese.com/p/discussing-amds-zen-5-at-hot-chips-2024:

"Dual decode clusters came up in sideline discussions. The core only uses one of its decode clusters when running a single thread, regardless of whether the sibling thread is idle or SMT is turned off. From those sideline conversations, apparently the challenge was stitching the out-of-order instructions streams back in-order at the micro-op queue. The micro-op queue is in-order because it has to serve the renamer, and register renaming is an inherently serial process."

In Zen 5, the two 4-wide decode clusters and the op cache all feed into a single deep micro-op queue.

Compare with Crestmont and Skymont, which don't have SMT (and don't have op cache, and target lower f_max):
https://chipsandcheese.com/p/skymont-intels-e-cores-reach-for-the-sky

There each decode cluster (two 3-wide in Crestmont, three 3-wide in Skymont) feeds into its own private shallow micro-op queue. These 2…3 separate micro-op queues then feed the single rename/dispatch block (8-wide in ~~Crestmont and~~ Skymont). (Rename/dispatch is also 8-wide in Zen 5 and Lion Cove, 6-wide in Zen 4 , Redwood Cove, and Crestmont.)

Crestmont's predecessor, Gracemont, has got the same decoder/µopq/renamer config. Gracemont's predecessor was Tremont, and it was the very first Atom µarch with clustered decoder. I haven't found whether that one had two µopqs as well.

In Lion Cove and Redwood Cove (and maybe all their µop$ featuring predecessors), the decoder and the µop$ feed into a common deep micro-op queue:
https://chipsandcheese.com/p/lion-cove-intels-p-core-roars

So, Intel's team which designed Lion Cove and earlier cores, as well as AMD's team which designed Zen 5, went with a single micro-op queue. Intel's team which designed Crestmont and earlier cores went with one micro-op queue per decode cluster. Does the asymmetry of sourcing micro-ops from decoder(s) _and_ from a µop cache necessitate a single common micro-op queue? Or is it merely the aim for very high peak clock frequency which makes a single but deep micro-op queue the preferred choice?

edit: corrected Crestmont rename/dispatch width

StefanR5R · Jan 25, 2025

StefanR5R said:
In Zen 5, the two 4-wide decode clusters and the op cache all feed into a single deep micro-op queue.

Or do they? According to Cardyak, …

Saylick said:
Courtesy of Cardyak:
View attachment 104532

Cardyak's Microarchitecture Block Diagram Images -> AMD -> High Power -> 2024 - Zen 5.jpg
... there are *two* micro-op queues, not one. Each decoder cluster feeds into one of the two queues. And the micro-op cache is dual-ported and feeds into both of the two micro-op queues. These are then muxed into the reorder buffer.

MS_AT · Jan 25, 2025

StefanR5R said:
Or do they? According to Cardyak, …

Cardyak's Microarchitecture Block Diagram Images -> AMD -> High Power -> 2024 - Zen 5.jpg
... there are *two* micro-op queues, not one. Each decoder cluster feeds into one of the two queues. And the micro-op cache is dual-ported and feeds into both of the two micro-op queues. These are then muxed into the reorder buffer.

His diagram contains errors (there is only one complex decoder in the cluster if we are to trust software optimization guide) so I would not rule out other mistakes.

naukkis · Jan 26, 2025

MS_AT said:
His diagram contains errors (there is only one complex decoder in the cluster if we are to trust software optimization guide) so I would not rule out other mistakes.

AMD cpu's have always had unified features on all decoders unlike Intel. With complex decoding all decoders can't work simultaneously but AMD hardware has no split between simple/complex decoders.

MS_AT · Jan 26, 2025

naukkis said:
AMD cpu's have always had unified features on all decoders unlike Intel. With complex decoding all decoders can't work simultaneously but AMD hardware has no split between simple/complex decoders.

This is the direct quote from the Software Optimization Guide for the AMD Zen5 Microarchitecture published in August 2024, revision 1.00:

Only the first decode slot (of four) can decode instructions greater than 10 bytes in length. Avoid having more than one instruction in a sequence of four that is greater than 10 bytes in length.

So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.

bakyt115 · Jan 26, 2025

is it true that zens loose predecoding(instructions boundaries mark) L1i ?

Discussion Zen 5 Architecture & Technical discussion

Diamond Member

Elite Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Elite Member

Golden Member

Lifer

Elite Member

Senior member

Attachments

Senior member

Golden Member

Golden Member

Senior member

Member

Golden Member

Senior member

Lifer

Elite Member

Elite Member

Senior member

Golden Member

Senior member

Member