Discussion Zen 5 Architecture & Technical discussion

adroc_thurston · Sep 30, 2024

coercitiv said:
less efficiency than vanilla cores is weird

That's normal, SIR hates cachelet setups.

StefanR5R · Sep 30, 2024

yuri69 said:
Lion Cove/Skymont analysis at David Huang's Blog there are interesting comparisons to Zen 5.

coercitiv said:
The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
View attachment 108440

The graph does not show core efficiency. It is SIR rate-1 performance over package power. Operating system and maybe the test regime may have kept too much of the other components in the package from powering down deeply enough to make single core power usage a dominant part of package power consumption.

The other graph which is based on VDDCR data from MMIO read requests to the SMU (in which case the act of observing the data unfortunately influences the data) shows the classic core and dense core closer to each other WRT SIR r1 perf/W.

coercitiv · Sep 30, 2024

StefanR5R said:
The graph does not show core efficiency. It is SIR rate-1 performance over package power. Operating system and maybe the test regime may have kept too much of the other components in the package from powering down deeply enough to make single core power usage a dominant part of package power consumption.

The other graph which is based on VDDCR data from MMIO read requests to the SMU (in which case the act of observing the data unfortunately influences the data) shows the classic core and dense core closer to each other WRT SIR r1 perf/W.

I'm aware of both the issues, that's why I said that for now all we can do is to take it for what it's worth, essentially observe the strange result and raise an eyebrow.

Geddagod · Sep 30, 2024

coercitiv said:
TL;DR - cores dangerously close to each other, collisions expected.

The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
View attachment 108440

He noticed the same thing with Zen 4 vs Zen 4 Dense IIRC. Though that seemed to be much closer than the difference that is seen here.
I would imagine the fact that the Zen 5C cluster has less L3 cache would play a role here too?

Abwx said:
20/30% better perf/clock in INT/FP for Zen 5c vs SKT in GB 6 ST, so that s close only apparently, in the real world they are far apart, about two gens apart, and to think that we had people here expecting SKT to match or even beat Zen 4, i once said that it was at Zen 3 level at best.

Pretty sure Skymont in LNL doesn't even have access to the L3, much like the LPE cores in MTL. They do have the SLC, but still...
Wait till Skymont appears in ARL.

coercitiv · Sep 30, 2024

Geddagod said:
I would imagine the fact that the Zen 5C cluster has less L3 cache would play a role here too?

The workload plays a role (affinity for cache) and the very limited method of measurement plays another (probably bigger) role. Nevertheless, I think we're more informed about the subject than we were before Huang published the data, as long as we don't attempt to draw definitive conclusions.

coercitiv · Sep 30, 2024

Geddagod said:
Pretty sure Skymont in LNL doesn't even have access to the L3, much like the LPE cores in MTL. They do have the SLC, but still...

No access to L3, and based on info from the lead SoC architect who worked on the project, the SLC is there as more of an bandwidth saving solution.

StefanR5R · Sep 30, 2024

coercitiv said:
I'm aware of both the issues, that's why I said that for now all we can do is to take it for what it's worth, essentially observe the strange result and raise an eyebrow.

Admittedly I didn't read your other post very carefully, as I was already preoccupied with typing my post... :-)

Phoenix 2 was later to market than Genoa + Bergamo. According to low resolution die shots (Phoenix 2/ Durango/ Vindhya), Zen 4-classic and Zen 4-dense core layouts in Phoenix 2 on 4nm are still looking the same as Genoa's and Bergamo's on 5nm. That is, all the optimization work which went into Bergamo relative to Genoa, in order to press 33% more fully-featured cores into the same socket power envelope, should reflect in Phoenix 2.

Strix Point on the other hand is much earlier ready to market than Turin-dense, and the latter is designed for a newer manufacturing node. (Plus, Strix Point's Zen 5 and Zen 5c incarnations are somewhat cut down from Turin's and Turin-dense's, though basically just in the FP department.) This is not the speculation thread, but I can't help myself wondering whether power efficiency tweaking of Strix Point's dense cores was somewhat cut short for sake of time to market, and due to being a 3nm/4nm co-design to some extent.

DavidC1 · Sep 30, 2024

Abwx said:
20/30% better perf/clock in INT/FP for Zen 5c vs SKT in GB 6 ST, so that s close only apparently, in the real world they are far apart, about two gens apart, and to think that we had people here expecting SKT to match or even beat Zen 4, i once said that it was at Zen 3 level at best.

Skymont in Lunarlake performs better 41% in both SpecInt and Geekbench Integer per clock(assuming 90% clock scaling) while in FP it performs 60% faster in GB5 and 70% faster in GB6 compared to Crestmont LP, which is almost exactly as Intel claims.

Now we're waiting for the ring version which Intel said will outperform Raptorlake by 2% in Spec.

Even though Lunarlake's version is not that performant, it scales extremely low in power and is very power efficient.

coercitiv said:
No access to L3, and based on info from the lead SoC architect who worked on the project, the SLC is there as more of an bandwidth saving solution.

This would make sense as higher performing caches need a tradeoff as larger cell sizes and/or lower power efficiency due to higher leakage and outright higher power use.

igor_kavinski · Oct 5, 2024

I think a special game mode could be developed for the dual chiplet Ryzens, using a combination of Windows and BIOS trickery, where the lower binned CCD acts essentially as a zombie CCD for the preferred CCD. The zombie CCD's threads don't do any actual work. But they help to prefetch data for the preferred CCD threads and when needed, provide that data, acting as a dedicated virtual cache CCD, using its all available L2+L3 capacity for this purpose and reducing the instances of waiting on data from system RAM. How hard would this be?

StefanR5R · Oct 6, 2024

L3$ is CCX-private and L2$ is core-private. Prefetching into a certain core's L2$ does nothing for a different core but steal memory access bandwidth (and while doing so, increases memory access latency). I am not sure whether prefetching into one CCX's L3$ would help with latency of the other CCX's L3$ misses, but in any case all sorts of more aggressive prefetch policies run the risk of regressing due to memory access bandwidth and latency limitations.

CouncilorIrissa · Oct 6, 2024

GNR CCD floor plan annotated by Nemez.

https://twitter.com/x/status/1843054429773459889

lightmanek · Oct 6, 2024

Love the Magical SRAM of Mystery block 😅

Tuna-Fish · Oct 7, 2024

CouncilorIrissa said:
GNR CCD floor plan annotated by Nemez.

https://twitter.com/x/status/1843054429773459889

I think the vector side is annotated wrong. The unit is clearly split into 4 quarters, with the execution units tightly bound with the register files. Instead of having 256-bit units per quarter, I think each quarter is a 128-bit slice across the entire SIMD pipe. This would make maximum communication distances much shorter from registers to units.

naukkis · Oct 7, 2024

Tuna-Fish said:
I think the vector side is annotated wrong. The unit is clearly split into 4 quarters, with the execution units tightly bound with the register files. Instead of having 256-bit units per quarter, I think each quarter is a 128-bit slice across the entire SIMD pipe. This would make maximum communication distances much shorter from registers to units.

You can't slice AVX512 instructions efficiently. For being able to do fast 512 bit permutations register file needs to hold whole 512 bit register. So those are 256 bit quarters and those nearby are combined for executing 512 bit instructions( or there's just two double-pumped pipelines). And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.

MS_AT · Oct 7, 2024

naukkis said:
And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.

Any data to back this up? Optimization manual doesn't talk about it, it's not included in any latency tables that I know of. And the way you state it, it sounds different to the one additional latency cycle due to schedulers being full. [that applies only to instructions that would themselves have only 1 cycle of latency, anything with longer latency is not affected].

Bigos · Oct 7, 2024

Another explanation would be the register file being duplicated to both the top and the bottom parts. This is done usually in order to increase the number of read ports of a register file at the cost of area and power. Writes to the register file have to access both parts so the number of write ports does not increase in this scheme.

And with complex enough forwarding network, the latency of register file writes does not matter much as it can be delayed - subsequent operations can source their results from either the back end of execution units directly or from the forwarding network "buffers".

This might still mean that at the critical path an operation done on the bottom EU needs to be forwarded to the top EU immediately (or vice-versa), all within the same cycle. This might be easier to do with the EU as they are closer to each other than the register files (and the middle area is probably mostly the forwarding network itself).

Tuna-Fish · Oct 7, 2024

naukkis said:
You can't slice AVX512 instructions efficiently. For being able to do fast 512 bit permutations register file needs to hold whole 512 bit register. So those are 256 bit quarters and those nearby are combined for executing 512 bit instructions( or there's just two double-pumped pipelines). And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.

That's if you try to make wide permutes very low latency. I find that AMD's optimization manual for Zen5(zip, contains pdf and spreadsheet with latencies) is very illuminating here.

Notably:

All types of VPERMW emit the same amount of Mops, and have same high throughput of 2 per cycle. However...
VPERMW x, x, x (that is, 128-bit registers) has a latency of 2 cycles.
VPERMW y, y, y (256-bit regs) has a latency of 4 cycles.
VPERMW z, z, z (512-bit regs) has a latency of 5 cycles.

This to me screams a horizontally split implementation. And it's also, at least to me, the technically better way to build it, because it means you only pay for lane-crossing when you actually intentionally cross lanes.

CouncilorIrissa · Sunday at 3:31 PM

Pushing AMD’s Infinity Fabric to its Limits

I recently wrote code to test memory latency under load, seeking to reproduce data in various presentations with bandwidth on the X axis and latency on the Y axis.

chipsandcheese.com

Discussion Zen 5 Architecture & Technical discussion

adroc_thurston

Diamond Member

StefanR5R

Elite Member

coercitiv

Diamond Member

Geddagod

Golden Member

coercitiv

Diamond Member

coercitiv

Diamond Member

StefanR5R

Elite Member

DavidC1

Golden Member

igor_kavinski

Lifer

StefanR5R

Elite Member

CouncilorIrissa

Senior member

Attachments

lightmanek

Senior member

Tuna-Fish

Golden Member

naukkis

Senior member

MS_AT

Senior member

Bigos

Member

Tuna-Fish

Golden Member

CouncilorIrissa

Senior member

Pushing AMD’s Infinity Fabric to its Limits

TRENDING THREADS