adroc_thurston
Diamond Member
- Jul 2, 2023
- 4,714
- 6,501
- 96
That's normal, SIR hates cachelet setups.less efficiency than vanilla cores is weird
That's normal, SIR hates cachelet setups.less efficiency than vanilla cores is weird
Lion Cove/Skymont analysis at David Huang's Blog there are interesting comparisons to Zen 5.
The graph does not show core efficiency. It is SIR rate-1 performance over package power. Operating system and maybe the test regime may have kept too much of the other components in the package from powering down deeply enough to make single core power usage a dominant part of package power consumption.The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
View attachment 108440
I'm aware of both the issues, that's why I said that for now all we can do is to take it for what it's worth, essentially observe the strange result and raise an eyebrow.The graph does not show core efficiency. It is SIR rate-1 performance over package power. Operating system and maybe the test regime may have kept too much of the other components in the package from powering down deeply enough to make single core power usage a dominant part of package power consumption.
The other graph which is based on VDDCR data from MMIO read requests to the SMU (in which case the act of observing the data unfortunately influences the data) shows the classic core and dense core closer to each other WRT SIR r1 perf/W.
He noticed the same thing with Zen 4 vs Zen 4 Dense IIRC. Though that seemed to be much closer than the difference that is seen here.TL;DR - cores dangerously close to each other, collisions expected.
The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
View attachment 108440
Pretty sure Skymont in LNL doesn't even have access to the L3, much like the LPE cores in MTL. They do have the SLC, but still...20/30% better perf/clock in INT/FP for Zen 5c vs SKT in GB 6 ST, so that s close only apparently, in the real world they are far apart, about two gens apart, and to think that we had people here expecting SKT to match or even beat Zen 4, i once said that it was at Zen 3 level at best.
The workload plays a role (affinity for cache) and the very limited method of measurement plays another (probably bigger) role. Nevertheless, I think we're more informed about the subject than we were before Huang published the data, as long as we don't attempt to draw definitive conclusions.I would imagine the fact that the Zen 5C cluster has less L3 cache would play a role here too?
No access to L3, and based on info from the lead SoC architect who worked on the project, the SLC is there as more of an bandwidth saving solution.Pretty sure Skymont in LNL doesn't even have access to the L3, much like the LPE cores in MTL. They do have the SLC, but still...
Admittedly I didn't read your other post very carefully, as I was already preoccupied with typing my post... :-)I'm aware of both the issues, that's why I said that for now all we can do is to take it for what it's worth, essentially observe the strange result and raise an eyebrow.
Skymont in Lunarlake performs better 41% in both SpecInt and Geekbench Integer per clock(assuming 90% clock scaling) while in FP it performs 60% faster in GB5 and 70% faster in GB6 compared to Crestmont LP, which is almost exactly as Intel claims.20/30% better perf/clock in INT/FP for Zen 5c vs SKT in GB 6 ST, so that s close only apparently, in the real world they are far apart, about two gens apart, and to think that we had people here expecting SKT to match or even beat Zen 4, i once said that it was at Zen 3 level at best.
This would make sense as higher performing caches need a tradeoff as larger cell sizes and/or lower power efficiency due to higher leakage and outright higher power use.No access to L3, and based on info from the lead SoC architect who worked on the project, the SLC is there as more of an bandwidth saving solution.
GNR CCD floor plan annotated by Nemez.
I think the vector side is annotated wrong. The unit is clearly split into 4 quarters, with the execution units tightly bound with the register files. Instead of having 256-bit units per quarter, I think each quarter is a 128-bit slice across the entire SIMD pipe. This would make maximum communication distances much shorter from registers to units.
Any data to back this up? Optimization manual doesn't talk about it, it's not included in any latency tables that I know of. And the way you state it, it sounds different to the one additional latency cycle due to schedulers being full. [that applies only to instructions that would themselves have only 1 cycle of latency, anything with longer latency is not affected].And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.
You can't slice AVX512 instructions efficiently. For being able to do fast 512 bit permutations register file needs to hold whole 512 bit register. So those are 256 bit quarters and those nearby are combined for executing 512 bit instructions( or there's just two double-pumped pipelines). And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.
Very nicely written article. Love that he wrote his own quick and dirty performance counter monitoring program. That could be fun to use or experiment with. Seems the clustered decoders are there mainly to support and boost SMT which speaks strongly to Zen 5's server focused roots.C&C has an article up testing Zen 5 with the opcache disabled to see how well the 2x4 decoder works. Short answer, in single thread not great. With SMT enabled surprisingly good. It does seem like an obvious place for AMD to work some magic in Zen 6 though, assuming they still care about client performance. The results vary quite a bit so if you are interested check it out:
Disabling Zen 5’s Op Cache and Exploring its Clustered Decoder
Zen 5 has an interesting frontend setup with a pair of fetch and decode clusters.chipsandcheese.com
Or do they? According to Cardyak, …In Zen 5, the two 4-wide decode clusters and the op cache all feed into a single deep micro-op queue.
Cardyak's Microarchitecture Block Diagram Images -> AMD -> High Power -> 2024 - Zen 5.jpg
His diagram contains errors (there is only one complex decoder in the cluster if we are to trust software optimization guide) so I would not rule out other mistakes.Or do they? According to Cardyak, …
Cardyak's Microarchitecture Block Diagram Images -> AMD -> High Power -> 2024 - Zen 5.jpg
... there are *two* micro-op queues, not one. Each decoder cluster feeds into one of the two queues. And the micro-op cache is dual-ported and feeds into both of the two micro-op queues. These are then muxed into the reorder buffer.
His diagram contains errors (there is only one complex decoder in the cluster if we are to trust software optimization guide) so I would not rule out other mistakes.
This is the direct quote from the Software Optimization Guide for the AMD Zen5 Microarchitecture published in August 2024, revision 1.00:AMD cpu's have always had unified features on all decoders unlike Intel. With complex decoding all decoders can't work simultaneously but AMD hardware has no split between simple/complex decoders.
So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.Only the first decode slot (of four) can decode instructions greater than 10 bytes in length. Avoid having more than one instruction in a sequence of four that is greater than 10 bytes in length.