adroc_thurston
Diamond Member
- Jul 2, 2023
- 3,793
- 5,489
- 96
That's normal, SIR hates cachelet setups.less efficiency than vanilla cores is weird
That's normal, SIR hates cachelet setups.less efficiency than vanilla cores is weird
Lion Cove/Skymont analysis at David Huang's Blog there are interesting comparisons to Zen 5.
The graph does not show core efficiency. It is SIR rate-1 performance over package power. Operating system and maybe the test regime may have kept too much of the other components in the package from powering down deeply enough to make single core power usage a dominant part of package power consumption.The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
View attachment 108440
I'm aware of both the issues, that's why I said that for now all we can do is to take it for what it's worth, essentially observe the strange result and raise an eyebrow.The graph does not show core efficiency. It is SIR rate-1 performance over package power. Operating system and maybe the test regime may have kept too much of the other components in the package from powering down deeply enough to make single core power usage a dominant part of package power consumption.
The other graph which is based on VDDCR data from MMIO read requests to the SMU (in which case the act of observing the data unfortunately influences the data) shows the classic core and dense core closer to each other WRT SIR r1 perf/W.
He noticed the same thing with Zen 4 vs Zen 4 Dense IIRC. Though that seemed to be much closer than the difference that is seen here.TL;DR - cores dangerously close to each other, collisions expected.
The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
View attachment 108440
Pretty sure Skymont in LNL doesn't even have access to the L3, much like the LPE cores in MTL. They do have the SLC, but still...20/30% better perf/clock in INT/FP for Zen 5c vs SKT in GB 6 ST, so that s close only apparently, in the real world they are far apart, about two gens apart, and to think that we had people here expecting SKT to match or even beat Zen 4, i once said that it was at Zen 3 level at best.
The workload plays a role (affinity for cache) and the very limited method of measurement plays another (probably bigger) role. Nevertheless, I think we're more informed about the subject than we were before Huang published the data, as long as we don't attempt to draw definitive conclusions.I would imagine the fact that the Zen 5C cluster has less L3 cache would play a role here too?
No access to L3, and based on info from the lead SoC architect who worked on the project, the SLC is there as more of an bandwidth saving solution.Pretty sure Skymont in LNL doesn't even have access to the L3, much like the LPE cores in MTL. They do have the SLC, but still...
Admittedly I didn't read your other post very carefully, as I was already preoccupied with typing my post... :-)I'm aware of both the issues, that's why I said that for now all we can do is to take it for what it's worth, essentially observe the strange result and raise an eyebrow.
Skymont in Lunarlake performs better 41% in both SpecInt and Geekbench Integer per clock(assuming 90% clock scaling) while in FP it performs 60% faster in GB5 and 70% faster in GB6 compared to Crestmont LP, which is almost exactly as Intel claims.20/30% better perf/clock in INT/FP for Zen 5c vs SKT in GB 6 ST, so that s close only apparently, in the real world they are far apart, about two gens apart, and to think that we had people here expecting SKT to match or even beat Zen 4, i once said that it was at Zen 3 level at best.
This would make sense as higher performing caches need a tradeoff as larger cell sizes and/or lower power efficiency due to higher leakage and outright higher power use.No access to L3, and based on info from the lead SoC architect who worked on the project, the SLC is there as more of an bandwidth saving solution.
GNR CCD floor plan annotated by Nemez.
I think the vector side is annotated wrong. The unit is clearly split into 4 quarters, with the execution units tightly bound with the register files. Instead of having 256-bit units per quarter, I think each quarter is a 128-bit slice across the entire SIMD pipe. This would make maximum communication distances much shorter from registers to units.
Any data to back this up? Optimization manual doesn't talk about it, it's not included in any latency tables that I know of. And the way you state it, it sounds different to the one additional latency cycle due to schedulers being full. [that applies only to instructions that would themselves have only 1 cycle of latency, anything with longer latency is not affected].And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.
You can't slice AVX512 instructions efficiently. For being able to do fast 512 bit permutations register file needs to hold whole 512 bit register. So those are 256 bit quarters and those nearby are combined for executing 512 bit instructions( or there's just two double-pumped pipelines). And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.